date:20170314

[jira] [Commented] (SPARK-19944) Move SQLConf from sql/core to sql/catalyst

2017-03-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925633#comment-15925633
 ] 

Apache Spark commented on SPARK-19944:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/17301

> Move SQLConf from sql/core to sql/catalyst
> --
>
> Key: SPARK-19944
> URL: https://issues.apache.org/jira/browse/SPARK-19944
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.2.0
>
>
> It is pretty weird to have SQLConf only in sql/core and then we have to 
> duplicate config options that impact optimizer/analyzer in sql/catalyst using 
> CatalystConf. This ticket moves SQLConf into sql/catalyst.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-16591) HadoopFsRelation will list , cache all parquet file paths

2017-03-14 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro closed SPARK-16591.

Resolution: Won't Fix

> HadoopFsRelation will list , cache all parquet file paths
> -
>
> Key: SPARK-16591
> URL: https://issues.apache.org/jira/browse/SPARK-16591
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: cen yuhai
>
> HadoopFsRelation has a fileStatusCache which list all paths and then cache 
> all filestatus no matter whether you specify partition columns  or not.  It 
> may cause OOM when reading parquet table.
> In HiveMetastoreCatalog file, spark will convert MetaStoreRelation to 
> ParquetRelation by calling convertToParquetRelation method.
> It will call metastoreRelation.getHiveQlPartitions() to request hive 
> metastore service for all partitions without filters. And then pass all 
> partition paths to ParquetRelation's paths member.
> In FileStatusCache's refresh method, it will list all paths : "val files = 
> listLeafFiles(paths)"



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16591) HadoopFsRelation will list , cache all parquet file paths

2017-03-14 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925631#comment-15925631
 ] 

Takeshi Yamamuro commented on SPARK-16591:
--

Since HadoopFsRelation has totally changed, I'll close this. If you have any 
problem, you feel to update the description and reopen this. Thanks.

> HadoopFsRelation will list , cache all parquet file paths
> -
>
> Key: SPARK-16591
> URL: https://issues.apache.org/jira/browse/SPARK-16591
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: cen yuhai
>
> HadoopFsRelation has a fileStatusCache which list all paths and then cache 
> all filestatus no matter whether you specify partition columns  or not.  It 
> may cause OOM when reading parquet table.
> In HiveMetastoreCatalog file, spark will convert MetaStoreRelation to 
> ParquetRelation by calling convertToParquetRelation method.
> It will call metastoreRelation.getHiveQlPartitions() to request hive 
> metastore service for all partitions without filters. And then pass all 
> partition paths to ParquetRelation's paths member.
> In FileStatusCache's refresh method, it will list all paths : "val files = 
> listLeafFiles(paths)"



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19957) Inconsist KMeans initialization mode behavior between ML and MLlib

2017-03-14 Thread yuhao yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-19957:
---
Description: 
when users set the initialization mode to "random", KMeans in ML and MLlib has 
inconsistent behavior for multiple runs:

MLlib will basically use new Random for each run.
ML Kmeans however will use the default random seed, which is 
{code}this.getClass.getName.hashCode.toLong{code}, and keep using the same 
number among multiple fitting.

I would expect the "random" initialization mode to be literally random. 
There're different solutions with different scope of impact. Adjusting the 
hasSeed trait may have a broader impact(but maybe worth discussion). We can 
always just set random default seed in KMeans. 

Appreciate your feedback.

  was:
when users set the initialization mode to "random", KMeans in ML and MLlib has 
inconsistent behavior for multiple runs:

MLlib will basically use new Random for each run.
ML Kmeans however will use the default random seed, which is 
{code}this.getClass.getName.hashCode.toLong{code}, and keep using the same 
number among multiple fitting.

I would expect the "random" initialization mode to be literally random. 
There're different solutions with different scope of impact. Adjusting the 
hasSeed trait may have a broader impact. We can always just set random default 
seed in KMeans. 

Appreciate your feedback.


> Inconsist KMeans initialization mode behavior between ML and MLlib
> --
>
> Key: SPARK-19957
> URL: https://issues.apache.org/jira/browse/SPARK-19957
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: yuhao yang
>Priority: Minor
>
> when users set the initialization mode to "random", KMeans in ML and MLlib 
> has inconsistent behavior for multiple runs:
> MLlib will basically use new Random for each run.
> ML Kmeans however will use the default random seed, which is 
> {code}this.getClass.getName.hashCode.toLong{code}, and keep using the same 
> number among multiple fitting.
> I would expect the "random" initialization mode to be literally random. 
> There're different solutions with different scope of impact. Adjusting the 
> hasSeed trait may have a broader impact(but maybe worth discussion). We can 
> always just set random default seed in KMeans. 
> Appreciate your feedback.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-6384) saveAsParquet doesn't clean up attempt_* folders

2017-03-14 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro closed SPARK-6384.
---
Resolution: Won't Fix

> saveAsParquet doesn't clean up attempt_* folders
> 
>
> Key: SPARK-6384
> URL: https://issues.apache.org/jira/browse/SPARK-6384
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
>Reporter: Rex Xiong
>
> After calling SchemaRDD.saveAsParquet, it runs well and generate *.parquet, 
> _SUCCESS, _common_metadata, _metadata files successfully.
> But sometimes, there will be some attempt_* folder (e.g. 
> attempt_201503170229_0006_r_06_736, 
> attempt_201503170229_0006_r_000404_416) under the same folder, it contains 
> one parquet file, seems to be a working temp folder.
> It happens even though _SUCCESS file created.
> In this situation, SparkSQL (Hive table) throws exception when loading this 
> parquet folder:
> Error: java.io.FileNotFoundException: Path is not a file: 
> ../attempt_201503170229_0006_r_06_736
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.ja
> va:69)
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.ja
> va:55)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations
> UpdateTimes(FSNamesystem.java:1728)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations
> Int(FSNamesystem.java:1671)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations
> (FSNamesystem.java:1651)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations
> (FSNamesystem.java:1625)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLoca
> tions(NameNodeRpcServer.java:503)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTra
> nslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:32
> 2)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Cl
> ientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.cal
> l(ProtobufRpcEngine.java:585)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma
> tion.java:1594)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) 
> (state=,co
> de=0)
> I'm not sure whether it's a Spark bug or a Parquet bug.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19957) Inconsist KMeans initialization mode behavior between ML and MLlib

2017-03-14 Thread yuhao yang (JIRA)

yuhao yang created SPARK-19957:
--

 Summary: Inconsist KMeans initialization mode behavior between ML 
and MLlib
 Key: SPARK-19957
 URL: https://issues.apache.org/jira/browse/SPARK-19957
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.1.0
Reporter: yuhao yang
Priority: Minor


when users set the initialization mode to "random", KMeans in ML and MLlib has 
inconsistent behavior for multiple runs:

MLlib will basically use new Random for each run.
ML Kmeans however will use the default random seed, which is 
{code}this.getClass.getName.hashCode.toLong{code}, and keep using the same 
number among multiple fitting.

I would expect the "random" initialization mode to be literally random. 
There're different solutions with different scope of impact. Adjusting the 
hasSeed trait may have a broader impact. We can always just set random default 
seed in KMeans. 

Appreciate your feedback.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6384) saveAsParquet doesn't clean up attempt_* folders

2017-03-14 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925617#comment-15925617
 ] 

Takeshi Yamamuro commented on SPARK-6384:
-

Since this ticket is almost inactive and the related code has totally changed 
(At least, SchemaRDD's gone), I'll close this. If you have any problem, you 
feel free to update description and reopen this. Thanks!

> saveAsParquet doesn't clean up attempt_* folders
> 
>
> Key: SPARK-6384
> URL: https://issues.apache.org/jira/browse/SPARK-6384
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
>Reporter: Rex Xiong
>
> After calling SchemaRDD.saveAsParquet, it runs well and generate *.parquet, 
> _SUCCESS, _common_metadata, _metadata files successfully.
> But sometimes, there will be some attempt_* folder (e.g. 
> attempt_201503170229_0006_r_06_736, 
> attempt_201503170229_0006_r_000404_416) under the same folder, it contains 
> one parquet file, seems to be a working temp folder.
> It happens even though _SUCCESS file created.
> In this situation, SparkSQL (Hive table) throws exception when loading this 
> parquet folder:
> Error: java.io.FileNotFoundException: Path is not a file: 
> ../attempt_201503170229_0006_r_06_736
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.ja
> va:69)
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.ja
> va:55)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations
> UpdateTimes(FSNamesystem.java:1728)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations
> Int(FSNamesystem.java:1671)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations
> (FSNamesystem.java:1651)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations
> (FSNamesystem.java:1625)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLoca
> tions(NameNodeRpcServer.java:503)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTra
> nslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:32
> 2)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$Cl
> ientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.cal
> l(ProtobufRpcEngine.java:585)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma
> tion.java:1594)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) 
> (state=,co
> de=0)
> I'm not sure whether it's a Spark bug or a Parquet bug.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19954) Joining to a unioned DataFrame does not produce expected result.

2017-03-14 Thread Arun Allamsetty (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925586#comment-15925586
 ] 

Arun Allamsetty commented on SPARK-19954:
-

I was assuming it could be a bug-fix release like 2.1.1. But I'll defer to the 
judgement of the people managing the Spark project. I set it to blocker as it 
seems like a non-trivial bug to me given the behavior.

> Joining to a unioned DataFrame does not produce expected result.
> 
>
> Key: SPARK-19954
> URL: https://issues.apache.org/jira/browse/SPARK-19954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Arun Allamsetty
>Priority: Blocker
>
> I found this bug while trying to update from Spark 1.6.1 to 2.1.0. The bug is 
> that when we try to join two DataFrames, one of which is a result of a union 
> operation, the result of the join results in data as if the table was joined 
> only to the first table in the union. This issue is not present in Spark 
> 2.0.0 or 2.0.1 or 2.0.2, only in 2.1.0. Here's how to reproduce it.
> {noformat}
> import spark.implicits._
> import org.apache.spark.sql.functions.lit
> case class A(id: Long, colA: Boolean)
> case class B(id: Long, colB: Int)
> case class C(id: Long, colC: Double)
> case class X(id: Long, name: String)
> val aData = A(1, true) :: Nil
> val bData = B(2, 10) :: Nil
> val cData = C(3, 9.73D) :: Nil
> val xData = X(1, "a") :: X(2, "b") :: X(3, "c") :: Nil
> val aDf = spark.createDataset(aData).toDF
> val bDf = spark.createDataset(bData).toDF
> val cDf = spark.createDataset(cData).toDF
> val xDf = spark.createDataset(xData).toDF
> val unionDf =
>   aDf.select($"id", lit("a").as("name"), $"colA", lit(null).as("colB"), 
> lit(null).as("colC")).union(
>   bDf.select($"id", lit("b").as("name"), lit(null).as("colA"), $"colB", 
> lit(null).as("colC"))).union(
>   cDf.select($"id", lit("c").as("name"), lit(null).as("colA"), 
> lit(null).as("colB"), $"colC"))
> val result = xDf.join(unionDf, unionDf("name") === xDf("name") && 
> unionDf("id") === xDf("id"))
> result.show
> {noformat}
> The result being
> {noformat}
> +---++---+++++
> | id|name| id|name|colA|colB|colC|
> +---++---+++++
> |  1|   a|  1|   a|true|null|null|
> +---++---+++++
> {noformat}
> Force computing {{unionDf}} using {{count}} does not help change the result 
> of the join. However, writing the data to disk and reading it back does give 
> the correct result. But it is definitely not ideal. Interestingly caching the 
> {{unionDf}} also gives the correct result.
> {noformat}
> +---++---+++++
> | id|name| id|name|colA|colB|colC|
> +---++---+++++
> |  1|   a|  1|   a|true|null|null|
> |  2|   b|  2|   b|null|  10|null|
> |  3|   c|  3|   c|null|null|9.73|
> +---++---+++++
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19956) Optimize a location order of blocks with topology information

2017-03-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19956:


Assignee: (was: Apache Spark)

> Optimize a location order of blocks with topology information
> -
>
> Key: SPARK-19956
> URL: https://issues.apache.org/jira/browse/SPARK-19956
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: coneyliu
>
> When call the method getLocations of BlockManager, we only compare the data 
> block host. Random selection for non-local data blocks, this may cause the 
> selected data block to be in a different rack. So in this patch to increase 
> the sort of the rack.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19956) Optimize a location order of blocks with topology information

2017-03-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925471#comment-15925471
 ] 

Apache Spark commented on SPARK-19956:
--

User 'ConeyLiu' has created a pull request for this issue:
https://github.com/apache/spark/pull/17300

> Optimize a location order of blocks with topology information
> -
>
> Key: SPARK-19956
> URL: https://issues.apache.org/jira/browse/SPARK-19956
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: coneyliu
>
> When call the method getLocations of BlockManager, we only compare the data 
> block host. Random selection for non-local data blocks, this may cause the 
> selected data block to be in a different rack. So in this patch to increase 
> the sort of the rack.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19956) Optimize a location order of blocks with topology information

2017-03-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19956:


Assignee: Apache Spark

> Optimize a location order of blocks with topology information
> -
>
> Key: SPARK-19956
> URL: https://issues.apache.org/jira/browse/SPARK-19956
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: coneyliu
>Assignee: Apache Spark
>
> When call the method getLocations of BlockManager, we only compare the data 
> block host. Random selection for non-local data blocks, this may cause the 
> selected data block to be in a different rack. So in this patch to increase 
> the sort of the rack.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19956) Optimize a location order of blocks with topology information

2017-03-14 Thread coneyliu (JIRA)

coneyliu created SPARK-19956:


 Summary: Optimize a location order of blocks with topology 
information
 Key: SPARK-19956
 URL: https://issues.apache.org/jira/browse/SPARK-19956
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: coneyliu


When call the method getLocations of BlockManager, we only compare the data 
block host. Random selection for non-local data blocks, this may cause the 
selected data block to be in a different rack. So in this patch to increase the 
sort of the rack.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2017-03-14 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-18112.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17232
[https://github.com/apache/spark/pull/17232]

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.2.0
>
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19828) R to support JSON array in column from_json

2017-03-14 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-19828.
--
  Resolution: Fixed
Assignee: Hyukjin Kwon
   Fix Version/s: 2.2.0
Target Version/s: 2.2.0

> R to support JSON array in column from_json
> ---
>
> Key: SPARK-19828
> URL: https://issues.apache.org/jira/browse/SPARK-19828
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Hyukjin Kwon
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19887) __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in partitioned persisted tables

2017-03-14 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19887.
-
   Resolution: Fixed
 Assignee: Wenchen Fan
Fix Version/s: 2.2.0
   2.1.1

> __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in 
> partitioned persisted tables
> -
>
> Key: SPARK-19887
> URL: https://issues.apache.org/jira/browse/SPARK-19887
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Cheng Lian
>Assignee: Wenchen Fan
>  Labels: correctness
> Fix For: 2.1.1, 2.2.0
>
>
> The following Spark shell snippet under Spark 2.1 reproduces this issue:
> {code}
> val data = Seq(
>   ("p1", 1, 1),
>   ("p2", 2, 2),
>   (null, 3, 3)
> )
> // Correct case: Saving partitioned data to file system.
> val path = "/tmp/partitioned"
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   parquet(path)
> spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false)
> // +---+---+---+
> // |c  |a  |b  |
> // +---+---+---+
> // |2  |p2 |2  |
> // |1  |p1 |1  |
> // +---+---+---+
> // Incorrect case: Saving partitioned data as persisted table.
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   saveAsTable("test_null")
> spark.table("test_null").filter($"a".isNotNull).show(truncate = false)
> // +---+--+---+
> // |c  |a |b  |
> // +---+--+---+
> // |3  |__HIVE_DEFAULT_PARTITION__|3  | <-- This line should not be here
> // |1  |p1|1  |
> // |2  |p2|2  |
> // +---+--+---+
> {code}
> Hive-style partitioned tables use the magic string 
> {{\_\_HIVE_DEFAULT_PARTITION\_\_}} to indicate {{NULL}} partition values in 
> partition directory names. However, in the case persisted partitioned table, 
> this magic string is not interpreted as {{NULL}} but a regular string.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19817) make it clear that `timeZone` option is a general option in DataFrameReader/Writer, DataStreamReader/Writer

2017-03-14 Thread Liwei Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liwei Lin updated SPARK-19817:
--
Summary: make it clear that `timeZone` option is a general option in 
DataFrameReader/Writer, DataStreamReader/Writer  (was: make it clear that 
`timeZone` option is a general option in DataFrameReader/Writer)

> make it clear that `timeZone` option is a general option in 
> DataFrameReader/Writer, DataStreamReader/Writer
> ---
>
> Key: SPARK-19817
> URL: https://issues.apache.org/jira/browse/SPARK-19817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Takuya Ueshin
> Fix For: 2.2.0
>
>
> As timezone setting can also affect partition values, it works for all 
> formats, we should make it clear.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19817) make it clear that `timeZone` option is a general option in DataFrameReader/Writer

2017-03-14 Thread Liwei Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liwei Lin updated SPARK-19817:
--
Component/s: Structured Streaming

> make it clear that `timeZone` option is a general option in 
> DataFrameReader/Writer
> --
>
> Key: SPARK-19817
> URL: https://issues.apache.org/jira/browse/SPARK-19817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Takuya Ueshin
> Fix For: 2.2.0
>
>
> As timezone setting can also affect partition values, it works for all 
> formats, we should make it clear.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19918) Use TextFileFormat in implementation of JsonFileFormat

2017-03-14 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19918.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17255
[https://github.com/apache/spark/pull/17255]

> Use TextFileFormat in implementation of JsonFileFormat
> --
>
> Key: SPARK-19918
> URL: https://issues.apache.org/jira/browse/SPARK-19918
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
> Fix For: 2.2.0
>
>
> If we use Dataset for initial loading when inferring the schema, there are 
> advantages. Please refer SPARK-18362
> It seems JSON one was supposed to be fixed together but missed according to 
> https://github.com/apache/spark/pull/15813
> {quote}
> A similar problem also affects the JSON file format and this patch originally 
> fixed that as well, but I've decided to split that change into a separate 
> patch so as not to conflict with changes in another JSON PR.
> {quote}
> Also, this affects some functionalities because it does not use 
> {{FileScanRDD}}. This problem is described in SPARK-19885 (but it was CSV's 
> case).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19918) Use TextFileFormat in implementation of JsonFileFormat

2017-03-14 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-19918:
---

Assignee: Hyukjin Kwon

> Use TextFileFormat in implementation of JsonFileFormat
> --
>
> Key: SPARK-19918
> URL: https://issues.apache.org/jira/browse/SPARK-19918
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.2.0
>
>
> If we use Dataset for initial loading when inferring the schema, there are 
> advantages. Please refer SPARK-18362
> It seems JSON one was supposed to be fixed together but missed according to 
> https://github.com/apache/spark/pull/15813
> {quote}
> A similar problem also affects the JSON file format and this patch originally 
> fixed that as well, but I've decided to split that change into a separate 
> patch so as not to conflict with changes in another JSON PR.
> {quote}
> Also, this affects some functionalities because it does not use 
> {{FileScanRDD}}. This problem is described in SPARK-19885 (but it was CSV's 
> case).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19817) make it clear that `timeZone` option is a general option in DataFrameReader/Writer

2017-03-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925424#comment-15925424
 ] 

Apache Spark commented on SPARK-19817:
--

User 'lw-lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/17299

> make it clear that `timeZone` option is a general option in 
> DataFrameReader/Writer
> --
>
> Key: SPARK-19817
> URL: https://issues.apache.org/jira/browse/SPARK-19817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Takuya Ueshin
> Fix For: 2.2.0
>
>
> As timezone setting can also affect partition values, it works for all 
> formats, we should make it clear.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-19881) Support Dynamic Partition Inserts params with SET command

2017-03-14 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-19881.
-

> Support Dynamic Partition Inserts params with SET command
> -
>
> Key: SPARK-19881
> URL: https://issues.apache.org/jira/browse/SPARK-19881
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since Spark 2.0.0, `SET` commands do not pass the values to HiveClient. In 
> most case, Spark handles well. However, for the dynamic partition insert, 
> users meet the following misleading situation. 
> {code}
> scala> spark.range(1001).selectExpr("id as key", "id as 
> value").registerTempTable("t1001")
> scala> sql("create table p (value int) partitioned by (key int)").show
> scala> sql("insert into table p partition(key) select key, value from t1001")
> org.apache.spark.SparkException:
> Dynamic partition strict mode requires at least one static partition column.
> To turn this off set hive.exec.dynamic.partition.mode=nonstrict
> scala> sql("set hive.exec.dynamic.partition.mode=nonstrict")
> scala> sql("insert into table p partition(key) select key, value from t1001")
> org.apache.hadoop.hive.ql.metadata.HiveException:
> Number of dynamic partitions created is 1001, which is more than 1000.
> To solve this try to set hive.exec.max.dynamic.partitions to at least 1001.
> scala> sql("set hive.exec.max.dynamic.partitions=1001")
> scala> sql("set hive.exec.max.dynamic.partitions").show(false)
> ++-+
> |key |value|
> ++-+
> |hive.exec.max.dynamic.partitions|1001 |
> ++-+
> scala> sql("insert into table p partition(key) select key, value from t1001")
> org.apache.hadoop.hive.ql.metadata.HiveException:
> Number of dynamic partitions created is 1001, which is more than 1000.
> To solve this try to set hive.exec.max.dynamic.partitions to at least 1001.
> {code}
> The last error is the same with the previous one. `HiveClient` does not know 
> new value 1001. There is no way to change the default value of 
> `hive.exec.max.dynamic.partitions` of `HiveCilent` with `SET` command.
> The root cause is that `hive` parameters are passed to `HiveClient` on 
> creating. So, the workaround is to use `--hiveconf` when starting 
> `spark-shell`. However, it is still unchangeable in `spark-shell`. We had 
> better handle this case without misleading error messages ending infinite 
> loop.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-17277) Set hive conf failed

2017-03-14 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-17277.
-

> Set hive conf failed
> 
>
> Key: SPARK-17277
> URL: https://issues.apache.org/jira/browse/SPARK-17277
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Weizhong
>Priority: Minor
>
> Now we can't use "SET k=v" to set Hive conf, for example: run below SQL in 
> spark-sql
> {noformat}
> set hive.exec.max.dynamic.partitions = 2000
> {noformat}
> but actually the value is 1000(default value), this is because after merge 
> SPARK-15012, we don't call runSqlHive("SET k=v") when set Hive conf.
> Only those conf we will use directly on Spark is OK, like 
> hive.exec.dynamic.partition.mode etc.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17277) Set hive conf failed

2017-03-14 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-17277.
---
Resolution: Won't Fix

As mentioned in the related issue 
[comment|https://github.com/apache/spark/pull/17223#issuecomment-286608743], we 
will not set hive conf dynamically in order to keep session isolation.
{quote}
Since hive client is shared among all sessions, we can't set hive conf 
dynamically, to keep session isolation. I think we should treat hive conf as 
static sql conf, and throw exception when users try to change them.
{quote}

I'll close this issue as WON'T FIX, too.

> Set hive conf failed
> 
>
> Key: SPARK-17277
> URL: https://issues.apache.org/jira/browse/SPARK-17277
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Weizhong
>Priority: Minor
>
> Now we can't use "SET k=v" to set Hive conf, for example: run below SQL in 
> spark-sql
> {noformat}
> set hive.exec.max.dynamic.partitions = 2000
> {noformat}
> but actually the value is 1000(default value), this is because after merge 
> SPARK-15012, we don't call runSqlHive("SET k=v") when set Hive conf.
> Only those conf we will use directly on Spark is OK, like 
> hive.exec.dynamic.partition.mode etc.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19881) Support Dynamic Partition Inserts params with SET command

2017-03-14 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-19881.
---
Resolution: Won't Fix

As mentioned in 
[comment|https://github.com/apache/spark/pull/17223#issuecomment-286608743], we 
will not set hive conf dynamically in order to keep session isolation.
{quote}
Since hive client is shared among all sessions, we can't set hive conf 
dynamically, to keep session isolation. I think we should treat hive conf as 
static sql conf, and throw exception when users try to change them.
{quote}

> Support Dynamic Partition Inserts params with SET command
> -
>
> Key: SPARK-19881
> URL: https://issues.apache.org/jira/browse/SPARK-19881
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since Spark 2.0.0, `SET` commands do not pass the values to HiveClient. In 
> most case, Spark handles well. However, for the dynamic partition insert, 
> users meet the following misleading situation. 
> {code}
> scala> spark.range(1001).selectExpr("id as key", "id as 
> value").registerTempTable("t1001")
> scala> sql("create table p (value int) partitioned by (key int)").show
> scala> sql("insert into table p partition(key) select key, value from t1001")
> org.apache.spark.SparkException:
> Dynamic partition strict mode requires at least one static partition column.
> To turn this off set hive.exec.dynamic.partition.mode=nonstrict
> scala> sql("set hive.exec.dynamic.partition.mode=nonstrict")
> scala> sql("insert into table p partition(key) select key, value from t1001")
> org.apache.hadoop.hive.ql.metadata.HiveException:
> Number of dynamic partitions created is 1001, which is more than 1000.
> To solve this try to set hive.exec.max.dynamic.partitions to at least 1001.
> scala> sql("set hive.exec.max.dynamic.partitions=1001")
> scala> sql("set hive.exec.max.dynamic.partitions").show(false)
> ++-+
> |key |value|
> ++-+
> |hive.exec.max.dynamic.partitions|1001 |
> ++-+
> scala> sql("insert into table p partition(key) select key, value from t1001")
> org.apache.hadoop.hive.ql.metadata.HiveException:
> Number of dynamic partitions created is 1001, which is more than 1000.
> To solve this try to set hive.exec.max.dynamic.partitions to at least 1001.
> {code}
> The last error is the same with the previous one. `HiveClient` does not know 
> new value 1001. There is no way to change the default value of 
> `hive.exec.max.dynamic.partitions` of `HiveCilent` with `SET` command.
> The root cause is that `hive` parameters are passed to `HiveClient` on 
> creating. So, the workaround is to use `--hiveconf` when starting 
> `spark-shell`. However, it is still unchangeable in `spark-shell`. We had 
> better handle this case without misleading error messages ending infinite 
> loop.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17465) Inappropriate memory management in `org.apache.spark.storage.MemoryStore` may lead to memory leak

2017-03-14 Thread agate (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925394#comment-15925394
 ] 

agate commented on SPARK-17465:
---

[~saturday_s] Thanks for this fix! It really helped us. We were using 1.6.1 and 
were seeing processing times increase gradually over period of several days. 
With 1.6.3 this increase does not happen. Thank you!


> Inappropriate memory management in `org.apache.spark.storage.MemoryStore` may 
> lead to memory leak
> -
>
> Key: SPARK-17465
> URL: https://issues.apache.org/jira/browse/SPARK-17465
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: Xing Shi
>Assignee: Xing Shi
> Fix For: 1.6.3, 2.0.1, 2.1.0
>
>
> After updating Spark from 1.5.0 to 1.6.0, I found that it seems to have a 
> memory leak on my Spark streaming application.
> Here is the head of the heap histogram of my application, which has been 
> running about 160 hours:
> {code:borderStyle=solid}
>  num #instances #bytes  class name
> --
>1: 28094   71753976  [B
>2:   1188086   28514064  java.lang.Long
>3:   1183844   28412256  scala.collection.mutable.DefaultEntry
>4:102242   13098768  
>5:102242   12421000  
>6:  81849199032  
>7:388391584  [Lscala.collection.mutable.HashEntry;
>8:  81847514288  
>9:  66514874080  
>   10: 371973438040  [C
>   11:  64232445640  
>   12:  87731044808  java.lang.Class
>   13: 36869 884856  java.lang.String
>   14: 15715 848368  [[I
>   15: 13690 782808  [S
>   16: 18903 604896  
> java.util.concurrent.ConcurrentHashMap$HashEntry
>   17:13 426192  [Lscala.concurrent.forkjoin.ForkJoinTask;
> {code}
> It shows that *scala.collection.mutable.DefaultEntry* and *java.lang.Long* 
> have unexpected big numbers of instances. In fact, the numbers started 
> growing at streaming process began, and keep growing proportional to total 
> number of tasks.
> After some further investigation, I found that the problem is caused by some 
> inappropriate memory management in _releaseUnrollMemoryForThisTask_ and 
> _unrollSafely_ method of class 
> [org.apache.spark.storage.MemoryStore|https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala].
> In Spark 1.6.x, a _releaseUnrollMemoryForThisTask_ operation will be 
> processed only with the parameter _memoryToRelease_ > 0:
> https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L530-L537
> But in fact, if a task successfully unrolled all its blocks in memory by 
> _unrollSafely_ method, the memory saved in _unrollMemoryMap_ would be set to 
> zero:
> https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L322
> So the result is, the memory saved in _unrollMemoryMap_ will be released, but 
> the key of that part of memory will never be removed from the hash map. The 
> hash table will keep increasing, while new tasks keep incoming. Although the 
> speed of increase is comparatively slow (about dozens of bytes per task), it 
> is possible that result into OOM after weeks or months.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-19834) csv escape of quote escape

2017-03-14 Thread Soonmok Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Soonmok Kwon closed SPARK-19834.

Resolution: Later

Closed for now. Wil be re-open when spark uses uniVocity-parser 2.4.0+.

> csv escape of quote escape
> --
>
> Key: SPARK-19834
> URL: https://issues.apache.org/jira/browse/SPARK-19834
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Soonmok Kwon
>Priority: Minor
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> A DataFrame is stored in CSV format and loaded again. When there's backslash 
> followed by quotation mark, csv reading seems to make an error.
> reference:
> http://stackoverflow.com/questions/42607208/spark-csv-error-when-reading-backslash-and-quotation-mark



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19834) csv escape of quote escape

2017-03-14 Thread Soonmok Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925380#comment-15925380
 ] 

Soonmok Kwon edited comment on SPARK-19834 at 3/15/17 1:25 AM:
---

Agreed


was (Author: ep1804):
To resolve this issue we need to enable uniVocity csv parser options: 
escapeUnquotedValues and  charToEscapeQuoteEscaping. It is good to add as it is 
also described in univocity library's README.md but Exposing an option that has 
currently a little bug has a risk. Maybe it is better to close this issue for 
now and re-open when Spark bumps up to uniVocity version 2.4.0.

> csv escape of quote escape
> --
>
> Key: SPARK-19834
> URL: https://issues.apache.org/jira/browse/SPARK-19834
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Soonmok Kwon
>Priority: Minor
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> A DataFrame is stored in CSV format and loaded again. When there's backslash 
> followed by quotation mark, csv reading seems to make an error.
> reference:
> http://stackoverflow.com/questions/42607208/spark-csv-error-when-reading-backslash-and-quotation-mark



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19834) csv escape of quote escape

2017-03-14 Thread Soonmok Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925380#comment-15925380
 ] 

Soonmok Kwon commented on SPARK-19834:
--

To resolve this issue we need to enable uniVocity csv parser options: 
escapeUnquotedValues and  charToEscapeQuoteEscaping. It is good to add as it is 
also described in univocity library's README.md but Exposing an option that has 
currently a little bug has a risk. Maybe it is better to close this issue for 
now and re-open when Spark bumps up to uniVocity version 2.4.0.

> csv escape of quote escape
> --
>
> Key: SPARK-19834
> URL: https://issues.apache.org/jira/browse/SPARK-19834
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Soonmok Kwon
>Priority: Minor
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> A DataFrame is stored in CSV format and loaded again. When there's backslash 
> followed by quotation mark, csv reading seems to make an error.
> reference:
> http://stackoverflow.com/questions/42607208/spark-csv-error-when-reading-backslash-and-quotation-mark



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19834) csv escape of quote escape

2017-03-14 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925375#comment-15925375
 ] 

Hyukjin Kwon commented on SPARK-19834:
--

Just for other guys to easily track this, I guess this is a good to do as it is 
also described in univocity's README.md - 
https://github.com/uniVocity/univocity-parsers/blob/master/README.md#escaping-quote-escape-characters

However, there is a small bug in this option which was fixed in 2.4.0. So, I 
suggested to close this for now and bring it back when we bump up the library 
into 2.4.0 later.
Please refer the details in the PR.

Please let me know if anyone thinks differently.

> csv escape of quote escape
> --
>
> Key: SPARK-19834
> URL: https://issues.apache.org/jira/browse/SPARK-19834
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Soonmok Kwon
>Priority: Minor
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> A DataFrame is stored in CSV format and loaded again. When there's backslash 
> followed by quotation mark, csv reading seems to make an error.
> reference:
> http://stackoverflow.com/questions/42607208/spark-csv-error-when-reading-backslash-and-quotation-mark



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19834) csv escape of quote escape

2017-03-14 Thread Soonmok Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Soonmok Kwon updated SPARK-19834:
-
Summary: csv escape of quote escape  (was: csv encoding/decoding error not 
using escape of escape)

> csv escape of quote escape
> --
>
> Key: SPARK-19834
> URL: https://issues.apache.org/jira/browse/SPARK-19834
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Soonmok Kwon
>Priority: Minor
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> A DataFrame is stored in CSV format and loaded again. When there's backslash 
> followed by quotation mark, csv reading seems to make an error.
> reference:
> http://stackoverflow.com/questions/42607208/spark-csv-error-when-reading-backslash-and-quotation-mark



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19950) nullable ignored when df.load() is executed for file-based data source

2017-03-14 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925367#comment-15925367
 ] 

Hyukjin Kwon commented on SPARK-19950:
--

Yes. Just to help, up to my knowledge, currently, the problem is forcing the 
schema into nullable ones. 

SPARK-16472 - describes that it should always be nullable consistently.
SPARK-19950 - describes that it should keep the schema as is consistently.

> nullable ignored when df.load() is executed for file-based data source
> --
>
> Key: SPARK-19950
> URL: https://issues.apache.org/jira/browse/SPARK-19950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> This problem is reported in [Databricks 
> forum|https://forums.databricks.com/questions/7123/nullable-seemingly-ignored-when-reading-parquet.html].
> When we execute the following code, a schema for "id" in {{dfRead}} has 
> {{nullable = true}}. It should be {{nullable = false}}.
> {code:java}
> val field = "id"
> val df = spark.range(0, 5, 1, 1).toDF(field)
> val fmt = "parquet"
> val path = "/tmp/parquet"
> val schema = StructType(Seq(StructField(field, LongType, false)))
> df.write.format(fmt).mode("overwrite").save(path)
> val dfRead = spark.read.format(fmt).schema(schema).load(path)
> dfRead.printSchema
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19950) nullable ignored when df.load() is executed for file-based data source

2017-03-14 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925357#comment-15925357
 ] 

Kazuaki Ishizaki commented on SPARK-19950:
--

[~hyukjin.kwon] Thank you for pointing out SPARK-16472. Give me some time to 
read related discussions in PR and ML.
IIUC, SPARK-19950 and SPARK-16472 try to implement different semantics for 
{{.schema(...).read()}}.

> nullable ignored when df.load() is executed for file-based data source
> --
>
> Key: SPARK-19950
> URL: https://issues.apache.org/jira/browse/SPARK-19950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> This problem is reported in [Databricks 
> forum|https://forums.databricks.com/questions/7123/nullable-seemingly-ignored-when-reading-parquet.html].
> When we execute the following code, a schema for "id" in {{dfRead}} has 
> {{nullable = true}}. It should be {{nullable = false}}.
> {code:java}
> val field = "id"
> val df = spark.range(0, 5, 1, 1).toDF(field)
> val fmt = "parquet"
> val path = "/tmp/parquet"
> val schema = StructType(Seq(StructField(field, LongType, false)))
> df.write.format(fmt).mode("overwrite").save(path)
> val dfRead = spark.read.format(fmt).schema(schema).load(path)
> dfRead.printSchema
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests

2017-03-14 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout reopened SPARK-19803:

  Assignee: Shubham Chopra  (was: Genmao Yu)

> Flaky BlockManagerProactiveReplicationSuite tests
> -
>
> Key: SPARK-19803
> URL: https://issues.apache.org/jira/browse/SPARK-19803
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>Assignee: Shubham Chopra
>  Labels: flaky-test
> Fix For: 2.2.0
>
>
> The tests added for BlockManagerProactiveReplicationSuite has made the 
> jenkins build flaky. Please refer to the build for more details - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests

2017-03-14 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925353#comment-15925353
 ] 

Kay Ousterhout commented on SPARK-19803:


This does not appear to be fixed -- it looks like there's some error condition 
in the underlying code that can cause this to break?  From 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74412/testReport/org.apache.spark.storage/BlockManagerProactiveReplicationSuite/proactive_block_replication___5_replicas___4_block_manager_deletions/:
 

org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
eventually never returned normally. Attempted 493 times over 5.00752125399 
seconds. Last failure message: 4 did not equal 5.
at 
org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307)
at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
at 
org.apache.spark.storage.BlockManagerProactiveReplicationSuite.testProactiveReplication(BlockManagerReplicationSuite.scala:492)
at 
org.apache.spark.storage.BlockManagerProactiveReplicationSuite$$anonfun$12$$anonfun$apply$mcVI$sp$1.apply$mcV$sp(BlockManagerReplicationSuite.scala:464)
at 
org.apache.spark.storage.BlockManagerProactiveReplicationSuite$$anonfun$12$$anonfun$apply$mcVI$sp$1.apply(BlockManagerReplicationSuite.scala:464)
at 
org.apache.spark.storage.BlockManagerProactiveReplicationSuite$$anonfun$12$$anonfun$apply$mcVI$sp$1.apply(BlockManagerReplicationSuite.scala:464)

[~shubhamc] and [~cloud_fan], since you worked on the original code for this, 
can you take a look at this?  I looked at this for a bit and based on some 
experimentation it looked like there were some race conditions in the 
underlying code.

> Flaky BlockManagerProactiveReplicationSuite tests
> -
>
> Key: SPARK-19803
> URL: https://issues.apache.org/jira/browse/SPARK-19803
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>Assignee: Genmao Yu
>  Labels: flaky-test
> Fix For: 2.2.0
>
>
> The tests added for BlockManagerProactiveReplicationSuite has made the 
> jenkins build flaky. Please refer to the build for more details - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19875) Map->filter on many columns gets stuck in constraint inference optimization code

2017-03-14 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925332#comment-15925332
 ] 

Takeshi Yamamuro edited comment on SPARK-19875 at 3/15/17 12:32 AM:


If you understand a concrete reason about the bug you described, could you 
update the description in this JIRA so that we could fix in future.


was (Author: maropu):
Hi, Sameer. If you understand a concrete reason about the bug you described, 
could you update the description in this JIRA so that we could fix in future. 
Thanks.

> Map->filter on many columns gets stuck in constraint inference optimization 
> code
> 
>
> Key: SPARK-19875
> URL: https://issues.apache.org/jira/browse/SPARK-19875
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Jay Pranavamurthi
> Attachments: test10cols.csv, test50cols.csv, TestFilter.scala
>
>
> The attached code (TestFilter.scala) works with a 10-column csv dataset, but 
> gets stuck with a 50-column csv dataset. Both datasets are attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19875) Map->filter on many columns gets stuck in constraint inference optimization code

2017-03-14 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925332#comment-15925332
 ] 

Takeshi Yamamuro commented on SPARK-19875:
--

Hi, Sameer. If you understand a concrete reason about the bug you described, 
could you update the description in this JIRA so that we could fix in future. 
Thanks.

> Map->filter on many columns gets stuck in constraint inference optimization 
> code
> 
>
> Key: SPARK-19875
> URL: https://issues.apache.org/jira/browse/SPARK-19875
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Jay Pranavamurthi
> Attachments: test10cols.csv, test50cols.csv, TestFilter.scala
>
>
> The attached code (TestFilter.scala) works with a 10-column csv dataset, but 
> gets stuck with a 50-column csv dataset. Both datasets are attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19954) Joining to a unioned DataFrame does not produce expected result.

2017-03-14 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925310#comment-15925310
 ] 

Hyukjin Kwon commented on SPARK-19954:
--

Is this a blocker BTW?

{quote}
pointless to release without this change as the release would be unusable to a 
large minority of users
{quote}

> Joining to a unioned DataFrame does not produce expected result.
> 
>
> Key: SPARK-19954
> URL: https://issues.apache.org/jira/browse/SPARK-19954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Arun Allamsetty
>Priority: Blocker
>
> I found this bug while trying to update from Spark 1.6.1 to 2.1.0. The bug is 
> that when we try to join two DataFrames, one of which is a result of a union 
> operation, the result of the join results in data as if the table was joined 
> only to the first table in the union. This issue is not present in Spark 
> 2.0.0 or 2.0.1 or 2.0.2, only in 2.1.0. Here's how to reproduce it.
> {noformat}
> import spark.implicits._
> import org.apache.spark.sql.functions.lit
> case class A(id: Long, colA: Boolean)
> case class B(id: Long, colB: Int)
> case class C(id: Long, colC: Double)
> case class X(id: Long, name: String)
> val aData = A(1, true) :: Nil
> val bData = B(2, 10) :: Nil
> val cData = C(3, 9.73D) :: Nil
> val xData = X(1, "a") :: X(2, "b") :: X(3, "c") :: Nil
> val aDf = spark.createDataset(aData).toDF
> val bDf = spark.createDataset(bData).toDF
> val cDf = spark.createDataset(cData).toDF
> val xDf = spark.createDataset(xData).toDF
> val unionDf =
>   aDf.select($"id", lit("a").as("name"), $"colA", lit(null).as("colB"), 
> lit(null).as("colC")).union(
>   bDf.select($"id", lit("b").as("name"), lit(null).as("colA"), $"colB", 
> lit(null).as("colC"))).union(
>   cDf.select($"id", lit("c").as("name"), lit(null).as("colA"), 
> lit(null).as("colB"), $"colC"))
> val result = xDf.join(unionDf, unionDf("name") === xDf("name") && 
> unionDf("id") === xDf("id"))
> result.show
> {noformat}
> The result being
> {noformat}
> +---++---+++++
> | id|name| id|name|colA|colB|colC|
> +---++---+++++
> |  1|   a|  1|   a|true|null|null|
> +---++---+++++
> {noformat}
> Force computing {{unionDf}} using {{count}} does not help change the result 
> of the join. However, writing the data to disk and reading it back does give 
> the correct result. But it is definitely not ideal. Interestingly caching the 
> {{unionDf}} also gives the correct result.
> {noformat}
> +---++---+++++
> | id|name| id|name|colA|colB|colC|
> +---++---+++++
> |  1|   a|  1|   a|true|null|null|
> |  2|   b|  2|   b|null|  10|null|
> |  3|   c|  3|   c|null|null|9.73|
> +---++---+++++
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16472) Inconsistent nullability in schema after being read

2017-03-14 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-16472:
-
Summary: Inconsistent nullability in schema after being read  (was: 
Inconsistent nullability in schema after being read in SQL API.)

> Inconsistent nullability in schema after being read
> ---
>
> Key: SPARK-16472
> URL: https://issues.apache.org/jira/browse/SPARK-16472
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> It seems the data sources implementing {{FileFormat}} seems loading the data 
> by forcing the fields as nullable fields. It seems this was official 
> documented SPARK-11360 and was discussed here 
> https://www.mail-archive.com/user@spark.apache.org/msg39230.html
> However, I realised that several APIs do not follow this. For example,
> {code}
> DataFrame.json(jsonRDD: RDD[String])
> {code}
> So, the codes below:
> {code}
> val rdd = spark.sparkContext.makeRDD(Seq("{\"a\" : 1}", "{\"a\" : null}"))
> val schema = StructType(StructField("a", IntegerType, nullable = false) :: 
> Nil)
> val df = spark.read.schema(schema).json(rdd)
> df.printSchema()
> {code}
> prints below:
> {code}
> root
>  |-- a: integer (nullable = false)
> {code}
> This API loads the schema as it is after loading. However, the schema became 
> different when loading it by the API below (nullable fields) :
> {code}
> spark.read.format("json").schema(...).load(path).printSchema()
> {code}
> {code}
> spark.read.schema(...).load(path).printSchema()
> {code}
> produce below:
> {code}
> root
>  |-- a: integer (nullable = true)
> {code}
> In addition, this is happening for structured streaming as well. (even when 
> we read batch after writing it by structured streaming).
> While testing, I wrote some tests codes and patches. Please see the following 
> PR for more cases.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19950) nullable ignored when df.load() is executed for file-based data source

2017-03-14 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925301#comment-15925301
 ] 

Hyukjin Kwon edited comment on SPARK-19950 at 3/15/17 12:04 AM:


[~kiszk], do you think we maybe should resolve this JIRA as a duplicate of 
SPARK-16472 if this describes file format datasources force the schema into 
nullable ones?


was (Author: hyukjin.kwon):
[~kiszk], do you think we maybe resolve this JIRA as a duplicate of SPARK-16472 
if this describes file format datasources force the schema into nullable ones?

> nullable ignored when df.load() is executed for file-based data source
> --
>
> Key: SPARK-19950
> URL: https://issues.apache.org/jira/browse/SPARK-19950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> This problem is reported in [Databricks 
> forum|https://forums.databricks.com/questions/7123/nullable-seemingly-ignored-when-reading-parquet.html].
> When we execute the following code, a schema for "id" in {{dfRead}} has 
> {{nullable = true}}. It should be {{nullable = false}}.
> {code:java}
> val field = "id"
> val df = spark.range(0, 5, 1, 1).toDF(field)
> val fmt = "parquet"
> val path = "/tmp/parquet"
> val schema = StructType(Seq(StructField(field, LongType, false)))
> df.write.format(fmt).mode("overwrite").save(path)
> val dfRead = spark.read.format(fmt).schema(schema).load(path)
> dfRead.printSchema
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19950) nullable ignored when df.load() is executed for file-based data source

2017-03-14 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925301#comment-15925301
 ] 

Hyukjin Kwon commented on SPARK-19950:
--

[~kiszk], do you think we maybe resolve this JIRA as a duplicate of SPARK-16472 
if this describes file format datasources force the schema into nullable ones?

> nullable ignored when df.load() is executed for file-based data source
> --
>
> Key: SPARK-19950
> URL: https://issues.apache.org/jira/browse/SPARK-19950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> This problem is reported in [Databricks 
> forum|https://forums.databricks.com/questions/7123/nullable-seemingly-ignored-when-reading-parquet.html].
> When we execute the following code, a schema for "id" in {{dfRead}} has 
> {{nullable = true}}. It should be {{nullable = false}}.
> {code:java}
> val field = "id"
> val df = spark.range(0, 5, 1, 1).toDF(field)
> val fmt = "parquet"
> val path = "/tmp/parquet"
> val schema = StructType(Seq(StructField(field, LongType, false)))
> df.write.format(fmt).mode("overwrite").save(path)
> val dfRead = spark.read.format(fmt).schema(schema).load(path)
> dfRead.printSchema
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19955) Update run-tests to support conda

2017-03-14 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925295#comment-15925295
 ] 

holdenk commented on SPARK-19955:
-

It appears that the current Jenkin workers have conda installed & an old 
version of Python 2.7 is available in that conda.

> Update run-tests to support conda
> -
>
> Key: SPARK-19955
> URL: https://issues.apache.org/jira/browse/SPARK-19955
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra, PySpark
>Affects Versions: 2.1.1, 2.2.0
>Reporter: holdenk
>
> The current test scripts only look at system python. On the Jenkins workers 
> we also have Conda installed, we should support looking for Python versions 
> in Conda and testing with those.
> This could unblock some of the 2.6 deprecation work and more easily enable 
> testing of pip packaging.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19955) Update run-tests to support conda

2017-03-14 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-19955:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-12661

> Update run-tests to support conda
> -
>
> Key: SPARK-19955
> URL: https://issues.apache.org/jira/browse/SPARK-19955
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra, PySpark
>Affects Versions: 2.1.1, 2.2.0
>Reporter: holdenk
>
> The current test scripts only look at system python. On the Jenkins workers 
> we also have Conda installed, we should support looking for Python versions 
> in Conda and testing with those.
> This could unblock some of the 2.6 deprecation work and more easily enable 
> testing of pip packaging.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19955) Update run-tests to support conda

2017-03-14 Thread holdenk (JIRA)

holdenk created SPARK-19955:
---

 Summary: Update run-tests to support conda
 Key: SPARK-19955
 URL: https://issues.apache.org/jira/browse/SPARK-19955
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra, PySpark
Affects Versions: 2.1.1, 2.2.0
Reporter: holdenk


The current test scripts only look at system python. On the Jenkins workers we 
also have Conda installed, we should support looking for Python versions in 
Conda and testing with those.

This could unblock some of the 2.6 deprecation work and more easily enable 
testing of pip packaging.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19094) Plumb through logging/error messages from the JVM to Jupyter PySpark

2017-03-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19094:


Assignee: Apache Spark

> Plumb through logging/error messages from the JVM to Jupyter PySpark
> 
>
> Key: SPARK-19094
> URL: https://issues.apache.org/jira/browse/SPARK-19094
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: holdenk
>Assignee: Apache Spark
>Priority: Trivial
>
> Jupyter/IPython notebooks works by overriding sys.stdout & sys.stderr, as 
> such the error messages that show up in IJupyter/IPython are often missing 
> the related logs - which is often more useful than the exception its self.
> This could make it easier for Python developers getting started with Spark on 
> their local laptops to debug their applications, since otherwise they need to 
> remember to keep going to the terminal where they launched the notebook from.
> One counterpoint to this is that Spark's logging is fairly verbose, but since 
> we provide the ability for the user to tune the log messages from within the 
> notebook that should be OK.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19094) Plumb through logging/error messages from the JVM to Jupyter PySpark

2017-03-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925285#comment-15925285
 ] 

Apache Spark commented on SPARK-19094:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/17298

> Plumb through logging/error messages from the JVM to Jupyter PySpark
> 
>
> Key: SPARK-19094
> URL: https://issues.apache.org/jira/browse/SPARK-19094
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: holdenk
>Priority: Trivial
>
> Jupyter/IPython notebooks works by overriding sys.stdout & sys.stderr, as 
> such the error messages that show up in IJupyter/IPython are often missing 
> the related logs - which is often more useful than the exception its self.
> This could make it easier for Python developers getting started with Spark on 
> their local laptops to debug their applications, since otherwise they need to 
> remember to keep going to the terminal where they launched the notebook from.
> One counterpoint to this is that Spark's logging is fairly verbose, but since 
> we provide the ability for the user to tune the log messages from within the 
> notebook that should be OK.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19094) Plumb through logging/error messages from the JVM to Jupyter PySpark

2017-03-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19094:


Assignee: (was: Apache Spark)

> Plumb through logging/error messages from the JVM to Jupyter PySpark
> 
>
> Key: SPARK-19094
> URL: https://issues.apache.org/jira/browse/SPARK-19094
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: holdenk
>Priority: Trivial
>
> Jupyter/IPython notebooks works by overriding sys.stdout & sys.stderr, as 
> such the error messages that show up in IJupyter/IPython are often missing 
> the related logs - which is often more useful than the exception its self.
> This could make it easier for Python developers getting started with Spark on 
> their local laptops to debug their applications, since otherwise they need to 
> remember to keep going to the terminal where they launched the notebook from.
> One counterpoint to this is that Spark's logging is fairly verbose, but since 
> we provide the ability for the user to tune the log messages from within the 
> notebook that should be OK.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19282) RandomForestRegressionModel summary should expose getMaxDepth

2017-03-14 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925159#comment-15925159
 ] 

Bryan Cutler commented on SPARK-19282:
--

[~iamshrek] I am currently working on SPARK-10931 and I'll create sub-tasks 
once we figure out the best way to proceed.  There will be quite a bit of work 
to expose all parameters, so help would be much appreciated!

> RandomForestRegressionModel summary should expose getMaxDepth
> -
>
> Key: SPARK-19282
> URL: https://issues.apache.org/jira/browse/SPARK-19282
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark, SparkR
>Affects Versions: 2.1.0
>Reporter: Nick Lothian
>Assignee: Xin Ren
>Priority: Minor
> Fix For: 2.2.0
>
>
> Currently it isn't clear hot to get the max depth of a 
> RandomForestRegressionModel (eg, after doing a grid search)
> It is possible to call
> {{regressor._java_obj.getMaxDepth()}} 
> but most other decision trees allow
> {{regressor.getMaxDepth()}} 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19629) Partitioning of Parquet is not considered correctly at loading in local[X] mode

2017-03-14 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-19629.
--
Resolution: Not A Problem

{quote}
What could other solutions be, if this is not a bug?
{quote}

Setting {{openCostInBytes}} above might be a workaround for it. and i don't 
think it is a bug and they are guaranteed to be same.

I am resolving this JIRA. please reopen this if I am mistaken.

> Partitioning of Parquet is not considered correctly at loading in local[X] 
> mode
> ---
>
> Key: SPARK-19629
> URL: https://issues.apache.org/jira/browse/SPARK-19629
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Spark Core
>Affects Versions: 2.0.0, 2.1.0
> Environment: Tested using docker run 
> gettyimages/spark:1.6.1-hadoop-2.6 and
> docker run gettyimages/spark:2.1.0-hadoop-2.7.
>Reporter: Navige
>Priority: Minor
>
> Running the following two examples will lead to different results depending 
> on whether the code is run using Spark 1.6 or Spark 2.1. 
> h1.What does the example do?
> - The code creates an exemplary dataframe with random data. 
> - The dataframe is repartitioned and stored to disk. 
> - Then the dataframe is re-read from disk.
> - The number of partitions of the dataframe is considered.
> h1. What is the/my expected behaviour?
> The number of partitions specified when storing the dataframe should be the 
> same as when re-loading the dataframe from disk.
> h1. Differences in Spark 1.6 and Spark 2
> On Spark 1.6 the partitioning is kept, i.e., the code example will return 10 
> partitions as specified using npartitions; on Spark 2.1 the number of 
> partitions will equal the number of local nodes specified when starting Spark 
> (using local[X] as master). 
> Looking at the data produced, in both Spark versions the number of files in 
> the parquet directory is the same - so Spark 2 produces so many files as the 
> number partitions when storing, but when reading in Spark 2, the number of 
> partitions is messed up.
> h1.Minimal code example
> {code:none}
> # run on Spark 1.6
> import scala.util.Random
> import org.apache.spark.sql.types.{StructField, StructType, FloatType}
> import org.apache.spark.sql.Row
>  val rdd = sc.parallelize(Seq.fill(100)(Row(Seq(Random.nextFloat()): _*)))
> val df = sqlContext.createDataFrame(rdd, StructType(Seq(StructField("test", 
> FloatType
> val npartitions = 10
> df.repartition(npartitions).write.parquet("/tmp/test1")
> val read = sqlContext.read.parquet("/tmp/test1")
> assert(npartitions == read.rdd.getNumPartitions) //true on Spark 1.6
> {code}
> {code:none}
> # run on Spark 2.1
> import scala.util.Random
> import org.apache.spark.sql.types.{StructField, StructType, FloatType}
> import org.apache.spark.sql.Row
> val rdd = sc.parallelize(Seq.fill(100)(Row(Seq(Random.nextFloat()): _*)))
> val df = spark.sqlContext.createDataFrame(rdd, 
> StructType(Seq(StructField("test", FloatType
> val npartitions = 10
> df.repartition(npartitions).write.parquet("/tmp/test1")
> val read = spark.sqlContext.read.parquet("/tmp/test1")
> assert(npartitions == read.rdd.getNumPartitions) //false on Spark 2.1
> {code}
> h1.What could other solutions be, if this is not a bug?
> If this is intended, what about introducing a parameter at reading time, 
> which specifies whether the data should truly be repartitioned (depending on 
> the number of nodes) or should be read "as-is".



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19954) Joining to a unioned DataFrame does not produce expected result.

2017-03-14 Thread Arun Allamsetty (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun Allamsetty updated SPARK-19954:

Description: 
I found this bug while trying to update from Spark 1.6.1 to 2.1.0. The bug is 
that when we try to join two DataFrames, one of which is a result of a union 
operation, the result of the join results in data as if the table was joined 
only to the first table in the union. This issue is not present in Spark 2.0.0 
or 2.0.1 or 2.0.2, only in 2.1.0. Here's how to reproduce it.

{noformat}
import spark.implicits._
import org.apache.spark.sql.functions.lit

case class A(id: Long, colA: Boolean)
case class B(id: Long, colB: Int)
case class C(id: Long, colC: Double)
case class X(id: Long, name: String)

val aData = A(1, true) :: Nil
val bData = B(2, 10) :: Nil
val cData = C(3, 9.73D) :: Nil
val xData = X(1, "a") :: X(2, "b") :: X(3, "c") :: Nil

val aDf = spark.createDataset(aData).toDF
val bDf = spark.createDataset(bData).toDF
val cDf = spark.createDataset(cData).toDF
val xDf = spark.createDataset(xData).toDF

val unionDf =
  aDf.select($"id", lit("a").as("name"), $"colA", lit(null).as("colB"), 
lit(null).as("colC")).union(
  bDf.select($"id", lit("b").as("name"), lit(null).as("colA"), $"colB", 
lit(null).as("colC"))).union(
  cDf.select($"id", lit("c").as("name"), lit(null).as("colA"), 
lit(null).as("colB"), $"colC"))
val result = xDf.join(unionDf, unionDf("name") === xDf("name") && unionDf("id") 
=== xDf("id"))
result.show
{noformat}

The result being
{noformat}
+---++---+++++
| id|name| id|name|colA|colB|colC|
+---++---+++++
|  1|   a|  1|   a|true|null|null|
+---++---+++++
{noformat}

Force computing {{unionDf}} using {{count}} does not help change the result of 
the join. However, writing the data to disk and reading it back does give the 
correct result. But it is definitely not ideal. Interestingly caching the 
{{unionDf}} also gives the correct result.

{noformat}
+---++---+++++
| id|name| id|name|colA|colB|colC|
+---++---+++++
|  1|   a|  1|   a|true|null|null|
|  2|   b|  2|   b|null|  10|null|
|  3|   c|  3|   c|null|null|9.73|
+---++---+++++
{noformat}

  was:
I found this bug while trying to update from Spark 1.6.1 to 2.1.0. The bug is 
that when we try to join two DataFrames, one of which is a result of a union 
operation, the result of the join results in data as if the table was joined 
only to the first table in the union. This issue is not present in Spark 2.0.0 
or 2.0.1 or 2.0.2, only in 2.1.0. Here's how to reproduce it.

{{{noformat}}}
import spark.implicits._
import org.apache.spark.sql.functions.lit

case class A(id: Long, colA: Boolean)
case class B(id: Long, colB: Int)
case class C(id: Long, colC: Double)
case class X(id: Long, name: String)

val aData = A(1, true) :: Nil
val bData = B(2, 10) :: Nil
val cData = C(3, 9.73D) :: Nil
val xData = X(1, "a") :: X(2, "b") :: X(3, "c") :: Nil

val aDf = spark.createDataset(aData).toDF
val bDf = spark.createDataset(bData).toDF
val cDf = spark.createDataset(cData).toDF
val xDf = spark.createDataset(xData).toDF

val unionDf =
  aDf.select($"id", lit("a").as("name"), $"colA", lit(null).as("colB"), 
lit(null).as("colC")).union(
  bDf.select($"id", lit("b").as("name"), lit(null).as("colA"), $"colB", 
lit(null).as("colC"))).union(
  cDf.select($"id", lit("c").as("name"), lit(null).as("colA"), 
lit(null).as("colB"), $"colC"))
val result = xDf.join(unionDf, unionDf("name") === xDf("name") && unionDf("id") 
=== xDf("id"))
result.show
{{{noformat}}}

The result being
{{{noformat}}}
+---++---+++++
| id|name| id|name|colA|colB|colC|
+---++---+++++
|  1|   a|  1|   a|true|null|null|
+---++---+++++
{{{noformat}}}

Force computing {{unionDf}} using {{count}} does not help change the result of 
the join. However, writing the data to disk and reading it back does give the 
correct result. But it is definitely not ideal. Interestingly caching the 
{{unionDf}} also gives the correct result.

{{{noformat}}}
+---++---+++++
| id|name| id|name|colA|colB|colC|
+---++---+++++
|  1|   a|  1|   a|true|null|null|
|  2|   b|  2|   b|null|  10|null|
|  3|   c|  3|   c|null|null|9.73|
+---++---+++++
{{{noformat}}}


> Joining to a unioned DataFrame does not produce expected result.
> 
>
> Key: SPARK-19954
> URL: https://issues.apache.org/jira/browse/SPARK-19954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Arun Allamsetty
>Priority: Blocker
>
> I found this bug while trying to update from Spark 1.6.1 to 2.1.0. The bug is 
> that wh

[jira] [Commented] (SPARK-18603) Support `OuterReference` in projection list of IN correlated subqueries

2017-03-14 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925140#comment-15925140
 ] 

Dongjoon Hyun commented on SPARK-18603:
---

Thank YOU!

> Support `OuterReference` in projection list of IN correlated subqueries
> ---
>
> Key: SPARK-18603
> URL: https://issues.apache.org/jira/browse/SPARK-18603
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Dongjoon Hyun
>
> This issue aims to allow OuterReference columns in projection lists of IN 
> correlated subqueries. 
> *SIMPLE EXAMPLE*
> {code}
> scala> sql("CREATE TEMPORARY VIEW t1 AS SELECT * FROM VALUES 1, 2 AS t1(a)")
> scala> sql("CREATE TEMPORARY VIEW t2 AS SELECT * FROM VALUES 1 AS t2(b)")
> scala> sql("SELECT a FROM t1 WHERE a IN (SELECT a FROM t2)").show
> {code}
> *COMPLEX EXAMPLE*
> {code}
> SELECT *
> FROM t1
> WHERE a IN (SELECT x
> FROM (SELECT b, a + 1 as x, a + b as y
>   FROM t2)
> WHERE y > 0)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19954) Joining to a unioned DataFrame does not produce expected result.

2017-03-14 Thread Arun Allamsetty (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun Allamsetty updated SPARK-19954:

Description: 
I found this bug while trying to update from Spark 1.6.1 to 2.1.0. The bug is 
that when we try to join two DataFrames, one of which is a result of a union 
operation, the result of the join results in data as if the table was joined 
only to the first table in the union. This issue is not present in Spark 2.0.0 
or 2.0.1 or 2.0.2, only in 2.1.0. Here's how to reproduce it.

{{{noformat}}}
import spark.implicits._
import org.apache.spark.sql.functions.lit

case class A(id: Long, colA: Boolean)
case class B(id: Long, colB: Int)
case class C(id: Long, colC: Double)
case class X(id: Long, name: String)

val aData = A(1, true) :: Nil
val bData = B(2, 10) :: Nil
val cData = C(3, 9.73D) :: Nil
val xData = X(1, "a") :: X(2, "b") :: X(3, "c") :: Nil

val aDf = spark.createDataset(aData).toDF
val bDf = spark.createDataset(bData).toDF
val cDf = spark.createDataset(cData).toDF
val xDf = spark.createDataset(xData).toDF

val unionDf =
  aDf.select($"id", lit("a").as("name"), $"colA", lit(null).as("colB"), 
lit(null).as("colC")).union(
  bDf.select($"id", lit("b").as("name"), lit(null).as("colA"), $"colB", 
lit(null).as("colC"))).union(
  cDf.select($"id", lit("c").as("name"), lit(null).as("colA"), 
lit(null).as("colB"), $"colC"))
val result = xDf.join(unionDf, unionDf("name") === xDf("name") && unionDf("id") 
=== xDf("id"))
result.show
{{{noformat}}}

The result being
{{{noformat}}}
+---++---+++++
| id|name| id|name|colA|colB|colC|
+---++---+++++
|  1|   a|  1|   a|true|null|null|
+---++---+++++
{{{noformat}}}

Force computing {{unionDf}} using {{count}} does not help change the result of 
the join. However, writing the data to disk and reading it back does give the 
correct result. But it is definitely not ideal. Interestingly caching the 
{{unionDf}} also gives the correct result.

{{{noformat}}}
+---++---+++++
| id|name| id|name|colA|colB|colC|
+---++---+++++
|  1|   a|  1|   a|true|null|null|
|  2|   b|  2|   b|null|  10|null|
|  3|   c|  3|   c|null|null|9.73|
+---++---+++++
{{{noformat}}}

  was:
I found this bug while trying to update from Spark 1.6.1 to 2.1.0. The bug is 
that when we try to join two DataFrames, one of which is a result of a union 
operation, the result of the join results in data as if the table was joined 
only to the first table in the union. This issue is not present in Spark 2.0.0 
or 2.0.1 or 2.0.2, only in 2.1.0. Here's how to reproduce it.

{{noformat}}
import spark.implicits._
import org.apache.spark.sql.functions.lit

case class A(id: Long, colA: Boolean)
case class B(id: Long, colB: Int)
case class C(id: Long, colC: Double)
case class X(id: Long, name: String)

val aData = A(1, true) :: Nil
val bData = B(2, 10) :: Nil
val cData = C(3, 9.73D) :: Nil
val xData = X(1, "a") :: X(2, "b") :: X(3, "c") :: Nil

val aDf = spark.createDataset(aData).toDF
val bDf = spark.createDataset(bData).toDF
val cDf = spark.createDataset(cData).toDF
val xDf = spark.createDataset(xData).toDF

val unionDf =
  aDf.select($"id", lit("a").as("name"), $"colA", lit(null).as("colB"), 
lit(null).as("colC")).union(
  bDf.select($"id", lit("b").as("name"), lit(null).as("colA"), $"colB", 
lit(null).as("colC"))).union(
  cDf.select($"id", lit("c").as("name"), lit(null).as("colA"), 
lit(null).as("colB"), $"colC"))
val result = xDf.join(unionDf, unionDf("name") === xDf("name") && unionDf("id") 
=== xDf("id"))
result.show
{{noformat}}

The result being
{{noformat}}
+---++---+++++
| id|name| id|name|colA|colB|colC|
+---++---+++++
|  1|   a|  1|   a|true|null|null|
+---++---+++++
{{noformat}}

Force computing {{unionDf}} using {{count}} does not help change the result of 
the join. However, writing the data to disk and reading it back does give the 
correct result. But it is definitely not ideal. Interestingly caching the 
{{unionDf}} also gives the correct result.

{{noformat}}
+---++---+++++
| id|name| id|name|colA|colB|colC|
+---++---+++++
|  1|   a|  1|   a|true|null|null|
|  2|   b|  2|   b|null|  10|null|
|  3|   c|  3|   c|null|null|9.73|
+---++---+++++
{{noformat}}


> Joining to a unioned DataFrame does not produce expected result.
> 
>
> Key: SPARK-19954
> URL: https://issues.apache.org/jira/browse/SPARK-19954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Arun Allamsetty
>Priority: Blocker
>
> I found this bug while trying to update from Spark 1.6.1 to 2.1.0. The bug i

[jira] [Created] (SPARK-19954) Joining to a unioned DataFrame does not produce expected result.

2017-03-14 Thread Arun Allamsetty (JIRA)

Arun Allamsetty created SPARK-19954:
---

 Summary: Joining to a unioned DataFrame does not produce expected 
result.
 Key: SPARK-19954
 URL: https://issues.apache.org/jira/browse/SPARK-19954
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Arun Allamsetty
Priority: Blocker


I found this bug while trying to update from Spark 1.6.1 to 2.1.0. The bug is 
that when we try to join two DataFrames, one of which is a result of a union 
operation, the result of the join results in data as if the table was joined 
only to the first table in the union. This issue is not present in Spark 2.0.0 
or 2.0.1 or 2.0.2, only in 2.1.0. Here's how to reproduce it.

{{noformat}}
import spark.implicits._
import org.apache.spark.sql.functions.lit

case class A(id: Long, colA: Boolean)
case class B(id: Long, colB: Int)
case class C(id: Long, colC: Double)
case class X(id: Long, name: String)

val aData = A(1, true) :: Nil
val bData = B(2, 10) :: Nil
val cData = C(3, 9.73D) :: Nil
val xData = X(1, "a") :: X(2, "b") :: X(3, "c") :: Nil

val aDf = spark.createDataset(aData).toDF
val bDf = spark.createDataset(bData).toDF
val cDf = spark.createDataset(cData).toDF
val xDf = spark.createDataset(xData).toDF

val unionDf =
  aDf.select($"id", lit("a").as("name"), $"colA", lit(null).as("colB"), 
lit(null).as("colC")).union(
  bDf.select($"id", lit("b").as("name"), lit(null).as("colA"), $"colB", 
lit(null).as("colC"))).union(
  cDf.select($"id", lit("c").as("name"), lit(null).as("colA"), 
lit(null).as("colB"), $"colC"))
val result = xDf.join(unionDf, unionDf("name") === xDf("name") && unionDf("id") 
=== xDf("id"))
result.show
{{noformat}}

The result being
{{noformat}}
+---++---+++++
| id|name| id|name|colA|colB|colC|
+---++---+++++
|  1|   a|  1|   a|true|null|null|
+---++---+++++
{{noformat}}

Force computing {{unionDf}} using {{count}} does not help change the result of 
the join. However, writing the data to disk and reading it back does give the 
correct result. But it is definitely not ideal. Interestingly caching the 
{{unionDf}} also gives the correct result.

{{noformat}}
+---++---+++++
| id|name| id|name|colA|colB|colC|
+---++---+++++
|  1|   a|  1|   a|true|null|null|
|  2|   b|  2|   b|null|  10|null|
|  3|   c|  3|   c|null|null|9.73|
+---++---+++++
{{noformat}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14649) DagScheduler re-starts all running tasks on fetch failure

2017-03-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925125#comment-15925125
 ] 

Apache Spark commented on SPARK-14649:
--

User 'sitalkedia' has created a pull request for this issue:
https://github.com/apache/spark/pull/17297

> DagScheduler re-starts all running tasks on fetch failure
> -
>
> Key: SPARK-14649
> URL: https://issues.apache.org/jira/browse/SPARK-14649
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Sital Kedia
>
> When a fetch failure occurs, the DAGScheduler re-launches the previous stage 
> (to re-generate output that was missing), and then re-launches all tasks in 
> the stage with the fetch failure that hadn't *completed* when the fetch 
> failure occurred (the DAGScheduler re-lanches all of the tasks whose output 
> data is not available -- which is equivalent to the set of tasks that hadn't 
> yet completed).
> The assumption when this code was originally written was that when a fetch 
> failure occurred, the output from at least one of the tasks in the previous 
> stage was no longer available, so all of the tasks in the current stage would 
> eventually fail due to not being able to access that output.  This assumption 
> does not hold for some large-scale, long-running workloads.  E.g., there's 
> one use case where a job has ~100k tasks that each run for about 1 hour, and 
> only the first 5-10 minutes are spent fetching data.  Because of the large 
> number of tasks, it's very common to see a few tasks fail in the fetch phase, 
> and it's wasteful to re-run other tasks that had finished fetching data so 
> aren't affected by the fetch failure (and may be most of the way through 
> their hour-long execution).  The DAGScheduler should not re-start these tasks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18603) Support `OuterReference` in projection list of IN correlated subqueries

2017-03-14 Thread Nattavut Sutyanyong (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925122#comment-15925122
 ] 

Nattavut Sutyanyong commented on SPARK-18603:
-

I will include it in my list and will prioritize it along with other sub-tasks. 
Thanks!

> Support `OuterReference` in projection list of IN correlated subqueries
> ---
>
> Key: SPARK-18603
> URL: https://issues.apache.org/jira/browse/SPARK-18603
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Dongjoon Hyun
>
> This issue aims to allow OuterReference columns in projection lists of IN 
> correlated subqueries. 
> *SIMPLE EXAMPLE*
> {code}
> scala> sql("CREATE TEMPORARY VIEW t1 AS SELECT * FROM VALUES 1, 2 AS t1(a)")
> scala> sql("CREATE TEMPORARY VIEW t2 AS SELECT * FROM VALUES 1 AS t2(b)")
> scala> sql("SELECT a FROM t1 WHERE a IN (SELECT a FROM t2)").show
> {code}
> *COMPLEX EXAMPLE*
> {code}
> SELECT *
> FROM t1
> WHERE a IN (SELECT x
> FROM (SELECT b, a + 1 as x, a + b as y
>   FROM t2)
> WHERE y > 0)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18603) Support `OuterReference` in projection list of IN correlated subqueries

2017-03-14 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925115#comment-15925115
 ] 

Dongjoon Hyun commented on SPARK-18603:
---

For this kind of issues, I think you are the best person to fix it in these 
days. Could you take this, too? :)

> Support `OuterReference` in projection list of IN correlated subqueries
> ---
>
> Key: SPARK-18603
> URL: https://issues.apache.org/jira/browse/SPARK-18603
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Dongjoon Hyun
>
> This issue aims to allow OuterReference columns in projection lists of IN 
> correlated subqueries. 
> *SIMPLE EXAMPLE*
> {code}
> scala> sql("CREATE TEMPORARY VIEW t1 AS SELECT * FROM VALUES 1, 2 AS t1(a)")
> scala> sql("CREATE TEMPORARY VIEW t2 AS SELECT * FROM VALUES 1 AS t2(b)")
> scala> sql("SELECT a FROM t1 WHERE a IN (SELECT a FROM t2)").show
> {code}
> *COMPLEX EXAMPLE*
> {code}
> SELECT *
> FROM t1
> WHERE a IN (SELECT x
> FROM (SELECT b, a + 1 as x, a + b as y
>   FROM t2)
> WHERE y > 0)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19953) RandomForest Models should use the UID of Estimator when fit

2017-03-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19953:


Assignee: (was: Apache Spark)

> RandomForest Models should use the UID of Estimator when fit
> 
>
> Key: SPARK-19953
> URL: https://issues.apache.org/jira/browse/SPARK-19953
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Bryan Cutler
>Priority: Minor
>
> Currently, RandomForestClassificationModel and RandomForestRegressionModel 
> use the alternate constructor which creates a new random UID instead of using 
> the parent estimators UID.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19953) RandomForest Models should use the UID of Estimator when fit

2017-03-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19953:


Assignee: Apache Spark

> RandomForest Models should use the UID of Estimator when fit
> 
>
> Key: SPARK-19953
> URL: https://issues.apache.org/jira/browse/SPARK-19953
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Bryan Cutler
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, RandomForestClassificationModel and RandomForestRegressionModel 
> use the alternate constructor which creates a new random UID instead of using 
> the parent estimators UID.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19953) RandomForest Models should use the UID of Estimator when fit

2017-03-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925107#comment-15925107
 ] 

Apache Spark commented on SPARK-19953:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/17296

> RandomForest Models should use the UID of Estimator when fit
> 
>
> Key: SPARK-19953
> URL: https://issues.apache.org/jira/browse/SPARK-19953
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Bryan Cutler
>Priority: Minor
>
> Currently, RandomForestClassificationModel and RandomForestRegressionModel 
> use the alternate constructor which creates a new random UID instead of using 
> the parent estimators UID.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18603) Support `OuterReference` in projection list of IN correlated subqueries

2017-03-14 Thread Nattavut Sutyanyong (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925092#comment-15925092
 ] 

Nattavut Sutyanyong commented on SPARK-18603:
-

[~dongjoon] Now that the first phase of the subquery that lays a new 
infrastructure has been merged to master, we can start this work on top. Please 
let me know when you will start working on it. We can collaborate here.

> Support `OuterReference` in projection list of IN correlated subqueries
> ---
>
> Key: SPARK-18603
> URL: https://issues.apache.org/jira/browse/SPARK-18603
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Dongjoon Hyun
>
> This issue aims to allow OuterReference columns in projection lists of IN 
> correlated subqueries. 
> *SIMPLE EXAMPLE*
> {code}
> scala> sql("CREATE TEMPORARY VIEW t1 AS SELECT * FROM VALUES 1, 2 AS t1(a)")
> scala> sql("CREATE TEMPORARY VIEW t2 AS SELECT * FROM VALUES 1 AS t2(b)")
> scala> sql("SELECT a FROM t1 WHERE a IN (SELECT a FROM t2)").show
> {code}
> *COMPLEX EXAMPLE*
> {code}
> SELECT *
> FROM t1
> WHERE a IN (SELECT x
> FROM (SELECT b, a + 1 as x, a + b as y
>   FROM t2)
> WHERE y > 0)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19953) RandomForest Models should use the UID of Estimator when fit

2017-03-14 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925088#comment-15925088
 ] 

Bryan Cutler commented on SPARK-19953:
--

I'll push the patch for this

> RandomForest Models should use the UID of Estimator when fit
> 
>
> Key: SPARK-19953
> URL: https://issues.apache.org/jira/browse/SPARK-19953
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Bryan Cutler
>Priority: Minor
>
> Currently, RandomForestClassificationModel and RandomForestRegressionModel 
> use the alternate constructor which creates a new random UID instead of using 
> the parent estimators UID.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19953) RandomForest Models should use the UID of Estimator when fit

2017-03-14 Thread Bryan Cutler (JIRA)

Bryan Cutler created SPARK-19953:


 Summary: RandomForest Models should use the UID of Estimator when 
fit
 Key: SPARK-19953
 URL: https://issues.apache.org/jira/browse/SPARK-19953
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.1.0, 2.0.1
Reporter: Bryan Cutler
Priority: Minor


Currently, RandomForestClassificationModel and RandomForestRegressionModel use 
the alternate constructor which creates a new random UID instead of using the 
parent estimators UID.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19556) Broadcast data is not encrypted when I/O encryption is on

2017-03-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925067#comment-15925067
 ] 

Apache Spark commented on SPARK-19556:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/17295

> Broadcast data is not encrypted when I/O encryption is on
> -
>
> Key: SPARK-19556
> URL: https://issues.apache.org/jira/browse/SPARK-19556
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>
> {{TorrentBroadcast}} uses a couple of "back doors" into the block manager to 
> write and read data:
> {code}
>   if (!blockManager.putBytes(pieceId, bytes, MEMORY_AND_DISK_SER, 
> tellMaster = true)) {
> throw new SparkException(s"Failed to store $pieceId of $broadcastId 
> in local BlockManager")
>   }
> {code}
> {code}
>   bm.getLocalBytes(pieceId) match {
> case Some(block) =>
>   blocks(pid) = block
>   releaseLock(pieceId)
> case None =>
>   bm.getRemoteBytes(pieceId) match {
> case Some(b) =>
>   if (checksumEnabled) {
> val sum = calcChecksum(b.chunks(0))
> if (sum != checksums(pid)) {
>   throw new SparkException(s"corrupt remote block $pieceId of 
> $broadcastId:" +
> s" $sum != ${checksums(pid)}")
> }
>   }
>   // We found the block from remote executors/driver's 
> BlockManager, so put the block
>   // in this executor's BlockManager.
>   if (!bm.putBytes(pieceId, b, StorageLevel.MEMORY_AND_DISK_SER, 
> tellMaster = true)) {
> throw new SparkException(
>   s"Failed to store $pieceId of $broadcastId in local 
> BlockManager")
>   }
>   blocks(pid) = b
> case None =>
>   throw new SparkException(s"Failed to get $pieceId of 
> $broadcastId")
>   }
>   }
> {code}
> The thing these block manager methods have in common is that they bypass the 
> encryption code; so broadcast data is stored unencrypted in the block 
> manager, causing unencrypted data to be written to disk if those blocks need 
> to be evicted from memory.
> The correct fix here is actually not to change {{TorrentBroadcast}}, but to 
> fix the block manager so that:
> - data stored in memory is not encrypted
> - data written to disk is encrypted
> This would simplify the code paths that use BlockManager / SerializerManager 
> APIs (e.g. see SPARK-19520), but requires some tricky changes inside the 
> BlockManager to still be able to use file channels to avoid reading whole 
> blocks back into memory so they can be decrypted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16617) Upgrade to Avro 1.8.x

2017-03-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16617:
--
Affects Version/s: 2.1.0
 Target Version/s: 3.0.0
  Component/s: Spark Core
   Build

> Upgrade to Avro 1.8.x
> -
>
> Key: SPARK-16617
> URL: https://issues.apache.org/jira/browse/SPARK-16617
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Affects Versions: 2.1.0
>Reporter: Ben McCann
>
> Avro 1.8 makes Avro objects serializable so that you can easily have an RDD 
> containing Avro objects.
> See https://issues.apache.org/jira/browse/AVRO-1502



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19817) make it clear that `timeZone` option is a general option in DataFrameReader/Writer

2017-03-14 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-19817:

Fix Version/s: 2.2.0

> make it clear that `timeZone` option is a general option in 
> DataFrameReader/Writer
> --
>
> Key: SPARK-19817
> URL: https://issues.apache.org/jira/browse/SPARK-19817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Takuya Ueshin
> Fix For: 2.2.0
>
>
> As timezone setting can also affect partition values, it works for all 
> formats, we should make it clear.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19817) make it clear that `timeZone` option is a general option in DataFrameReader/Writer

2017-03-14 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-19817.
-
Resolution: Fixed

> make it clear that `timeZone` option is a general option in 
> DataFrameReader/Writer
> --
>
> Key: SPARK-19817
> URL: https://issues.apache.org/jira/browse/SPARK-19817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Takuya Ueshin
>
> As timezone setting can also affect partition values, it works for all 
> formats, we should make it clear.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18966) NOT IN subquery with correlated expressions may return incorrect result

2017-03-14 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-18966.
---
   Resolution: Fixed
 Assignee: Nattavut Sutyanyong
Fix Version/s: 2.2.0

> NOT IN subquery with correlated expressions may return incorrect result
> ---
>
> Key: SPARK-18966
> URL: https://issues.apache.org/jira/browse/SPARK-18966
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>Assignee: Nattavut Sutyanyong
>  Labels: correctness
> Fix For: 2.2.0
>
>
> {code}
> Seq((1, 2)).toDF("a1", "b1").createOrReplaceTempView("t1")
> Seq[(java.lang.Integer, java.lang.Integer)]((1, null)).toDF("a2", 
> "b2").createOrReplaceTempView("t2")
> // The expected result is 1 row of (1,2) as shown in the next statement.
> sql("select * from t1 where a1 not in (select a2 from t2 where b2 = b1)").show
> +---+---+
> | a1| b1|
> +---+---+
> +---+---+
> sql("select * from t1 where a1 not in (select a2 from t2 where b2 = 2)").show
> +---+---+
> | a1| b1|
> +---+---+
> |  1|  2|
> +---+---+
> {code}
> The two SQL statements above should return the same result.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19952) Remove specialized catalog related analysis exceptions

2017-03-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-19952:

Description: 
We introduce catalog specific analysis exceptions (that extends 
AnalysisException) in Spark SQL. The problem with these are that they do not 
add much value, and that they are not well supported (for example in Pyspark). 
We should remove them, and use AnalysisException instead.

We should remove all the exceptions defined in NoSuchItemException.scala.


  was:We introduce catalog specific analysis exceptions (that extends 
AnalysisException) in Spark SQL. The problem with these are that they do not 
add much value, and that they are not well supported (for example in Pyspark). 
We should remove them, and use AnalysisException instead.


> Remove specialized catalog related analysis exceptions
> --
>
> Key: SPARK-19952
> URL: https://issues.apache.org/jira/browse/SPARK-19952
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>
> We introduce catalog specific analysis exceptions (that extends 
> AnalysisException) in Spark SQL. The problem with these are that they do not 
> add much value, and that they are not well supported (for example in 
> Pyspark). We should remove them, and use AnalysisException instead.
> We should remove all the exceptions defined in NoSuchItemException.scala.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19952) Remove specialized catalog related analysis exceptions

2017-03-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-19952:

Description: We introduce catalog specific analysis exceptions (that 
extends AnalysisException) in Spark SQL. The problem with these are that they 
do not add much value, and that they are not well supported (for example in 
Pyspark). We should remove them, and use AnalysisException instead.  (was: We 
introduce catalog specific analysis exceptions in Spark SQL. The problem with 
these are that they do not add much value, and that they are not well supported 
(for example in Pyspark). We should remove them, and use AnalysisException 
instead.)

> Remove specialized catalog related analysis exceptions
> --
>
> Key: SPARK-19952
> URL: https://issues.apache.org/jira/browse/SPARK-19952
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>
> We introduce catalog specific analysis exceptions (that extends 
> AnalysisException) in Spark SQL. The problem with these are that they do not 
> add much value, and that they are not well supported (for example in 
> Pyspark). We should remove them, and use AnalysisException instead.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19952) Remove specialized catalog related analysis exceptions

2017-03-14 Thread Herman van Hovell (JIRA)

Herman van Hovell created SPARK-19952:
-

 Summary: Remove specialized catalog related analysis exceptions
 Key: SPARK-19952
 URL: https://issues.apache.org/jira/browse/SPARK-19952
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0, 2.0.0
Reporter: Herman van Hovell
Assignee: Herman van Hovell


We introduce catalog specific analysis exceptions in Spark SQL. The problem 
with these are that they do not add much value, and that they are not well 
supported (for example in Pyspark). We should remove them, and use 
AnalysisException instead.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19282) RandomForestRegressionModel summary should expose getMaxDepth

2017-03-14 Thread Xin Ren (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15924728#comment-15924728
 ] 

Xin Ren commented on SPARK-19282:
-

thanks Bryan, could you please create some sub tasks under SPARK-10931?

I'd like to help on it if possible

> RandomForestRegressionModel summary should expose getMaxDepth
> -
>
> Key: SPARK-19282
> URL: https://issues.apache.org/jira/browse/SPARK-19282
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark, SparkR
>Affects Versions: 2.1.0
>Reporter: Nick Lothian
>Assignee: Xin Ren
>Priority: Minor
> Fix For: 2.2.0
>
>
> Currently it isn't clear hot to get the max depth of a 
> RandomForestRegressionModel (eg, after doing a grid search)
> It is possible to call
> {{regressor._java_obj.getMaxDepth()}} 
> but most other decision trees allow
> {{regressor.getMaxDepth()}} 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19951) Add string concatenate operator || to Spark SQL

2017-03-14 Thread Herman van Hovell (JIRA)

Herman van Hovell created SPARK-19951:
-

 Summary: Add string concatenate operator || to Spark SQL
 Key: SPARK-19951
 URL: https://issues.apache.org/jira/browse/SPARK-19951
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Herman van Hovell
Assignee: Herman van Hovell
Priority: Minor


It is quite natural to concatenate strings using the {||} symbol. For example: 
{{select a || b || c as abc from tbl_x}}. Let's add to Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19933) TPCDS Q70 went wrong while explaining

2017-03-14 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-19933.
---
   Resolution: Fixed
 Assignee: Herman van Hovell
Fix Version/s: 2.2.0

> TPCDS Q70 went wrong while explaining
> -
>
> Key: SPARK-19933
> URL: https://issues.apache.org/jira/browse/SPARK-19933
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Zhenhua Wang
>Assignee: Herman van Hovell
> Fix For: 2.2.0
>
>
> The query run ok in Jan. So I think some recent change broke it.
> All tables are in parquet format.
> The latest commit of my test version (master branch on Mar 13) is: 
> https://github.com/apache/spark/commit/9456688547522a62f1e7520e9b3564550c57aa5d
> Error messages are as follows:
> TreeNodeException: Binding attribute, tree: s_state#4
> at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:884)
> at 
> org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:358)
> at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87)
> at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85)
> at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:191)
> at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:191)
> at 
> scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
> at 
> scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46)
> at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186)
> at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:191)
> at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:190)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$

[jira] [Resolved] (SPARK-19923) Remove unnecessary type conversion per call in Hive

2017-03-14 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-19923.
---
   Resolution: Fixed
 Assignee: Takeshi Yamamuro
Fix Version/s: 2.2.0

> Remove unnecessary type conversion per call in Hive
> ---
>
> Key: SPARK-19923
> URL: https://issues.apache.org/jira/browse/SPARK-19923
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
> Fix For: 2.2.0
>
>
> IIUC I found unnecessary type conversions in HiveGenericUDF: 
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala#L116



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19899) FPGrowth input column naming

2017-03-14 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15924677#comment-15924677
 ] 

Felix Cheung commented on SPARK-19899:
--

+1 on "itemsCol"
looks like it is defaulting to "items" for association rules
https://github.com/apache/spark/blob/d4a637cd46b6dd5cc71ea17a55c4a26186e592c7/mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala#L214

> FPGrowth input column naming
> 
>
> Key: SPARK-19899
> URL: https://issues.apache.org/jira/browse/SPARK-19899
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>
> Current implementation extends {{HasFeaturesCol}}. Personally I find it 
> rather unfortunate. Up to this moment we used consistent conventions - if we 
> mix-in  {{HasFeaturesCol}} the {{featuresCol}} should be {{VectorUDT}}. 
> Using the same {{Param}} for an {{array}} (and possibly for 
> {{array>}} once {{PrefixSpan}} is ported to {{ml}}) will be 
> confusing for the users.
> I would like to suggest adding new {{trait}} (let's say 
> {{HasTransactionsCol}}) to clearly indicate that the input type differs for 
> the other {{Estiamtors}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19416) Dataset.schema is inconsistent with Dataset in handling columns with periods

2017-03-14 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15924646#comment-15924646
 ] 

Reynold Xin commented on SPARK-19416:
-

We probably can't change any of them now, unless we introduce a config flag for 
the more consistent behavior.


> Dataset.schema is inconsistent with Dataset in handling columns with periods
> 
>
> Key: SPARK-19416
> URL: https://issues.apache.org/jira/browse/SPARK-19416
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> When you have a DataFrame with a column with a period in its name, the API is 
> inconsistent about how to quote the column name.
> Here's a reproduction:
> {code}
> import org.apache.spark.sql.functions.col
> val rows = Seq(
>   ("foo", 1),
>   ("bar", 2)
> )
> val df = spark.createDataFrame(rows).toDF("a.b", "id")
> {code}
> These methods are all consistent:
> {code}
> df.select("a.b") // fails
> df.select("`a.b`") // succeeds
> df.select(col("a.b")) // fails
> df.select(col("`a.b`")) // succeeds
> df("a.b") // fails
> df("`a.b`") // succeeds
> {code}
> But {{schema}} is inconsistent:
> {code}
> df.schema("a.b") // succeeds
> df.schema("`a.b`") // fails
> {code}
> "fails" produces error messages like:
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve '`a.b`' given input 
> columns: [a.b, id];;
> 'Project ['a.b]
> +- Project [_1#1511 AS a.b#1516, _2#1512 AS id#1517]
>+- LocalRelation [_1#1511, _2#1512]
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:282)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:292)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:296)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:296)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$7.apply(QueryPlan.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:301)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2822)
>   at org.apache.spark.sql.Dataset.select(Dataset.scala:1121)
>   at org.apache.spark.sql.Dataset.select(Dataset.scala:1139)
>   at 
> line9667c6d14e79417280e5882aa52e0de727.$read$$iw

[jira] [Commented] (SPARK-19282) RandomForestRegressionModel summary should expose getMaxDepth

2017-03-14 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15924644#comment-15924644
 ] 

Bryan Cutler commented on SPARK-19282:
--

This is common issue with all PySpark ML Models and SPARK-10931 should be 
completed first, then probably a coordinated effort to expose all parameters 
from Models which most likely means using a common base class between the 
estimator and model that contains the params and get methods.

> RandomForestRegressionModel summary should expose getMaxDepth
> -
>
> Key: SPARK-19282
> URL: https://issues.apache.org/jira/browse/SPARK-19282
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark, SparkR
>Affects Versions: 2.1.0
>Reporter: Nick Lothian
>Assignee: Xin Ren
>Priority: Minor
> Fix For: 2.2.0
>
>
> Currently it isn't clear hot to get the max depth of a 
> RandomForestRegressionModel (eg, after doing a grid search)
> It is possible to call
> {{regressor._java_obj.getMaxDepth()}} 
> but most other decision trees allow
> {{regressor.getMaxDepth()}} 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18961) Support `SHOW TABLE EXTENDED ... PARTITION` statement

2017-03-14 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-18961:
---

Assignee: Jiang Xingbo

> Support `SHOW TABLE EXTENDED ... PARTITION` statement
> -
>
> Key: SPARK-18961
> URL: https://issues.apache.org/jira/browse/SPARK-18961
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
> Fix For: 2.2.0
>
>
> We should support the statement `SHOW TABLE EXTENDED LIKE 'table_identifier' 
> PARTITION(partition_spec)`, just like that HIVE does.
> When partition is specified, the `SHOW TABLE EXTENDED` command should output 
> the information of the partitions instead of the tables.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-18961) Support `SHOW TABLE EXTENDED ... PARTITION` statement

2017-03-14 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-18961.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

> Support `SHOW TABLE EXTENDED ... PARTITION` statement
> -
>
> Key: SPARK-18961
> URL: https://issues.apache.org/jira/browse/SPARK-18961
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
> Fix For: 2.2.0
>
>
> We should support the statement `SHOW TABLE EXTENDED LIKE 'table_identifier' 
> PARTITION(partition_spec)`, just like that HIVE does.
> When partition is specified, the `SHOW TABLE EXTENDED` command should output 
> the information of the partitions instead of the tables.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18966) NOT IN subquery with correlated expressions may return incorrect result

2017-03-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15924593#comment-15924593
 ] 

Apache Spark commented on SPARK-18966:
--

User 'nsyca' has created a pull request for this issue:
https://github.com/apache/spark/pull/17294

> NOT IN subquery with correlated expressions may return incorrect result
> ---
>
> Key: SPARK-18966
> URL: https://issues.apache.org/jira/browse/SPARK-18966
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>  Labels: correctness
>
> {code}
> Seq((1, 2)).toDF("a1", "b1").createOrReplaceTempView("t1")
> Seq[(java.lang.Integer, java.lang.Integer)]((1, null)).toDF("a2", 
> "b2").createOrReplaceTempView("t2")
> // The expected result is 1 row of (1,2) as shown in the next statement.
> sql("select * from t1 where a1 not in (select a2 from t2 where b2 = b1)").show
> +---+---+
> | a1| b1|
> +---+---+
> +---+---+
> sql("select * from t1 where a1 not in (select a2 from t2 where b2 = 2)").show
> +---+---+
> | a1| b1|
> +---+---+
> |  1|  2|
> +---+---+
> {code}
> The two SQL statements above should return the same result.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18966) NOT IN subquery with correlated expressions may return incorrect result

2017-03-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18966:


Assignee: Apache Spark

> NOT IN subquery with correlated expressions may return incorrect result
> ---
>
> Key: SPARK-18966
> URL: https://issues.apache.org/jira/browse/SPARK-18966
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>Assignee: Apache Spark
>  Labels: correctness
>
> {code}
> Seq((1, 2)).toDF("a1", "b1").createOrReplaceTempView("t1")
> Seq[(java.lang.Integer, java.lang.Integer)]((1, null)).toDF("a2", 
> "b2").createOrReplaceTempView("t2")
> // The expected result is 1 row of (1,2) as shown in the next statement.
> sql("select * from t1 where a1 not in (select a2 from t2 where b2 = b1)").show
> +---+---+
> | a1| b1|
> +---+---+
> +---+---+
> sql("select * from t1 where a1 not in (select a2 from t2 where b2 = 2)").show
> +---+---+
> | a1| b1|
> +---+---+
> |  1|  2|
> +---+---+
> {code}
> The two SQL statements above should return the same result.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18966) NOT IN subquery with correlated expressions may return incorrect result

2017-03-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18966:


Assignee: (was: Apache Spark)

> NOT IN subquery with correlated expressions may return incorrect result
> ---
>
> Key: SPARK-18966
> URL: https://issues.apache.org/jira/browse/SPARK-18966
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>  Labels: correctness
>
> {code}
> Seq((1, 2)).toDF("a1", "b1").createOrReplaceTempView("t1")
> Seq[(java.lang.Integer, java.lang.Integer)]((1, null)).toDF("a2", 
> "b2").createOrReplaceTempView("t2")
> // The expected result is 1 row of (1,2) as shown in the next statement.
> sql("select * from t1 where a1 not in (select a2 from t2 where b2 = b1)").show
> +---+---+
> | a1| b1|
> +---+---+
> +---+---+
> sql("select * from t1 where a1 not in (select a2 from t2 where b2 = 2)").show
> +---+---+
> | a1| b1|
> +---+---+
> |  1|  2|
> +---+---+
> {code}
> The two SQL statements above should return the same result.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19424) Wrong runtime type in RDD when reading from avro with custom serializer

2017-03-14 Thread Nira Amit (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15924573#comment-15924573
 ] 

Nira Amit commented on SPARK-19424:
---

[~hvanhovell] The solution would be to not ignore the "Unchecked cast" warning 
when your API assigns the returned value from the avro library (which seems to 
default to GenericData$Record if it can't return the requested type). Your API 
is the one handling "Object"s, so it should be the one throwing the 
ClassCastException error. Not my code, after the erroneous assignment had 
already been made.

> Wrong runtime type in RDD when reading from avro with custom serializer
> ---
>
> Key: SPARK-19424
> URL: https://issues.apache.org/jira/browse/SPARK-19424
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.2
> Environment: Ubuntu, spark 2.0.2 prebuilt for hadoop 2.7
>Reporter: Nira Amit
>
> I am trying to read data from avro files into an RDD using Kryo. My code 
> compiles fine, but in runtime I'm getting a ClassCastException. Here is what 
> my code does:
> {code}
> SparkConf conf = new SparkConf()...
> conf.set("spark.serializer", KryoSerializer.class.getCanonicalName());
> conf.set("spark.kryo.registrator", MyKryoRegistrator.class.getName());
> JavaSparkContext sc = new JavaSparkContext(conf);
> {code}
> Where MyKryoRegistrator registers a Serializer for MyCustomClass:
> {code}
> public void registerClasses(Kryo kryo) {
> kryo.register(MyCustomClass.class, new MyCustomClassSerializer());
> }
> {code}
> Then, I read my datafile:
> {code}
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> AvroKeyInputFormat.class, MyCustomClass.class, 
> NullWritable.class,
> sc.hadoopConfiguration());
> Tuple2 first = records.first();
> {code}
> This seems to work fine, but using a debugger I can see that while the RDD 
> has a kClassTag of my.package.containing.MyCustomClass, the variable first 
> contains a Tuple2, not Tuple2 NullWritable>! And indeed, when the following line executes:
> {code}
> System.out.println("Got a result, custom field is: " + 
> first._1.getSomeCustomField());
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.mapred.AvroKey cannot be cast 
> to my.package.containing.MyCustomClass
> {code}
> Am I doing something wrong? And even so, shouldn't I get a compilation error 
> rather than a runtime error?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-19424) Wrong runtime type in RDD when reading from avro with custom serializer

2017-03-14 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell closed SPARK-19424.
-
Resolution: Not A Problem

> Wrong runtime type in RDD when reading from avro with custom serializer
> ---
>
> Key: SPARK-19424
> URL: https://issues.apache.org/jira/browse/SPARK-19424
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.2
> Environment: Ubuntu, spark 2.0.2 prebuilt for hadoop 2.7
>Reporter: Nira Amit
>
> I am trying to read data from avro files into an RDD using Kryo. My code 
> compiles fine, but in runtime I'm getting a ClassCastException. Here is what 
> my code does:
> {code}
> SparkConf conf = new SparkConf()...
> conf.set("spark.serializer", KryoSerializer.class.getCanonicalName());
> conf.set("spark.kryo.registrator", MyKryoRegistrator.class.getName());
> JavaSparkContext sc = new JavaSparkContext(conf);
> {code}
> Where MyKryoRegistrator registers a Serializer for MyCustomClass:
> {code}
> public void registerClasses(Kryo kryo) {
> kryo.register(MyCustomClass.class, new MyCustomClassSerializer());
> }
> {code}
> Then, I read my datafile:
> {code}
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> AvroKeyInputFormat.class, MyCustomClass.class, 
> NullWritable.class,
> sc.hadoopConfiguration());
> Tuple2 first = records.first();
> {code}
> This seems to work fine, but using a debugger I can see that while the RDD 
> has a kClassTag of my.package.containing.MyCustomClass, the variable first 
> contains a Tuple2, not Tuple2 NullWritable>! And indeed, when the following line executes:
> {code}
> System.out.println("Got a result, custom field is: " + 
> first._1.getSomeCustomField());
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.mapred.AvroKey cannot be cast 
> to my.package.containing.MyCustomClass
> {code}
> Am I doing something wrong? And even so, shouldn't I get a compilation error 
> rather than a runtime error?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19424) Wrong runtime type in RDD when reading from avro with custom serializer

2017-03-14 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15924564#comment-15924564
 ] 

Herman van Hovell commented on SPARK-19424:
---

[~amitnira] Like Sean said, you are not proposing any solution to the problem 
at hand, or even investigating the problem. I just see you echoing the same 
opinion over and over. Could you please keep this closed until you come up with 
a solution (remember it is also your API)

> Wrong runtime type in RDD when reading from avro with custom serializer
> ---
>
> Key: SPARK-19424
> URL: https://issues.apache.org/jira/browse/SPARK-19424
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.2
> Environment: Ubuntu, spark 2.0.2 prebuilt for hadoop 2.7
>Reporter: Nira Amit
>
> I am trying to read data from avro files into an RDD using Kryo. My code 
> compiles fine, but in runtime I'm getting a ClassCastException. Here is what 
> my code does:
> {code}
> SparkConf conf = new SparkConf()...
> conf.set("spark.serializer", KryoSerializer.class.getCanonicalName());
> conf.set("spark.kryo.registrator", MyKryoRegistrator.class.getName());
> JavaSparkContext sc = new JavaSparkContext(conf);
> {code}
> Where MyKryoRegistrator registers a Serializer for MyCustomClass:
> {code}
> public void registerClasses(Kryo kryo) {
> kryo.register(MyCustomClass.class, new MyCustomClassSerializer());
> }
> {code}
> Then, I read my datafile:
> {code}
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> AvroKeyInputFormat.class, MyCustomClass.class, 
> NullWritable.class,
> sc.hadoopConfiguration());
> Tuple2 first = records.first();
> {code}
> This seems to work fine, but using a debugger I can see that while the RDD 
> has a kClassTag of my.package.containing.MyCustomClass, the variable first 
> contains a Tuple2, not Tuple2 NullWritable>! And indeed, when the following line executes:
> {code}
> System.out.println("Got a result, custom field is: " + 
> first._1.getSomeCustomField());
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.mapred.AvroKey cannot be cast 
> to my.package.containing.MyCustomClass
> {code}
> Am I doing something wrong? And even so, shouldn't I get a compilation error 
> rather than a runtime error?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19424) Wrong runtime type in RDD when reading from avro with custom serializer

2017-03-14 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15924517#comment-15924517
 ] 

Sean Owen commented on SPARK-19424:
---

All: I'm going to contact INFRA about blocking further changes. I can't disable 
users' access to JIRA.

> Wrong runtime type in RDD when reading from avro with custom serializer
> ---
>
> Key: SPARK-19424
> URL: https://issues.apache.org/jira/browse/SPARK-19424
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.2
> Environment: Ubuntu, spark 2.0.2 prebuilt for hadoop 2.7
>Reporter: Nira Amit
>
> I am trying to read data from avro files into an RDD using Kryo. My code 
> compiles fine, but in runtime I'm getting a ClassCastException. Here is what 
> my code does:
> {code}
> SparkConf conf = new SparkConf()...
> conf.set("spark.serializer", KryoSerializer.class.getCanonicalName());
> conf.set("spark.kryo.registrator", MyKryoRegistrator.class.getName());
> JavaSparkContext sc = new JavaSparkContext(conf);
> {code}
> Where MyKryoRegistrator registers a Serializer for MyCustomClass:
> {code}
> public void registerClasses(Kryo kryo) {
> kryo.register(MyCustomClass.class, new MyCustomClassSerializer());
> }
> {code}
> Then, I read my datafile:
> {code}
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> AvroKeyInputFormat.class, MyCustomClass.class, 
> NullWritable.class,
> sc.hadoopConfiguration());
> Tuple2 first = records.first();
> {code}
> This seems to work fine, but using a debugger I can see that while the RDD 
> has a kClassTag of my.package.containing.MyCustomClass, the variable first 
> contains a Tuple2, not Tuple2 NullWritable>! And indeed, when the following line executes:
> {code}
> System.out.println("Got a result, custom field is: " + 
> first._1.getSomeCustomField());
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.mapred.AvroKey cannot be cast 
> to my.package.containing.MyCustomClass
> {code}
> Am I doing something wrong? And even so, shouldn't I get a compilation error 
> rather than a runtime error?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-19424) Wrong runtime type in RDD when reading from avro with custom serializer

2017-03-14 Thread Nira Amit (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nira Amit reopened SPARK-19424:
---

I managed to get my code to work, yes. The problem of the wrong type at runtime 
is not resolved, though. Your API ignores an "Unchecked cast" warning when 
assigning the returned value from the avro lib to the datum field of my custom 
AvroKey, I'm assuming. This is a problem as it violates the type-safety 
guarantees expected from Java APIs. My code shouldn't break unexpectedly with a 
CastClassException that I did not explicitly cause. The API should throw the 
exception instead of performing the erroneous assignment.

> Wrong runtime type in RDD when reading from avro with custom serializer
> ---
>
> Key: SPARK-19424
> URL: https://issues.apache.org/jira/browse/SPARK-19424
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.2
> Environment: Ubuntu, spark 2.0.2 prebuilt for hadoop 2.7
>Reporter: Nira Amit
>
> I am trying to read data from avro files into an RDD using Kryo. My code 
> compiles fine, but in runtime I'm getting a ClassCastException. Here is what 
> my code does:
> {code}
> SparkConf conf = new SparkConf()...
> conf.set("spark.serializer", KryoSerializer.class.getCanonicalName());
> conf.set("spark.kryo.registrator", MyKryoRegistrator.class.getName());
> JavaSparkContext sc = new JavaSparkContext(conf);
> {code}
> Where MyKryoRegistrator registers a Serializer for MyCustomClass:
> {code}
> public void registerClasses(Kryo kryo) {
> kryo.register(MyCustomClass.class, new MyCustomClassSerializer());
> }
> {code}
> Then, I read my datafile:
> {code}
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> AvroKeyInputFormat.class, MyCustomClass.class, 
> NullWritable.class,
> sc.hadoopConfiguration());
> Tuple2 first = records.first();
> {code}
> This seems to work fine, but using a debugger I can see that while the RDD 
> has a kClassTag of my.package.containing.MyCustomClass, the variable first 
> contains a Tuple2, not Tuple2 NullWritable>! And indeed, when the following line executes:
> {code}
> System.out.println("Got a result, custom field is: " + 
> first._1.getSomeCustomField());
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.mapred.AvroKey cannot be cast 
> to my.package.containing.MyCustomClass
> {code}
> Am I doing something wrong? And even so, shouldn't I get a compilation error 
> rather than a runtime error?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19950) nullable ignored when df.load() is executed for file-based data source

2017-03-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19950:


Assignee: (was: Apache Spark)

> nullable ignored when df.load() is executed for file-based data source
> --
>
> Key: SPARK-19950
> URL: https://issues.apache.org/jira/browse/SPARK-19950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> This problem is reported in [Databricks 
> forum|https://forums.databricks.com/questions/7123/nullable-seemingly-ignored-when-reading-parquet.html].
> When we execute the following code, a schema for "id" in {{dfRead}} has 
> {{nullable = true}}. It should be {{nullable = false}}.
> {code:java}
> val field = "id"
> val df = spark.range(0, 5, 1, 1).toDF(field)
> val fmt = "parquet"
> val path = "/tmp/parquet"
> val schema = StructType(Seq(StructField(field, LongType, false)))
> df.write.format(fmt).mode("overwrite").save(path)
> val dfRead = spark.read.format(fmt).schema(schema).load(path)
> dfRead.printSchema
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19950) nullable ignored when df.load() is executed for file-based data source

2017-03-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15924509#comment-15924509
 ] 

Apache Spark commented on SPARK-19950:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/17293

> nullable ignored when df.load() is executed for file-based data source
> --
>
> Key: SPARK-19950
> URL: https://issues.apache.org/jira/browse/SPARK-19950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> This problem is reported in [Databricks 
> forum|https://forums.databricks.com/questions/7123/nullable-seemingly-ignored-when-reading-parquet.html].
> When we execute the following code, a schema for "id" in {{dfRead}} has 
> {{nullable = true}}. It should be {{nullable = false}}.
> {code:java}
> val field = "id"
> val df = spark.range(0, 5, 1, 1).toDF(field)
> val fmt = "parquet"
> val path = "/tmp/parquet"
> val schema = StructType(Seq(StructField(field, LongType, false)))
> df.write.format(fmt).mode("overwrite").save(path)
> val dfRead = spark.read.format(fmt).schema(schema).load(path)
> dfRead.printSchema
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19950) nullable ignored when df.load() is executed for file-based data source

2017-03-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19950:


Assignee: Apache Spark

> nullable ignored when df.load() is executed for file-based data source
> --
>
> Key: SPARK-19950
> URL: https://issues.apache.org/jira/browse/SPARK-19950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>
> This problem is reported in [Databricks 
> forum|https://forums.databricks.com/questions/7123/nullable-seemingly-ignored-when-reading-parquet.html].
> When we execute the following code, a schema for "id" in {{dfRead}} has 
> {{nullable = true}}. It should be {{nullable = false}}.
> {code:java}
> val field = "id"
> val df = spark.range(0, 5, 1, 1).toDF(field)
> val fmt = "parquet"
> val path = "/tmp/parquet"
> val schema = StructType(Seq(StructField(field, LongType, false)))
> df.write.format(fmt).mode("overwrite").save(path)
> val dfRead = spark.read.format(fmt).schema(schema).load(path)
> dfRead.printSchema
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19950) nullable ignored when df.load() is executed for file-based data source

2017-03-14 Thread Kazuaki Ishizaki (JIRA)

Kazuaki Ishizaki created SPARK-19950:


 Summary: nullable ignored when df.load() is executed for 
file-based data source
 Key: SPARK-19950
 URL: https://issues.apache.org/jira/browse/SPARK-19950
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Kazuaki Ishizaki


This problem is reported in [Databricks 
forum|https://forums.databricks.com/questions/7123/nullable-seemingly-ignored-when-reading-parquet.html].

When we execute the following code, a schema for "id" in {{dfRead}} has 
{{nullable = true}}. It should be {{nullable = false}}.

{code:java}
val field = "id"
val df = spark.range(0, 5, 1, 1).toDF(field)
val fmt = "parquet"
val path = "/tmp/parquet"
val schema = StructType(Seq(StructField(field, LongType, false)))
df.write.format(fmt).mode("overwrite").save(path)
val dfRead = spark.read.format(fmt).schema(schema).load(path)
dfRead.printSchema
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19424) Wrong runtime type in RDD when reading from avro with custom serializer

2017-03-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19424.
---
Resolution: Not A Problem

You resolved the problem above though, yourself. You hadn't passed the correct 
arguments to this function. I don't see anyone suggesting you suppress a 
ClassCastException, and correct usage does not produce one. I am actually not 
clear what you are demanding: even if we could break the API to change it, what 
would it look like? As I say, you can at best blame the Hadoop API for the 
structure this is mirroring, but I gave a more nuanced explanation in the 
email. It involves Scala, which I assume is new to you. That is worth 
re-reading.

Yes, you have some business need, but this doesn't entitle you to a specific 
outcome or anyone's time. You've received more than a fair amount of help here. 
On the contrary, it's not OK to continually reopen this issue with no change, 
or proposal to make. It's a quick way to ensure you get no help in the future.

This part of the discussion is clearly done, and I will close it one more time. 
Please leave it closed. Your next step is to back up and decide what code 
change you would propose at this point, and reply to your earlier thread on the 
mailing list. If there's any support for it then we can make a new JIRA. 

> Wrong runtime type in RDD when reading from avro with custom serializer
> ---
>
> Key: SPARK-19424
> URL: https://issues.apache.org/jira/browse/SPARK-19424
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.2
> Environment: Ubuntu, spark 2.0.2 prebuilt for hadoop 2.7
>Reporter: Nira Amit
>
> I am trying to read data from avro files into an RDD using Kryo. My code 
> compiles fine, but in runtime I'm getting a ClassCastException. Here is what 
> my code does:
> {code}
> SparkConf conf = new SparkConf()...
> conf.set("spark.serializer", KryoSerializer.class.getCanonicalName());
> conf.set("spark.kryo.registrator", MyKryoRegistrator.class.getName());
> JavaSparkContext sc = new JavaSparkContext(conf);
> {code}
> Where MyKryoRegistrator registers a Serializer for MyCustomClass:
> {code}
> public void registerClasses(Kryo kryo) {
> kryo.register(MyCustomClass.class, new MyCustomClassSerializer());
> }
> {code}
> Then, I read my datafile:
> {code}
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> AvroKeyInputFormat.class, MyCustomClass.class, 
> NullWritable.class,
> sc.hadoopConfiguration());
> Tuple2 first = records.first();
> {code}
> This seems to work fine, but using a debugger I can see that while the RDD 
> has a kClassTag of my.package.containing.MyCustomClass, the variable first 
> contains a Tuple2, not Tuple2 NullWritable>! And indeed, when the following line executes:
> {code}
> System.out.println("Got a result, custom field is: " + 
> first._1.getSomeCustomField());
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.mapred.AvroKey cannot be cast 
> to my.package.containing.MyCustomClass
> {code}
> Am I doing something wrong? And even so, shouldn't I get a compilation error 
> rather than a runtime error?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-19424) Wrong runtime type in RDD when reading from avro with custom serializer

2017-03-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-19424.
-

> Wrong runtime type in RDD when reading from avro with custom serializer
> ---
>
> Key: SPARK-19424
> URL: https://issues.apache.org/jira/browse/SPARK-19424
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.2
> Environment: Ubuntu, spark 2.0.2 prebuilt for hadoop 2.7
>Reporter: Nira Amit
>
> I am trying to read data from avro files into an RDD using Kryo. My code 
> compiles fine, but in runtime I'm getting a ClassCastException. Here is what 
> my code does:
> {code}
> SparkConf conf = new SparkConf()...
> conf.set("spark.serializer", KryoSerializer.class.getCanonicalName());
> conf.set("spark.kryo.registrator", MyKryoRegistrator.class.getName());
> JavaSparkContext sc = new JavaSparkContext(conf);
> {code}
> Where MyKryoRegistrator registers a Serializer for MyCustomClass:
> {code}
> public void registerClasses(Kryo kryo) {
> kryo.register(MyCustomClass.class, new MyCustomClassSerializer());
> }
> {code}
> Then, I read my datafile:
> {code}
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> AvroKeyInputFormat.class, MyCustomClass.class, 
> NullWritable.class,
> sc.hadoopConfiguration());
> Tuple2 first = records.first();
> {code}
> This seems to work fine, but using a debugger I can see that while the RDD 
> has a kClassTag of my.package.containing.MyCustomClass, the variable first 
> contains a Tuple2, not Tuple2 NullWritable>! And indeed, when the following line executes:
> {code}
> System.out.println("Got a result, custom field is: " + 
> first._1.getSomeCustomField());
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.mapred.AvroKey cannot be cast 
> to my.package.containing.MyCustomClass
> {code}
> Am I doing something wrong? And even so, shouldn't I get a compilation error 
> rather than a runtime error?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-19424) Wrong runtime type in RDD when reading from avro with custom serializer

2017-03-14 Thread Nira Amit (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nira Amit reopened SPARK-19424:
---

Sean, I am not trying to piss you off but to investigate a problem. If your 
explanation is that it is fine to suppress a ClassCastException in a Java API 
then please provide a reference for this claim, because I provided two that say 
the opposite (including one citing the Java Generics lead designer).
I am putting significant time and effort into this, and it is very 
disrespectful to keep trying to blow me off. I have been working in this 
industry since the late 90's, I am a Java Tech Lead and System Architect, and I 
am telling you that this API does not behave like any I've worked with before.

> Wrong runtime type in RDD when reading from avro with custom serializer
> ---
>
> Key: SPARK-19424
> URL: https://issues.apache.org/jira/browse/SPARK-19424
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.2
> Environment: Ubuntu, spark 2.0.2 prebuilt for hadoop 2.7
>Reporter: Nira Amit
>
> I am trying to read data from avro files into an RDD using Kryo. My code 
> compiles fine, but in runtime I'm getting a ClassCastException. Here is what 
> my code does:
> {code}
> SparkConf conf = new SparkConf()...
> conf.set("spark.serializer", KryoSerializer.class.getCanonicalName());
> conf.set("spark.kryo.registrator", MyKryoRegistrator.class.getName());
> JavaSparkContext sc = new JavaSparkContext(conf);
> {code}
> Where MyKryoRegistrator registers a Serializer for MyCustomClass:
> {code}
> public void registerClasses(Kryo kryo) {
> kryo.register(MyCustomClass.class, new MyCustomClassSerializer());
> }
> {code}
> Then, I read my datafile:
> {code}
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> AvroKeyInputFormat.class, MyCustomClass.class, 
> NullWritable.class,
> sc.hadoopConfiguration());
> Tuple2 first = records.first();
> {code}
> This seems to work fine, but using a debugger I can see that while the RDD 
> has a kClassTag of my.package.containing.MyCustomClass, the variable first 
> contains a Tuple2, not Tuple2 NullWritable>! And indeed, when the following line executes:
> {code}
> System.out.println("Got a result, custom field is: " + 
> first._1.getSomeCustomField());
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.mapred.AvroKey cannot be cast 
> to my.package.containing.MyCustomClass
> {code}
> Am I doing something wrong? And even so, shouldn't I get a compilation error 
> rather than a runtime error?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19899) FPGrowth input column naming

2017-03-14 Thread Maciej Szymkiewicz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15924435#comment-15924435
 ] 

Maciej Szymkiewicz commented on SPARK-19899:


"itemsCol" sounds good. What should we use as a default value? Just "items"?

> FPGrowth input column naming
> 
>
> Key: SPARK-19899
> URL: https://issues.apache.org/jira/browse/SPARK-19899
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>
> Current implementation extends {{HasFeaturesCol}}. Personally I find it 
> rather unfortunate. Up to this moment we used consistent conventions - if we 
> mix-in  {{HasFeaturesCol}} the {{featuresCol}} should be {{VectorUDT}}. 
> Using the same {{Param}} for an {{array}} (and possibly for 
> {{array>}} once {{PrefixSpan}} is ported to {{ml}}) will be 
> confusing for the users.
> I would like to suggest adding new {{trait}} (let's say 
> {{HasTransactionsCol}}) to clearly indicate that the input type differs for 
> the other {{Estiamtors}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11798) Datanucleus jars is missing under lib_managed/jars

2017-03-14 Thread Luca Menichetti (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15924420#comment-15924420
 ] 

Luca Menichetti commented on SPARK-11798:
-

I have exactly the same issue, I had to include all datanucleus-* jar manually 
with the --jars command when submitting job in cluster mode. With --master 
yarn-client everything is working, so apparently these jars are not 
automatically imported.

Spark is compiled with Hive profile and version is 1.6.2.

The logs I see are the following:
WARN metastore.HiveMetaStore: Retrying creating default database after error: 
Class org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
javax.jdo.JDOFatalUserException: Class 
org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.

> Datanucleus jars is missing under lib_managed/jars
> --
>
> Key: SPARK-11798
> URL: https://issues.apache.org/jira/browse/SPARK-11798
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Reporter: Jeff Zhang
>
> I notice the comments in https://github.com/apache/spark/pull/9575 said that 
> Datanucleus related jars will still be copied to lib_managed/jars. But I 
> don't see any jars under lib_managed/jars. The weird thing is that I see the 
> jars on another machine, but could not see jars on my laptop even after I 
> delete the whole spark project and start from scratch. Does it related with 
> environments ? I try to add the following code in SparkBuild.scala to track 
> the issue, it shows that the jars is empty. 
> {code}
> deployDatanucleusJars := {
>   val jars: Seq[File] = (fullClasspath in assembly).value.map(_.data)
> .filter(_.getPath.contains("org.datanucleus"))
>   // this is what I added
>   println("*")
>   println("fullClasspath:"+fullClasspath)
>   println("assembly:"+assembly)
>   println("jars:"+jars.map(_.getAbsolutePath()).mkString(","))
>   //
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-19424) Wrong runtime type in RDD when reading from avro with custom serializer

2017-03-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-19424.
-

> Wrong runtime type in RDD when reading from avro with custom serializer
> ---
>
> Key: SPARK-19424
> URL: https://issues.apache.org/jira/browse/SPARK-19424
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.2
> Environment: Ubuntu, spark 2.0.2 prebuilt for hadoop 2.7
>Reporter: Nira Amit
>
> I am trying to read data from avro files into an RDD using Kryo. My code 
> compiles fine, but in runtime I'm getting a ClassCastException. Here is what 
> my code does:
> {code}
> SparkConf conf = new SparkConf()...
> conf.set("spark.serializer", KryoSerializer.class.getCanonicalName());
> conf.set("spark.kryo.registrator", MyKryoRegistrator.class.getName());
> JavaSparkContext sc = new JavaSparkContext(conf);
> {code}
> Where MyKryoRegistrator registers a Serializer for MyCustomClass:
> {code}
> public void registerClasses(Kryo kryo) {
> kryo.register(MyCustomClass.class, new MyCustomClassSerializer());
> }
> {code}
> Then, I read my datafile:
> {code}
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> AvroKeyInputFormat.class, MyCustomClass.class, 
> NullWritable.class,
> sc.hadoopConfiguration());
> Tuple2 first = records.first();
> {code}
> This seems to work fine, but using a debugger I can see that while the RDD 
> has a kClassTag of my.package.containing.MyCustomClass, the variable first 
> contains a Tuple2, not Tuple2 NullWritable>! And indeed, when the following line executes:
> {code}
> System.out.println("Got a result, custom field is: " + 
> first._1.getSomeCustomField());
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.mapred.AvroKey cannot be cast 
> to my.package.containing.MyCustomClass
> {code}
> Am I doing something wrong? And even so, shouldn't I get a compilation error 
> rather than a runtime error?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19424) Wrong runtime type in RDD when reading from avro with custom serializer

2017-03-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19424.
---
Resolution: Not A Problem

This was even further discussed at 
https://www.mail-archive.com/user@spark.apache.org/msg62268.html

For those reasons, again, no, this is not a bug. 

> Wrong runtime type in RDD when reading from avro with custom serializer
> ---
>
> Key: SPARK-19424
> URL: https://issues.apache.org/jira/browse/SPARK-19424
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.2
> Environment: Ubuntu, spark 2.0.2 prebuilt for hadoop 2.7
>Reporter: Nira Amit
>
> I am trying to read data from avro files into an RDD using Kryo. My code 
> compiles fine, but in runtime I'm getting a ClassCastException. Here is what 
> my code does:
> {code}
> SparkConf conf = new SparkConf()...
> conf.set("spark.serializer", KryoSerializer.class.getCanonicalName());
> conf.set("spark.kryo.registrator", MyKryoRegistrator.class.getName());
> JavaSparkContext sc = new JavaSparkContext(conf);
> {code}
> Where MyKryoRegistrator registers a Serializer for MyCustomClass:
> {code}
> public void registerClasses(Kryo kryo) {
> kryo.register(MyCustomClass.class, new MyCustomClassSerializer());
> }
> {code}
> Then, I read my datafile:
> {code}
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> AvroKeyInputFormat.class, MyCustomClass.class, 
> NullWritable.class,
> sc.hadoopConfiguration());
> Tuple2 first = records.first();
> {code}
> This seems to work fine, but using a debugger I can see that while the RDD 
> has a kClassTag of my.package.containing.MyCustomClass, the variable first 
> contains a Tuple2, not Tuple2 NullWritable>! And indeed, when the following line executes:
> {code}
> System.out.println("Got a result, custom field is: " + 
> first._1.getSomeCustomField());
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.mapred.AvroKey cannot be cast 
> to my.package.containing.MyCustomClass
> {code}
> Am I doing something wrong? And even so, shouldn't I get a compilation error 
> rather than a runtime error?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-19424) Wrong runtime type in RDD when reading from avro with custom serializer

2017-03-14 Thread Nira Amit (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nira Amit reopened SPARK-19424:
---

Re-opening this issue because throwing unexpected ClassCastExceptions is not an 
accepted behavior of a Java API: 
"This sort of unexpected ClassCastException is considered a violation of the 
type-safety principle" (source: 
http://www.angelikalanger.com/GenericsFAQ/FAQSections/ParameterizedTypes.html#FAQ006)
"[The type-safety] principle is very important – we don’t want the implicit 
casts added when compiling generic code to raise runtime exceptions, since they 
would be hard to understand and fix". (source: 
https://eyalsch.wordpress.com/tag/type-safety/).

I created a GitHub repository with a complete test-app that reproduces this 
behavior: https://github.com/homosepian/spark-avro-kryo

> Wrong runtime type in RDD when reading from avro with custom serializer
> ---
>
> Key: SPARK-19424
> URL: https://issues.apache.org/jira/browse/SPARK-19424
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.2
> Environment: Ubuntu, spark 2.0.2 prebuilt for hadoop 2.7
>Reporter: Nira Amit
>
> I am trying to read data from avro files into an RDD using Kryo. My code 
> compiles fine, but in runtime I'm getting a ClassCastException. Here is what 
> my code does:
> {code}
> SparkConf conf = new SparkConf()...
> conf.set("spark.serializer", KryoSerializer.class.getCanonicalName());
> conf.set("spark.kryo.registrator", MyKryoRegistrator.class.getName());
> JavaSparkContext sc = new JavaSparkContext(conf);
> {code}
> Where MyKryoRegistrator registers a Serializer for MyCustomClass:
> {code}
> public void registerClasses(Kryo kryo) {
> kryo.register(MyCustomClass.class, new MyCustomClassSerializer());
> }
> {code}
> Then, I read my datafile:
> {code}
> JavaPairRDD records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> AvroKeyInputFormat.class, MyCustomClass.class, 
> NullWritable.class,
> sc.hadoopConfiguration());
> Tuple2 first = records.first();
> {code}
> This seems to work fine, but using a debugger I can see that while the RDD 
> has a kClassTag of my.package.containing.MyCustomClass, the variable first 
> contains a Tuple2, not Tuple2 NullWritable>! And indeed, when the following line executes:
> {code}
> System.out.println("Got a result, custom field is: " + 
> first._1.getSomeCustomField());
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.mapred.AvroKey cannot be cast 
> to my.package.containing.MyCustomClass
> {code}
> Am I doing something wrong? And even so, shouldn't I get a compilation error 
> rather than a runtime error?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19282) RandomForestRegressionModel summary should expose getMaxDepth

2017-03-14 Thread Xin Ren (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15924374#comment-15924374
 ] 

Xin Ren commented on SPARK-19282:
-

sure, I'm working on python part :)

> RandomForestRegressionModel summary should expose getMaxDepth
> -
>
> Key: SPARK-19282
> URL: https://issues.apache.org/jira/browse/SPARK-19282
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark, SparkR
>Affects Versions: 2.1.0
>Reporter: Nick Lothian
>Assignee: Xin Ren
>Priority: Minor
> Fix For: 2.2.0
>
>
> Currently it isn't clear hot to get the max depth of a 
> RandomForestRegressionModel (eg, after doing a grid search)
> It is possible to call
> {{regressor._java_obj.getMaxDepth()}} 
> but most other decision trees allow
> {{regressor.getMaxDepth()}} 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19899) FPGrowth input column naming

2017-03-14 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-19899:
--
Target Version/s: 2.2.0

> FPGrowth input column naming
> 
>
> Key: SPARK-19899
> URL: https://issues.apache.org/jira/browse/SPARK-19899
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>
> Current implementation extends {{HasFeaturesCol}}. Personally I find it 
> rather unfortunate. Up to this moment we used consistent conventions - if we 
> mix-in  {{HasFeaturesCol}} the {{featuresCol}} should be {{VectorUDT}}. 
> Using the same {{Param}} for an {{array}} (and possibly for 
> {{array>}} once {{PrefixSpan}} is ported to {{ml}}) will be 
> confusing for the users.
> I would like to suggest adding new {{trait}} (let's say 
> {{HasTransactionsCol}}) to clearly indicate that the input type differs for 
> the other {{Estiamtors}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 160 matches

Mail list logo