date:20170416

[jira] [Commented] (SPARK-20336) spark.read.csv() with wholeFile=True option fails to read non ASCII unicode characters

2017-04-16 Thread HanCheol Cho (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970259#comment-15970259
 ] 

HanCheol Cho commented on SPARK-20336:
--

Hi, [~hyukjin.kwon] 

I found that this case only happens when I run it in Yarn mode, not local mode, 
and the clused used here were using different Python version (Anaconda Python 
2.7.11 in client and System's Python 2.7.5 in worker nodes).
Other system configurations such as locale (en_us.UTF-8) were same.

However, I am not yet sure if this is the root cause or not.
I will test it once agin by updating Cluster's Python.
But it will take some time since other team members also use it.
I think I can make additional reports during next week. Would it be okay?



> spark.read.csv() with wholeFile=True option fails to read non ASCII unicode 
> characters
> --
>
> Key: SPARK-20336
> URL: https://issues.apache.org/jira/browse/SPARK-20336
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0 (master branch is downloaded from Github)
> PySpark
>Reporter: HanCheol Cho
>
> I used spark.read.csv() method with wholeFile=True option to load data that 
> has multi-line records.
> However, non-ASCII characters are not properly loaded.
> The following is a sample data for test:
> {code:none}
> col1,col2,col3
> 1,a,text
> 2,b,テキスト
> 3,c,텍스트
> 4,d,"text
> テキスト
> 텍스트"
> 5,e,last
> {code}
> When it is loaded without wholeFile=True option, non-ASCII characters are 
> shown correctly although multi-line records are parsed incorrectly as follows:
> {code:none}
> testdf_default = spark.read.csv("test.encoding.csv", header=True)
> testdf_default.show()
> ++++
> |col1|col2|col3|
> ++++
> |   1|   a|text|
> |   2|   b|テキスト|
> |   3|   c| 텍스트|
> |   4|   d|text|
> |テキスト|null|null|
> | 텍스트"|null|null|
> |   5|   e|last|
> ++++
> {code}
> When wholeFile=True option is used, non-ASCII characters are broken as 
> follows:
> {code:none}
> testdf_wholefile = spark.read.csv("test.encoding.csv", header=True, 
> wholeFile=True)
> testdf_wholefile.show()
> ++++
> |col1|col2|col3|
> ++++
> |   1|   a|text|
> |   2|   b||
> |   3|   c|   �|
> |   4|   d|text
> ...|
> |   5|   e|last|
> ++++
> {code}
> The result is same even if I use encoding="UTF-8" option with wholeFile=True.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-9278) DataFrameWriter.insertInto inserts incorrect data

2017-04-16 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-9278:

Comment: was deleted

(was: The result might be definitely different as I ran the codes below with 
master branch of Spark, local environment without S3, Scala API and Mac OS. 
Though, I will leave the comment about what I tested in case you might want to 
test without the environments.

Here the codes I ran,

{code}
  // Create data.
  val alphabets = Seq("a", "e", "i", "o", "u")
  val partA = (0 to 4).map(i => Seq(alphabets(i % 5), "a", i))
  val partB = (5 to 9).map(i => Seq(alphabets(i % 5), "b", i))
  val partC = (10 to 14).map(i => Seq(alphabets(i % 5), "c", i))
  val data = partA ++ partB ++ partC

  // Create RDD.
  val rowsRDD = sc.parallelize(data.map(Row.fromSeq))

  // Create Dataframe.
  val schema = StructType(List(
StructField("k", StringType, true),
StructField("pk", StringType, true),
StructField("v", IntegerType, true))
  )
  val sdf = sqlContext.createDataFrame(rowsRDD, schema)

  // Create a empty table.
  sdf.filter("FALSE")
.write
.format("parquet")
.option("path", "foo")
.partitionBy("pk")
.saveAsTable("foo")

  // Save a partitioned table.
  sdf.filter("pk = 'a'")
.write
.partitionBy("pk")
.insertInto("foo")

  // Select all.
  val foo = sqlContext.table("foo")
  foo.show()
{code} 

And the result was correct as below.

{code}
+---+---+---+
|  k|  v| pk|
+---+---+---+
|  a|  0|  a|
|  e|  1|  a|
|  i|  2|  a|
|  o|  3|  a|
|  u|  4|  a|
+---+---+---+
{code})

> DataFrameWriter.insertInto inserts incorrect data
> -
>
> Key: SPARK-9278
> URL: https://issues.apache.org/jira/browse/SPARK-9278
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: Linux, S3, Hive Metastore
>Reporter: Steve Lindemann
>Assignee: Cheng Lian
>Priority: Critical
>
> After creating a partitioned Hive table (stored as Parquet) via the 
> DataFrameWriter.createTable command, subsequent attempts to insert additional 
> data into new partitions of this table result in inserting incorrect data 
> rows. Reordering the columns in the data to be written seems to avoid this 
> issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9278) DataFrameWriter.insertInto inserts incorrect data

2017-04-16 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-9278.
-
Resolution: Not A Problem

I tried to reproduce the codes above.

{code}
import pandas

pdf = pandas.DataFrame({'pk': ['a']*5+['b']*5+['c']*5, 'k': ['a', 'e', 'i', 
'o', 'u']*3, 'v': range(15)})
sdf = spark.createDataFrame(pdf)
sdf.filter('FALSE').write.partitionBy('pk').saveAsTable('foo', 
format='parquet', path='/tmp/tmptable')
sdf.filter(sdf.pk == 'a').write.partitionBy('pk').insertInto('foo')
foo = spark.table('foo')
foo.show()
{code}

It seems now it produces an exception as below:

{code}
Traceback (most recent call last):
  File "", line 1, in 
  File ".../spark/python/pyspark/sql/readwriter.py", line 606, in insertInto
self._jwrite.mode("overwrite" if overwrite else 
"append").insertInto(tableName)
  File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 
1133, in __call__
  File ".../spark/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u"insertInto() can't be used together with 
partitionBy(). Partition columns have already be defined for the table. It is 
not necessary to use partitionBy().;"
{code}

I am resolving this per ...

{quote}
If the issue seems clearly obsolete and applies to issues or components that 
have changed radically since it was opened, resolve as Not a Problem
{quote}

Please reopen this if I was mistaken.

> DataFrameWriter.insertInto inserts incorrect data
> -
>
> Key: SPARK-9278
> URL: https://issues.apache.org/jira/browse/SPARK-9278
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: Linux, S3, Hive Metastore
>Reporter: Steve Lindemann
>Assignee: Cheng Lian
>Priority: Critical
>
> After creating a partitioned Hive table (stored as Parquet) via the 
> DataFrameWriter.createTable command, subsequent attempts to insert additional 
> data into new partitions of this table result in inserting incorrect data 
> rows. Reordering the columns in the data to be written seems to avoid this 
> issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20336) spark.read.csv() with wholeFile=True option fails to read non ASCII unicode characters

2017-04-16 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970262#comment-15970262
 ] 

Hyukjin Kwon commented on SPARK-20336:
--

Sure, definitely. No need to be in a harry. I just wanted to know in more 
details and wanted to be sure this is on a progress. Thank you for your input.

> spark.read.csv() with wholeFile=True option fails to read non ASCII unicode 
> characters
> --
>
> Key: SPARK-20336
> URL: https://issues.apache.org/jira/browse/SPARK-20336
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0 (master branch is downloaded from Github)
> PySpark
>Reporter: HanCheol Cho
>
> I used spark.read.csv() method with wholeFile=True option to load data that 
> has multi-line records.
> However, non-ASCII characters are not properly loaded.
> The following is a sample data for test:
> {code:none}
> col1,col2,col3
> 1,a,text
> 2,b,テキスト
> 3,c,텍스트
> 4,d,"text
> テキスト
> 텍스트"
> 5,e,last
> {code}
> When it is loaded without wholeFile=True option, non-ASCII characters are 
> shown correctly although multi-line records are parsed incorrectly as follows:
> {code:none}
> testdf_default = spark.read.csv("test.encoding.csv", header=True)
> testdf_default.show()
> ++++
> |col1|col2|col3|
> ++++
> |   1|   a|text|
> |   2|   b|テキスト|
> |   3|   c| 텍스트|
> |   4|   d|text|
> |テキスト|null|null|
> | 텍스트"|null|null|
> |   5|   e|last|
> ++++
> {code}
> When wholeFile=True option is used, non-ASCII characters are broken as 
> follows:
> {code:none}
> testdf_wholefile = spark.read.csv("test.encoding.csv", header=True, 
> wholeFile=True)
> testdf_wholefile.show()
> ++++
> |col1|col2|col3|
> ++++
> |   1|   a|text|
> |   2|   b||
> |   3|   c|   �|
> |   4|   d|text
> ...|
> |   5|   e|last|
> ++++
> {code}
> The result is same even if I use encoding="UTF-8" option with wholeFile=True.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20336) spark.read.csv() with wholeFile=True option fails to read non ASCII unicode characters

2017-04-16 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970262#comment-15970262
 ] 

Hyukjin Kwon edited comment on SPARK-20336 at 4/16/17 7:16 AM:
---

Sure, definitely. No need to be in a hurry. I just wanted to know in more 
details and wanted to be sure this is on a progress. Thank you for your input.


was (Author: hyukjin.kwon):
Sure, definitely. No need to be in a harry. I just wanted to know in more 
details and wanted to be sure this is on a progress. Thank you for your input.

> spark.read.csv() with wholeFile=True option fails to read non ASCII unicode 
> characters
> --
>
> Key: SPARK-20336
> URL: https://issues.apache.org/jira/browse/SPARK-20336
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0 (master branch is downloaded from Github)
> PySpark
>Reporter: HanCheol Cho
>
> I used spark.read.csv() method with wholeFile=True option to load data that 
> has multi-line records.
> However, non-ASCII characters are not properly loaded.
> The following is a sample data for test:
> {code:none}
> col1,col2,col3
> 1,a,text
> 2,b,テキスト
> 3,c,텍스트
> 4,d,"text
> テキスト
> 텍스트"
> 5,e,last
> {code}
> When it is loaded without wholeFile=True option, non-ASCII characters are 
> shown correctly although multi-line records are parsed incorrectly as follows:
> {code:none}
> testdf_default = spark.read.csv("test.encoding.csv", header=True)
> testdf_default.show()
> ++++
> |col1|col2|col3|
> ++++
> |   1|   a|text|
> |   2|   b|テキスト|
> |   3|   c| 텍스트|
> |   4|   d|text|
> |テキスト|null|null|
> | 텍스트"|null|null|
> |   5|   e|last|
> ++++
> {code}
> When wholeFile=True option is used, non-ASCII characters are broken as 
> follows:
> {code:none}
> testdf_wholefile = spark.read.csv("test.encoding.csv", header=True, 
> wholeFile=True)
> testdf_wholefile.show()
> ++++
> |col1|col2|col3|
> ++++
> |   1|   a|text|
> |   2|   b||
> |   3|   c|   �|
> |   4|   d|text
> ...|
> |   5|   e|last|
> ++++
> {code}
> The result is same even if I use encoding="UTF-8" option with wholeFile=True.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10109) NPE when saving Parquet To HDFS

2017-04-16 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-10109.
--
Resolution: Duplicate

^ i am resolving this as it looks a subset of SPARK-20038. Please reopen this 
if I misunderstood the comment.

> NPE when saving Parquet To HDFS
> ---
>
> Key: SPARK-10109
> URL: https://issues.apache.org/jira/browse/SPARK-10109
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
> Environment: Sparc-ec2, standalone cluster on amazon
>Reporter: Virgil Palanciuc
>
> Very simple code, trying to save a dataframe
> I get this in the driver
> {quote}
> 15/08/19 11:21:41 INFO TaskSetManager: Lost task 9.2 in stage 217.0 (TID 
> 4748) on executor 172.xx.xx.xx: java.lang.NullPointerException (null) 
> and  (not for that task):
> 15/08/19 11:21:46 WARN TaskSetManager: Lost task 5.0 in stage 543.0 (TID 
> 5607, 172.yy.yy.yy): java.lang.NullPointerException
> at 
> parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:146)
> at 
> parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112)
> at 
> parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)
> at 
> org.apache.spark.sql.parquet.ParquetOutputWriter.close(newParquet.scala:88)
> at 
> org.apache.spark.sql.sources.DynamicPartitionWriterContainer$$anonfun$clearOutputWriters$1.apply(commands.scala:536)
> at 
> org.apache.spark.sql.sources.DynamicPartitionWriterContainer$$anonfun$clearOutputWriters$1.apply(commands.scala:536)
> at 
> scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:107)
> at 
> scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:107)
> at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
> at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:107)
> at 
> org.apache.spark.sql.sources.DynamicPartitionWriterContainer.clearOutputWriters(commands.scala:536)
> at 
> org.apache.spark.sql.sources.DynamicPartitionWriterContainer.abortTask(commands.scala:552)
> at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$2(commands.scala:269)
> at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insertWithDynamicPartitions$3.apply(commands.scala:229)
> at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insertWithDynamicPartitions$3.apply(commands.scala:229)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {quote}
> I get this in the executor log:
> {quote}
> 15/08/19 11:21:41 WARN DFSClient: DataStreamer Exception
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>  No lease on 
> /gglogs/2015-07-27/_temporary/_attempt_201508191119_0217_m_09_2/dpid=18432/pid=1109/part-r-9-46ac3a79-a95c-4d9c-a2f1-b3ee76f6a46c.snappy.parquet
>  File does not exist. Holder DFSClient_NONMAPREDUCE_1730998114_63 does not 
> have any open files.
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2396)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2387)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2183)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:481)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1695)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1691)
>   at java.security.AccessController.doPrivileged(Native Method)
>

[jira] [Commented] (SPARK-10294) When Parquet writer's close method throws an exception, we will call close again and trigger a NPE

2017-04-16 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970266#comment-15970266
 ] 

Hyukjin Kwon commented on SPARK-10294:
--

Would this be resolvable maybe?

> When Parquet writer's close method throws an exception, we will call close 
> again and trigger a NPE
> --
>
> Key: SPARK-10294
> URL: https://issues.apache.org/jira/browse/SPARK-10294
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
> Attachments: screenshot-1.png
>
>
> When a task saves a large parquet file (larger than the S3 file size limit) 
> to S3, looks like we still call parquet writer's close twice and triggers NPE 
> reported in SPARK-7837. Eventually, job failed and I got NPE as the 
> exception. Actually, the real problem was that the file was too large for S3.
> {code}
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1280)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1268)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1267)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1267)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1493)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1455)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1444)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1818)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1831)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1908)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:150)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:927)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:927)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
>   at 
> com.databricks.spark.sql.perf.tpcds.Tables$Table.genData(Tables.scala:147)
>   at 
> com.databricks.spark.sql.perf.tpcds.Tables$$anonfun$genData$2.apply(Tables.scala:192)
>   at 
> com.databricks.spark.sql.perf.tpcds.Tables$$anonfun$genData$2.apply(Tables.scala:190)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at com.dat

[jira] [Created] (SPARK-20349) ListFunction returns duplicate functions after using persistent functions

2017-04-16 Thread Xiao Li (JIRA)

Xiao Li created SPARK-20349:
---

 Summary: ListFunction returns duplicate functions after using 
persistent functions
 Key: SPARK-20349
 URL: https://issues.apache.org/jira/browse/SPARK-20349
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0, 2.0.2
Reporter: Xiao Li
Assignee: Xiao Li


The session catalog caches some persistent functions in the FunctionRegistry, 
so there can be duplicates. Our Catalog API listFunctions does not handle it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10746) count ( distinct columnref) over () returns wrong result set

2017-04-16 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-10746.
--
Resolution: Not A Problem

I am resolving this per the comment above.

I think this applies to ...

{quote}
If the issue seems clearly obsolete and applies to issues or components that 
have changed radically since it was opened, resolve as Not a Problem
{quote}

> count ( distinct columnref) over () returns wrong result set
> 
>
> Key: SPARK-10746
> URL: https://issues.apache.org/jira/browse/SPARK-10746
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: N Campbell
>
> Same issue as report against HIVE (HIVE-9534) 
> Result set was expected to contain 5 rows instead of 1 row as others vendors 
> (ORACLE, Netezza etc) would.
> select count( distinct column) over () from t1



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20349) ListFunctions returns duplicate functions after using persistent functions

2017-04-16 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20349:

Summary: ListFunctions returns duplicate functions after using persistent 
functions  (was: ListFunction returns duplicate functions after using 
persistent functions)

> ListFunctions returns duplicate functions after using persistent functions
> --
>
> Key: SPARK-20349
> URL: https://issues.apache.org/jira/browse/SPARK-20349
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> The session catalog caches some persistent functions in the FunctionRegistry, 
> so there can be duplicates. Our Catalog API listFunctions does not handle it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11186) Caseness inconsistency between SQLContext and HiveContext

2017-04-16 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-11186.
--
Resolution: Cannot Reproduce

I can't run the codes as reported. I am resolving this per ...

{quote}
For issues that can’t be reproduced against master as reported, resolve as 
Cannot Reproduce
{quote}

Please reopen this if anyone is able to run the codes above in the current 
master.

> Caseness inconsistency between SQLContext and HiveContext
> -
>
> Key: SPARK-11186
> URL: https://issues.apache.org/jira/browse/SPARK-11186
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Santiago M. Mola
>Priority: Minor
>
> Default catalog behaviour for caseness is different in {{SQLContext}} and 
> {{HiveContext}}.
> {code}
>   test("Catalog caseness (SQL)") {
> val sqlc = new SQLContext(sc)
> val relationName = "MyTable"
> sqlc.catalog.registerTable(relationName :: Nil, LogicalRelation(new 
> BaseRelation {
>   override def sqlContext: SQLContext = sqlc
>   override def schema: StructType = StructType(Nil)
> }))
> val tables = sqlc.tableNames()
> assert(tables.contains(relationName))
>   }
>   test("Catalog caseness (Hive)") {
> val sqlc = new HiveContext(sc)
> val relationName = "MyTable"
> sqlc.catalog.registerTable(relationName :: Nil, LogicalRelation(new 
> BaseRelation {
>   override def sqlContext: SQLContext = sqlc
>   override def schema: StructType = StructType(Nil)
> }))
> val tables = sqlc.tableNames()
> assert(tables.contains(relationName))
>   }
> {code}
> Looking at {{HiveContext#SQLSession}}, I see this is the intended behaviour. 
> But the reason that this is needed seems undocumented (both in the manual or 
> in the source code comments).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20349) ListFunctions returns duplicate functions after using persistent functions

2017-04-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20349:


Assignee: Apache Spark  (was: Xiao Li)

> ListFunctions returns duplicate functions after using persistent functions
> --
>
> Key: SPARK-20349
> URL: https://issues.apache.org/jira/browse/SPARK-20349
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> The session catalog caches some persistent functions in the FunctionRegistry, 
> so there can be duplicates. Our Catalog API listFunctions does not handle it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20349) ListFunctions returns duplicate functions after using persistent functions

2017-04-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20349:


Assignee: Xiao Li  (was: Apache Spark)

> ListFunctions returns duplicate functions after using persistent functions
> --
>
> Key: SPARK-20349
> URL: https://issues.apache.org/jira/browse/SPARK-20349
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> The session catalog caches some persistent functions in the FunctionRegistry, 
> so there can be duplicates. Our Catalog API listFunctions does not handle it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20349) ListFunctions returns duplicate functions after using persistent functions

2017-04-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970270#comment-15970270
 ] 

Apache Spark commented on SPARK-20349:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/17646

> ListFunctions returns duplicate functions after using persistent functions
> --
>
> Key: SPARK-20349
> URL: https://issues.apache.org/jira/browse/SPARK-20349
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> The session catalog caches some persistent functions in the FunctionRegistry, 
> so there can be duplicates. Our Catalog API listFunctions does not handle it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12259) Kryo/javaSerialization encoder are not composable

2017-04-16 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970280#comment-15970280
 ] 

Hyukjin Kwon commented on SPARK-12259:
--

[~smilegator], I just happened to try to reproduce this.

{code}
import org.apache.spark.sql.Encoders

case class KryoClassData(st: String, i: Long)
implicit val kryoEncoder = Encoders.kryo[KryoClassData]
val ds = Seq( ( KryoClassData("a", 1), KryoClassData("b", 2) ) ).toDS()
ds.show()
{code}

It seems the codes above work fine now in the current master. Would this be 
resolvable maybe?

> Kryo/javaSerialization encoder are not composable
> -
>
> Key: SPARK-12259
> URL: https://issues.apache.org/jira/browse/SPARK-12259
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>
> {code}
> implicit val kryoEncoder = Encoders.kryo[KryoClassData]
> val ds = Seq( ( KryoClassData("a", 1), KryoClassData("b", 2) ) ).toDS()
> {code}
> The above code complains it is unable to find the encoder. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20346) sum aggregate over empty Dataset gives null

2017-04-16 Thread Jacek Laskowski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970281#comment-15970281
 ] 

Jacek Laskowski commented on SPARK-20346:
-

[~hyukjin.kwon] Caught me! I didn't think about it much and thought 0 would be 
a viable candidate, but it's not obvious to me now. The field is {{nullable}} 
so...{{null}} is acceptable. I think it may be acceptable for end users (like 
me) when at least it's documented in the docs (or at least in scaladoc for 
{{agg}} or {{groupBy}} / {{groupByKey}}). It was weird to see {{null}} as a 
numeric value.

> sum aggregate over empty Dataset gives null
> ---
>
> Key: SPARK-20346
> URL: https://issues.apache.org/jira/browse/SPARK-20346
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> {code}
> scala> spark.range(0).agg(sum("id")).show
> +---+
> |sum(id)|
> +---+
> |   null|
> +---+
> scala> spark.range(0).agg(sum("id")).printSchema
> root
>  |-- sum(id): long (nullable = true)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12677) Lazy file discovery for parquet

2017-04-16 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-12677.
--
Resolution: Duplicate

I just realised that there is the option for this case. I am resolving this as 
a duplicate.

> Lazy file discovery for parquet
> ---
>
> Key: SPARK-12677
> URL: https://issues.apache.org/jira/browse/SPARK-12677
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Reporter: Tiago Albineli Motta
>Priority: Minor
>  Labels: features
>
> When using sqlContext.read.parquet(files: _*) the driver verifyies if 
> everything is ok with the files. But reading those files is lazy, so when it 
> starts maybe the files are not there anymore, or they have changed, so we 
> receive this error message:
> {quote}
> 16/01/06 10:52:43 ERROR yarn.ApplicationMaster: User class threw exception: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in 
> stage 0.0 failed 4 times, most recent failure: Lost task 4.3 in stage 0.0 
> (TID 16, riolb586.globoi.com): java.io.FileNotFoundException: File does not 
> exist: 
> hdfs://mynamenode.com:8020/rec/prefs/2016/01/06/part-r-3-27a100b0-ff49-45ad-8803-e6cc77286661.gz.parquet
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
>   at 
> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:381)
>   at 
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:155)
>   at 
> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
>   at 
> org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:153)
>   at 
> org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:124)
>   at 
> org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:66)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
>   at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:70)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {quote}
> Maybe if  sqlContext.read.parquet could receive a Function to discover the 
> files instead it could be avoided. Like this: sqlContext.read.parquet( () => 
> files )



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-12677) Lazy file discovery for parquet

2017-04-16 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-12677:
--

> Lazy file discovery for parquet
> ---
>
> Key: SPARK-12677
> URL: https://issues.apache.org/jira/browse/SPARK-12677
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Reporter: Tiago Albineli Motta
>Priority: Minor
>  Labels: features
>
> When using sqlContext.read.parquet(files: _*) the driver verifyies if 
> everything is ok with the files. But reading those files is lazy, so when it 
> starts maybe the files are not there anymore, or they have changed, so we 
> receive this error message:
> {quote}
> 16/01/06 10:52:43 ERROR yarn.ApplicationMaster: User class threw exception: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in 
> stage 0.0 failed 4 times, most recent failure: Lost task 4.3 in stage 0.0 
> (TID 16, riolb586.globoi.com): java.io.FileNotFoundException: File does not 
> exist: 
> hdfs://mynamenode.com:8020/rec/prefs/2016/01/06/part-r-3-27a100b0-ff49-45ad-8803-e6cc77286661.gz.parquet
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
>   at 
> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:381)
>   at 
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:155)
>   at 
> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
>   at 
> org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:153)
>   at 
> org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:124)
>   at 
> org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:66)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
>   at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:70)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {quote}
> Maybe if  sqlContext.read.parquet could receive a Function to discover the 
> files instead it could be avoided. Like this: sqlContext.read.parquet( () => 
> files )



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12677) Lazy file discovery for parquet

2017-04-16 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-12677.
--
Resolution: Not A Problem

It sounds it was decided to explicitly throws an exception in SPARK-19916. 
There is an option for corrupt files handling in SPARK-17850. I think it is 
good enough. I am resolving this. 

> Lazy file discovery for parquet
> ---
>
> Key: SPARK-12677
> URL: https://issues.apache.org/jira/browse/SPARK-12677
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Reporter: Tiago Albineli Motta
>Priority: Minor
>  Labels: features
>
> When using sqlContext.read.parquet(files: _*) the driver verifyies if 
> everything is ok with the files. But reading those files is lazy, so when it 
> starts maybe the files are not there anymore, or they have changed, so we 
> receive this error message:
> {quote}
> 16/01/06 10:52:43 ERROR yarn.ApplicationMaster: User class threw exception: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in 
> stage 0.0 failed 4 times, most recent failure: Lost task 4.3 in stage 0.0 
> (TID 16, riolb586.globoi.com): java.io.FileNotFoundException: File does not 
> exist: 
> hdfs://mynamenode.com:8020/rec/prefs/2016/01/06/part-r-3-27a100b0-ff49-45ad-8803-e6cc77286661.gz.parquet
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
>   at 
> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:381)
>   at 
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:155)
>   at 
> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
>   at 
> org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:153)
>   at 
> org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:124)
>   at 
> org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:66)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
>   at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:70)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {quote}
> Maybe if  sqlContext.read.parquet could receive a Function to discover the 
> files instead it could be avoided. Like this: sqlContext.read.parquet( () => 
> files )



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-12677) Lazy file discovery for parquet

2017-04-16 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-12677:
-
Comment: was deleted

(was: I just realised that there is the option for this case. I am resolving 
this as a duplicate.)

> Lazy file discovery for parquet
> ---
>
> Key: SPARK-12677
> URL: https://issues.apache.org/jira/browse/SPARK-12677
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Reporter: Tiago Albineli Motta
>Priority: Minor
>  Labels: features
>
> When using sqlContext.read.parquet(files: _*) the driver verifyies if 
> everything is ok with the files. But reading those files is lazy, so when it 
> starts maybe the files are not there anymore, or they have changed, so we 
> receive this error message:
> {quote}
> 16/01/06 10:52:43 ERROR yarn.ApplicationMaster: User class threw exception: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in 
> stage 0.0 failed 4 times, most recent failure: Lost task 4.3 in stage 0.0 
> (TID 16, riolb586.globoi.com): java.io.FileNotFoundException: File does not 
> exist: 
> hdfs://mynamenode.com:8020/rec/prefs/2016/01/06/part-r-3-27a100b0-ff49-45ad-8803-e6cc77286661.gz.parquet
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
>   at 
> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:381)
>   at 
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:155)
>   at 
> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
>   at 
> org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:153)
>   at 
> org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:124)
>   at 
> org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:66)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
>   at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:70)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {quote}
> Maybe if  sqlContext.read.parquet could receive a Function to discover the 
> files instead it could be avoided. Like this: sqlContext.read.parquet( () => 
> files )



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20346) sum aggregate over empty Dataset gives null

2017-04-16 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970285#comment-15970285
 ] 

Hyukjin Kwon commented on SPARK-20346:
--

Actually, I and [~a1ray] had a discussion about this in a PR related with this 
(missing values in pivot's output). It will be {{0}} for {{count}} currently 
but it seems {{null}} for others. Up to my knowledge, this one is related with 
it. I guess he has a better idea about this.

> sum aggregate over empty Dataset gives null
> ---
>
> Key: SPARK-20346
> URL: https://issues.apache.org/jira/browse/SPARK-20346
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> {code}
> scala> spark.range(0).agg(sum("id")).show
> +---+
> |sum(id)|
> +---+
> |   null|
> +---+
> scala> spark.range(0).agg(sum("id")).printSchema
> root
>  |-- sum(id): long (nullable = true)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20346) sum aggregate over empty Dataset gives null

2017-04-16 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970285#comment-15970285
 ] 

Hyukjin Kwon edited comment on SPARK-20346 at 4/16/17 8:24 AM:
---

Actually, I and [~a1ray] had a discussion about this in a PR related with this 
(missing values in pivot's output). Up to my knowledge, this one is related 
with it. I guess he has a better idea about this.


was (Author: hyukjin.kwon):
Actually, I and [~a1ray] had a discussion about this in a PR related with this 
(missing values in pivot's output). It will be {{0}} for {{count}} currently 
but it seems {{null}} for others. Up to my knowledge, this one is related with 
it. I guess he has a better idea about this.

> sum aggregate over empty Dataset gives null
> ---
>
> Key: SPARK-20346
> URL: https://issues.apache.org/jira/browse/SPARK-20346
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> {code}
> scala> spark.range(0).agg(sum("id")).show
> +---+
> |sum(id)|
> +---+
> |   null|
> +---+
> scala> spark.range(0).agg(sum("id")).printSchema
> root
>  |-- sum(id): long (nullable = true)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12677) Lazy file discovery for parquet

2017-04-16 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970286#comment-15970286
 ] 

Hyukjin Kwon commented on SPARK-12677:
--

Please reopen this if anyone would like to stay against the current behaviour. 
I only resolve this as a respect of the decision made in that PR and it seems 
there is no interest in this issue for quite long time.

> Lazy file discovery for parquet
> ---
>
> Key: SPARK-12677
> URL: https://issues.apache.org/jira/browse/SPARK-12677
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Reporter: Tiago Albineli Motta
>Priority: Minor
>  Labels: features
>
> When using sqlContext.read.parquet(files: _*) the driver verifyies if 
> everything is ok with the files. But reading those files is lazy, so when it 
> starts maybe the files are not there anymore, or they have changed, so we 
> receive this error message:
> {quote}
> 16/01/06 10:52:43 ERROR yarn.ApplicationMaster: User class threw exception: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in 
> stage 0.0 failed 4 times, most recent failure: Lost task 4.3 in stage 0.0 
> (TID 16, riolb586.globoi.com): java.io.FileNotFoundException: File does not 
> exist: 
> hdfs://mynamenode.com:8020/rec/prefs/2016/01/06/part-r-3-27a100b0-ff49-45ad-8803-e6cc77286661.gz.parquet
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
>   at 
> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:381)
>   at 
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:155)
>   at 
> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
>   at 
> org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:153)
>   at 
> org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:124)
>   at 
> org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:66)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
>   at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:70)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {quote}
> Maybe if  sqlContext.read.parquet could receive a Function to discover the 
> files instead it could be avoided. Like this: sqlContext.read.parquet( () => 
> files )



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13301) PySpark Dataframe return wrong results with custom UDF

2017-04-16 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-13301.
--
Resolution: Cannot Reproduce

Per the description in the JIRA, I can't reproduce this. I can only imagine the 
data so I tried to reproduce this at my best.

Please reopen this if anyone knows if this still exists and how to reproduce

{code}
>>> from pyspark.sql import functions
>>> import string
>>>
>>> data = [
... ["1265AB4F65C05740E...", "Ivo"],
... ["1D94AB4F75C83B51E...", "Raffaele"],
... ["4F008903600A0133E...", "Cristina"]
... ]
>>>
>>> myDF = spark.createDataFrame(data, ["col1", "col2"])
>>> myFunc = functions.udf(lambda s: string.lower(s))
>>> myDF.select("col1", "col2").withColumn("col3", myFunc(myDF["col1"])).show()


++++
|col1|col2|col3|
++++
|1265AB4F65C05740E...| Ivo|1265ab4f65c05740e...|
|1D94AB4F75C83B51E...|Raffaele|1d94ab4f75c83b51e...|
|4F008903600A0133E...|Cristina|4f008903600a0133e...|
++++
{code}


> PySpark Dataframe return wrong results with custom UDF
> --
>
> Key: SPARK-13301
> URL: https://issues.apache.org/jira/browse/SPARK-13301
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: PySpark in yarn-client mode - CDH 5.5.1
>Reporter: Simone
>Priority: Critical
>
> Using a User Defined Function in PySpark inside the withColumn() method of 
> Dataframe, gives wrong results.
> Here an example:
> from pyspark.sql import functions
> import string
> myFunc = functions.udf(lambda s: string.lower(s))
> myDF.select("col1", "col2").withColumn("col3", myFunc(myDF["col1"])).show()
> |col1|   col2|col3|
> |1265AB4F65C05740E...|Ivo|4f00ae514e7c015be...|
> |1D94AB4F75C83B51E...|   Raffaele|4f00dcf6422100c0e...|
> |4F008903600A0133E...|   Cristina|4f008903600a0133e...|
> The results are wrong and seem to be random: some record are OK (for example 
> the third) some others NO (for example the first 2).
> The problem seems not occur with Spark built-in functions:
> from pyspark.sql.functions import *
> myDF.select("col1", "col2").withColumn("col3", lower(myDF["col1"])).show()
> Without the withColumn() method, results seems to be always correct:
> myDF.select("col1", "col2", myFunc(myDF["col1"])).show()
> This can be considered only in part a workaround because you have to list 
> each time all column of your Dataframe.
> Also in Scala/Java the problems seems not occur.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13491) Issue using table alias in Spark SQL case statement

2017-04-16 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-13491.
--
Resolution: Cannot Reproduce

{code}
val employee1 = Seq(Tuple2("Hyukjin", 1), Tuple2("Tom", 2)).toDF("name", "id")
val employee2 = Seq(Tuple2("Hyukjin", 1), Tuple2("Tom", 2), Tuple2("Jackson", 
3)).toDF("name", "id")
employee1.createOrReplaceTempView("employee1")
employee2.createOrReplaceTempView("employee2")
val sqlStatement = s"select case when (employee1.id = employee2.id and 
concat(employee1.name) = concat(employee2.name) ) then 'NoChange' end from 
employee1 full outer join employee2 on employee1.id = employee2.id"
spark.sql(sqlStatement).show()
{code}

I can't reproduce this. Please reopen this if this still exists and there is a 
better reproducer or steps to reproduce this.

> Issue using table alias in Spark SQL case statement
> ---
>
> Key: SPARK-13491
> URL: https://issues.apache.org/jira/browse/SPARK-13491
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Tanmay Deshpande
>
> Spark Version 1.6.0 and 1.5.2
> I am trying to run a Hive query from Spark SQL doing so when I use case 
> statements in select. I get an error saying 
> 16/02/25 15:19:30 INFO audit: ugi=admin1  ip=unknown-ip-addr  
> cmd=get_table : db=default tbl=employee2
> Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve 'employee1.id' given input columns salary, name, name, salary, id, 
> id; line 1 pos 18
> My Query in Spark is 
>val sqlStatement = s"select case when (employee1.id = employee2.id and 
> concat(employee1.*) = concat(employee2.*) ) then 'NoChange' end from 
> employee1 full outer join employee2 on employee1.id = employee2.id";
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13644) Add the source file name and line into Logger when an exception occurs in the generated code

2017-04-16 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970295#comment-15970295
 ] 

Hyukjin Kwon commented on SPARK-13644:
--

gentle ping [~kiszk], is this resolvable?

> Add the source file name and line into Logger when an exception occurs in the 
> generated code
> 
>
> Key: SPARK-13644
> URL: https://issues.apache.org/jira/browse/SPARK-13644
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> This is to show a message points out the origin of a generated method when an 
> exception occurs in the generated method at runtime.
> An example of a message (the first line is newly added)
> {code}
> 07:49:29.525 ERROR 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator: 
> The method GeneratedIterator.processNext() is generated for filter at 
> Test.scala:23
> 07:49:29.526 ERROR org.apache.spark.executor.Executor: Exception in task 1.0 
> in stage 2.0 (TID 4)
> java.lang.NullPointerException:
> at ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13680) Java UDAF with more than one intermediate argument returns wrong results

2017-04-16 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970296#comment-15970296
 ] 

Hyukjin Kwon commented on SPARK-13680:
--

Could we narrow down the scope? It seems the query itself is too complex. It'd 
be great if this can be tested against the current master if possible.

What was the expected output and what was the actual output? I think I'd rather 
close this if the reporter is not active.

> Java UDAF with more than one intermediate argument returns wrong results
> 
>
> Key: SPARK-13680
> URL: https://issues.apache.org/jira/browse/SPARK-13680
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: CDH 5.5.2
>Reporter: Yael Aharon
> Attachments: data.csv, setup.hql
>
>
> I am trying to incorporate the Java UDAF from 
> https://github.com/apache/spark/blob/master/sql/hive/src/test/java/org/apache/spark/sql/hive/aggregate/MyDoubleAvg.java
>  into an SQL query. 
> I registered the UDAF like this:
>  sqlContext.udf().register("myavg", new MyDoubleAvg());
> My SQL query is:
> SELECT AVG(seqi) AS `avg_seqi`, AVG(seqd) AS `avg_seqd`, AVG(ci) AS `avg_ci`, 
> AVG(cd) AS `avg_cd`, AVG(stdevd) AS `avg_stdevd`, AVG(stdevi) AS 
> `avg_stdevi`, MAX(seqi) AS `max_seqi`, MAX(seqd) AS `max_seqd`, MAX(ci) AS 
> `max_ci`, MAX(cd) AS `max_cd`, MAX(stdevd) AS `max_stdevd`, MAX(stdevi) AS 
> `max_stdevi`, MIN(seqi) AS `min_seqi`, MIN(seqd) AS `min_seqd`, MIN(ci) AS 
> `min_ci`, MIN(cd) AS `min_cd`, MIN(stdevd) AS `min_stdevd`, MIN(stdevi) AS 
> `min_stdevi`,SUM(seqi) AS `sum_seqi`, SUM(seqd) AS `sum_seqd`, SUM(ci) AS 
> `sum_ci`, SUM(cd) AS `sum_cd`, SUM(stdevd) AS `sum_stdevd`, SUM(stdevi) AS 
> `sum_stdevi`, myavg(seqd) as `myavg_seqd`,  AVG(zero) AS `avg_zero`, 
> AVG(nulli) AS `avg_nulli`,AVG(nulld) AS `avg_nulld`, SUM(zero) AS `sum_zero`, 
> SUM(nulli) AS `sum_nulli`,SUM(nulld) AS `sum_nulld`,MAX(zero) AS `max_zero`, 
> MAX(nulli) AS `max_nulli`,MAX(nulld) AS `max_nulld`,count( * ) AS 
> `count_all`, count(nulli) AS `count_nulli` FROM mytable
> As soon as I add the UDAF myavg to the SQL, all the results become incorrect. 
> When I remove the call to the UDAF, the results are correct.
> I was able to go around the issue by modifying bufferSchema of the UDAF to 
> use an array and the corresponding update and merge methods. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14057) sql time stamps do not respect time zones

2017-04-16 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-14057.
--
Resolution: Duplicate

Apparently it seems fixed in SPARK-18936. If it refers other datasources or 
functions when data is initially loaded, I think we can specify timezone in 
time format in many places. Please reopen this if this still exists and I 
misunderstood.

> sql time stamps do not respect time zones
> -
>
> Key: SPARK-14057
> URL: https://issues.apache.org/jira/browse/SPARK-14057
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Andrew Davidson
>Priority: Minor
>
> we have time stamp data. The time stamp data is UTC how ever when we load the 
> data into spark data frames, the system assume the time stamps are in the 
> local time zone. This causes problems for our data scientists. Often they 
> pull data from our data center into their local macs. The data centers run 
> UTC. There computers are typically in PST or EST.
> It is possible to hack around this problem
> This cause a lot of errors in their analysis
> A complete description of this issue can be found in the following mail msg
> https://www.mail-archive.com/user@spark.apache.org/msg48121.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14097) Spark SQL Optimization is not consistent

2017-04-16 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-14097.
--
Resolution: Invalid

Is the issue different plans? It really looks hard to read and at least for me 
the point is not clear.

As it seems there has been no interests for an almost year, and I strongly feel 
that no one is going to resolve this issue unless the reporter keeps 
reproducing this. I am resolving this. Please reopen this if anyone can verify 
this.

> Spark SQL Optimization is not consistent
> 
>
> Key: SPARK-14097
> URL: https://issues.apache.org/jira/browse/SPARK-14097
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.1
>Reporter: Gaurav Tiwari
>Priority: Minor
>
> I am trying to execute a simple query with join on 3 tables. When I look at 
> the execution plan , it varies with position of table in the "from" clause of 
> the query. Execution plan looks more optimized when the position of table 
> with predicates is specified before any other table. 
> a) Original query : 
> select distinct pge.portfolio_code 
> from table1 pge join table2 p 
> on p.perm_group = pge.anc_port_group 
> join table3 uge 
> on p.user_group=uge.anc_user_group 
> where uge.user_name = 'user' and p.perm_type = 'TEST' 
> b) Optimized query (table with predicates is moved ahead): 
> select distinct pge.portfolio_code from table3 uge, table2 p, table1 pge 
> where uge.user_name = 'user' and p.perm_type = 'TEST'  and p.perm_group = 
> pge.anc_port_group  and p.user_group=uge.anc_user_group
> 1) Execution Plan for Original query (a):
> == Parsed Logical Plan ==
> 'Distinct
>  'Project [unresolvedalias('pge.portfolio_code)]
>   'Filter (('uge.user_name = user) && ('p.perm_type = TEST))
>'Join Inner, Some(('p.user_group = 'uge.anc_user_group))
> 'Join Inner, Some(('p.perm_group = 'pge.anc_port_group))
>  'UnresolvedRelation [table1], Some(pge)
>  'UnresolvedRelation [table2], Some(p)
> 'UnresolvedRelation [table3], Some(uge)
> == Analyzed Logical Plan ==
> portfolio_code: string
> Distinct
>  Project [portfolio_code#7]
>   Filter ((user_name#12 = user) && (perm_type#9 = TEST))
>Join Inner, Some((user_group#8 = anc_user_group#11))
> Join Inner, Some((perm_group#10 = anc_port_group#5))
>  Subquery pge
>   Relation[anc_port_group#5,portfolio_name#6,portfolio_code#7] 
> ParquetRelation[snackfs://tst:9042/aladdin_data_beta/table1.parquet]
>  Subquery p
>   Relation[user_group#8,perm_type#9,perm_group#10] 
> ParquetRelation[snackfs://tst:9042/aladdin_data_beta/table2.parquet]
> Subquery uge
>  Relation[anc_user_group#11,user_name#12] 
> ParquetRelation[snackfs://tst:9042/aladdin_data_beta/table3.parquet]
> == Optimized Logical Plan ==
> Aggregate [portfolio_code#7], [portfolio_code#7]
>  Project [portfolio_code#7]
>   Join Inner, Some((user_group#8 = anc_user_group#11))
>Project [portfolio_code#7,user_group#8]
> Join Inner, Some((perm_group#10 = anc_port_group#5))
>  Project [portfolio_code#7,anc_port_group#5]
>   Relation[anc_port_group#5,portfolio_name#6,portfolio_code#7] 
> ParquetRelation[snackfs://tst:9042/aladdin_data_beta/table1.parquet]
>  Project [user_group#8,perm_group#10]
>   Filter (perm_type#9 = TEST)
>Relation[user_group#8,perm_type#9,perm_group#10] 
> ParquetRelation[snackfs://tst:9042/aladdin_data_beta/table2.parquet]
>Project [anc_user_group#11]
> Filter (user_name#12 = user)
>  Relation[anc_user_group#11,user_name#12] 
> ParquetRelation[snackfs://tst:9042/aladdin_data_beta/table3.parquet]
> == Physical Plan ==
> TungstenAggregate(key=[portfolio_code#7], functions=[], 
> output=[portfolio_code#7])
>  TungstenExchange hashpartitioning(portfolio_code#7)
>   TungstenAggregate(key=[portfolio_code#7], functions=[], 
> output=[portfolio_code#7])
>TungstenProject [portfolio_code#7]
> BroadcastHashJoin [user_group#8], [anc_user_group#11], BuildRight
>  TungstenProject [portfolio_code#7,user_group#8]
>   BroadcastHashJoin [anc_port_group#5], [perm_group#10], BuildRight
>ConvertToUnsafe
> Scan 
> ParquetRelation[snackfs://tst:9042/aladdin_data_beta/table1.parquet][portfolio_code#7,anc_port_group#5]
>ConvertToUnsafe
> Project [user_group#8,perm_group#10]
>  Filter (perm_type#9 = TEST)
>   Scan 
> ParquetRelation[snackfs://tst:9042/aladdin_data_beta/table2.parquet][user_group#8,perm_group#10,perm_type#9]
>  ConvertToUnsafe
>   Project [anc_user_group#11]
>Filter (user_name#12 = user)
> Scan 
> ParquetRelation[snackfs://tst:9042/aladdin_data_beta/table3.parquet][anc_user_group#11,user_name#12]
> Code Generation: true
> 2) Execution Plan for  Optimized query (b):
> == Parsed Logical Pla

[jira] [Commented] (SPARK-14584) Improve recognition of non-nullability in Dataset transformations

2017-04-16 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970306#comment-15970306
 ] 

Hyukjin Kwon commented on SPARK-14584:
--

[~joshrosen], it seems now it recognise the non-nullability as below:

{code}
scala> case class MyCaseClass(foo: Int)
defined class MyCaseClass

scala> sc.parallelize(Seq(0)).toDS.map(MyCaseClass).printSchema
root
 |-- foo: integer (nullable = false)
{code}

Could we resolve this?

> Improve recognition of non-nullability in Dataset transformations
> -
>
> Key: SPARK-14584
> URL: https://issues.apache.org/jira/browse/SPARK-14584
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Josh Rosen
>
> There are many cases where we can statically know that a field will never be 
> null. For instance, a field in a case class with a primitive type will never 
> return null. However, there are currently several cases in the Dataset API 
> where we do not properly recognize this non-nullability. For instance:
> {code}
> case class MyCaseClass(foo: Int)
> sc.parallelize(Seq(0)).toDS.map(MyCaseClass).printSchema
> {code}
> claims that the {{foo}} field is nullable even though this is impossible.
> I believe that this is due to the way that we reason about nullability when 
> constructing serializer expressions in ExpressionEncoders. The following 
> assertion will currently fail if added to ExpressionEncoder:
> {code}
>   require(schema.size == serializer.size)
>   schema.fields.zip(serializer).foreach { case (field, fieldSerializer) =>
> require(field.dataType == fieldSerializer.dataType, s"Field 
> ${field.name}'s data type is " +
>   s"${field.dataType} in the schema but ${fieldSerializer.dataType} in 
> its serializer")
> require(field.nullable == fieldSerializer.nullable, s"Field 
> ${field.name}'s nullability is " +
>   s"${field.nullable} in the schema but ${fieldSerializer.nullable} in 
> its serializer")
>   }
> {code}
> Most often, the schema claims that a field is non-nullable while the encoder 
> allows for nullability, but occasionally we see a mismatch in the datatypes 
> due to disagreements over the nullability of nested structs' fields (or 
> fields of structs in arrays).
> I think the problem is that when we're reasoning about nullability in a 
> struct's schema we consider its fields' nullability to be independent of the 
> nullability of the struct itself, whereas in the serializer expressions we 
> are considering those field extraction expressions to be nullable if the 
> input objects themselves can be nullable.
> I'm not sure what's the simplest way to fix this. One proposal would be to 
> leave the serializers unchanged and have ObjectOperator derive its output 
> attributes from an explicitly-passed schema rather than using the 
> serializers' attributes. However, I worry that this might introduce bugs in 
> case the serializer and schema disagree.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14651) CREATE TEMPORARY TABLE is not supported yet

2017-04-16 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970307#comment-15970307
 ] 

Hyukjin Kwon commented on SPARK-14651:
--

I just added affected version as I can reproduce this in the current master as 
below:

{code}
scala> sql("CREATE temporary table t2")
org.apache.spark.sql.catalyst.parser.ParseException:
CREATE TEMPORARY TABLE is not supported yet. Please use CREATE TEMPORARY VIEW 
as an alternative.(line 1, pos 0)

== SQL ==
CREATE temporary table t2
^^^
{code}

> CREATE TEMPORARY TABLE is not supported yet
> ---
>
> Key: SPARK-14651
> URL: https://issues.apache.org/jira/browse/SPARK-14651
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> With today's master it seems that {{CREATE TEMPORARY TABLE}} may or may not 
> work depending on how complete the DDL is (?)
> {code}
> scala> sql("CREATE temporary table t2")
> 16/04/14 23:29:26 INFO HiveSqlParser: Parsing command: CREATE temporary table 
> t2
> org.apache.spark.sql.catalyst.parser.ParseException:
> CREATE TEMPORARY TABLE is not supported yet. Please use registerTempTable as 
> an alternative.(line 1, pos 0)
> == SQL ==
> CREATE temporary table t2
> ^^^
>   at 
> org.apache.spark.sql.hive.execution.HiveSqlAstBuilder$$anonfun$visitCreateTable$1.apply(HiveSqlParser.scala:169)
>   at 
> org.apache.spark.sql.hive.execution.HiveSqlAstBuilder$$anonfun$visitCreateTable$1.apply(HiveSqlParser.scala:165)
>   at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:85)
>   at 
> org.apache.spark.sql.hive.execution.HiveSqlAstBuilder.visitCreateTable(HiveSqlParser.scala:165)
>   at 
> org.apache.spark.sql.hive.execution.HiveSqlAstBuilder.visitCreateTable(HiveSqlParser.scala:53)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$CreateTableContext.accept(SqlBaseParser.java:1049)
>   at 
> org.antlr.v4.runtime.tree.AbstractParseTreeVisitor.visit(AbstractParseTreeVisitor.java:42)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleStatement$1.apply(AstBuilder.scala:63)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleStatement$1.apply(AstBuilder.scala:63)
>   at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:85)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitSingleStatement(AstBuilder.scala:62)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:54)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:53)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:86)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:198)
>   at 
> org.apache.spark.sql.hive.HiveContext.org$apache$spark$sql$hive$HiveContext$$super$parseSql(HiveContext.scala:201)
>   at 
> org.apache.spark.sql.hive.HiveContext$$anonfun$parseSql$1.apply(HiveContext.scala:201)
>   at 
> org.apache.spark.sql.hive.HiveContext$$anonfun$parseSql$1.apply(HiveContext.scala:201)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:228)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:175)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:174)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:217)
>   at org.apache.spark.sql.hive.HiveContext.parseSql(HiveContext.scala:200)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:765)
>   ... 48 elided
> scala> sql("CREATE temporary table t2 USING PARQUET OPTIONS (PATH 'hello') AS 
> SELECT * FROM t1")
> 16/04/14 23:30:21 INFO HiveSqlParser: Parsing command: CREATE temporary table 
> t2 USING PARQUET OPTIONS (PATH 'hello') AS SELECT * FROM t1
> org.apache.spark.sql.AnalysisException: Table or View not found: t1; line 1 
> pos 80
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$getTable(Analyzer.scala:412)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:421)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:416)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(Logic

[jira] [Updated] (SPARK-14651) CREATE TEMPORARY TABLE is not supported yet

2017-04-16 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-14651:
-
Affects Version/s: 2.2.0

> CREATE TEMPORARY TABLE is not supported yet
> ---
>
> Key: SPARK-14651
> URL: https://issues.apache.org/jira/browse/SPARK-14651
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> With today's master it seems that {{CREATE TEMPORARY TABLE}} may or may not 
> work depending on how complete the DDL is (?)
> {code}
> scala> sql("CREATE temporary table t2")
> 16/04/14 23:29:26 INFO HiveSqlParser: Parsing command: CREATE temporary table 
> t2
> org.apache.spark.sql.catalyst.parser.ParseException:
> CREATE TEMPORARY TABLE is not supported yet. Please use registerTempTable as 
> an alternative.(line 1, pos 0)
> == SQL ==
> CREATE temporary table t2
> ^^^
>   at 
> org.apache.spark.sql.hive.execution.HiveSqlAstBuilder$$anonfun$visitCreateTable$1.apply(HiveSqlParser.scala:169)
>   at 
> org.apache.spark.sql.hive.execution.HiveSqlAstBuilder$$anonfun$visitCreateTable$1.apply(HiveSqlParser.scala:165)
>   at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:85)
>   at 
> org.apache.spark.sql.hive.execution.HiveSqlAstBuilder.visitCreateTable(HiveSqlParser.scala:165)
>   at 
> org.apache.spark.sql.hive.execution.HiveSqlAstBuilder.visitCreateTable(HiveSqlParser.scala:53)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$CreateTableContext.accept(SqlBaseParser.java:1049)
>   at 
> org.antlr.v4.runtime.tree.AbstractParseTreeVisitor.visit(AbstractParseTreeVisitor.java:42)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleStatement$1.apply(AstBuilder.scala:63)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleStatement$1.apply(AstBuilder.scala:63)
>   at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:85)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitSingleStatement(AstBuilder.scala:62)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:54)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:53)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:86)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:198)
>   at 
> org.apache.spark.sql.hive.HiveContext.org$apache$spark$sql$hive$HiveContext$$super$parseSql(HiveContext.scala:201)
>   at 
> org.apache.spark.sql.hive.HiveContext$$anonfun$parseSql$1.apply(HiveContext.scala:201)
>   at 
> org.apache.spark.sql.hive.HiveContext$$anonfun$parseSql$1.apply(HiveContext.scala:201)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:228)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:175)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:174)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:217)
>   at org.apache.spark.sql.hive.HiveContext.parseSql(HiveContext.scala:200)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:765)
>   ... 48 elided
> scala> sql("CREATE temporary table t2 USING PARQUET OPTIONS (PATH 'hello') AS 
> SELECT * FROM t1")
> 16/04/14 23:30:21 INFO HiveSqlParser: Parsing command: CREATE temporary table 
> t2 USING PARQUET OPTIONS (PATH 'hello') AS SELECT * FROM t1
> org.apache.spark.sql.AnalysisException: Table or View not found: t1; line 1 
> pos 80
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$getTable(Analyzer.scala:412)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:421)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:416)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:58)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:58)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:68)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:57)
>   at 
> org.apache.spark.sql.catalyst.pla

[jira] [Commented] (SPARK-14764) Spark SQL documentation should be more precise about which SQL features it supports

2017-04-16 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970310#comment-15970310
 ] 

Hyukjin Kwon commented on SPARK-14764:
--

How about https://docs.databricks.com/spark/latest/spark-sql/index.html ?

> Spark SQL documentation should be more precise about which SQL features it 
> supports
> ---
>
> Key: SPARK-14764
> URL: https://issues.apache.org/jira/browse/SPARK-14764
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 1.5.0
>Reporter: Jeremy Beard
>Priority: Minor
>
> Terminology such as "vast majority" and "most" is difficult to develop 
> against without a lot of trial and error. It would be excellent if the Spark 
> SQL documentation could be more precise about which SQL features it does and 
> doesn't support. In a sense this is part of the API of Spark.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15071) Check the result of all TPCDS queries

2017-04-16 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970312#comment-15970312
 ] 

Hyukjin Kwon commented on SPARK-15071:
--

[~nirmannarang] How has it been going?

> Check the result of all TPCDS queries
> -
>
> Key: SPARK-15071
> URL: https://issues.apache.org/jira/browse/SPARK-15071
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Davies Liu
>
> We should compare the results of all TPCDS query again other Database that 
> could support all of them (for example, IBM Big SQL, PostgreSQL)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13644) Add the source file name and line into Logger when an exception occurs in the generated code

2017-04-16 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970313#comment-15970313
 ] 

Kazuaki Ishizaki commented on SPARK-13644:
--

It is not resolved yet. But, let me close this once since no time for this JIRA 
now.

> Add the source file name and line into Logger when an exception occurs in the 
> generated code
> 
>
> Key: SPARK-13644
> URL: https://issues.apache.org/jira/browse/SPARK-13644
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> This is to show a message points out the origin of a generated method when an 
> exception occurs in the generated method at runtime.
> An example of a message (the first line is newly added)
> {code}
> 07:49:29.525 ERROR 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator: 
> The method GeneratedIterator.processNext() is generated for filter at 
> Test.scala:23
> 07:49:29.526 ERROR org.apache.spark.executor.Executor: Exception in task 1.0 
> in stage 2.0 (TID 4)
> java.lang.NullPointerException:
> at ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-13644) Add the source file name and line into Logger when an exception occurs in the generated code

2017-04-16 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki closed SPARK-13644.

Resolution: Unresolved

> Add the source file name and line into Logger when an exception occurs in the 
> generated code
> 
>
> Key: SPARK-13644
> URL: https://issues.apache.org/jira/browse/SPARK-13644
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> This is to show a message points out the origin of a generated method when an 
> exception occurs in the generated method at runtime.
> An example of a message (the first line is newly added)
> {code}
> 07:49:29.525 ERROR 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator: 
> The method GeneratedIterator.processNext() is generated for filter at 
> Test.scala:23
> 07:49:29.526 ERROR org.apache.spark.executor.Executor: Exception in task 1.0 
> in stage 2.0 (TID 4)
> java.lang.NullPointerException:
> at ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14584) Improve recognition of non-nullability in Dataset transformations

2017-04-16 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970316#comment-15970316
 ] 

Hyukjin Kwon commented on SPARK-14584:
--

[~joshrosen], I am pretty sure this is fixed somewhere. I just tested this in 
2.1.0 as below:

{code}
scala> case class MyCaseClass(foo: Int)
defined class MyCaseClass

scala> sc.parallelize(Seq(0)).toDS.map(MyCaseClass).printSchema
root
 |-- foo: integer (nullable = true)
{code}


> Improve recognition of non-nullability in Dataset transformations
> -
>
> Key: SPARK-14584
> URL: https://issues.apache.org/jira/browse/SPARK-14584
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Josh Rosen
>
> There are many cases where we can statically know that a field will never be 
> null. For instance, a field in a case class with a primitive type will never 
> return null. However, there are currently several cases in the Dataset API 
> where we do not properly recognize this non-nullability. For instance:
> {code}
> case class MyCaseClass(foo: Int)
> sc.parallelize(Seq(0)).toDS.map(MyCaseClass).printSchema
> {code}
> claims that the {{foo}} field is nullable even though this is impossible.
> I believe that this is due to the way that we reason about nullability when 
> constructing serializer expressions in ExpressionEncoders. The following 
> assertion will currently fail if added to ExpressionEncoder:
> {code}
>   require(schema.size == serializer.size)
>   schema.fields.zip(serializer).foreach { case (field, fieldSerializer) =>
> require(field.dataType == fieldSerializer.dataType, s"Field 
> ${field.name}'s data type is " +
>   s"${field.dataType} in the schema but ${fieldSerializer.dataType} in 
> its serializer")
> require(field.nullable == fieldSerializer.nullable, s"Field 
> ${field.name}'s nullability is " +
>   s"${field.nullable} in the schema but ${fieldSerializer.nullable} in 
> its serializer")
>   }
> {code}
> Most often, the schema claims that a field is non-nullable while the encoder 
> allows for nullability, but occasionally we see a mismatch in the datatypes 
> due to disagreements over the nullability of nested structs' fields (or 
> fields of structs in arrays).
> I think the problem is that when we're reasoning about nullability in a 
> struct's schema we consider its fields' nullability to be independent of the 
> nullability of the struct itself, whereas in the serializer expressions we 
> are considering those field extraction expressions to be nullable if the 
> input objects themselves can be nullable.
> I'm not sure what's the simplest way to fix this. One proposal would be to 
> leave the serializers unchanged and have ObjectOperator derive its output 
> attributes from an explicitly-passed schema rather than using the 
> serializers' attributes. However, I worry that this might introduce bugs in 
> case the serializer and schema disagree.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20344) Duplicate call in FairSchedulableBuilder.addTaskSetManager

2017-04-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20344:


Assignee: (was: Apache Spark)

> Duplicate call in FairSchedulableBuilder.addTaskSetManager
> --
>
> Key: SPARK-20344
> URL: https://issues.apache.org/jira/browse/SPARK-20344
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Robert Stupp
>Priority: Trivial
>
> {{org.apache.spark.scheduler.FairSchedulableBuilder#addTaskSetManager}} 
> contains the code snippet:
> {code}
>   override def addTaskSetManager(manager: Schedulable, properties: 
> Properties) {
> var poolName = DEFAULT_POOL_NAME
> var parentPool = rootPool.getSchedulableByName(poolName)
> if (properties != null) {
>   poolName = properties.getProperty(FAIR_SCHEDULER_PROPERTIES, 
> DEFAULT_POOL_NAME)
>   parentPool = rootPool.getSchedulableByName(poolName)
>   if (parentPool == null) {
> {code}
> {{parentPool = rootPool.getSchedulableByName(poolName)}} is called twice if 
> {{properties != null}}.
> I'm not sure whether this is an oversight or there's something else missing. 
> This piece of the code hasn't been modified since 2013, so I doubt that this 
> is a serious issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20344) Duplicate call in FairSchedulableBuilder.addTaskSetManager

2017-04-16 Thread Robert Stupp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970319#comment-15970319
 ] 

Robert Stupp commented on SPARK-20344:
--

Yea, true. I've updated the patch and [submitted a 
PR|https://github.com/apache/spark/pull/17647].

> Duplicate call in FairSchedulableBuilder.addTaskSetManager
> --
>
> Key: SPARK-20344
> URL: https://issues.apache.org/jira/browse/SPARK-20344
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Robert Stupp
>Priority: Trivial
>
> {{org.apache.spark.scheduler.FairSchedulableBuilder#addTaskSetManager}} 
> contains the code snippet:
> {code}
>   override def addTaskSetManager(manager: Schedulable, properties: 
> Properties) {
> var poolName = DEFAULT_POOL_NAME
> var parentPool = rootPool.getSchedulableByName(poolName)
> if (properties != null) {
>   poolName = properties.getProperty(FAIR_SCHEDULER_PROPERTIES, 
> DEFAULT_POOL_NAME)
>   parentPool = rootPool.getSchedulableByName(poolName)
>   if (parentPool == null) {
> {code}
> {{parentPool = rootPool.getSchedulableByName(poolName)}} is called twice if 
> {{properties != null}}.
> I'm not sure whether this is an oversight or there's something else missing. 
> This piece of the code hasn't been modified since 2013, so I doubt that this 
> is a serious issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20344) Duplicate call in FairSchedulableBuilder.addTaskSetManager

2017-04-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20344:


Assignee: Apache Spark

> Duplicate call in FairSchedulableBuilder.addTaskSetManager
> --
>
> Key: SPARK-20344
> URL: https://issues.apache.org/jira/browse/SPARK-20344
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Robert Stupp
>Assignee: Apache Spark
>Priority: Trivial
>
> {{org.apache.spark.scheduler.FairSchedulableBuilder#addTaskSetManager}} 
> contains the code snippet:
> {code}
>   override def addTaskSetManager(manager: Schedulable, properties: 
> Properties) {
> var poolName = DEFAULT_POOL_NAME
> var parentPool = rootPool.getSchedulableByName(poolName)
> if (properties != null) {
>   poolName = properties.getProperty(FAIR_SCHEDULER_PROPERTIES, 
> DEFAULT_POOL_NAME)
>   parentPool = rootPool.getSchedulableByName(poolName)
>   if (parentPool == null) {
> {code}
> {{parentPool = rootPool.getSchedulableByName(poolName)}} is called twice if 
> {{properties != null}}.
> I'm not sure whether this is an oversight or there's something else missing. 
> This piece of the code hasn't been modified since 2013, so I doubt that this 
> is a serious issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20344) Duplicate call in FairSchedulableBuilder.addTaskSetManager

2017-04-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970318#comment-15970318
 ] 

Apache Spark commented on SPARK-20344:
--

User 'snazy' has created a pull request for this issue:
https://github.com/apache/spark/pull/17647

> Duplicate call in FairSchedulableBuilder.addTaskSetManager
> --
>
> Key: SPARK-20344
> URL: https://issues.apache.org/jira/browse/SPARK-20344
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Robert Stupp
>Priority: Trivial
>
> {{org.apache.spark.scheduler.FairSchedulableBuilder#addTaskSetManager}} 
> contains the code snippet:
> {code}
>   override def addTaskSetManager(manager: Schedulable, properties: 
> Properties) {
> var poolName = DEFAULT_POOL_NAME
> var parentPool = rootPool.getSchedulableByName(poolName)
> if (properties != null) {
>   poolName = properties.getProperty(FAIR_SCHEDULER_PROPERTIES, 
> DEFAULT_POOL_NAME)
>   parentPool = rootPool.getSchedulableByName(poolName)
>   if (parentPool == null) {
> {code}
> {{parentPool = rootPool.getSchedulableByName(poolName)}} is called twice if 
> {{properties != null}}.
> I'm not sure whether this is an oversight or there's something else missing. 
> This piece of the code hasn't been modified since 2013, so I doubt that this 
> is a serious issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15848) Spark unable to read partitioned table in avro format and column name in upper case

2017-04-16 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-15848.
--
Resolution: Cannot Reproduce

I am resolving this per the comment above ^.

> Spark unable to read partitioned table in avro format and column name in 
> upper case
> ---
>
> Key: SPARK-15848
> URL: https://issues.apache.org/jira/browse/SPARK-15848
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Zhan Zhang
>
> If external partitioned Hive tables created in Avro format.
> Spark is returning "null" values if columns names are in Uppercase in the 
> Avro schema.
> The same tables return proper data when queried in the Hive client.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15848) Spark unable to read partitioned table in avro format and column name in upper case

2017-04-16 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970333#comment-15970333
 ] 

Hyukjin Kwon edited comment on SPARK-15848 at 4/16/17 10:57 AM:


I am resolving this per the comment above ^. It would be great if anyone 
identifies the JIRA fixing this and backports if applicable.


was (Author: hyukjin.kwon):
I am resolving this per the comment above ^.

> Spark unable to read partitioned table in avro format and column name in 
> upper case
> ---
>
> Key: SPARK-15848
> URL: https://issues.apache.org/jira/browse/SPARK-15848
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Zhan Zhang
>
> If external partitioned Hive tables created in Avro format.
> Spark is returning "null" values if columns names are in Uppercase in the 
> Avro schema.
> The same tables return proper data when queried in the Hive client.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19851) Add support for EVERY and ANY (SOME) aggregates

2017-04-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970337#comment-15970337
 ] 

Apache Spark commented on SPARK-19851:
--

User 'ptkool' has created a pull request for this issue:
https://github.com/apache/spark/pull/17648

> Add support for EVERY and ANY (SOME) aggregates
> ---
>
> Key: SPARK-19851
> URL: https://issues.apache.org/jira/browse/SPARK-19851
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Michael Styles
>
> Add support for EVERY and ANY (SOME) aggregates.
> - EVERY returns true if all input values are true.
> - ANY returns true if at least one input value is true.
> - SOME is equivalent to ANY.
> Both aggregates are part of the SQL standard.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16544) Support for conversion from compatible schema for Parquet data source when data types are not matched

2017-04-16 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-16544:
-
Affects Version/s: 2.2.0

> Support for conversion from compatible schema for Parquet data source when 
> data types are not matched
> -
>
> Key: SPARK-16544
> URL: https://issues.apache.org/jira/browse/SPARK-16544
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Hyukjin Kwon
>
> This deals with scenario 1 - case - 1 from the parent issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16562) Do not allow downcast in INT32 based types for non-vectorized Parquet reader

2017-04-16 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-16562.
--
Resolution: Not A Bug

This seems not a problem to me. Please my PR linked.

> Do not allow downcast in INT32 based types for non-vectorized Parquet reader
> 
>
> Key: SPARK-16562
> URL: https://issues.apache.org/jira/browse/SPARK-16562
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, INT32 based types, ({{ShortType}}, {{ByteType}}, {{IntegerType}} 
> can be downcasted in any combination. For example, the codes below:
> {code}
> val path = "/tmp/test.parquet"
> val data = (1 to 4).map(Tuple1(_.toInt))
> data.toDF("a").write.parquet(path)
> val schema = StructType(StructField("a", ShortType, true) :: Nil)
> spark.read.schema(schema).parquet(path).show()
> {code}
> work fine.
> This should not be allowed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16562) Do not allow downcast in INT32 based types for non-vectorized Parquet reader

2017-04-16 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-16562.
--
Resolution: Invalid

> Do not allow downcast in INT32 based types for non-vectorized Parquet reader
> 
>
> Key: SPARK-16562
> URL: https://issues.apache.org/jira/browse/SPARK-16562
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, INT32 based types, ({{ShortType}}, {{ByteType}}, {{IntegerType}} 
> can be downcasted in any combination. For example, the codes below:
> {code}
> val path = "/tmp/test.parquet"
> val data = (1 to 4).map(Tuple1(_.toInt))
> data.toDF("a").write.parquet(path)
> val schema = StructType(StructField("a", ShortType, true) :: Nil)
> spark.read.schema(schema).parquet(path).show()
> {code}
> work fine.
> This should not be allowed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-16562) Do not allow downcast in INT32 based types for non-vectorized Parquet reader

2017-04-16 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-16562:
--

> Do not allow downcast in INT32 based types for non-vectorized Parquet reader
> 
>
> Key: SPARK-16562
> URL: https://issues.apache.org/jira/browse/SPARK-16562
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, INT32 based types, ({{ShortType}}, {{ByteType}}, {{IntegerType}} 
> can be downcasted in any combination. For example, the codes below:
> {code}
> val path = "/tmp/test.parquet"
> val data = (1 to 4).map(Tuple1(_.toInt))
> data.toDF("a").write.parquet(path)
> val schema = StructType(StructField("a", ShortType, true) :: Nil)
> spark.read.schema(schema).parquet(path).show()
> {code}
> work fine.
> This should not be allowed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16562) Do not allow downcast in INT32 based types for non-vectorized Parquet reader

2017-04-16 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970340#comment-15970340
 ] 

Hyukjin Kwon edited comment on SPARK-16562 at 4/16/17 11:16 AM:


This seems not a problem to me. Please see my PR linked.


was (Author: hyukjin.kwon):
This seems not a problem to me. Please my PR linked.

> Do not allow downcast in INT32 based types for non-vectorized Parquet reader
> 
>
> Key: SPARK-16562
> URL: https://issues.apache.org/jira/browse/SPARK-16562
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, INT32 based types, ({{ShortType}}, {{ByteType}}, {{IntegerType}} 
> can be downcasted in any combination. For example, the codes below:
> {code}
> val path = "/tmp/test.parquet"
> val data = (1 to 4).map(Tuple1(_.toInt))
> data.toDF("a").write.parquet(path)
> val schema = StructType(StructField("a", ShortType, true) :: Nil)
> spark.read.schema(schema).parquet(path).show()
> {code}
> work fine.
> This should not be allowed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16604) Spark2.0 fail in executing the sql statement which includes partition field in the "select" statement while spark1.6 supports

2017-04-16 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-16604.
--
Resolution: Cannot Reproduce

It sounds almost impossible to reproduce. I am resolving this. Unless the 
reporter is active and keeps reproducing this against the current master, I 
think no one is going to reproduce this or resolve this.

Please reopen this can be reproduced in the current master.


> Spark2.0 fail in executing the sql statement which includes partition field 
> in the "select" statement while spark1.6 supports
> -
>
> Key: SPARK-16604
> URL: https://issues.apache.org/jira/browse/SPARK-16604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: marymwu
>
> Spark2.0 fail in executing the sql statement which includes partition field 
> in the "select" statement
> error:
> 16/07/14 16:10:47 INFO HiveThriftServer2: set 
> sessionId(69e92ba1-4be2-4be9-bc81-7a00c5802ef8) to 
> exeId(c93f69b0-0f6e-4f07-afdc-ca6c41045fa3)
> 16/07/14 16:10:47 INFO SparkSqlParser: Parsing command: INSERT OVERWRITE 
> TABLE 
> d_avatar.RPS__H_REPORT_MORE_DIMENSION_MORE_NORM_FIRST_CHANNEL_VCD_IMPALA 
> PARTITION(p_event_date='2016-07-13')
> select 
> app_key,
> app_version,
> app_channel,
> device_model,
> total_num,
> new_num,
> active_num,
> extant_num,
> visits_num,
> start_num,
> p_event_date
> from RPS__H_REPORT_MORE_DIMENSION_MORE_NORM_FIRST_CHANNEL_VCD where 
> p_event_date = '2016-07-13'
> 16/07/14 16:10:47 INFO ThriftHttpServlet: Could not validate cookie sent, 
> will try to generate a new cookie
> 16/07/14 16:10:47 INFO ThriftHttpServlet: Cookie added for clientUserName hive
> 16/07/14 16:10:47 INFO HiveMetaStore: 108: get_table : db=default 
> tbl=rps__h_report_more_dimension_more_norm_first_channel_vcd
> 16/07/14 16:10:47 INFO audit: ugi=u_reaperip=unknown-ip-addr  
> cmd=get_table : db=default 
> tbl=rps__h_report_more_dimension_more_norm_first_channel_vcd 
> 16/07/14 16:10:47 INFO HiveMetaStore: 108: Opening raw store with 
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 16/07/14 16:10:47 INFO ObjectStore: ObjectStore, initialize called
> 16/07/14 16:10:47 INFO ThriftHttpServlet: Could not validate cookie sent, 
> will try to generate a new cookie
> 16/07/14 16:10:47 INFO ThriftHttpServlet: Cookie added for clientUserName hive
> 16/07/14 16:10:47 INFO Query: Reading in results for query 
> "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is 
> closing
> 16/07/14 16:10:47 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is 
> MYSQL
> 16/07/14 16:10:47 INFO ObjectStore: Initialized ObjectStore
> 16/07/14 16:10:47 INFO CatalystSqlParser: Parsing command: string
> 16/07/14 16:10:47 INFO CatalystSqlParser: Parsing command: string
> 16/07/14 16:10:47 INFO CatalystSqlParser: Parsing command: string
> 16/07/14 16:10:47 INFO CatalystSqlParser: Parsing command: string
> 16/07/14 16:10:47 INFO CatalystSqlParser: Parsing command: string
> 16/07/14 16:10:47 INFO CatalystSqlParser: Parsing command: bigint
> 16/07/14 16:10:47 INFO CatalystSqlParser: Parsing command: bigint
> 16/07/14 16:10:47 INFO CatalystSqlParser: Parsing command: bigint
> 16/07/14 16:10:47 INFO CatalystSqlParser: Parsing command: bigint
> 16/07/14 16:10:47 INFO CatalystSqlParser: Parsing command: bigint
> 16/07/14 16:10:47 INFO CatalystSqlParser: Parsing command: bigint
> 16/07/14 16:10:47 INFO HiveMetaStore: 108: get_table : db=d_avatar 
> tbl=rps__h_report_more_dimension_more_norm_first_channel_vcd_impala
> 16/07/14 16:10:47 INFO audit: ugi=u_reaperip=unknown-ip-addr  
> cmd=get_table : db=d_avatar 
> tbl=rps__h_report_more_dimension_more_norm_first_channel_vcd_impala 
> 16/07/14 16:10:47 INFO CatalystSqlParser: Parsing command: string
> 16/07/14 16:10:47 INFO CatalystSqlParser: Parsing command: string
> 16/07/14 16:10:47 INFO CatalystSqlParser: Parsing command: string
> 16/07/14 16:10:47 INFO CatalystSqlParser: Parsing command: string
> 16/07/14 16:10:47 INFO CatalystSqlParser: Parsing command: string
> 16/07/14 16:10:47 INFO CatalystSqlParser: Parsing command: bigint
> 16/07/14 16:10:47 INFO CatalystSqlParser: Parsing command: bigint
> 16/07/14 16:10:47 INFO CatalystSqlParser: Parsing command: bigint
> 16/07/14 16:10:47 INFO CatalystSqlParser: Parsing command: bigint
> 16/07/14 16:10:47 INFO CatalystSqlParser: Parsing command: bigint
> 16/07/14 16:10:47 INFO CatalystSqlParser: Parsing command: bigint
> 16/07/14 16:10:49 WARN HiveSessionState$$anon$1: Max iterations (100) reached 
> for batch Resolution
> 16/07/14 16:10:49 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING, 
> org.apache.spark.sql.AnalysisExcept

[jira] [Commented] (SPARK-16892) flatten function to get flat array (or map) column from array of array (or array of map) column

2017-04-16 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970348#comment-15970348
 ] 

Hyukjin Kwon commented on SPARK-16892:
--

Maybe you are looking for somethine like this?

{code}
scala> Seq(Tuple1(Array(Array(1, 2), Array(3, 4.toDS.map(r => 
r._1.flatten).show()
++
|   value|
++
|[1, 2, 3, 4]|
++
{code}

> flatten function to get flat array (or map) column from array of array (or 
> array of map) column
> ---
>
> Key: SPARK-16892
> URL: https://issues.apache.org/jira/browse/SPARK-16892
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Kapil Singh
>
> flatten(input)
> Converts input of array of array type into flat array type by inserting 
> elements of all element arrays into single array. Example:
> input: [[1, 2, 3], [4, 5], [-1, -2, 0]]
> output: [1, 2, 3, 4, 5, -1, -2, 0]
> Converts input of array of map type into flat map type by inserting key-value 
> pairs of all element maps into single map. Example:
> input: [(1 -> "one", 2 -> "two"), (0 -> "zero"), (4 -> "four")]
> output: (1 -> "one", 2 -> "two", 0 -> "zero", 4 -> "four")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14764) Spark SQL documentation should be more precise about which SQL features it supports

2017-04-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970350#comment-15970350
 ] 

Sean Owen commented on SPARK-14764:
---

Can we simply port that document into the main project docs?

> Spark SQL documentation should be more precise about which SQL features it 
> supports
> ---
>
> Key: SPARK-14764
> URL: https://issues.apache.org/jira/browse/SPARK-14764
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 1.5.0
>Reporter: Jeremy Beard
>Priority: Minor
>
> Terminology such as "vast majority" and "most" is difficult to develop 
> against without a lot of trial and error. It would be excellent if the Spark 
> SQL documentation could be more precise about which SQL features it does and 
> doesn't support. In a sense this is part of the API of Spark.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19608) setup.py missing reference to pyspark.ml.param

2017-04-16 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-19608.
--
Resolution: Duplicate

Seems added in 
https://github.com/apache/spark/commit/965c82d8c4b7f2d4dfbc45ec4d47d6b6588094c3

Please reopen this if I misunderstood.

> setup.py missing reference to pyspark.ml.param
> --
>
> Key: SPARK-19608
> URL: https://issues.apache.org/jira/browse/SPARK-19608
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Arthur Tacca
>Priority: Minor
>
> The setup.py written for SPARK-1267 is missing "pyspark.ml.param" in the 
> packages list. Even better than just adding it would be to replace the 
> manually-updated list of packages with {{find_packages()}}, which is already 
> imported in this file, perhaps with {{+['pyspark.bin', 'pyspark.jars', ...}} 
> if necessary. (Just {{find_packages()}} without any extras worked fine for 
> me, but I'm just using setup.py to allow installing {{pyspark}} with pip to 
> connect to a standalone cluster rather than any of this fancy jar-copying 
> stuff.)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20023) Can not see table comment when describe formatted table

2017-04-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970360#comment-15970360
 ] 

Apache Spark commented on SPARK-20023:
--

User 'sujith71955' has created a pull request for this issue:
https://github.com/apache/spark/pull/17649

> Can not see table comment when describe formatted table
> ---
>
> Key: SPARK-20023
> URL: https://issues.apache.org/jira/browse/SPARK-20023
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: chenerlu
>Assignee: Xiao Li
> Fix For: 2.2.0
>
>
> Spark 2.x implements create table by itself.
> https://github.com/apache/spark/commit/7d2ed8cc030f3d84fea47fded072c320c3d87ca7
> But in the implement mentioned above, it remove table comment from 
> properties, so user can not see table comment through run "describe formatted 
> table". Similarly, when user alters table comment, he still can not see the 
> change of table comment through run "describe formatted table".
> I wonder why we removed table comments, is this a bug?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20023) Can not see table comment when describe formatted table

2017-04-16 Thread Sujith (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970365#comment-15970365
 ] 

Sujith commented on SPARK-20023:


@chenerlu, your point is right, after executing the alter command with newly 
added/modified table comment, the same is not reflecting when we execute desc 
formatted table query.
table comment which is now directly part of CatalogTable instance is not 
getting updated and old table comment was shown, to handle this issue while 
updating the table properties map with newly added/modified properties in 
CatalogTable instance also update the comment parameter in CatalogTable with 
the newly added/modified comment.

I raised PR after fixing this issue, https://github.com/apache/spark/pull/17649
 Please let me know for any suggestions.

> Can not see table comment when describe formatted table
> ---
>
> Key: SPARK-20023
> URL: https://issues.apache.org/jira/browse/SPARK-20023
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: chenerlu
>Assignee: Xiao Li
> Fix For: 2.2.0
>
>
> Spark 2.x implements create table by itself.
> https://github.com/apache/spark/commit/7d2ed8cc030f3d84fea47fded072c320c3d87ca7
> But in the implement mentioned above, it remove table comment from 
> properties, so user can not see table comment through run "describe formatted 
> table". Similarly, when user alters table comment, he still can not see the 
> change of table comment through run "describe formatted table".
> I wonder why we removed table comments, is this a bug?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14764) Spark SQL documentation should be more precise about which SQL features it supports

2017-04-16 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970368#comment-15970368
 ] 

Hyukjin Kwon commented on SPARK-14764:
--

(For me, I would like to have this one but I don't know if it is okay because 
it looks Databricks's property.)

> Spark SQL documentation should be more precise about which SQL features it 
> supports
> ---
>
> Key: SPARK-14764
> URL: https://issues.apache.org/jira/browse/SPARK-14764
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 1.5.0
>Reporter: Jeremy Beard
>Priority: Minor
>
> Terminology such as "vast majority" and "most" is difficult to develop 
> against without a lot of trial and error. It would be excellent if the Spark 
> SQL documentation could be more precise about which SQL features it does and 
> doesn't support. In a sense this is part of the API of Spark.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20350) Apply Complementation Laws during boolean expression simplification

2017-04-16 Thread Michael Styles (JIRA)

Michael Styles created SPARK-20350:
--

 Summary: Apply Complementation Laws during boolean expression 
simplification
 Key: SPARK-20350
 URL: https://issues.apache.org/jira/browse/SPARK-20350
 Project: Spark
  Issue Type: Improvement
  Components: Optimizer
Affects Versions: 2.1.0
Reporter: Michael Styles


Apply Complementation Laws during boolean expression simplification.

* A AND NOT(A) == FALSE
* A OR NOT(A) == TRUE



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20350) Apply Complementation Laws during boolean expression simplification

2017-04-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970372#comment-15970372
 ] 

Apache Spark commented on SPARK-20350:
--

User 'ptkool' has created a pull request for this issue:
https://github.com/apache/spark/pull/17650

> Apply Complementation Laws during boolean expression simplification
> ---
>
> Key: SPARK-20350
> URL: https://issues.apache.org/jira/browse/SPARK-20350
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.0
>Reporter: Michael Styles
>
> Apply Complementation Laws during boolean expression simplification.
> * A AND NOT(A) == FALSE
> * A OR NOT(A) == TRUE



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20350) Apply Complementation Laws during boolean expression simplification

2017-04-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20350:


Assignee: Apache Spark

> Apply Complementation Laws during boolean expression simplification
> ---
>
> Key: SPARK-20350
> URL: https://issues.apache.org/jira/browse/SPARK-20350
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.0
>Reporter: Michael Styles
>Assignee: Apache Spark
>
> Apply Complementation Laws during boolean expression simplification.
> * A AND NOT(A) == FALSE
> * A OR NOT(A) == TRUE



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20350) Apply Complementation Laws during boolean expression simplification

2017-04-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20350:


Assignee: (was: Apache Spark)

> Apply Complementation Laws during boolean expression simplification
> ---
>
> Key: SPARK-20350
> URL: https://issues.apache.org/jira/browse/SPARK-20350
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.0
>Reporter: Michael Styles
>
> Apply Complementation Laws during boolean expression simplification.
> * A AND NOT(A) == FALSE
> * A OR NOT(A) == TRUE



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13680) Java UDAF with more than one intermediate argument returns wrong results

2017-04-16 Thread Yael Aharon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970390#comment-15970390
 ] 

Yael Aharon commented on SPARK-13680:
-

This has been fixed in spark 1.6. It can probably be closed.

> Java UDAF with more than one intermediate argument returns wrong results
> 
>
> Key: SPARK-13680
> URL: https://issues.apache.org/jira/browse/SPARK-13680
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: CDH 5.5.2
>Reporter: Yael Aharon
> Attachments: data.csv, setup.hql
>
>
> I am trying to incorporate the Java UDAF from 
> https://github.com/apache/spark/blob/master/sql/hive/src/test/java/org/apache/spark/sql/hive/aggregate/MyDoubleAvg.java
>  into an SQL query. 
> I registered the UDAF like this:
>  sqlContext.udf().register("myavg", new MyDoubleAvg());
> My SQL query is:
> SELECT AVG(seqi) AS `avg_seqi`, AVG(seqd) AS `avg_seqd`, AVG(ci) AS `avg_ci`, 
> AVG(cd) AS `avg_cd`, AVG(stdevd) AS `avg_stdevd`, AVG(stdevi) AS 
> `avg_stdevi`, MAX(seqi) AS `max_seqi`, MAX(seqd) AS `max_seqd`, MAX(ci) AS 
> `max_ci`, MAX(cd) AS `max_cd`, MAX(stdevd) AS `max_stdevd`, MAX(stdevi) AS 
> `max_stdevi`, MIN(seqi) AS `min_seqi`, MIN(seqd) AS `min_seqd`, MIN(ci) AS 
> `min_ci`, MIN(cd) AS `min_cd`, MIN(stdevd) AS `min_stdevd`, MIN(stdevi) AS 
> `min_stdevi`,SUM(seqi) AS `sum_seqi`, SUM(seqd) AS `sum_seqd`, SUM(ci) AS 
> `sum_ci`, SUM(cd) AS `sum_cd`, SUM(stdevd) AS `sum_stdevd`, SUM(stdevi) AS 
> `sum_stdevi`, myavg(seqd) as `myavg_seqd`,  AVG(zero) AS `avg_zero`, 
> AVG(nulli) AS `avg_nulli`,AVG(nulld) AS `avg_nulld`, SUM(zero) AS `sum_zero`, 
> SUM(nulli) AS `sum_nulli`,SUM(nulld) AS `sum_nulld`,MAX(zero) AS `max_zero`, 
> MAX(nulli) AS `max_nulli`,MAX(nulld) AS `max_nulld`,count( * ) AS 
> `count_all`, count(nulli) AS `count_nulli` FROM mytable
> As soon as I add the UDAF myavg to the SQL, all the results become incorrect. 
> When I remove the call to the UDAF, the results are correct.
> I was able to go around the issue by modifying bufferSchema of the UDAF to 
> use an array and the corresponding update and merge methods. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19740) Spark executor always runs as root when running on mesos

2017-04-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19740.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17109
[https://github.com/apache/spark/pull/17109]

> Spark executor always runs as root when running on mesos
> 
>
> Key: SPARK-19740
> URL: https://issues.apache.org/jira/browse/SPARK-19740
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Ji Yan
>Priority: Minor
> Fix For: 2.2.0
>
>
> When running Spark on Mesos with docker containerizer, the spark executors 
> are always launched with 'docker run' command without specifying --user 
> option, which always results in spark executors running as root. Mesos has a 
> way to support arbitrary parameters. Spark could use that to expose setting 
> user
> background on mesos with arbitrary parameters support: 
> https://issues.apache.org/jira/browse/MESOS-1816



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19740) Spark executor always runs as root when running on mesos

2017-04-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-19740:
-

Assignee: Ji Yan

> Spark executor always runs as root when running on mesos
> 
>
> Key: SPARK-19740
> URL: https://issues.apache.org/jira/browse/SPARK-19740
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Ji Yan
>Assignee: Ji Yan
>Priority: Minor
> Fix For: 2.2.0
>
>
> When running Spark on Mesos with docker containerizer, the spark executors 
> are always launched with 'docker run' command without specifying --user 
> option, which always results in spark executors running as root. Mesos has a 
> way to support arbitrary parameters. Spark could use that to expose setting 
> user
> background on mesos with arbitrary parameters support: 
> https://issues.apache.org/jira/browse/MESOS-1816



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20343) SBT master build for Hadoop 2.6 in Jenkins fails due to Avro version resolution

2017-04-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20343.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17642
[https://github.com/apache/spark/pull/17642]

> SBT master build for Hadoop 2.6 in Jenkins fails due to Avro version 
> resolution 
> 
>
> Key: SPARK-20343
> URL: https://issues.apache.org/jira/browse/SPARK-20343
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
> Fix For: 2.2.0
>
>
> Please refer https://github.com/apache/spark/pull/17477#issuecomment-293942637
> {quote}
> [error] 
> /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.6/core/src/main/scala/org/apache/spark/serializer/GenericAvroSerializer.scala:123:
>  value createDatumWriter is not a member of 
> org.apache.avro.generic.GenericData
> [error] writerCache.getOrElseUpdate(schema, 
> GenericData.get.createDatumWriter(schema))
> [error] 
> {quote}
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/2770/consoleFull
> It seems sbt has a different resolution for Avro differently with Maven in 
> some cases. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20343) SBT master build for Hadoop 2.6 in Jenkins fails due to Avro version resolution

2017-04-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-20343:
-

Assignee: Hyukjin Kwon

> SBT master build for Hadoop 2.6 in Jenkins fails due to Avro version 
> resolution 
> 
>
> Key: SPARK-20343
> URL: https://issues.apache.org/jira/browse/SPARK-20343
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.2.0
>
>
> Please refer https://github.com/apache/spark/pull/17477#issuecomment-293942637
> {quote}
> [error] 
> /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.6/core/src/main/scala/org/apache/spark/serializer/GenericAvroSerializer.scala:123:
>  value createDatumWriter is not a member of 
> org.apache.avro.generic.GenericData
> [error] writerCache.getOrElseUpdate(schema, 
> GenericData.get.createDatumWriter(schema))
> [error] 
> {quote}
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/2770/consoleFull
> It seems sbt has a different resolution for Avro differently with Maven in 
> some cases. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13680) Java UDAF with more than one intermediate argument returns wrong results

2017-04-16 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970399#comment-15970399
 ] 

Hyukjin Kwon edited comment on SPARK-13680 at 4/16/17 1:40 PM:
---

Thank you so much for your confirmation. I am resolving this as a Cannot 
Reproduce as the guide lines (cannot reproduce in the master). Probably it 
would be nicer if anyone identifies the JIRA and backports if applicable.


was (Author: hyukjin.kwon):
Thank you so much for your confirmation. I am resolving this as a Cannot 
Reproduce as the guide lines. Probably it would be nicer if anyone identifies 
the JIRA and backports if applicable.

> Java UDAF with more than one intermediate argument returns wrong results
> 
>
> Key: SPARK-13680
> URL: https://issues.apache.org/jira/browse/SPARK-13680
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: CDH 5.5.2
>Reporter: Yael Aharon
> Attachments: data.csv, setup.hql
>
>
> I am trying to incorporate the Java UDAF from 
> https://github.com/apache/spark/blob/master/sql/hive/src/test/java/org/apache/spark/sql/hive/aggregate/MyDoubleAvg.java
>  into an SQL query. 
> I registered the UDAF like this:
>  sqlContext.udf().register("myavg", new MyDoubleAvg());
> My SQL query is:
> SELECT AVG(seqi) AS `avg_seqi`, AVG(seqd) AS `avg_seqd`, AVG(ci) AS `avg_ci`, 
> AVG(cd) AS `avg_cd`, AVG(stdevd) AS `avg_stdevd`, AVG(stdevi) AS 
> `avg_stdevi`, MAX(seqi) AS `max_seqi`, MAX(seqd) AS `max_seqd`, MAX(ci) AS 
> `max_ci`, MAX(cd) AS `max_cd`, MAX(stdevd) AS `max_stdevd`, MAX(stdevi) AS 
> `max_stdevi`, MIN(seqi) AS `min_seqi`, MIN(seqd) AS `min_seqd`, MIN(ci) AS 
> `min_ci`, MIN(cd) AS `min_cd`, MIN(stdevd) AS `min_stdevd`, MIN(stdevi) AS 
> `min_stdevi`,SUM(seqi) AS `sum_seqi`, SUM(seqd) AS `sum_seqd`, SUM(ci) AS 
> `sum_ci`, SUM(cd) AS `sum_cd`, SUM(stdevd) AS `sum_stdevd`, SUM(stdevi) AS 
> `sum_stdevi`, myavg(seqd) as `myavg_seqd`,  AVG(zero) AS `avg_zero`, 
> AVG(nulli) AS `avg_nulli`,AVG(nulld) AS `avg_nulld`, SUM(zero) AS `sum_zero`, 
> SUM(nulli) AS `sum_nulli`,SUM(nulld) AS `sum_nulld`,MAX(zero) AS `max_zero`, 
> MAX(nulli) AS `max_nulli`,MAX(nulld) AS `max_nulld`,count( * ) AS 
> `count_all`, count(nulli) AS `count_nulli` FROM mytable
> As soon as I add the UDAF myavg to the SQL, all the results become incorrect. 
> When I remove the call to the UDAF, the results are correct.
> I was able to go around the issue by modifying bufferSchema of the UDAF to 
> use an array and the corresponding update and merge methods. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13680) Java UDAF with more than one intermediate argument returns wrong results

2017-04-16 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970399#comment-15970399
 ] 

Hyukjin Kwon commented on SPARK-13680:
--

Thank you so much for your confirmation. I am resolving this as a Cannot 
Reproduce as the guide lines. Probably it would be nicer if anyone identifies 
the JIRA and backports if applicable.

> Java UDAF with more than one intermediate argument returns wrong results
> 
>
> Key: SPARK-13680
> URL: https://issues.apache.org/jira/browse/SPARK-13680
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: CDH 5.5.2
>Reporter: Yael Aharon
> Attachments: data.csv, setup.hql
>
>
> I am trying to incorporate the Java UDAF from 
> https://github.com/apache/spark/blob/master/sql/hive/src/test/java/org/apache/spark/sql/hive/aggregate/MyDoubleAvg.java
>  into an SQL query. 
> I registered the UDAF like this:
>  sqlContext.udf().register("myavg", new MyDoubleAvg());
> My SQL query is:
> SELECT AVG(seqi) AS `avg_seqi`, AVG(seqd) AS `avg_seqd`, AVG(ci) AS `avg_ci`, 
> AVG(cd) AS `avg_cd`, AVG(stdevd) AS `avg_stdevd`, AVG(stdevi) AS 
> `avg_stdevi`, MAX(seqi) AS `max_seqi`, MAX(seqd) AS `max_seqd`, MAX(ci) AS 
> `max_ci`, MAX(cd) AS `max_cd`, MAX(stdevd) AS `max_stdevd`, MAX(stdevi) AS 
> `max_stdevi`, MIN(seqi) AS `min_seqi`, MIN(seqd) AS `min_seqd`, MIN(ci) AS 
> `min_ci`, MIN(cd) AS `min_cd`, MIN(stdevd) AS `min_stdevd`, MIN(stdevi) AS 
> `min_stdevi`,SUM(seqi) AS `sum_seqi`, SUM(seqd) AS `sum_seqd`, SUM(ci) AS 
> `sum_ci`, SUM(cd) AS `sum_cd`, SUM(stdevd) AS `sum_stdevd`, SUM(stdevi) AS 
> `sum_stdevi`, myavg(seqd) as `myavg_seqd`,  AVG(zero) AS `avg_zero`, 
> AVG(nulli) AS `avg_nulli`,AVG(nulld) AS `avg_nulld`, SUM(zero) AS `sum_zero`, 
> SUM(nulli) AS `sum_nulli`,SUM(nulld) AS `sum_nulld`,MAX(zero) AS `max_zero`, 
> MAX(nulli) AS `max_nulli`,MAX(nulld) AS `max_nulld`,count( * ) AS 
> `count_all`, count(nulli) AS `count_nulli` FROM mytable
> As soon as I add the UDAF myavg to the SQL, all the results become incorrect. 
> When I remove the call to the UDAF, the results are correct.
> I was able to go around the issue by modifying bufferSchema of the UDAF to 
> use an array and the corresponding update and merge methods. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13680) Java UDAF with more than one intermediate argument returns wrong results

2017-04-16 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-13680.
--
Resolution: Cannot Reproduce

> Java UDAF with more than one intermediate argument returns wrong results
> 
>
> Key: SPARK-13680
> URL: https://issues.apache.org/jira/browse/SPARK-13680
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: CDH 5.5.2
>Reporter: Yael Aharon
> Attachments: data.csv, setup.hql
>
>
> I am trying to incorporate the Java UDAF from 
> https://github.com/apache/spark/blob/master/sql/hive/src/test/java/org/apache/spark/sql/hive/aggregate/MyDoubleAvg.java
>  into an SQL query. 
> I registered the UDAF like this:
>  sqlContext.udf().register("myavg", new MyDoubleAvg());
> My SQL query is:
> SELECT AVG(seqi) AS `avg_seqi`, AVG(seqd) AS `avg_seqd`, AVG(ci) AS `avg_ci`, 
> AVG(cd) AS `avg_cd`, AVG(stdevd) AS `avg_stdevd`, AVG(stdevi) AS 
> `avg_stdevi`, MAX(seqi) AS `max_seqi`, MAX(seqd) AS `max_seqd`, MAX(ci) AS 
> `max_ci`, MAX(cd) AS `max_cd`, MAX(stdevd) AS `max_stdevd`, MAX(stdevi) AS 
> `max_stdevi`, MIN(seqi) AS `min_seqi`, MIN(seqd) AS `min_seqd`, MIN(ci) AS 
> `min_ci`, MIN(cd) AS `min_cd`, MIN(stdevd) AS `min_stdevd`, MIN(stdevi) AS 
> `min_stdevi`,SUM(seqi) AS `sum_seqi`, SUM(seqd) AS `sum_seqd`, SUM(ci) AS 
> `sum_ci`, SUM(cd) AS `sum_cd`, SUM(stdevd) AS `sum_stdevd`, SUM(stdevi) AS 
> `sum_stdevi`, myavg(seqd) as `myavg_seqd`,  AVG(zero) AS `avg_zero`, 
> AVG(nulli) AS `avg_nulli`,AVG(nulld) AS `avg_nulld`, SUM(zero) AS `sum_zero`, 
> SUM(nulli) AS `sum_nulli`,SUM(nulld) AS `sum_nulld`,MAX(zero) AS `max_zero`, 
> MAX(nulli) AS `max_nulli`,MAX(nulld) AS `max_nulld`,count( * ) AS 
> `count_all`, count(nulli) AS `count_nulli` FROM mytable
> As soon as I add the UDAF myavg to the SQL, all the results become incorrect. 
> When I remove the call to the UDAF, the results are correct.
> I was able to go around the issue by modifying bufferSchema of the UDAF to 
> use an array and the corresponding update and merge methods. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20343) SBT master build for Hadoop 2.6 in Jenkins fails due to Avro version resolution

2017-04-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970424#comment-15970424
 ] 

Apache Spark commented on SPARK-20343:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/17651

> SBT master build for Hadoop 2.6 in Jenkins fails due to Avro version 
> resolution 
> 
>
> Key: SPARK-20343
> URL: https://issues.apache.org/jira/browse/SPARK-20343
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.2.0
>
>
> Please refer https://github.com/apache/spark/pull/17477#issuecomment-293942637
> {quote}
> [error] 
> /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.6/core/src/main/scala/org/apache/spark/serializer/GenericAvroSerializer.scala:123:
>  value createDatumWriter is not a member of 
> org.apache.avro.generic.GenericData
> [error] writerCache.getOrElseUpdate(schema, 
> GenericData.get.createDatumWriter(schema))
> [error] 
> {quote}
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/2770/consoleFull
> It seems sbt has a different resolution for Avro differently with Maven in 
> some cases. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20278) Disable 'multiple_dots_linter' lint rule that is against project's code style

2017-04-16 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-20278.
--
  Resolution: Fixed
Assignee: Hyukjin Kwon
   Fix Version/s: 2.2.0
Target Version/s: 2.2.0

> Disable 'multiple_dots_linter' lint rule that is against project's code style
> -
>
> Key: SPARK-20278
> URL: https://issues.apache.org/jira/browse/SPARK-20278
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.2.0
>
>
> Currently, multi-dot separated variables in R is not allowed. For example,
> {code}
>  setMethod("from_json", signature(x = "Column", schema = "structType"),
> -  function(x, schema, asJsonArray = FALSE, ...) {
> +  function(x, schema, as.json.array = FALSE, ...) {
>  if (asJsonArray) {
>jschema <- callJStatic("org.apache.spark.sql.types.DataTypes",
>   "createArrayType",
> {code}
> produces an error as below:
> {code}
> R/functions.R:2462:31: style: Words within variable and function names should 
> be separated by '_' rather than '.'.
>   function(x, schema, as.json.array = FALSE, ...) {
>   ^
> {code}
> This seems against https://google.github.io/styleguide/Rguide.xml#identifiers 
> which says
> {quote}
>  The preferred form for variable names is all lower case letters and words 
> separated with dots
> {quote}
> This looks because lintr https://github.com/jimhester/lintr follows 
> http://r-pkgs.had.co.nz/style.html as written in the README.md. Few cases 
> seems not following Google's one.
> Per SPARK-6813, we follow Google's R Style Guide with few exceptions 
> https://google.github.io/styleguide/Rguide.xml. This is also merged into 
> Spark's website - https://github.com/apache/spark-website/pull/43
> Also, we have no limit on function name. This rule also looks affecting to 
> the name of functions as written in the README.md.
> {quote}
> multiple_dots_linter: check that function and variable names are separated by 
> _ rather than ..
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20307) SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer

2017-04-16 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970475#comment-15970475
 ] 

Felix Cheung commented on SPARK-20307:
--

Thanks for reporting this!
Sounds like we should take a fix for this for 2.1.x and 2.2.x

> SparkR: pass on setHandleInvalid to spark.mllib functions that use 
> StringIndexer
> 
>
> Key: SPARK-20307
> URL: https://issues.apache.org/jira/browse/SPARK-20307
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Anne Rutten
>Priority: Minor
>
> when training a model in SparkR with string variables (tested with 
> spark.randomForest, but i assume is valid for all spark.xx functions that 
> apply a StringIndexer under the hood), testing on a new dataset with factor 
> levels that are not in the training set will throw an "Unseen label" error. 
> I think this can be solved if there's a method to pass setHandleInvalid on to 
> the StringIndexers when calling spark.randomForest.
> code snippet:
> {code}
> # (i've run this in Zeppelin which already has SparkR and the context loaded)
> #library(SparkR)
> #sparkR.session(master = "local[*]") 
> data = data.frame(clicked = base::sample(c(0,1),100,replace=TRUE),
>   someString = base::sample(c("this", "that"), 
> 100, replace=TRUE), stringsAsFactors=FALSE)
> trainidxs = base::sample(nrow(data), nrow(data)*0.7)
> traindf = as.DataFrame(data[trainidxs,])
> testdf = as.DataFrame(rbind(data[-trainidxs,],c(0,"the other")))
> rf = spark.randomForest(traindf, clicked~., type="classification", 
> maxDepth=10, 
> maxBins=41,
> numTrees = 100)
> predictions = predict(rf, testdf)
> SparkR::collect(predictions)
> {code}
> stack trace:
> {quote}
> Error in handleErrors(returnStatus, conn): org.apache.spark.SparkException: 
> Job aborted due to stage failure: Task 0 in stage 607.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 607.0 (TID 1581, localhost, executor 
> driver): org.apache.spark.SparkException: Failed to execute user defined 
> function($anonfun$4: (string) => double)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Unseen label: the other.
> at 
> org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:170)
> at 
> org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:166)
> ... 16 more
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
> at scala.Opt

[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release

2017-04-16 Thread Yongqin Xiao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970496#comment-15970496
 ] 

Yongqin Xiao commented on SPARK-18406:
--

Thanks Josh for the quick response!
This issue is critical to my company's use cases, where for the purpose of 
performance we have to use custom RDD to take input from multiple parent RDDs, 
and use existing computation logic (in a black box) in the background to pull 
the result.


> Race between end-of-task and completion iterator read lock release
> --
>
> Key: SPARK-18406
> URL: https://issues.apache.org/jira/browse/SPARK-18406
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Josh Rosen
>
> The following log comes from a production streaming job where executors 
> periodically die due to uncaught exceptions during block release:
> {code}
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921
> 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922
> 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923
> 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923)
> 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable 
> 2721
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924
> 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924)
> 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as 
> bytes in memory (estimated size 5.0 KB, free 4.9 GB)
> 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took 
> 3 ms
> 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in 
> memory (estimated size 9.4 KB, free 4.9 GB)
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = 
> 567, finish = 1
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = 
> 541, finish = 6
> 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID 
> 7923). 1429 bytes result sent to driver
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = 
> 533, finish = 7
> 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID 
> 7924). 1429 bytes result sent to driver
> 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID 
> 7921)
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84)
>   at 
> org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:356)
>   at 
> org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7925
> 16/11/07 17:11:06 INFO Executor: Running task 0.1 in stage 2390.0 (TID 7925)
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 41, boot = -536, init = 
> 576, finish = 1
> 16/11/07 17:11:06 INFO Executor: Finished task 1.0 in stage 2390.0 (TID 
> 792

[jira] [Commented] (SPARK-20335) Children expressions of Hive UDF impacts the determinism of Hive UDF

2017-04-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970520#comment-15970520
 ] 

Apache Spark commented on SPARK-20335:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/17652

> Children expressions of Hive UDF impacts the determinism of Hive UDF
> 
>
> Key: SPARK-20335
> URL: https://issues.apache.org/jira/browse/SPARK-20335
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.2.0
>
>
> {noformat}
>   /**
>* Certain optimizations should not be applied if UDF is not deterministic.
>* Deterministic UDF returns same result each time it is invoked with a
>* particular input. This determinism just needs to hold within the context 
> of
>* a query.
>*
>* @return true if the UDF is deterministic
>*/
>   boolean deterministic() default true;
> {noformat}
> Based on the definition o UDFType, when Hive UDF's children are 
> non-deterministic, Hive UDF is also non-deterministic.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19828) R to support JSON array in column from_json

2017-04-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970536#comment-15970536
 ] 

Apache Spark commented on SPARK-19828:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/17653

> R to support JSON array in column from_json
> ---
>
> Key: SPARK-19828
> URL: https://issues.apache.org/jira/browse/SPARK-19828
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Hyukjin Kwon
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9478) Add sample weights to Random Forest

2017-04-16 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970541#comment-15970541
 ] 

Joseph K. Bradley commented on SPARK-9478:
--

By the way, one design choice which has come up is whether the current 
minInstancesPerNode Param should take instance weights into account.

Pros of using instance weights with minInstancesPerNode:
* This maintains the semantics of instance weights.  The algorithm should treat 
these 2 datasets identically: (a) {{[(weight 1.0, example A), (weight 1.0, 
example B), (weight 1.0, example B)]}} vs. (b) {{[(weight 1.0, example A), 
(weight 2.0, example B)]}}.
* By maintaining these semantics, we avoid confusion about how RandomForest and 
GBT should treat the instance weights introduced by subsampling.  (Currently, 
these use instance weights with minInstancesPerNode, so this choice is 
consistent with our previous choices.)

Pros of not using instance weights with minInstancesPerNode:
* AFAIK, scikit-learn does not use instance weights with {{min_samples_leaf}}.

I vote for the first choice (taking weights into account).

> Add sample weights to Random Forest
> ---
>
> Key: SPARK-9478
> URL: https://issues.apache.org/jira/browse/SPARK-9478
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.1
>Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support sample 
> (instance) weights. Weights are important when there is imbalanced training 
> data or the evaluation metric of a classifier is imbalanced (e.g. true 
> positive rate at some false positive threshold).  Sample weights generalize 
> class weights, so this could be used to add class weights later on.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-9478) Add sample weights to Random Forest

2017-04-16 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970541#comment-15970541
 ] 

Joseph K. Bradley edited comment on SPARK-9478 at 4/16/17 10:36 PM:


By the way, one design choice which has come up is whether the current 
minInstancesPerNode Param should take instance weights into account.

Pros of using instance weights with minInstancesPerNode:
* This maintains the semantics of instance weights.  The algorithm should treat 
these 2 datasets identically: (a) {{[(weight 1.0, example A), (weight 1.0, 
example B), (weight 1.0, example B)]}} vs. (b) {{[(weight 1.0, example A), 
(weight 2.0, example B)]}}.
* By maintaining these semantics, we avoid confusion about how RandomForest and 
GBT should treat the instance weights introduced by subsampling.  (Currently, 
these use instance weights with minInstancesPerNode, so this choice is 
consistent with our previous choices.)

Pros of not using instance weights with minInstancesPerNode:
* AFAIK, scikit-learn does not use instance weights with {{min_samples_leaf}}.

I vote for the first choice (taking weights into account).

This does introduce one small complication:
* If you have small instance weights < 1.0, then the current limit on 
minInstancesPerNode of being >= 1.0 (in the ParamValidator) is a bit strict.
* I propose to permit minInstancesPerNode to be set to 0.  I plan to add a 
check to make sure each leaf node does have non-zero weight (i.e., at least one 
instance with non-0 weight).


was (Author: josephkb):
By the way, one design choice which has come up is whether the current 
minInstancesPerNode Param should take instance weights into account.

Pros of using instance weights with minInstancesPerNode:
* This maintains the semantics of instance weights.  The algorithm should treat 
these 2 datasets identically: (a) {{[(weight 1.0, example A), (weight 1.0, 
example B), (weight 1.0, example B)]}} vs. (b) {{[(weight 1.0, example A), 
(weight 2.0, example B)]}}.
* By maintaining these semantics, we avoid confusion about how RandomForest and 
GBT should treat the instance weights introduced by subsampling.  (Currently, 
these use instance weights with minInstancesPerNode, so this choice is 
consistent with our previous choices.)

Pros of not using instance weights with minInstancesPerNode:
* AFAIK, scikit-learn does not use instance weights with {{min_samples_leaf}}.

I vote for the first choice (taking weights into account).

> Add sample weights to Random Forest
> ---
>
> Key: SPARK-9478
> URL: https://issues.apache.org/jira/browse/SPARK-9478
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.1
>Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support sample 
> (instance) weights. Weights are important when there is imbalanced training 
> data or the evaluation metric of a classifier is imbalanced (e.g. true 
> positive rate at some false positive threshold).  Sample weights generalize 
> class weights, so this could be used to add class weights later on.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20351) Add trait hasTrainingSummary to replace the duplicate code

2017-04-16 Thread yuhao yang (JIRA)

yuhao yang created SPARK-20351:
--

 Summary: Add trait hasTrainingSummary to replace the duplicate code
 Key: SPARK-20351
 URL: https://issues.apache.org/jira/browse/SPARK-20351
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.2.0
Reporter: yuhao yang
Priority: Minor


Add a trait HasTrainingSummary to avoid code duplicate related to training 
summary. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20351) Add trait hasTrainingSummary to replace the duplicate code

2017-04-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970556#comment-15970556
 ] 

Apache Spark commented on SPARK-20351:
--

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/17654

> Add trait hasTrainingSummary to replace the duplicate code
> --
>
> Key: SPARK-20351
> URL: https://issues.apache.org/jira/browse/SPARK-20351
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Priority: Minor
>
> Add a trait HasTrainingSummary to avoid code duplicate related to training 
> summary. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20351) Add trait hasTrainingSummary to replace the duplicate code

2017-04-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20351:


Assignee: (was: Apache Spark)

> Add trait hasTrainingSummary to replace the duplicate code
> --
>
> Key: SPARK-20351
> URL: https://issues.apache.org/jira/browse/SPARK-20351
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Priority: Minor
>
> Add a trait HasTrainingSummary to avoid code duplicate related to training 
> summary. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20351) Add trait hasTrainingSummary to replace the duplicate code

2017-04-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20351:


Assignee: Apache Spark

> Add trait hasTrainingSummary to replace the duplicate code
> --
>
> Key: SPARK-20351
> URL: https://issues.apache.org/jira/browse/SPARK-20351
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Assignee: Apache Spark
>Priority: Minor
>
> Add a trait HasTrainingSummary to avoid code duplicate related to training 
> summary. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20338) Spaces in spark.eventLog.dir are not correctly handled

2017-04-16 Thread zuotingbing (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zuotingbing updated SPARK-20338:

Priority: Major  (was: Minor)

> Spaces in spark.eventLog.dir are not correctly handled
> --
>
> Key: SPARK-20338
> URL: https://issues.apache.org/jira/browse/SPARK-20338
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: zuotingbing
>
> set spark.eventLog.dir=/home/mr/event log and submit an app ,we got error as 
> follows:
> 017-04-14 17:28:40,378 INFO org.apache.spark.SparkContext: Successfully 
> stopped SparkContext
> Exception in thread "main" ExitCodeException exitCode=1: chmod: cannot access 
> `/home/mr/event%20log/app-20170414172839-.inprogress': No such file or 
> directory
>   at org.apache.hadoop.util.Shell.runCommand(Shell.java:561)
>   at org.apache.hadoop.util.Shell.run(Shell.java:478)
>   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:738)
>   at org.apache.hadoop.util.Shell.execCommand(Shell.java:831)
>   at org.apache.hadoop.util.Shell.execCommand(Shell.java:814)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:712)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:506)
>   at 
> org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:125)
>   at org.apache.spark.SparkContext.(SparkContext.scala:516)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2258)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$9.apply(SparkSession.scala:879)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$9.apply(SparkSession.scala:871)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:871)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:288)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:137)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20336) spark.read.csv() with wholeFile=True option fails to read non ASCII unicode characters

2017-04-16 Thread HanCheol Cho (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970259#comment-15970259
 ] 

HanCheol Cho edited comment on SPARK-20336 at 4/17/17 1:18 AM:
---

Hi, [~hyukjin.kwon] 

I found that this case only happens when I run it in Yarn mode, not local mode, 
and the cluster used here were using different Python version, Anaconda Python 
2.7.11 in a client node and System's Python 2.7.5 in worker nodes.
Other system configurations such as locale (en_us.UTF-8) were same.

However, I am not yet sure if this is the root cause or not.
I will test it once again by updating Cluster's Python, but it will take some 
time since other team members also use it.
I think I can make additional reports during next week. Would it be okay?




was (Author: priancho):
Hi, [~hyukjin.kwon] 

I found that this case only happens when I run it in Yarn mode, not local mode, 
and the clused used here were using different Python version (Anaconda Python 
2.7.11 in client and System's Python 2.7.5 in worker nodes).
Other system configurations such as locale (en_us.UTF-8) were same.

However, I am not yet sure if this is the root cause or not.
I will test it once agin by updating Cluster's Python.
But it will take some time since other team members also use it.
I think I can make additional reports during next week. Would it be okay?



> spark.read.csv() with wholeFile=True option fails to read non ASCII unicode 
> characters
> --
>
> Key: SPARK-20336
> URL: https://issues.apache.org/jira/browse/SPARK-20336
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0 (master branch is downloaded from Github)
> PySpark
>Reporter: HanCheol Cho
>
> I used spark.read.csv() method with wholeFile=True option to load data that 
> has multi-line records.
> However, non-ASCII characters are not properly loaded.
> The following is a sample data for test:
> {code:none}
> col1,col2,col3
> 1,a,text
> 2,b,テキスト
> 3,c,텍스트
> 4,d,"text
> テキスト
> 텍스트"
> 5,e,last
> {code}
> When it is loaded without wholeFile=True option, non-ASCII characters are 
> shown correctly although multi-line records are parsed incorrectly as follows:
> {code:none}
> testdf_default = spark.read.csv("test.encoding.csv", header=True)
> testdf_default.show()
> ++++
> |col1|col2|col3|
> ++++
> |   1|   a|text|
> |   2|   b|テキスト|
> |   3|   c| 텍스트|
> |   4|   d|text|
> |テキスト|null|null|
> | 텍스트"|null|null|
> |   5|   e|last|
> ++++
> {code}
> When wholeFile=True option is used, non-ASCII characters are broken as 
> follows:
> {code:none}
> testdf_wholefile = spark.read.csv("test.encoding.csv", header=True, 
> wholeFile=True)
> testdf_wholefile.show()
> ++++
> |col1|col2|col3|
> ++++
> |   1|   a|text|
> |   2|   b||
> |   3|   c|   �|
> |   4|   d|text
> ...|
> |   5|   e|last|
> ++++
> {code}
> The result is same even if I use encoding="UTF-8" option with wholeFile=True.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20156) Java String toLowerCase "Turkish locale bug" causes Spark problems

2017-04-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970581#comment-15970581
 ] 

Apache Spark commented on SPARK-20156:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/17655

> Java String toLowerCase "Turkish locale bug" causes Spark problems
> --
>
> Key: SPARK-20156
> URL: https://issues.apache.org/jira/browse/SPARK-20156
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.1.0
> Environment: Ubunutu 16.04
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_121)
>Reporter: Serkan Taş
>Assignee: Sean Owen
> Fix For: 2.2.0
>
> Attachments: sprk_shell.txt
>
>
> If the regional setting of the operation system is Turkish, the famous java 
> locale problem occurs (https://jira.atlassian.com/browse/CONF-5931 or 
> https://issues.apache.org/jira/browse/AVRO-1493). 
> e.g : 
> "SERDEINFO" lowers to "serdeınfo"
> "uniquetable" uppers to "UNİQUETABLE"
> work around : 
> add -Duser.country=US -Duser.language=en to the end of the line 
> SPARK_SUBMIT_OPTS="$SPARK_SUBMIT_OPTS -Dscala.usejavacp=true"
> in spark-shell.sh



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-20346) sum aggregate over empty Dataset gives null

2017-04-16 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-20346.
---
Resolution: Not A Problem

> sum aggregate over empty Dataset gives null
> ---
>
> Key: SPARK-20346
> URL: https://issues.apache.org/jira/browse/SPARK-20346
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> {code}
> scala> spark.range(0).agg(sum("id")).show
> +---+
> |sum(id)|
> +---+
> |   null|
> +---+
> scala> spark.range(0).agg(sum("id")).printSchema
> root
>  |-- sum(id): long (nullable = true)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20346) sum aggregate over empty Dataset gives null

2017-04-16 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970612#comment-15970612
 ] 

Xiao Li commented on SPARK-20346:
-

{quote}
The result of the COUNT and COUNT_BIG functions cannot be the null value. As 
specified in the description of AVG, MAX, MIN, STDDEV, SUM, and VARIANCE, the 
result is the null value when the function is applied to an empty set. However, 
the result is also the null value when the function is specified in an outer 
select list, the argument is given by an arithmetic expression, and any 
evaluation of the expression causes an arithmetic exception (such as division 
by zero).
{quote}

This is copied from 
https://www.ibm.com/support/knowledgecenter/en/SSEPEK_11.0.0/sqlref/src/tpc/db2z_aggregatefunctionsintro.html.
 It is a common behavior. 

Thanks!

> sum aggregate over empty Dataset gives null
> ---
>
> Key: SPARK-20346
> URL: https://issues.apache.org/jira/browse/SPARK-20346
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> {code}
> scala> spark.range(0).agg(sum("id")).show
> +---+
> |sum(id)|
> +---+
> |   null|
> +---+
> scala> spark.range(0).agg(sum("id")).printSchema
> root
>  |-- sum(id): long (nullable = true)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20299) NullPointerException when null and string are in a tuple while encoding Dataset

2017-04-16 Thread Umesh Chaudhary (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970652#comment-15970652
 ] 

Umesh Chaudhary commented on SPARK-20299:
-

My bad, previously I was indeed trying the reproduce this on Spark 2.1. I was 
able to reproduce the issue by following the mentioned steps. 
After debugging the behaviour I observed a difference in generated 
"CleanExpressions" as below : 

{code}
=== Result of Batch CleanExpressions (Spark 2.1.0) ===

!staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, assertnotnull(input[0, scala.Tuple2, true], top level Product input 
object)._1, true) AS _1#0   
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, assertnotnull(input[0, scala.Tuple2, true], top level Product input 
object)._1, true)
!+- staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, assertnotnull(input[0, scala.Tuple2, true], top level Product input 
object)._1, true)
+- assertnotnull(input[0, scala.Tuple2, true], top level Product input 
object)._1
!   +- assertnotnull(input[0, scala.Tuple2, true], top level Product input 
object)._1  
  
+- assertnotnull(input[0, scala.Tuple2, true], top level Product input object)
!  +- assertnotnull(input[0, scala.Tuple2, true], top level Product input 
object) 
 
+- input[0, scala.Tuple2, true]
! +- input[0, scala.Tuple2, true]  
{code}

{code}
=== Result of Batch CleanExpressions (Spark 2.2.0)===
!staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, 
true) AS _1#0   
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true)
!+- staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, 
true)
+- assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1
!   +- assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1

+- assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))
!  +- assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))

   +- assertnotnull(input[0, scala.Tuple2, true])
! +- assertnotnull(input[0, scala.Tuple2, true])

  +- input[0, scala.Tuple2, true]
!+- input[0, scala.Tuple2, true] 
{code}

There is an additional wrapper of "assertnotnull" function on the tuple rows 
which seems to be resulted by changes in CodeGenerator. Need to confirm with 
[~marmbrus] as this seems to be root cause of this issue. 

> NullPointerException when null and string are in a tuple while encoding 
> Dataset
> ---
>
> Key: SPARK-20299
> URL: https://issues.apache.org/jira/browse/SPARK-20299
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> When creating a Dataset from a tuple with {{null}} and a string, NPE is 
> reported. When either is removed, it works fine.
> {code}
> scala> Seq((1, null.asInstanceOf[Int]), (2, 1)).toDS
> res43: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int, _2: int]
> scala> Seq(("1", null.asInstanceOf[Int]), ("2", 1)).toDS
> java.lang.RuntimeException: Error while encoding: 
> java.lang.NullPointerException
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true], top 
> level Product input object), - root class: "scala.Tuple2")._1, true) AS _1#474
> assertnotnull(assertnotnull(input[0, scala.Tuple2, true], top level Product 
> input object), - root class: "scala.Tuple2")._2 AS _2#475
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.imm

[jira] [Created] (SPARK-20352) PySpark SparkSession initialization take longer every iteration in a single application

2017-04-16 Thread hosein (JIRA)

hosein created SPARK-20352:
--

 Summary: PySpark SparkSession initialization take longer every 
iteration in a single application
 Key: SPARK-20352
 URL: https://issues.apache.org/jira/browse/SPARK-20352
 Project: Spark
  Issue Type: Question
  Components: PySpark
Affects Versions: 2.1.0
 Environment: linux ubunto 12
pyspark
Reporter: hosein
 Fix For: 2.1.0


I run Spark on a standalone Ubuntu server with 128G memory and 32-core CPU. Run 
spark-sumbit my_code.py without any additional configuration parameters.
In a while loop I start SparkSession, analyze data and then stop the context 
and this process repeats every 10 seconds.

#
while True:
spark =   
SparkSession.builder.appName("sync_task").config('spark.driver.maxResultSize'
 , '5g').getOrCreate()
sc = spark.sparkContext
#some process and analyze
spark.stop()
###

When program starts, it works perfectly.

but when it works for many hours. spark initialization take long time. it makes 
10 or 20 seconds for just initializing spark.

So what is the problem ?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20352) PySpark SparkSession initialization take longer every iteration in a single application

2017-04-16 Thread hosein (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hosein updated SPARK-20352:
---
Environment: 
linux ubuntu 12
pyspark

  was:
linux ubunto 12
pyspark


> PySpark SparkSession initialization take longer every iteration in a single 
> application
> ---
>
> Key: SPARK-20352
> URL: https://issues.apache.org/jira/browse/SPARK-20352
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: linux ubuntu 12
> pyspark
>Reporter: hosein
> Fix For: 2.1.0
>
>
> I run Spark on a standalone Ubuntu server with 128G memory and 32-core CPU. 
> Run spark-sumbit my_code.py without any additional configuration parameters.
> In a while loop I start SparkSession, analyze data and then stop the context 
> and this process repeats every 10 seconds.
> #
> while True:
> spark =   
> SparkSession.builder.appName("sync_task").config('spark.driver.maxResultSize'
>  , '5g').getOrCreate()
> sc = spark.sparkContext
> #some process and analyze
> spark.stop()
> ###
> When program starts, it works perfectly.
> but when it works for many hours. spark initialization take long time. it 
> makes 10 or 20 seconds for just initializing spark.
> So what is the problem ?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20352) PySpark SparkSession initialization take longer every iteration in a single application

2017-04-16 Thread hosein (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hosein updated SPARK-20352:
---
Environment: 
linux ubunto 12
spark 2.1
JRE 8.0


  was:
linux ubuntu 12
pyspark


> PySpark SparkSession initialization take longer every iteration in a single 
> application
> ---
>
> Key: SPARK-20352
> URL: https://issues.apache.org/jira/browse/SPARK-20352
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: linux ubunto 12
> spark 2.1
> JRE 8.0
>Reporter: hosein
> Fix For: 2.1.0
>
>
> I run Spark on a standalone Ubuntu server with 128G memory and 32-core CPU. 
> Run spark-sumbit my_code.py without any additional configuration parameters.
> In a while loop I start SparkSession, analyze data and then stop the context 
> and this process repeats every 10 seconds.
> #
> while True:
> spark =   
> SparkSession.builder.appName("sync_task").config('spark.driver.maxResultSize'
>  , '5g').getOrCreate()
> sc = spark.sparkContext
> #some process and analyze
> spark.stop()
> ###
> When program starts, it works perfectly.
> but when it works for many hours. spark initialization take long time. it 
> makes 10 or 20 seconds for just initializing spark.
> So what is the problem ?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20352) PySpark SparkSession initialization take longer every iteration in a single application

2017-04-16 Thread hosein (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hosein updated SPARK-20352:
---
Environment: 
Ubuntu 12
Spark 2.1
JRE 8.0
Python 2.7


  was:
linux ubunto 12
spark 2.1
JRE 8.0



> PySpark SparkSession initialization take longer every iteration in a single 
> application
> ---
>
> Key: SPARK-20352
> URL: https://issues.apache.org/jira/browse/SPARK-20352
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: Ubuntu 12
> Spark 2.1
> JRE 8.0
> Python 2.7
>Reporter: hosein
> Fix For: 2.1.0
>
>
> I run Spark on a standalone Ubuntu server with 128G memory and 32-core CPU. 
> Run spark-sumbit my_code.py without any additional configuration parameters.
> In a while loop I start SparkSession, analyze data and then stop the context 
> and this process repeats every 10 seconds.
> #
> while True:
> spark =   
> SparkSession.builder.appName("sync_task").config('spark.driver.maxResultSize'
>  , '5g').getOrCreate()
> sc = spark.sparkContext
> #some process and analyze
> spark.stop()
> ###
> When program starts, it works perfectly.
> but when it works for many hours. spark initialization take long time. it 
> makes 10 or 20 seconds for just initializing spark.
> So what is the problem ?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20352) PySpark SparkSession initialization take longer every iteration in a single application

2017-04-16 Thread hosein (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hosein updated SPARK-20352:
---
Description: 
I run Spark on a standalone Ubuntu server with 128G memory and 32-core CPU. Run 
spark-sumbit my_code.py without any additional configuration parameters.
In a while loop I start SparkSession, analyze data and then stop the context 
and this process repeats every 10 seconds.

{code}
while True:
spark =   
SparkSession.builder.appName("sync_task").config('spark.driver.maxResultSize' , 
'5g').getOrCreate()
sc = spark.sparkContext
#some process and analyze
spark.stop()
{code}

When program starts, it works perfectly.

but when it works for many hours. spark initialization take long time. it makes 
10 or 20 seconds for just initializing spark.

So what is the problem ?

  was:
I run Spark on a standalone Ubuntu server with 128G memory and 32-core CPU. Run 
spark-sumbit my_code.py without any additional configuration parameters.
In a while loop I start SparkSession, analyze data and then stop the context 
and this process repeats every 10 seconds.

#
while True:
spark =   
SparkSession.builder.appName("sync_task").config('spark.driver.maxResultSize'
 , '5g').getOrCreate()
sc = spark.sparkContext
#some process and analyze
spark.stop()
###

When program starts, it works perfectly.

but when it works for many hours. spark initialization take long time. it makes 
10 or 20 seconds for just initializing spark.

So what is the problem ?


> PySpark SparkSession initialization take longer every iteration in a single 
> application
> ---
>
> Key: SPARK-20352
> URL: https://issues.apache.org/jira/browse/SPARK-20352
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: Ubuntu 12
> Spark 2.1
> JRE 8.0
> Python 2.7
>Reporter: hosein
> Fix For: 2.1.0
>
>
> I run Spark on a standalone Ubuntu server with 128G memory and 32-core CPU. 
> Run spark-sumbit my_code.py without any additional configuration parameters.
> In a while loop I start SparkSession, analyze data and then stop the context 
> and this process repeats every 10 seconds.
> {code}
> while True:
> spark =   
> SparkSession.builder.appName("sync_task").config('spark.driver.maxResultSize' 
> , '5g').getOrCreate()
> sc = spark.sparkContext
> #some process and analyze
> spark.stop()
> {code}
> When program starts, it works perfectly.
> but when it works for many hours. spark initialization take long time. it 
> makes 10 or 20 seconds for just initializing spark.
> So what is the problem ?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

91 matches

Mail list logo