[jira] [Commented] (SPARK-21074) Parquet files are read fully even though only count() is requested

2017-06-18 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16053532#comment-16053532
 ] 

Kazuaki Ishizaki commented on SPARK-21074:
--

I think that [~spektom] expected the similar behavior to this 
[article|https://stackoverflow.com/questions/40629435/fast-parquet-row-count-in-spark].
 In summary, it is to read only meta data.

I agree with setting "improvement" since this is a kind of performance 
optimization.

> Parquet files are read fully even though only count() is requested
> --
>
> Key: SPARK-21074
> URL: https://issues.apache.org/jira/browse/SPARK-21074
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.0
>Reporter: Michael Spector
>
> I have the following sample code that creates parquet files:
> {code:java}
> val spark = SparkSession.builder()
>   .config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", 
> "2")
>   .config("spark.hadoop.parquet.metadata.read.parallelism", "50")
>   .appName("Test Write").getOrCreate()
> val sqc = spark.sqlContext
> import sqc.implicits._
> val random = new scala.util.Random(31L)
> (1465720077 to 1465720077+1000).map(x => Event(x, random.nextString(2)))
>   .toDS()
>   .write
>   .mode(SaveMode.Overwrite)
>   .parquet("s3://my-bucket/test")
> {code}
> Afterwards, I'm trying to read these files with the following code:
> {code:java}
> val spark = SparkSession.builder()
>   .config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", 
> "2")
>   .config("spark.hadoop.parquet.metadata.read.parallelism", "50")
>   .config("spark.sql.parquet.filterPushdown", "true")
>   .appName("Test Read").getOrCreate()
> spark.sqlContext.read
>   .option("mergeSchema", "false")
>   .parquet("s3://my-bucket/test")
>   .count()
> {code}
> I've enabled DEBUG log level to see what requests are actually sent through 
> S3 API, and I've figured out that in addition to parquet "footer" retrieval 
> there are requests that ask for whole file content.
> For example, this is full content request example:
> {noformat}
> 17/06/13 05:46:50 DEBUG wire: http-outgoing-1 >> "GET 
> /test/part-0-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet 
> HTTP/1.1[\r][\n]"
> 
> 17/06/13 05:46:50 DEBUG wire: http-outgoing-1 << "Content-Range: bytes 
> 0-7472093/7472094[\r][\n]"
> 
> 17/06/13 05:46:50 DEBUG wire: http-outgoing-1 << "Content-Length: 
> 7472094[\r][\n]"
> {noformat}
> And this is partial request example for footer only:
> {noformat}
> 17/06/13 05:46:50 DEBUG headers: http-outgoing-2 >> GET 
> /test/part-0-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet HTTP/1.1
> 
> 17/06/13 05:46:50 DEBUG headers: http-outgoing-2 >> Range: 
> bytes=7472086-7472094
> ...
> 17/06/13 05:46:50 DEBUG wire: http-outgoing-2 << "Content-Length: 8[\r][\n]"
> 
> {noformat}
> Here's what FileScanRDD prints:
> {noformat}
> 17/06/13 05:46:52 INFO FileScanRDD: Reading File path: 
> s3://my-bucket/test/part-4-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet,
>  range: 0-7473020, partition values: [empty row]
> 17/06/13 05:46:52 INFO FileScanRDD: Reading File path: 
> s3://my-bucket/test/part-00011-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet,
>  range: 0-7472503, partition values: [empty row]
> 17/06/13 05:46:52 INFO FileScanRDD: Reading File path: 
> s3://my-bucket/test/part-6-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet,
>  range: 0-7472501, partition values: [empty row]
> 17/06/13 05:46:52 INFO FileScanRDD: Reading File path: 
> s3://my-bucket/test/part-7-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet,
>  range: 0-7473104, partition values: [empty row]
> 17/06/13 05:46:52 INFO FileScanRDD: Reading File path: 
> s3://my-bucket/test/part-3-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet,
>  range: 0-7472458, partition values: [empty row]
> 17/06/13 05:46:52 INFO FileScanRDD: Reading File path: 
> s3://my-bucket/test/part-00012-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet,
>  range: 0-7472594, partition values: [empty row]
> 17/06/13 05:46:52 INFO FileScanRDD: Reading File path: 
> s3://my-bucket/test/part-1-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet,
>  range: 0-7472984, partition values: [empty row]
> 17/06/13 05:46:52 INFO FileScanRDD: Reading File path: 
> s3://my-bucket/test/part-00014-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet,
>  range: 0-7472720, partition values: [empty row]
> 17/06/13 05:46:53 INFO FileScanRDD: Reading File path: 
> s3://my-bucket/test/part-8-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet,
>  range: 0-7472339, partition values: [empty row]
> 17/06/13 05:46:53 INFO FileScanRDD: Reading File path: 
> s3://my-bucket/test/

[jira] [Commented] (SPARK-19644) Memory leak in Spark Streaming

2017-06-18 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16053509#comment-16053509
 ] 

Dongjoon Hyun commented on SPARK-19644:
---

I see. Thank you anyway, [~deenbandhu].

> Memory leak in Spark Streaming
> --
>
> Key: SPARK-19644
> URL: https://issues.apache.org/jira/browse/SPARK-19644
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: 3 AWS EC2 c3.xLarge
> Number of cores - 3
> Number of executors 3 
> Memory to each executor 2GB
>Reporter: Deenbandhu Agarwal
>  Labels: memory_leak, performance
> Attachments: Dominator_tree.png, heapdump.png, Path2GCRoot.png
>
>
> I am using streaming on the production for some aggregation and fetching data 
> from cassandra and saving data back to cassandra. 
> I see a gradual increase in old generation heap capacity from 1161216 Bytes 
> to 1397760 Bytes over a period of six hours.
> After 50 hours of processing instances of class 
> scala.collection.immutable.$colon$colon incresed to 12,811,793 which is a 
> huge number. 
> I think this is a clear case of memory leak



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19644) Memory leak in Spark Streaming

2017-06-18 Thread Deenbandhu Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16053502#comment-16053502
 ] 

Deenbandhu Agarwal commented on SPARK-19644:


And the spark cassandra connector is also not out for those spark version. 
which is a dependency for us

> Memory leak in Spark Streaming
> --
>
> Key: SPARK-19644
> URL: https://issues.apache.org/jira/browse/SPARK-19644
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: 3 AWS EC2 c3.xLarge
> Number of cores - 3
> Number of executors 3 
> Memory to each executor 2GB
>Reporter: Deenbandhu Agarwal
>  Labels: memory_leak, performance
> Attachments: Dominator_tree.png, heapdump.png, Path2GCRoot.png
>
>
> I am using streaming on the production for some aggregation and fetching data 
> from cassandra and saving data back to cassandra. 
> I see a gradual increase in old generation heap capacity from 1161216 Bytes 
> to 1397760 Bytes over a period of six hours.
> After 50 hours of processing instances of class 
> scala.collection.immutable.$colon$colon incresed to 12,811,793 which is a 
> huge number. 
> I think this is a clear case of memory leak



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21101) Error running Hive temporary UDTF on latest Spark 2.2

2017-06-18 Thread Dayou Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16053501#comment-16053501
 ] 

Dayou Zhou commented on SPARK-21101:


Hi [~q79969786], thank you kindly for your response.  I was not aware of this 
'other' version of initialize() method, and will try our suggestion tomorrow.

> Error running Hive temporary UDTF on latest Spark 2.2
> -
>
> Key: SPARK-21101
> URL: https://issues.apache.org/jira/browse/SPARK-21101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Dayou Zhou
>
> I'm using temporary UDTFs on Spark 2.2, e.g.
> CREATE TEMPORARY FUNCTION myudtf AS 'com.foo.MyUdtf' USING JAR 
> 'hdfs:///path/to/udf.jar'; 
> But when I try to invoke it, I get the following error:
> {noformat}
> 17/06/14 19:43:50 ERROR SparkExecuteStatementOperation: Error running hive 
> query:
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.sql.AnalysisException: No handler for Hive UDF 
> 'com.foo.MyUdtf': java.lang.NullPointerException; line 1 pos 7
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:266)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:174)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:184)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Any help appreciated, thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21101) Error running Hive temporary UDTF on latest Spark 2.2

2017-06-18 Thread Dayou Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16053501#comment-16053501
 ] 

Dayou Zhou edited comment on SPARK-21101 at 6/19/17 5:54 AM:
-

Hi [~q79969786], thank you kindly for your response.  I was not aware of this 
'other' version of initialize() method, and will try your suggestion tomorrow.


was (Author: dyzhou):
Hi [~q79969786], thank you kindly for your response.  I was not aware of this 
'other' version of initialize() method, and will try our suggestion tomorrow.

> Error running Hive temporary UDTF on latest Spark 2.2
> -
>
> Key: SPARK-21101
> URL: https://issues.apache.org/jira/browse/SPARK-21101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Dayou Zhou
>
> I'm using temporary UDTFs on Spark 2.2, e.g.
> CREATE TEMPORARY FUNCTION myudtf AS 'com.foo.MyUdtf' USING JAR 
> 'hdfs:///path/to/udf.jar'; 
> But when I try to invoke it, I get the following error:
> {noformat}
> 17/06/14 19:43:50 ERROR SparkExecuteStatementOperation: Error running hive 
> query:
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.sql.AnalysisException: No handler for Hive UDF 
> 'com.foo.MyUdtf': java.lang.NullPointerException; line 1 pos 7
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:266)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:174)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:184)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Any help appreciated, thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19644) Memory leak in Spark Streaming

2017-06-18 Thread Deenbandhu Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16053498#comment-16053498
 ] 

Deenbandhu Agarwal commented on SPARK-19644:


yes i can try but is there any report of such events in that particular version 


> Memory leak in Spark Streaming
> --
>
> Key: SPARK-19644
> URL: https://issues.apache.org/jira/browse/SPARK-19644
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: 3 AWS EC2 c3.xLarge
> Number of cores - 3
> Number of executors 3 
> Memory to each executor 2GB
>Reporter: Deenbandhu Agarwal
>  Labels: memory_leak, performance
> Attachments: Dominator_tree.png, heapdump.png, Path2GCRoot.png
>
>
> I am using streaming on the production for some aggregation and fetching data 
> from cassandra and saving data back to cassandra. 
> I see a gradual increase in old generation heap capacity from 1161216 Bytes 
> to 1397760 Bytes over a period of six hours.
> After 50 hours of processing instances of class 
> scala.collection.immutable.$colon$colon incresed to 12,811,793 which is a 
> huge number. 
> I think this is a clear case of memory leak



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19644) Memory leak in Spark Streaming

2017-06-18 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16053494#comment-16053494
 ] 

Dongjoon Hyun commented on SPARK-19644:
---

Hi, [~deenbandhu].
Could you try this with the latest versions like 2.1.1 or 2.2.0-RC4?

> Memory leak in Spark Streaming
> --
>
> Key: SPARK-19644
> URL: https://issues.apache.org/jira/browse/SPARK-19644
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: 3 AWS EC2 c3.xLarge
> Number of cores - 3
> Number of executors 3 
> Memory to each executor 2GB
>Reporter: Deenbandhu Agarwal
>  Labels: memory_leak, performance
> Attachments: Dominator_tree.png, heapdump.png, Path2GCRoot.png
>
>
> I am using streaming on the production for some aggregation and fetching data 
> from cassandra and saving data back to cassandra. 
> I see a gradual increase in old generation heap capacity from 1161216 Bytes 
> to 1397760 Bytes over a period of six hours.
> After 50 hours of processing instances of class 
> scala.collection.immutable.$colon$colon incresed to 12,811,793 which is a 
> huge number. 
> I think this is a clear case of memory leak



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17381) Memory leak org.apache.spark.sql.execution.ui.SQLTaskMetrics

2017-06-18 Thread Deenbandhu Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16053488#comment-16053488
 ] 

Deenbandhu Agarwal edited comment on SPARK-17381 at 6/19/17 5:33 AM:
-

[~joaomaiaduarte] I am facing a similar kind of issue. I am running spark 
streaming in the production environment with 6 executors and 1 GB memory and 1 
core each and driver with 3 GB.Spark Version used is 2.0.1. Objects of some 
linked list are getting accumulated over the time in the JVM Heap of driver and 
after 2-3 hours the GC become very frequent and jobs starts queuing up. I tried 
your solution but in vain. We are not using linked list anywhere. You can find 
details of the issue here [https://issues.apache.org/jira/browse/SPARK-19644]  


was (Author: deenbandhu):
[~joaomaiaduarte] I am facing a similar kind of issue. I am running spark 
streaming in the production environment with 6 executors and 1 GB memory and 1 
core each and driver with 3 GB.Spark Version used is 2.0.1. Objects of some 
linked list are getting accumulated over the time in the JVM Heap of driver and 
after 2-3 hours the GC become very frequent and jobs starts queuing up. I tried 
your solution but in vain. We are not using linked list anywhere.  

> Memory leak  org.apache.spark.sql.execution.ui.SQLTaskMetrics
> -
>
> Key: SPARK-17381
> URL: https://issues.apache.org/jira/browse/SPARK-17381
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: EMR 5.0.0 (submitted as yarn-client)
> Java Version  1.8.0_101 (Oracle Corporation)
> Scala Version version 2.11.8
> Problem also happens when I run locally with similar versions of java/scala. 
> OS: Ubuntu 16.04
>Reporter: Joao Duarte
>
> I am running a Spark Streaming application from a Kinesis stream. After some 
> hours running it gets out of memory. After a driver heap dump I found two 
> problems:
> 1) huge amount of org.apache.spark.sql.execution.ui.SQLTaskMetrics (It seems 
> this was a problem before: 
> https://issues.apache.org/jira/browse/SPARK-11192);
> To replicate the org.apache.spark.sql.execution.ui.SQLTaskMetrics leak just 
> needed to run the code below:
> {code}
> val dstream = ssc.union(kinesisStreams)
> dstream.foreachRDD((streamInfo: RDD[Array[Byte]]) => {
>   val toyDF = streamInfo.map(_ =>
> (1, "data","more data "
> ))
> .toDF("Num", "Data", "MoreData" )
>   toyDF.agg(sum("Num")).first().get(0)
> }
> )
> {code}
> 2) huge amount of Array[Byte] (9Gb+)
> After some analysis, I noticed that most of the Array[Byte] where being 
> referenced by objects that were being referenced by SQLTaskMetrics. The 
> strangest thing is that those Array[Byte] were basically text that were 
> loaded in the executors, so they should never be in the driver at all!
> Still could not replicate the 2nd problem with a simple code (the original 
> was complex with data coming from S3, DynamoDB and other databases). However, 
> when I debug the application I can see that in Executor.scala, during 
> reportHeartBeat(),  the data that should not be sent to the driver is being 
> added to "accumUpdates" which, as I understand, will be sent to the driver 
> for reporting.
> To be more precise, one of the taskRunner in the loop "for (taskRunner <- 
> runningTasks.values().asScala)"  contains a GenericInternalRow with a lot of 
> data that should not go to the driver. The path would be in my case: 
> taskRunner.task.metrics.externalAccums[2]._list[0]. This data is similar (if 
> not the same) to the data I see when I do a driver heap dump. 
> I guess that if the org.apache.spark.sql.execution.ui.SQLTaskMetrics leak is 
> fixed I would have less of this undesirable data in the driver and I could 
> run my streaming app for a long period of time, but I think there will always 
> be some performance lost.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17381) Memory leak org.apache.spark.sql.execution.ui.SQLTaskMetrics

2017-06-18 Thread Deenbandhu Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16053488#comment-16053488
 ] 

Deenbandhu Agarwal commented on SPARK-17381:


[~joaomaiaduarte] I am facing a similar kind of issue. I am running spark 
streaming in the production environment with 6 executors and 1 GB memory and 1 
core each and driver with 3 GB.Spark Version used is 2.0.1. Objects of some 
linked list are getting accumulated over the time in the JVM Heap of driver and 
after 2-3 hours the GC become very frequent and jobs starts queuing up. I tried 
your solution but in vain. We are not using linked list anywhere.  

> Memory leak  org.apache.spark.sql.execution.ui.SQLTaskMetrics
> -
>
> Key: SPARK-17381
> URL: https://issues.apache.org/jira/browse/SPARK-17381
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: EMR 5.0.0 (submitted as yarn-client)
> Java Version  1.8.0_101 (Oracle Corporation)
> Scala Version version 2.11.8
> Problem also happens when I run locally with similar versions of java/scala. 
> OS: Ubuntu 16.04
>Reporter: Joao Duarte
>
> I am running a Spark Streaming application from a Kinesis stream. After some 
> hours running it gets out of memory. After a driver heap dump I found two 
> problems:
> 1) huge amount of org.apache.spark.sql.execution.ui.SQLTaskMetrics (It seems 
> this was a problem before: 
> https://issues.apache.org/jira/browse/SPARK-11192);
> To replicate the org.apache.spark.sql.execution.ui.SQLTaskMetrics leak just 
> needed to run the code below:
> {code}
> val dstream = ssc.union(kinesisStreams)
> dstream.foreachRDD((streamInfo: RDD[Array[Byte]]) => {
>   val toyDF = streamInfo.map(_ =>
> (1, "data","more data "
> ))
> .toDF("Num", "Data", "MoreData" )
>   toyDF.agg(sum("Num")).first().get(0)
> }
> )
> {code}
> 2) huge amount of Array[Byte] (9Gb+)
> After some analysis, I noticed that most of the Array[Byte] where being 
> referenced by objects that were being referenced by SQLTaskMetrics. The 
> strangest thing is that those Array[Byte] were basically text that were 
> loaded in the executors, so they should never be in the driver at all!
> Still could not replicate the 2nd problem with a simple code (the original 
> was complex with data coming from S3, DynamoDB and other databases). However, 
> when I debug the application I can see that in Executor.scala, during 
> reportHeartBeat(),  the data that should not be sent to the driver is being 
> added to "accumUpdates" which, as I understand, will be sent to the driver 
> for reporting.
> To be more precise, one of the taskRunner in the loop "for (taskRunner <- 
> runningTasks.values().asScala)"  contains a GenericInternalRow with a lot of 
> data that should not go to the driver. The path would be in my case: 
> taskRunner.task.metrics.externalAccums[2]._list[0]. This data is similar (if 
> not the same) to the data I see when I do a driver heap dump. 
> I guess that if the org.apache.spark.sql.execution.ui.SQLTaskMetrics leak is 
> fixed I would have less of this undesirable data in the driver and I could 
> run my streaming app for a long period of time, but I think there will always 
> be some performance lost.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19824) Standalone master JSON not showing cores for running applications

2017-06-18 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-19824.
-
  Resolution: Fixed
Assignee: Jiang Xingbo
   Fix Version/s: 2.3.0
Target Version/s: 2.3.0

> Standalone master JSON not showing cores for running applications
> -
>
> Key: SPARK-19824
> URL: https://issues.apache.org/jira/browse/SPARK-19824
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.1.0
>Reporter: Dan
>Assignee: Jiang Xingbo
>Priority: Minor
> Fix For: 2.3.0
>
>
> The JSON API of the standalone master ("/json") does not show the number of 
> cores for a running application, which is available on the UI.
>   "activeapps" : [ {
> "starttime" : 1488702337788,
> "id" : "app-20170305102537-19717",
> "name" : "POPAI_Aggregated",
> "user" : "ibiuser",
> "memoryperslave" : 16384,
> "submitdate" : "Sun Mar 05 10:25:37 IST 2017",
> "state" : "RUNNING",
> "duration" : 1141934
>   } ],



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20927) Add cache operator to Unsupported Operations in Structured Streaming

2017-06-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16053468#comment-16053468
 ] 

Apache Spark commented on SPARK-20927:
--

User 'ZiyueHuang' has created a pull request for this issue:
https://github.com/apache/spark/pull/18349

> Add cache operator to Unsupported Operations in Structured Streaming 
> -
>
> Key: SPARK-20927
> URL: https://issues.apache.org/jira/browse/SPARK-20927
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>  Labels: starter
>
> Just [found 
> out|https://stackoverflow.com/questions/42062092/why-does-using-cache-on-streaming-datasets-fail-with-analysisexception-queries]
>  that {{cache}} is not allowed on streaming datasets.
> {{cache}} on streaming datasets leads to the following exception:
> {code}
> scala> spark.readStream.text("files").cache
> org.apache.spark.sql.AnalysisException: Queries with streaming sources must 
> be executed with writeStream.start();;
> FileSource[files]
>   at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:297)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:36)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:34)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:34)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:63)
>   at 
> org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:74)
>   at 
> org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:72)
>   at 
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:84)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:89)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89)
>   at 
> org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:104)
>   at 
> org.apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:68)
>   at 
> org.apache.spark.sql.execution.CacheManager.cacheQuery(CacheManager.scala:92)
>   at org.apache.spark.sql.Dataset.persist(Dataset.scala:2603)
>   at org.apache.spark.sql.Dataset.cache(Dataset.scala:2613)
>   ... 48 elided
> {code}
> It should be included in Structured Streaming's [Unsupported 
> Operations|http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#unsupported-operations].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20927) Add cache operator to Unsupported Operations in Structured Streaming

2017-06-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20927:


Assignee: (was: Apache Spark)

> Add cache operator to Unsupported Operations in Structured Streaming 
> -
>
> Key: SPARK-20927
> URL: https://issues.apache.org/jira/browse/SPARK-20927
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>  Labels: starter
>
> Just [found 
> out|https://stackoverflow.com/questions/42062092/why-does-using-cache-on-streaming-datasets-fail-with-analysisexception-queries]
>  that {{cache}} is not allowed on streaming datasets.
> {{cache}} on streaming datasets leads to the following exception:
> {code}
> scala> spark.readStream.text("files").cache
> org.apache.spark.sql.AnalysisException: Queries with streaming sources must 
> be executed with writeStream.start();;
> FileSource[files]
>   at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:297)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:36)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:34)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:34)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:63)
>   at 
> org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:74)
>   at 
> org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:72)
>   at 
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:84)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:89)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89)
>   at 
> org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:104)
>   at 
> org.apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:68)
>   at 
> org.apache.spark.sql.execution.CacheManager.cacheQuery(CacheManager.scala:92)
>   at org.apache.spark.sql.Dataset.persist(Dataset.scala:2603)
>   at org.apache.spark.sql.Dataset.cache(Dataset.scala:2613)
>   ... 48 elided
> {code}
> It should be included in Structured Streaming's [Unsupported 
> Operations|http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#unsupported-operations].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20927) Add cache operator to Unsupported Operations in Structured Streaming

2017-06-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20927:


Assignee: Apache Spark

> Add cache operator to Unsupported Operations in Structured Streaming 
> -
>
> Key: SPARK-20927
> URL: https://issues.apache.org/jira/browse/SPARK-20927
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Assignee: Apache Spark
>Priority: Trivial
>  Labels: starter
>
> Just [found 
> out|https://stackoverflow.com/questions/42062092/why-does-using-cache-on-streaming-datasets-fail-with-analysisexception-queries]
>  that {{cache}} is not allowed on streaming datasets.
> {{cache}} on streaming datasets leads to the following exception:
> {code}
> scala> spark.readStream.text("files").cache
> org.apache.spark.sql.AnalysisException: Queries with streaming sources must 
> be executed with writeStream.start();;
> FileSource[files]
>   at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:297)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:36)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:34)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:34)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:63)
>   at 
> org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:74)
>   at 
> org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:72)
>   at 
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:84)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:89)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89)
>   at 
> org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:104)
>   at 
> org.apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:68)
>   at 
> org.apache.spark.sql.execution.CacheManager.cacheQuery(CacheManager.scala:92)
>   at org.apache.spark.sql.Dataset.persist(Dataset.scala:2603)
>   at org.apache.spark.sql.Dataset.cache(Dataset.scala:2613)
>   ... 48 elided
> {code}
> It should be included in Structured Streaming's [Unsupported 
> Operations|http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#unsupported-operations].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21090) Optimize the unified memory manager code

2017-06-18 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-21090:

Fix Version/s: (was: 2.3.0)
   2.2.0

> Optimize the unified  memory manager code
> -
>
> Key: SPARK-21090
> URL: https://issues.apache.org/jira/browse/SPARK-21090
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
> Fix For: 2.2.0
>
>
> 1.In *acquireStorageMemory*, when the MemoryMode is OFF_HEAP ,the *maxMemory* 
> should be modified to  *maxOffHeapStorageMemory*
> 2. Borrow memory from execution, *numBytes* modified to *numBytes - 
> storagePool.memoryFree* will be more reasonable



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21090) Optimize the unified memory manager code

2017-06-18 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21090:
---

Assignee: liuxian

> Optimize the unified  memory manager code
> -
>
> Key: SPARK-21090
> URL: https://issues.apache.org/jira/browse/SPARK-21090
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Assignee: liuxian
> Fix For: 2.2.0
>
>
> 1.In *acquireStorageMemory*, when the MemoryMode is OFF_HEAP ,the *maxMemory* 
> should be modified to  *maxOffHeapStorageMemory*
> 2. Borrow memory from execution, *numBytes* modified to *numBytes - 
> storagePool.memoryFree* will be more reasonable



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21090) Optimize the unified memory manager code

2017-06-18 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21090.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18296
[https://github.com/apache/spark/pull/18296]

> Optimize the unified  memory manager code
> -
>
> Key: SPARK-21090
> URL: https://issues.apache.org/jira/browse/SPARK-21090
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
> Fix For: 2.3.0
>
>
> 1.In *acquireStorageMemory*, when the MemoryMode is OFF_HEAP ,the *maxMemory* 
> should be modified to  *maxOffHeapStorageMemory*
> 2. Borrow memory from execution, *numBytes* modified to *numBytes - 
> storagePool.memoryFree* will be more reasonable



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20948) Built-in SQL Function UnaryMinus/UnaryPositive support string type

2017-06-18 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-20948.
-
   Resolution: Fixed
 Assignee: Yuming Wang
Fix Version/s: 2.3.0

> Built-in SQL Function UnaryMinus/UnaryPositive support string type
> --
>
> Key: SPARK-20948
> URL: https://issues.apache.org/jira/browse/SPARK-20948
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
> Fix For: 2.3.0
>
>
> {{UnaryMinus}}/{{UnaryPositive}} function should support string type, same as 
> hive:
> {code:sql}
> $ bin/hive
> Logging initialized using configuration in 
> jar:file:/home/wym/apache-hive-1.2.2-bin/lib/hive-common-1.2.2.jar!/hive-log4j.properties
> hive> select positive('-1.11'), negative('-1.11');
> OK
> -1.11   1.11
> Time taken: 1.937 seconds, Fetched: 1 row(s)
> hive> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21120) Increasing the master's metric is conducive to the spark cluster management system monitoring.

2017-06-18 Thread guoxiaolongzte (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16053426#comment-16053426
 ] 

guoxiaolongzte commented on SPARK-21120:


Sorry, the last two or three days I did not deal with my jira in time.

> Increasing the master's metric is conducive to the spark cluster management 
> system monitoring.
> --
>
> Key: SPARK-21120
> URL: https://issues.apache.org/jira/browse/SPARK-21120
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: guoxiaolongzte
>Priority: Minor
> Attachments: 1.png
>
>
> The current number of master metric is very small, unable to meet the needs 
> of spark large-scale cluster management system. So I am as much as possible 
> to complete the relevant metric.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21120) Increasing the master's metric is conducive to the spark cluster management system monitoring.

2017-06-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16053424#comment-16053424
 ] 

Apache Spark commented on SPARK-21120:
--

User 'guoxiaolongzte' has created a pull request for this issue:
https://github.com/apache/spark/pull/18348

> Increasing the master's metric is conducive to the spark cluster management 
> system monitoring.
> --
>
> Key: SPARK-21120
> URL: https://issues.apache.org/jira/browse/SPARK-21120
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: guoxiaolongzte
>Priority: Minor
> Attachments: 1.png
>
>
> The current number of master metric is very small, unable to meet the needs 
> of spark large-scale cluster management system. So I am as much as possible 
> to complete the relevant metric.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21120) Increasing the master's metric is conducive to the spark cluster management system monitoring.

2017-06-18 Thread guoxiaolongzte (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guoxiaolongzte updated SPARK-21120:
---
Attachment: 1.png

> Increasing the master's metric is conducive to the spark cluster management 
> system monitoring.
> --
>
> Key: SPARK-21120
> URL: https://issues.apache.org/jira/browse/SPARK-21120
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: guoxiaolongzte
>Priority: Minor
> Attachments: 1.png
>
>
> The current number of master metric is very small, unable to meet the needs 
> of spark large-scale cluster management system. So I am as much as possible 
> to complete the relevant metric.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21120) Increasing the master's metric is conducive to the spark cluster management system monitoring.

2017-06-18 Thread guoxiaolongzte (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guoxiaolongzte updated SPARK-21120:
---
Description: The current number of master metric is very small, unable to 
meet the needs of spark large-scale cluster management system. So I am as much 
as possible to complete the relevant metric.

> Increasing the master's metric is conducive to the spark cluster management 
> system monitoring.
> --
>
> Key: SPARK-21120
> URL: https://issues.apache.org/jira/browse/SPARK-21120
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: guoxiaolongzte
>Priority: Minor
> Attachments: 1.png
>
>
> The current number of master metric is very small, unable to meet the needs 
> of spark large-scale cluster management system. So I am as much as possible 
> to complete the relevant metric.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20599) ConsoleSink should work with write (batch)

2017-06-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16053413#comment-16053413
 ] 

Apache Spark commented on SPARK-20599:
--

User 'lubozhan' has created a pull request for this issue:
https://github.com/apache/spark/pull/18347

> ConsoleSink should work with write (batch)
> --
>
> Key: SPARK-20599
> URL: https://issues.apache.org/jira/browse/SPARK-20599
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
>  Labels: starter
>
> I think the following should just work.
> {code}
> spark.
>   read.  // <-- it's a batch query not streaming query if that matters
>   format("kafka").
>   option("subscribe", "topic1").
>   option("kafka.bootstrap.servers", "localhost:9092").
>   load.
>   write.
>   format("console").  // <-- that's not supported currently
>   save
> {code}
> The above combination of {{kafka}} source and {{console}} sink leads to the 
> following exception:
> {code}
> java.lang.RuntimeException: 
> org.apache.spark.sql.execution.streaming.ConsoleSinkProvider does not allow 
> create table as select.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:479)
>   at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:93)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:93)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
>   ... 48 elided
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20599) ConsoleSink should work with write (batch)

2017-06-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20599:


Assignee: Apache Spark

> ConsoleSink should work with write (batch)
> --
>
> Key: SPARK-20599
> URL: https://issues.apache.org/jira/browse/SPARK-20599
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>
> I think the following should just work.
> {code}
> spark.
>   read.  // <-- it's a batch query not streaming query if that matters
>   format("kafka").
>   option("subscribe", "topic1").
>   option("kafka.bootstrap.servers", "localhost:9092").
>   load.
>   write.
>   format("console").  // <-- that's not supported currently
>   save
> {code}
> The above combination of {{kafka}} source and {{console}} sink leads to the 
> following exception:
> {code}
> java.lang.RuntimeException: 
> org.apache.spark.sql.execution.streaming.ConsoleSinkProvider does not allow 
> create table as select.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:479)
>   at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:93)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:93)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
>   ... 48 elided
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20599) ConsoleSink should work with write (batch)

2017-06-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20599:


Assignee: (was: Apache Spark)

> ConsoleSink should work with write (batch)
> --
>
> Key: SPARK-20599
> URL: https://issues.apache.org/jira/browse/SPARK-20599
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
>  Labels: starter
>
> I think the following should just work.
> {code}
> spark.
>   read.  // <-- it's a batch query not streaming query if that matters
>   format("kafka").
>   option("subscribe", "topic1").
>   option("kafka.bootstrap.servers", "localhost:9092").
>   load.
>   write.
>   format("console").  // <-- that's not supported currently
>   save
> {code}
> The above combination of {{kafka}} source and {{console}} sink leads to the 
> following exception:
> {code}
> java.lang.RuntimeException: 
> org.apache.spark.sql.execution.streaming.ConsoleSinkProvider does not allow 
> create table as select.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:479)
>   at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:93)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:93)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
>   ... 48 elided
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21134) Codegen-only expressions should not be collapsed with upper CodegenFallback expression

2017-06-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21134:


Assignee: (was: Apache Spark)

> Codegen-only expressions should not be collapsed with upper CodegenFallback 
> expression
> --
>
> Key: SPARK-21134
> URL: https://issues.apache.org/jira/browse/SPARK-21134
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Liang-Chi Hsieh
>
> The rule {{CollapseProject}} in optimizer collapses lower and upper 
> expressions if they have common expressions.
> We have seen a use case that a codegen-only expression has been collapsed 
> with an upper {{CodegenFallback}} expression. Because all children 
> expressions under {{CodegenFallback}} will be evaluated with non-codegen path 
> {{eval}}, it causes an exception in runtime.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21134) Codegen-only expressions should not be collapsed with upper CodegenFallback expression

2017-06-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16053404#comment-16053404
 ] 

Apache Spark commented on SPARK-21134:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/18346

> Codegen-only expressions should not be collapsed with upper CodegenFallback 
> expression
> --
>
> Key: SPARK-21134
> URL: https://issues.apache.org/jira/browse/SPARK-21134
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Liang-Chi Hsieh
>
> The rule {{CollapseProject}} in optimizer collapses lower and upper 
> expressions if they have common expressions.
> We have seen a use case that a codegen-only expression has been collapsed 
> with an upper {{CodegenFallback}} expression. Because all children 
> expressions under {{CodegenFallback}} will be evaluated with non-codegen path 
> {{eval}}, it causes an exception in runtime.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21134) Codegen-only expressions should not be collapsed with upper CodegenFallback expression

2017-06-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21134:


Assignee: Apache Spark

> Codegen-only expressions should not be collapsed with upper CodegenFallback 
> expression
> --
>
> Key: SPARK-21134
> URL: https://issues.apache.org/jira/browse/SPARK-21134
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> The rule {{CollapseProject}} in optimizer collapses lower and upper 
> expressions if they have common expressions.
> We have seen a use case that a codegen-only expression has been collapsed 
> with an upper {{CodegenFallback}} expression. Because all children 
> expressions under {{CodegenFallback}} will be evaluated with non-codegen path 
> {{eval}}, it causes an exception in runtime.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21134) Codegen-only expressions should not be collapsed with upper CodegenFallback expression

2017-06-18 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-21134:
---

 Summary: Codegen-only expressions should not be collapsed with 
upper CodegenFallback expression
 Key: SPARK-21134
 URL: https://issues.apache.org/jira/browse/SPARK-21134
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.1
Reporter: Liang-Chi Hsieh


The rule {{CollapseProject}} in optimizer collapses lower and upper expressions 
if they have common expressions.

We have seen a use case that a codegen-only expression has been collapsed with 
an upper {{CodegenFallback}} expression. Because all children expressions under 
{{CodegenFallback}} will be evaluated with non-codegen path {{eval}}, it causes 
an exception in runtime.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-20896) spark executor get java.lang.ClassCastException when trigger two job at same time

2017-06-18 Thread poseidon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

poseidon closed SPARK-20896.

  Resolution: Fixed
   Fix Version/s: 1.6.4
Target Version/s: 1.6.2

No a issue

> spark executor get java.lang.ClassCastException when trigger two job at same 
> time
> -
>
> Key: SPARK-20896
> URL: https://issues.apache.org/jira/browse/SPARK-20896
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: poseidon
> Fix For: 1.6.4
>
>
> 1、zeppelin 0.6.2  in *SCOPE* mode 
> 2、spark 1.6.2 
> 3、HDP 2.4 for HDFS YARN 
> trigger scala code like :
> {noformat}
> var tmpDataFrame = sql(" select b1,b2,b3 from xxx.x")
> val vectorDf = assembler.transform(tmpDataFrame)
> val vectRdd = vectorDf.select("features").map{x:Row => x.getAs[Vector](0)}
> val correlMatrix: Matrix = Statistics.corr(vectRdd, "spearman")
> val columns = correlMatrix.toArray.grouped(correlMatrix.numRows)
> val rows = columns.toSeq.transpose
> val vectors = rows.map(row => new DenseVector(row.toArray))
> val vRdd = sc.parallelize(vectors)
> import sqlContext.implicits._
> val dfV = vRdd.map(_.toArray).map{ case Array(b1,b2,b3) => (b1,b2,b3) }.toDF()
> val rows = dfV.rdd.zipWithIndex.map(_.swap)
>   
> .join(sc.parallelize(Array("b1","b2","b3")).zipWithIndex.map(_.swap))
>   .values.map{case (row: Row, x: String) => Row.fromSeq(row.toSeq 
> :+ x)}
> {noformat}
> ---
> and code :
> {noformat}
> var df = sql("select b1,b2 from .x")
> var i = 0
> var threshold = Array(2.0,3.0)
> var inputCols = Array("b1","b2")
> var tmpDataFrame = df
> for (col <- inputCols){
>   val binarizer: Binarizer = new Binarizer().setInputCol(col)
> .setOutputCol(inputCols(i)+"_binary")
> .setThreshold(threshold(i))
>   tmpDataFrame = binarizer.transform(tmpDataFrame).drop(inputCols(i))
>   i = i+1
> }
> var saveDFBin = tmpDataFrame
> val dfAppendBin = sql("select b3 from poseidon.corelatdemo")
> val rows = saveDFBin.rdd.zipWithIndex.map(_.swap)
>   .join(dfAppendBin.rdd.zipWithIndex.map(_.swap))
>   .values.map{case (row1: Row, row2: Row) => Row.fromSeq(row1.toSeq 
> ++ row2.toSeq)}
> import org.apache.spark.sql.types.StructType
> val rowSchema = StructType(saveDFBin.schema.fields ++ 
> dfAppendBin.schema.fields)
> saveDFBin = sqlContext.createDataFrame(rows, rowSchema)
> //save result to table
> import org.apache.spark.sql.SaveMode
> saveDFBin.write.mode(SaveMode.Overwrite).saveAsTable(".")
> sql("alter table . set lifecycle 1")
> {noformat}
> on zeppelin with two different notebook at same time. 
> Found this exception log in  executor :
> {quote}
> l1.dtdream.com): java.lang.ClassCastException: 
> org.apache.spark.mllib.linalg.DenseVector cannot be cast to scala.Tuple2
> at 
> $line127359816836.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:34)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1597)
> at 
> org.apache.spark.rdd.ZippedWithIndexRDD$$anonfun$2.apply(ZippedWithIndexRDD.scala:52)
> at 
> org.apache.spark.rdd.ZippedWithIndexRDD$$anonfun$2.apply(ZippedWithIndexRDD.scala:52)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1875)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1875)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {quote}
> OR 
> {quote}
> java.lang.ClassCastException: scala.Tuple2 cannot be cast to 
> org.apache.spark.mllib.linalg.DenseVector
> at 
> $line34684895436.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:57)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler

[jira] [Commented] (SPARK-20896) spark executor get java.lang.ClassCastException when trigger two job at same time

2017-06-18 Thread poseidon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16053386#comment-16053386
 ] 

poseidon commented on SPARK-20896:
--

It is related to Zeppelin,  so this issue can be closed .

> spark executor get java.lang.ClassCastException when trigger two job at same 
> time
> -
>
> Key: SPARK-20896
> URL: https://issues.apache.org/jira/browse/SPARK-20896
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: poseidon
>
> 1、zeppelin 0.6.2  in *SCOPE* mode 
> 2、spark 1.6.2 
> 3、HDP 2.4 for HDFS YARN 
> trigger scala code like :
> {noformat}
> var tmpDataFrame = sql(" select b1,b2,b3 from xxx.x")
> val vectorDf = assembler.transform(tmpDataFrame)
> val vectRdd = vectorDf.select("features").map{x:Row => x.getAs[Vector](0)}
> val correlMatrix: Matrix = Statistics.corr(vectRdd, "spearman")
> val columns = correlMatrix.toArray.grouped(correlMatrix.numRows)
> val rows = columns.toSeq.transpose
> val vectors = rows.map(row => new DenseVector(row.toArray))
> val vRdd = sc.parallelize(vectors)
> import sqlContext.implicits._
> val dfV = vRdd.map(_.toArray).map{ case Array(b1,b2,b3) => (b1,b2,b3) }.toDF()
> val rows = dfV.rdd.zipWithIndex.map(_.swap)
>   
> .join(sc.parallelize(Array("b1","b2","b3")).zipWithIndex.map(_.swap))
>   .values.map{case (row: Row, x: String) => Row.fromSeq(row.toSeq 
> :+ x)}
> {noformat}
> ---
> and code :
> {noformat}
> var df = sql("select b1,b2 from .x")
> var i = 0
> var threshold = Array(2.0,3.0)
> var inputCols = Array("b1","b2")
> var tmpDataFrame = df
> for (col <- inputCols){
>   val binarizer: Binarizer = new Binarizer().setInputCol(col)
> .setOutputCol(inputCols(i)+"_binary")
> .setThreshold(threshold(i))
>   tmpDataFrame = binarizer.transform(tmpDataFrame).drop(inputCols(i))
>   i = i+1
> }
> var saveDFBin = tmpDataFrame
> val dfAppendBin = sql("select b3 from poseidon.corelatdemo")
> val rows = saveDFBin.rdd.zipWithIndex.map(_.swap)
>   .join(dfAppendBin.rdd.zipWithIndex.map(_.swap))
>   .values.map{case (row1: Row, row2: Row) => Row.fromSeq(row1.toSeq 
> ++ row2.toSeq)}
> import org.apache.spark.sql.types.StructType
> val rowSchema = StructType(saveDFBin.schema.fields ++ 
> dfAppendBin.schema.fields)
> saveDFBin = sqlContext.createDataFrame(rows, rowSchema)
> //save result to table
> import org.apache.spark.sql.SaveMode
> saveDFBin.write.mode(SaveMode.Overwrite).saveAsTable(".")
> sql("alter table . set lifecycle 1")
> {noformat}
> on zeppelin with two different notebook at same time. 
> Found this exception log in  executor :
> {quote}
> l1.dtdream.com): java.lang.ClassCastException: 
> org.apache.spark.mllib.linalg.DenseVector cannot be cast to scala.Tuple2
> at 
> $line127359816836.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:34)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1597)
> at 
> org.apache.spark.rdd.ZippedWithIndexRDD$$anonfun$2.apply(ZippedWithIndexRDD.scala:52)
> at 
> org.apache.spark.rdd.ZippedWithIndexRDD$$anonfun$2.apply(ZippedWithIndexRDD.scala:52)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1875)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1875)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {quote}
> OR 
> {quote}
> java.lang.ClassCastException: scala.Tuple2 cannot be cast to 
> org.apache.spark.mllib.linalg.DenseVector
> at 
> $line34684895436.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:57)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleM

[jira] [Resolved] (SPARK-20892) Add SQL trunc function to SparkR

2017-06-18 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-20892.
--
  Resolution: Fixed
Assignee: Wayne Zhang
   Fix Version/s: 2.3.0
Target Version/s: 2.3.0

> Add SQL trunc function to SparkR
> 
>
> Key: SPARK-20892
> URL: https://issues.apache.org/jira/browse/SPARK-20892
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 2.1.1
>Reporter: Wayne Zhang
>Assignee: Wayne Zhang
> Fix For: 2.3.0
>
>
> Add SQL trunc function to SparkR



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20520) R streaming tests failed on Windows

2017-06-18 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-20520.
--
   Resolution: Cannot Reproduce
Fix Version/s: 2.2.0

tested no longer repo on windows with rc4

> R streaming tests failed on Windows
> ---
>
> Key: SPARK-20520
> URL: https://issues.apache.org/jira/browse/SPARK-20520
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.1
>Reporter: Felix Cheung
>Assignee: Felix Cheung
> Fix For: 2.2.0
>
>
> Running R CMD check on SparkR 2.2 RC1 packages 
> {code}
> Failed 
> -
> 1. Failure: read.stream, write.stream, awaitTermination, stopQuery 
> (@test_streaming.R#56) 
> head(sql("SELECT count(*) FROM people"))[[1]] not equal to 3.
> 1/1 mismatches
> [1] 0 - 3 == -3
> 2. Failure: read.stream, write.stream, awaitTermination, stopQuery 
> (@test_streaming.R#60) 
> head(sql("SELECT count(*) FROM people"))[[1]] not equal to 6.
> 1/1 mismatches
> [1] 3 - 6 == -3
> 3. Failure: print from explain, lastProgress, status, isActive 
> (@test_streaming.R#75) 
> any(grepl("\"description\" : \"MemorySink\"", 
> capture.output(lastProgress(q isn't true.
> 4. Failure: Stream other format (@test_streaming.R#95) 
> -
> head(sql("SELECT count(*) FROM people3"))[[1]] not equal to 3.
> 1/1 mismatches
> [1] 0 - 3 == -3
> 5. Failure: Stream other format (@test_streaming.R#98) 
> -
> any(...) isn't true.
> {code}
> Need to investigate



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21128) Running R tests multiple times failed due to pre-exiting "spark-warehouse" / "metastore_db"

2017-06-18 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-21128.
--
  Resolution: Fixed
Assignee: Hyukjin Kwon
   Fix Version/s: 2.3.0
Target Version/s: 2.3.0

> Running R tests multiple times failed due to pre-exiting "spark-warehouse" / 
> "metastore_db"
> ---
>
> Key: SPARK-21128
> URL: https://issues.apache.org/jira/browse/SPARK-21128
> Project: Spark
>  Issue Type: Test
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.3.0
>
>
> Currently, running R tests multiple times fails due to pre-exiting 
> "spark-warehouse" / "metastore_db" as below:
> {code}
> SparkSQL functions: Spark package found in SPARK_HOME: .../spark
> ...1234...
> {code}
> {code}
> Failed 
> -
> 1. Failure: No extra files are created in SPARK_HOME by starting session and 
> making calls (@test_sparkSQL.R#3384)
> length(list1) not equal to length(list2).
> 1/1 mismatches
> [1] 25 - 23 == 2
> 2. Failure: No extra files are created in SPARK_HOME by starting session and 
> making calls (@test_sparkSQL.R#3384)
> sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE).
> 10/25 mismatches
> x[16]: "metastore_db"
> y[16]: "pkg"
> x[17]: "pkg"
> y[17]: "R"
> x[18]: "R"
> y[18]: "README.md"
> x[19]: "README.md"
> y[19]: "run-tests.sh"
> x[20]: "run-tests.sh"
> y[20]: "SparkR_2.2.0.tar.gz"
> x[21]: "metastore_db"
> y[21]: "pkg"
> x[22]: "pkg"
> y[22]: "R"
> x[23]: "R"
> y[23]: "README.md"
> x[24]: "README.md"
> y[24]: "run-tests.sh"
> x[25]: "run-tests.sh"
> y[25]: "SparkR_2.2.0.tar.gz"
> 3. Failure: No extra files are created in SPARK_HOME by starting session and 
> making calls (@test_sparkSQL.R#3388)
> length(list1) not equal to length(list2).
> 1/1 mismatches
> [1] 25 - 23 == 2
> 4. Failure: No extra files are created in SPARK_HOME by starting session and 
> making calls (@test_sparkSQL.R#3388)
> sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE).
> 10/25 mismatches
> x[16]: "metastore_db"
> y[16]: "pkg"
> x[17]: "pkg"
> y[17]: "R"
> x[18]: "R"
> y[18]: "README.md"
> x[19]: "README.md"
> y[19]: "run-tests.sh"
> x[20]: "run-tests.sh"
> y[20]: "SparkR_2.2.0.tar.gz"
> x[21]: "metastore_db"
> y[21]: "pkg"
> x[22]: "pkg"
> y[22]: "R"
> x[23]: "R"
> y[23]: "README.md"
> x[24]: "README.md"
> y[24]: "run-tests.sh"
> x[25]: "run-tests.sh"
> y[25]: "SparkR_2.2.0.tar.gz"
> DONE 
> ===
> {code}
> It looks we should remove both "spark-warehouse" and "metastore_db" _before_ 
> listing files into  {{sparkRFilesBefore}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15799) Release SparkR on CRAN

2017-06-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15799:
--
Target Version/s:   (was: 2.2.0)

> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20520) R streaming tests failed on Windows

2017-06-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20520:
--
Affects Version/s: 2.1.1
 Target Version/s:   (was: 2.2.0)

> R streaming tests failed on Windows
> ---
>
> Key: SPARK-20520
> URL: https://issues.apache.org/jira/browse/SPARK-20520
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.1
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>
> Running R CMD check on SparkR 2.2 RC1 packages 
> {code}
> Failed 
> -
> 1. Failure: read.stream, write.stream, awaitTermination, stopQuery 
> (@test_streaming.R#56) 
> head(sql("SELECT count(*) FROM people"))[[1]] not equal to 3.
> 1/1 mismatches
> [1] 0 - 3 == -3
> 2. Failure: read.stream, write.stream, awaitTermination, stopQuery 
> (@test_streaming.R#60) 
> head(sql("SELECT count(*) FROM people"))[[1]] not equal to 6.
> 1/1 mismatches
> [1] 3 - 6 == -3
> 3. Failure: print from explain, lastProgress, status, isActive 
> (@test_streaming.R#75) 
> any(grepl("\"description\" : \"MemorySink\"", 
> capture.output(lastProgress(q isn't true.
> 4. Failure: Stream other format (@test_streaming.R#95) 
> -
> head(sql("SELECT count(*) FROM people3"))[[1]] not equal to 3.
> 1/1 mismatches
> [1] 0 - 3 == -3
> 5. Failure: Stream other format (@test_streaming.R#98) 
> -
> any(...) isn't true.
> {code}
> Need to investigate



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18267) Distribute PySpark via Python Package Index (pypi)

2017-06-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16053253#comment-16053253
 ] 

Sean Owen commented on SPARK-18267:
---

[~holdenk] is all of this basically done? for 2.2?

> Distribute PySpark via Python Package Index (pypi)
> --
>
> Key: SPARK-18267
> URL: https://issues.apache.org/jira/browse/SPARK-18267
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, Project Infra, PySpark
>Reporter: Reynold Xin
>
> The goal is to distribute PySpark via pypi, so users can simply run Spark on 
> a single node via "pip install pyspark" (or "pip install apache-spark").



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21125) PySpark context missing function to set Job Description.

2017-06-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21125:
--
Target Version/s:   (was: 2.2.0)
   Fix Version/s: (was: 2.2.0)
  Issue Type: Improvement  (was: Bug)

> PySpark context missing function to set Job Description.
> 
>
> Key: SPARK-21125
> URL: https://issues.apache.org/jira/browse/SPARK-21125
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.1.1
>Reporter: Shane Jarvie
>Priority: Trivial
>  Labels: beginner
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The PySpark API is missing a convienient function currently found in the 
> Scala API, which sets the Job Description for display in the Spark UI.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21101) Error running Hive temporary UDTF on latest Spark 2.2

2017-06-18 Thread Yuming Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16053145#comment-16053145
 ] 

Yuming Wang commented on SPARK-21101:
-

[~dyzhou], Can you try to override 
https://github.com/apache/hive/blob/release-2.0.0/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDTF.java#L70

It works for me:
{code:sql}
add jar hdfs://nameservice1/tmp/wym/hive-exec-1.1.0-cdh5.4.3.jar;
CREATE TEMPORARY FUNCTION spark_21101 AS 
'org.apache.hadoop.hive.ql.udf.generic.GenericUDTFStack';
select spark_21101(2,'A',10,date '2015-01-01','B',20,date '2016-01-01');
{code}


Ref: 
https://github.com/apache/hive/blob/release-2.0.0/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDTFStack.java

> Error running Hive temporary UDTF on latest Spark 2.2
> -
>
> Key: SPARK-21101
> URL: https://issues.apache.org/jira/browse/SPARK-21101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Dayou Zhou
>
> I'm using temporary UDTFs on Spark 2.2, e.g.
> CREATE TEMPORARY FUNCTION myudtf AS 'com.foo.MyUdtf' USING JAR 
> 'hdfs:///path/to/udf.jar'; 
> But when I try to invoke it, I get the following error:
> {noformat}
> 17/06/14 19:43:50 ERROR SparkExecuteStatementOperation: Error running hive 
> query:
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.sql.AnalysisException: No handler for Hive UDF 
> 'com.foo.MyUdtf': java.lang.NullPointerException; line 1 pos 7
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:266)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:174)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:184)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Any help appreciated, thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21123) Options for file stream source are in a wrong table

2017-06-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21123:


Assignee: (was: Apache Spark)

> Options for file stream source are in a wrong table
> ---
>
> Key: SPARK-21123
> URL: https://issues.apache.org/jira/browse/SPARK-21123
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Structured Streaming
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Shixiong Zhu
>Priority: Minor
>  Labels: starter
> Attachments: Screen Shot 2017-06-15 at 11.25.49 AM.png
>
>
> Right now options for file stream source are documented with file sink. We 
> should create a table for source options and fix it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21123) Options for file stream source are in a wrong table

2017-06-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21123:


Assignee: Apache Spark

> Options for file stream source are in a wrong table
> ---
>
> Key: SPARK-21123
> URL: https://issues.apache.org/jira/browse/SPARK-21123
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Structured Streaming
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
> Attachments: Screen Shot 2017-06-15 at 11.25.49 AM.png
>
>
> Right now options for file stream source are documented with file sink. We 
> should create a table for source options and fix it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21123) Options for file stream source are in a wrong table

2017-06-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16053125#comment-16053125
 ] 

Apache Spark commented on SPARK-21123:
--

User 'assafmendelson' has created a pull request for this issue:
https://github.com/apache/spark/pull/18342

> Options for file stream source are in a wrong table
> ---
>
> Key: SPARK-21123
> URL: https://issues.apache.org/jira/browse/SPARK-21123
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Structured Streaming
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Shixiong Zhu
>Priority: Minor
>  Labels: starter
> Attachments: Screen Shot 2017-06-15 at 11.25.49 AM.png
>
>
> Right now options for file stream source are documented with file sink. We 
> should create a table for source options and fix it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-21133) HighlyCompressedMapStatus#writeExternal throws NPE

2017-06-18 Thread Yuming Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-21133:

Comment: was deleted

(was: I'll create a PR later.)

> HighlyCompressedMapStatus#writeExternal throws NPE
> --
>
> Key: SPARK-21133
> URL: https://issues.apache.org/jira/browse/SPARK-21133
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>
> Reproduce, set {{set spark.sql.shuffle.partitions>2000}} with shuffle, for 
> simple:
> {code:sql}
> spark-sql --executor-memory 12g --driver-memory 8g --executor-cores 7   -e "
>   set spark.sql.shuffle.partitions=2001;
>   drop table if exists spark_hcms_npe;
>   create table spark_hcms_npe as select id, count(*) from big_table group by 
> id;
> "
> {code}
> Error logs:
> {noformat}
> 17/06/18 15:00:27 ERROR Utils: Exception encountered
> java.lang.NullPointerException
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply$mcV$sp(MapStatus.scala:171)
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303)
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167)
> at 
> java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
> at 
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619)
> at 
> org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562)
> at 
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 17/06/18 15:00:27 ERROR MapOutputTrackerMaster: java.lang.NullPointerException
> java.io.IOException: java.lang.NullPointerException
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1310)
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167)
> at 
> java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
> at 
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619)
> at 
> org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562)
> at 
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused 

[jira] [Assigned] (SPARK-21133) HighlyCompressedMapStatus#writeExternal throws NPE

2017-06-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21133:


Assignee: (was: Apache Spark)

> HighlyCompressedMapStatus#writeExternal throws NPE
> --
>
> Key: SPARK-21133
> URL: https://issues.apache.org/jira/browse/SPARK-21133
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>
> Reproduce, set {{set spark.sql.shuffle.partitions>2000}} with shuffle, for 
> simple:
> {code:sql}
> spark-sql --executor-memory 12g --driver-memory 8g --executor-cores 7   -e "
>   set spark.sql.shuffle.partitions=2001;
>   drop table if exists spark_hcms_npe;
>   create table spark_hcms_npe as select id, count(*) from big_table group by 
> id;
> "
> {code}
> Error logs:
> {noformat}
> 17/06/18 15:00:27 ERROR Utils: Exception encountered
> java.lang.NullPointerException
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply$mcV$sp(MapStatus.scala:171)
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303)
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167)
> at 
> java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
> at 
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619)
> at 
> org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562)
> at 
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 17/06/18 15:00:27 ERROR MapOutputTrackerMaster: java.lang.NullPointerException
> java.io.IOException: java.lang.NullPointerException
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1310)
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167)
> at 
> java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
> at 
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619)
> at 
> org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562)
> at 
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.l

[jira] [Commented] (SPARK-21133) HighlyCompressedMapStatus#writeExternal throws NPE

2017-06-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16053093#comment-16053093
 ] 

Apache Spark commented on SPARK-21133:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/18343

> HighlyCompressedMapStatus#writeExternal throws NPE
> --
>
> Key: SPARK-21133
> URL: https://issues.apache.org/jira/browse/SPARK-21133
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>
> Reproduce, set {{set spark.sql.shuffle.partitions>2000}} with shuffle, for 
> simple:
> {code:sql}
> spark-sql --executor-memory 12g --driver-memory 8g --executor-cores 7   -e "
>   set spark.sql.shuffle.partitions=2001;
>   drop table if exists spark_hcms_npe;
>   create table spark_hcms_npe as select id, count(*) from big_table group by 
> id;
> "
> {code}
> Error logs:
> {noformat}
> 17/06/18 15:00:27 ERROR Utils: Exception encountered
> java.lang.NullPointerException
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply$mcV$sp(MapStatus.scala:171)
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303)
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167)
> at 
> java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
> at 
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619)
> at 
> org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562)
> at 
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 17/06/18 15:00:27 ERROR MapOutputTrackerMaster: java.lang.NullPointerException
> java.io.IOException: java.lang.NullPointerException
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1310)
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167)
> at 
> java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
> at 
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619)
> at 
> org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562)
> at 
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecu

[jira] [Assigned] (SPARK-21133) HighlyCompressedMapStatus#writeExternal throws NPE

2017-06-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21133:


Assignee: Apache Spark

> HighlyCompressedMapStatus#writeExternal throws NPE
> --
>
> Key: SPARK-21133
> URL: https://issues.apache.org/jira/browse/SPARK-21133
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>
> Reproduce, set {{set spark.sql.shuffle.partitions>2000}} with shuffle, for 
> simple:
> {code:sql}
> spark-sql --executor-memory 12g --driver-memory 8g --executor-cores 7   -e "
>   set spark.sql.shuffle.partitions=2001;
>   drop table if exists spark_hcms_npe;
>   create table spark_hcms_npe as select id, count(*) from big_table group by 
> id;
> "
> {code}
> Error logs:
> {noformat}
> 17/06/18 15:00:27 ERROR Utils: Exception encountered
> java.lang.NullPointerException
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply$mcV$sp(MapStatus.scala:171)
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303)
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167)
> at 
> java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
> at 
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619)
> at 
> org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562)
> at 
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 17/06/18 15:00:27 ERROR MapOutputTrackerMaster: java.lang.NullPointerException
> java.io.IOException: java.lang.NullPointerException
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1310)
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167)
> at 
> java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
> at 
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619)
> at 
> org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562)
> at 
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java

[jira] [Assigned] (SPARK-21126) The configuration which named "spark.core.connection.auth.wait.timeout" hasn't been used in spark

2017-06-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-21126:
-

Assignee: liuzhaokun

> The configuration which named "spark.core.connection.auth.wait.timeout" 
> hasn't been used in spark
> -
>
> Key: SPARK-21126
> URL: https://issues.apache.org/jira/browse/SPARK-21126
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Affects Versions: 2.1.1
>Reporter: liuzhaokun
>Assignee: liuzhaokun
>Priority: Trivial
> Fix For: 2.2.1
>
>
> The configuration which named "spark.core.connection.auth.wait.timeout" 
> hasn't been used in spark,so I think it should be removed from 
> configuration.md.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21126) The configuration which named "spark.core.connection.auth.wait.timeout" hasn't been used in spark

2017-06-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21126.
---
   Resolution: Fixed
Fix Version/s: 2.2.1

Issue resolved by pull request 18333
[https://github.com/apache/spark/pull/18333]

> The configuration which named "spark.core.connection.auth.wait.timeout" 
> hasn't been used in spark
> -
>
> Key: SPARK-21126
> URL: https://issues.apache.org/jira/browse/SPARK-21126
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Affects Versions: 2.1.1
>Reporter: liuzhaokun
>Priority: Trivial
> Fix For: 2.2.1
>
>
> The configuration which named "spark.core.connection.auth.wait.timeout" 
> hasn't been used in spark,so I think it should be removed from 
> configuration.md.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21133) HighlyCompressedMapStatus#writeExternal throws NPE

2017-06-18 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-21133:
---

 Summary: HighlyCompressedMapStatus#writeExternal throws NPE
 Key: SPARK-21133
 URL: https://issues.apache.org/jira/browse/SPARK-21133
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Yuming Wang



Reproduce, set {{set spark.sql.shuffle.partitions>2000}} with shuffle, for 
simple:

{code:sql}
spark-sql --executor-memory 12g --driver-memory 8g --executor-cores 7   -e "
  set spark.sql.shuffle.partitions=2001;
  drop table if exists spark_hcms_npe;
  create table spark_hcms_npe as select id, count(*) from big_table group by id;
"
{code}

Error logs:
{noformat}
17/06/18 15:00:27 ERROR Utils: Exception encountered
java.lang.NullPointerException
at 
org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply$mcV$sp(MapStatus.scala:171)
at 
org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
at 
org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303)
at 
org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167)
at 
java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at 
org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617)
at 
org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
at 
org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
at 
org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619)
at 
org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562)
at 
org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/06/18 15:00:27 ERROR MapOutputTrackerMaster: java.lang.NullPointerException
java.io.IOException: java.lang.NullPointerException
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1310)
at 
org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167)
at 
java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at 
org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617)
at 
org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
at 
org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
at 
org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619)
at 
org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562)
at 
org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at 
org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply$mcV$sp(MapStatus.scala:171)
at 
org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
at 
org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)

[jira] [Commented] (SPARK-21133) HighlyCompressedMapStatus#writeExternal throws NPE

2017-06-18 Thread Yuming Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16053091#comment-16053091
 ] 

Yuming Wang commented on SPARK-21133:
-

I'll create a PR later.

> HighlyCompressedMapStatus#writeExternal throws NPE
> --
>
> Key: SPARK-21133
> URL: https://issues.apache.org/jira/browse/SPARK-21133
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>
> Reproduce, set {{set spark.sql.shuffle.partitions>2000}} with shuffle, for 
> simple:
> {code:sql}
> spark-sql --executor-memory 12g --driver-memory 8g --executor-cores 7   -e "
>   set spark.sql.shuffle.partitions=2001;
>   drop table if exists spark_hcms_npe;
>   create table spark_hcms_npe as select id, count(*) from big_table group by 
> id;
> "
> {code}
> Error logs:
> {noformat}
> 17/06/18 15:00:27 ERROR Utils: Exception encountered
> java.lang.NullPointerException
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply$mcV$sp(MapStatus.scala:171)
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303)
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167)
> at 
> java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
> at 
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619)
> at 
> org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562)
> at 
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 17/06/18 15:00:27 ERROR MapOutputTrackerMaster: java.lang.NullPointerException
> java.io.IOException: java.lang.NullPointerException
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1310)
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167)
> at 
> java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
> at 
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619)
> at 
> org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562)
> at 
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(T