[jira] [Comment Edited] (SPARK-20325) Spark Structured Streaming documentation Update: checkpoint configuration

2017-04-15 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970255#comment-15970255
 ] 

Hyukjin Kwon edited comment on SPARK-20325 at 4/16/17 6:30 AM:
---

It sounds the documentation issue for ... 

{quote}
could we update documentation for Structured Streaming and describe this 
behavior
{quote}


I think this question should go to the mailing list.

{quote}
Do we really need to specify the checkpoint dir per query? what the reason for 
this? finally we will be forced to write some checkpointDir name generator, for 
example associate it with some particular named query and so on?
{quote}




was (Author: hyukjin.kwon):
It sounds the documentation issue for ... 

{quote}
could we update documentation for Structured Streaming and describe this 
behavior
{quote}

{quote}
Do we really need to specify the checkpoint dir per query? what the reason for 
this? finally we will be forced to write some checkpointDir name generator, for 
example associate it with some particular named query and so on?
{quote}

I think this question should go to the mailing list.


> Spark Structured Streaming documentation Update: checkpoint configuration
> -
>
> Key: SPARK-20325
> URL: https://issues.apache.org/jira/browse/SPARK-20325
> Project: Spark
>  Issue Type: Documentation
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Kate Eri
>Priority: Minor
>
> I have configured the following stream outputting to Kafka:
> {code}
> map.foreach(metric => {
>   streamToProcess
> .groupBy(metric)
> .agg(count(metric))
> .writeStream
> .outputMode("complete")
> .option("checkpointLocation", checkpointDir)
> .foreach(kafkaWriter)
> .start()
> })
> {code}
> And configured the checkpoint Dir for each of output sinks like: 
> .option("checkpointLocation", checkpointDir)  according to the documentation 
> => 
> http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing
>  
> As a result I've got the following exception: 
> Cannot start query with id bf6a1003-6252-4c62-8249-c6a189701255 as another 
> query with same id is already active. Perhaps you are attempting to restart a 
> query from checkpoint that is already active.
> java.lang.IllegalStateException: Cannot start query with id 
> bf6a1003-6252-4c62-8249-c6a189701255 as another query with same id is already 
> active. Perhaps you are attempting to restart a query from checkpoint that is 
> already active.
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:291)
> So according to current spark logic for “foreach” sink the checkpoint 
> configuration is loaded in the following way: 
> {code:title=StreamingQueryManager.scala}
>val checkpointLocation = userSpecifiedCheckpointLocation.map { 
> userSpecified =>
>   new Path(userSpecified).toUri.toString
> }.orElse {
>   df.sparkSession.sessionState.conf.checkpointLocation.map { location =>
> new Path(location, 
> userSpecifiedName.getOrElse(UUID.randomUUID().toString)).toUri.toString
>   }
> }.getOrElse {
>   if (useTempCheckpointLocation) {
> Utils.createTempDir(namePrefix = s"temporary").getCanonicalPath
>   } else {
> throw new AnalysisException(
>   "checkpointLocation must be specified either " +
> """through option("checkpointLocation", ...) or """ +
> s"""SparkSession.conf.set("${SQLConf.CHECKPOINT_LOCATION.key}", 
> ...)""")
>   }
> }
> {code}
> so first spark take checkpointDir from query, then from sparksession 
> (spark.sql.streaming.checkpointLocation) and so on. 
> But this behavior was not documented, thus two questions:
> 1) could we update documentation for Structured Streaming and describe this 
> behavior
> 2) Do we really need to specify the checkpoint dir per query? what the reason 
> for this? finally we will be forced to write some checkpointDir name 
> generator, for example associate it with some particular named query and so 
> on?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20325) Spark Structured Streaming documentation Update: checkpoint configuration

2017-04-15 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-20325:
-
Issue Type: Documentation  (was: Bug)

> Spark Structured Streaming documentation Update: checkpoint configuration
> -
>
> Key: SPARK-20325
> URL: https://issues.apache.org/jira/browse/SPARK-20325
> Project: Spark
>  Issue Type: Documentation
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Kate Eri
>Priority: Minor
>
> I have configured the following stream outputting to Kafka:
> {code}
> map.foreach(metric => {
>   streamToProcess
> .groupBy(metric)
> .agg(count(metric))
> .writeStream
> .outputMode("complete")
> .option("checkpointLocation", checkpointDir)
> .foreach(kafkaWriter)
> .start()
> })
> {code}
> And configured the checkpoint Dir for each of output sinks like: 
> .option("checkpointLocation", checkpointDir)  according to the documentation 
> => 
> http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing
>  
> As a result I've got the following exception: 
> Cannot start query with id bf6a1003-6252-4c62-8249-c6a189701255 as another 
> query with same id is already active. Perhaps you are attempting to restart a 
> query from checkpoint that is already active.
> java.lang.IllegalStateException: Cannot start query with id 
> bf6a1003-6252-4c62-8249-c6a189701255 as another query with same id is already 
> active. Perhaps you are attempting to restart a query from checkpoint that is 
> already active.
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:291)
> So according to current spark logic for “foreach” sink the checkpoint 
> configuration is loaded in the following way: 
> {code:title=StreamingQueryManager.scala}
>val checkpointLocation = userSpecifiedCheckpointLocation.map { 
> userSpecified =>
>   new Path(userSpecified).toUri.toString
> }.orElse {
>   df.sparkSession.sessionState.conf.checkpointLocation.map { location =>
> new Path(location, 
> userSpecifiedName.getOrElse(UUID.randomUUID().toString)).toUri.toString
>   }
> }.getOrElse {
>   if (useTempCheckpointLocation) {
> Utils.createTempDir(namePrefix = s"temporary").getCanonicalPath
>   } else {
> throw new AnalysisException(
>   "checkpointLocation must be specified either " +
> """through option("checkpointLocation", ...) or """ +
> s"""SparkSession.conf.set("${SQLConf.CHECKPOINT_LOCATION.key}", 
> ...)""")
>   }
> }
> {code}
> so first spark take checkpointDir from query, then from sparksession 
> (spark.sql.streaming.checkpointLocation) and so on. 
> But this behavior was not documented, thus two questions:
> 1) could we update documentation for Structured Streaming and describe this 
> behavior
> 2) Do we really need to specify the checkpoint dir per query? what the reason 
> for this? finally we will be forced to write some checkpointDir name 
> generator, for example associate it with some particular named query and so 
> on?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20325) Spark Structured Streaming documentation Update: checkpoint configuration

2017-04-15 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970255#comment-15970255
 ] 

Hyukjin Kwon commented on SPARK-20325:
--

It sounds the documentation issue for ... 

{quote}
could we update documentation for Structured Streaming and describe this 
behavior
{quote}

{quote}
Do we really need to specify the checkpoint dir per query? what the reason for 
this? finally we will be forced to write some checkpointDir name generator, for 
example associate it with some particular named query and so on?
{quote}

I think this question should go to the mailing list.


> Spark Structured Streaming documentation Update: checkpoint configuration
> -
>
> Key: SPARK-20325
> URL: https://issues.apache.org/jira/browse/SPARK-20325
> Project: Spark
>  Issue Type: Documentation
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Kate Eri
>Priority: Minor
>
> I have configured the following stream outputting to Kafka:
> {code}
> map.foreach(metric => {
>   streamToProcess
> .groupBy(metric)
> .agg(count(metric))
> .writeStream
> .outputMode("complete")
> .option("checkpointLocation", checkpointDir)
> .foreach(kafkaWriter)
> .start()
> })
> {code}
> And configured the checkpoint Dir for each of output sinks like: 
> .option("checkpointLocation", checkpointDir)  according to the documentation 
> => 
> http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing
>  
> As a result I've got the following exception: 
> Cannot start query with id bf6a1003-6252-4c62-8249-c6a189701255 as another 
> query with same id is already active. Perhaps you are attempting to restart a 
> query from checkpoint that is already active.
> java.lang.IllegalStateException: Cannot start query with id 
> bf6a1003-6252-4c62-8249-c6a189701255 as another query with same id is already 
> active. Perhaps you are attempting to restart a query from checkpoint that is 
> already active.
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:291)
> So according to current spark logic for “foreach” sink the checkpoint 
> configuration is loaded in the following way: 
> {code:title=StreamingQueryManager.scala}
>val checkpointLocation = userSpecifiedCheckpointLocation.map { 
> userSpecified =>
>   new Path(userSpecified).toUri.toString
> }.orElse {
>   df.sparkSession.sessionState.conf.checkpointLocation.map { location =>
> new Path(location, 
> userSpecifiedName.getOrElse(UUID.randomUUID().toString)).toUri.toString
>   }
> }.getOrElse {
>   if (useTempCheckpointLocation) {
> Utils.createTempDir(namePrefix = s"temporary").getCanonicalPath
>   } else {
> throw new AnalysisException(
>   "checkpointLocation must be specified either " +
> """through option("checkpointLocation", ...) or """ +
> s"""SparkSession.conf.set("${SQLConf.CHECKPOINT_LOCATION.key}", 
> ...)""")
>   }
> }
> {code}
> so first spark take checkpointDir from query, then from sparksession 
> (spark.sql.streaming.checkpointLocation) and so on. 
> But this behavior was not documented, thus two questions:
> 1) could we update documentation for Structured Streaming and describe this 
> behavior
> 2) Do we really need to specify the checkpoint dir per query? what the reason 
> for this? finally we will be forced to write some checkpointDir name 
> generator, for example associate it with some particular named query and so 
> on?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20346) sum aggregate over empty Dataset gives null

2017-04-15 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970248#comment-15970248
 ] 

Hyukjin Kwon commented on SPARK-20346:
--

[~jlaskowski], do you mind if I ask the expected output? I thought {{null}} for 
no input rows makes sense in a way.

> sum aggregate over empty Dataset gives null
> ---
>
> Key: SPARK-20346
> URL: https://issues.apache.org/jira/browse/SPARK-20346
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> {code}
> scala> spark.range(0).agg(sum("id")).show
> +---+
> |sum(id)|
> +---+
> |   null|
> +---+
> scala> spark.range(0).agg(sum("id")).printSchema
> root
>  |-- sum(id): long (nullable = true)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20336) spark.read.csv() with wholeFile=True option fails to read non ASCII unicode characters

2017-04-15 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970246#comment-15970246
 ] 

Hyukjin Kwon commented on SPARK-20336:
--

gentle ping [~priancho], I would resolve this JIRA if you are unable to provide 
more details because I could not reproduce this.

> spark.read.csv() with wholeFile=True option fails to read non ASCII unicode 
> characters
> --
>
> Key: SPARK-20336
> URL: https://issues.apache.org/jira/browse/SPARK-20336
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0 (master branch is downloaded from Github)
> PySpark
>Reporter: HanCheol Cho
>
> I used spark.read.csv() method with wholeFile=True option to load data that 
> has multi-line records.
> However, non-ASCII characters are not properly loaded.
> The following is a sample data for test:
> {code:none}
> col1,col2,col3
> 1,a,text
> 2,b,テキスト
> 3,c,텍스트
> 4,d,"text
> テキスト
> 텍스트"
> 5,e,last
> {code}
> When it is loaded without wholeFile=True option, non-ASCII characters are 
> shown correctly although multi-line records are parsed incorrectly as follows:
> {code:none}
> testdf_default = spark.read.csv("test.encoding.csv", header=True)
> testdf_default.show()
> ++++
> |col1|col2|col3|
> ++++
> |   1|   a|text|
> |   2|   b|テキスト|
> |   3|   c| 텍스트|
> |   4|   d|text|
> |テキスト|null|null|
> | 텍스트"|null|null|
> |   5|   e|last|
> ++++
> {code}
> When wholeFile=True option is used, non-ASCII characters are broken as 
> follows:
> {code:none}
> testdf_wholefile = spark.read.csv("test.encoding.csv", header=True, 
> wholeFile=True)
> testdf_wholefile.show()
> ++++
> |col1|col2|col3|
> ++++
> |   1|   a|text|
> |   2|   b||
> |   3|   c|   �|
> |   4|   d|text
> ...|
> |   5|   e|last|
> ++++
> {code}
> The result is same even if I use encoding="UTF-8" option with wholeFile=True.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release

2017-04-15 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970235#comment-15970235
 ] 

Josh Rosen commented on SPARK-18406:


I can see how allowing user-level code to call setTaskContext() can fix this 
issue but it's not ideal because it still places the burden on the end users to 
call the setTaskContext() method in their code.

Instead, I think a cleaner fix would be to have the CompletionIterator record 
the task ID when it's instantiated so that the same task ID can be used even if 
the completion occurs in a different thread (the idea is to reduce our reliance 
on thread locals: there are reasons why we couldn't completely remove them (API 
changes), but there are parts of the internals where we can propagate more 
efficiently).

To move forward here, my suggestion is that we write a failing regression test 
based on the description provided by [~yxiao], then experiment on my suggested 
approach of more explicit threading of task ids into closeable objects when 
they're first created.

I'm on vacation this week and won't be able to help with this until Monday, 
April 24th, so someone else will need to help / review if this is urgent.

> Race between end-of-task and completion iterator read lock release
> --
>
> Key: SPARK-18406
> URL: https://issues.apache.org/jira/browse/SPARK-18406
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Josh Rosen
>
> The following log comes from a production streaming job where executors 
> periodically die due to uncaught exceptions during block release:
> {code}
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921
> 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922
> 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923
> 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923)
> 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable 
> 2721
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924
> 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924)
> 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as 
> bytes in memory (estimated size 5.0 KB, free 4.9 GB)
> 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took 
> 3 ms
> 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in 
> memory (estimated size 9.4 KB, free 4.9 GB)
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = 
> 567, finish = 1
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = 
> 541, finish = 6
> 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID 
> 7923). 1429 bytes result sent to driver
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = 
> 533, finish = 7
> 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID 
> 7924). 1429 bytes result sent to driver
> 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID 
> 7921)
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84)
>   at 
> org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:356)
>   at 
> org.apache.spark.storage.BlockManager.releaseAllLocksForTask(B

[jira] [Resolved] (SPARK-20335) Children expressions of Hive UDF impacts the determinism of Hive UDF

2017-04-15 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20335.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> Children expressions of Hive UDF impacts the determinism of Hive UDF
> 
>
> Key: SPARK-20335
> URL: https://issues.apache.org/jira/browse/SPARK-20335
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.2.0
>
>
> {noformat}
>   /**
>* Certain optimizations should not be applied if UDF is not deterministic.
>* Deterministic UDF returns same result each time it is invoked with a
>* particular input. This determinism just needs to hold within the context 
> of
>* a query.
>*
>* @return true if the UDF is deterministic
>*/
>   boolean deterministic() default true;
> {noformat}
> Based on the definition o UDFType, when Hive UDF's children are 
> non-deterministic, Hive UDF is also non-deterministic.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20348) Support squared hinge loss (L2 loss) for LinearSVC

2017-04-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20348:


Assignee: Apache Spark

> Support squared hinge loss (L2 loss) for LinearSVC
> --
>
> Key: SPARK-20348
> URL: https://issues.apache.org/jira/browse/SPARK-20348
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Assignee: Apache Spark
>Priority: Minor
>
> While Hinge loss is the standard loss function for linear SVM, Squared hinge 
> loss (a.k.a. L2 loss) is also popular in practice. L2-SVM is differentiable 
> and imposes a bigger (quadratic vs. linear) loss for points which violate the 
> margin. Some introduction can be found from 
> http://mccormickml.com/2015/01/06/what-is-an-l2-svm/
> Liblinear and [scikit 
> learn|http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html]
>  both offer squared hinge loss as the default loss function for linear SVM. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20348) Support squared hinge loss (L2 loss) for LinearSVC

2017-04-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20348:


Assignee: (was: Apache Spark)

> Support squared hinge loss (L2 loss) for LinearSVC
> --
>
> Key: SPARK-20348
> URL: https://issues.apache.org/jira/browse/SPARK-20348
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Priority: Minor
>
> While Hinge loss is the standard loss function for linear SVM, Squared hinge 
> loss (a.k.a. L2 loss) is also popular in practice. L2-SVM is differentiable 
> and imposes a bigger (quadratic vs. linear) loss for points which violate the 
> margin. Some introduction can be found from 
> http://mccormickml.com/2015/01/06/what-is-an-l2-svm/
> Liblinear and [scikit 
> learn|http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html]
>  both offer squared hinge loss as the default loss function for linear SVM. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20348) Support squared hinge loss (L2 loss) for LinearSVC

2017-04-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970217#comment-15970217
 ] 

Apache Spark commented on SPARK-20348:
--

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/17645

> Support squared hinge loss (L2 loss) for LinearSVC
> --
>
> Key: SPARK-20348
> URL: https://issues.apache.org/jira/browse/SPARK-20348
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Priority: Minor
>
> While Hinge loss is the standard loss function for linear SVM, Squared hinge 
> loss (a.k.a. L2 loss) is also popular in practice. L2-SVM is differentiable 
> and imposes a bigger (quadratic vs. linear) loss for points which violate the 
> margin. Some introduction can be found from 
> http://mccormickml.com/2015/01/06/what-is-an-l2-svm/
> Liblinear and [scikit 
> learn|http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html]
>  both offer squared hinge loss as the default loss function for linear SVM. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20348) Support squared hinge loss (L2 loss) for LinearSVC

2017-04-15 Thread yuhao yang (JIRA)
yuhao yang created SPARK-20348:
--

 Summary: Support squared hinge loss (L2 loss) for LinearSVC
 Key: SPARK-20348
 URL: https://issues.apache.org/jira/browse/SPARK-20348
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.2.0
Reporter: yuhao yang
Priority: Minor


While Hinge loss is the standard loss function for linear SVM, Squared hinge 
loss (a.k.a. L2 loss) is also popular in practice. L2-SVM is differentiable and 
imposes a bigger (quadratic vs. linear) loss for points which violate the 
margin. Some introduction can be found from 
http://mccormickml.com/2015/01/06/what-is-an-l2-svm/

Liblinear and [scikit 
learn|http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html]
 both offer squared hinge loss as the default loss function for linear SVM. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20347) Provide AsyncRDDActions in Python

2017-04-15 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-20347:

Shepherd: holdenk

> Provide AsyncRDDActions in Python
> -
>
> Key: SPARK-20347
> URL: https://issues.apache.org/jira/browse/SPARK-20347
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: holdenk
>Priority: Minor
>
> In core Spark AsyncRDDActions allows people to perform non-blocking RDD 
> actions. In Python where threading & is a bit more involved there could be 
> value in exposing this, the easiest way might involve using the Py4J callback 
> server on the driver.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20347) Provide AsyncRDDActions in Python

2017-04-15 Thread holdenk (JIRA)
holdenk created SPARK-20347:
---

 Summary: Provide AsyncRDDActions in Python
 Key: SPARK-20347
 URL: https://issues.apache.org/jira/browse/SPARK-20347
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.2.0
Reporter: holdenk
Priority: Minor


In core Spark AsyncRDDActions allows people to perform non-blocking RDD 
actions. In Python where threading & is a bit more involved there could be 
value in exposing this, the easiest way might involve using the Py4J callback 
server on the driver.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17729) Enable creating hive bucketed tables

2017-04-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970201#comment-15970201
 ] 

Apache Spark commented on SPARK-17729:
--

User 'tejasapatil' has created a pull request for this issue:
https://github.com/apache/spark/pull/17644

> Enable creating hive bucketed tables
> 
>
> Key: SPARK-17729
> URL: https://issues.apache.org/jira/browse/SPARK-17729
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Tejas Patil
>Priority: Trivial
>
> Hive allows inserting data to bucketed table without guaranteeing bucketed 
> and sorted-ness based on these two configs : `hive.enforce.bucketing` and 
> `hive.enforce.sorting`. 
> With this jira, Spark still won't produce bucketed data as per Hive's 
> bucketing guarantees, but will allow writes IFF user wishes to do so without 
> caring about bucketing guarantees. Ability to create bucketed tables will 
> enable adding test cases to Spark while pieces are being added to Spark have 
> it support hive bucketing (eg. https://github.com/apache/spark/pull/15229)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20346) sum aggregate over empty Dataset gives null

2017-04-15 Thread Jacek Laskowski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-20346:

Description: 
{code}
scala> spark.range(0).agg(sum("id")).show
+---+
|sum(id)|
+---+
|   null|
+---+

scala> spark.range(0).agg(sum("id")).printSchema
root
 |-- sum(id): long (nullable = true)
{code}

> sum aggregate over empty Dataset gives null
> ---
>
> Key: SPARK-20346
> URL: https://issues.apache.org/jira/browse/SPARK-20346
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> {code}
> scala> spark.range(0).agg(sum("id")).show
> +---+
> |sum(id)|
> +---+
> |   null|
> +---+
> scala> spark.range(0).agg(sum("id")).printSchema
> root
>  |-- sum(id): long (nullable = true)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20346) sum aggregate over empty Dataset gives null

2017-04-15 Thread Jacek Laskowski (JIRA)
Jacek Laskowski created SPARK-20346:
---

 Summary: sum aggregate over empty Dataset gives null
 Key: SPARK-20346
 URL: https://issues.apache.org/jira/browse/SPARK-20346
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Jacek Laskowski
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20345) Fix STS error handling logic on HiveSQLException

2017-04-15 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-20345:
--
Description: 
[SPARK-5100|https://github.com/apache/spark/commit/343d3bfafd449a0371feb6a88f78e07302fa7143]
 added Spark Thrift Server UI and the following logic to handle exceptions on 
case `Throwable`.
{code}
HiveThriftServer2.listener.onStatementError(
  statementId, e.getMessage, SparkUtils.exceptionString(e))
{code}

However, there occurred a missed case after implementing 
[SPARK-6964|https://github.com/apache/spark/commit/eb19d3f75cbd002f7e72ce02017a8de67f562792]'s
 `Support Cancellation in the Thrift Server` by adding case `HiveSQLException` 
before case `Throwable`.

Logically, we had better add `HiveThriftServer2.listener.onStatementError` on 
case `HiveSQLException`, too.
{code}
  case e: HiveSQLException =>
if (getStatus().getState() == OperationState.CANCELED) {
  return
} else {
  setState(OperationState.ERROR)
  throw e
}
  // Actually do need to catch Throwable as some failures don't inherit 
from Exception and
  // HiveServer will silently swallow them.
  case e: Throwable =>
val currentState = getStatus().getState()
logError(s"Error executing query, currentState $currentState, ", e)
setState(OperationState.ERROR)
HiveThriftServer2.listener.onStatementError(
  statementId, e.getMessage, SparkUtils.exceptionString(e))
throw new HiveSQLException(e.toString)
{code}

  was:
[SPARK-5100|https://github.com/apache/spark/commit/343d3bfafd449a0371feb6a88f78e07302fa7143]
 added Spark Thrift UI and the following logic to handle exceptions like the 
following on case `Throwable`.
{code}
HiveThriftServer2.listener.onStatementError(
  statementId, e.getMessage, SparkUtils.exceptionString(e))
{code}

However, there occurs a missed case after implementing 
[SPARK-6964|https://github.com/apache/spark/commit/eb19d3f75cbd002f7e72ce02017a8de67f562792]'s
 `Support Cancellation in the Thrift Server` by adding case `HiveSQLException` 
before case `Throwable`.

Logically, we had better add `HiveThriftServer2.listener.onStatementError` on 
case `HiveSQLException`, too.
{code}
  case e: HiveSQLException =>
if (getStatus().getState() == OperationState.CANCELED) {
  return
} else {
  setState(OperationState.ERROR)
  throw e
}
  // Actually do need to catch Throwable as some failures don't inherit 
from Exception and
  // HiveServer will silently swallow them.
  case e: Throwable =>
val currentState = getStatus().getState()
logError(s"Error executing query, currentState $currentState, ", e)
setState(OperationState.ERROR)
HiveThriftServer2.listener.onStatementError(
  statementId, e.getMessage, SparkUtils.exceptionString(e))
throw new HiveSQLException(e.toString)
{code}


> Fix STS error handling logic on HiveSQLException
> 
>
> Key: SPARK-20345
> URL: https://issues.apache.org/jira/browse/SPARK-20345
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Dongjoon Hyun
>
> [SPARK-5100|https://github.com/apache/spark/commit/343d3bfafd449a0371feb6a88f78e07302fa7143]
>  added Spark Thrift Server UI and the following logic to handle exceptions on 
> case `Throwable`.
> {code}
> HiveThriftServer2.listener.onStatementError(
>   statementId, e.getMessage, SparkUtils.exceptionString(e))
> {code}
> However, there occurred a missed case after implementing 
> [SPARK-6964|https://github.com/apache/spark/commit/eb19d3f75cbd002f7e72ce02017a8de67f562792]'s
>  `Support Cancellation in the Thrift Server` by adding case 
> `HiveSQLException` before case `Throwable`.
> Logically, we had better add `HiveThriftServer2.listener.onStatementError` on 
> case `HiveSQLException`, too.
> {code}
>   case e: HiveSQLException =>
> if (getStatus().getState() == OperationState.CANCELED) {
>   return
> } else {
>   setState(OperationState.ERROR)
>   throw e
> }
>   // Actually do need to catch Throwable as some failures don't inherit 
> from Exception and
>   // HiveServer will silently swallow them.
>   case e: Throwable =>
> val currentState = getStatus().getState()
> logError(s"Error executing query, currentState $currentState, ", e)
> setState(OperationState.ERROR)
> HiveThriftServer2.listener.onStatementError(
>   statementId, e.getMessage, SparkUtils.exceptionString(e))
> throw new HiveSQLException(e.toString)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

--

[jira] [Commented] (SPARK-20345) Fix STS error handling logic on HiveSQLException

2017-04-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970084#comment-15970084
 ] 

Apache Spark commented on SPARK-20345:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/17643

> Fix STS error handling logic on HiveSQLException
> 
>
> Key: SPARK-20345
> URL: https://issues.apache.org/jira/browse/SPARK-20345
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Dongjoon Hyun
>
> [SPARK-5100|https://github.com/apache/spark/commit/343d3bfafd449a0371feb6a88f78e07302fa7143]
>  added Spark Thrift UI and the following logic to handle exceptions like the 
> following on case `Throwable`.
> {code}
> HiveThriftServer2.listener.onStatementError(
>   statementId, e.getMessage, SparkUtils.exceptionString(e))
> {code}
> However, there occurs a missed case after implementing 
> [SPARK-6964|https://github.com/apache/spark/commit/eb19d3f75cbd002f7e72ce02017a8de67f562792]'s
>  `Support Cancellation in the Thrift Server` by adding case 
> `HiveSQLException` before case `Throwable`.
> Logically, we had better add `HiveThriftServer2.listener.onStatementError` on 
> case `HiveSQLException`, too.
> {code}
>   case e: HiveSQLException =>
> if (getStatus().getState() == OperationState.CANCELED) {
>   return
> } else {
>   setState(OperationState.ERROR)
>   throw e
> }
>   // Actually do need to catch Throwable as some failures don't inherit 
> from Exception and
>   // HiveServer will silently swallow them.
>   case e: Throwable =>
> val currentState = getStatus().getState()
> logError(s"Error executing query, currentState $currentState, ", e)
> setState(OperationState.ERROR)
> HiveThriftServer2.listener.onStatementError(
>   statementId, e.getMessage, SparkUtils.exceptionString(e))
> throw new HiveSQLException(e.toString)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20345) Fix STS error handling logic on HiveSQLException

2017-04-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20345:


Assignee: Apache Spark

> Fix STS error handling logic on HiveSQLException
> 
>
> Key: SPARK-20345
> URL: https://issues.apache.org/jira/browse/SPARK-20345
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>
> [SPARK-5100|https://github.com/apache/spark/commit/343d3bfafd449a0371feb6a88f78e07302fa7143]
>  added Spark Thrift UI and the following logic to handle exceptions like the 
> following on case `Throwable`.
> {code}
> HiveThriftServer2.listener.onStatementError(
>   statementId, e.getMessage, SparkUtils.exceptionString(e))
> {code}
> However, there occurs a missed case after implementing 
> [SPARK-6964|https://github.com/apache/spark/commit/eb19d3f75cbd002f7e72ce02017a8de67f562792]'s
>  `Support Cancellation in the Thrift Server` by adding case 
> `HiveSQLException` before case `Throwable`.
> Logically, we had better add `HiveThriftServer2.listener.onStatementError` on 
> case `HiveSQLException`, too.
> {code}
>   case e: HiveSQLException =>
> if (getStatus().getState() == OperationState.CANCELED) {
>   return
> } else {
>   setState(OperationState.ERROR)
>   throw e
> }
>   // Actually do need to catch Throwable as some failures don't inherit 
> from Exception and
>   // HiveServer will silently swallow them.
>   case e: Throwable =>
> val currentState = getStatus().getState()
> logError(s"Error executing query, currentState $currentState, ", e)
> setState(OperationState.ERROR)
> HiveThriftServer2.listener.onStatementError(
>   statementId, e.getMessage, SparkUtils.exceptionString(e))
> throw new HiveSQLException(e.toString)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20345) Fix STS error handling logic on HiveSQLException

2017-04-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20345:


Assignee: (was: Apache Spark)

> Fix STS error handling logic on HiveSQLException
> 
>
> Key: SPARK-20345
> URL: https://issues.apache.org/jira/browse/SPARK-20345
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Dongjoon Hyun
>
> [SPARK-5100|https://github.com/apache/spark/commit/343d3bfafd449a0371feb6a88f78e07302fa7143]
>  added Spark Thrift UI and the following logic to handle exceptions like the 
> following on case `Throwable`.
> {code}
> HiveThriftServer2.listener.onStatementError(
>   statementId, e.getMessage, SparkUtils.exceptionString(e))
> {code}
> However, there occurs a missed case after implementing 
> [SPARK-6964|https://github.com/apache/spark/commit/eb19d3f75cbd002f7e72ce02017a8de67f562792]'s
>  `Support Cancellation in the Thrift Server` by adding case 
> `HiveSQLException` before case `Throwable`.
> Logically, we had better add `HiveThriftServer2.listener.onStatementError` on 
> case `HiveSQLException`, too.
> {code}
>   case e: HiveSQLException =>
> if (getStatus().getState() == OperationState.CANCELED) {
>   return
> } else {
>   setState(OperationState.ERROR)
>   throw e
> }
>   // Actually do need to catch Throwable as some failures don't inherit 
> from Exception and
>   // HiveServer will silently swallow them.
>   case e: Throwable =>
> val currentState = getStatus().getState()
> logError(s"Error executing query, currentState $currentState, ", e)
> setState(OperationState.ERROR)
> HiveThriftServer2.listener.onStatementError(
>   statementId, e.getMessage, SparkUtils.exceptionString(e))
> throw new HiveSQLException(e.toString)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20345) Fix STS error handling logic on HiveSQLException

2017-04-15 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-20345:
-

 Summary: Fix STS error handling logic on HiveSQLException
 Key: SPARK-20345
 URL: https://issues.apache.org/jira/browse/SPARK-20345
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0, 1.6.3
Reporter: Dongjoon Hyun


[SPARK-5100|https://github.com/apache/spark/commit/343d3bfafd449a0371feb6a88f78e07302fa7143]
 added Spark Thrift UI and the following logic to handle exceptions like the 
following on case `Throwable`.
{code}
HiveThriftServer2.listener.onStatementError(
  statementId, e.getMessage, SparkUtils.exceptionString(e))
{code}

However, there occurs a missed case after implementing 
[SPARK-6964|https://github.com/apache/spark/commit/eb19d3f75cbd002f7e72ce02017a8de67f562792]'s
 `Support Cancellation in the Thrift Server` by adding case `HiveSQLException` 
before case `Throwable`.

Logically, we had better add `HiveThriftServer2.listener.onStatementError` on 
case `HiveSQLException`, too.
{code}
  case e: HiveSQLException =>
if (getStatus().getState() == OperationState.CANCELED) {
  return
} else {
  setState(OperationState.ERROR)
  throw e
}
  // Actually do need to catch Throwable as some failures don't inherit 
from Exception and
  // HiveServer will silently swallow them.
  case e: Throwable =>
val currentState = getStatus().getState()
logError(s"Error executing query, currentState $currentState, ", e)
setState(OperationState.ERROR)
HiveThriftServer2.listener.onStatementError(
  statementId, e.getMessage, SparkUtils.exceptionString(e))
throw new HiveSQLException(e.toString)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20299) NullPointerException when null and string are in a tuple while encoding Dataset

2017-04-15 Thread Jacek Laskowski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15969996#comment-15969996
 ] 

Jacek Laskowski commented on SPARK-20299:
-

It does work for 2.1. It does not for 2.2.0-SNAPSHOT.

Steps to reproduce:

1. Download the nightly build from 
http://people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/ (used 
{{spark-2.2.0-SNAPSHOT-bin-hadoop2.7.tgz}} from 2017-04-15 08:16)

{code}
➜  spark-2.2.0-SNAPSHOT-bin-hadoop2.7 ./bin/spark-submit --version
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0-SNAPSHOT
  /_/

Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
Branch HEAD
Compiled by user jenkins on 2017-04-15T08:05:06Z
Revision fb036c4413c2cd4d90880d080f418ec468d6c0fc
Url https://github.com/apache/spark.git
Type --help for more information.
{code}

2. Execute the following and you'll *surely* see the exception:

{code}
scala> Seq(("1", null.asInstanceOf[Int]), ("2", 1)).toDS
java.lang.RuntimeException: Error while encoding: java.lang.NullPointerException
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, 
true) AS _1#0
assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2 AS _2#1
  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290)
  at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454)
  at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:285)
  at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:454)
  at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:377)
  at 
org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQLImplicits.scala:246)
  ... 48 elided
Caused by: java.lang.NullPointerException
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_1$(Unknown
 Source)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287)
  ... 58 more
{code}

> NullPointerException when null and string are in a tuple while encoding 
> Dataset
> ---
>
> Key: SPARK-20299
> URL: https://issues.apache.org/jira/browse/SPARK-20299
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> When creating a Dataset from a tuple with {{null}} and a string, NPE is 
> reported. When either is removed, it works fine.
> {code}
> scala> Seq((1, null.asInstanceOf[Int]), (2, 1)).toDS
> res43: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int, _2: int]
> scala> Seq(("1", null.asInstanceOf[Int]), ("2", 1)).toDS
> java.lang.RuntimeException: Error while encoding: 
> java.lang.NullPointerException
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true], top 
> level Product input object), - root class: "scala.Tuple2")._1, true) AS _1#474
> assertnotnull(assertnotnull(input[0, scala.Tuple2, true], top level Product 
> input object), - root class: "scala.Tuple2")._2 AS _2#475
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:454)
>   at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:377)
>   at 
> org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQLImplicits.scala:246)
>   ... 48 elided
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$Sp

[jira] [Updated] (SPARK-20299) NullPointerException when null and string are in a tuple while encoding Dataset

2017-04-15 Thread Jacek Laskowski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-20299:

Description: 
When creating a Dataset from a tuple with {{null}} and a string, NPE is 
reported. When either is removed, it works fine.

{code}
scala> Seq((1, null.asInstanceOf[Int]), (2, 1)).toDS
res43: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int, _2: int]

scala> Seq(("1", null.asInstanceOf[Int]), ("2", 1)).toDS
java.lang.RuntimeException: Error while encoding: java.lang.NullPointerException
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true], top level 
Product input object), - root class: "scala.Tuple2")._1, true) AS _1#474
assertnotnull(assertnotnull(input[0, scala.Tuple2, true], top level Product 
input object), - root class: "scala.Tuple2")._2 AS _2#475
  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290)
  at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454)
  at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:285)
  at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:454)
  at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:377)
  at 
org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQLImplicits.scala:246)
  ... 48 elided
Caused by: java.lang.NullPointerException
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_1$(Unknown
 Source)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287)
  ... 58 more
{code}

  was:
When creating a Dataset from a tuple with {{null}} and a string, NPE is 
reported. When either is removed, it works fine.

{code}
scala> Seq((1, null.asInstanceOf[Int]), (2, 1)).toDS
res43: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int, _2: int]

scala> Seq(("1", null.asInstanceOf[Int]), ("2", 1)).toDS
scala> Seq(("1", null.asInstanceOf[Int]), ("2", 1)).toDS
java.lang.RuntimeException: Error while encoding: java.lang.NullPointerException
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true], top level 
Product input object), - root class: "scala.Tuple2")._1, true) AS _1#474
assertnotnull(assertnotnull(input[0, scala.Tuple2, true], top level Product 
input object), - root class: "scala.Tuple2")._2 AS _2#475
  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290)
  at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454)
  at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:285)
  at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:454)
  at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:377)
  at 
org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQLImplicits.scala:246)
  ... 48 elided
Caused by: java.lang.NullPointerException
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_1$(Unknown
 Source)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287)
  ... 58 more
{code}


> NullPointerException when null and string are in a tuple while encoding 
> Dataset
> ---
>
> Key: SPARK-20299
> URL: https://issues.apache.org/jira/browse/SPARK-20299
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> When creating a Dataset from a tuple with {{null}} and a string, NPE is 
> reported. When either is removed, it works fine.
> {code}
> scala> Seq((1, null.asInstanceOf[Int]), (2, 1)).toDS
> res43: org.apache.spar

[jira] [Commented] (SPARK-20344) Duplicate call in FairSchedulableBuilder.addTaskSetManager

2017-04-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15969946#comment-15969946
 ] 

Sean Owen commented on SPARK-20344:
---

We use pull requests -- http://spark.apache.org/contributing.html
That change looks a little more complex than needed. I think the only thing 
that's needed is to avoid the redundant assignments in the first part where the 
pool is obtained, and then proceed as before.

> Duplicate call in FairSchedulableBuilder.addTaskSetManager
> --
>
> Key: SPARK-20344
> URL: https://issues.apache.org/jira/browse/SPARK-20344
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Robert Stupp
>Priority: Trivial
>
> {{org.apache.spark.scheduler.FairSchedulableBuilder#addTaskSetManager}} 
> contains the code snippet:
> {code}
>   override def addTaskSetManager(manager: Schedulable, properties: 
> Properties) {
> var poolName = DEFAULT_POOL_NAME
> var parentPool = rootPool.getSchedulableByName(poolName)
> if (properties != null) {
>   poolName = properties.getProperty(FAIR_SCHEDULER_PROPERTIES, 
> DEFAULT_POOL_NAME)
>   parentPool = rootPool.getSchedulableByName(poolName)
>   if (parentPool == null) {
> {code}
> {{parentPool = rootPool.getSchedulableByName(poolName)}} is called twice if 
> {{properties != null}}.
> I'm not sure whether this is an oversight or there's something else missing. 
> This piece of the code hasn't been modified since 2013, so I doubt that this 
> is a serious issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20286) dynamicAllocation.executorIdleTimeout is ignored after unpersist

2017-04-15 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-20286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15969944#comment-15969944
 ] 

Miguel Pérez commented on SPARK-20286:
--

My supposition is that {{onExecutorIdle}} is only called when a task ends, so 
it's already idle when you call {{unpersist}}. I'm not sure how to test this 
though. Also, it will be great if the UI can show an "idle" status for the 
executors. Currently, they're shown as "Active" until they're killed and then 
shown as "Dead".

> dynamicAllocation.executorIdleTimeout is ignored after unpersist
> 
>
> Key: SPARK-20286
> URL: https://issues.apache.org/jira/browse/SPARK-20286
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Miguel Pérez
>
> With dynamic allocation enabled, it seems that executors with cached data 
> which are unpersisted are still being killed using the 
> {{dynamicAllocation.cachedExecutorIdleTimeout}} configuration, instead of 
> {{dynamicAllocation.executorIdleTimeout}}. Assuming the default configuration 
> ({{dynamicAllocation.cachedExecutorIdleTimeout = Infinity}}), an executor 
> with unpersisted data won't be released until the job ends.
> *How to reproduce*
> - Set different values for {{dynamicAllocation.executorIdleTimeout}} and 
> {{dynamicAllocation.cachedExecutorIdleTimeout}}
> - Load a file into a RDD and persist it
> - Execute an action on the RDD (like a count) so some executors are activated.
> - When the action has finished, unpersist the RDD
> - The application UI removes correctly the persisted data from the *Storage* 
> tab, but if you look in the *Executors* tab, you will find that the executors 
> remain *active* until ({{dynamicAllocation.cachedExecutorIdleTimeout}} is 
> reached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20339) Issue in regex_replace in Apache Spark Java

2017-04-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20339.
---
Resolution: Invalid

(No need to paste that much redundant code.)
If it's a question it should to go u...@spark.apache.org.
For such a huge sequence of generating columns you are probably much better off 
contstructing a Row directly in a transformation in one go instead of calling 
withColumn hundreds of times. Or else disable code gen.

> Issue in regex_replace in Apache Spark Java
> ---
>
> Key: SPARK-20339
> URL: https://issues.apache.org/jira/browse/SPARK-20339
> Project: Spark
>  Issue Type: Question
>  Components: Java API, Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Nischay
>
> We are currently facing couple of issues
> 1. 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
> grows beyond 64 KB".
> 2. "java.lang.StackOverflowError"
> The first issue is reported as a Major bug in Jira of Apache spark 
> https://issues.apache.org/jira/browse/SPARK-18492
> We got these issues by the following program. We are trying to replace the 
> Manufacturer name by its equivalent alternate name,
> These issues occur only when we have Huge number of alternate names to 
> replace, for small number of replacements it works with no issues.
> dataFileContent=dataFileContent.withColumn("ManufacturerSource", 
> regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).toString()));`
> Kindly suggest us an alternative method or a solution to go around this 
> problem.
> {code}
>   Hashtable manufacturerNames = new Hashtable();
> Enumeration names;
> String str;
> double bal;
> manufacturerNames.put("Allen","Apex Tool Group");
> manufacturerNames.put("Armstrong","Apex Tool Group");
> manufacturerNames.put("Campbell","Apex Tool Group");
> manufacturerNames.put("Lubriplate","Apex Tool Group");
> manufacturerNames.put("Delta","Apex Tool Group");
> manufacturerNames.put("Gearwrench","Apex Tool Group");
> manufacturerNames.put("H.K. Porter","Apex Tool 
> Group");
> manufacturerNames.put("Jacobs","Apex Tool Group");
> manufacturerNames.put("Jobox","Apex Tool Group");
> ...about 100 more ...
> manufacturerNames.put("Standard Safety","Standard 
> Safety Equipment Company");
> manufacturerNames.put("Standard Safety","Standard 
> Safety Equipment Company");   
> // Show all balances in hash table.
> names = manufacturerNames.keys();
> Dataset dataFileContent = 
> sqlContext.load("com.databricks.spark.csv", options);
>   
> 
> while(names.hasMoreElements()) {
>str = (String) names.nextElement();
>
> dataFileContent=dataFileContent.withColumn("ManufacturerSource", 
> regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).toString()));
> }
> dataFileContent.show();
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20339) Issue in regex_replace in Apache Spark Java

2017-04-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20339:
--
Description: 
We are currently facing couple of issues

1. "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
grows beyond 64 KB".
2. "java.lang.StackOverflowError"
The first issue is reported as a Major bug in Jira of Apache spark 
https://issues.apache.org/jira/browse/SPARK-18492

We got these issues by the following program. We are trying to replace the 
Manufacturer name by its equivalent alternate name,

These issues occur only when we have Huge number of alternate names to replace, 
for small number of replacements it works with no issues.
dataFileContent=dataFileContent.withColumn("ManufacturerSource", 
regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).toString()));`

Kindly suggest us an alternative method or a solution to go around this problem.

{code}
Hashtable manufacturerNames = new Hashtable();
  Enumeration names;
  String str;
  double bal;

  manufacturerNames.put("Allen","Apex Tool Group");
  manufacturerNames.put("Armstrong","Apex Tool Group");
  manufacturerNames.put("Campbell","Apex Tool Group");
  manufacturerNames.put("Lubriplate","Apex Tool Group");
  manufacturerNames.put("Delta","Apex Tool Group");
  manufacturerNames.put("Gearwrench","Apex Tool Group");
  manufacturerNames.put("H.K. Porter","Apex Tool 
Group");
  manufacturerNames.put("Jacobs","Apex Tool Group");
  manufacturerNames.put("Jobox","Apex Tool Group");
...about 100 more ...
  manufacturerNames.put("Standard Safety","Standard 
Safety Equipment Company");
  manufacturerNames.put("Standard Safety","Standard 
Safety Equipment Company");   

  // Show all balances in hash table.
  names = manufacturerNames.keys();
  Dataset dataFileContent = 
sqlContext.load("com.databricks.spark.csv", options);

  
  while(names.hasMoreElements()) {
 str = (String) names.nextElement();
 
dataFileContent=dataFileContent.withColumn("ManufacturerSource", 
regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).toString()));
  }
  dataFileContent.show();
{code}


  was:
We are currently facing couple of issues

1. "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
grows beyond 64 KB".
2. "java.lang.StackOverflowError"
The first issue is reported as a Major bug in Jira of Apache spark 
https://issues.apache.org/jira/browse/SPARK-18492

We got these issues by the following program. We are trying to replace the 
Manufacturer name by its equivalent alternate name,

These issues occur only when we have Huge number of alternate names to replace, 
for small number of replacements it works with no issues.
dataFileContent=dataFileContent.withColumn("ManufacturerSource", 
regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).toString()));`

Kindly suggest us an alternative method or a solution to go around this problem.

Hashtable manufacturerNames = new Hashtable();
  Enumeration names;
  String str;
  double bal;

  manufacturerNames.put("Allen","Apex Tool Group");
  manufacturerNames.put("Armstrong","Apex Tool Group");
  manufacturerNames.put("Campbell","Apex Tool Group");
  manufacturerNames.put("Lubriplate","Apex Tool Group");
  manufacturerNames.put("Delta","Apex Tool Group");
  manufacturerNames.put("Gearwrench","Apex Tool Group");
  manufacturerNames.put("H.K. Porter","Apex Tool 
Group");
  manufacturerNames.put("Jacobs","Apex Tool Group");
  manufacturerNames.put("Jobox","Apex Tool Group");
  manufacturerNames.put("Lufkin","Apex Tool Group");
  manufacturerNames.put("Nicholson","Apex Tool Group");
  manufacturerNames.put("Plumb","Apex Tool Group");
  manufacturerNames.put("Wiss","Apex Tool Group");
  manufacturerNames.put("Covert","Apex Tool Group");
  manufacturerNames.put("Apex-Geta"

[jira] [Commented] (SPARK-20299) NullPointerException when null and string are in a tuple while encoding Dataset

2017-04-15 Thread Umesh Chaudhary (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15969933#comment-15969933
 ] 

Umesh Chaudhary commented on SPARK-20299:
-

[~jlaskowski] your last two lines in repro steps are same. I tried different 
values in tuple to get NPE but was not able to see it. Can you please mention 
the exact steps to reproduce this issue.

> NullPointerException when null and string are in a tuple while encoding 
> Dataset
> ---
>
> Key: SPARK-20299
> URL: https://issues.apache.org/jira/browse/SPARK-20299
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> When creating a Dataset from a tuple with {{null}} and a string, NPE is 
> reported. When either is removed, it works fine.
> {code}
> scala> Seq((1, null.asInstanceOf[Int]), (2, 1)).toDS
> res43: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int, _2: int]
> scala> Seq(("1", null.asInstanceOf[Int]), ("2", 1)).toDS
> scala> Seq(("1", null.asInstanceOf[Int]), ("2", 1)).toDS
> java.lang.RuntimeException: Error while encoding: 
> java.lang.NullPointerException
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true], top 
> level Product input object), - root class: "scala.Tuple2")._1, true) AS _1#474
> assertnotnull(assertnotnull(input[0, scala.Tuple2, true], top level Product 
> input object), - root class: "scala.Tuple2")._2 AS _2#475
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:454)
>   at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:377)
>   at 
> org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQLImplicits.scala:246)
>   ... 48 elided
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_1$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287)
>   ... 58 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20286) dynamicAllocation.executorIdleTimeout is ignored after unpersist

2017-04-15 Thread Umesh Chaudhary (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15969924#comment-15969924
 ] 

Umesh Chaudhary commented on SPARK-20286:
-

While looking at ExecutorAllocationManager.onExecutorIdle, there is a condition 
which checks whether executor has CachedBlocks or not , if it has cached blocks 
then it uses cachedExecutorIdleTimeoutS and if no cached blocks it uses 
executorIdleTimeoutS.

Still not sure why even after calling unpersist it is behaving like this. One 
possibility: there might be some cached data on executors which is not reported 
to the BlockManager and it is causing executor to follow 
cachedExecutorIdleTimeout instead of executorIdleTimeout. 
Need some thoughts though cc: [~joshrosen], [~rxin].

> dynamicAllocation.executorIdleTimeout is ignored after unpersist
> 
>
> Key: SPARK-20286
> URL: https://issues.apache.org/jira/browse/SPARK-20286
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Miguel Pérez
>
> With dynamic allocation enabled, it seems that executors with cached data 
> which are unpersisted are still being killed using the 
> {{dynamicAllocation.cachedExecutorIdleTimeout}} configuration, instead of 
> {{dynamicAllocation.executorIdleTimeout}}. Assuming the default configuration 
> ({{dynamicAllocation.cachedExecutorIdleTimeout = Infinity}}), an executor 
> with unpersisted data won't be released until the job ends.
> *How to reproduce*
> - Set different values for {{dynamicAllocation.executorIdleTimeout}} and 
> {{dynamicAllocation.cachedExecutorIdleTimeout}}
> - Load a file into a RDD and persist it
> - Execute an action on the RDD (like a count) so some executors are activated.
> - When the action has finished, unpersist the RDD
> - The application UI removes correctly the persisted data from the *Storage* 
> tab, but if you look in the *Executors* tab, you will find that the executors 
> remain *active* until ({{dynamicAllocation.cachedExecutorIdleTimeout}} is 
> reached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20344) Duplicate call in FairSchedulableBuilder.addTaskSetManager

2017-04-15 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15969901#comment-15969901
 ] 

Robert Stupp commented on SPARK-20344:
--

Just saw it's a duplicate. Not a serious thing - just unnecessary.
I've setup a branch [on GitHub 
here|https://github.com/apache/spark/compare/master...snazy:20344-dup-call-master?expand=1]
 that rearranges the calls. Not sure whether you use pull-requests against the 
ASF mirror.

> Duplicate call in FairSchedulableBuilder.addTaskSetManager
> --
>
> Key: SPARK-20344
> URL: https://issues.apache.org/jira/browse/SPARK-20344
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Robert Stupp
>Priority: Trivial
>
> {{org.apache.spark.scheduler.FairSchedulableBuilder#addTaskSetManager}} 
> contains the code snippet:
> {code}
>   override def addTaskSetManager(manager: Schedulable, properties: 
> Properties) {
> var poolName = DEFAULT_POOL_NAME
> var parentPool = rootPool.getSchedulableByName(poolName)
> if (properties != null) {
>   poolName = properties.getProperty(FAIR_SCHEDULER_PROPERTIES, 
> DEFAULT_POOL_NAME)
>   parentPool = rootPool.getSchedulableByName(poolName)
>   if (parentPool == null) {
> {code}
> {{parentPool = rootPool.getSchedulableByName(poolName)}} is called twice if 
> {{properties != null}}.
> I'm not sure whether this is an oversight or there's something else missing. 
> This piece of the code hasn't been modified since 2013, so I doubt that this 
> is a serious issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20344) Duplicate call in FairSchedulableBuilder.addTaskSetManager

2017-04-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20344:
--
Priority: Trivial  (was: Minor)

Does it cause any problem? yes you could probably rearrange this anyway to 
avoid the duplication. It's not really worth a JIRA.

> Duplicate call in FairSchedulableBuilder.addTaskSetManager
> --
>
> Key: SPARK-20344
> URL: https://issues.apache.org/jira/browse/SPARK-20344
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Robert Stupp
>Priority: Trivial
>
> {{org.apache.spark.scheduler.FairSchedulableBuilder#addTaskSetManager}} 
> contains the code snippet:
> {code}
>   override def addTaskSetManager(manager: Schedulable, properties: 
> Properties) {
> var poolName = DEFAULT_POOL_NAME
> var parentPool = rootPool.getSchedulableByName(poolName)
> if (properties != null) {
>   poolName = properties.getProperty(FAIR_SCHEDULER_PROPERTIES, 
> DEFAULT_POOL_NAME)
>   parentPool = rootPool.getSchedulableByName(poolName)
>   if (parentPool == null) {
> {code}
> {{parentPool = rootPool.getSchedulableByName(poolName)}} is called twice if 
> {{properties != null}}.
> I'm not sure whether this is an oversight or there's something else missing. 
> This piece of the code hasn't been modified since 2013, so I doubt that this 
> is a serious issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20316) In SparkSQLCLIDriver, val and var should strictly follow the Scala syntax

2017-04-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-20316:
-

Assignee: Xiaochen Ouyang

> In SparkSQLCLIDriver, val and var should strictly follow the Scala syntax
> -
>
> Key: SPARK-20316
> URL: https://issues.apache.org/jira/browse/SPARK-20316
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Spark2.1.0
>Reporter: Xiaochen Ouyang
>Assignee: Xiaochen Ouyang
>Priority: Trivial
> Fix For: 2.2.0
>
>
> In SparkSQLCLIDriver, val and var should strictly follow the Scala syntax.
>   private var prompt = "spark-sql"
>   private var continuedPrompt = "".padTo(prompt.length, ' ')
> if there is no place to change the variable.We should use 'val' to modify the 
> variable,otherwise 'var'.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20316) In SparkSQLCLIDriver, val and var should strictly follow the Scala syntax

2017-04-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20316.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17628
[https://github.com/apache/spark/pull/17628]

> In SparkSQLCLIDriver, val and var should strictly follow the Scala syntax
> -
>
> Key: SPARK-20316
> URL: https://issues.apache.org/jira/browse/SPARK-20316
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Spark2.1.0
>Reporter: Xiaochen Ouyang
>Priority: Trivial
> Fix For: 2.2.0
>
>
> In SparkSQLCLIDriver, val and var should strictly follow the Scala syntax.
>   private var prompt = "spark-sql"
>   private var continuedPrompt = "".padTo(prompt.length, ' ')
> if there is no place to change the variable.We should use 'val' to modify the 
> variable,otherwise 'var'.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7674) R-like stats for ML models

2017-04-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7674.
--
Resolution: Done

> R-like stats for ML models
> --
>
> Key: SPARK-7674
> URL: https://issues.apache.org/jira/browse/SPARK-7674
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This is an umbrella JIRA for supporting ML model summaries and statistics, 
> following the example of R's summary() and plot() functions.
> [Design 
> doc|https://docs.google.com/document/d/1oswC_Neqlqn5ElPwodlDY4IkSaHAi0Bx6Guo_LvhHK8/edit?usp=sharing]
> From the design doc:
> {quote}
> R and its well-established packages provide extensive functionality for 
> inspecting a model and its results.  This inspection is critical to 
> interpreting, debugging and improving models.
> R is arguably a gold standard for a statistics/ML library, so this doc 
> largely attempts to imitate it.  The challenge we face is supporting similar 
> functionality, but on big (distributed) data.  Data size makes both efficient 
> computation and meaningful displays/summaries difficult.
> R model and result summaries generally take 2 forms:
> * summary(model): Display text with information about the model and results 
> on data
> * plot(model): Display plots about the model and results
> We aim to provide both of these types of information.  Visualization for the 
> plottable results will not be supported in MLlib itself, but we can provide 
> results in a form which can be plotted easily with other tools.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20344) Duplicate call in FairSchedulableBuilder.addTaskSetManager

2017-04-15 Thread Robert Stupp (JIRA)
Robert Stupp created SPARK-20344:


 Summary: Duplicate call in FairSchedulableBuilder.addTaskSetManager
 Key: SPARK-20344
 URL: https://issues.apache.org/jira/browse/SPARK-20344
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 2.1.0
Reporter: Robert Stupp
Priority: Minor


{{org.apache.spark.scheduler.FairSchedulableBuilder#addTaskSetManager}} 
contains the code snippet:
{code}
  override def addTaskSetManager(manager: Schedulable, properties: Properties) {
var poolName = DEFAULT_POOL_NAME
var parentPool = rootPool.getSchedulableByName(poolName)
if (properties != null) {
  poolName = properties.getProperty(FAIR_SCHEDULER_PROPERTIES, 
DEFAULT_POOL_NAME)
  parentPool = rootPool.getSchedulableByName(poolName)
  if (parentPool == null) {
{code}

{{parentPool = rootPool.getSchedulableByName(poolName)}} is called twice if 
{{properties != null}}.

I'm not sure whether this is an oversight or there's something else missing. 
This piece of the code hasn't been modified since 2013, so I doubt that this is 
a serious issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org