[jira] [Comment Edited] (SPARK-20325) Spark Structured Streaming documentation Update: checkpoint configuration
[ https://issues.apache.org/jira/browse/SPARK-20325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970255#comment-15970255 ] Hyukjin Kwon edited comment on SPARK-20325 at 4/16/17 6:30 AM: --- It sounds the documentation issue for ... {quote} could we update documentation for Structured Streaming and describe this behavior {quote} I think this question should go to the mailing list. {quote} Do we really need to specify the checkpoint dir per query? what the reason for this? finally we will be forced to write some checkpointDir name generator, for example associate it with some particular named query and so on? {quote} was (Author: hyukjin.kwon): It sounds the documentation issue for ... {quote} could we update documentation for Structured Streaming and describe this behavior {quote} {quote} Do we really need to specify the checkpoint dir per query? what the reason for this? finally we will be forced to write some checkpointDir name generator, for example associate it with some particular named query and so on? {quote} I think this question should go to the mailing list. > Spark Structured Streaming documentation Update: checkpoint configuration > - > > Key: SPARK-20325 > URL: https://issues.apache.org/jira/browse/SPARK-20325 > Project: Spark > Issue Type: Documentation > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Kate Eri >Priority: Minor > > I have configured the following stream outputting to Kafka: > {code} > map.foreach(metric => { > streamToProcess > .groupBy(metric) > .agg(count(metric)) > .writeStream > .outputMode("complete") > .option("checkpointLocation", checkpointDir) > .foreach(kafkaWriter) > .start() > }) > {code} > And configured the checkpoint Dir for each of output sinks like: > .option("checkpointLocation", checkpointDir) according to the documentation > => > http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing > > As a result I've got the following exception: > Cannot start query with id bf6a1003-6252-4c62-8249-c6a189701255 as another > query with same id is already active. Perhaps you are attempting to restart a > query from checkpoint that is already active. > java.lang.IllegalStateException: Cannot start query with id > bf6a1003-6252-4c62-8249-c6a189701255 as another query with same id is already > active. Perhaps you are attempting to restart a query from checkpoint that is > already active. > at > org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:291) > So according to current spark logic for “foreach” sink the checkpoint > configuration is loaded in the following way: > {code:title=StreamingQueryManager.scala} >val checkpointLocation = userSpecifiedCheckpointLocation.map { > userSpecified => > new Path(userSpecified).toUri.toString > }.orElse { > df.sparkSession.sessionState.conf.checkpointLocation.map { location => > new Path(location, > userSpecifiedName.getOrElse(UUID.randomUUID().toString)).toUri.toString > } > }.getOrElse { > if (useTempCheckpointLocation) { > Utils.createTempDir(namePrefix = s"temporary").getCanonicalPath > } else { > throw new AnalysisException( > "checkpointLocation must be specified either " + > """through option("checkpointLocation", ...) or """ + > s"""SparkSession.conf.set("${SQLConf.CHECKPOINT_LOCATION.key}", > ...)""") > } > } > {code} > so first spark take checkpointDir from query, then from sparksession > (spark.sql.streaming.checkpointLocation) and so on. > But this behavior was not documented, thus two questions: > 1) could we update documentation for Structured Streaming and describe this > behavior > 2) Do we really need to specify the checkpoint dir per query? what the reason > for this? finally we will be forced to write some checkpointDir name > generator, for example associate it with some particular named query and so > on? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20325) Spark Structured Streaming documentation Update: checkpoint configuration
[ https://issues.apache.org/jira/browse/SPARK-20325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-20325: - Issue Type: Documentation (was: Bug) > Spark Structured Streaming documentation Update: checkpoint configuration > - > > Key: SPARK-20325 > URL: https://issues.apache.org/jira/browse/SPARK-20325 > Project: Spark > Issue Type: Documentation > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Kate Eri >Priority: Minor > > I have configured the following stream outputting to Kafka: > {code} > map.foreach(metric => { > streamToProcess > .groupBy(metric) > .agg(count(metric)) > .writeStream > .outputMode("complete") > .option("checkpointLocation", checkpointDir) > .foreach(kafkaWriter) > .start() > }) > {code} > And configured the checkpoint Dir for each of output sinks like: > .option("checkpointLocation", checkpointDir) according to the documentation > => > http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing > > As a result I've got the following exception: > Cannot start query with id bf6a1003-6252-4c62-8249-c6a189701255 as another > query with same id is already active. Perhaps you are attempting to restart a > query from checkpoint that is already active. > java.lang.IllegalStateException: Cannot start query with id > bf6a1003-6252-4c62-8249-c6a189701255 as another query with same id is already > active. Perhaps you are attempting to restart a query from checkpoint that is > already active. > at > org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:291) > So according to current spark logic for “foreach” sink the checkpoint > configuration is loaded in the following way: > {code:title=StreamingQueryManager.scala} >val checkpointLocation = userSpecifiedCheckpointLocation.map { > userSpecified => > new Path(userSpecified).toUri.toString > }.orElse { > df.sparkSession.sessionState.conf.checkpointLocation.map { location => > new Path(location, > userSpecifiedName.getOrElse(UUID.randomUUID().toString)).toUri.toString > } > }.getOrElse { > if (useTempCheckpointLocation) { > Utils.createTempDir(namePrefix = s"temporary").getCanonicalPath > } else { > throw new AnalysisException( > "checkpointLocation must be specified either " + > """through option("checkpointLocation", ...) or """ + > s"""SparkSession.conf.set("${SQLConf.CHECKPOINT_LOCATION.key}", > ...)""") > } > } > {code} > so first spark take checkpointDir from query, then from sparksession > (spark.sql.streaming.checkpointLocation) and so on. > But this behavior was not documented, thus two questions: > 1) could we update documentation for Structured Streaming and describe this > behavior > 2) Do we really need to specify the checkpoint dir per query? what the reason > for this? finally we will be forced to write some checkpointDir name > generator, for example associate it with some particular named query and so > on? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20325) Spark Structured Streaming documentation Update: checkpoint configuration
[ https://issues.apache.org/jira/browse/SPARK-20325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970255#comment-15970255 ] Hyukjin Kwon commented on SPARK-20325: -- It sounds the documentation issue for ... {quote} could we update documentation for Structured Streaming and describe this behavior {quote} {quote} Do we really need to specify the checkpoint dir per query? what the reason for this? finally we will be forced to write some checkpointDir name generator, for example associate it with some particular named query and so on? {quote} I think this question should go to the mailing list. > Spark Structured Streaming documentation Update: checkpoint configuration > - > > Key: SPARK-20325 > URL: https://issues.apache.org/jira/browse/SPARK-20325 > Project: Spark > Issue Type: Documentation > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Kate Eri >Priority: Minor > > I have configured the following stream outputting to Kafka: > {code} > map.foreach(metric => { > streamToProcess > .groupBy(metric) > .agg(count(metric)) > .writeStream > .outputMode("complete") > .option("checkpointLocation", checkpointDir) > .foreach(kafkaWriter) > .start() > }) > {code} > And configured the checkpoint Dir for each of output sinks like: > .option("checkpointLocation", checkpointDir) according to the documentation > => > http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing > > As a result I've got the following exception: > Cannot start query with id bf6a1003-6252-4c62-8249-c6a189701255 as another > query with same id is already active. Perhaps you are attempting to restart a > query from checkpoint that is already active. > java.lang.IllegalStateException: Cannot start query with id > bf6a1003-6252-4c62-8249-c6a189701255 as another query with same id is already > active. Perhaps you are attempting to restart a query from checkpoint that is > already active. > at > org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:291) > So according to current spark logic for “foreach” sink the checkpoint > configuration is loaded in the following way: > {code:title=StreamingQueryManager.scala} >val checkpointLocation = userSpecifiedCheckpointLocation.map { > userSpecified => > new Path(userSpecified).toUri.toString > }.orElse { > df.sparkSession.sessionState.conf.checkpointLocation.map { location => > new Path(location, > userSpecifiedName.getOrElse(UUID.randomUUID().toString)).toUri.toString > } > }.getOrElse { > if (useTempCheckpointLocation) { > Utils.createTempDir(namePrefix = s"temporary").getCanonicalPath > } else { > throw new AnalysisException( > "checkpointLocation must be specified either " + > """through option("checkpointLocation", ...) or """ + > s"""SparkSession.conf.set("${SQLConf.CHECKPOINT_LOCATION.key}", > ...)""") > } > } > {code} > so first spark take checkpointDir from query, then from sparksession > (spark.sql.streaming.checkpointLocation) and so on. > But this behavior was not documented, thus two questions: > 1) could we update documentation for Structured Streaming and describe this > behavior > 2) Do we really need to specify the checkpoint dir per query? what the reason > for this? finally we will be forced to write some checkpointDir name > generator, for example associate it with some particular named query and so > on? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20346) sum aggregate over empty Dataset gives null
[ https://issues.apache.org/jira/browse/SPARK-20346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970248#comment-15970248 ] Hyukjin Kwon commented on SPARK-20346: -- [~jlaskowski], do you mind if I ask the expected output? I thought {{null}} for no input rows makes sense in a way. > sum aggregate over empty Dataset gives null > --- > > Key: SPARK-20346 > URL: https://issues.apache.org/jira/browse/SPARK-20346 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jacek Laskowski >Priority: Minor > > {code} > scala> spark.range(0).agg(sum("id")).show > +---+ > |sum(id)| > +---+ > | null| > +---+ > scala> spark.range(0).agg(sum("id")).printSchema > root > |-- sum(id): long (nullable = true) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20336) spark.read.csv() with wholeFile=True option fails to read non ASCII unicode characters
[ https://issues.apache.org/jira/browse/SPARK-20336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970246#comment-15970246 ] Hyukjin Kwon commented on SPARK-20336: -- gentle ping [~priancho], I would resolve this JIRA if you are unable to provide more details because I could not reproduce this. > spark.read.csv() with wholeFile=True option fails to read non ASCII unicode > characters > -- > > Key: SPARK-20336 > URL: https://issues.apache.org/jira/browse/SPARK-20336 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Spark 2.2.0 (master branch is downloaded from Github) > PySpark >Reporter: HanCheol Cho > > I used spark.read.csv() method with wholeFile=True option to load data that > has multi-line records. > However, non-ASCII characters are not properly loaded. > The following is a sample data for test: > {code:none} > col1,col2,col3 > 1,a,text > 2,b,テキスト > 3,c,텍스트 > 4,d,"text > テキスト > 텍스트" > 5,e,last > {code} > When it is loaded without wholeFile=True option, non-ASCII characters are > shown correctly although multi-line records are parsed incorrectly as follows: > {code:none} > testdf_default = spark.read.csv("test.encoding.csv", header=True) > testdf_default.show() > ++++ > |col1|col2|col3| > ++++ > | 1| a|text| > | 2| b|テキスト| > | 3| c| 텍스트| > | 4| d|text| > |テキスト|null|null| > | 텍스트"|null|null| > | 5| e|last| > ++++ > {code} > When wholeFile=True option is used, non-ASCII characters are broken as > follows: > {code:none} > testdf_wholefile = spark.read.csv("test.encoding.csv", header=True, > wholeFile=True) > testdf_wholefile.show() > ++++ > |col1|col2|col3| > ++++ > | 1| a|text| > | 2| b|| > | 3| c| �| > | 4| d|text > ...| > | 5| e|last| > ++++ > {code} > The result is same even if I use encoding="UTF-8" option with wholeFile=True. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release
[ https://issues.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970235#comment-15970235 ] Josh Rosen commented on SPARK-18406: I can see how allowing user-level code to call setTaskContext() can fix this issue but it's not ideal because it still places the burden on the end users to call the setTaskContext() method in their code. Instead, I think a cleaner fix would be to have the CompletionIterator record the task ID when it's instantiated so that the same task ID can be used even if the completion occurs in a different thread (the idea is to reduce our reliance on thread locals: there are reasons why we couldn't completely remove them (API changes), but there are parts of the internals where we can propagate more efficiently). To move forward here, my suggestion is that we write a failing regression test based on the description provided by [~yxiao], then experiment on my suggested approach of more explicit threading of task ids into closeable objects when they're first created. I'm on vacation this week and won't be able to help with this until Monday, April 24th, so someone else will need to help / review if this is urgent. > Race between end-of-task and completion iterator read lock release > -- > > Key: SPARK-18406 > URL: https://issues.apache.org/jira/browse/SPARK-18406 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 2.0.0, 2.0.1 >Reporter: Josh Rosen > > The following log comes from a production streaming job where executors > periodically die due to uncaught exceptions during block release: > {code} > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921 > 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922 > 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923 > 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923) > 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable > 2721 > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924 > 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924) > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as > bytes in memory (estimated size 5.0 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took > 3 ms > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in > memory (estimated size 9.4 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = > 567, finish = 1 > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = > 541, finish = 6 > 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID > 7923). 1429 bytes result sent to driver > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = > 533, finish = 7 > 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID > 7924). 1429 bytes result sent to driver > 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID > 7921) > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at > org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84) > at > org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:356) > at > org.apache.spark.storage.BlockManager.releaseAllLocksForTask(B
[jira] [Resolved] (SPARK-20335) Children expressions of Hive UDF impacts the determinism of Hive UDF
[ https://issues.apache.org/jira/browse/SPARK-20335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-20335. - Resolution: Fixed Fix Version/s: 2.2.0 > Children expressions of Hive UDF impacts the determinism of Hive UDF > > > Key: SPARK-20335 > URL: https://issues.apache.org/jira/browse/SPARK-20335 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.2.0 > > > {noformat} > /** >* Certain optimizations should not be applied if UDF is not deterministic. >* Deterministic UDF returns same result each time it is invoked with a >* particular input. This determinism just needs to hold within the context > of >* a query. >* >* @return true if the UDF is deterministic >*/ > boolean deterministic() default true; > {noformat} > Based on the definition o UDFType, when Hive UDF's children are > non-deterministic, Hive UDF is also non-deterministic. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20348) Support squared hinge loss (L2 loss) for LinearSVC
[ https://issues.apache.org/jira/browse/SPARK-20348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20348: Assignee: Apache Spark > Support squared hinge loss (L2 loss) for LinearSVC > -- > > Key: SPARK-20348 > URL: https://issues.apache.org/jira/browse/SPARK-20348 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang >Assignee: Apache Spark >Priority: Minor > > While Hinge loss is the standard loss function for linear SVM, Squared hinge > loss (a.k.a. L2 loss) is also popular in practice. L2-SVM is differentiable > and imposes a bigger (quadratic vs. linear) loss for points which violate the > margin. Some introduction can be found from > http://mccormickml.com/2015/01/06/what-is-an-l2-svm/ > Liblinear and [scikit > learn|http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html] > both offer squared hinge loss as the default loss function for linear SVM. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20348) Support squared hinge loss (L2 loss) for LinearSVC
[ https://issues.apache.org/jira/browse/SPARK-20348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20348: Assignee: (was: Apache Spark) > Support squared hinge loss (L2 loss) for LinearSVC > -- > > Key: SPARK-20348 > URL: https://issues.apache.org/jira/browse/SPARK-20348 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang >Priority: Minor > > While Hinge loss is the standard loss function for linear SVM, Squared hinge > loss (a.k.a. L2 loss) is also popular in practice. L2-SVM is differentiable > and imposes a bigger (quadratic vs. linear) loss for points which violate the > margin. Some introduction can be found from > http://mccormickml.com/2015/01/06/what-is-an-l2-svm/ > Liblinear and [scikit > learn|http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html] > both offer squared hinge loss as the default loss function for linear SVM. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20348) Support squared hinge loss (L2 loss) for LinearSVC
[ https://issues.apache.org/jira/browse/SPARK-20348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970217#comment-15970217 ] Apache Spark commented on SPARK-20348: -- User 'hhbyyh' has created a pull request for this issue: https://github.com/apache/spark/pull/17645 > Support squared hinge loss (L2 loss) for LinearSVC > -- > > Key: SPARK-20348 > URL: https://issues.apache.org/jira/browse/SPARK-20348 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang >Priority: Minor > > While Hinge loss is the standard loss function for linear SVM, Squared hinge > loss (a.k.a. L2 loss) is also popular in practice. L2-SVM is differentiable > and imposes a bigger (quadratic vs. linear) loss for points which violate the > margin. Some introduction can be found from > http://mccormickml.com/2015/01/06/what-is-an-l2-svm/ > Liblinear and [scikit > learn|http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html] > both offer squared hinge loss as the default loss function for linear SVM. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20348) Support squared hinge loss (L2 loss) for LinearSVC
yuhao yang created SPARK-20348: -- Summary: Support squared hinge loss (L2 loss) for LinearSVC Key: SPARK-20348 URL: https://issues.apache.org/jira/browse/SPARK-20348 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 2.2.0 Reporter: yuhao yang Priority: Minor While Hinge loss is the standard loss function for linear SVM, Squared hinge loss (a.k.a. L2 loss) is also popular in practice. L2-SVM is differentiable and imposes a bigger (quadratic vs. linear) loss for points which violate the margin. Some introduction can be found from http://mccormickml.com/2015/01/06/what-is-an-l2-svm/ Liblinear and [scikit learn|http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html] both offer squared hinge loss as the default loss function for linear SVM. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20347) Provide AsyncRDDActions in Python
[ https://issues.apache.org/jira/browse/SPARK-20347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk updated SPARK-20347: Shepherd: holdenk > Provide AsyncRDDActions in Python > - > > Key: SPARK-20347 > URL: https://issues.apache.org/jira/browse/SPARK-20347 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.0 >Reporter: holdenk >Priority: Minor > > In core Spark AsyncRDDActions allows people to perform non-blocking RDD > actions. In Python where threading & is a bit more involved there could be > value in exposing this, the easiest way might involve using the Py4J callback > server on the driver. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20347) Provide AsyncRDDActions in Python
holdenk created SPARK-20347: --- Summary: Provide AsyncRDDActions in Python Key: SPARK-20347 URL: https://issues.apache.org/jira/browse/SPARK-20347 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.2.0 Reporter: holdenk Priority: Minor In core Spark AsyncRDDActions allows people to perform non-blocking RDD actions. In Python where threading & is a bit more involved there could be value in exposing this, the easiest way might involve using the Py4J callback server on the driver. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17729) Enable creating hive bucketed tables
[ https://issues.apache.org/jira/browse/SPARK-17729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970201#comment-15970201 ] Apache Spark commented on SPARK-17729: -- User 'tejasapatil' has created a pull request for this issue: https://github.com/apache/spark/pull/17644 > Enable creating hive bucketed tables > > > Key: SPARK-17729 > URL: https://issues.apache.org/jira/browse/SPARK-17729 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Tejas Patil >Priority: Trivial > > Hive allows inserting data to bucketed table without guaranteeing bucketed > and sorted-ness based on these two configs : `hive.enforce.bucketing` and > `hive.enforce.sorting`. > With this jira, Spark still won't produce bucketed data as per Hive's > bucketing guarantees, but will allow writes IFF user wishes to do so without > caring about bucketing guarantees. Ability to create bucketed tables will > enable adding test cases to Spark while pieces are being added to Spark have > it support hive bucketing (eg. https://github.com/apache/spark/pull/15229) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20346) sum aggregate over empty Dataset gives null
[ https://issues.apache.org/jira/browse/SPARK-20346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-20346: Description: {code} scala> spark.range(0).agg(sum("id")).show +---+ |sum(id)| +---+ | null| +---+ scala> spark.range(0).agg(sum("id")).printSchema root |-- sum(id): long (nullable = true) {code} > sum aggregate over empty Dataset gives null > --- > > Key: SPARK-20346 > URL: https://issues.apache.org/jira/browse/SPARK-20346 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jacek Laskowski >Priority: Minor > > {code} > scala> spark.range(0).agg(sum("id")).show > +---+ > |sum(id)| > +---+ > | null| > +---+ > scala> spark.range(0).agg(sum("id")).printSchema > root > |-- sum(id): long (nullable = true) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20346) sum aggregate over empty Dataset gives null
Jacek Laskowski created SPARK-20346: --- Summary: sum aggregate over empty Dataset gives null Key: SPARK-20346 URL: https://issues.apache.org/jira/browse/SPARK-20346 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Jacek Laskowski Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20345) Fix STS error handling logic on HiveSQLException
[ https://issues.apache.org/jira/browse/SPARK-20345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-20345: -- Description: [SPARK-5100|https://github.com/apache/spark/commit/343d3bfafd449a0371feb6a88f78e07302fa7143] added Spark Thrift Server UI and the following logic to handle exceptions on case `Throwable`. {code} HiveThriftServer2.listener.onStatementError( statementId, e.getMessage, SparkUtils.exceptionString(e)) {code} However, there occurred a missed case after implementing [SPARK-6964|https://github.com/apache/spark/commit/eb19d3f75cbd002f7e72ce02017a8de67f562792]'s `Support Cancellation in the Thrift Server` by adding case `HiveSQLException` before case `Throwable`. Logically, we had better add `HiveThriftServer2.listener.onStatementError` on case `HiveSQLException`, too. {code} case e: HiveSQLException => if (getStatus().getState() == OperationState.CANCELED) { return } else { setState(OperationState.ERROR) throw e } // Actually do need to catch Throwable as some failures don't inherit from Exception and // HiveServer will silently swallow them. case e: Throwable => val currentState = getStatus().getState() logError(s"Error executing query, currentState $currentState, ", e) setState(OperationState.ERROR) HiveThriftServer2.listener.onStatementError( statementId, e.getMessage, SparkUtils.exceptionString(e)) throw new HiveSQLException(e.toString) {code} was: [SPARK-5100|https://github.com/apache/spark/commit/343d3bfafd449a0371feb6a88f78e07302fa7143] added Spark Thrift UI and the following logic to handle exceptions like the following on case `Throwable`. {code} HiveThriftServer2.listener.onStatementError( statementId, e.getMessage, SparkUtils.exceptionString(e)) {code} However, there occurs a missed case after implementing [SPARK-6964|https://github.com/apache/spark/commit/eb19d3f75cbd002f7e72ce02017a8de67f562792]'s `Support Cancellation in the Thrift Server` by adding case `HiveSQLException` before case `Throwable`. Logically, we had better add `HiveThriftServer2.listener.onStatementError` on case `HiveSQLException`, too. {code} case e: HiveSQLException => if (getStatus().getState() == OperationState.CANCELED) { return } else { setState(OperationState.ERROR) throw e } // Actually do need to catch Throwable as some failures don't inherit from Exception and // HiveServer will silently swallow them. case e: Throwable => val currentState = getStatus().getState() logError(s"Error executing query, currentState $currentState, ", e) setState(OperationState.ERROR) HiveThriftServer2.listener.onStatementError( statementId, e.getMessage, SparkUtils.exceptionString(e)) throw new HiveSQLException(e.toString) {code} > Fix STS error handling logic on HiveSQLException > > > Key: SPARK-20345 > URL: https://issues.apache.org/jira/browse/SPARK-20345 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.1.0 >Reporter: Dongjoon Hyun > > [SPARK-5100|https://github.com/apache/spark/commit/343d3bfafd449a0371feb6a88f78e07302fa7143] > added Spark Thrift Server UI and the following logic to handle exceptions on > case `Throwable`. > {code} > HiveThriftServer2.listener.onStatementError( > statementId, e.getMessage, SparkUtils.exceptionString(e)) > {code} > However, there occurred a missed case after implementing > [SPARK-6964|https://github.com/apache/spark/commit/eb19d3f75cbd002f7e72ce02017a8de67f562792]'s > `Support Cancellation in the Thrift Server` by adding case > `HiveSQLException` before case `Throwable`. > Logically, we had better add `HiveThriftServer2.listener.onStatementError` on > case `HiveSQLException`, too. > {code} > case e: HiveSQLException => > if (getStatus().getState() == OperationState.CANCELED) { > return > } else { > setState(OperationState.ERROR) > throw e > } > // Actually do need to catch Throwable as some failures don't inherit > from Exception and > // HiveServer will silently swallow them. > case e: Throwable => > val currentState = getStatus().getState() > logError(s"Error executing query, currentState $currentState, ", e) > setState(OperationState.ERROR) > HiveThriftServer2.listener.onStatementError( > statementId, e.getMessage, SparkUtils.exceptionString(e)) > throw new HiveSQLException(e.toString) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) --
[jira] [Commented] (SPARK-20345) Fix STS error handling logic on HiveSQLException
[ https://issues.apache.org/jira/browse/SPARK-20345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970084#comment-15970084 ] Apache Spark commented on SPARK-20345: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/17643 > Fix STS error handling logic on HiveSQLException > > > Key: SPARK-20345 > URL: https://issues.apache.org/jira/browse/SPARK-20345 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.1.0 >Reporter: Dongjoon Hyun > > [SPARK-5100|https://github.com/apache/spark/commit/343d3bfafd449a0371feb6a88f78e07302fa7143] > added Spark Thrift UI and the following logic to handle exceptions like the > following on case `Throwable`. > {code} > HiveThriftServer2.listener.onStatementError( > statementId, e.getMessage, SparkUtils.exceptionString(e)) > {code} > However, there occurs a missed case after implementing > [SPARK-6964|https://github.com/apache/spark/commit/eb19d3f75cbd002f7e72ce02017a8de67f562792]'s > `Support Cancellation in the Thrift Server` by adding case > `HiveSQLException` before case `Throwable`. > Logically, we had better add `HiveThriftServer2.listener.onStatementError` on > case `HiveSQLException`, too. > {code} > case e: HiveSQLException => > if (getStatus().getState() == OperationState.CANCELED) { > return > } else { > setState(OperationState.ERROR) > throw e > } > // Actually do need to catch Throwable as some failures don't inherit > from Exception and > // HiveServer will silently swallow them. > case e: Throwable => > val currentState = getStatus().getState() > logError(s"Error executing query, currentState $currentState, ", e) > setState(OperationState.ERROR) > HiveThriftServer2.listener.onStatementError( > statementId, e.getMessage, SparkUtils.exceptionString(e)) > throw new HiveSQLException(e.toString) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20345) Fix STS error handling logic on HiveSQLException
[ https://issues.apache.org/jira/browse/SPARK-20345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20345: Assignee: Apache Spark > Fix STS error handling logic on HiveSQLException > > > Key: SPARK-20345 > URL: https://issues.apache.org/jira/browse/SPARK-20345 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.1.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark > > [SPARK-5100|https://github.com/apache/spark/commit/343d3bfafd449a0371feb6a88f78e07302fa7143] > added Spark Thrift UI and the following logic to handle exceptions like the > following on case `Throwable`. > {code} > HiveThriftServer2.listener.onStatementError( > statementId, e.getMessage, SparkUtils.exceptionString(e)) > {code} > However, there occurs a missed case after implementing > [SPARK-6964|https://github.com/apache/spark/commit/eb19d3f75cbd002f7e72ce02017a8de67f562792]'s > `Support Cancellation in the Thrift Server` by adding case > `HiveSQLException` before case `Throwable`. > Logically, we had better add `HiveThriftServer2.listener.onStatementError` on > case `HiveSQLException`, too. > {code} > case e: HiveSQLException => > if (getStatus().getState() == OperationState.CANCELED) { > return > } else { > setState(OperationState.ERROR) > throw e > } > // Actually do need to catch Throwable as some failures don't inherit > from Exception and > // HiveServer will silently swallow them. > case e: Throwable => > val currentState = getStatus().getState() > logError(s"Error executing query, currentState $currentState, ", e) > setState(OperationState.ERROR) > HiveThriftServer2.listener.onStatementError( > statementId, e.getMessage, SparkUtils.exceptionString(e)) > throw new HiveSQLException(e.toString) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20345) Fix STS error handling logic on HiveSQLException
[ https://issues.apache.org/jira/browse/SPARK-20345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20345: Assignee: (was: Apache Spark) > Fix STS error handling logic on HiveSQLException > > > Key: SPARK-20345 > URL: https://issues.apache.org/jira/browse/SPARK-20345 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.1.0 >Reporter: Dongjoon Hyun > > [SPARK-5100|https://github.com/apache/spark/commit/343d3bfafd449a0371feb6a88f78e07302fa7143] > added Spark Thrift UI and the following logic to handle exceptions like the > following on case `Throwable`. > {code} > HiveThriftServer2.listener.onStatementError( > statementId, e.getMessage, SparkUtils.exceptionString(e)) > {code} > However, there occurs a missed case after implementing > [SPARK-6964|https://github.com/apache/spark/commit/eb19d3f75cbd002f7e72ce02017a8de67f562792]'s > `Support Cancellation in the Thrift Server` by adding case > `HiveSQLException` before case `Throwable`. > Logically, we had better add `HiveThriftServer2.listener.onStatementError` on > case `HiveSQLException`, too. > {code} > case e: HiveSQLException => > if (getStatus().getState() == OperationState.CANCELED) { > return > } else { > setState(OperationState.ERROR) > throw e > } > // Actually do need to catch Throwable as some failures don't inherit > from Exception and > // HiveServer will silently swallow them. > case e: Throwable => > val currentState = getStatus().getState() > logError(s"Error executing query, currentState $currentState, ", e) > setState(OperationState.ERROR) > HiveThriftServer2.listener.onStatementError( > statementId, e.getMessage, SparkUtils.exceptionString(e)) > throw new HiveSQLException(e.toString) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20345) Fix STS error handling logic on HiveSQLException
Dongjoon Hyun created SPARK-20345: - Summary: Fix STS error handling logic on HiveSQLException Key: SPARK-20345 URL: https://issues.apache.org/jira/browse/SPARK-20345 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0, 1.6.3 Reporter: Dongjoon Hyun [SPARK-5100|https://github.com/apache/spark/commit/343d3bfafd449a0371feb6a88f78e07302fa7143] added Spark Thrift UI and the following logic to handle exceptions like the following on case `Throwable`. {code} HiveThriftServer2.listener.onStatementError( statementId, e.getMessage, SparkUtils.exceptionString(e)) {code} However, there occurs a missed case after implementing [SPARK-6964|https://github.com/apache/spark/commit/eb19d3f75cbd002f7e72ce02017a8de67f562792]'s `Support Cancellation in the Thrift Server` by adding case `HiveSQLException` before case `Throwable`. Logically, we had better add `HiveThriftServer2.listener.onStatementError` on case `HiveSQLException`, too. {code} case e: HiveSQLException => if (getStatus().getState() == OperationState.CANCELED) { return } else { setState(OperationState.ERROR) throw e } // Actually do need to catch Throwable as some failures don't inherit from Exception and // HiveServer will silently swallow them. case e: Throwable => val currentState = getStatus().getState() logError(s"Error executing query, currentState $currentState, ", e) setState(OperationState.ERROR) HiveThriftServer2.listener.onStatementError( statementId, e.getMessage, SparkUtils.exceptionString(e)) throw new HiveSQLException(e.toString) {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20299) NullPointerException when null and string are in a tuple while encoding Dataset
[ https://issues.apache.org/jira/browse/SPARK-20299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15969996#comment-15969996 ] Jacek Laskowski commented on SPARK-20299: - It does work for 2.1. It does not for 2.2.0-SNAPSHOT. Steps to reproduce: 1. Download the nightly build from http://people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/ (used {{spark-2.2.0-SNAPSHOT-bin-hadoop2.7.tgz}} from 2017-04-15 08:16) {code} ➜ spark-2.2.0-SNAPSHOT-bin-hadoop2.7 ./bin/spark-submit --version Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.2.0-SNAPSHOT /_/ Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121 Branch HEAD Compiled by user jenkins on 2017-04-15T08:05:06Z Revision fb036c4413c2cd4d90880d080f418ec468d6c0fc Url https://github.com/apache/spark.git Type --help for more information. {code} 2. Execute the following and you'll *surely* see the exception: {code} scala> Seq(("1", null.asInstanceOf[Int]), ("2", 1)).toDS java.lang.RuntimeException: Error while encoding: java.lang.NullPointerException staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true) AS _1#0 assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2 AS _2#1 at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290) at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454) at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:454) at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:377) at org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQLImplicits.scala:246) ... 48 elided Caused by: java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_1$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287) ... 58 more {code} > NullPointerException when null and string are in a tuple while encoding > Dataset > --- > > Key: SPARK-20299 > URL: https://issues.apache.org/jira/browse/SPARK-20299 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jacek Laskowski >Priority: Minor > > When creating a Dataset from a tuple with {{null}} and a string, NPE is > reported. When either is removed, it works fine. > {code} > scala> Seq((1, null.asInstanceOf[Int]), (2, 1)).toDS > res43: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int, _2: int] > scala> Seq(("1", null.asInstanceOf[Int]), ("2", 1)).toDS > java.lang.RuntimeException: Error while encoding: > java.lang.NullPointerException > staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, > fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true], top > level Product input object), - root class: "scala.Tuple2")._1, true) AS _1#474 > assertnotnull(assertnotnull(input[0, scala.Tuple2, true], top level Product > input object), - root class: "scala.Tuple2")._2 AS _2#475 > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290) > at > org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454) > at > org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:454) > at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:377) > at > org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQLImplicits.scala:246) > ... 48 elided > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$Sp
[jira] [Updated] (SPARK-20299) NullPointerException when null and string are in a tuple while encoding Dataset
[ https://issues.apache.org/jira/browse/SPARK-20299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-20299: Description: When creating a Dataset from a tuple with {{null}} and a string, NPE is reported. When either is removed, it works fine. {code} scala> Seq((1, null.asInstanceOf[Int]), (2, 1)).toDS res43: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int, _2: int] scala> Seq(("1", null.asInstanceOf[Int]), ("2", 1)).toDS java.lang.RuntimeException: Error while encoding: java.lang.NullPointerException staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true], top level Product input object), - root class: "scala.Tuple2")._1, true) AS _1#474 assertnotnull(assertnotnull(input[0, scala.Tuple2, true], top level Product input object), - root class: "scala.Tuple2")._2 AS _2#475 at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290) at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454) at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:454) at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:377) at org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQLImplicits.scala:246) ... 48 elided Caused by: java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_1$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287) ... 58 more {code} was: When creating a Dataset from a tuple with {{null}} and a string, NPE is reported. When either is removed, it works fine. {code} scala> Seq((1, null.asInstanceOf[Int]), (2, 1)).toDS res43: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int, _2: int] scala> Seq(("1", null.asInstanceOf[Int]), ("2", 1)).toDS scala> Seq(("1", null.asInstanceOf[Int]), ("2", 1)).toDS java.lang.RuntimeException: Error while encoding: java.lang.NullPointerException staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true], top level Product input object), - root class: "scala.Tuple2")._1, true) AS _1#474 assertnotnull(assertnotnull(input[0, scala.Tuple2, true], top level Product input object), - root class: "scala.Tuple2")._2 AS _2#475 at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290) at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454) at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:454) at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:377) at org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQLImplicits.scala:246) ... 48 elided Caused by: java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_1$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287) ... 58 more {code} > NullPointerException when null and string are in a tuple while encoding > Dataset > --- > > Key: SPARK-20299 > URL: https://issues.apache.org/jira/browse/SPARK-20299 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jacek Laskowski >Priority: Minor > > When creating a Dataset from a tuple with {{null}} and a string, NPE is > reported. When either is removed, it works fine. > {code} > scala> Seq((1, null.asInstanceOf[Int]), (2, 1)).toDS > res43: org.apache.spar
[jira] [Commented] (SPARK-20344) Duplicate call in FairSchedulableBuilder.addTaskSetManager
[ https://issues.apache.org/jira/browse/SPARK-20344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15969946#comment-15969946 ] Sean Owen commented on SPARK-20344: --- We use pull requests -- http://spark.apache.org/contributing.html That change looks a little more complex than needed. I think the only thing that's needed is to avoid the redundant assignments in the first part where the pool is obtained, and then proceed as before. > Duplicate call in FairSchedulableBuilder.addTaskSetManager > -- > > Key: SPARK-20344 > URL: https://issues.apache.org/jira/browse/SPARK-20344 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Robert Stupp >Priority: Trivial > > {{org.apache.spark.scheduler.FairSchedulableBuilder#addTaskSetManager}} > contains the code snippet: > {code} > override def addTaskSetManager(manager: Schedulable, properties: > Properties) { > var poolName = DEFAULT_POOL_NAME > var parentPool = rootPool.getSchedulableByName(poolName) > if (properties != null) { > poolName = properties.getProperty(FAIR_SCHEDULER_PROPERTIES, > DEFAULT_POOL_NAME) > parentPool = rootPool.getSchedulableByName(poolName) > if (parentPool == null) { > {code} > {{parentPool = rootPool.getSchedulableByName(poolName)}} is called twice if > {{properties != null}}. > I'm not sure whether this is an oversight or there's something else missing. > This piece of the code hasn't been modified since 2013, so I doubt that this > is a serious issue. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20286) dynamicAllocation.executorIdleTimeout is ignored after unpersist
[ https://issues.apache.org/jira/browse/SPARK-20286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15969944#comment-15969944 ] Miguel Pérez commented on SPARK-20286: -- My supposition is that {{onExecutorIdle}} is only called when a task ends, so it's already idle when you call {{unpersist}}. I'm not sure how to test this though. Also, it will be great if the UI can show an "idle" status for the executors. Currently, they're shown as "Active" until they're killed and then shown as "Dead". > dynamicAllocation.executorIdleTimeout is ignored after unpersist > > > Key: SPARK-20286 > URL: https://issues.apache.org/jira/browse/SPARK-20286 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Miguel Pérez > > With dynamic allocation enabled, it seems that executors with cached data > which are unpersisted are still being killed using the > {{dynamicAllocation.cachedExecutorIdleTimeout}} configuration, instead of > {{dynamicAllocation.executorIdleTimeout}}. Assuming the default configuration > ({{dynamicAllocation.cachedExecutorIdleTimeout = Infinity}}), an executor > with unpersisted data won't be released until the job ends. > *How to reproduce* > - Set different values for {{dynamicAllocation.executorIdleTimeout}} and > {{dynamicAllocation.cachedExecutorIdleTimeout}} > - Load a file into a RDD and persist it > - Execute an action on the RDD (like a count) so some executors are activated. > - When the action has finished, unpersist the RDD > - The application UI removes correctly the persisted data from the *Storage* > tab, but if you look in the *Executors* tab, you will find that the executors > remain *active* until ({{dynamicAllocation.cachedExecutorIdleTimeout}} is > reached. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20339) Issue in regex_replace in Apache Spark Java
[ https://issues.apache.org/jira/browse/SPARK-20339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20339. --- Resolution: Invalid (No need to paste that much redundant code.) If it's a question it should to go u...@spark.apache.org. For such a huge sequence of generating columns you are probably much better off contstructing a Row directly in a transformation in one go instead of calling withColumn hundreds of times. Or else disable code gen. > Issue in regex_replace in Apache Spark Java > --- > > Key: SPARK-20339 > URL: https://issues.apache.org/jira/browse/SPARK-20339 > Project: Spark > Issue Type: Question > Components: Java API, Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: Nischay > > We are currently facing couple of issues > 1. > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" > grows beyond 64 KB". > 2. "java.lang.StackOverflowError" > The first issue is reported as a Major bug in Jira of Apache spark > https://issues.apache.org/jira/browse/SPARK-18492 > We got these issues by the following program. We are trying to replace the > Manufacturer name by its equivalent alternate name, > These issues occur only when we have Huge number of alternate names to > replace, for small number of replacements it works with no issues. > dataFileContent=dataFileContent.withColumn("ManufacturerSource", > regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).toString()));` > Kindly suggest us an alternative method or a solution to go around this > problem. > {code} > Hashtable manufacturerNames = new Hashtable(); > Enumeration names; > String str; > double bal; > manufacturerNames.put("Allen","Apex Tool Group"); > manufacturerNames.put("Armstrong","Apex Tool Group"); > manufacturerNames.put("Campbell","Apex Tool Group"); > manufacturerNames.put("Lubriplate","Apex Tool Group"); > manufacturerNames.put("Delta","Apex Tool Group"); > manufacturerNames.put("Gearwrench","Apex Tool Group"); > manufacturerNames.put("H.K. Porter","Apex Tool > Group"); > manufacturerNames.put("Jacobs","Apex Tool Group"); > manufacturerNames.put("Jobox","Apex Tool Group"); > ...about 100 more ... > manufacturerNames.put("Standard Safety","Standard > Safety Equipment Company"); > manufacturerNames.put("Standard Safety","Standard > Safety Equipment Company"); > // Show all balances in hash table. > names = manufacturerNames.keys(); > Dataset dataFileContent = > sqlContext.load("com.databricks.spark.csv", options); > > > while(names.hasMoreElements()) { >str = (String) names.nextElement(); > > dataFileContent=dataFileContent.withColumn("ManufacturerSource", > regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).toString())); > } > dataFileContent.show(); > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20339) Issue in regex_replace in Apache Spark Java
[ https://issues.apache.org/jira/browse/SPARK-20339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-20339: -- Description: We are currently facing couple of issues 1. "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" grows beyond 64 KB". 2. "java.lang.StackOverflowError" The first issue is reported as a Major bug in Jira of Apache spark https://issues.apache.org/jira/browse/SPARK-18492 We got these issues by the following program. We are trying to replace the Manufacturer name by its equivalent alternate name, These issues occur only when we have Huge number of alternate names to replace, for small number of replacements it works with no issues. dataFileContent=dataFileContent.withColumn("ManufacturerSource", regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).toString()));` Kindly suggest us an alternative method or a solution to go around this problem. {code} Hashtable manufacturerNames = new Hashtable(); Enumeration names; String str; double bal; manufacturerNames.put("Allen","Apex Tool Group"); manufacturerNames.put("Armstrong","Apex Tool Group"); manufacturerNames.put("Campbell","Apex Tool Group"); manufacturerNames.put("Lubriplate","Apex Tool Group"); manufacturerNames.put("Delta","Apex Tool Group"); manufacturerNames.put("Gearwrench","Apex Tool Group"); manufacturerNames.put("H.K. Porter","Apex Tool Group"); manufacturerNames.put("Jacobs","Apex Tool Group"); manufacturerNames.put("Jobox","Apex Tool Group"); ...about 100 more ... manufacturerNames.put("Standard Safety","Standard Safety Equipment Company"); manufacturerNames.put("Standard Safety","Standard Safety Equipment Company"); // Show all balances in hash table. names = manufacturerNames.keys(); Dataset dataFileContent = sqlContext.load("com.databricks.spark.csv", options); while(names.hasMoreElements()) { str = (String) names.nextElement(); dataFileContent=dataFileContent.withColumn("ManufacturerSource", regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).toString())); } dataFileContent.show(); {code} was: We are currently facing couple of issues 1. "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" grows beyond 64 KB". 2. "java.lang.StackOverflowError" The first issue is reported as a Major bug in Jira of Apache spark https://issues.apache.org/jira/browse/SPARK-18492 We got these issues by the following program. We are trying to replace the Manufacturer name by its equivalent alternate name, These issues occur only when we have Huge number of alternate names to replace, for small number of replacements it works with no issues. dataFileContent=dataFileContent.withColumn("ManufacturerSource", regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).toString()));` Kindly suggest us an alternative method or a solution to go around this problem. Hashtable manufacturerNames = new Hashtable(); Enumeration names; String str; double bal; manufacturerNames.put("Allen","Apex Tool Group"); manufacturerNames.put("Armstrong","Apex Tool Group"); manufacturerNames.put("Campbell","Apex Tool Group"); manufacturerNames.put("Lubriplate","Apex Tool Group"); manufacturerNames.put("Delta","Apex Tool Group"); manufacturerNames.put("Gearwrench","Apex Tool Group"); manufacturerNames.put("H.K. Porter","Apex Tool Group"); manufacturerNames.put("Jacobs","Apex Tool Group"); manufacturerNames.put("Jobox","Apex Tool Group"); manufacturerNames.put("Lufkin","Apex Tool Group"); manufacturerNames.put("Nicholson","Apex Tool Group"); manufacturerNames.put("Plumb","Apex Tool Group"); manufacturerNames.put("Wiss","Apex Tool Group"); manufacturerNames.put("Covert","Apex Tool Group"); manufacturerNames.put("Apex-Geta"
[jira] [Commented] (SPARK-20299) NullPointerException when null and string are in a tuple while encoding Dataset
[ https://issues.apache.org/jira/browse/SPARK-20299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15969933#comment-15969933 ] Umesh Chaudhary commented on SPARK-20299: - [~jlaskowski] your last two lines in repro steps are same. I tried different values in tuple to get NPE but was not able to see it. Can you please mention the exact steps to reproduce this issue. > NullPointerException when null and string are in a tuple while encoding > Dataset > --- > > Key: SPARK-20299 > URL: https://issues.apache.org/jira/browse/SPARK-20299 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jacek Laskowski >Priority: Minor > > When creating a Dataset from a tuple with {{null}} and a string, NPE is > reported. When either is removed, it works fine. > {code} > scala> Seq((1, null.asInstanceOf[Int]), (2, 1)).toDS > res43: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int, _2: int] > scala> Seq(("1", null.asInstanceOf[Int]), ("2", 1)).toDS > scala> Seq(("1", null.asInstanceOf[Int]), ("2", 1)).toDS > java.lang.RuntimeException: Error while encoding: > java.lang.NullPointerException > staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, > fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true], top > level Product input object), - root class: "scala.Tuple2")._1, true) AS _1#474 > assertnotnull(assertnotnull(input[0, scala.Tuple2, true], top level Product > input object), - root class: "scala.Tuple2")._2 AS _2#475 > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290) > at > org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454) > at > org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:454) > at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:377) > at > org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQLImplicits.scala:246) > ... 48 elided > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_1$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287) > ... 58 more > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20286) dynamicAllocation.executorIdleTimeout is ignored after unpersist
[ https://issues.apache.org/jira/browse/SPARK-20286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15969924#comment-15969924 ] Umesh Chaudhary commented on SPARK-20286: - While looking at ExecutorAllocationManager.onExecutorIdle, there is a condition which checks whether executor has CachedBlocks or not , if it has cached blocks then it uses cachedExecutorIdleTimeoutS and if no cached blocks it uses executorIdleTimeoutS. Still not sure why even after calling unpersist it is behaving like this. One possibility: there might be some cached data on executors which is not reported to the BlockManager and it is causing executor to follow cachedExecutorIdleTimeout instead of executorIdleTimeout. Need some thoughts though cc: [~joshrosen], [~rxin]. > dynamicAllocation.executorIdleTimeout is ignored after unpersist > > > Key: SPARK-20286 > URL: https://issues.apache.org/jira/browse/SPARK-20286 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Miguel Pérez > > With dynamic allocation enabled, it seems that executors with cached data > which are unpersisted are still being killed using the > {{dynamicAllocation.cachedExecutorIdleTimeout}} configuration, instead of > {{dynamicAllocation.executorIdleTimeout}}. Assuming the default configuration > ({{dynamicAllocation.cachedExecutorIdleTimeout = Infinity}}), an executor > with unpersisted data won't be released until the job ends. > *How to reproduce* > - Set different values for {{dynamicAllocation.executorIdleTimeout}} and > {{dynamicAllocation.cachedExecutorIdleTimeout}} > - Load a file into a RDD and persist it > - Execute an action on the RDD (like a count) so some executors are activated. > - When the action has finished, unpersist the RDD > - The application UI removes correctly the persisted data from the *Storage* > tab, but if you look in the *Executors* tab, you will find that the executors > remain *active* until ({{dynamicAllocation.cachedExecutorIdleTimeout}} is > reached. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20344) Duplicate call in FairSchedulableBuilder.addTaskSetManager
[ https://issues.apache.org/jira/browse/SPARK-20344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15969901#comment-15969901 ] Robert Stupp commented on SPARK-20344: -- Just saw it's a duplicate. Not a serious thing - just unnecessary. I've setup a branch [on GitHub here|https://github.com/apache/spark/compare/master...snazy:20344-dup-call-master?expand=1] that rearranges the calls. Not sure whether you use pull-requests against the ASF mirror. > Duplicate call in FairSchedulableBuilder.addTaskSetManager > -- > > Key: SPARK-20344 > URL: https://issues.apache.org/jira/browse/SPARK-20344 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Robert Stupp >Priority: Trivial > > {{org.apache.spark.scheduler.FairSchedulableBuilder#addTaskSetManager}} > contains the code snippet: > {code} > override def addTaskSetManager(manager: Schedulable, properties: > Properties) { > var poolName = DEFAULT_POOL_NAME > var parentPool = rootPool.getSchedulableByName(poolName) > if (properties != null) { > poolName = properties.getProperty(FAIR_SCHEDULER_PROPERTIES, > DEFAULT_POOL_NAME) > parentPool = rootPool.getSchedulableByName(poolName) > if (parentPool == null) { > {code} > {{parentPool = rootPool.getSchedulableByName(poolName)}} is called twice if > {{properties != null}}. > I'm not sure whether this is an oversight or there's something else missing. > This piece of the code hasn't been modified since 2013, so I doubt that this > is a serious issue. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20344) Duplicate call in FairSchedulableBuilder.addTaskSetManager
[ https://issues.apache.org/jira/browse/SPARK-20344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-20344: -- Priority: Trivial (was: Minor) Does it cause any problem? yes you could probably rearrange this anyway to avoid the duplication. It's not really worth a JIRA. > Duplicate call in FairSchedulableBuilder.addTaskSetManager > -- > > Key: SPARK-20344 > URL: https://issues.apache.org/jira/browse/SPARK-20344 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Robert Stupp >Priority: Trivial > > {{org.apache.spark.scheduler.FairSchedulableBuilder#addTaskSetManager}} > contains the code snippet: > {code} > override def addTaskSetManager(manager: Schedulable, properties: > Properties) { > var poolName = DEFAULT_POOL_NAME > var parentPool = rootPool.getSchedulableByName(poolName) > if (properties != null) { > poolName = properties.getProperty(FAIR_SCHEDULER_PROPERTIES, > DEFAULT_POOL_NAME) > parentPool = rootPool.getSchedulableByName(poolName) > if (parentPool == null) { > {code} > {{parentPool = rootPool.getSchedulableByName(poolName)}} is called twice if > {{properties != null}}. > I'm not sure whether this is an oversight or there's something else missing. > This piece of the code hasn't been modified since 2013, so I doubt that this > is a serious issue. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20316) In SparkSQLCLIDriver, val and var should strictly follow the Scala syntax
[ https://issues.apache.org/jira/browse/SPARK-20316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-20316: - Assignee: Xiaochen Ouyang > In SparkSQLCLIDriver, val and var should strictly follow the Scala syntax > - > > Key: SPARK-20316 > URL: https://issues.apache.org/jira/browse/SPARK-20316 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 > Environment: Spark2.1.0 >Reporter: Xiaochen Ouyang >Assignee: Xiaochen Ouyang >Priority: Trivial > Fix For: 2.2.0 > > > In SparkSQLCLIDriver, val and var should strictly follow the Scala syntax. > private var prompt = "spark-sql" > private var continuedPrompt = "".padTo(prompt.length, ' ') > if there is no place to change the variable.We should use 'val' to modify the > variable,otherwise 'var'. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20316) In SparkSQLCLIDriver, val and var should strictly follow the Scala syntax
[ https://issues.apache.org/jira/browse/SPARK-20316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20316. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17628 [https://github.com/apache/spark/pull/17628] > In SparkSQLCLIDriver, val and var should strictly follow the Scala syntax > - > > Key: SPARK-20316 > URL: https://issues.apache.org/jira/browse/SPARK-20316 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 > Environment: Spark2.1.0 >Reporter: Xiaochen Ouyang >Priority: Trivial > Fix For: 2.2.0 > > > In SparkSQLCLIDriver, val and var should strictly follow the Scala syntax. > private var prompt = "spark-sql" > private var continuedPrompt = "".padTo(prompt.length, ' ') > if there is no place to change the variable.We should use 'val' to modify the > variable,otherwise 'var'. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7674) R-like stats for ML models
[ https://issues.apache.org/jira/browse/SPARK-7674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7674. -- Resolution: Done > R-like stats for ML models > -- > > Key: SPARK-7674 > URL: https://issues.apache.org/jira/browse/SPARK-7674 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Critical > > This is an umbrella JIRA for supporting ML model summaries and statistics, > following the example of R's summary() and plot() functions. > [Design > doc|https://docs.google.com/document/d/1oswC_Neqlqn5ElPwodlDY4IkSaHAi0Bx6Guo_LvhHK8/edit?usp=sharing] > From the design doc: > {quote} > R and its well-established packages provide extensive functionality for > inspecting a model and its results. This inspection is critical to > interpreting, debugging and improving models. > R is arguably a gold standard for a statistics/ML library, so this doc > largely attempts to imitate it. The challenge we face is supporting similar > functionality, but on big (distributed) data. Data size makes both efficient > computation and meaningful displays/summaries difficult. > R model and result summaries generally take 2 forms: > * summary(model): Display text with information about the model and results > on data > * plot(model): Display plots about the model and results > We aim to provide both of these types of information. Visualization for the > plottable results will not be supported in MLlib itself, but we can provide > results in a form which can be plotted easily with other tools. > {quote} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20344) Duplicate call in FairSchedulableBuilder.addTaskSetManager
Robert Stupp created SPARK-20344: Summary: Duplicate call in FairSchedulableBuilder.addTaskSetManager Key: SPARK-20344 URL: https://issues.apache.org/jira/browse/SPARK-20344 Project: Spark Issue Type: Improvement Components: Scheduler Affects Versions: 2.1.0 Reporter: Robert Stupp Priority: Minor {{org.apache.spark.scheduler.FairSchedulableBuilder#addTaskSetManager}} contains the code snippet: {code} override def addTaskSetManager(manager: Schedulable, properties: Properties) { var poolName = DEFAULT_POOL_NAME var parentPool = rootPool.getSchedulableByName(poolName) if (properties != null) { poolName = properties.getProperty(FAIR_SCHEDULER_PROPERTIES, DEFAULT_POOL_NAME) parentPool = rootPool.getSchedulableByName(poolName) if (parentPool == null) { {code} {{parentPool = rootPool.getSchedulableByName(poolName)}} is called twice if {{properties != null}}. I'm not sure whether this is an oversight or there's something else missing. This piece of the code hasn't been modified since 2013, so I doubt that this is a serious issue. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org