[jira] [Commented] (SPARK-22451) Reduce decision tree aggregate size for unordered features from O(2^numCategories) to O(numCategories)
[ https://issues.apache.org/jira/browse/SPARK-22451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16241634#comment-16241634 ] Weichen Xu commented on SPARK-22451: [~josephkb] I think it can be reconstructed from a summary with only O(numCategories) values, let me give an simple example to explain how it works (but if I am wrong correct me): Suppose an unordered features containing values [a, b, c, d, e, f], we first get separate statistics for stat[a], stat[b], stat[c], stat[d], stat[e], stat[f] then, if we want to get stats for split [a, c, d], we can compute it by stat[a] + stat[c] + stat[d]. [~Siddharth Murching] can also help to confirm the correctness of this. > Reduce decision tree aggregate size for unordered features from > O(2^numCategories) to O(numCategories) > -- > > Key: SPARK-22451 > URL: https://issues.apache.org/jira/browse/SPARK-22451 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > Do not need generate all possible splits for unordered features before > aggregate, > in aggregete (executor side): > 1. Change `mixedBinSeqOp`, for each unordered feature, we do the same stat > with ordered features. so for unordered features, we only need > O(numCategories) space for this feature stat. > 2. After driver side get the aggregate result, generate all possible split > combinations, and compute the best split. > This will reduce decision tree aggregate size for each unordered feature from > O(2^numCategories) to O(numCategories), `numCategories` is the arity of this > unordered feature. > This also reduce the cpu cost in executor side. Reduce time complexity for > this unordered feature from O(numPoints * 2^numCategories) to O(numPoints). > This won't increase time complexity for unordered features best split > computing in driver side. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22464) <=> is not supported by Hive metastore partition predicate pushdown
[ https://issues.apache.org/jira/browse/SPARK-22464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22464: Assignee: Apache Spark (was: Xiao Li) > <=> is not supported by Hive metastore partition predicate pushdown > --- > > Key: SPARK-22464 > URL: https://issues.apache.org/jira/browse/SPARK-22464 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Apache Spark > > <=> is not supported by Hive metastore partition predicate pushdown. We > should forbid it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22464) <=> is not supported by Hive metastore partition predicate pushdown
[ https://issues.apache.org/jira/browse/SPARK-22464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16241607#comment-16241607 ] Apache Spark commented on SPARK-22464: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/19682 > <=> is not supported by Hive metastore partition predicate pushdown > --- > > Key: SPARK-22464 > URL: https://issues.apache.org/jira/browse/SPARK-22464 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > > <=> is not supported by Hive metastore partition predicate pushdown. We > should forbid it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22464) <=> is not supported by Hive metastore partition predicate pushdown
[ https://issues.apache.org/jira/browse/SPARK-22464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22464: Assignee: Xiao Li (was: Apache Spark) > <=> is not supported by Hive metastore partition predicate pushdown > --- > > Key: SPARK-22464 > URL: https://issues.apache.org/jira/browse/SPARK-22464 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > > <=> is not supported by Hive metastore partition predicate pushdown. We > should forbid it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22464) <=> is not supported by Hive metastore partition predicate pushdown
Xiao Li created SPARK-22464: --- Summary: <=> is not supported by Hive metastore partition predicate pushdown Key: SPARK-22464 URL: https://issues.apache.org/jira/browse/SPARK-22464 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Xiao Li Assignee: Xiao Li <=> is not supported by Hive metastore partition predicate pushdown. We should forbid it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19680) Offsets out of range with no configured reset policy for partitions
[ https://issues.apache.org/jira/browse/SPARK-19680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16241577#comment-16241577 ] Serkan Taş edited comment on SPARK-19680 at 11/7/17 6:37 AM: - I frequently get the error of "numRecords must not be negative", but this time I have the same issue also on a yarn cluster for a job running 4 days terminated due to the same error : diagnostics: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 18388.0 failed 4 times, most recent failure: Lost task 0.3 in stage 18388.0 (TID 21387, server, executor 1): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of range with no configured reset policy for partitions: {topic-0=33436703} Exception in thread "main" org.apache.spark.SparkException: Application application_fdsfdsfsdfsdf_0001 finished with failed status Hadop : 2.8.0 Spark : 2.1.0 Kafka : 0.10.2.1 Configuration : val kafkaParams = Map[String, Object]( "bootstrap.servers" -> "brokers", "key.deserializer" -> classOf[StringDeserializer], "value.deserializer" -> classOf[StringDeserializer], "group.id" -> "grp_id", "auto.offset.reset" -> "latest", "enable.auto.commit" -> (false: java.lang.Boolean) ) was (Author: serkan_tas): I frequently get the error of "numRecords must not be negative ", But this time I have the same issue also on a yarn cluster for a job running 4 days terminated due to the same error : diagnostics: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 18388.0 failed 4 times, most recent failure: Lost task 0.3 in stage 18388.0 (TID 21387, server, executor 1): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of range with no configured reset policy for partitions: {topic-0=33436703} Exception in thread "main" org.apache.spark.SparkException: Application application_fdsfdsfsdfsdf_0001 finished with failed status Hadop : 2.8.0 Spark : 2.1.0 Kafka : 0.10.2.1 Configuration : val kafkaParams = Map[String, Object]( "bootstrap.servers" -> "brokers", "key.deserializer" -> classOf[StringDeserializer], "value.deserializer" -> classOf[StringDeserializer], "group.id" -> "grp_id", "auto.offset.reset" -> "latest", "enable.auto.commit" -> (false: java.lang.Boolean) ) > Offsets out of range with no configured reset policy for partitions > --- > > Key: SPARK-19680 > URL: https://issues.apache.org/jira/browse/SPARK-19680 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.0 >Reporter: Schakmann Rene > > I'm using spark streaming with kafka to acutally create a toplist. I want to > read all the messages in kafka. So I set >"auto.offset.reset" -> "earliest" > Nevertheless when I start the job on our spark cluster it is not working I > get: > Error: > {code:title=error.log|borderStyle=solid} > Job aborted due to stage failure: Task 2 in stage 111.0 failed 4 times, > most recent failure: Lost task 2.3 in stage 111.0 (TID 1270, 194.232.55.23, > executor 2): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: > Offsets out of range with no configured reset policy for partitions: > {SearchEvents-2=161803385} > {code} > This is somehow wrong because I did set the auto.offset.reset property > Setup: > Kafka Parameter: > {code:title=Config.Scala|borderStyle=solid} > def getDefaultKafkaReceiverParameter(properties: Properties):Map[String, > Object] = { > Map( > "bootstrap.servers" -> > properties.getProperty("kafka.bootstrap.servers"), > "group.id" -> properties.getProperty("kafka.consumer.group"), > "auto.offset.reset" -> "earliest", > "spark.streaming.kafka.consumer.cache.enabled" -> "false", > "enable.auto.commit" -> "false", > "key.deserializer" -> classOf[StringDeserializer], > "value.deserializer" -> "at.willhaben.sid.DTOByteDeserializer") > } > {code} > Job: > {code:title=Job.Scala|borderStyle=solid} > def processSearchKeyWords(stream: InputDStream[ConsumerRecord[String, > Array[Byte]]], windowDuration: Int, slideDuration: Int, kafkaSink: > Broadcast[KafkaSink[TopList]]): Unit = { > getFilteredStream(stream.map(_.value()), windowDuration, > slideDuration).foreachRDD(rdd => { > val topList = new TopList > topList.setCreated(new Date()) > topList.setTopListEntryList(rdd.take(TopListLength).toList) > CurrentLogger.info("TopList length: " + > topList.getTopListEntryList.size().toString) > kafkaSink.value.send(SendToTopicName, topList) > CurrentLogger.info("Last Run: " + System.currentTimeMillis()) >
[jira] [Comment Edited] (SPARK-19680) Offsets out of range with no configured reset policy for partitions
[ https://issues.apache.org/jira/browse/SPARK-19680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16241577#comment-16241577 ] Serkan Taş edited comment on SPARK-19680 at 11/7/17 6:37 AM: - I frequently get the error of "numRecords must not be negative ", But this time I have the same issue also on a yarn cluster for a job running 4 days terminated due to the same error : diagnostics: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 18388.0 failed 4 times, most recent failure: Lost task 0.3 in stage 18388.0 (TID 21387, server, executor 1): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of range with no configured reset policy for partitions: {topic-0=33436703} Exception in thread "main" org.apache.spark.SparkException: Application application_fdsfdsfsdfsdf_0001 finished with failed status Hadop : 2.8.0 Spark : 2.1.0 Kafka : 0.10.2.1 Configuration : val kafkaParams = Map[String, Object]( "bootstrap.servers" -> "brokers", "key.deserializer" -> classOf[StringDeserializer], "value.deserializer" -> classOf[StringDeserializer], "group.id" -> "grp_id", "auto.offset.reset" -> "latest", "enable.auto.commit" -> (false: java.lang.Boolean) ) was (Author: serkan_tas): I have the same issue also on a yarn cluster for a job running 4 days terminated due to the same error : diagnostics: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 18388.0 failed 4 times, most recent failure: Lost task 0.3 in stage 18388.0 (TID 21387, server, executor 1): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of range with no configured reset policy for partitions: {topic-0=33436703} Exception in thread "main" org.apache.spark.SparkException: Application application_fdsfdsfsdfsdf_0001 finished with failed status Hadop : 2.8.0 Spark : 2.1.0 Kafka : 0.10.2.1 Configuration : val kafkaParams = Map[String, Object]( "bootstrap.servers" -> "brokers", "key.deserializer" -> classOf[StringDeserializer], "value.deserializer" -> classOf[StringDeserializer], "group.id" -> "grp_id", "auto.offset.reset" -> "latest", "enable.auto.commit" -> (false: java.lang.Boolean) ) > Offsets out of range with no configured reset policy for partitions > --- > > Key: SPARK-19680 > URL: https://issues.apache.org/jira/browse/SPARK-19680 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.0 >Reporter: Schakmann Rene > > I'm using spark streaming with kafka to acutally create a toplist. I want to > read all the messages in kafka. So I set >"auto.offset.reset" -> "earliest" > Nevertheless when I start the job on our spark cluster it is not working I > get: > Error: > {code:title=error.log|borderStyle=solid} > Job aborted due to stage failure: Task 2 in stage 111.0 failed 4 times, > most recent failure: Lost task 2.3 in stage 111.0 (TID 1270, 194.232.55.23, > executor 2): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: > Offsets out of range with no configured reset policy for partitions: > {SearchEvents-2=161803385} > {code} > This is somehow wrong because I did set the auto.offset.reset property > Setup: > Kafka Parameter: > {code:title=Config.Scala|borderStyle=solid} > def getDefaultKafkaReceiverParameter(properties: Properties):Map[String, > Object] = { > Map( > "bootstrap.servers" -> > properties.getProperty("kafka.bootstrap.servers"), > "group.id" -> properties.getProperty("kafka.consumer.group"), > "auto.offset.reset" -> "earliest", > "spark.streaming.kafka.consumer.cache.enabled" -> "false", > "enable.auto.commit" -> "false", > "key.deserializer" -> classOf[StringDeserializer], > "value.deserializer" -> "at.willhaben.sid.DTOByteDeserializer") > } > {code} > Job: > {code:title=Job.Scala|borderStyle=solid} > def processSearchKeyWords(stream: InputDStream[ConsumerRecord[String, > Array[Byte]]], windowDuration: Int, slideDuration: Int, kafkaSink: > Broadcast[KafkaSink[TopList]]): Unit = { > getFilteredStream(stream.map(_.value()), windowDuration, > slideDuration).foreachRDD(rdd => { > val topList = new TopList > topList.setCreated(new Date()) > topList.setTopListEntryList(rdd.take(TopListLength).toList) > CurrentLogger.info("TopList length: " + > topList.getTopListEntryList.size().toString) > kafkaSink.value.send(SendToTopicName, topList) > CurrentLogger.info("Last Run: " + System.currentTimeMillis()) > }) > } > def getFilteredStream(result: DStream[Array[Byte]], windowDuration:
[jira] [Commented] (SPARK-19680) Offsets out of range with no configured reset policy for partitions
[ https://issues.apache.org/jira/browse/SPARK-19680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16241577#comment-16241577 ] Serkan Taş commented on SPARK-19680: I have the same issue also on a yarn cluster for a job running 4 days terminated due to the same error : diagnostics: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 18388.0 failed 4 times, most recent failure: Lost task 0.3 in stage 18388.0 (TID 21387, server, executor 1): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of range with no configured reset policy for partitions: {topic-0=33436703} Exception in thread "main" org.apache.spark.SparkException: Application application_fdsfdsfsdfsdf_0001 finished with failed status Hadop : 2.8.0 Spark : 2.1.0 Kafka : 0.10.2.1 Configuration : val kafkaParams = Map[String, Object]( "bootstrap.servers" -> "brokers", "key.deserializer" -> classOf[StringDeserializer], "value.deserializer" -> classOf[StringDeserializer], "group.id" -> "grp_id", "auto.offset.reset" -> "latest", "enable.auto.commit" -> (false: java.lang.Boolean) ) > Offsets out of range with no configured reset policy for partitions > --- > > Key: SPARK-19680 > URL: https://issues.apache.org/jira/browse/SPARK-19680 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.0 >Reporter: Schakmann Rene > > I'm using spark streaming with kafka to acutally create a toplist. I want to > read all the messages in kafka. So I set >"auto.offset.reset" -> "earliest" > Nevertheless when I start the job on our spark cluster it is not working I > get: > Error: > {code:title=error.log|borderStyle=solid} > Job aborted due to stage failure: Task 2 in stage 111.0 failed 4 times, > most recent failure: Lost task 2.3 in stage 111.0 (TID 1270, 194.232.55.23, > executor 2): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: > Offsets out of range with no configured reset policy for partitions: > {SearchEvents-2=161803385} > {code} > This is somehow wrong because I did set the auto.offset.reset property > Setup: > Kafka Parameter: > {code:title=Config.Scala|borderStyle=solid} > def getDefaultKafkaReceiverParameter(properties: Properties):Map[String, > Object] = { > Map( > "bootstrap.servers" -> > properties.getProperty("kafka.bootstrap.servers"), > "group.id" -> properties.getProperty("kafka.consumer.group"), > "auto.offset.reset" -> "earliest", > "spark.streaming.kafka.consumer.cache.enabled" -> "false", > "enable.auto.commit" -> "false", > "key.deserializer" -> classOf[StringDeserializer], > "value.deserializer" -> "at.willhaben.sid.DTOByteDeserializer") > } > {code} > Job: > {code:title=Job.Scala|borderStyle=solid} > def processSearchKeyWords(stream: InputDStream[ConsumerRecord[String, > Array[Byte]]], windowDuration: Int, slideDuration: Int, kafkaSink: > Broadcast[KafkaSink[TopList]]): Unit = { > getFilteredStream(stream.map(_.value()), windowDuration, > slideDuration).foreachRDD(rdd => { > val topList = new TopList > topList.setCreated(new Date()) > topList.setTopListEntryList(rdd.take(TopListLength).toList) > CurrentLogger.info("TopList length: " + > topList.getTopListEntryList.size().toString) > kafkaSink.value.send(SendToTopicName, topList) > CurrentLogger.info("Last Run: " + System.currentTimeMillis()) > }) > } > def getFilteredStream(result: DStream[Array[Byte]], windowDuration: Int, > slideDuration: Int): DStream[TopListEntry] = { > val Mapper = MapperObject.readerFor[SearchEventDTO] > result.repartition(100).map(s => Mapper.readValue[SearchEventDTO](s)) > .filter(s => s != null && s.getSearchRequest != null && > s.getSearchRequest.getSearchParameters != null && s.getVertical == > Vertical.BAP && > s.getSearchRequest.getSearchParameters.containsKey(EspParameterEnum.KEYWORD.getName)) > .map(row => { > val name = > row.getSearchRequest.getSearchParameters.get(EspParameterEnum.KEYWORD.getName).getEspSearchParameterDTO.getValue.toLowerCase() > (name, new TopListEntry(name, 1, row.getResultCount)) > }) > .reduceByKeyAndWindow( > (a: TopListEntry, b: TopListEntry) => new TopListEntry(a.getKeyword, > a.getSearchCount + b.getSearchCount, a.getMeanSearchHits + > b.getMeanSearchHits), > (a: TopListEntry, b: TopListEntry) => new TopListEntry(a.getKeyword, > a.getSearchCount - b.getSearchCount, a.getMeanSearchHits - > b.getMeanSearchHits), > Minutes(windowDuration), > Seconds(slideDuration)) > .filter((x: (String, TopListEntry)) => x._2.getSearchCount >
[jira] [Updated] (SPARK-22289) Cannot save LogisticRegressionClassificationModel with bounds on coefficients
[ https://issues.apache.org/jira/browse/SPARK-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-22289: Shepherd: Yanbo Liang > Cannot save LogisticRegressionClassificationModel with bounds on coefficients > - > > Key: SPARK-22289 > URL: https://issues.apache.org/jira/browse/SPARK-22289 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Nic Eggert >Assignee: yuhao yang > > I think this was introduced in SPARK-20047. > Trying to call save on a logistic regression model trained with bounds on its > parameters throws an error. This seems to be because Spark doesn't know how > to serialize the Matrix parameter. > Model is set up like this: > {code} > val calibrator = new LogisticRegression() > .setFeaturesCol("uncalibrated_probability") > .setLabelCol("label") > .setWeightCol("weight") > .setStandardization(false) > .setLowerBoundsOnCoefficients(new DenseMatrix(1, 1, Array(0.0))) > .setFamily("binomial") > .setProbabilityCol("probability") > .setPredictionCol("logistic_prediction") > .setRawPredictionCol("logistic_raw_prediction") > {code} > {code} > 17/10/16 15:36:59 ERROR ApplicationMaster: User class threw exception: > scala.NotImplementedError: The default jsonEncode only supports string and > vector. org.apache.spark.ml.param.Param must override jsonEncode for > org.apache.spark.ml.linalg.DenseMatrix. > scala.NotImplementedError: The default jsonEncode only supports string and > vector. org.apache.spark.ml.param.Param must override jsonEncode for > org.apache.spark.ml.linalg.DenseMatrix. > at org.apache.spark.ml.param.Param.jsonEncode(params.scala:98) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:296) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:295) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.ml.util.DefaultParamsWriter$.getMetadataToSave(ReadWrite.scala:295) > at > org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:277) > at > org.apache.spark.ml.classification.LogisticRegressionModel$LogisticRegressionModelWriter.saveImpl(LogisticRegression.scala:1182) > at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:254) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:253) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:253) > at > org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:337) > at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114) > -snip- > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22289) Cannot save LogisticRegressionClassificationModel with bounds on coefficients
[ https://issues.apache.org/jira/browse/SPARK-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang reassigned SPARK-22289: --- Assignee: yuhao yang > Cannot save LogisticRegressionClassificationModel with bounds on coefficients > - > > Key: SPARK-22289 > URL: https://issues.apache.org/jira/browse/SPARK-22289 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Nic Eggert >Assignee: yuhao yang > > I think this was introduced in SPARK-20047. > Trying to call save on a logistic regression model trained with bounds on its > parameters throws an error. This seems to be because Spark doesn't know how > to serialize the Matrix parameter. > Model is set up like this: > {code} > val calibrator = new LogisticRegression() > .setFeaturesCol("uncalibrated_probability") > .setLabelCol("label") > .setWeightCol("weight") > .setStandardization(false) > .setLowerBoundsOnCoefficients(new DenseMatrix(1, 1, Array(0.0))) > .setFamily("binomial") > .setProbabilityCol("probability") > .setPredictionCol("logistic_prediction") > .setRawPredictionCol("logistic_raw_prediction") > {code} > {code} > 17/10/16 15:36:59 ERROR ApplicationMaster: User class threw exception: > scala.NotImplementedError: The default jsonEncode only supports string and > vector. org.apache.spark.ml.param.Param must override jsonEncode for > org.apache.spark.ml.linalg.DenseMatrix. > scala.NotImplementedError: The default jsonEncode only supports string and > vector. org.apache.spark.ml.param.Param must override jsonEncode for > org.apache.spark.ml.linalg.DenseMatrix. > at org.apache.spark.ml.param.Param.jsonEncode(params.scala:98) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:296) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:295) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.ml.util.DefaultParamsWriter$.getMetadataToSave(ReadWrite.scala:295) > at > org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:277) > at > org.apache.spark.ml.classification.LogisticRegressionModel$LogisticRegressionModelWriter.saveImpl(LogisticRegression.scala:1182) > at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:254) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:253) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:253) > at > org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:337) > at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114) > -snip- > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22453) Zookeeper configuration for MesosClusterDispatcher is not documented
[ https://issues.apache.org/jira/browse/SPARK-22453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jais Sebastian updated SPARK-22453: --- Priority: Critical (was: Major) > Zookeeper configuration for MesosClusterDispatcher is not documented > -- > > Key: SPARK-22453 > URL: https://issues.apache.org/jira/browse/SPARK-22453 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.2.0 > Environment: Spark 2.2 with Spark submit >Reporter: Jais Sebastian >Priority: Critical > > Spark does not have detailed documentation on how to configure zookeeper with > MesosClusterDispatcher . It is mentioned that the configuration is possible > and broken documentation link is provided. > https://spark.apache.org/docs/latest/running-on-mesos.html#cluster-mode > _{{The MesosClusterDispatcher also supports writing recovery state into > Zookeeper. This will allow the MesosClusterDispatcher to be able to recover > all submitted and running containers on relaunch. In order to enable this > recovery mode, you can set SPARK_DAEMON_JAVA_OPTS in spark-env by configuring > spark.deploy.recoveryMode and related spark.deploy.zookeeper.* > configurations. For more information about these configurations please refer > to the configurations doc.}}_ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22373) Intermittent NullPointerException in org.codehaus.janino.IClass.isAssignableFrom
[ https://issues.apache.org/jira/browse/SPARK-22373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16241469#comment-16241469 ] Kazuaki Ishizaki commented on SPARK-22373: -- If you can submit a program that can reproduce this, I could investigate which software component can cause an issue. > Intermittent NullPointerException in > org.codehaus.janino.IClass.isAssignableFrom > > > Key: SPARK-22373 > URL: https://issues.apache.org/jira/browse/SPARK-22373 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 > Environment: Hortonworks distribution: HDP 2.6.2.0-205 , > /usr/hdp/current/spark2-client/jars/spark-core_2.11-2.1.1.2.6.2.0-205.jar >Reporter: Dan Meany >Priority: Minor > > Very occasional and retry works. > Full stack: > 17/10/27 21:06:15 ERROR Executor: Exception in task 29.0 in stage 12.0 (TID > 758) > java.lang.NullPointerException > at org.codehaus.janino.IClass.isAssignableFrom(IClass.java:569) > at > org.codehaus.janino.UnitCompiler.isWideningReferenceConvertible(UnitCompiler.java:10347) > at > org.codehaus.janino.UnitCompiler.isMethodInvocationConvertible(UnitCompiler.java:8636) > at > org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:8427) > at > org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:8285) > at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8169) > at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8071) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4421) > at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762) > at > org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180) > at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) > at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370) > at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) > at > org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:550) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) > at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369) > at > org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) > at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369) > at > org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) > at > org
[jira] [Issue Comment Deleted] (SPARK-22359) Improve the test coverage of window functions
[ https://issues.apache.org/jira/browse/SPARK-22359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Teng Peng updated SPARK-22359: -- Comment: was deleted (was: [~jiangxb] If I can have 1 test as reference, I will figure out the rests myself.) > Improve the test coverage of window functions > - > > Key: SPARK-22359 > URL: https://issues.apache.org/jira/browse/SPARK-22359 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.3.0 >Reporter: Jiang Xingbo > > There are already quite a few integration tests using window functions, but > the unit tests coverage for window funtions is not ideal. > We'd like to test the following aspects: > * Specifications > ** different partition clauses (none, one, multiple) > ** different order clauses (none, one, multiple, asc/desc, nulls first/last) > * Frames and their combinations > ** OffsetWindowFunctionFrame > ** UnboundedWindowFunctionFrame > ** SlidingWindowFunctionFrame > ** UnboundedPrecedingWindowFunctionFrame > ** UnboundedFollowingWindowFunctionFrame > * Aggregate function types > ** Declarative > ** Imperative > ** UDAF > * Spilling > ** Cover the conditions that WindowExec should spill at least once -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22359) Improve the test coverage of window functions
[ https://issues.apache.org/jira/browse/SPARK-22359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16241392#comment-16241392 ] Teng Peng commented on SPARK-22359: --- [~jiangxb] If I can have 1 test as reference, I will figure out the rests myself. > Improve the test coverage of window functions > - > > Key: SPARK-22359 > URL: https://issues.apache.org/jira/browse/SPARK-22359 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.3.0 >Reporter: Jiang Xingbo > > There are already quite a few integration tests using window functions, but > the unit tests coverage for window funtions is not ideal. > We'd like to test the following aspects: > * Specifications > ** different partition clauses (none, one, multiple) > ** different order clauses (none, one, multiple, asc/desc, nulls first/last) > * Frames and their combinations > ** OffsetWindowFunctionFrame > ** UnboundedWindowFunctionFrame > ** SlidingWindowFunctionFrame > ** UnboundedPrecedingWindowFunctionFrame > ** UnboundedFollowingWindowFunctionFrame > * Aggregate function types > ** Declarative > ** Imperative > ** UDAF > * Spilling > ** Cover the conditions that WindowExec should spill at least once -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22463) Missing hadoop/hive/hbase/etc configuration files in SPARK_CONF_DIR to distributed archive
[ https://issues.apache.org/jira/browse/SPARK-22463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16241389#comment-16241389 ] Apache Spark commented on SPARK-22463: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/19663 > Missing hadoop/hive/hbase/etc configuration files in SPARK_CONF_DIR to > distributed archive > -- > > Key: SPARK-22463 > URL: https://issues.apache.org/jira/browse/SPARK-22463 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.2, 2.2.0 >Reporter: Kent Yao > > When I ran self contained sql apps, such as > {code:java} > import org.apache.spark.sql.SparkSession > object ShowHiveTables { > def main(args: Array[String]): Unit = { > val spark = SparkSession > .builder() > .appName("Show Hive Tables") > .enableHiveSupport() > .getOrCreate() > spark.sql("show tables").show() > spark.stop() > } > } > {code} > with **yarn cluster** mode and `hive-site.xml` correctly within > `$SPARK_HOME/conf`,they failed to connect the right hive metestore for not > seeing hive-site.xml in AM/Driver's classpath. > Although submitting them with `--files/--jars local/path/to/hive-site.xml` or > puting it to `$HADOOP_CONF_DIR/YARN_CONF_DIR` can make these apps works well > in cluster mode as client mode, according to the official doc, see @ > http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables > > Configuration of Hive is done by placing your hive-site.xml, core-site.xml > > (for security configuration), and hdfs-site.xml (for HDFS configuration) > > file in conf/. > We may respect these configuration files too or modify the doc for > hive-tables in cluster mode. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22463) Missing hadoop/hive/hbase/etc configuration files in SPARK_CONF_DIR to distributed archive
[ https://issues.apache.org/jira/browse/SPARK-22463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22463: Assignee: (was: Apache Spark) > Missing hadoop/hive/hbase/etc configuration files in SPARK_CONF_DIR to > distributed archive > -- > > Key: SPARK-22463 > URL: https://issues.apache.org/jira/browse/SPARK-22463 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.2, 2.2.0 >Reporter: Kent Yao > > When I ran self contained sql apps, such as > {code:java} > import org.apache.spark.sql.SparkSession > object ShowHiveTables { > def main(args: Array[String]): Unit = { > val spark = SparkSession > .builder() > .appName("Show Hive Tables") > .enableHiveSupport() > .getOrCreate() > spark.sql("show tables").show() > spark.stop() > } > } > {code} > with **yarn cluster** mode and `hive-site.xml` correctly within > `$SPARK_HOME/conf`,they failed to connect the right hive metestore for not > seeing hive-site.xml in AM/Driver's classpath. > Although submitting them with `--files/--jars local/path/to/hive-site.xml` or > puting it to `$HADOOP_CONF_DIR/YARN_CONF_DIR` can make these apps works well > in cluster mode as client mode, according to the official doc, see @ > http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables > > Configuration of Hive is done by placing your hive-site.xml, core-site.xml > > (for security configuration), and hdfs-site.xml (for HDFS configuration) > > file in conf/. > We may respect these configuration files too or modify the doc for > hive-tables in cluster mode. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22463) Missing hadoop/hive/hbase/etc configuration files in SPARK_CONF_DIR to distributed archive
[ https://issues.apache.org/jira/browse/SPARK-22463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22463: Assignee: Apache Spark > Missing hadoop/hive/hbase/etc configuration files in SPARK_CONF_DIR to > distributed archive > -- > > Key: SPARK-22463 > URL: https://issues.apache.org/jira/browse/SPARK-22463 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.2, 2.2.0 >Reporter: Kent Yao >Assignee: Apache Spark > > When I ran self contained sql apps, such as > {code:java} > import org.apache.spark.sql.SparkSession > object ShowHiveTables { > def main(args: Array[String]): Unit = { > val spark = SparkSession > .builder() > .appName("Show Hive Tables") > .enableHiveSupport() > .getOrCreate() > spark.sql("show tables").show() > spark.stop() > } > } > {code} > with **yarn cluster** mode and `hive-site.xml` correctly within > `$SPARK_HOME/conf`,they failed to connect the right hive metestore for not > seeing hive-site.xml in AM/Driver's classpath. > Although submitting them with `--files/--jars local/path/to/hive-site.xml` or > puting it to `$HADOOP_CONF_DIR/YARN_CONF_DIR` can make these apps works well > in cluster mode as client mode, according to the official doc, see @ > http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables > > Configuration of Hive is done by placing your hive-site.xml, core-site.xml > > (for security configuration), and hdfs-site.xml (for HDFS configuration) > > file in conf/. > We may respect these configuration files too or modify the doc for > hive-tables in cluster mode. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22463) Missing hadoop/hive/hbase/etc configuration files in SPARK_CONF_DIR to distributed archive
Kent Yao created SPARK-22463: Summary: Missing hadoop/hive/hbase/etc configuration files in SPARK_CONF_DIR to distributed archive Key: SPARK-22463 URL: https://issues.apache.org/jira/browse/SPARK-22463 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 2.2.0, 2.1.2 Reporter: Kent Yao When I ran self contained sql apps, such as {code:java} import org.apache.spark.sql.SparkSession object ShowHiveTables { def main(args: Array[String]): Unit = { val spark = SparkSession .builder() .appName("Show Hive Tables") .enableHiveSupport() .getOrCreate() spark.sql("show tables").show() spark.stop() } } {code} with **yarn cluster** mode and `hive-site.xml` correctly within `$SPARK_HOME/conf`,they failed to connect the right hive metestore for not seeing hive-site.xml in AM/Driver's classpath. Although submitting them with `--files/--jars local/path/to/hive-site.xml` or puting it to `$HADOOP_CONF_DIR/YARN_CONF_DIR` can make these apps works well in cluster mode as client mode, according to the official doc, see @ http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables > Configuration of Hive is done by placing your hive-site.xml, core-site.xml > (for security configuration), and hdfs-site.xml (for HDFS configuration) file > in conf/. We may respect these configuration files too or modify the doc for hive-tables in cluster mode. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22462) SQL metrics missing after foreach operation on dataframe
[ https://issues.apache.org/jira/browse/SPARK-22462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Juliusz Sompolski updated SPARK-22462: -- Attachment: collect.png foreach.png > SQL metrics missing after foreach operation on dataframe > > > Key: SPARK-22462 > URL: https://issues.apache.org/jira/browse/SPARK-22462 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Juliusz Sompolski > Attachments: collect.png, foreach.png > > > No SQL metrics are visible in the SQL tab of SparkUI when foreach is executed > on the DataFrame. > e.g. > {code} > sql("select * from range(10)").collect() > sql("select * from range(10)").foreach(a => Unit) > sql("select * from range(10)").foreach(a => println(a)) > {code} > See collect.png vs. foreach.png -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22462) SQL metrics missing after foreach operation on dataframe
Juliusz Sompolski created SPARK-22462: - Summary: SQL metrics missing after foreach operation on dataframe Key: SPARK-22462 URL: https://issues.apache.org/jira/browse/SPARK-22462 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Juliusz Sompolski No SQL metrics are visible in the SQL tab of SparkUI when foreach is executed on the DataFrame. e.g. {code} sql("select * from range(10)").collect() sql("select * from range(10)").foreach(a => Unit) sql("select * from range(10)").foreach(a => println(a)) {code} See collect.png vs. foreach.png -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20652) Make SQL UI use new app state store
[ https://issues.apache.org/jira/browse/SPARK-20652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20652: Assignee: Apache Spark > Make SQL UI use new app state store > --- > > Key: SPARK-20652 > URL: https://issues.apache.org/jira/browse/SPARK-20652 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark > > See spec in parent issue (SPARK-18085) for more details. > This task tracks modifying the SQL listener and UI code to use the new app > state store. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20652) Make SQL UI use new app state store
[ https://issues.apache.org/jira/browse/SPARK-20652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16241245#comment-16241245 ] Apache Spark commented on SPARK-20652: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/19681 > Make SQL UI use new app state store > --- > > Key: SPARK-20652 > URL: https://issues.apache.org/jira/browse/SPARK-20652 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin > > See spec in parent issue (SPARK-18085) for more details. > This task tracks modifying the SQL listener and UI code to use the new app > state store. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20652) Make SQL UI use new app state store
[ https://issues.apache.org/jira/browse/SPARK-20652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20652: Assignee: (was: Apache Spark) > Make SQL UI use new app state store > --- > > Key: SPARK-20652 > URL: https://issues.apache.org/jira/browse/SPARK-20652 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin > > See spec in parent issue (SPARK-18085) for more details. > This task tracks modifying the SQL listener and UI code to use the new app > state store. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22330) Linear containsKey operation for serialized maps.
[ https://issues.apache.org/jira/browse/SPARK-22330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-22330: --- Assignee: Alexander > Linear containsKey operation for serialized maps. > - > > Key: SPARK-22330 > URL: https://issues.apache.org/jira/browse/SPARK-22330 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.1, 2.2.0 >Reporter: Alexander >Assignee: Alexander > Labels: performance > Fix For: 2.3.0 > > Original Estimate: 5m > Remaining Estimate: 5m > > One of our production application which aggressively uses cached spark RDDs > degraded after increasing volumes of data though it shouldn't. Fast profiling > session showed that the slowest part was SerializableMapWrapper#containsKey: > it delegates get and remove to actual implementation, but containsKey is > inherited from AbstractMap which is implemented in linear time via iteration > over whole keySet. A workaround was simple: replacing all containsKey with > get(key) != null solved the issue. > Nevertheless, it would be much simpler for everyone if the issue will be > fixed once and for all. > A fix is straightforward, delegate containsKey to actual implementation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22330) Linear containsKey operation for serialized maps.
[ https://issues.apache.org/jira/browse/SPARK-22330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-22330. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19553 [https://github.com/apache/spark/pull/19553] > Linear containsKey operation for serialized maps. > - > > Key: SPARK-22330 > URL: https://issues.apache.org/jira/browse/SPARK-22330 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.1, 2.2.0 >Reporter: Alexander > Labels: performance > Fix For: 2.3.0 > > Original Estimate: 5m > Remaining Estimate: 5m > > One of our production application which aggressively uses cached spark RDDs > degraded after increasing volumes of data though it shouldn't. Fast profiling > session showed that the slowest part was SerializableMapWrapper#containsKey: > it delegates get and remove to actual implementation, but containsKey is > inherited from AbstractMap which is implemented in linear time via iteration > over whole keySet. A workaround was simple: replacing all containsKey with > get(key) != null solved the issue. > Nevertheless, it would be much simpler for everyone if the issue will be > fixed once and for all. > A fix is straightforward, delegate containsKey to actual implementation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22461) Move Spark ML model summaries into a dedicated package
Seth Hendrickson created SPARK-22461: Summary: Move Spark ML model summaries into a dedicated package Key: SPARK-22461 URL: https://issues.apache.org/jira/browse/SPARK-22461 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.3.0 Reporter: Seth Hendrickson Priority: Minor Summaries in ML right now do not adhere to a common abstraction, and are usually placed in the same file as the algorithm, which makes these files unwieldy. We can and should unify them under one hierarchy, perhaps in a new {{summary}} module. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22451) Reduce decision tree aggregate size for unordered features from O(2^numCategories) to O(numCategories)
[ https://issues.apache.org/jira/browse/SPARK-22451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240967#comment-16240967 ] Joseph K. Bradley commented on SPARK-22451: --- How does this work? For unordered features, we need to keep track of stats for all possible 2^numCategories splits. That information cannot be reconstructed from a summary with only O(numCategories) values. > Reduce decision tree aggregate size for unordered features from > O(2^numCategories) to O(numCategories) > -- > > Key: SPARK-22451 > URL: https://issues.apache.org/jira/browse/SPARK-22451 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > Do not need generate all possible splits for unordered features before > aggregate, > in aggregete (executor side): > 1. Change `mixedBinSeqOp`, for each unordered feature, we do the same stat > with ordered features. so for unordered features, we only need > O(numCategories) space for this feature stat. > 2. After driver side get the aggregate result, generate all possible split > combinations, and compute the best split. > This will reduce decision tree aggregate size for each unordered feature from > O(2^numCategories) to O(numCategories), `numCategories` is the arity of this > unordered feature. > This also reduce the cpu cost in executor side. Reduce time complexity for > this unordered feature from O(numPoints * 2^numCategories) to O(numPoints). > This won't increase time complexity for unordered features best split > computing in driver side. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22078) clarify exception behaviors for all data source v2 interfaces
[ https://issues.apache.org/jira/browse/SPARK-22078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-22078. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19623 [https://github.com/apache/spark/pull/19623] > clarify exception behaviors for all data source v2 interfaces > - > > Key: SPARK-22078 > URL: https://issues.apache.org/jira/browse/SPARK-22078 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20647) Make the Storage page use new app state store
[ https://issues.apache.org/jira/browse/SPARK-20647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240895#comment-16240895 ] Apache Spark commented on SPARK-20647: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/19679 > Make the Storage page use new app state store > - > > Key: SPARK-20647 > URL: https://issues.apache.org/jira/browse/SPARK-20647 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin > > See spec in parent issue (SPARK-18085) for more details. > This task tracks making the Storage page use the new app state store. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20647) Make the Storage page use new app state store
[ https://issues.apache.org/jira/browse/SPARK-20647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20647: Assignee: (was: Apache Spark) > Make the Storage page use new app state store > - > > Key: SPARK-20647 > URL: https://issues.apache.org/jira/browse/SPARK-20647 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin > > See spec in parent issue (SPARK-18085) for more details. > This task tracks making the Storage page use the new app state store. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20647) Make the Storage page use new app state store
[ https://issues.apache.org/jira/browse/SPARK-20647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20647: Assignee: Apache Spark > Make the Storage page use new app state store > - > > Key: SPARK-20647 > URL: https://issues.apache.org/jira/browse/SPARK-20647 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark > > See spec in parent issue (SPARK-18085) for more details. > This task tracks making the Storage page use the new app state store. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22460) Spark De-serialization of Timestamp field is Incorrect
[ https://issues.apache.org/jira/browse/SPARK-22460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saniya Tech updated SPARK-22460: Description: We are trying to serialize Timestamp fields to Avro using spark-avro connector. I can see the Timestamp fields are getting correctly serialized as long (milliseconds since Epoch). I verified that the data is correctly read back from the Avro files. It is when we encode the Dataset as a case class that timestamp field is incorrectly converted to a long value as seconds since Epoch. As can be seen below, this shifts the timestamp many years in the future. Code used to reproduce the issue: {code:java} import java.sql.Timestamp import com.databricks.spark.avro._ import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession} case class TestRecord(name: String, modified: Timestamp) import spark.implicits._ val data = Seq( TestRecord("One", new Timestamp(System.currentTimeMillis())) ) // Serialize: val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> "com.example.domain") val path = s"s3a://some-bucket/output/" val ds = spark.createDataset(data) ds.write .options(parameters) .mode(SaveMode.Overwrite) .avro(path) // // De-serialize val output = spark.read.avro(path).as[TestRecord] {code} Output from the test: {code:java} scala> data.head res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419) scala> output.collect().head res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0) {code} was: We are trying to serialize Timestamp fields to Avro using spark-avro connector. I can see the Timestamp fields are getting correctly serialized as long (milliseconds since Epoch). I verified that the data is correctly read back from the Avro files. It is when we encode the Dataset as a case class that timestamp field is incorrectly converted to as long value as seconds since Epoch. As can be seen below, this shifts the timestamp many years in the future. Code used to reproduce the issue: {code:java} import java.sql.Timestamp import com.databricks.spark.avro._ import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession} case class TestRecord(name: String, modified: Timestamp) import spark.implicits._ val data = Seq( TestRecord("One", new Timestamp(System.currentTimeMillis())) ) // Serialize: val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> "com.example.domain") val path = s"s3a://some-bucket/output/" val ds = spark.createDataset(data) ds.write .options(parameters) .mode(SaveMode.Overwrite) .avro(path) // // De-serialize val output = spark.read.avro(path).as[TestRecord] {code} Output from the test: {code:java} scala> data.head res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419) scala> output.collect().head res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0) {code} > Spark De-serialization of Timestamp field is Incorrect > -- > > Key: SPARK-22460 > URL: https://issues.apache.org/jira/browse/SPARK-22460 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.1.1 >Reporter: Saniya Tech > > We are trying to serialize Timestamp fields to Avro using spark-avro > connector. I can see the Timestamp fields are getting correctly serialized as > long (milliseconds since Epoch). I verified that the data is correctly read > back from the Avro files. It is when we encode the Dataset as a case class > that timestamp field is incorrectly converted to a long value as seconds > since Epoch. As can be seen below, this shifts the timestamp many years in > the future. > Code used to reproduce the issue: > {code:java} > import java.sql.Timestamp > import com.databricks.spark.avro._ > import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession} > case class TestRecord(name: String, modified: Timestamp) > import spark.implicits._ > val data = Seq( > TestRecord("One", new Timestamp(System.currentTimeMillis())) > ) > // Serialize: > val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> > "com.example.domain") > val path = s"s3a://some-bucket/output/" > val ds = spark.createDataset(data) > ds.write > .options(parameters) > .mode(SaveMode.Overwrite) > .avro(path) > // > // De-serialize > val output = spark.read.avro(path).as[TestRecord] > {code} > Output from the test: > {code:java} > scala> data.head > res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419) > scala> output.collect().head > res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22460) Spark De-serialization of Timestamp field is Incorrect
Saniya Tech created SPARK-22460: --- Summary: Spark De-serialization of Timestamp field is Incorrect Key: SPARK-22460 URL: https://issues.apache.org/jira/browse/SPARK-22460 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 2.1.1 Reporter: Saniya Tech We are trying to serialize Timestamp fields to Avro using spark-avro connector. I can see the Timestamp fields are getting correctly serialized as long (milliseconds since Epoch). I verified that the data is correctly read back from the Avro files. It is when we encode the Dataset as a case class that timestamp field is incorrectly converted to as long value as seconds since Epoch. As can be seen below, this shifts the timestamp many years in the future. Code used to reproduce the issue: {code:java} import java.sql.Timestamp import com.databricks.spark.avro._ import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession} case class TestRecord(name: String, modified: Timestamp) import spark.implicits._ val data = Seq( TestRecord("One", new Timestamp(System.currentTimeMillis())) ) // Serialize: val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> "com.example.domain") val path = s"s3a://some-bucket/output/" val ds = spark.createDataset(data) ds.write .options(parameters) .mode(SaveMode.Overwrite) .avro(path) // // De-serialize val output = spark.read.avro(path).as[TestRecord] {code} Output from the test: {code:java} scala> data.head res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419) scala> output.collect().head res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0) {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20646) Make Executors page use new app state store
[ https://issues.apache.org/jira/browse/SPARK-20646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20646: Assignee: Apache Spark > Make Executors page use new app state store > --- > > Key: SPARK-20646 > URL: https://issues.apache.org/jira/browse/SPARK-20646 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark > > See spec in parent issue (SPARK-18085) for more details. > This task tracks making the Executors page use the new app state store. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20646) Make Executors page use new app state store
[ https://issues.apache.org/jira/browse/SPARK-20646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20646: Assignee: (was: Apache Spark) > Make Executors page use new app state store > --- > > Key: SPARK-20646 > URL: https://issues.apache.org/jira/browse/SPARK-20646 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin > > See spec in parent issue (SPARK-18085) for more details. > This task tracks making the Executors page use the new app state store. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20646) Make Executors page use new app state store
[ https://issues.apache.org/jira/browse/SPARK-20646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240805#comment-16240805 ] Apache Spark commented on SPARK-20646: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/19678 > Make Executors page use new app state store > --- > > Key: SPARK-20646 > URL: https://issues.apache.org/jira/browse/SPARK-20646 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin > > See spec in parent issue (SPARK-18085) for more details. > This task tracks making the Executors page use the new app state store. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20645) Make Environment page use new app state store
[ https://issues.apache.org/jira/browse/SPARK-20645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240783#comment-16240783 ] Apache Spark commented on SPARK-20645: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/19677 > Make Environment page use new app state store > - > > Key: SPARK-20645 > URL: https://issues.apache.org/jira/browse/SPARK-20645 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin > > See spec in parent issue (SPARK-18085) for more details. > This task tracks porting the Environment page to use the new key-value store. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20645) Make Environment page use new app state store
[ https://issues.apache.org/jira/browse/SPARK-20645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20645: Assignee: Apache Spark > Make Environment page use new app state store > - > > Key: SPARK-20645 > URL: https://issues.apache.org/jira/browse/SPARK-20645 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark > > See spec in parent issue (SPARK-18085) for more details. > This task tracks porting the Environment page to use the new key-value store. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20645) Make Environment page use new app state store
[ https://issues.apache.org/jira/browse/SPARK-20645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20645: Assignee: (was: Apache Spark) > Make Environment page use new app state store > - > > Key: SPARK-20645 > URL: https://issues.apache.org/jira/browse/SPARK-20645 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin > > See spec in parent issue (SPARK-18085) for more details. > This task tracks porting the Environment page to use the new key-value store. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22459) EdgeDirection "Either" Does Not Considerate Real "Either" Direction
[ https://issues.apache.org/jira/browse/SPARK-22459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tom updated SPARK-22459: Description: When running functions involving [EdgeDirection|https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/EdgeDirection.scala], the _EdgeDirection.Either_ type does not truly perform the actual specified direction (_either_). For instance, when using _Pregel API_ with a simple graph as shown: !http://www.icodeguru.com/CPP/10book/books/book9/images/fig6_1.gif! With _EdgeDirection.Either_ you would guess that vertex #3 will send a message (when activated) to vertices 1, 2, and 4, but in reality it does not. This might be bypassed, but in an expansive and ineffective way. Tested with 2.2.0, 2.1.1 and 2.1.2 spark versions. was: When running functions involving [EdgeDirection|https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/EdgeDirection.scala], the _EdgeDirection.Either_ type does not truly perform the actual specified direction (_either_). For instance, when using _Pregel API_ with a simple graph as shown: !http://www.icodeguru.com/CPP/10book/books/book9/images/fig6_1.gif|thumbnail! With _EdgeDirection.Either_ you would guess that vertex #3 will send a message (when activated) to vertices 1, 2, and 4, but in reality it does not. This might be bypassed, but in an expansive and ineffective way. Tested with 2.2.0, 2.1.1 and 2.1.2 spark versions. > EdgeDirection "Either" Does Not Considerate Real "Either" Direction > --- > > Key: SPARK-22459 > URL: https://issues.apache.org/jira/browse/SPARK-22459 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 2.1.1, 2.1.2, 2.2.0 > Environment: Windows 7. Yarn cluster / client mode. Tested with > 2.2.0, 2.1.1 and 2.1.2 spark versions. >Reporter: Tom > > When running functions involving > [EdgeDirection|https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/EdgeDirection.scala], > the _EdgeDirection.Either_ type does not truly perform the actual specified > direction (_either_). > For instance, when using _Pregel API_ with a simple graph as shown: > !http://www.icodeguru.com/CPP/10book/books/book9/images/fig6_1.gif! > With _EdgeDirection.Either_ you would guess that vertex #3 will send a > message (when activated) to vertices 1, 2, and 4, but in reality it does not. > This might be bypassed, but in an expansive and ineffective way. > Tested with 2.2.0, 2.1.1 and 2.1.2 spark versions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22459) EdgeDirection "Either" Does Not Considerate Real "Either" Direction
[ https://issues.apache.org/jira/browse/SPARK-22459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tom updated SPARK-22459: Description: When running functions involving [EdgeDirection|https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/EdgeDirection.scala], the _EdgeDirection.Either_ type does not truly perform the actual specified direction (_either_). For instance, when using _Pregel API_ with a simple graph as shown: !http://www.icodeguru.com/CPP/10book/books/book9/images/fig6_1.gif|thumbnail! With _EdgeDirection.Either_ you would guess that vertex #3 will send a message (when activated) to vertices 1, 2, and 4, but in reality it does not. This might be bypassed, but in an expansive and ineffective way. Tested with 2.2.0, 2.1.1 and 2.1.2 spark versions. was: When running functions involving [EdgeDirection|https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/EdgeDirection.scala], the _EdgeDirection.Either_ type does not truly perform the actual specified direction (_either_). For instance, when using _Pregel API_ with a simple graph as shown: !http://www.icodeguru.com/CPP/10book/books/book9/images/fig6_1.gif! With _EdgeDirection.Either_ you would guess that vertex #3 will send a message (when activated) to vertices 1, 2, and 4, but in reality it does not. This might be bypassed, but in an expansive and ineffective way. Tested with 2.2.0, 2.1.1 and 2.1.2 spark versions. > EdgeDirection "Either" Does Not Considerate Real "Either" Direction > --- > > Key: SPARK-22459 > URL: https://issues.apache.org/jira/browse/SPARK-22459 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 2.1.1, 2.1.2, 2.2.0 > Environment: Windows 7. Yarn cluster / client mode. Tested with > 2.2.0, 2.1.1 and 2.1.2 spark versions. >Reporter: Tom > > When running functions involving > [EdgeDirection|https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/EdgeDirection.scala], > the _EdgeDirection.Either_ type does not truly perform the actual specified > direction (_either_). > For instance, when using _Pregel API_ with a simple graph as shown: > !http://www.icodeguru.com/CPP/10book/books/book9/images/fig6_1.gif|thumbnail! > With _EdgeDirection.Either_ you would guess that vertex #3 will send a > message (when activated) to vertices 1, 2, and 4, but in reality it does not. > This might be bypassed, but in an expansive and ineffective way. > Tested with 2.2.0, 2.1.1 and 2.1.2 spark versions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22459) EdgeDirection "Either" Does Not Considerate Real "Either" Direction
[ https://issues.apache.org/jira/browse/SPARK-22459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tom updated SPARK-22459: Description: When running functions involving [EdgeDirection|https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/EdgeDirection.scala], the _EdgeDirection.Either_ type does not truly perform the actual specified direction (_either_). For instance, when using _Pregel API_ with a simple graph as shown: !http://www.icodeguru.com/CPP/10book/books/book9/images/fig6_1.gif! With _EdgeDirection.Either_ you would guess that vertex #3 will send a message (when activated) to vertices 1, 2, and 4, but in reality it does not. This might be bypassed, but in an expansive and ineffective way. Tested with 2.2.0, 2.1.1 and 2.1.2 spark versions. was: When running functions involving [EdgeDirection|https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/EdgeDirection.scala], the _EdgeDirection.Either_ type does not truly perform the actual specified direction (_either_). For instance, when using _Pregel API_ with a simple graph as shown: !data:image/png;base64,iVBORw0KGgoNSUhEUgAAAPkAAADLCAMAAACbI8UEh1BMVEX9/f0AAAB1dXWFhYX4+Pjl5eV6enr39/fj4+Py8vLo6OgsLCyCgoLs7Ox8fHxTU1PKyso8PDxNTU0vLy/c3NwiIiJFRUVLS0vFxcVcXFzOzs41NTVSUlIYGBisrKy9vb1tbW1mZmYeHh6fn5+UlJQLCwuysrKMjIylpaWZmZlgYGASEhKECRTxAAAUhElEQVR4nN1diZqqOgy2LIq1ICAtssi+Ku//fBdwOY4UBEThu//MOAoUG5ImaZq2q9X/FwAIo8rxYOKK/BBA2KZR7AeBL8a5Bfs/AMHI83jv+34W5XmOv1jFr8CK96aalGCY8pdB+mUTYf5tMWBFG9kpCzBhVYxhCtXPrB/UdyIAHNlJeAxKpsXnPI7OMcseCSGbdwzcnu2kUE/+XsyyjNv7sh4miaOeezyzRQDHqqNm7hqW72/NledxepaJHnUJPYxUpAYxhpAHJcrCa2zkrJ4kdrb7RcU/BHCDwuaa3AWrXaSFm3VbOd4wic1ZTc2GMzV0ZMqJhcHyyYXDkH7SuxDZo0vuWtSci0EVCR6zqBT5hUs8NgvVa2fPLkMoppGAzdAWW0UaGAdCsnH28UfwdOQrXRfwkYYoJKx9JHstglLD2iCUflq7L8LwL7nUfQmfayh+JX3ro3d6H57Rwfioct8E9IP8bWsUYqJFfw9tfT1/K8qCSI5LteyYVdMeagjGmvqHBLi3ox6aW8hQ0NUgZkSE2F41gxninlns2cGbJnKFFGj5Im0b1o89ney1bz9pK3xQvX7lPKQusanDPXL7cgQf9cdDgpzq9jTVvIhOC+zBuMjvJbM1zuRh1Y1T/9a7NcPo/VU/BjTJAHur2ObNbeHPh96iUl5sn7ZDa/ZtpOGlheVw2+SpsFfd6ztFN1td+SZ2pu0Nr9tXAQO6iwWMWFbZpq+OD1emg4hklFbOt5h3/kz8hVk2bNNZt/aD4ICaypv3UX1sLasU92SdtTVnrB4WJu4GEmkKGvA7QVhz6NJgVK7l1b9UE5vs3fmkVetlaFk6DkSow9JuZb1hjPBRrP7FtEYice2Up4RdVJ8NmqjDpwbnpkivTbPUiPzmKvR/wXtEbuuxKkTubzx/gPVB7Wp+udrgOfQrOYCmTZMVy242j/spTZ+3oe/+1qxUPB0qV4gODcr5fSUH2yPV/VkfWxkL2a529X1g3WYNvH60OAt1UQ7yJn1Cpipliz5SFaNkHlqNfMyYhrCarefiVrFwZG9S7zqEYGmdrS81GjXlxar3IR1EGg3wcmqNTJ0ZJrlkMcuyG/YZm9cDj8N/P2wopxp3aj27sZkbCt2snDGsyl1coJwrex9eSaKc0a6Hl/YHGdfjEEdmbpDDvmrCW72TctrDyEjFc5Nqo2BwbEr79QuEDcPYMptxIseJ1R93f7m+eTpw/aUcq47e3v09frv8dlb8c1a8nj3UZGuXzChVHag01WGglRX2VWwJBjrNGkJfbmvnGDEBXkPIl/pBACu+RPkCeFB6Tav6FQig/OUBqE+Uf7dj5Uv9Qag+VB9X1VX3O6yuZerj5UtdqDoL/p4FHsMc2HwNbrwAWFcHjoJIgV8WFzYUTQ3KB9LW/VlhMq9uN3TxblVr6iEbDgwZbI/7siSfUfu2MDg8Pcg/7chCTd/g23iugIBfhDstWjgBbs3zVQ2khKv+xdRQAwxaNZxHKCpgTlgtvUclOOMS6eWlOQPROVf/PZsWtdwdWx211Nkvym8vFQ/VDYU+cTQb2Yh7oY9n664KWJvNzsyKzwly6fSVxnBhQXeeZagxcyvecGKU5a+MtdRbbyynxC2lgJC6uw9ePQGgkJnd9iYsctnRPJbKuDRR9tFumk3RmiENYLmpQY3Hlpp0aTGZFbiQPgMlNyja3WDDPXX4ANDvZenURjUvoqI/N0AcPobDPdQ/BAk5euhnXuCg/0iDYf8zyoKI4r6PzFCHhLZ/BsvWO4fOn640n4cQLV3rSc72tLjIaw0+dsxegSLoh+cn6RBiTe71yCDnyAszaTdgk8Q9SIcx+su58kn0GRkXIqK2JNnMDmweuLcOFh8T9YXFyom0jj8/hEPgQtJneP7HgEZaddks9q22AoZ2bBgmQ9eydwreUENOmnX4vPHlQLI8Vq/HDVbYJ2KnwPPuEVHst3VyLp2JjoKrE263rLwBKQ9QUsS3SmGZ+B3upZQRmzpabhyKY8PF/Qfo2uZmvSDCBSvb6AzDJP7Vy15VyTLEj1vYBz0ZXVrSBPCGELEl0REoAdIWlQAKYnKNxvn4X5UFN2DUPSV6wBuXkrjW+kNXLTS/EaQFAtjFKLGjRY2srLxrKjb5Y5OAFZ0cPVKe1REPjXyDSNCZ7af4JNQy5Q+J/DaND6EmGssifIVrnheNJLh1bDPMSTy7blr+pjkXHAsGsdYbvS94HGIcOXY9pYLhpRHHOqVIzTm0QIXg1SynjP7xZcv8E50OD5tOft8ADFO7ytH1lUk2F/MsrZYVhlntxLp69K4WxO7eN+WDXvIsMSOjb39MslL2UhYq6lEM3zXqPPllcRxuSlZqTOi2XQAEWGKXE0rAqfPGcGuphFHT7U6CdYb/HR/XeRJg0XE2FnbfZqxaqHXYoB0RKaKa5H9kj6d80icmGDqj57DPPS1tRJzYu/bgwV8Mvs30kDIUXpoDpFRYqH1IuBWGXUfcF0f5LitI70CKMkbaFa1+XAujHFiyc/R6X25pY3iu1rZyWZTDsx129UpeUWq44S63ZR8WRzk8O2g/hJRRlJftfGHSDhSTmMNiQsoYDacsTcMJoubsB3YeRtnzpWk4KQvVd5OTGsDaYYS0XzXcaiGUl8ZsRJL1JxruhfTBt5kGA43ZA6UnM3z486bhFkE5FAmiTLR9j5Ea7nRtVfNTbm2QPi7Ob2mnMRrutAyeC6X3Eowc2flIw80NKS69l7EpSRYyhxPxkPZ5gc1Q9UYHhJQxGq6U9vl5DoyDo3njy4/rq/2K5x26A0QobIbBBwCPiUwY9uw8FyKinT8arh/lyczfzrcbcvwwQWOUPS89mVl5zhtmr0H9TljaqHY+whROB5jbIftxfu2oOTe3ONwTfunHbFnSXBJiOJQBaV8PWGPcn6mA/eLgTpCKhO1Run22aWm8qxbBJJnko6LO3e38m4IPzwRl0zz1UZ7MbDyXfMd+v9BJP1hjpF2hh6q/reUEz3T0jjWQhmHcGMssfTUpQoS1Jks9GyXt1hz2fMcmagSpaemjMEra52jnilz0WgqoN0ppHxWZ+DHlIFVD/10+yzCMk/ZfezIwVd+u2jUUo8bPDfVXy6fUipzHG3JKp84gHyXtnjoibHnDcKPEe6oTvE7B+xzjPBn9h+18d7Yd/wszH0eNnyv67xYLMoLCPn9j4qPVI5jYEFDvR7odrIBnJ+oUPbMmjDE+nHUc384HYReHZOxIwjuM6p+XPP8J5copIe63JgyMs2r6L/x2wZMZdaqeWRMKGiG4P6EcRkXyzcWDR40o/oJyzIZkj7/Y8S2lfUQEsocP91mdBUVm7I4JJBNgrIb7rlWDqZ0cvrxMtqXpy7Nq24wknwfU32CchvvuCnjrS4LEr5vNUd5rP2kfJ6w8NExG9hqLfkyOURruQ2nvJApGhOk3B/hDjOurvUj7hPzZZU6x/42DiEbw/Fu6HfCWX6Bs95Np7OOiUV/S7YKnM+rkwZcWjMyTGaEW3wEAHCHmd5PYR42ffycOt2WTxKRvg/ENjBo/
[jira] [Created] (SPARK-22459) EdgeDirection "Either" Does Not Considerate real "either" direction
Tom created SPARK-22459: --- Summary: EdgeDirection "Either" Does Not Considerate real "either" direction Key: SPARK-22459 URL: https://issues.apache.org/jira/browse/SPARK-22459 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 2.2.0, 2.1.2, 2.1.1 Environment: Windows 7. Yarn cluster / client mode. Tested with 2.2.0, 2.1.1 and 2.1.2 spark versions. Reporter: Tom When running functions involving [EdgeDirection|https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/EdgeDirection.scala], the _EdgeDirection.Either_ type does not truly perform the actual specified direction (_either_). For instance, when using _Pregel API_ with a simple graph as shown: !data:image/png;base64,iVBORw0KGgoNSUhEUgAAAPkAAADLCAMAAACbI8UEh1BMVEX9/f0AAAB1dXWFhYX4+Pjl5eV6enr39/fj4+Py8vLo6OgsLCyCgoLs7Ox8fHxTU1PKyso8PDxNTU0vLy/c3NwiIiJFRUVLS0vFxcVcXFzOzs41NTVSUlIYGBisrKy9vb1tbW1mZmYeHh6fn5+UlJQLCwuysrKMjIylpaWZmZlgYGASEhKECRTxAAAUhElEQVR4nN1diZqqOgy2LIq1ICAtssi+Ku//fBdwOY4UBEThu//MOAoUG5ImaZq2q9X/FwAIo8rxYOKK/BBA2KZR7AeBL8a5Bfs/AMHI83jv+34W5XmOv1jFr8CK96aalGCY8pdB+mUTYf5tMWBFG9kpCzBhVYxhCtXPrB/UdyIAHNlJeAxKpsXnPI7OMcseCSGbdwzcnu2kUE/+XsyyjNv7sh4miaOeezyzRQDHqqNm7hqW72/NledxepaJHnUJPYxUpAYxhpAHJcrCa2zkrJ4kdrb7RcU/BHCDwuaa3AWrXaSFm3VbOd4wic1ZTc2GMzV0ZMqJhcHyyYXDkH7SuxDZo0vuWtSci0EVCR6zqBT5hUs8NgvVa2fPLkMoppGAzdAWW0UaGAdCsnH28UfwdOQrXRfwkYYoJKx9JHstglLD2iCUflq7L8LwL7nUfQmfayh+JX3ro3d6H57Rwfioct8E9IP8bWsUYqJFfw9tfT1/K8qCSI5LteyYVdMeagjGmvqHBLi3ox6aW8hQ0NUgZkSE2F41gxninlns2cGbJnKFFGj5Im0b1o89ney1bz9pK3xQvX7lPKQusanDPXL7cgQf9cdDgpzq9jTVvIhOC+zBuMjvJbM1zuRh1Y1T/9a7NcPo/VU/BjTJAHur2ObNbeHPh96iUl5sn7ZDa/ZtpOGlheVw2+SpsFfd6ztFN1td+SZ2pu0Nr9tXAQO6iwWMWFbZpq+OD1emg4hklFbOt5h3/kz8hVk2bNNZt/aD4ICaypv3UX1sLasU92SdtTVnrB4WJu4GEmkKGvA7QVhz6NJgVK7l1b9UE5vs3fmkVetlaFk6DkSow9JuZb1hjPBRrP7FtEYice2Up4RdVJ8NmqjDpwbnpkivTbPUiPzmKvR/wXtEbuuxKkTubzx/gPVB7Wp+udrgOfQrOYCmTZMVy242j/spTZ+3oe/+1qxUPB0qV4gODcr5fSUH2yPV/VkfWxkL2a529X1g3WYNvH60OAt1UQ7yJn1Cpipliz5SFaNkHlqNfMyYhrCarefiVrFwZG9S7zqEYGmdrS81GjXlxar3IR1EGg3wcmqNTJ0ZJrlkMcuyG/YZm9cDj8N/P2wopxp3aj27sZkbCt2snDGsyl1coJwrex9eSaKc0a6Hl/YHGdfjEEdmbpDDvmrCW72TctrDyEjFc5Nqo2BwbEr79QuEDcPYMptxIseJ1R93f7m+eTpw/aUcq47e3v09frv8dlb8c1a8nj3UZGuXzChVHag01WGglRX2VWwJBjrNGkJfbmvnGDEBXkPIl/pBACu+RPkCeFB6Tav6FQig/OUBqE+Uf7dj5Uv9Qag+VB9X1VX3O6yuZerj5UtdqDoL/p4FHsMc2HwNbrwAWFcHjoJIgV8WFzYUTQ3KB9LW/VlhMq9uN3TxblVr6iEbDgwZbI/7siSfUfu2MDg8Pcg/7chCTd/g23iugIBfhDstWjgBbs3zVQ2khKv+xdRQAwxaNZxHKCpgTlgtvUclOOMS6eWlOQPROVf/PZsWtdwdWx211Nkvym8vFQ/VDYU+cTQb2Yh7oY9n664KWJvNzsyKzwly6fSVxnBhQXeeZagxcyvecGKU5a+MtdRbbyynxC2lgJC6uw9ePQGgkJnd9iYsctnRPJbKuDRR9tFumk3RmiENYLmpQY3Hlpp0aTGZFbiQPgMlNyja3WDDPXX4ANDvZenURjUvoqI/N0AcPobDPdQ/BAk5euhnXuCg/0iDYf8zyoKI4r6PzFCHhLZ/BsvWO4fOn640n4cQLV3rSc72tLjIaw0+dsxegSLoh+cn6RBiTe71yCDnyAszaTdgk8Q9SIcx+su58kn0GRkXIqK2JNnMDmweuLcOFh8T9YXFyom0jj8/hEPgQtJneP7HgEZaddks9q22AoZ2bBgmQ9eydwreUENOmnX4vPHlQLI8Vq/HDVbYJ2KnwPPuEVHst3VyLp2JjoKrE263rLwBKQ9QUsS3SmGZ+B3upZQRmzpabhyKY8PF/Qfo2uZmvSDCBSvb6AzDJP7Vy15VyTLEj1vYBz0ZXVrSBPCGELEl0REoAdIWlQAKYnKNxvn4X5UFN2DUPSV6wBuXkrjW+kNXLTS/EaQFAtjFKLGjRY2srLxrKjb5Y5OAFZ0cPVKe1REPjXyDSNCZ7af4JNQy5Q+J/DaND6EmGssifIVrnheNJLh1bDPMSTy7blr+pjkXHAsGsdYbvS94HGIcOXY9pYLhpRHHOqVIzTm0QIXg1SynjP7xZcv8E50OD5tOft8ADFO7ytH1lUk2F/MsrZYVhlntxLp69K4WxO7eN+WDXvIsMSOjb39MslL2UhYq6lEM3zXqPPllcRxuSlZqTOi2XQAEWGKXE0rAqfPGcGuphFHT7U6CdYb/HR/XeRJg0XE2FnbfZqxaqHXYoB0RKaKa5H9kj6d80icmGDqj57DPPS1tRJzYu/bgwV8Mvs30kDIUXpoDpFRYqH1IuBWGXUfcF0f5LitI70CKMkbaFa1+XAujHFiyc/R6X25pY3iu1rZyWZTDsx129UpeUWq44S63ZR8WRzk8O2g/hJRRlJftfGHSDhSTmMNiQsoYDacsTcMJoubsB3YeRtnzpWk4KQvVd5OTGsDaYYS0XzXcaiGUl8ZsRJL1JxruhfTBt5kGA43ZA6UnM3z486bhFkE5FAmiTLR9j5Ea7nRtVfNTbm2QPi7Ob2mnMRrutAyeC6X3Eowc2flIw80NKS69l7EpSRYyhxPxkPZ5gc1Q9UYHhJQxGq6U9vl5DoyDo3njy4/rq/2K5x26A0QobIbBBwCPiUwY9uw8FyKinT8arh/lyczfzrcbcvwwQWOUPS89mVl5zhtmr0H9TljaqHY+whROB5jbIftxfu2oOTe3ONwTfunHbFnSXBJiOJQBaV8PWGPcn6mA/eLgTpCKhO1Run22aWm8qxbBJJnko6LO3e38m4IPzwRl0zz1UZ7MbDyXfMd+v9BJP1hjpF2hh6q/reUEz3T0jjWQhmHcGMssfTUpQoS1Jks9GyXt1hz2fMcmagSpaemjMEra52jnilz0WgqoN0ppHxWZ+DHlIFVD/10+yzCMk/ZfezIwVd+u2jUUo8bPDfVXy6fUipzHG3JKp84gHyXtnjoibHnDcKPEe6oTvE7B+xzjPBn9h+18d7Yd/wszH0eNnyv67xYLMoLCPn9j4qPVI5jYEFDvR7odrIBnJ+oUPbMmjDE+nHUc384HYReHZOxIwjuM6p+XPP8J5copIe63JgyMs2r6L/x2wZMZdaqeWRMKGiG4P6EcRkXyzcWDR40o/oJyzIZkj7/Y8S2lfUQEsocP91mdBUVm7I4JJBNgrIb7rlWDqZ0cvrxMtqXpy7Nq24wknwfU32CchvvuCnjrS4LEr5vNUd5rP2kfJ6w8NExG9hqLfkyOURruQ2nvJApGhOk3B/hDjOurvUj7hPzZZU6x/42DiEbw/Fu6HfCWX6Bs95Np7OOiUV/S7YKnM+rkwZcWjMyTGaEW3wEAHCHmd5PYR42ffycOt2WTxKRvg/ENjBo/974xxmKYBcl+uKvNKO9V0SfPEAKG5qg/ndg7yp4rqjyxtEsxSnqNmU1nQJfhvW43ReJ/s0tKwSjKp7bnipyQ+NcLy42i3NKntOdCZCdy+vOZvePmq9kTtvNtFibBDFtWKWicVZuK57xyKtB5jgVZRkYgp7JqwNAT9f3WQN/AKMonG10CLnHMH+v0O0b1WKaSduBp4evqmz/DuIzfaaLOOEDkPFvS9CSRiVEA2Eyai1T9DrNFJoRUJ7922/5gXGTic09GEp0wm3WJ1HFZYR/3WCzfUXuvi/AdGCOl/SMNBzzdORozLxo5R/9cYkkydY7XcIzScJ95MjBznPY9aX+G33sy601hD54m/
[jira] [Commented] (SPARK-22427) StackOverFlowError when using FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-22427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240662#comment-16240662 ] lyt commented on SPARK-22427: - Maybe it's problem of DataFrame. This problem won't occur when using pure rdds. > StackOverFlowError when using FPGrowth > -- > > Key: SPARK-22427 > URL: https://issues.apache.org/jira/browse/SPARK-22427 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.2.0 > Environment: Centos Linux 3.10.0-327.el7.x86_64 > java 1.8.0.111 > spark 2.2.0 >Reporter: lyt > > code part: > val path = jobConfig.getString("hdfspath") > val vectordata = sc.sparkContext.textFile(path) > val finaldata = sc.createDataset(vectordata.map(obj => { > obj.split(" ") > }).filter(arr => arr.length > 0)).toDF("items") > val fpg = new FPGrowth() > > fpg.setMinSupport(minSupport).setItemsCol("items").setMinConfidence(minConfidence) > val train = fpg.fit(finaldata) > print(train.freqItemsets.count()) > print(train.associationRules.count()) > train.save("/tmp/FPGModel") > And encountered following exception: > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.collect(RDD.scala:935) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:278) > at > org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2430) > at > org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2429) > at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2837) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2836) > at org.apache.spark.sql.Dataset.count(Dataset.scala:2429) > at DataMining.FPGrowth$.runJob(FPGrowth.scala:116) > at DataMining.testFPG$.main(FPGrowth.scala:36) > at DataMining.testFPG.main(FPGrowth.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119) > at org.apache.spark.deploy.Sp
[jira] [Closed] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen closed SPARK-22458. - > OutOfDirectMemoryError with Spark 2.2 > - > > Key: SPARK-22458 > URL: https://issues.apache.org/jira/browse/SPARK-22458 > Project: Spark > Issue Type: Bug > Components: Shuffle, SQL, YARN >Affects Versions: 2.2.0 >Reporter: Kaushal Prajapati >Priority: Blocker > > We were using Spark 2.1 from last 6 months to execute multiple spark jobs > that is running 15 hour long for 50+ TB of source data with below > configurations successfully. > {quote}spark.master yarn > spark.driver.cores10 > spark.driver.maxResultSize5g > spark.driver.memory 20g > spark.executor.cores 5 > spark.executor.extraJavaOptions *-XX:+UseG1GC > -Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6 > *-XX:MaxDirectMemorySize=2048m* > -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 > spark.driver.extraJavaOptions > *-Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* > -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 > spark.executor.instances 30 > spark.executor.memory 30g > *spark.kryoserializer.buffer.max 512m* > spark.network.timeout 12000s > spark.serializer > org.apache.spark.serializer.KryoSerializer > spark.shuffle.io.preferDirectBufs false > spark.sql.catalogImplementation hive > spark.sql.shuffle.partitions 5000 > spark.yarn.driver.memoryOverhead 1536 > spark.yarn.executor.memoryOverhead4096 > spark.core.connection.ack.wait.timeout600s > spark.scheduler.maxRegisteredResourcesWaitingTime 15s > spark.sql.hive.filesourcePartitionFileCacheSize 524288000 > spark.dynamicAllocation.executorIdleTimeout 3s > spark.dynamicAllocation.enabled true > spark.hadoop.yarn.timeline-service.enabledfalse > spark.shuffle.service.enabled true > spark.yarn.am.extraJavaOptions*-Dhdp.version=2.5.3.0-37 > -Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m*{quote} > Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes > using latest version. But we started facing DirectBuffer outOfMemory error > and exceeding memory limits for executor memoryOverhead issue. To fix that we > started tweaking multiple properties but still issue persists. Relevant > information is shared below > Please let me any other details is requried, > > Snapshot for DirectMemory Error Stacktrace :- > {code:java} > 10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in > stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): > FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), > shuffleId=7, mapId=141, reduceId=3372, message= > org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 > byte(s) of direct memory (used: 1073699840, max: 1073741824) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown >
[jira] [Resolved] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22458. --- Resolution: Not A Problem Changes in implementation between minor releases could change memory usage. Your overhead is pretty low. This is not bug and should not be reopened. > OutOfDirectMemoryError with Spark 2.2 > - > > Key: SPARK-22458 > URL: https://issues.apache.org/jira/browse/SPARK-22458 > Project: Spark > Issue Type: Bug > Components: Shuffle, SQL, YARN >Affects Versions: 2.2.0 >Reporter: Kaushal Prajapati >Priority: Blocker > > We were using Spark 2.1 from last 6 months to execute multiple spark jobs > that is running 15 hour long for 50+ TB of source data with below > configurations successfully. > {quote}spark.master yarn > spark.driver.cores10 > spark.driver.maxResultSize5g > spark.driver.memory 20g > spark.executor.cores 5 > spark.executor.extraJavaOptions *-XX:+UseG1GC > -Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6 > *-XX:MaxDirectMemorySize=2048m* > -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 > spark.driver.extraJavaOptions > *-Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* > -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 > spark.executor.instances 30 > spark.executor.memory 30g > *spark.kryoserializer.buffer.max 512m* > spark.network.timeout 12000s > spark.serializer > org.apache.spark.serializer.KryoSerializer > spark.shuffle.io.preferDirectBufs false > spark.sql.catalogImplementation hive > spark.sql.shuffle.partitions 5000 > spark.yarn.driver.memoryOverhead 1536 > spark.yarn.executor.memoryOverhead4096 > spark.core.connection.ack.wait.timeout600s > spark.scheduler.maxRegisteredResourcesWaitingTime 15s > spark.sql.hive.filesourcePartitionFileCacheSize 524288000 > spark.dynamicAllocation.executorIdleTimeout 3s > spark.dynamicAllocation.enabled true > spark.hadoop.yarn.timeline-service.enabledfalse > spark.shuffle.service.enabled true > spark.yarn.am.extraJavaOptions*-Dhdp.version=2.5.3.0-37 > -Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m*{quote} > Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes > using latest version. But we started facing DirectBuffer outOfMemory error > and exceeding memory limits for executor memoryOverhead issue. To fix that we > started tweaking multiple properties but still issue persists. Relevant > information is shared below > Please let me any other details is requried, > > Snapshot for DirectMemory Error Stacktrace :- > {code:java} > 10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in > stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): > FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), > shuffleId=7, mapId=141, reduceId=3372, message= > org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 > byte(s) of direct memory (used: 1073699840, max: 1073741824) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec
[jira] [Reopened] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kaushal Prajapati reopened SPARK-22458: --- It's working with same configs using Spark 2.1, why not on Spark 2.2 and why Spark 2.2 demand extra memory and how much extra? > OutOfDirectMemoryError with Spark 2.2 > - > > Key: SPARK-22458 > URL: https://issues.apache.org/jira/browse/SPARK-22458 > Project: Spark > Issue Type: Bug > Components: Shuffle, SQL, YARN >Affects Versions: 2.2.0 >Reporter: Kaushal Prajapati >Priority: Blocker > > We were using Spark 2.1 from last 6 months to execute multiple spark jobs > that is running 15 hour long for 50+ TB of source data with below > configurations successfully. > {quote}spark.master yarn > spark.driver.cores10 > spark.driver.maxResultSize5g > spark.driver.memory 20g > spark.executor.cores 5 > spark.executor.extraJavaOptions *-XX:+UseG1GC > -Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6 > *-XX:MaxDirectMemorySize=2048m* > -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 > spark.driver.extraJavaOptions > *-Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* > -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 > spark.executor.instances 30 > spark.executor.memory 30g > *spark.kryoserializer.buffer.max 512m* > spark.network.timeout 12000s > spark.serializer > org.apache.spark.serializer.KryoSerializer > spark.shuffle.io.preferDirectBufs false > spark.sql.catalogImplementation hive > spark.sql.shuffle.partitions 5000 > spark.yarn.driver.memoryOverhead 1536 > spark.yarn.executor.memoryOverhead4096 > spark.core.connection.ack.wait.timeout600s > spark.scheduler.maxRegisteredResourcesWaitingTime 15s > spark.sql.hive.filesourcePartitionFileCacheSize 524288000 > spark.dynamicAllocation.executorIdleTimeout 3s > spark.dynamicAllocation.enabled true > spark.hadoop.yarn.timeline-service.enabledfalse > spark.shuffle.service.enabled true > spark.yarn.am.extraJavaOptions*-Dhdp.version=2.5.3.0-37 > -Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m*{quote} > Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes > using latest version. But we started facing DirectBuffer outOfMemory error > and exceeding memory limits for executor memoryOverhead issue. To fix that we > started tweaking multiple properties but still issue persists. Relevant > information is shared below > Please let me any other details is requried, > > Snapshot for DirectMemory Error Stacktrace :- > {code:java} > 10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in > stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): > FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), > shuffleId=7, mapId=141, reduceId=3372, message= > org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 > byte(s) of direct memory (used: 1073699840, max: 1073741824) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeSta
[jira] [Resolved] (SPARK-22315) Check for version match between R package and JVM
[ https://issues.apache.org/jira/browse/SPARK-22315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-22315. --- Resolution: Fixed Fix Version/s: 2.3.0 2.2.1 Issue resolved by pull request 19624 [https://github.com/apache/spark/pull/19624] > Check for version match between R package and JVM > - > > Key: SPARK-22315 > URL: https://issues.apache.org/jira/browse/SPARK-22315 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.2.1 >Reporter: Shivaram Venkataraman > Fix For: 2.2.1, 2.3.0 > > > With the release of SparkR on CRAN we could have scenarios where users have a > newer version of package when compared to the Spark cluster they are > connecting to. > We should print appropriate warnings on either (a) connecting to a different > version R Backend (b) connecting to a Spark master running a different > version of Spark (this should ideally happen inside Scala ?) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240537#comment-16240537 ] Kaushal Prajapati commented on SPARK-22458: --- yes, but we tried multiple times, same code is working with Spark 2.1 not with Spark 2.2. > OutOfDirectMemoryError with Spark 2.2 > - > > Key: SPARK-22458 > URL: https://issues.apache.org/jira/browse/SPARK-22458 > Project: Spark > Issue Type: Bug > Components: Shuffle, SQL, YARN >Affects Versions: 2.2.0 >Reporter: Kaushal Prajapati >Priority: Blocker > > We were using Spark 2.1 from last 6 months to execute multiple spark jobs > that is running 15 hour long for 50+ TB of source data with below > configurations successfully. > {quote}spark.master yarn > spark.driver.cores10 > spark.driver.maxResultSize5g > spark.driver.memory 20g > spark.executor.cores 5 > spark.executor.extraJavaOptions *-XX:+UseG1GC > -Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6 > *-XX:MaxDirectMemorySize=2048m* > -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 > spark.driver.extraJavaOptions > *-Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* > -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 > spark.executor.instances 30 > spark.executor.memory 30g > *spark.kryoserializer.buffer.max 512m* > spark.network.timeout 12000s > spark.serializer > org.apache.spark.serializer.KryoSerializer > spark.shuffle.io.preferDirectBufs false > spark.sql.catalogImplementation hive > spark.sql.shuffle.partitions 5000 > spark.yarn.driver.memoryOverhead 1536 > spark.yarn.executor.memoryOverhead4096 > spark.core.connection.ack.wait.timeout600s > spark.scheduler.maxRegisteredResourcesWaitingTime 15s > spark.sql.hive.filesourcePartitionFileCacheSize 524288000 > spark.dynamicAllocation.executorIdleTimeout 3s > spark.dynamicAllocation.enabled true > spark.hadoop.yarn.timeline-service.enabledfalse > spark.shuffle.service.enabled true > spark.yarn.am.extraJavaOptions*-Dhdp.version=2.5.3.0-37 > -Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m*{quote} > Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes > using latest version. But we started facing DirectBuffer outOfMemory error > and exceeding memory limits for executor memoryOverhead issue. To fix that we > started tweaking multiple properties but still issue persists. Relevant > information is shared below > Please let me any other details is requried, > > Snapshot for DirectMemory Error Stacktrace :- > {code:java} > 10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in > stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): > FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), > shuffleId=7, mapId=141, reduceId=3372, message= > org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 > byte(s) of direct memory (used: 1073699840, max: 1073741824) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$
[jira] [Commented] (SPARK-14516) Clustering evaluator
[ https://issues.apache.org/jira/browse/SPARK-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240468#comment-16240468 ] Apache Spark commented on SPARK-14516: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/19676 > Clustering evaluator > > > Key: SPARK-14516 > URL: https://issues.apache.org/jira/browse/SPARK-14516 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: zhengruifeng >Assignee: Marco Gaido > Fix For: 2.3.0 > > > MLlib does not have any general purposed clustering metrics with a ground > truth. > In > [Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics), > there are several kinds of metrics for this. > It may be meaningful to add some clustering metrics into MLlib. > This should be added as a {{ClusteringEvaluator}} class of extending > {{Evaluator}} in spark.ml. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14540) Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner
[ https://issues.apache.org/jira/browse/SPARK-14540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14540: Assignee: Apache Spark > Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner > > > Key: SPARK-14540 > URL: https://issues.apache.org/jira/browse/SPARK-14540 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Josh Rosen >Assignee: Apache Spark > > Using https://github.com/JoshRosen/spark/tree/build-for-2.12, I tried running > ClosureCleanerSuite with Scala 2.12 and ran into two bad test failures: > {code} > [info] - toplevel return statements in closures are identified at cleaning > time *** FAILED *** (32 milliseconds) > [info] Expected exception > org.apache.spark.util.ReturnStatementInClosureException to be thrown, but no > exception was thrown. (ClosureCleanerSuite.scala:57) > {code} > and > {code} > [info] - user provided closures are actually cleaned *** FAILED *** (56 > milliseconds) > [info] Expected ReturnStatementInClosureException, but got > org.apache.spark.SparkException: Job aborted due to stage failure: Task not > serializable: java.io.NotSerializableException: java.lang.Object > [info]- element of array (index: 0) > [info]- array (class "[Ljava.lang.Object;", size: 1) > [info]- field (class "java.lang.invoke.SerializedLambda", name: > "capturedArgs", type: "class [Ljava.lang.Object;") > [info]- object (class "java.lang.invoke.SerializedLambda", > SerializedLambda[capturingClass=class > org.apache.spark.util.TestUserClosuresActuallyCleaned$, > functionalInterfaceMethod=scala/runtime/java8/JFunction1$mcII$sp.apply$mcII$sp:(I)I, > implementation=invokeStatic > org/apache/spark/util/TestUserClosuresActuallyCleaned$.org$apache$spark$util$TestUserClosuresActuallyCleaned$$$anonfun$69:(Ljava/lang/Object;I)I, > instantiatedMethodType=(I)I, numCaptured=1]) > [info]- element of array (index: 0) > [info]- array (class "[Ljava.lang.Object;", size: 1) > [info]- field (class "java.lang.invoke.SerializedLambda", name: > "capturedArgs", type: "class [Ljava.lang.Object;") > [info]- object (class "java.lang.invoke.SerializedLambda", > SerializedLambda[capturingClass=class org.apache.spark.rdd.RDD, > functionalInterfaceMethod=scala/Function3.apply:(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;, > implementation=invokeStatic > org/apache/spark/rdd/RDD.org$apache$spark$rdd$RDD$$$anonfun$20$adapted:(Lscala/Function1;Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, > > instantiatedMethodType=(Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, > numCaptured=1]) > [info]- field (class "org.apache.spark.rdd.MapPartitionsRDD", name: > "f", type: "interface scala.Function3") > [info]- object (class "org.apache.spark.rdd.MapPartitionsRDD", > MapPartitionsRDD[2] at apply at Transformer.scala:22) > [info]- field (class "scala.Tuple2", name: "_1", type: "class > java.lang.Object") > [info]- root object (class "scala.Tuple2", (MapPartitionsRDD[2] at > apply at > Transformer.scala:22,org.apache.spark.SparkContext$$Lambda$957/431842435@6e803685)). > [info] This means the closure provided by user is not actually cleaned. > (ClosureCleanerSuite.scala:78) > {code} > We'll need to figure out a closure cleaning strategy which works for 2.12 > lambdas. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14540) Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner
[ https://issues.apache.org/jira/browse/SPARK-14540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14540: Assignee: (was: Apache Spark) > Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner > > > Key: SPARK-14540 > URL: https://issues.apache.org/jira/browse/SPARK-14540 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Josh Rosen > > Using https://github.com/JoshRosen/spark/tree/build-for-2.12, I tried running > ClosureCleanerSuite with Scala 2.12 and ran into two bad test failures: > {code} > [info] - toplevel return statements in closures are identified at cleaning > time *** FAILED *** (32 milliseconds) > [info] Expected exception > org.apache.spark.util.ReturnStatementInClosureException to be thrown, but no > exception was thrown. (ClosureCleanerSuite.scala:57) > {code} > and > {code} > [info] - user provided closures are actually cleaned *** FAILED *** (56 > milliseconds) > [info] Expected ReturnStatementInClosureException, but got > org.apache.spark.SparkException: Job aborted due to stage failure: Task not > serializable: java.io.NotSerializableException: java.lang.Object > [info]- element of array (index: 0) > [info]- array (class "[Ljava.lang.Object;", size: 1) > [info]- field (class "java.lang.invoke.SerializedLambda", name: > "capturedArgs", type: "class [Ljava.lang.Object;") > [info]- object (class "java.lang.invoke.SerializedLambda", > SerializedLambda[capturingClass=class > org.apache.spark.util.TestUserClosuresActuallyCleaned$, > functionalInterfaceMethod=scala/runtime/java8/JFunction1$mcII$sp.apply$mcII$sp:(I)I, > implementation=invokeStatic > org/apache/spark/util/TestUserClosuresActuallyCleaned$.org$apache$spark$util$TestUserClosuresActuallyCleaned$$$anonfun$69:(Ljava/lang/Object;I)I, > instantiatedMethodType=(I)I, numCaptured=1]) > [info]- element of array (index: 0) > [info]- array (class "[Ljava.lang.Object;", size: 1) > [info]- field (class "java.lang.invoke.SerializedLambda", name: > "capturedArgs", type: "class [Ljava.lang.Object;") > [info]- object (class "java.lang.invoke.SerializedLambda", > SerializedLambda[capturingClass=class org.apache.spark.rdd.RDD, > functionalInterfaceMethod=scala/Function3.apply:(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;, > implementation=invokeStatic > org/apache/spark/rdd/RDD.org$apache$spark$rdd$RDD$$$anonfun$20$adapted:(Lscala/Function1;Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, > > instantiatedMethodType=(Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, > numCaptured=1]) > [info]- field (class "org.apache.spark.rdd.MapPartitionsRDD", name: > "f", type: "interface scala.Function3") > [info]- object (class "org.apache.spark.rdd.MapPartitionsRDD", > MapPartitionsRDD[2] at apply at Transformer.scala:22) > [info]- field (class "scala.Tuple2", name: "_1", type: "class > java.lang.Object") > [info]- root object (class "scala.Tuple2", (MapPartitionsRDD[2] at > apply at > Transformer.scala:22,org.apache.spark.SparkContext$$Lambda$957/431842435@6e803685)). > [info] This means the closure provided by user is not actually cleaned. > (ClosureCleanerSuite.scala:78) > {code} > We'll need to figure out a closure cleaning strategy which works for 2.12 > lambdas. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14540) Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner
[ https://issues.apache.org/jira/browse/SPARK-14540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240422#comment-16240422 ] Apache Spark commented on SPARK-14540: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/19675 > Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner > > > Key: SPARK-14540 > URL: https://issues.apache.org/jira/browse/SPARK-14540 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Josh Rosen > > Using https://github.com/JoshRosen/spark/tree/build-for-2.12, I tried running > ClosureCleanerSuite with Scala 2.12 and ran into two bad test failures: > {code} > [info] - toplevel return statements in closures are identified at cleaning > time *** FAILED *** (32 milliseconds) > [info] Expected exception > org.apache.spark.util.ReturnStatementInClosureException to be thrown, but no > exception was thrown. (ClosureCleanerSuite.scala:57) > {code} > and > {code} > [info] - user provided closures are actually cleaned *** FAILED *** (56 > milliseconds) > [info] Expected ReturnStatementInClosureException, but got > org.apache.spark.SparkException: Job aborted due to stage failure: Task not > serializable: java.io.NotSerializableException: java.lang.Object > [info]- element of array (index: 0) > [info]- array (class "[Ljava.lang.Object;", size: 1) > [info]- field (class "java.lang.invoke.SerializedLambda", name: > "capturedArgs", type: "class [Ljava.lang.Object;") > [info]- object (class "java.lang.invoke.SerializedLambda", > SerializedLambda[capturingClass=class > org.apache.spark.util.TestUserClosuresActuallyCleaned$, > functionalInterfaceMethod=scala/runtime/java8/JFunction1$mcII$sp.apply$mcII$sp:(I)I, > implementation=invokeStatic > org/apache/spark/util/TestUserClosuresActuallyCleaned$.org$apache$spark$util$TestUserClosuresActuallyCleaned$$$anonfun$69:(Ljava/lang/Object;I)I, > instantiatedMethodType=(I)I, numCaptured=1]) > [info]- element of array (index: 0) > [info]- array (class "[Ljava.lang.Object;", size: 1) > [info]- field (class "java.lang.invoke.SerializedLambda", name: > "capturedArgs", type: "class [Ljava.lang.Object;") > [info]- object (class "java.lang.invoke.SerializedLambda", > SerializedLambda[capturingClass=class org.apache.spark.rdd.RDD, > functionalInterfaceMethod=scala/Function3.apply:(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;, > implementation=invokeStatic > org/apache/spark/rdd/RDD.org$apache$spark$rdd$RDD$$$anonfun$20$adapted:(Lscala/Function1;Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, > > instantiatedMethodType=(Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, > numCaptured=1]) > [info]- field (class "org.apache.spark.rdd.MapPartitionsRDD", name: > "f", type: "interface scala.Function3") > [info]- object (class "org.apache.spark.rdd.MapPartitionsRDD", > MapPartitionsRDD[2] at apply at Transformer.scala:22) > [info]- field (class "scala.Tuple2", name: "_1", type: "class > java.lang.Object") > [info]- root object (class "scala.Tuple2", (MapPartitionsRDD[2] at > apply at > Transformer.scala:22,org.apache.spark.SparkContext$$Lambda$957/431842435@6e803685)). > [info] This means the closure provided by user is not actually cleaned. > (ClosureCleanerSuite.scala:78) > {code} > We'll need to figure out a closure cleaning strategy which works for 2.12 > lambdas. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22458. --- Resolution: Not A Problem Please read http://spark.apache.org/contributing.html This just means you're out of YARN container memory. You need to increase the overhead size. This is not a bug. > OutOfDirectMemoryError with Spark 2.2 > - > > Key: SPARK-22458 > URL: https://issues.apache.org/jira/browse/SPARK-22458 > Project: Spark > Issue Type: Bug > Components: Shuffle, SQL, YARN >Affects Versions: 2.2.0 >Reporter: Kaushal Prajapati >Priority: Blocker > > We were using Spark 2.1 from last 6 months to execute multiple spark jobs > that is running 15 hour long for 50+ TB of source data with below > configurations successfully. > {quote}spark.master yarn > spark.driver.cores10 > spark.driver.maxResultSize5g > spark.driver.memory 20g > spark.executor.cores 5 > spark.executor.extraJavaOptions *-XX:+UseG1GC > -Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6 > *-XX:MaxDirectMemorySize=2048m* > -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 > spark.driver.extraJavaOptions > *-Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* > -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 > spark.executor.instances 30 > spark.executor.memory 30g > *spark.kryoserializer.buffer.max 512m* > spark.network.timeout 12000s > spark.serializer > org.apache.spark.serializer.KryoSerializer > spark.shuffle.io.preferDirectBufs false > spark.sql.catalogImplementation hive > spark.sql.shuffle.partitions 5000 > spark.yarn.driver.memoryOverhead 1536 > spark.yarn.executor.memoryOverhead4096 > spark.core.connection.ack.wait.timeout600s > spark.scheduler.maxRegisteredResourcesWaitingTime 15s > spark.sql.hive.filesourcePartitionFileCacheSize 524288000 > spark.dynamicAllocation.executorIdleTimeout 3s > spark.dynamicAllocation.enabled true > spark.hadoop.yarn.timeline-service.enabledfalse > spark.shuffle.service.enabled true > spark.yarn.am.extraJavaOptions*-Dhdp.version=2.5.3.0-37 > -Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m*{quote} > Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes > using latest version. But we started facing DirectBuffer outOfMemory error > and exceeding memory limits for executor memoryOverhead issue. To fix that we > started tweaking multiple properties but still issue persists. Relevant > information is shared below > Please let me any other details is requried, > > Snapshot for DirectMemory Error Stacktrace :- > {code:java} > 10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in > stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): > FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), > shuffleId=7, mapId=141, reduceId=3372, message= > org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 > byte(s) of direct memory (used: 1073699840, max: 1073741824) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.Whol
[jira] [Commented] (SPARK-22395) Fix the behavior of timestamp values for Pandas to respect session timezone
[ https://issues.apache.org/jira/browse/SPARK-22395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240396#comment-16240396 ] Apache Spark commented on SPARK-22395: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/19674 > Fix the behavior of timestamp values for Pandas to respect session timezone > --- > > Key: SPARK-22395 > URL: https://issues.apache.org/jira/browse/SPARK-22395 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Takuya Ueshin > > When converting Pandas DataFrame/Series from/to Spark DataFrame using > {{toPandas()}} or pandas udfs, timestamp values behave to respect Python > system timezone instead of session timezone. > For example, let's say we use {{"America/Los_Angeles"}} as session timezone > and have a timestamp value {{"1970-01-01 00:00:01"}} in the timezone. Btw, > I'm in Japan so Python timezone would be {{"Asia/Tokyo"}}. > The timestamp value from current {{toPandas()}} will be the following: > {noformat} > >>> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") > >>> df = spark.createDataFrame([28801], "long").selectExpr("timestamp(value) > >>> as ts") > >>> df.show() > +---+ > | ts| > +---+ > |1970-01-01 00:00:01| > +---+ > >>> df.toPandas() >ts > 0 1970-01-01 17:00:01 > {noformat} > As you can see, the value becomes {{"1970-01-01 17:00:01"}} because it > respects Python timezone. > As we discussed in https://github.com/apache/spark/pull/18664, we consider > this behavior is a bug and the value should be {{"1970-01-01 00:00:01"}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20644) Hook up Spark UI to the new key-value store backend
[ https://issues.apache.org/jira/browse/SPARK-20644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-20644. -- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19582 [https://github.com/apache/spark/pull/19582] > Hook up Spark UI to the new key-value store backend > --- > > Key: SPARK-20644 > URL: https://issues.apache.org/jira/browse/SPARK-20644 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin > Fix For: 2.3.0 > > > See spec in parent issue (SPARK-18085) for more details. > This task tracks hooking up the Spark UI (both live and SHS) to the key-value > store based backend. It's the initial work to allow individual UI pages to be > de-coupled from the listener implementations and use the REST API data saved > in the store. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20644) Hook up Spark UI to the new key-value store backend
[ https://issues.apache.org/jira/browse/SPARK-20644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid reassigned SPARK-20644: Assignee: Marcelo Vanzin > Hook up Spark UI to the new key-value store backend > --- > > Key: SPARK-20644 > URL: https://issues.apache.org/jira/browse/SPARK-20644 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Fix For: 2.3.0 > > > See spec in parent issue (SPARK-18085) for more details. > This task tracks hooking up the Spark UI (both live and SHS) to the key-value > store based backend. It's the initial work to allow individual UI pages to be > de-coupled from the listener implementations and use the REST API data saved > in the store. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kaushal Prajapati updated SPARK-22458: -- Description: We were using Spark 2.1 from last 6 months to execute multiple spark jobs that is running 15 hour long for 50+ TB of source data with below configurations successfully. {quote}spark.master yarn spark.driver.cores10 spark.driver.maxResultSize5g spark.driver.memory 20g spark.executor.cores 5 spark.executor.extraJavaOptions *-XX:+UseG1GC -Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6 *-XX:MaxDirectMemorySize=2048m* -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 spark.driver.extraJavaOptions *-Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 spark.executor.instances 30 spark.executor.memory 30g *spark.kryoserializer.buffer.max 512m* spark.network.timeout 12000s spark.serializer org.apache.spark.serializer.KryoSerializer spark.shuffle.io.preferDirectBufs false spark.sql.catalogImplementation hive spark.sql.shuffle.partitions 5000 spark.yarn.driver.memoryOverhead 1536 spark.yarn.executor.memoryOverhead4096 spark.core.connection.ack.wait.timeout600s spark.scheduler.maxRegisteredResourcesWaitingTime 15s spark.sql.hive.filesourcePartitionFileCacheSize 524288000 spark.dynamicAllocation.executorIdleTimeout 3s spark.dynamicAllocation.enabled true spark.hadoop.yarn.timeline-service.enabledfalse spark.shuffle.service.enabled true spark.yarn.am.extraJavaOptions*-Dhdp.version=2.5.3.0-37 -Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m*{quote} Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes using latest version. But we started facing DirectBuffer outOfMemory error and exceeding memory limits for executor memoryOverhead issue. To fix that we started tweaking multiple properties but still issue persists. Relevant information is shared below Please let me any other details is requried, Snapshot for DirectMemory Error Stacktrace :- {code:java} 10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), shuffleId=7, mapId=141, reduceId=3372, message= org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 byte(s) of direct memory (used: 1073699840, max: 1073741824) at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:414) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(Unsafe
[jira] [Updated] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kaushal Prajapati updated SPARK-22458: -- Description: We were using Spark 2.1 from last 6 months to execute multiple spark jobs that is running 15 hour long for 50+ TB of source data with below configurations successfully. {noformat} spark.master yarn spark.driver.cores10 spark.driver.maxResultSize5g spark.driver.memory 20g spark.executor.cores 5 spark.executor.extraJavaOptions -XX:+UseG1GC {color:red}-Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6{color} -XX:MaxDirectMemorySize=2048m* -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 spark.driver.extraJavaOptions *-Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 spark.executor.instances 30 spark.executor.memory 30g *spark.kryoserializer.buffer.max 512m* spark.network.timeout 12000s spark.serializer org.apache.spark.serializer.KryoSerializer spark.shuffle.io.preferDirectBufs false spark.sql.catalogImplementation hive spark.sql.shuffle.partitions 5000 spark.yarn.driver.memoryOverhead 1536 spark.yarn.executor.memoryOverhead4096 spark.core.connection.ack.wait.timeout600s spark.scheduler.maxRegisteredResourcesWaitingTime 15s spark.sql.hive.filesourcePartitionFileCacheSize 524288000 spark.dynamicAllocation.executorIdleTimeout 3s spark.dynamicAllocation.enabled true spark.hadoop.yarn.timeline-service.enabledfalse spark.shuffle.service.enabled true spark.yarn.am.extraJavaOptions-Dhdp.version=2.5.3.0-37 * -Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m* {noformat} Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes using latest version. But we started facing DirectBuffer outOfMemory error and exceeding memory limits for executor memoryOverhead issue. To fix that we started tweaking multiple properties but still issue persists. Relevant information is shared below Please let me any other details is requried, Snapshot for DirectMemory Error Stacktrace :- {code:java} 10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), shuffleId=7, mapId=141, reduceId=3372, message= org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 byte(s) of direct memory (used: 1073699840, max: 1073741824) at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:414) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.Unsafe
[jira] [Updated] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kaushal Prajapati updated SPARK-22458: -- Description: We were using Spark 2.1 from last 6 months to execute multiple spark jobs that is running 15 hour long for 50+ TB of source data with below configurations successfully. {noformat} spark.master yarn spark.driver.cores10 spark.driver.maxResultSize5g spark.driver.memory 20g spark.executor.cores 5 spark.executor.extraJavaOptions -XX:+UseG1GC *-Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6 *-XX:MaxDirectMemorySize=2048m* -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 spark.driver.extraJavaOptions *-Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 spark.executor.instances 30 spark.executor.memory 30g *spark.kryoserializer.buffer.max 512m* spark.network.timeout 12000s spark.serializer org.apache.spark.serializer.KryoSerializer spark.shuffle.io.preferDirectBufs false spark.sql.catalogImplementation hive spark.sql.shuffle.partitions 5000 spark.yarn.driver.memoryOverhead 1536 spark.yarn.executor.memoryOverhead4096 spark.core.connection.ack.wait.timeout600s spark.scheduler.maxRegisteredResourcesWaitingTime 15s spark.sql.hive.filesourcePartitionFileCacheSize 524288000 spark.dynamicAllocation.executorIdleTimeout 3s spark.dynamicAllocation.enabled true spark.hadoop.yarn.timeline-service.enabledfalse spark.shuffle.service.enabled true spark.yarn.am.extraJavaOptions-Dhdp.version=2.5.3.0-37 * -Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m* {noformat} Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes using latest version. But we started facing DirectBuffer outOfMemory error and exceeding memory limits for executor memoryOverhead issue. To fix that we started tweaking multiple properties but still issue persists. Relevant information is shared below Please let me any other details is requried, Snapshot for DirectMemory Error Stacktrace :- {code:java} 10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), shuffleId=7, mapId=141, reduceId=3372, message= org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 byte(s) of direct memory (used: 1073699840, max: 1073741824) at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:414) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.wr
[jira] [Updated] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kaushal Prajapati updated SPARK-22458: -- Description: We were using Spark 2.1 from last 6 months to execute multiple spark jobs that is running 15 hour long for 50+ TB of source data with below configurations successfully. {noformat} spark.master yarn spark.driver.cores10 spark.driver.maxResultSize5g spark.driver.memory 20g spark.executor.cores 5 spark.executor.extraJavaOptions -XX:+UseG1GC *-Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6 *-XX:MaxDirectMemorySize=2048m* -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 spark.driver.extraJavaOptions* -Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 spark.executor.instances 30 spark.executor.memory 30g *spark.kryoserializer.buffer.max 512m* spark.network.timeout 12000s spark.serializer org.apache.spark.serializer.KryoSerializer spark.shuffle.io.preferDirectBufs false spark.sql.catalogImplementation hive spark.sql.shuffle.partitions 5000 spark.yarn.driver.memoryOverhead 1536 spark.yarn.executor.memoryOverhead4096 spark.core.connection.ack.wait.timeout600s spark.scheduler.maxRegisteredResourcesWaitingTime 15s spark.sql.hive.filesourcePartitionFileCacheSize 524288000 spark.dynamicAllocation.executorIdleTimeout 3s spark.dynamicAllocation.enabled true spark.hadoop.yarn.timeline-service.enabledfalse spark.shuffle.service.enabled true spark.yarn.am.extraJavaOptions-Dhdp.version=2.5.3.0-37 * -Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m* {noformat} Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes using latest version. But we started facing DirectBuffer outOfMemory error and exceeding memory limits for executor memoryOverhead issue. To fix that we started tweaking multiple properties but still issue persists. Relevant information is shared below Please let me any other details is requried, Snapshot for DirectMemory Error Stacktrace :- {code:java} 10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), shuffleId=7, mapId=141, reduceId=3372, message= org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 byte(s) of direct memory (used: 1073699840, max: 1073741824) at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:414) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.w
[jira] [Updated] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kaushal Prajapati updated SPARK-22458: -- Description: We were using Spark 2.1 from last 6 months to execute multiple spark jobs that is running 15 hour long for 50+ TB of source data with below configurations successfully. {{spark.master yarn spark.driver.cores10 spark.driver.maxResultSize5g spark.driver.memory 20g spark.executor.cores 5 spark.executor.extraJavaOptions -XX:+UseG1GC *-Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6 *-XX:MaxDirectMemorySize=2048m* -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 spark.driver.extraJavaOptions* -Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 spark.executor.instances 30 spark.executor.memory 30g *spark.kryoserializer.buffer.max 512m* spark.network.timeout 12000s spark.serializer org.apache.spark.serializer.KryoSerializer spark.shuffle.io.preferDirectBufs false spark.sql.catalogImplementation hive spark.sql.shuffle.partitions 5000 spark.yarn.driver.memoryOverhead 1536 spark.yarn.executor.memoryOverhead4096 spark.core.connection.ack.wait.timeout600s spark.scheduler.maxRegisteredResourcesWaitingTime 15s spark.sql.hive.filesourcePartitionFileCacheSize 524288000 spark.dynamicAllocation.executorIdleTimeout 3s spark.dynamicAllocation.enabled true spark.hadoop.yarn.timeline-service.enabledfalse spark.shuffle.service.enabled true spark.yarn.am.extraJavaOptions-Dhdp.version=2.5.3.0-37 * -Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m* }} Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes using latest version. But we started facing DirectBuffer outOfMemory error and exceeding memory limits for executor memoryOverhead issue. To fix that we started tweaking multiple properties but still issue persists. Relevant information is shared below Please let me any other details is requried, Snapshot for DirectMemory Error Stacktrace :- {code:java} 10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), shuffleId=7, mapId=141, reduceId=3372, message= org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 byte(s) of direct memory (used: 1073699840, max: 1073741824) at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:414) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWr
[jira] [Resolved] (SPARK-22445) move CodegenContext.copyResult to CodegenSupport
[ https://issues.apache.org/jira/browse/SPARK-22445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-22445. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19656 [https://github.com/apache/spark/pull/19656] > move CodegenContext.copyResult to CodegenSupport > > > Key: SPARK-22445 > URL: https://issues.apache.org/jira/browse/SPARK-22445 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kaushal Prajapati updated SPARK-22458: -- Description: We were using Spark 2.1 from last 6 months to execute multiple spark jobs that is running 15 hour long for 50+ TB of source data with below configurations successfully. spark.master yarn spark.driver.cores10 spark.driver.maxResultSize5g spark.driver.memory 20g spark.executor.cores 5 spark.executor.extraJavaOptions -XX:+UseG1GC *-Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6 *-XX:MaxDirectMemorySize=2048m* -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 spark.driver.extraJavaOptions* -Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 spark.executor.instances 30 spark.executor.memory 30g *spark.kryoserializer.buffer.max 512m* spark.network.timeout 12000s spark.serializer org.apache.spark.serializer.KryoSerializer spark.shuffle.io.preferDirectBufs false spark.sql.catalogImplementation hive spark.sql.shuffle.partitions 5000 spark.yarn.driver.memoryOverhead 1536 spark.yarn.executor.memoryOverhead4096 spark.core.connection.ack.wait.timeout600s spark.scheduler.maxRegisteredResourcesWaitingTime 15s spark.sql.hive.filesourcePartitionFileCacheSize 524288000 spark.dynamicAllocation.executorIdleTimeout 3s spark.dynamicAllocation.enabled true spark.hadoop.yarn.timeline-service.enabledfalse spark.shuffle.service.enabled true spark.yarn.am.extraJavaOptions-Dhdp.version=2.5.3.0-37 * -Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m* Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes using latest version. But we started facing DirectBuffer outOfMemory error and exceeding memory limits for executor memoryOverhead issue. To fix that we started tweaking multiple properties but still issue persists. Relevant information is shared below Please let me any other details is requried, Snapshot for DirectMemory Error Stacktrace :- 10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), shuffleId=7, mapId=141, reduceId=3372, message= org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 byte(s) of direct memory (used: 1073699840, max: 1073741824) at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:414) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:166)
[jira] [Created] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2
Kaushal Prajapati created SPARK-22458: - Summary: OutOfDirectMemoryError with Spark 2.2 Key: SPARK-22458 URL: https://issues.apache.org/jira/browse/SPARK-22458 Project: Spark Issue Type: Bug Components: Shuffle, SQL, YARN Affects Versions: 2.2.0 Reporter: Kaushal Prajapati Priority: Blocker We were using Spark 2.1 from last 6 months to execute multiple spark jobs that is running 15 hour long for 50+ TB of source data with below configurations successfully. spark.master yarn spark.driver.cores10 spark.driver.maxResultSize5g spark.driver.memory 20g spark.executor.cores 5 spark.executor.extraJavaOptions -XX:+UseG1GC *-Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6 *-XX:MaxDirectMemorySize=2048m* -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 spark.driver.extraJavaOptions* -Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37 spark.executor.instances 30 spark.executor.memory 30g *spark.kryoserializer.buffer.max 512m* spark.network.timeout 12000s spark.serializer org.apache.spark.serializer.KryoSerializer spark.shuffle.io.preferDirectBufs false spark.sql.catalogImplementation hive spark.sql.shuffle.partitions 5000 spark.yarn.driver.memoryOverhead 1536 spark.yarn.executor.memoryOverhead4096 spark.core.connection.ack.wait.timeout600s spark.scheduler.maxRegisteredResourcesWaitingTime 15s spark.sql.hive.filesourcePartitionFileCacheSize 524288000 spark.dynamicAllocation.executorIdleTimeout 3s spark.dynamicAllocation.enabled true spark.hadoop.yarn.timeline-service.enabledfalse spark.shuffle.service.enabled true spark.yarn.am.extraJavaOptions-Dhdp.version=2.5.3.0-37 * -Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m* Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes using latest version. But we started facing DirectBuffer outOfMemory error and exceeding memory limits for executor memoryOverhead issue. To fix that we started tweaking multiple properties but still issue persists. Relevant information is shared below Please let me any other details is requried, Snapshot for DirectMemory Error Stacktrace :- 10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), shuffleId=7, mapId=141, reduceId=3372, message= org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 byte(s) of direct memory (used: 1073699840, max: 1073741824) at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon
[jira] [Commented] (SPARK-22456) Add new function dayofweek
[ https://issues.apache.org/jira/browse/SPARK-22456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240331#comment-16240331 ] Michael Styles commented on SPARK-22456: This functionality is part of the SQL standard and supported by a number of relational database systems including Oracle, DB2, SQL Server, MySQL, and Teradata, to name a few. > Add new function dayofweek > -- > > Key: SPARK-22456 > URL: https://issues.apache.org/jira/browse/SPARK-22456 > Project: Spark > Issue Type: New Feature > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Michael Styles > > Add new function *dayofweek* to return the day of the week of the given > argument as an integer value in the range 1-7, where 1 represents Sunday. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22457) Tables are supposed to be MANAGED only taking into account whether a path is provided
[ https://issues.apache.org/jira/browse/SPARK-22457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Arroyo Cazorla updated SPARK-22457: - Description: As far as I know, since Spark 2.2, tables are supposed to be MANAGED only taking into account whether a path is provided: {code:java} val tableType = if (storage.locationUri.isDefined) { CatalogTableType.EXTERNAL } else { CatalogTableType.MANAGED } {code} This solution seems to be right for filesystem based data sources. On the other hand, when working with other data sources such as elasticsearch, that solution is leading to a weird behaviour described below: 1) InMemoryCatalog's doCreateTable() adds a locationURI if CatalogTableType.MANAGED && tableDefinition.storage.locationUri.isEmpty. 2) Before loading the data source table FindDataSourceTable's readDataSourceTable() adds a path option if locationURI exists: {code:java} val pathOption = table.storage.locationUri.map("path" -> CatalogUtils.URIToString(_)) {code} 3) That causes an error when reading from elasticsearch because 'path' is an option already supported by elasticsearch (locationUri is set to file:/home/user/spark-rv/elasticsearch/shop/clients) org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot find mapping for file:/home/user/spark-rv/elasticsearch/shop/clients - one is required before using Spark SQL Would be possible only to mark tables as MANAGED for a subset of data sources (TEXT, CSV, JSON, JDBC, PARQUET, ORC, HIVE) or think about any other solution? P.S. InMemoryCatalog' doDropTable() deletes the directory of the table which from my point of view should only be required for filesystem based data sources: {code:java} if (tableMeta.tableType == CatalogTableType.MANAGED) ... // Delete the data/directory of the table val dir = new Path(tableMeta.location) try { val fs = dir.getFileSystem(hadoopConfig) fs.delete(dir, true) } catch { case e: IOException => throw new SparkException(s"Unable to drop table $table as failed " + s"to delete its directory $dir", e) } {code} was: As far as I know, since Spark 2.2, tables are supposed to be MANAGED only taking into account whether a path is provided: {code:java} val tableType = if (storage.locationUri.isDefined) { CatalogTableType.EXTERNAL } else { CatalogTableType.MANAGED } {code} This solution seems to be right for filesystem based data sources. On the other hand, when working with other data sources such as elasticsearch, that solution is leading to a weird behaviour described below: 1) InMemoryCatalog's doCreateTable() adds a locationURI if CatalogTableType.MANAGED && tableDefinition.storage.locationUri.isEmpty. 2) Before loading the data source table FindDataSourceTable's readDataSourceTable() adds a path option if locationURI exists: {code:java} val pathOption = table.storage.locationUri.map("path" -> CatalogUtils.URIToString(_)) {code} 3) That causes an error when reading from elasticsearch because 'path' is an option already supported by elasticsearch (locationUri is set to file:/home/user/spark-rv/elasticsearch/shop/clients) org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot find mapping for file:/home/user/spark-rv/elasticsearch/shop/clients - one is required before using Spark SQL Would be possible only mark tables as MANAGED for a subset of data sources (TEXT, CSV, JSON, JDBC, PARQUET, ORC, HIVE) or think about any other solution? P.S. InMemoryCatalog' doDropTable() deletes the directory of the table which from my point of view should only be required for filesystem based data sources: {code:java} if (tableMeta.tableType == CatalogTableType.MANAGED) ... // Delete the data/directory of the table val dir = new Path(tableMeta.location) try { val fs = dir.getFileSystem(hadoopConfig) fs.delete(dir, true) } catch { case e: IOException => throw new SparkException(s"Unable to drop table $table as failed " + s"to delete its directory $dir", e) } {code} > Tables are supposed to be MANAGED only taking into account whether a path is > provided > - > > Key: SPARK-22457 > URL: https://issues.apache.org/jira/browse/SPARK-22457 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: David Arroyo Cazorla > > As far as I know, since Spark 2.2, tables are supposed to be MANAGED only > taking into account whether a path is provided: > {code:java} > val tableType = if (storage.locationUri.isDefined) { > CatalogTableType.EXTERNAL > } else { >
[jira] [Commented] (SPARK-10399) Off Heap Memory Access for non-JVM libraries (C++)
[ https://issues.apache.org/jira/browse/SPARK-10399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240314#comment-16240314 ] Sean Owen commented on SPARK-10399: --- I think this is mostly superseded by Arrow's intended role here, yeah. > Off Heap Memory Access for non-JVM libraries (C++) > -- > > Key: SPARK-10399 > URL: https://issues.apache.org/jira/browse/SPARK-10399 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Paul Weiss > > *Summary* > Provide direct off-heap memory access to an external non-JVM program such as > a c++ library within the Spark running JVM/executor. As Spark moves to > storing all data into off heap memory it makes sense to provide access points > to the memory for non-JVM programs. > > *Assumptions* > * Zero copies will be made during the call into non-JVM library > * Access into non-JVM libraries will be accomplished via JNI > * A generic JNI interface will be created so that developers will not need to > deal with the raw JNI call > * C++ will be the initial target non-JVM use case > * memory management will remain on the JVM/Spark side > * the API from C++ will be similar to dataframes as much as feasible and NOT > require expert knowledge of JNI > * Data organization and layout will support complex (multi-type, nested, > etc.) types > > *Design* > * Initially Spark JVM -> non-JVM will be supported > * Creating an embedded JVM with Spark running from a non-JVM program is > initially out of scope > > *Technical* > * GetDirectBufferAddress is the JNI call used to access byte buffer without > copy -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22457) Tables are supposed to be MANAGED only taking into account whether a path is provided
[ https://issues.apache.org/jira/browse/SPARK-22457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Arroyo Cazorla updated SPARK-22457: - Description: As far as I know, since Spark 2.2, tables are supposed to be MANAGED only taking into account whether a path is provided: {code:java} val tableType = if (storage.locationUri.isDefined) { CatalogTableType.EXTERNAL } else { CatalogTableType.MANAGED } {code} This solution seems to be right for filesystem based data sources. On the other hand, when working with other data sources such as elasticsearch, that solution is leading to a weird behaviour described below. 1) InMemoryCatalog's doCreateTable() adds a locationURI if CatalogTableType.MANAGED && tableDefinition.storage.locationUri.isEmpty. 2) Before loading the data source table FindDataSourceTable's readDataSourceTable() adds a path option if locationURI exists: {code:java} val pathOption = table.storage.locationUri.map("path" -> CatalogUtils.URIToString(_)) {code} 3) That causes an error when reading from elasticsearch because 'path' is an option already supported by elasticsearch (locationUri is set to file:/home/user/spark-rv/elasticsearch/shop/clients) org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot find mapping for file:/home/user/spark-rv/elasticsearch/shop/clients - one is required before using Spark SQL Would be possible only mark tables as MANAGED for a subset of data sources (TEXT, CSV, JSON, JDBC, PARQUET, ORC, HIVE) or think about any other solution? P.S. InMemoryCatalog' doDropTable() deletes the directory of the table which from my point of view should only be required for filesystem based data sources: {code:java} if (tableMeta.tableType == CatalogTableType.MANAGED) ... // Delete the data/directory of the table val dir = new Path(tableMeta.location) try { val fs = dir.getFileSystem(hadoopConfig) fs.delete(dir, true) } catch { case e: IOException => throw new SparkException(s"Unable to drop table $table as failed " + s"to delete its directory $dir", e) } {code} was: As far as I know, since Spark 2.2, tables are supposed to be MANAGED only taking into account whether a path is provided: {code:scala} val tableType = if (storage.locationUri.isDefined) { CatalogTableType.EXTERNAL } else { CatalogTableType.MANAGED } {code} This solution seems to be right for filesystem based data sources. On the other hand, when working with other data sources such as elasticsearch, that solution is leading to a weird behaviour described below. 1) InMemoryCatalog's doCreateTable() adds a locationURI if CatalogTableType.MANAGED && tableDefinition.storage.locationUri.isEmpty. 2) Before loading the data source table FindDataSourceTable's readDataSourceTable() adds a path option if locationURI exists: {code:scala} val pathOption = table.storage.locationUri.map("path" -> CatalogUtils.URIToString(_)) {code} 3) That causes an error when reading from elasticsearch because 'path' is an option already supported by elasticsearch (locationUri is set to file:/home/user/spark-rv/elasticsearch/shop/clients) org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot find mapping for file:/home/user/spark-rv/elasticsearch/shop/clients - one is required before using Spark SQL Would be possible only mark tables as MANAGED for a subset of data sources (TEXT, CSV, JSON, JDBC, PARQUET, ORC, HIVE) or think about any other solution? P.S. InMemoryCatalog' doDropTable() deletes the directory of the table which from my point of view should only be required for filesystem based data sources: {code:scala} if (tableMeta.tableType == CatalogTableType.MANAGED) ... // Delete the data/directory of the table val dir = new Path(tableMeta.location) try { val fs = dir.getFileSystem(hadoopConfig) fs.delete(dir, true) } catch { case e: IOException => throw new SparkException(s"Unable to drop table $table as failed " + s"to delete its directory $dir", e) } {code} > Tables are supposed to be MANAGED only taking into account whether a path is > provided > - > > Key: SPARK-22457 > URL: https://issues.apache.org/jira/browse/SPARK-22457 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: David Arroyo Cazorla > > As far as I know, since Spark 2.2, tables are supposed to be MANAGED only > taking into account whether a path is provided: > {code:java} > val tableType = if (storage.locationUri.isDefined) { > CatalogTableType.EXTERNAL > } else { >
[jira] [Updated] (SPARK-22457) Tables are supposed to be MANAGED only taking into account whether a path is provided
[ https://issues.apache.org/jira/browse/SPARK-22457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Arroyo Cazorla updated SPARK-22457: - Description: As far as I know, since Spark 2.2, tables are supposed to be MANAGED only taking into account whether a path is provided: {code:java} val tableType = if (storage.locationUri.isDefined) { CatalogTableType.EXTERNAL } else { CatalogTableType.MANAGED } {code} This solution seems to be right for filesystem based data sources. On the other hand, when working with other data sources such as elasticsearch, that solution is leading to a weird behaviour described below: 1) InMemoryCatalog's doCreateTable() adds a locationURI if CatalogTableType.MANAGED && tableDefinition.storage.locationUri.isEmpty. 2) Before loading the data source table FindDataSourceTable's readDataSourceTable() adds a path option if locationURI exists: {code:java} val pathOption = table.storage.locationUri.map("path" -> CatalogUtils.URIToString(_)) {code} 3) That causes an error when reading from elasticsearch because 'path' is an option already supported by elasticsearch (locationUri is set to file:/home/user/spark-rv/elasticsearch/shop/clients) org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot find mapping for file:/home/user/spark-rv/elasticsearch/shop/clients - one is required before using Spark SQL Would be possible only mark tables as MANAGED for a subset of data sources (TEXT, CSV, JSON, JDBC, PARQUET, ORC, HIVE) or think about any other solution? P.S. InMemoryCatalog' doDropTable() deletes the directory of the table which from my point of view should only be required for filesystem based data sources: {code:java} if (tableMeta.tableType == CatalogTableType.MANAGED) ... // Delete the data/directory of the table val dir = new Path(tableMeta.location) try { val fs = dir.getFileSystem(hadoopConfig) fs.delete(dir, true) } catch { case e: IOException => throw new SparkException(s"Unable to drop table $table as failed " + s"to delete its directory $dir", e) } {code} was: As far as I know, since Spark 2.2, tables are supposed to be MANAGED only taking into account whether a path is provided: {code:java} val tableType = if (storage.locationUri.isDefined) { CatalogTableType.EXTERNAL } else { CatalogTableType.MANAGED } {code} This solution seems to be right for filesystem based data sources. On the other hand, when working with other data sources such as elasticsearch, that solution is leading to a weird behaviour described below. 1) InMemoryCatalog's doCreateTable() adds a locationURI if CatalogTableType.MANAGED && tableDefinition.storage.locationUri.isEmpty. 2) Before loading the data source table FindDataSourceTable's readDataSourceTable() adds a path option if locationURI exists: {code:java} val pathOption = table.storage.locationUri.map("path" -> CatalogUtils.URIToString(_)) {code} 3) That causes an error when reading from elasticsearch because 'path' is an option already supported by elasticsearch (locationUri is set to file:/home/user/spark-rv/elasticsearch/shop/clients) org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot find mapping for file:/home/user/spark-rv/elasticsearch/shop/clients - one is required before using Spark SQL Would be possible only mark tables as MANAGED for a subset of data sources (TEXT, CSV, JSON, JDBC, PARQUET, ORC, HIVE) or think about any other solution? P.S. InMemoryCatalog' doDropTable() deletes the directory of the table which from my point of view should only be required for filesystem based data sources: {code:java} if (tableMeta.tableType == CatalogTableType.MANAGED) ... // Delete the data/directory of the table val dir = new Path(tableMeta.location) try { val fs = dir.getFileSystem(hadoopConfig) fs.delete(dir, true) } catch { case e: IOException => throw new SparkException(s"Unable to drop table $table as failed " + s"to delete its directory $dir", e) } {code} > Tables are supposed to be MANAGED only taking into account whether a path is > provided > - > > Key: SPARK-22457 > URL: https://issues.apache.org/jira/browse/SPARK-22457 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: David Arroyo Cazorla > > As far as I know, since Spark 2.2, tables are supposed to be MANAGED only > taking into account whether a path is provided: > {code:java} > val tableType = if (storage.locationUri.isDefined) { > CatalogTableType.EXTERNAL > } else { >
[jira] [Created] (SPARK-22457) Tables are supposed to be MANAGED only taking into account whether a path is provided
David Arroyo Cazorla created SPARK-22457: Summary: Tables are supposed to be MANAGED only taking into account whether a path is provided Key: SPARK-22457 URL: https://issues.apache.org/jira/browse/SPARK-22457 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: David Arroyo Cazorla As far as I know, since Spark 2.2, tables are supposed to be MANAGED only taking into account whether a path is provided: {code:scala} val tableType = if (storage.locationUri.isDefined) { CatalogTableType.EXTERNAL } else { CatalogTableType.MANAGED } {code} This solution seems to be right for filesystem based data sources. On the other hand, when working with other data sources such as elasticsearch, that solution is leading to a weird behaviour described below. 1) InMemoryCatalog's doCreateTable() adds a locationURI if CatalogTableType.MANAGED && tableDefinition.storage.locationUri.isEmpty. 2) Before loading the data source table FindDataSourceTable's readDataSourceTable() adds a path option if locationURI exists: {code:scala} val pathOption = table.storage.locationUri.map("path" -> CatalogUtils.URIToString(_)) {code} 3) That causes an error when reading from elasticsearch because 'path' is an option already supported by elasticsearch (locationUri is set to file:/home/user/spark-rv/elasticsearch/shop/clients) org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot find mapping for file:/home/user/spark-rv/elasticsearch/shop/clients - one is required before using Spark SQL Would be possible only mark tables as MANAGED for a subset of data sources (TEXT, CSV, JSON, JDBC, PARQUET, ORC, HIVE) or think about any other solution? P.S. InMemoryCatalog' doDropTable() deletes the directory of the table which from my point of view should only be required for filesystem based data sources: {code:scala} if (tableMeta.tableType == CatalogTableType.MANAGED) ... // Delete the data/directory of the table val dir = new Path(tableMeta.location) try { val fs = dir.getFileSystem(hadoopConfig) fs.delete(dir, true) } catch { case e: IOException => throw new SparkException(s"Unable to drop table $table as failed " + s"to delete its directory $dir", e) } {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10399) Off Heap Memory Access for non-JVM libraries (C++)
[ https://issues.apache.org/jira/browse/SPARK-10399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240304#comment-16240304 ] Jim Pivarski commented on SPARK-10399: -- WontFix because PR 19222 has no conflicts and will be merged, or because off-heap memory will instead be exposed as Arrow buffers, or because you don't intend to support this feature? I'm just asking for clarification, not asking you to change your minds. > Off Heap Memory Access for non-JVM libraries (C++) > -- > > Key: SPARK-10399 > URL: https://issues.apache.org/jira/browse/SPARK-10399 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Paul Weiss > > *Summary* > Provide direct off-heap memory access to an external non-JVM program such as > a c++ library within the Spark running JVM/executor. As Spark moves to > storing all data into off heap memory it makes sense to provide access points > to the memory for non-JVM programs. > > *Assumptions* > * Zero copies will be made during the call into non-JVM library > * Access into non-JVM libraries will be accomplished via JNI > * A generic JNI interface will be created so that developers will not need to > deal with the raw JNI call > * C++ will be the initial target non-JVM use case > * memory management will remain on the JVM/Spark side > * the API from C++ will be similar to dataframes as much as feasible and NOT > require expert knowledge of JNI > * Data organization and layout will support complex (multi-type, nested, > etc.) types > > *Design* > * Initially Spark JVM -> non-JVM will be supported > * Creating an embedded JVM with Spark running from a non-JVM program is > initially out of scope > > *Technical* > * GetDirectBufferAddress is the JNI call used to access byte buffer without > copy -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21640) Method mode with String parameters within DataFrameWriter is error prone
[ https://issues.apache.org/jira/browse/SPARK-21640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240298#comment-16240298 ] Apache Spark commented on SPARK-21640: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/19673 > Method mode with String parameters within DataFrameWriter is error prone > > > Key: SPARK-21640 > URL: https://issues.apache.org/jira/browse/SPARK-21640 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Alberto >Assignee: Alberto >Priority: Trivial > Fix For: 2.3.0 > > > The following method: > {code:java} > def mode(saveMode: String): DataFrameWriter[T] > {code} > sets the SaveMode of the DataFrameWriter depending on the string that is > pass-in as parameter. > There is a java Enum with all the save modes which are Append, Overwrite, > ErrorIfExists and Ignore. In my current project I was writing some code that > was using this enum to get the string value that I use to call the mode > method: > {code:java} > private[utils] val configModeAppend = SaveMode.Append.toString.toLowerCase > private[utils] val configModeErrorIfExists = > SaveMode.ErrorIfExists.toString.toLowerCase > private[utils] val configModeIgnore = SaveMode.Ignore.toString.toLowerCase > private[utils] val configModeOverwrite = > SaveMode.Overwrite.toString.toLowerCase > {code} > The configModeErrorIfExists val contains the value "errorifexists" and when I > call the saveMode method using this string it does not match. I suggest to > include "errorifexists" as a right match for the ErrorIfExists SaveMode. > Will create a PR to address this issue ASAP. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3383) DecisionTree aggregate size could be smaller
[ https://issues.apache.org/jira/browse/SPARK-3383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240284#comment-16240284 ] Yan Facai (颜发才) edited comment on SPARK-3383 at 11/6/17 1:28 PM: - [~WeichenXu123] Good work! I'd like to take a look if time allows. Anyway, I believe that unordered features can benefit a lot from your work. was (Author: facai): [~WeichenXu123] Good work! I'd like to take a look if time allows. Anyway, I believe that unordered features can benefit a lot from the PR. > DecisionTree aggregate size could be smaller > > > Key: SPARK-3383 > URL: https://issues.apache.org/jira/browse/SPARK-3383 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.1.0 >Reporter: Joseph K. Bradley >Priority: Minor > > Storage and communication optimization: > DecisionTree aggregate statistics could store less data (described below). > The savings would be significant for datasets with many low-arity categorical > features (binary features, or unordered categorical features). Savings would > be negligible for continuous features. > DecisionTree stores a vector sufficient statistics for each (node, feature, > bin). We could store 1 fewer bin per (node, feature): For a given (node, > feature), if we store these vectors for all but the last bin, and also store > the total statistics for each node, then we could compute the statistics for > the last bin. For binary and unordered categorical features, this would cut > in half the number of bins to store and communicate. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3383) DecisionTree aggregate size could be smaller
[ https://issues.apache.org/jira/browse/SPARK-3383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240284#comment-16240284 ] Yan Facai (颜发才) commented on SPARK-3383: [~WeichenXu123] Good work! I'd like to take a look if time allows. Anyway, I believe that unordered features can benefit a lot from the PR. > DecisionTree aggregate size could be smaller > > > Key: SPARK-3383 > URL: https://issues.apache.org/jira/browse/SPARK-3383 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.1.0 >Reporter: Joseph K. Bradley >Priority: Minor > > Storage and communication optimization: > DecisionTree aggregate statistics could store less data (described below). > The savings would be significant for datasets with many low-arity categorical > features (binary features, or unordered categorical features). Savings would > be negligible for continuous features. > DecisionTree stores a vector sufficient statistics for each (node, feature, > bin). We could store 1 fewer bin per (node, feature): For a given (node, > feature), if we store these vectors for all but the last bin, and also store > the total statistics for each node, then we could compute the statistics for > the last bin. For binary and unordered categorical features, this would cut > in half the number of bins to store and communicate. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21195) MetricSystem should pick up dynamically registered metrics in sources
[ https://issues.apache.org/jira/browse/SPARK-21195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21195: Assignee: (was: Apache Spark) > MetricSystem should pick up dynamically registered metrics in sources > - > > Key: SPARK-21195 > URL: https://issues.apache.org/jira/browse/SPARK-21195 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Priority: Minor > > Currently when MetricsSystem registers a source it only picks up currently > registered metrics. It's quite cumbersome and leads to a lot of boilerplate > to preregister all metrics especially with systems that use instrumentation. > This change proposes to teach MetricsSystem to watch metrics added to sources > and dynamically register them -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21195) MetricSystem should pick up dynamically registered metrics in sources
[ https://issues.apache.org/jira/browse/SPARK-21195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21195: Assignee: Apache Spark > MetricSystem should pick up dynamically registered metrics in sources > - > > Key: SPARK-21195 > URL: https://issues.apache.org/jira/browse/SPARK-21195 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Assignee: Apache Spark >Priority: Minor > > Currently when MetricsSystem registers a source it only picks up currently > registered metrics. It's quite cumbersome and leads to a lot of boilerplate > to preregister all metrics especially with systems that use instrumentation. > This change proposes to teach MetricsSystem to watch metrics added to sources > and dynamically register them -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-21195) MetricSystem should pick up dynamically registered metrics in sources
[ https://issues.apache.org/jira/browse/SPARK-21195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kruszewski reopened SPARK-21195: --- > MetricSystem should pick up dynamically registered metrics in sources > - > > Key: SPARK-21195 > URL: https://issues.apache.org/jira/browse/SPARK-21195 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Priority: Minor > > Currently when MetricsSystem registers a source it only picks up currently > registered metrics. It's quite cumbersome and leads to a lot of boilerplate > to preregister all metrics especially with systems that use instrumentation. > This change proposes to teach MetricsSystem to watch metrics added to sources > and dynamically register them -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21195) MetricSystem should pick up dynamically registered metrics in sources
[ https://issues.apache.org/jira/browse/SPARK-21195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240274#comment-16240274 ] Robert Kruszewski commented on SPARK-21195: --- Thanks, indeed I responded that I missed the last comment. > MetricSystem should pick up dynamically registered metrics in sources > - > > Key: SPARK-21195 > URL: https://issues.apache.org/jira/browse/SPARK-21195 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Priority: Minor > > Currently when MetricsSystem registers a source it only picks up currently > registered metrics. It's quite cumbersome and leads to a lot of boilerplate > to preregister all metrics especially with systems that use instrumentation. > This change proposes to teach MetricsSystem to watch metrics added to sources > and dynamically register them -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22456) Add new function dayofweek
[ https://issues.apache.org/jira/browse/SPARK-22456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22456: Assignee: (was: Apache Spark) > Add new function dayofweek > -- > > Key: SPARK-22456 > URL: https://issues.apache.org/jira/browse/SPARK-22456 > Project: Spark > Issue Type: New Feature > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Michael Styles > > Add new function *dayofweek* to return the day of the week of the given > argument as an integer value in the range 1-7, where 1 represents Sunday. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22456) Add new function dayofweek
[ https://issues.apache.org/jira/browse/SPARK-22456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240272#comment-16240272 ] Apache Spark commented on SPARK-22456: -- User 'ptkool' has created a pull request for this issue: https://github.com/apache/spark/pull/19672 > Add new function dayofweek > -- > > Key: SPARK-22456 > URL: https://issues.apache.org/jira/browse/SPARK-22456 > Project: Spark > Issue Type: New Feature > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Michael Styles > > Add new function *dayofweek* to return the day of the week of the given > argument as an integer value in the range 1-7, where 1 represents Sunday. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22456) Add new function dayofweek
[ https://issues.apache.org/jira/browse/SPARK-22456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22456: Assignee: Apache Spark > Add new function dayofweek > -- > > Key: SPARK-22456 > URL: https://issues.apache.org/jira/browse/SPARK-22456 > Project: Spark > Issue Type: New Feature > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Michael Styles >Assignee: Apache Spark > > Add new function *dayofweek* to return the day of the week of the given > argument as an integer value in the range 1-7, where 1 represents Sunday. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21195) MetricSystem should pick up dynamically registered metrics in sources
[ https://issues.apache.org/jira/browse/SPARK-21195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240270#comment-16240270 ] Sean Owen commented on SPARK-21195: --- The PR hadn't been updated in almost half a year, and Jerry had some concerns. I see you just commented on it. If there's any interest in continuing with it, you can reopen. > MetricSystem should pick up dynamically registered metrics in sources > - > > Key: SPARK-21195 > URL: https://issues.apache.org/jira/browse/SPARK-21195 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Priority: Minor > > Currently when MetricsSystem registers a source it only picks up currently > registered metrics. It's quite cumbersome and leads to a lot of boilerplate > to preregister all metrics especially with systems that use instrumentation. > This change proposes to teach MetricsSystem to watch metrics added to sources > and dynamically register them -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22456) Add new function dayofweek
[ https://issues.apache.org/jira/browse/SPARK-22456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240269#comment-16240269 ] Sean Owen commented on SPARK-22456: --- Does it exist in other SQL dialects? that's the reason most functions like this exist in Spark. Beyond that, these sorts of things should be UDFs. > Add new function dayofweek > -- > > Key: SPARK-22456 > URL: https://issues.apache.org/jira/browse/SPARK-22456 > Project: Spark > Issue Type: New Feature > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Michael Styles > > Add new function *dayofweek* to return the day of the week of the given > argument as an integer value in the range 1-7, where 1 represents Sunday. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22456) Add new function dayofweek
Michael Styles created SPARK-22456: -- Summary: Add new function dayofweek Key: SPARK-22456 URL: https://issues.apache.org/jira/browse/SPARK-22456 Project: Spark Issue Type: New Feature Components: PySpark, SQL Affects Versions: 2.2.0 Reporter: Michael Styles Add new function *dayofweek* to return the day of the week of the given argument as an integer value in the range 1-7, where 1 represents Sunday. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21195) MetricSystem should pick up dynamically registered metrics in sources
[ https://issues.apache.org/jira/browse/SPARK-21195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240267#comment-16240267 ] Robert Kruszewski commented on SPARK-21195: --- What's the justification for `won't fix` here? > MetricSystem should pick up dynamically registered metrics in sources > - > > Key: SPARK-21195 > URL: https://issues.apache.org/jira/browse/SPARK-21195 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Priority: Minor > > Currently when MetricsSystem registers a source it only picks up currently > registered metrics. It's quite cumbersome and leads to a lot of boilerplate > to preregister all metrics especially with systems that use instrumentation. > This change proposes to teach MetricsSystem to watch metrics added to sources > and dynamically register them -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22455) Provide an option to store the exception records/files and reasons in log files when reading data from a file-based data source.
[ https://issues.apache.org/jira/browse/SPARK-22455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-22455: -- Flags: (was: Patch) Labels: (was: patch) Priority: Minor (was: Major) > Provide an option to store the exception records/files and reasons in log > files when reading data from a file-based data source. > > > Key: SPARK-22455 > URL: https://issues.apache.org/jira/browse/SPARK-22455 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.2.0 >Reporter: Sreenath Chothar >Priority: Minor > > Provide an option to store the exception/bad records and reasons in log files > when reading data from a file-based data source into a PySpark dataframe. Now > only following three options are available: > 1. PERMISSIVE : sets other fields to null when it meets a corrupted record, > and puts the malformed string into a field configured by > columnNameOfCorruptRecord. > 2. DROPMALFORMED : ignores the whole corrupted records. > 3. FAILFAST : throws an exception when it meets corrupted records. > We could use first option to accumulate the corrupted records and output to a > log file.But we can't use this option when input schema is inferred > automatically. If the number of columns to read is too large, providing the > complete schema with additional column for storing corrupted data is > difficult. Instead "pyspark.sql.DataFrameReader.csv" reader functions could > provide an option to redirect the bad records to configured log file path > with exception details. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21195) MetricSystem should pick up dynamically registered metrics in sources
[ https://issues.apache.org/jira/browse/SPARK-21195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21195. --- Resolution: Won't Fix > MetricSystem should pick up dynamically registered metrics in sources > - > > Key: SPARK-21195 > URL: https://issues.apache.org/jira/browse/SPARK-21195 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Priority: Minor > > Currently when MetricsSystem registers a source it only picks up currently > registered metrics. It's quite cumbersome and leads to a lot of boilerplate > to preregister all metrics especially with systems that use instrumentation. > This change proposes to teach MetricsSystem to watch metrics added to sources > and dynamically register them -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20638) Optimize the CartesianRDD to reduce repeatedly data fetching
[ https://issues.apache.org/jira/browse/SPARK-20638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20638. --- Resolution: Won't Fix > Optimize the CartesianRDD to reduce repeatedly data fetching > > > Key: SPARK-20638 > URL: https://issues.apache.org/jira/browse/SPARK-20638 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Teng Jiang > > In CartesianRDD, group each iterator to multiple groups. Thus in the second > iteration, the data with be fetched (num of data)/groupSize times, rather > than (num of data) times. > The test results are: > Test Environment : 3 workers, each has 10 cores, 30G memory, 1 executor > Test data : users : 480,189, each is a 10-dim vector, and items : 17770, each > is a 10-dim vector. > With default CartesianRDD, cartesian time is 2420.7s. > With this proposal, cartesian time is 45.3s > 50x faster than the original method. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22154) add a shutdown hook that explains why the output is terminating
[ https://issues.apache.org/jira/browse/SPARK-22154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22154. --- Resolution: Won't Fix > add a shutdown hook that explains why the output is terminating > > > Key: SPARK-22154 > URL: https://issues.apache.org/jira/browse/SPARK-22154 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.2.0 >Reporter: liuzhaokun >Priority: Critical > > It would be nice to add a shutdown hook here that explains why the output is > terminating. Otherwise if the worker dies the executor logs will silently > stop. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10399) Off Heap Memory Access for non-JVM libraries (C++)
[ https://issues.apache.org/jira/browse/SPARK-10399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10399. --- Resolution: Won't Fix > Off Heap Memory Access for non-JVM libraries (C++) > -- > > Key: SPARK-10399 > URL: https://issues.apache.org/jira/browse/SPARK-10399 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Paul Weiss > > *Summary* > Provide direct off-heap memory access to an external non-JVM program such as > a c++ library within the Spark running JVM/executor. As Spark moves to > storing all data into off heap memory it makes sense to provide access points > to the memory for non-JVM programs. > > *Assumptions* > * Zero copies will be made during the call into non-JVM library > * Access into non-JVM libraries will be accomplished via JNI > * A generic JNI interface will be created so that developers will not need to > deal with the raw JNI call > * C++ will be the initial target non-JVM use case > * memory management will remain on the JVM/Spark side > * the API from C++ will be similar to dataframes as much as feasible and NOT > require expert knowledge of JNI > * Data organization and layout will support complex (multi-type, nested, > etc.) types > > *Design* > * Initially Spark JVM -> non-JVM will be supported > * Creating an embedded JVM with Spark running from a non-JVM program is > initially out of scope > > *Technical* > * GetDirectBufferAddress is the JNI call used to access byte buffer without > copy -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13547) Add SQL query in web UI's SQL Tab
[ https://issues.apache.org/jira/browse/SPARK-13547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-13547. --- Resolution: Won't Fix > Add SQL query in web UI's SQL Tab > - > > Key: SPARK-13547 > URL: https://issues.apache.org/jira/browse/SPARK-13547 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 1.6.0 >Reporter: Jeff Zhang >Priority: Minor > Attachments: screenshot-1.png > > > It would be nice to have the sql query in sql tab -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19458) loading hive jars from the local repo which has already downloaded
[ https://issues.apache.org/jira/browse/SPARK-19458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19458. --- Resolution: Won't Fix > loading hive jars from the local repo which has already downloaded > -- > > Key: SPARK-19458 > URL: https://issues.apache.org/jira/browse/SPARK-19458 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun >Priority: Minor > > Currently when we new a HiveClient for a specific metastore version and > `spark.sql.hive.metastore.jars` is setted to `maven`, Spark will download the > hive jars from remote repo(http://www.datanucleus.org/downloads/maven2). > we should allow the user to load hive jars from the local repo which has > already downloaded. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20044) Support Spark UI behind front-end reverse proxy using a path prefix
[ https://issues.apache.org/jira/browse/SPARK-20044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20044. --- Resolution: Won't Fix > Support Spark UI behind front-end reverse proxy using a path prefix > --- > > Key: SPARK-20044 > URL: https://issues.apache.org/jira/browse/SPARK-20044 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.1.0 >Reporter: Oliver Koeth >Priority: Minor > Labels: reverse-proxy, sso > > Purpose: allow to run the Spark web UI behind a reverse proxy with URLs > prefixed by a context root, like www.mydomain.com/spark. In particular, this > allows to access multiple Spark clusters through the same virtual host, only > distinguishing them by context root, like www.mydomain.com/cluster1, > www.mydomain.com/cluster2, and it allows to run the Spark UI in a common > cookie domain (for SSO) with other services. > [SPARK-15487] introduced some support for front-end reverse proxies by > allowing all Spark UI requests to be routed through the master UI as a single > endpoint and also added a spark.ui.reverseProxyUrl setting to define a > another proxy sitting in front of Spark. However, as noted in the comments on > [SPARK-15487], this mechanism does not currently work if the reverseProxyUrl > includes a context root like the examples above: Most links generated by the > Spark UI result in full path URLs (like /proxy/app-"id"/...) that do not > account for a path prefix (context root) and work only if the Spark UI "owns" > the entire virtual host. In fact, the only place in the UI where the > reverseProxyUrl seems to be used is the back-link from the worker UI to the > master UI. > The discussion on [SPARK-15487] proposes to open a new issue for the problem, > but that does not seem to have happened, so this issue aims to address the > remaining shortcomings of spark.ui.reverseProxyUrl > The problem can be partially worked around by doing content rewrite in a > front-end proxy and prefixing src="/..." or href="/..." links with a context > root. However, detecting and patching URLs in HTML output is not a robust > approach and breaks down for URLs included in custom REST responses. E.g. the > "allexecutors" REST call used from the Spark 2.1.0 application/executors page > returns links for log viewing that direct to the worker UI and do not work in > this scenario. > This issue proposes to honor spark.ui.reverseProxyUrl throughout Spark UI URL > generation. Experiments indicate that most of this can simply be achieved by > using/prepending spark.ui.reverseProxyUrl to the existing spark.ui.proxyBase > system property. Beyond that, the places that require adaption are > - worker and application links in the master web UI > - webui URLs returned by REST interfaces > Note: It seems that returned redirect location headers do not need to be > adapted, since URL rewriting for these is commonly done in front-end proxies > and has a well-defined interface -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19527) Approximate Size of Intersection of Bloom Filters
[ https://issues.apache.org/jira/browse/SPARK-19527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19527. --- Resolution: Won't Fix > Approximate Size of Intersection of Bloom Filters > - > > Key: SPARK-19527 > URL: https://issues.apache.org/jira/browse/SPARK-19527 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Brandon Poole > > No way to calculate approximate items in a Bloom filter externally as > numOfHashFunctions and bits are private. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22455) Provide an option to store the exception records/files and reasons in log files when reading data from a file-based data source.
Sreenath Chothar created SPARK-22455: Summary: Provide an option to store the exception records/files and reasons in log files when reading data from a file-based data source. Key: SPARK-22455 URL: https://issues.apache.org/jira/browse/SPARK-22455 Project: Spark Issue Type: Improvement Components: Input/Output Affects Versions: 2.2.0 Reporter: Sreenath Chothar Provide an option to store the exception/bad records and reasons in log files when reading data from a file-based data source into a PySpark dataframe. Now only following three options are available: 1. PERMISSIVE : sets other fields to null when it meets a corrupted record, and puts the malformed string into a field configured by columnNameOfCorruptRecord. 2. DROPMALFORMED : ignores the whole corrupted records. 3. FAILFAST : throws an exception when it meets corrupted records. We could use first option to accumulate the corrupted records and output to a log file.But we can't use this option when input schema is inferred automatically. If the number of columns to read is too large, providing the complete schema with additional column for storing corrupted data is difficult. Instead "pyspark.sql.DataFrameReader.csv" reader functions could provide an option to redirect the bad records to configured log file path with exception details. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22297) Flaky test: BlockManagerSuite "Shuffle registration timeout and maxAttempts conf"
[ https://issues.apache.org/jira/browse/SPARK-22297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240228#comment-16240228 ] Apache Spark commented on SPARK-22297: -- User 'mpetruska' has created a pull request for this issue: https://github.com/apache/spark/pull/19671 > Flaky test: BlockManagerSuite "Shuffle registration timeout and maxAttempts > conf" > - > > Key: SPARK-22297 > URL: https://issues.apache.org/jira/browse/SPARK-22297 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Priority: Minor > > Ran into this locally; the test code seems to use timeouts which generally > end up in flakiness like this. > {noformat} > [info] - SPARK-20640: Shuffle registration timeout and maxAttempts conf are > working *** FAILED *** (1 second, 203 milliseconds) > [info] "Unable to register with external shuffle server due to : > java.util.concurrent.TimeoutException: Timeout waiting for task." did not > contain "test_spark_20640_try_again" (BlockManagerSuite.scala:1370) > [info] org.scalatest.exceptions.TestFailedException: > [info] at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) > [info] at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > [info] at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) > [info] at > org.apache.spark.storage.BlockManagerSuite$$anonfun$14.apply$mcV$sp(BlockManagerSuite.scala:1370) > [info] at > org.apache.spark.storage.BlockManagerSuite$$anonfun$14.apply(BlockManagerSuite.scala:1323) > [info] at > org.apache.spark.storage.BlockManagerSuite$$anonfun$14.apply(BlockManagerSuite.scala:1323) > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22297) Flaky test: BlockManagerSuite "Shuffle registration timeout and maxAttempts conf"
[ https://issues.apache.org/jira/browse/SPARK-22297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22297: Assignee: (was: Apache Spark) > Flaky test: BlockManagerSuite "Shuffle registration timeout and maxAttempts > conf" > - > > Key: SPARK-22297 > URL: https://issues.apache.org/jira/browse/SPARK-22297 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Priority: Minor > > Ran into this locally; the test code seems to use timeouts which generally > end up in flakiness like this. > {noformat} > [info] - SPARK-20640: Shuffle registration timeout and maxAttempts conf are > working *** FAILED *** (1 second, 203 milliseconds) > [info] "Unable to register with external shuffle server due to : > java.util.concurrent.TimeoutException: Timeout waiting for task." did not > contain "test_spark_20640_try_again" (BlockManagerSuite.scala:1370) > [info] org.scalatest.exceptions.TestFailedException: > [info] at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) > [info] at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > [info] at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) > [info] at > org.apache.spark.storage.BlockManagerSuite$$anonfun$14.apply$mcV$sp(BlockManagerSuite.scala:1370) > [info] at > org.apache.spark.storage.BlockManagerSuite$$anonfun$14.apply(BlockManagerSuite.scala:1323) > [info] at > org.apache.spark.storage.BlockManagerSuite$$anonfun$14.apply(BlockManagerSuite.scala:1323) > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22297) Flaky test: BlockManagerSuite "Shuffle registration timeout and maxAttempts conf"
[ https://issues.apache.org/jira/browse/SPARK-22297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22297: Assignee: Apache Spark > Flaky test: BlockManagerSuite "Shuffle registration timeout and maxAttempts > conf" > - > > Key: SPARK-22297 > URL: https://issues.apache.org/jira/browse/SPARK-22297 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Minor > > Ran into this locally; the test code seems to use timeouts which generally > end up in flakiness like this. > {noformat} > [info] - SPARK-20640: Shuffle registration timeout and maxAttempts conf are > working *** FAILED *** (1 second, 203 milliseconds) > [info] "Unable to register with external shuffle server due to : > java.util.concurrent.TimeoutException: Timeout waiting for task." did not > contain "test_spark_20640_try_again" (BlockManagerSuite.scala:1370) > [info] org.scalatest.exceptions.TestFailedException: > [info] at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) > [info] at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > [info] at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) > [info] at > org.apache.spark.storage.BlockManagerSuite$$anonfun$14.apply$mcV$sp(BlockManagerSuite.scala:1370) > [info] at > org.apache.spark.storage.BlockManagerSuite$$anonfun$14.apply(BlockManagerSuite.scala:1323) > [info] at > org.apache.spark.storage.BlockManagerSuite$$anonfun$14.apply(BlockManagerSuite.scala:1323) > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22454) ExternalShuffleClient.close() should check null
[ https://issues.apache.org/jira/browse/SPARK-22454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240192#comment-16240192 ] Apache Spark commented on SPARK-22454: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/19670 > ExternalShuffleClient.close() should check null > --- > > Key: SPARK-22454 > URL: https://issues.apache.org/jira/browse/SPARK-22454 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.1.1 >Reporter: Yuming Wang >Priority: Minor > > {noformat} > 17/11/06 20:08:05 ERROR Utils: Uncaught exception in thread main > java.lang.NullPointerException > at > org.apache.spark.network.shuffle.ExternalShuffleClient.close(ExternalShuffleClient.java:152) > at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1407) > at org.apache.spark.SparkEnv.stop(SparkEnv.scala:89) > at > org.apache.spark.SparkContext$$anonfun$stop$11.apply$mcV$sp(SparkContext.scala:1849) > at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1283) > at org.apache.spark.SparkContext.stop(SparkContext.scala:1848) > at org.apache.spark.SparkContext.(SparkContext.scala:587) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2320) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:868) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:860) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:47) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:288) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:137) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:743) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org