[jira] [Commented] (SPARK-22451) Reduce decision tree aggregate size for unordered features from O(2^numCategories) to O(numCategories)

2017-11-06 Thread Weichen Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16241634#comment-16241634
 ] 

Weichen Xu commented on SPARK-22451:


[~josephkb] I think it can be reconstructed from a summary with only 
O(numCategories) values, let me give an simple example to explain how it works 
(but if I am wrong correct me):
Suppose an unordered features containing values [a, b, c, d, e, f], we first 
get separate statistics for stat[a], stat[b], stat[c], stat[d], stat[e], stat[f]
then, if we want to get stats for split [a, c, d], we can compute it by stat[a] 
+ stat[c] + stat[d].
[~Siddharth Murching] can also help to confirm the correctness of this.

> Reduce decision tree aggregate size for unordered features from 
> O(2^numCategories) to O(numCategories)
> --
>
> Key: SPARK-22451
> URL: https://issues.apache.org/jira/browse/SPARK-22451
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Do not need generate all possible splits for unordered features before 
> aggregate,
> in aggregete (executor side):
> 1. Change `mixedBinSeqOp`, for each unordered feature, we do the same stat 
> with ordered features. so for unordered features, we only need 
> O(numCategories) space for this feature stat.
> 2. After driver side get the aggregate result, generate all possible split 
> combinations, and compute the best split.
> This will reduce decision tree aggregate size for each unordered feature from 
> O(2^numCategories) to O(numCategories),  `numCategories` is the arity of this 
> unordered feature.
> This also reduce the cpu cost in executor side. Reduce time complexity for 
> this unordered feature from O(numPoints * 2^numCategories) to O(numPoints).
> This won't increase time complexity for unordered features best split 
> computing in driver side.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22464) <=> is not supported by Hive metastore partition predicate pushdown

2017-11-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22464:


Assignee: Apache Spark  (was: Xiao Li)

> <=> is not supported by Hive metastore partition predicate pushdown
> ---
>
> Key: SPARK-22464
> URL: https://issues.apache.org/jira/browse/SPARK-22464
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> <=> is not supported by Hive metastore partition predicate pushdown. We 
> should forbid it. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22464) <=> is not supported by Hive metastore partition predicate pushdown

2017-11-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16241607#comment-16241607
 ] 

Apache Spark commented on SPARK-22464:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/19682

> <=> is not supported by Hive metastore partition predicate pushdown
> ---
>
> Key: SPARK-22464
> URL: https://issues.apache.org/jira/browse/SPARK-22464
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> <=> is not supported by Hive metastore partition predicate pushdown. We 
> should forbid it. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22464) <=> is not supported by Hive metastore partition predicate pushdown

2017-11-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22464:


Assignee: Xiao Li  (was: Apache Spark)

> <=> is not supported by Hive metastore partition predicate pushdown
> ---
>
> Key: SPARK-22464
> URL: https://issues.apache.org/jira/browse/SPARK-22464
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> <=> is not supported by Hive metastore partition predicate pushdown. We 
> should forbid it. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22464) <=> is not supported by Hive metastore partition predicate pushdown

2017-11-06 Thread Xiao Li (JIRA)
Xiao Li created SPARK-22464:
---

 Summary: <=> is not supported by Hive metastore partition 
predicate pushdown
 Key: SPARK-22464
 URL: https://issues.apache.org/jira/browse/SPARK-22464
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Xiao Li
Assignee: Xiao Li


<=> is not supported by Hive metastore partition predicate pushdown. We should 
forbid it. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19680) Offsets out of range with no configured reset policy for partitions

2017-11-06 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-19680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16241577#comment-16241577
 ] 

Serkan Taş edited comment on SPARK-19680 at 11/7/17 6:37 AM:
-

I frequently get the error of "numRecords must not be negative", but this time 
I have the same issue also on a yarn cluster for a job running 4 days 
terminated due to the same error :

diagnostics: User class threw exception: org.apache.spark.SparkException: Job 
aborted due to stage failure: Task 0 in stage 18388.0 failed 4 times, most 
recent failure: Lost task 0.3 in stage 18388.0 (TID 21387, server, executor 1): 
org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of 
range with no configured reset policy for partitions: {topic-0=33436703}



Exception in thread "main" org.apache.spark.SparkException: Application 
application_fdsfdsfsdfsdf_0001 finished with failed status

Hadop : 2.8.0
Spark : 2.1.0
Kafka : 0.10.2.1

Configuration : 

val kafkaParams = Map[String, Object](
  "bootstrap.servers" -> "brokers",
  "key.deserializer" -> classOf[StringDeserializer],
  "value.deserializer" -> classOf[StringDeserializer],
  "group.id" -> "grp_id",
  "auto.offset.reset" -> "latest",
  "enable.auto.commit" -> (false: java.lang.Boolean)
)


was (Author: serkan_tas):
I frequently get the error of "numRecords must not be negative
", But this time I have the same issue also on a yarn cluster for a job running 
4 days terminated due to the same error :

diagnostics: User class threw exception: org.apache.spark.SparkException: Job 
aborted due to stage failure: Task 0 in stage 18388.0 failed 4 times, most 
recent failure: Lost task 0.3 in stage 18388.0 (TID 21387, server, executor 1): 
org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of 
range with no configured reset policy for partitions: {topic-0=33436703}



Exception in thread "main" org.apache.spark.SparkException: Application 
application_fdsfdsfsdfsdf_0001 finished with failed status

Hadop : 2.8.0
Spark : 2.1.0
Kafka : 0.10.2.1

Configuration : 

val kafkaParams = Map[String, Object](
  "bootstrap.servers" -> "brokers",
  "key.deserializer" -> classOf[StringDeserializer],
  "value.deserializer" -> classOf[StringDeserializer],
  "group.id" -> "grp_id",
  "auto.offset.reset" -> "latest",
  "enable.auto.commit" -> (false: java.lang.Boolean)
)

> Offsets out of range with no configured reset policy for partitions
> ---
>
> Key: SPARK-19680
> URL: https://issues.apache.org/jira/browse/SPARK-19680
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Schakmann Rene
>
> I'm using spark streaming with kafka to acutally create a toplist. I want to 
> read all the messages in kafka. So I set
>"auto.offset.reset" -> "earliest"
> Nevertheless when I start the job on our spark cluster it is not working I 
> get:
> Error:
> {code:title=error.log|borderStyle=solid}
>   Job aborted due to stage failure: Task 2 in stage 111.0 failed 4 times, 
> most recent failure: Lost task 2.3 in stage 111.0 (TID 1270, 194.232.55.23, 
> executor 2): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: 
> Offsets out of range with no configured reset policy for partitions: 
> {SearchEvents-2=161803385}
> {code}
> This is somehow wrong because I did set the auto.offset.reset property
> Setup:
> Kafka Parameter:
> {code:title=Config.Scala|borderStyle=solid}
>   def getDefaultKafkaReceiverParameter(properties: Properties):Map[String, 
> Object] = {
> Map(
>   "bootstrap.servers" -> 
> properties.getProperty("kafka.bootstrap.servers"),
>   "group.id" -> properties.getProperty("kafka.consumer.group"),
>   "auto.offset.reset" -> "earliest",
>   "spark.streaming.kafka.consumer.cache.enabled" -> "false",
>   "enable.auto.commit" -> "false",
>   "key.deserializer" -> classOf[StringDeserializer],
>   "value.deserializer" -> "at.willhaben.sid.DTOByteDeserializer")
>   }
> {code}
> Job:
> {code:title=Job.Scala|borderStyle=solid}
>   def processSearchKeyWords(stream: InputDStream[ConsumerRecord[String, 
> Array[Byte]]], windowDuration: Int, slideDuration: Int, kafkaSink: 
> Broadcast[KafkaSink[TopList]]): Unit = {
> getFilteredStream(stream.map(_.value()), windowDuration, 
> slideDuration).foreachRDD(rdd => {
>   val topList = new TopList
>   topList.setCreated(new Date())
>   topList.setTopListEntryList(rdd.take(TopListLength).toList)
>   CurrentLogger.info("TopList length: " + 
> topList.getTopListEntryList.size().toString)
>   kafkaSink.value.send(SendToTopicName, topList)
>   CurrentLogger.info("Last Run: " + System.currentTimeMillis())
> 

[jira] [Comment Edited] (SPARK-19680) Offsets out of range with no configured reset policy for partitions

2017-11-06 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-19680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16241577#comment-16241577
 ] 

Serkan Taş edited comment on SPARK-19680 at 11/7/17 6:37 AM:
-

I frequently get the error of "numRecords must not be negative
", But this time I have the same issue also on a yarn cluster for a job running 
4 days terminated due to the same error :

diagnostics: User class threw exception: org.apache.spark.SparkException: Job 
aborted due to stage failure: Task 0 in stage 18388.0 failed 4 times, most 
recent failure: Lost task 0.3 in stage 18388.0 (TID 21387, server, executor 1): 
org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of 
range with no configured reset policy for partitions: {topic-0=33436703}



Exception in thread "main" org.apache.spark.SparkException: Application 
application_fdsfdsfsdfsdf_0001 finished with failed status

Hadop : 2.8.0
Spark : 2.1.0
Kafka : 0.10.2.1

Configuration : 

val kafkaParams = Map[String, Object](
  "bootstrap.servers" -> "brokers",
  "key.deserializer" -> classOf[StringDeserializer],
  "value.deserializer" -> classOf[StringDeserializer],
  "group.id" -> "grp_id",
  "auto.offset.reset" -> "latest",
  "enable.auto.commit" -> (false: java.lang.Boolean)
)


was (Author: serkan_tas):
I have the same issue also on a yarn cluster for a job running 4 days 
terminated due to the same error :

diagnostics: User class threw exception: org.apache.spark.SparkException: Job 
aborted due to stage failure: Task 0 in stage 18388.0 failed 4 times, most 
recent failure: Lost task 0.3 in stage 18388.0 (TID 21387, server, executor 1): 
org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of 
range with no configured reset policy for partitions: {topic-0=33436703}



Exception in thread "main" org.apache.spark.SparkException: Application 
application_fdsfdsfsdfsdf_0001 finished with failed status

Hadop : 2.8.0
Spark : 2.1.0
Kafka : 0.10.2.1

Configuration : 

val kafkaParams = Map[String, Object](
  "bootstrap.servers" -> "brokers",
  "key.deserializer" -> classOf[StringDeserializer],
  "value.deserializer" -> classOf[StringDeserializer],
  "group.id" -> "grp_id",
  "auto.offset.reset" -> "latest",
  "enable.auto.commit" -> (false: java.lang.Boolean)
)

> Offsets out of range with no configured reset policy for partitions
> ---
>
> Key: SPARK-19680
> URL: https://issues.apache.org/jira/browse/SPARK-19680
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Schakmann Rene
>
> I'm using spark streaming with kafka to acutally create a toplist. I want to 
> read all the messages in kafka. So I set
>"auto.offset.reset" -> "earliest"
> Nevertheless when I start the job on our spark cluster it is not working I 
> get:
> Error:
> {code:title=error.log|borderStyle=solid}
>   Job aborted due to stage failure: Task 2 in stage 111.0 failed 4 times, 
> most recent failure: Lost task 2.3 in stage 111.0 (TID 1270, 194.232.55.23, 
> executor 2): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: 
> Offsets out of range with no configured reset policy for partitions: 
> {SearchEvents-2=161803385}
> {code}
> This is somehow wrong because I did set the auto.offset.reset property
> Setup:
> Kafka Parameter:
> {code:title=Config.Scala|borderStyle=solid}
>   def getDefaultKafkaReceiverParameter(properties: Properties):Map[String, 
> Object] = {
> Map(
>   "bootstrap.servers" -> 
> properties.getProperty("kafka.bootstrap.servers"),
>   "group.id" -> properties.getProperty("kafka.consumer.group"),
>   "auto.offset.reset" -> "earliest",
>   "spark.streaming.kafka.consumer.cache.enabled" -> "false",
>   "enable.auto.commit" -> "false",
>   "key.deserializer" -> classOf[StringDeserializer],
>   "value.deserializer" -> "at.willhaben.sid.DTOByteDeserializer")
>   }
> {code}
> Job:
> {code:title=Job.Scala|borderStyle=solid}
>   def processSearchKeyWords(stream: InputDStream[ConsumerRecord[String, 
> Array[Byte]]], windowDuration: Int, slideDuration: Int, kafkaSink: 
> Broadcast[KafkaSink[TopList]]): Unit = {
> getFilteredStream(stream.map(_.value()), windowDuration, 
> slideDuration).foreachRDD(rdd => {
>   val topList = new TopList
>   topList.setCreated(new Date())
>   topList.setTopListEntryList(rdd.take(TopListLength).toList)
>   CurrentLogger.info("TopList length: " + 
> topList.getTopListEntryList.size().toString)
>   kafkaSink.value.send(SendToTopicName, topList)
>   CurrentLogger.info("Last Run: " + System.currentTimeMillis())
> })
>   }
>   def getFilteredStream(result: DStream[Array[Byte]], windowDuration:

[jira] [Commented] (SPARK-19680) Offsets out of range with no configured reset policy for partitions

2017-11-06 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-19680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16241577#comment-16241577
 ] 

Serkan Taş commented on SPARK-19680:


I have the same issue also on a yarn cluster for a job running 4 days 
terminated due to the same error :

diagnostics: User class threw exception: org.apache.spark.SparkException: Job 
aborted due to stage failure: Task 0 in stage 18388.0 failed 4 times, most 
recent failure: Lost task 0.3 in stage 18388.0 (TID 21387, server, executor 1): 
org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of 
range with no configured reset policy for partitions: {topic-0=33436703}



Exception in thread "main" org.apache.spark.SparkException: Application 
application_fdsfdsfsdfsdf_0001 finished with failed status

Hadop : 2.8.0
Spark : 2.1.0
Kafka : 0.10.2.1

Configuration : 

val kafkaParams = Map[String, Object](
  "bootstrap.servers" -> "brokers",
  "key.deserializer" -> classOf[StringDeserializer],
  "value.deserializer" -> classOf[StringDeserializer],
  "group.id" -> "grp_id",
  "auto.offset.reset" -> "latest",
  "enable.auto.commit" -> (false: java.lang.Boolean)
)

> Offsets out of range with no configured reset policy for partitions
> ---
>
> Key: SPARK-19680
> URL: https://issues.apache.org/jira/browse/SPARK-19680
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Schakmann Rene
>
> I'm using spark streaming with kafka to acutally create a toplist. I want to 
> read all the messages in kafka. So I set
>"auto.offset.reset" -> "earliest"
> Nevertheless when I start the job on our spark cluster it is not working I 
> get:
> Error:
> {code:title=error.log|borderStyle=solid}
>   Job aborted due to stage failure: Task 2 in stage 111.0 failed 4 times, 
> most recent failure: Lost task 2.3 in stage 111.0 (TID 1270, 194.232.55.23, 
> executor 2): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: 
> Offsets out of range with no configured reset policy for partitions: 
> {SearchEvents-2=161803385}
> {code}
> This is somehow wrong because I did set the auto.offset.reset property
> Setup:
> Kafka Parameter:
> {code:title=Config.Scala|borderStyle=solid}
>   def getDefaultKafkaReceiverParameter(properties: Properties):Map[String, 
> Object] = {
> Map(
>   "bootstrap.servers" -> 
> properties.getProperty("kafka.bootstrap.servers"),
>   "group.id" -> properties.getProperty("kafka.consumer.group"),
>   "auto.offset.reset" -> "earliest",
>   "spark.streaming.kafka.consumer.cache.enabled" -> "false",
>   "enable.auto.commit" -> "false",
>   "key.deserializer" -> classOf[StringDeserializer],
>   "value.deserializer" -> "at.willhaben.sid.DTOByteDeserializer")
>   }
> {code}
> Job:
> {code:title=Job.Scala|borderStyle=solid}
>   def processSearchKeyWords(stream: InputDStream[ConsumerRecord[String, 
> Array[Byte]]], windowDuration: Int, slideDuration: Int, kafkaSink: 
> Broadcast[KafkaSink[TopList]]): Unit = {
> getFilteredStream(stream.map(_.value()), windowDuration, 
> slideDuration).foreachRDD(rdd => {
>   val topList = new TopList
>   topList.setCreated(new Date())
>   topList.setTopListEntryList(rdd.take(TopListLength).toList)
>   CurrentLogger.info("TopList length: " + 
> topList.getTopListEntryList.size().toString)
>   kafkaSink.value.send(SendToTopicName, topList)
>   CurrentLogger.info("Last Run: " + System.currentTimeMillis())
> })
>   }
>   def getFilteredStream(result: DStream[Array[Byte]], windowDuration: Int, 
> slideDuration: Int): DStream[TopListEntry] = {
> val Mapper = MapperObject.readerFor[SearchEventDTO]
> result.repartition(100).map(s => Mapper.readValue[SearchEventDTO](s))
>   .filter(s => s != null && s.getSearchRequest != null && 
> s.getSearchRequest.getSearchParameters != null && s.getVertical == 
> Vertical.BAP && 
> s.getSearchRequest.getSearchParameters.containsKey(EspParameterEnum.KEYWORD.getName))
>   .map(row => {
> val name = 
> row.getSearchRequest.getSearchParameters.get(EspParameterEnum.KEYWORD.getName).getEspSearchParameterDTO.getValue.toLowerCase()
> (name, new TopListEntry(name, 1, row.getResultCount))
>   })
>   .reduceByKeyAndWindow(
> (a: TopListEntry, b: TopListEntry) => new TopListEntry(a.getKeyword, 
> a.getSearchCount + b.getSearchCount, a.getMeanSearchHits + 
> b.getMeanSearchHits),
> (a: TopListEntry, b: TopListEntry) => new TopListEntry(a.getKeyword, 
> a.getSearchCount - b.getSearchCount, a.getMeanSearchHits - 
> b.getMeanSearchHits),
> Minutes(windowDuration),
> Seconds(slideDuration))
>   .filter((x: (String, TopListEntry)) => x._2.getSearchCount > 

[jira] [Updated] (SPARK-22289) Cannot save LogisticRegressionClassificationModel with bounds on coefficients

2017-11-06 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-22289:

Shepherd: Yanbo Liang

> Cannot save LogisticRegressionClassificationModel with bounds on coefficients
> -
>
> Key: SPARK-22289
> URL: https://issues.apache.org/jira/browse/SPARK-22289
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Nic Eggert
>Assignee: yuhao yang
>
> I think this was introduced in SPARK-20047.
> Trying to call save on a logistic regression model trained with bounds on its 
> parameters throws an error. This seems to be because Spark doesn't know how 
> to serialize the Matrix parameter.
> Model is set up like this:
> {code}
> val calibrator = new LogisticRegression()
>   .setFeaturesCol("uncalibrated_probability")
>   .setLabelCol("label")
>   .setWeightCol("weight")
>   .setStandardization(false)
>   .setLowerBoundsOnCoefficients(new DenseMatrix(1, 1, Array(0.0)))
>   .setFamily("binomial")
>   .setProbabilityCol("probability")
>   .setPredictionCol("logistic_prediction")
>   .setRawPredictionCol("logistic_raw_prediction")
> {code}
> {code}
> 17/10/16 15:36:59 ERROR ApplicationMaster: User class threw exception: 
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
>   at org.apache.spark.ml.param.Param.jsonEncode(params.scala:98)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:296)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:295)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.getMetadataToSave(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:277)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionModel$LogisticRegressionModelWriter.saveImpl(LogisticRegression.scala:1182)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:254)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:253)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:253)
>   at 
> org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:337)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   -snip-
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22289) Cannot save LogisticRegressionClassificationModel with bounds on coefficients

2017-11-06 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-22289:
---

Assignee: yuhao yang

> Cannot save LogisticRegressionClassificationModel with bounds on coefficients
> -
>
> Key: SPARK-22289
> URL: https://issues.apache.org/jira/browse/SPARK-22289
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Nic Eggert
>Assignee: yuhao yang
>
> I think this was introduced in SPARK-20047.
> Trying to call save on a logistic regression model trained with bounds on its 
> parameters throws an error. This seems to be because Spark doesn't know how 
> to serialize the Matrix parameter.
> Model is set up like this:
> {code}
> val calibrator = new LogisticRegression()
>   .setFeaturesCol("uncalibrated_probability")
>   .setLabelCol("label")
>   .setWeightCol("weight")
>   .setStandardization(false)
>   .setLowerBoundsOnCoefficients(new DenseMatrix(1, 1, Array(0.0)))
>   .setFamily("binomial")
>   .setProbabilityCol("probability")
>   .setPredictionCol("logistic_prediction")
>   .setRawPredictionCol("logistic_raw_prediction")
> {code}
> {code}
> 17/10/16 15:36:59 ERROR ApplicationMaster: User class threw exception: 
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
>   at org.apache.spark.ml.param.Param.jsonEncode(params.scala:98)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:296)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:295)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.getMetadataToSave(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:277)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionModel$LogisticRegressionModelWriter.saveImpl(LogisticRegression.scala:1182)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:254)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:253)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:253)
>   at 
> org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:337)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   -snip-
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22453) Zookeeper configuration for MesosClusterDispatcher is not documented

2017-11-06 Thread Jais Sebastian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jais Sebastian updated SPARK-22453:
---
Priority: Critical  (was: Major)

> Zookeeper configuration for MesosClusterDispatcher  is not documented 
> --
>
> Key: SPARK-22453
> URL: https://issues.apache.org/jira/browse/SPARK-22453
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.2.0
> Environment: Spark 2.2 with Spark submit
>Reporter: Jais Sebastian
>Priority: Critical
>
> Spark does not have detailed documentation on how to configure zookeeper with 
> MesosClusterDispatcher . It is mentioned that the configuration is possible 
> and broken documentation link is provided.
> https://spark.apache.org/docs/latest/running-on-mesos.html#cluster-mode
> _{{The MesosClusterDispatcher also supports writing recovery state into 
> Zookeeper. This will allow the MesosClusterDispatcher to be able to recover 
> all submitted and running containers on relaunch. In order to enable this 
> recovery mode, you can set SPARK_DAEMON_JAVA_OPTS in spark-env by configuring 
> spark.deploy.recoveryMode and related spark.deploy.zookeeper.* 
> configurations. For more information about these configurations please refer 
> to the configurations doc.}}_



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22373) Intermittent NullPointerException in org.codehaus.janino.IClass.isAssignableFrom

2017-11-06 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16241469#comment-16241469
 ] 

Kazuaki Ishizaki commented on SPARK-22373:
--

If you can submit a program that can reproduce this, I could investigate which 
software component can cause an issue.

> Intermittent NullPointerException in 
> org.codehaus.janino.IClass.isAssignableFrom
> 
>
> Key: SPARK-22373
> URL: https://issues.apache.org/jira/browse/SPARK-22373
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
> Environment: Hortonworks distribution: HDP 2.6.2.0-205 , 
> /usr/hdp/current/spark2-client/jars/spark-core_2.11-2.1.1.2.6.2.0-205.jar
>Reporter: Dan Meany
>Priority: Minor
>
> Very occasional and retry works.
> Full stack:
> 17/10/27 21:06:15 ERROR Executor: Exception in task 29.0 in stage 12.0 (TID 
> 758)
> java.lang.NullPointerException
>   at org.codehaus.janino.IClass.isAssignableFrom(IClass.java:569)
>   at 
> org.codehaus.janino.UnitCompiler.isWideningReferenceConvertible(UnitCompiler.java:10347)
>   at 
> org.codehaus.janino.UnitCompiler.isMethodInvocationConvertible(UnitCompiler.java:8636)
>   at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:8427)
>   at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:8285)
>   at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8169)
>   at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8071)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4421)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:550)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org

[jira] [Issue Comment Deleted] (SPARK-22359) Improve the test coverage of window functions

2017-11-06 Thread Teng Peng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teng Peng updated SPARK-22359:
--
Comment: was deleted

(was: [~jiangxb] If I can have 1 test as reference, I will figure out the rests 
myself.)

> Improve the test coverage of window functions
> -
>
> Key: SPARK-22359
> URL: https://issues.apache.org/jira/browse/SPARK-22359
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>
> There are already quite a few integration tests using window functions, but 
> the unit tests coverage for window funtions is not ideal.
> We'd like to test the following aspects:
> * Specifications
> ** different partition clauses (none, one, multiple)
> ** different order clauses (none, one, multiple, asc/desc, nulls first/last)
> * Frames and their combinations
> ** OffsetWindowFunctionFrame
> ** UnboundedWindowFunctionFrame
> ** SlidingWindowFunctionFrame
> ** UnboundedPrecedingWindowFunctionFrame
> ** UnboundedFollowingWindowFunctionFrame
> * Aggregate function types
> ** Declarative
> ** Imperative
> ** UDAF
> * Spilling
> ** Cover the conditions that WindowExec should spill at least once 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22359) Improve the test coverage of window functions

2017-11-06 Thread Teng Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16241392#comment-16241392
 ] 

Teng Peng commented on SPARK-22359:
---

[~jiangxb] If I can have 1 test as reference, I will figure out the rests 
myself.

> Improve the test coverage of window functions
> -
>
> Key: SPARK-22359
> URL: https://issues.apache.org/jira/browse/SPARK-22359
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>
> There are already quite a few integration tests using window functions, but 
> the unit tests coverage for window funtions is not ideal.
> We'd like to test the following aspects:
> * Specifications
> ** different partition clauses (none, one, multiple)
> ** different order clauses (none, one, multiple, asc/desc, nulls first/last)
> * Frames and their combinations
> ** OffsetWindowFunctionFrame
> ** UnboundedWindowFunctionFrame
> ** SlidingWindowFunctionFrame
> ** UnboundedPrecedingWindowFunctionFrame
> ** UnboundedFollowingWindowFunctionFrame
> * Aggregate function types
> ** Declarative
> ** Imperative
> ** UDAF
> * Spilling
> ** Cover the conditions that WindowExec should spill at least once 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22463) Missing hadoop/hive/hbase/etc configuration files in SPARK_CONF_DIR to distributed archive

2017-11-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16241389#comment-16241389
 ] 

Apache Spark commented on SPARK-22463:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/19663

> Missing hadoop/hive/hbase/etc configuration files in SPARK_CONF_DIR to 
> distributed archive
> --
>
> Key: SPARK-22463
> URL: https://issues.apache.org/jira/browse/SPARK-22463
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.2, 2.2.0
>Reporter: Kent Yao
>
> When I ran self contained sql apps, such as
> {code:java}
> import org.apache.spark.sql.SparkSession
> object ShowHiveTables {
>   def main(args: Array[String]): Unit = {
> val spark = SparkSession
>   .builder()
>   .appName("Show Hive Tables")
>   .enableHiveSupport()
>   .getOrCreate()
> spark.sql("show tables").show()
> spark.stop()
>   }
> }
> {code}
> with **yarn cluster** mode and `hive-site.xml` correctly within 
> `$SPARK_HOME/conf`,they failed to connect the right hive metestore for not 
> seeing hive-site.xml in AM/Driver's classpath.
> Although submitting them with `--files/--jars local/path/to/hive-site.xml` or 
> puting it to `$HADOOP_CONF_DIR/YARN_CONF_DIR` can make these apps works well 
> in cluster mode as client mode, according to the official doc, see @ 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
> > Configuration of Hive is done by placing your hive-site.xml, core-site.xml 
> > (for security configuration), and hdfs-site.xml (for HDFS configuration) 
> > file in conf/.
> We may respect these configuration files too or modify the doc for 
> hive-tables in cluster mode.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22463) Missing hadoop/hive/hbase/etc configuration files in SPARK_CONF_DIR to distributed archive

2017-11-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22463:


Assignee: (was: Apache Spark)

> Missing hadoop/hive/hbase/etc configuration files in SPARK_CONF_DIR to 
> distributed archive
> --
>
> Key: SPARK-22463
> URL: https://issues.apache.org/jira/browse/SPARK-22463
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.2, 2.2.0
>Reporter: Kent Yao
>
> When I ran self contained sql apps, such as
> {code:java}
> import org.apache.spark.sql.SparkSession
> object ShowHiveTables {
>   def main(args: Array[String]): Unit = {
> val spark = SparkSession
>   .builder()
>   .appName("Show Hive Tables")
>   .enableHiveSupport()
>   .getOrCreate()
> spark.sql("show tables").show()
> spark.stop()
>   }
> }
> {code}
> with **yarn cluster** mode and `hive-site.xml` correctly within 
> `$SPARK_HOME/conf`,they failed to connect the right hive metestore for not 
> seeing hive-site.xml in AM/Driver's classpath.
> Although submitting them with `--files/--jars local/path/to/hive-site.xml` or 
> puting it to `$HADOOP_CONF_DIR/YARN_CONF_DIR` can make these apps works well 
> in cluster mode as client mode, according to the official doc, see @ 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
> > Configuration of Hive is done by placing your hive-site.xml, core-site.xml 
> > (for security configuration), and hdfs-site.xml (for HDFS configuration) 
> > file in conf/.
> We may respect these configuration files too or modify the doc for 
> hive-tables in cluster mode.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22463) Missing hadoop/hive/hbase/etc configuration files in SPARK_CONF_DIR to distributed archive

2017-11-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22463:


Assignee: Apache Spark

> Missing hadoop/hive/hbase/etc configuration files in SPARK_CONF_DIR to 
> distributed archive
> --
>
> Key: SPARK-22463
> URL: https://issues.apache.org/jira/browse/SPARK-22463
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.2, 2.2.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>
> When I ran self contained sql apps, such as
> {code:java}
> import org.apache.spark.sql.SparkSession
> object ShowHiveTables {
>   def main(args: Array[String]): Unit = {
> val spark = SparkSession
>   .builder()
>   .appName("Show Hive Tables")
>   .enableHiveSupport()
>   .getOrCreate()
> spark.sql("show tables").show()
> spark.stop()
>   }
> }
> {code}
> with **yarn cluster** mode and `hive-site.xml` correctly within 
> `$SPARK_HOME/conf`,they failed to connect the right hive metestore for not 
> seeing hive-site.xml in AM/Driver's classpath.
> Although submitting them with `--files/--jars local/path/to/hive-site.xml` or 
> puting it to `$HADOOP_CONF_DIR/YARN_CONF_DIR` can make these apps works well 
> in cluster mode as client mode, according to the official doc, see @ 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
> > Configuration of Hive is done by placing your hive-site.xml, core-site.xml 
> > (for security configuration), and hdfs-site.xml (for HDFS configuration) 
> > file in conf/.
> We may respect these configuration files too or modify the doc for 
> hive-tables in cluster mode.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22463) Missing hadoop/hive/hbase/etc configuration files in SPARK_CONF_DIR to distributed archive

2017-11-06 Thread Kent Yao (JIRA)
Kent Yao created SPARK-22463:


 Summary: Missing hadoop/hive/hbase/etc configuration files in 
SPARK_CONF_DIR to distributed archive
 Key: SPARK-22463
 URL: https://issues.apache.org/jira/browse/SPARK-22463
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.2.0, 2.1.2
Reporter: Kent Yao


When I ran self contained sql apps, such as
{code:java}
import org.apache.spark.sql.SparkSession

object ShowHiveTables {
  def main(args: Array[String]): Unit = {
val spark = SparkSession
  .builder()
  .appName("Show Hive Tables")
  .enableHiveSupport()
  .getOrCreate()
spark.sql("show tables").show()
spark.stop()
  }
}
{code}
with **yarn cluster** mode and `hive-site.xml` correctly within 
`$SPARK_HOME/conf`,they failed to connect the right hive metestore for not 
seeing hive-site.xml in AM/Driver's classpath.

Although submitting them with `--files/--jars local/path/to/hive-site.xml` or 
puting it to `$HADOOP_CONF_DIR/YARN_CONF_DIR` can make these apps works well in 
cluster mode as client mode, according to the official doc, see @ 
http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
> Configuration of Hive is done by placing your hive-site.xml, core-site.xml 
> (for security configuration), and hdfs-site.xml (for HDFS configuration) file 
> in conf/.

We may respect these configuration files too or modify the doc for hive-tables 
in cluster mode.





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22462) SQL metrics missing after foreach operation on dataframe

2017-11-06 Thread Juliusz Sompolski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juliusz Sompolski updated SPARK-22462:
--
Attachment: collect.png
foreach.png

> SQL metrics missing after foreach operation on dataframe
> 
>
> Key: SPARK-22462
> URL: https://issues.apache.org/jira/browse/SPARK-22462
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Juliusz Sompolski
> Attachments: collect.png, foreach.png
>
>
> No SQL metrics are visible in the SQL tab of SparkUI when foreach is executed 
> on the DataFrame.
> e.g.
> {code}
> sql("select * from range(10)").collect()
> sql("select * from range(10)").foreach(a => Unit)
> sql("select * from range(10)").foreach(a => println(a))
> {code}
> See collect.png vs. foreach.png



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22462) SQL metrics missing after foreach operation on dataframe

2017-11-06 Thread Juliusz Sompolski (JIRA)
Juliusz Sompolski created SPARK-22462:
-

 Summary: SQL metrics missing after foreach operation on dataframe
 Key: SPARK-22462
 URL: https://issues.apache.org/jira/browse/SPARK-22462
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Juliusz Sompolski


No SQL metrics are visible in the SQL tab of SparkUI when foreach is executed 
on the DataFrame.
e.g.
{code}
sql("select * from range(10)").collect()
sql("select * from range(10)").foreach(a => Unit)
sql("select * from range(10)").foreach(a => println(a))
{code}
See collect.png vs. foreach.png



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20652) Make SQL UI use new app state store

2017-11-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20652:


Assignee: Apache Spark

> Make SQL UI use new app state store
> ---
>
> Key: SPARK-20652
> URL: https://issues.apache.org/jira/browse/SPARK-20652
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks modifying the SQL listener and UI code to use the new app 
> state store.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20652) Make SQL UI use new app state store

2017-11-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16241245#comment-16241245
 ] 

Apache Spark commented on SPARK-20652:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/19681

> Make SQL UI use new app state store
> ---
>
> Key: SPARK-20652
> URL: https://issues.apache.org/jira/browse/SPARK-20652
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks modifying the SQL listener and UI code to use the new app 
> state store.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20652) Make SQL UI use new app state store

2017-11-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20652:


Assignee: (was: Apache Spark)

> Make SQL UI use new app state store
> ---
>
> Key: SPARK-20652
> URL: https://issues.apache.org/jira/browse/SPARK-20652
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks modifying the SQL listener and UI code to use the new app 
> state store.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22330) Linear containsKey operation for serialized maps.

2017-11-06 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22330:
---

Assignee: Alexander

> Linear containsKey operation for serialized maps.
> -
>
> Key: SPARK-22330
> URL: https://issues.apache.org/jira/browse/SPARK-22330
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1, 2.2.0
>Reporter: Alexander
>Assignee: Alexander
>  Labels: performance
> Fix For: 2.3.0
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> One of our production application which aggressively uses cached spark RDDs 
> degraded after increasing volumes of data though it shouldn't. Fast profiling 
> session showed that the slowest part was SerializableMapWrapper#containsKey: 
> it delegates get and remove to actual implementation, but containsKey is 
> inherited from AbstractMap which is implemented in linear time via iteration 
> over whole keySet. A workaround was simple: replacing all containsKey with 
> get(key) != null solved the issue.
> Nevertheless, it would be much simpler for everyone if the issue will be 
> fixed once and for all.
> A fix is straightforward, delegate containsKey to actual implementation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22330) Linear containsKey operation for serialized maps.

2017-11-06 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22330.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19553
[https://github.com/apache/spark/pull/19553]

> Linear containsKey operation for serialized maps.
> -
>
> Key: SPARK-22330
> URL: https://issues.apache.org/jira/browse/SPARK-22330
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1, 2.2.0
>Reporter: Alexander
>  Labels: performance
> Fix For: 2.3.0
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> One of our production application which aggressively uses cached spark RDDs 
> degraded after increasing volumes of data though it shouldn't. Fast profiling 
> session showed that the slowest part was SerializableMapWrapper#containsKey: 
> it delegates get and remove to actual implementation, but containsKey is 
> inherited from AbstractMap which is implemented in linear time via iteration 
> over whole keySet. A workaround was simple: replacing all containsKey with 
> get(key) != null solved the issue.
> Nevertheless, it would be much simpler for everyone if the issue will be 
> fixed once and for all.
> A fix is straightforward, delegate containsKey to actual implementation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22461) Move Spark ML model summaries into a dedicated package

2017-11-06 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-22461:


 Summary: Move Spark ML model summaries into a dedicated package
 Key: SPARK-22461
 URL: https://issues.apache.org/jira/browse/SPARK-22461
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.3.0
Reporter: Seth Hendrickson
Priority: Minor


Summaries in ML right now do not adhere to a common abstraction, and are 
usually placed in the same file as the algorithm, which makes these files 
unwieldy. We can and should unify them under one hierarchy, perhaps in a new 
{{summary}} module.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22451) Reduce decision tree aggregate size for unordered features from O(2^numCategories) to O(numCategories)

2017-11-06 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240967#comment-16240967
 ] 

Joseph K. Bradley commented on SPARK-22451:
---

How does this work?  For unordered features, we need to keep track of stats for 
all possible 2^numCategories splits.  That information cannot be reconstructed 
from a summary with only O(numCategories) values.

> Reduce decision tree aggregate size for unordered features from 
> O(2^numCategories) to O(numCategories)
> --
>
> Key: SPARK-22451
> URL: https://issues.apache.org/jira/browse/SPARK-22451
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Do not need generate all possible splits for unordered features before 
> aggregate,
> in aggregete (executor side):
> 1. Change `mixedBinSeqOp`, for each unordered feature, we do the same stat 
> with ordered features. so for unordered features, we only need 
> O(numCategories) space for this feature stat.
> 2. After driver side get the aggregate result, generate all possible split 
> combinations, and compute the best split.
> This will reduce decision tree aggregate size for each unordered feature from 
> O(2^numCategories) to O(numCategories),  `numCategories` is the arity of this 
> unordered feature.
> This also reduce the cpu cost in executor side. Reduce time complexity for 
> this unordered feature from O(numPoints * 2^numCategories) to O(numPoints).
> This won't increase time complexity for unordered features best split 
> computing in driver side.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22078) clarify exception behaviors for all data source v2 interfaces

2017-11-06 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22078.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19623
[https://github.com/apache/spark/pull/19623]

> clarify exception behaviors for all data source v2 interfaces
> -
>
> Key: SPARK-22078
> URL: https://issues.apache.org/jira/browse/SPARK-22078
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20647) Make the Storage page use new app state store

2017-11-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240895#comment-16240895
 ] 

Apache Spark commented on SPARK-20647:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/19679

> Make the Storage page use new app state store
> -
>
> Key: SPARK-20647
> URL: https://issues.apache.org/jira/browse/SPARK-20647
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks making the Storage page use the new app state store.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20647) Make the Storage page use new app state store

2017-11-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20647:


Assignee: (was: Apache Spark)

> Make the Storage page use new app state store
> -
>
> Key: SPARK-20647
> URL: https://issues.apache.org/jira/browse/SPARK-20647
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks making the Storage page use the new app state store.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20647) Make the Storage page use new app state store

2017-11-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20647:


Assignee: Apache Spark

> Make the Storage page use new app state store
> -
>
> Key: SPARK-20647
> URL: https://issues.apache.org/jira/browse/SPARK-20647
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks making the Storage page use the new app state store.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22460) Spark De-serialization of Timestamp field is Incorrect

2017-11-06 Thread Saniya Tech (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saniya Tech updated SPARK-22460:

Description: 
We are trying to serialize Timestamp fields to Avro using spark-avro connector. 
I can see the Timestamp fields are getting correctly serialized as long 
(milliseconds since Epoch). I verified that the data is correctly read back 
from the Avro files. It is when we encode the Dataset as a case class that 
timestamp field is incorrectly converted to a long value as seconds since 
Epoch. As can be seen below, this shifts the timestamp many years in the future.

Code used to reproduce the issue:

{code:java}
import java.sql.Timestamp

import com.databricks.spark.avro._
import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession}

case class TestRecord(name: String, modified: Timestamp)

import spark.implicits._
val data = Seq(
  TestRecord("One", new Timestamp(System.currentTimeMillis()))
)

// Serialize:
val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> 
"com.example.domain")
val path = s"s3a://some-bucket/output/"
val ds = spark.createDataset(data)
ds.write
  .options(parameters)
  .mode(SaveMode.Overwrite)
  .avro(path)
//

// De-serialize
val output = spark.read.avro(path).as[TestRecord]
{code}

Output from the test:
{code:java}
scala> data.head
res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419)

scala> output.collect().head
res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0)
{code}



  was:
We are trying to serialize Timestamp fields to Avro using spark-avro connector. 
I can see the Timestamp fields are getting correctly serialized as long 
(milliseconds since Epoch). I verified that the data is correctly read back 
from the Avro files. It is when we encode the Dataset as a case class that 
timestamp field is incorrectly converted to as long value as seconds since 
Epoch. As can be seen below, this shifts the timestamp many years in the future.

Code used to reproduce the issue:

{code:java}
import java.sql.Timestamp

import com.databricks.spark.avro._
import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession}

case class TestRecord(name: String, modified: Timestamp)

import spark.implicits._
val data = Seq(
  TestRecord("One", new Timestamp(System.currentTimeMillis()))
)

// Serialize:
val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> 
"com.example.domain")
val path = s"s3a://some-bucket/output/"
val ds = spark.createDataset(data)
ds.write
  .options(parameters)
  .mode(SaveMode.Overwrite)
  .avro(path)
//

// De-serialize
val output = spark.read.avro(path).as[TestRecord]
{code}

Output from the test:
{code:java}
scala> data.head
res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419)

scala> output.collect().head
res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0)
{code}




> Spark De-serialization of Timestamp field is Incorrect
> --
>
> Key: SPARK-22460
> URL: https://issues.apache.org/jira/browse/SPARK-22460
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.1.1
>Reporter: Saniya Tech
>
> We are trying to serialize Timestamp fields to Avro using spark-avro 
> connector. I can see the Timestamp fields are getting correctly serialized as 
> long (milliseconds since Epoch). I verified that the data is correctly read 
> back from the Avro files. It is when we encode the Dataset as a case class 
> that timestamp field is incorrectly converted to a long value as seconds 
> since Epoch. As can be seen below, this shifts the timestamp many years in 
> the future.
> Code used to reproduce the issue:
> {code:java}
> import java.sql.Timestamp
> import com.databricks.spark.avro._
> import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession}
> case class TestRecord(name: String, modified: Timestamp)
> import spark.implicits._
> val data = Seq(
>   TestRecord("One", new Timestamp(System.currentTimeMillis()))
> )
> // Serialize:
> val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> 
> "com.example.domain")
> val path = s"s3a://some-bucket/output/"
> val ds = spark.createDataset(data)
> ds.write
>   .options(parameters)
>   .mode(SaveMode.Overwrite)
>   .avro(path)
> //
> // De-serialize
> val output = spark.read.avro(path).as[TestRecord]
> {code}
> Output from the test:
> {code:java}
> scala> data.head
> res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419)
> scala> output.collect().head
> res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22460) Spark De-serialization of Timestamp field is Incorrect

2017-11-06 Thread Saniya Tech (JIRA)
Saniya Tech created SPARK-22460:
---

 Summary: Spark De-serialization of Timestamp field is Incorrect
 Key: SPARK-22460
 URL: https://issues.apache.org/jira/browse/SPARK-22460
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.1.1
Reporter: Saniya Tech


We are trying to serialize Timestamp fields to Avro using spark-avro connector. 
I can see the Timestamp fields are getting correctly serialized as long 
(milliseconds since Epoch). I verified that the data is correctly read back 
from the Avro files. It is when we encode the Dataset as a case class that 
timestamp field is incorrectly converted to as long value as seconds since 
Epoch. As can be seen below, this shifts the timestamp many years in the future.

Code used to reproduce the issue:

{code:java}
import java.sql.Timestamp

import com.databricks.spark.avro._
import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession}

case class TestRecord(name: String, modified: Timestamp)

import spark.implicits._
val data = Seq(
  TestRecord("One", new Timestamp(System.currentTimeMillis()))
)

// Serialize:
val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> 
"com.example.domain")
val path = s"s3a://some-bucket/output/"
val ds = spark.createDataset(data)
ds.write
  .options(parameters)
  .mode(SaveMode.Overwrite)
  .avro(path)
//

// De-serialize
val output = spark.read.avro(path).as[TestRecord]
{code}

Output from the test:
{code:java}
scala> data.head
res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419)

scala> output.collect().head
res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0)
{code}





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20646) Make Executors page use new app state store

2017-11-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20646:


Assignee: Apache Spark

> Make Executors page use new app state store
> ---
>
> Key: SPARK-20646
> URL: https://issues.apache.org/jira/browse/SPARK-20646
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks making the Executors page use the new app state store.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20646) Make Executors page use new app state store

2017-11-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20646:


Assignee: (was: Apache Spark)

> Make Executors page use new app state store
> ---
>
> Key: SPARK-20646
> URL: https://issues.apache.org/jira/browse/SPARK-20646
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks making the Executors page use the new app state store.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20646) Make Executors page use new app state store

2017-11-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240805#comment-16240805
 ] 

Apache Spark commented on SPARK-20646:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/19678

> Make Executors page use new app state store
> ---
>
> Key: SPARK-20646
> URL: https://issues.apache.org/jira/browse/SPARK-20646
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks making the Executors page use the new app state store.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20645) Make Environment page use new app state store

2017-11-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240783#comment-16240783
 ] 

Apache Spark commented on SPARK-20645:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/19677

> Make Environment page use new app state store
> -
>
> Key: SPARK-20645
> URL: https://issues.apache.org/jira/browse/SPARK-20645
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks porting the Environment page to use the new key-value store.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20645) Make Environment page use new app state store

2017-11-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20645:


Assignee: Apache Spark

> Make Environment page use new app state store
> -
>
> Key: SPARK-20645
> URL: https://issues.apache.org/jira/browse/SPARK-20645
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks porting the Environment page to use the new key-value store.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20645) Make Environment page use new app state store

2017-11-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20645:


Assignee: (was: Apache Spark)

> Make Environment page use new app state store
> -
>
> Key: SPARK-20645
> URL: https://issues.apache.org/jira/browse/SPARK-20645
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks porting the Environment page to use the new key-value store.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22459) EdgeDirection "Either" Does Not Considerate Real "Either" Direction

2017-11-06 Thread Tom (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom updated SPARK-22459:

Description: 
When running functions involving  
[EdgeDirection|https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/EdgeDirection.scala],
 the _EdgeDirection.Either_ type does not truly perform the actual specified 
direction (_either_).
For instance, when using _Pregel API_ with a simple graph as shown:
!http://www.icodeguru.com/CPP/10book/books/book9/images/fig6_1.gif!
With _EdgeDirection.Either_ you would guess that vertex #3 will send a message 
(when activated) to vertices 1, 2, and 4, but in reality it does not.
This might be bypassed, but in an expansive and ineffective way. 
Tested with 2.2.0, 2.1.1 and 2.1.2 spark versions.

  was:
When running functions involving  
[EdgeDirection|https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/EdgeDirection.scala],
 the _EdgeDirection.Either_ type does not truly perform the actual specified 
direction (_either_).
For instance, when using _Pregel API_ with a simple graph as shown:
!http://www.icodeguru.com/CPP/10book/books/book9/images/fig6_1.gif|thumbnail!
With _EdgeDirection.Either_ you would guess that vertex #3 will send a message 
(when activated) to vertices 1, 2, and 4, but in reality it does not.
This might be bypassed, but in an expansive and ineffective way. 
Tested with 2.2.0, 2.1.1 and 2.1.2 spark versions.


> EdgeDirection "Either" Does Not Considerate Real "Either" Direction
> ---
>
> Key: SPARK-22459
> URL: https://issues.apache.org/jira/browse/SPARK-22459
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.1.1, 2.1.2, 2.2.0
> Environment: Windows 7. Yarn cluster / client mode. Tested with 
> 2.2.0, 2.1.1 and 2.1.2 spark versions.
>Reporter: Tom
>
> When running functions involving  
> [EdgeDirection|https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/EdgeDirection.scala],
>  the _EdgeDirection.Either_ type does not truly perform the actual specified 
> direction (_either_).
> For instance, when using _Pregel API_ with a simple graph as shown:
> !http://www.icodeguru.com/CPP/10book/books/book9/images/fig6_1.gif!
> With _EdgeDirection.Either_ you would guess that vertex #3 will send a 
> message (when activated) to vertices 1, 2, and 4, but in reality it does not.
> This might be bypassed, but in an expansive and ineffective way. 
> Tested with 2.2.0, 2.1.1 and 2.1.2 spark versions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22459) EdgeDirection "Either" Does Not Considerate Real "Either" Direction

2017-11-06 Thread Tom (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom updated SPARK-22459:

Description: 
When running functions involving  
[EdgeDirection|https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/EdgeDirection.scala],
 the _EdgeDirection.Either_ type does not truly perform the actual specified 
direction (_either_).
For instance, when using _Pregel API_ with a simple graph as shown:
!http://www.icodeguru.com/CPP/10book/books/book9/images/fig6_1.gif|thumbnail!
With _EdgeDirection.Either_ you would guess that vertex #3 will send a message 
(when activated) to vertices 1, 2, and 4, but in reality it does not.
This might be bypassed, but in an expansive and ineffective way. 
Tested with 2.2.0, 2.1.1 and 2.1.2 spark versions.

  was:
When running functions involving  
[EdgeDirection|https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/EdgeDirection.scala],
 the _EdgeDirection.Either_ type does not truly perform the actual specified 
direction (_either_).
For instance, when using _Pregel API_ with a simple graph as shown:
!http://www.icodeguru.com/CPP/10book/books/book9/images/fig6_1.gif!
With _EdgeDirection.Either_ you would guess that vertex #3 will send a message 
(when activated) to vertices 1, 2, and 4, but in reality it does not.
This might be bypassed, but in an expansive and ineffective way. 
Tested with 2.2.0, 2.1.1 and 2.1.2 spark versions.


> EdgeDirection "Either" Does Not Considerate Real "Either" Direction
> ---
>
> Key: SPARK-22459
> URL: https://issues.apache.org/jira/browse/SPARK-22459
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.1.1, 2.1.2, 2.2.0
> Environment: Windows 7. Yarn cluster / client mode. Tested with 
> 2.2.0, 2.1.1 and 2.1.2 spark versions.
>Reporter: Tom
>
> When running functions involving  
> [EdgeDirection|https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/EdgeDirection.scala],
>  the _EdgeDirection.Either_ type does not truly perform the actual specified 
> direction (_either_).
> For instance, when using _Pregel API_ with a simple graph as shown:
> !http://www.icodeguru.com/CPP/10book/books/book9/images/fig6_1.gif|thumbnail!
> With _EdgeDirection.Either_ you would guess that vertex #3 will send a 
> message (when activated) to vertices 1, 2, and 4, but in reality it does not.
> This might be bypassed, but in an expansive and ineffective way. 
> Tested with 2.2.0, 2.1.1 and 2.1.2 spark versions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22459) EdgeDirection "Either" Does Not Considerate Real "Either" Direction

2017-11-06 Thread Tom (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom updated SPARK-22459:

Description: 
When running functions involving  
[EdgeDirection|https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/EdgeDirection.scala],
 the _EdgeDirection.Either_ type does not truly perform the actual specified 
direction (_either_).
For instance, when using _Pregel API_ with a simple graph as shown:
!http://www.icodeguru.com/CPP/10book/books/book9/images/fig6_1.gif!
With _EdgeDirection.Either_ you would guess that vertex #3 will send a message 
(when activated) to vertices 1, 2, and 4, but in reality it does not.
This might be bypassed, but in an expansive and ineffective way. 
Tested with 2.2.0, 2.1.1 and 2.1.2 spark versions.

  was:
When running functions involving  
[EdgeDirection|https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/EdgeDirection.scala],
 the _EdgeDirection.Either_ type does not truly perform the actual specified 
direction (_either_).
For instance, when using _Pregel API_ with a simple graph as shown:
!data:image/png;base64,iVBORw0KGgoNSUhEUgAAAPkAAADLCAMAAACbI8UEh1BMVEX9/f0AAAB1dXWFhYX4+Pjl5eV6enr39/fj4+Py8vLo6OgsLCyCgoLs7Ox8fHxTU1PKyso8PDxNTU0vLy/c3NwiIiJFRUVLS0vFxcVcXFzOzs41NTVSUlIYGBisrKy9vb1tbW1mZmYeHh6fn5+UlJQLCwuysrKMjIylpaWZmZlgYGASEhKECRTxAAAUhElEQVR4nN1diZqqOgy2LIq1ICAtssi+Ku//fBdwOY4UBEThu//MOAoUG5ImaZq2q9X/FwAIo8rxYOKK/BBA2KZR7AeBL8a5Bfs/AMHI83jv+34W5XmOv1jFr8CK96aalGCY8pdB+mUTYf5tMWBFG9kpCzBhVYxhCtXPrB/UdyIAHNlJeAxKpsXnPI7OMcseCSGbdwzcnu2kUE/+XsyyjNv7sh4miaOeezyzRQDHqqNm7hqW72/NledxepaJHnUJPYxUpAYxhpAHJcrCa2zkrJ4kdrb7RcU/BHCDwuaa3AWrXaSFm3VbOd4wic1ZTc2GMzV0ZMqJhcHyyYXDkH7SuxDZo0vuWtSci0EVCR6zqBT5hUs8NgvVa2fPLkMoppGAzdAWW0UaGAdCsnH28UfwdOQrXRfwkYYoJKx9JHstglLD2iCUflq7L8LwL7nUfQmfayh+JX3ro3d6H57Rwfioct8E9IP8bWsUYqJFfw9tfT1/K8qCSI5LteyYVdMeagjGmvqHBLi3ox6aW8hQ0NUgZkSE2F41gxninlns2cGbJnKFFGj5Im0b1o89ney1bz9pK3xQvX7lPKQusanDPXL7cgQf9cdDgpzq9jTVvIhOC+zBuMjvJbM1zuRh1Y1T/9a7NcPo/VU/BjTJAHur2ObNbeHPh96iUl5sn7ZDa/ZtpOGlheVw2+SpsFfd6ztFN1td+SZ2pu0Nr9tXAQO6iwWMWFbZpq+OD1emg4hklFbOt5h3/kz8hVk2bNNZt/aD4ICaypv3UX1sLasU92SdtTVnrB4WJu4GEmkKGvA7QVhz6NJgVK7l1b9UE5vs3fmkVetlaFk6DkSow9JuZb1hjPBRrP7FtEYice2Up4RdVJ8NmqjDpwbnpkivTbPUiPzmKvR/wXtEbuuxKkTubzx/gPVB7Wp+udrgOfQrOYCmTZMVy242j/spTZ+3oe/+1qxUPB0qV4gODcr5fSUH2yPV/VkfWxkL2a529X1g3WYNvH60OAt1UQ7yJn1Cpipliz5SFaNkHlqNfMyYhrCarefiVrFwZG9S7zqEYGmdrS81GjXlxar3IR1EGg3wcmqNTJ0ZJrlkMcuyG/YZm9cDj8N/P2wopxp3aj27sZkbCt2snDGsyl1coJwrex9eSaKc0a6Hl/YHGdfjEEdmbpDDvmrCW72TctrDyEjFc5Nqo2BwbEr79QuEDcPYMptxIseJ1R93f7m+eTpw/aUcq47e3v09frv8dlb8c1a8nj3UZGuXzChVHag01WGglRX2VWwJBjrNGkJfbmvnGDEBXkPIl/pBACu+RPkCeFB6Tav6FQig/OUBqE+Uf7dj5Uv9Qag+VB9X1VX3O6yuZerj5UtdqDoL/p4FHsMc2HwNbrwAWFcHjoJIgV8WFzYUTQ3KB9LW/VlhMq9uN3TxblVr6iEbDgwZbI/7siSfUfu2MDg8Pcg/7chCTd/g23iugIBfhDstWjgBbs3zVQ2khKv+xdRQAwxaNZxHKCpgTlgtvUclOOMS6eWlOQPROVf/PZsWtdwdWx211Nkvym8vFQ/VDYU+cTQb2Yh7oY9n664KWJvNzsyKzwly6fSVxnBhQXeeZagxcyvecGKU5a+MtdRbbyynxC2lgJC6uw9ePQGgkJnd9iYsctnRPJbKuDRR9tFumk3RmiENYLmpQY3Hlpp0aTGZFbiQPgMlNyja3WDDPXX4ANDvZenURjUvoqI/N0AcPobDPdQ/BAk5euhnXuCg/0iDYf8zyoKI4r6PzFCHhLZ/BsvWO4fOn640n4cQLV3rSc72tLjIaw0+dsxegSLoh+cn6RBiTe71yCDnyAszaTdgk8Q9SIcx+su58kn0GRkXIqK2JNnMDmweuLcOFh8T9YXFyom0jj8/hEPgQtJneP7HgEZaddks9q22AoZ2bBgmQ9eydwreUENOmnX4vPHlQLI8Vq/HDVbYJ2KnwPPuEVHst3VyLp2JjoKrE263rLwBKQ9QUsS3SmGZ+B3upZQRmzpabhyKY8PF/Qfo2uZmvSDCBSvb6AzDJP7Vy15VyTLEj1vYBz0ZXVrSBPCGELEl0REoAdIWlQAKYnKNxvn4X5UFN2DUPSV6wBuXkrjW+kNXLTS/EaQFAtjFKLGjRY2srLxrKjb5Y5OAFZ0cPVKe1REPjXyDSNCZ7af4JNQy5Q+J/DaND6EmGssifIVrnheNJLh1bDPMSTy7blr+pjkXHAsGsdYbvS94HGIcOXY9pYLhpRHHOqVIzTm0QIXg1SynjP7xZcv8E50OD5tOft8ADFO7ytH1lUk2F/MsrZYVhlntxLp69K4WxO7eN+WDXvIsMSOjb39MslL2UhYq6lEM3zXqPPllcRxuSlZqTOi2XQAEWGKXE0rAqfPGcGuphFHT7U6CdYb/HR/XeRJg0XE2FnbfZqxaqHXYoB0RKaKa5H9kj6d80icmGDqj57DPPS1tRJzYu/bgwV8Mvs30kDIUXpoDpFRYqH1IuBWGXUfcF0f5LitI70CKMkbaFa1+XAujHFiyc/R6X25pY3iu1rZyWZTDsx129UpeUWq44S63ZR8WRzk8O2g/hJRRlJftfGHSDhSTmMNiQsoYDacsTcMJoubsB3YeRtnzpWk4KQvVd5OTGsDaYYS0XzXcaiGUl8ZsRJL1JxruhfTBt5kGA43ZA6UnM3z486bhFkE5FAmiTLR9j5Ea7nRtVfNTbm2QPi7Ob2mnMRrutAyeC6X3Eowc2flIw80NKS69l7EpSRYyhxPxkPZ5gc1Q9UYHhJQxGq6U9vl5DoyDo3njy4/rq/2K5x26A0QobIbBBwCPiUwY9uw8FyKinT8arh/lyczfzrcbcvwwQWOUPS89mVl5zhtmr0H9TljaqHY+whROB5jbIftxfu2oOTe3ONwTfunHbFnSXBJiOJQBaV8PWGPcn6mA/eLgTpCKhO1Run22aWm8qxbBJJnko6LO3e38m4IPzwRl0zz1UZ7MbDyXfMd+v9BJP1hjpF2hh6q/reUEz3T0jjWQhmHcGMssfTUpQoS1Jks9GyXt1hz2fMcmagSpaemjMEra52jnilz0WgqoN0ppHxWZ+DHlIFVD/10+yzCMk/ZfezIwVd+u2jUUo8bPDfVXy6fUipzHG3JKp84gHyXtnjoibHnDcKPEe6oTvE7B+xzjPBn9h+18d7Yd/wszH0eNnyv67xYLMoLCPn9j4qPVI5jYEFDvR7odrIBnJ+oUPbMmjDE+nHUc384HYReHZOxIwjuM6p+XPP8J5copIe63JgyMs2r6L/x2wZMZdaqeWRMKGiG4P6EcRkXyzcWDR40o/oJyzIZkj7/Y8S2lfUQEsocP91mdBUVm7I4JJBNgrIb7rlWDqZ0cvrxMtqXpy7Nq24wknwfU32CchvvuCnjrS4LEr5vNUd5rP2kfJ6w8NExG9hqLfkyOURruQ2nvJApGhOk3B/hDjOurvUj7hPzZZU6x/42DiEbw/Fu6HfCWX6Bs95Np7OOiUV/S7YKnM+rkwZcWjMyTGaEW3wEAHCHmd5PYR42ffycOt2WTxKRvg/ENjBo/

[jira] [Created] (SPARK-22459) EdgeDirection "Either" Does Not Considerate real "either" direction

2017-11-06 Thread Tom (JIRA)
Tom created SPARK-22459:
---

 Summary: EdgeDirection "Either" Does Not Considerate real "either" 
direction
 Key: SPARK-22459
 URL: https://issues.apache.org/jira/browse/SPARK-22459
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 2.2.0, 2.1.2, 2.1.1
 Environment: Windows 7. Yarn cluster / client mode. Tested with 2.2.0, 
2.1.1 and 2.1.2 spark versions.
Reporter: Tom


When running functions involving  
[EdgeDirection|https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/EdgeDirection.scala],
 the _EdgeDirection.Either_ type does not truly perform the actual specified 
direction (_either_).
For instance, when using _Pregel API_ with a simple graph as shown:
!data:image/png;base64,iVBORw0KGgoNSUhEUgAAAPkAAADLCAMAAACbI8UEh1BMVEX9/f0AAAB1dXWFhYX4+Pjl5eV6enr39/fj4+Py8vLo6OgsLCyCgoLs7Ox8fHxTU1PKyso8PDxNTU0vLy/c3NwiIiJFRUVLS0vFxcVcXFzOzs41NTVSUlIYGBisrKy9vb1tbW1mZmYeHh6fn5+UlJQLCwuysrKMjIylpaWZmZlgYGASEhKECRTxAAAUhElEQVR4nN1diZqqOgy2LIq1ICAtssi+Ku//fBdwOY4UBEThu//MOAoUG5ImaZq2q9X/FwAIo8rxYOKK/BBA2KZR7AeBL8a5Bfs/AMHI83jv+34W5XmOv1jFr8CK96aalGCY8pdB+mUTYf5tMWBFG9kpCzBhVYxhCtXPrB/UdyIAHNlJeAxKpsXnPI7OMcseCSGbdwzcnu2kUE/+XsyyjNv7sh4miaOeezyzRQDHqqNm7hqW72/NledxepaJHnUJPYxUpAYxhpAHJcrCa2zkrJ4kdrb7RcU/BHCDwuaa3AWrXaSFm3VbOd4wic1ZTc2GMzV0ZMqJhcHyyYXDkH7SuxDZo0vuWtSci0EVCR6zqBT5hUs8NgvVa2fPLkMoppGAzdAWW0UaGAdCsnH28UfwdOQrXRfwkYYoJKx9JHstglLD2iCUflq7L8LwL7nUfQmfayh+JX3ro3d6H57Rwfioct8E9IP8bWsUYqJFfw9tfT1/K8qCSI5LteyYVdMeagjGmvqHBLi3ox6aW8hQ0NUgZkSE2F41gxninlns2cGbJnKFFGj5Im0b1o89ney1bz9pK3xQvX7lPKQusanDPXL7cgQf9cdDgpzq9jTVvIhOC+zBuMjvJbM1zuRh1Y1T/9a7NcPo/VU/BjTJAHur2ObNbeHPh96iUl5sn7ZDa/ZtpOGlheVw2+SpsFfd6ztFN1td+SZ2pu0Nr9tXAQO6iwWMWFbZpq+OD1emg4hklFbOt5h3/kz8hVk2bNNZt/aD4ICaypv3UX1sLasU92SdtTVnrB4WJu4GEmkKGvA7QVhz6NJgVK7l1b9UE5vs3fmkVetlaFk6DkSow9JuZb1hjPBRrP7FtEYice2Up4RdVJ8NmqjDpwbnpkivTbPUiPzmKvR/wXtEbuuxKkTubzx/gPVB7Wp+udrgOfQrOYCmTZMVy242j/spTZ+3oe/+1qxUPB0qV4gODcr5fSUH2yPV/VkfWxkL2a529X1g3WYNvH60OAt1UQ7yJn1Cpipliz5SFaNkHlqNfMyYhrCarefiVrFwZG9S7zqEYGmdrS81GjXlxar3IR1EGg3wcmqNTJ0ZJrlkMcuyG/YZm9cDj8N/P2wopxp3aj27sZkbCt2snDGsyl1coJwrex9eSaKc0a6Hl/YHGdfjEEdmbpDDvmrCW72TctrDyEjFc5Nqo2BwbEr79QuEDcPYMptxIseJ1R93f7m+eTpw/aUcq47e3v09frv8dlb8c1a8nj3UZGuXzChVHag01WGglRX2VWwJBjrNGkJfbmvnGDEBXkPIl/pBACu+RPkCeFB6Tav6FQig/OUBqE+Uf7dj5Uv9Qag+VB9X1VX3O6yuZerj5UtdqDoL/p4FHsMc2HwNbrwAWFcHjoJIgV8WFzYUTQ3KB9LW/VlhMq9uN3TxblVr6iEbDgwZbI/7siSfUfu2MDg8Pcg/7chCTd/g23iugIBfhDstWjgBbs3zVQ2khKv+xdRQAwxaNZxHKCpgTlgtvUclOOMS6eWlOQPROVf/PZsWtdwdWx211Nkvym8vFQ/VDYU+cTQb2Yh7oY9n664KWJvNzsyKzwly6fSVxnBhQXeeZagxcyvecGKU5a+MtdRbbyynxC2lgJC6uw9ePQGgkJnd9iYsctnRPJbKuDRR9tFumk3RmiENYLmpQY3Hlpp0aTGZFbiQPgMlNyja3WDDPXX4ANDvZenURjUvoqI/N0AcPobDPdQ/BAk5euhnXuCg/0iDYf8zyoKI4r6PzFCHhLZ/BsvWO4fOn640n4cQLV3rSc72tLjIaw0+dsxegSLoh+cn6RBiTe71yCDnyAszaTdgk8Q9SIcx+su58kn0GRkXIqK2JNnMDmweuLcOFh8T9YXFyom0jj8/hEPgQtJneP7HgEZaddks9q22AoZ2bBgmQ9eydwreUENOmnX4vPHlQLI8Vq/HDVbYJ2KnwPPuEVHst3VyLp2JjoKrE263rLwBKQ9QUsS3SmGZ+B3upZQRmzpabhyKY8PF/Qfo2uZmvSDCBSvb6AzDJP7Vy15VyTLEj1vYBz0ZXVrSBPCGELEl0REoAdIWlQAKYnKNxvn4X5UFN2DUPSV6wBuXkrjW+kNXLTS/EaQFAtjFKLGjRY2srLxrKjb5Y5OAFZ0cPVKe1REPjXyDSNCZ7af4JNQy5Q+J/DaND6EmGssifIVrnheNJLh1bDPMSTy7blr+pjkXHAsGsdYbvS94HGIcOXY9pYLhpRHHOqVIzTm0QIXg1SynjP7xZcv8E50OD5tOft8ADFO7ytH1lUk2F/MsrZYVhlntxLp69K4WxO7eN+WDXvIsMSOjb39MslL2UhYq6lEM3zXqPPllcRxuSlZqTOi2XQAEWGKXE0rAqfPGcGuphFHT7U6CdYb/HR/XeRJg0XE2FnbfZqxaqHXYoB0RKaKa5H9kj6d80icmGDqj57DPPS1tRJzYu/bgwV8Mvs30kDIUXpoDpFRYqH1IuBWGXUfcF0f5LitI70CKMkbaFa1+XAujHFiyc/R6X25pY3iu1rZyWZTDsx129UpeUWq44S63ZR8WRzk8O2g/hJRRlJftfGHSDhSTmMNiQsoYDacsTcMJoubsB3YeRtnzpWk4KQvVd5OTGsDaYYS0XzXcaiGUl8ZsRJL1JxruhfTBt5kGA43ZA6UnM3z486bhFkE5FAmiTLR9j5Ea7nRtVfNTbm2QPi7Ob2mnMRrutAyeC6X3Eowc2flIw80NKS69l7EpSRYyhxPxkPZ5gc1Q9UYHhJQxGq6U9vl5DoyDo3njy4/rq/2K5x26A0QobIbBBwCPiUwY9uw8FyKinT8arh/lyczfzrcbcvwwQWOUPS89mVl5zhtmr0H9TljaqHY+whROB5jbIftxfu2oOTe3ONwTfunHbFnSXBJiOJQBaV8PWGPcn6mA/eLgTpCKhO1Run22aWm8qxbBJJnko6LO3e38m4IPzwRl0zz1UZ7MbDyXfMd+v9BJP1hjpF2hh6q/reUEz3T0jjWQhmHcGMssfTUpQoS1Jks9GyXt1hz2fMcmagSpaemjMEra52jnilz0WgqoN0ppHxWZ+DHlIFVD/10+yzCMk/ZfezIwVd+u2jUUo8bPDfVXy6fUipzHG3JKp84gHyXtnjoibHnDcKPEe6oTvE7B+xzjPBn9h+18d7Yd/wszH0eNnyv67xYLMoLCPn9j4qPVI5jYEFDvR7odrIBnJ+oUPbMmjDE+nHUc384HYReHZOxIwjuM6p+XPP8J5copIe63JgyMs2r6L/x2wZMZdaqeWRMKGiG4P6EcRkXyzcWDR40o/oJyzIZkj7/Y8S2lfUQEsocP91mdBUVm7I4JJBNgrIb7rlWDqZ0cvrxMtqXpy7Nq24wknwfU32CchvvuCnjrS4LEr5vNUd5rP2kfJ6w8NExG9hqLfkyOURruQ2nvJApGhOk3B/hDjOurvUj7hPzZZU6x/42DiEbw/Fu6HfCWX6Bs95Np7OOiUV/S7YKnM+rkwZcWjMyTGaEW3wEAHCHmd5PYR42ffycOt2WTxKRvg/ENjBo/974xxmKYBcl+uKvNKO9V0SfPEAKG5qg/ndg7yp4rqjyxtEsxSnqNmU1nQJfhvW43ReJ/s0tKwSjKp7bnipyQ+NcLy42i3NKntOdCZCdy+vOZvePmq9kTtvNtFibBDFtWKWicVZuK57xyKtB5jgVZRkYgp7JqwNAT9f3WQN/AKMonG10CLnHMH+v0O0b1WKaSduBp4evqmz/DuIzfaaLOOEDkPFvS9CSRiVEA2Eyai1T9DrNFJoRUJ7922/5gXGTic09GEp0wm3WJ1HFZYR/3WCzfUXuvi/AdGCOl/SMNBzzdORozLxo5R/9cYkkydY7XcIzScJ95MjBznPY9aX+G33sy601hD54m/

[jira] [Commented] (SPARK-22427) StackOverFlowError when using FPGrowth

2017-11-06 Thread lyt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240662#comment-16240662
 ] 

lyt commented on SPARK-22427:
-

Maybe it's problem of DataFrame.
This problem won't occur when using pure rdds.

> StackOverFlowError when using FPGrowth
> --
>
> Key: SPARK-22427
> URL: https://issues.apache.org/jira/browse/SPARK-22427
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.2.0
> Environment: Centos Linux 3.10.0-327.el7.x86_64
> java 1.8.0.111
> spark 2.2.0
>Reporter: lyt
>
> code part:
> val path = jobConfig.getString("hdfspath")
> val vectordata = sc.sparkContext.textFile(path)
> val finaldata = sc.createDataset(vectordata.map(obj => {
>   obj.split(" ")
> }).filter(arr => arr.length > 0)).toDF("items")
> val fpg = new FPGrowth()
> 
> fpg.setMinSupport(minSupport).setItemsCol("items").setMinConfidence(minConfidence)
> val train = fpg.fit(finaldata)
> print(train.freqItemsets.count())
> print(train.associationRules.count())
> train.save("/tmp/FPGModel")
> And encountered following exception:
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:278)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2430)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2429)
>   at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2837)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2836)
>   at org.apache.spark.sql.Dataset.count(Dataset.scala:2429)
>   at DataMining.FPGrowth$.runJob(FPGrowth.scala:116)
>   at DataMining.testFPG$.main(FPGrowth.scala:36)
>   at DataMining.testFPG.main(FPGrowth.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
>   at org.apache.spark.deploy.Sp

[jira] [Closed] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2

2017-11-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-22458.
-

> OutOfDirectMemoryError with Spark 2.2
> -
>
> Key: SPARK-22458
> URL: https://issues.apache.org/jira/browse/SPARK-22458
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, SQL, YARN
>Affects Versions: 2.2.0
>Reporter: Kaushal Prajapati
>Priority: Blocker
>
> We were using Spark 2.1 from last 6 months to execute multiple spark jobs 
> that is running 15 hour long for 50+ TB of source data with below 
> configurations successfully. 
> {quote}spark.master  yarn
> spark.driver.cores10
> spark.driver.maxResultSize5g
> spark.driver.memory   20g
> spark.executor.cores  5
> spark.executor.extraJavaOptions   *-XX:+UseG1GC 
> -Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6 
> *-XX:MaxDirectMemorySize=2048m* 
> -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
> spark.driver.extraJavaOptions   
> *-Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* 
> -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
> spark.executor.instances  30
> spark.executor.memory 30g
> *spark.kryoserializer.buffer.max   512m*
> spark.network.timeout 12000s
> spark.serializer  
> org.apache.spark.serializer.KryoSerializer
> spark.shuffle.io.preferDirectBufs false
> spark.sql.catalogImplementation   hive
> spark.sql.shuffle.partitions  5000
> spark.yarn.driver.memoryOverhead  1536
> spark.yarn.executor.memoryOverhead4096
> spark.core.connection.ack.wait.timeout600s
> spark.scheduler.maxRegisteredResourcesWaitingTime 15s
> spark.sql.hive.filesourcePartitionFileCacheSize   524288000
> spark.dynamicAllocation.executorIdleTimeout   3s
> spark.dynamicAllocation.enabled   true
> spark.hadoop.yarn.timeline-service.enabledfalse
> spark.shuffle.service.enabled true
> spark.yarn.am.extraJavaOptions*-Dhdp.version=2.5.3.0-37 
> -Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m*{quote}
> Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes 
> using latest version. But we started facing DirectBuffer outOfMemory error 
> and exceeding memory limits for executor memoryOverhead issue. To fix that we 
> started tweaking multiple properties but still issue persists. Relevant 
> information is shared below
> Please let me any other details is requried,
>   
> Snapshot for DirectMemory Error Stacktrace :- 
> {code:java}
> 10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in 
> stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): 
> FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), 
> shuffleId=7, mapId=141, reduceId=3372, message=
> org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 
> byte(s) of direct memory (used: 1073699840, max: 1073741824)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
> 

[jira] [Resolved] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2

2017-11-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22458.
---
Resolution: Not A Problem

Changes in implementation between minor releases could change memory usage. 
Your overhead is pretty low. This is not bug and should not be reopened.

> OutOfDirectMemoryError with Spark 2.2
> -
>
> Key: SPARK-22458
> URL: https://issues.apache.org/jira/browse/SPARK-22458
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, SQL, YARN
>Affects Versions: 2.2.0
>Reporter: Kaushal Prajapati
>Priority: Blocker
>
> We were using Spark 2.1 from last 6 months to execute multiple spark jobs 
> that is running 15 hour long for 50+ TB of source data with below 
> configurations successfully. 
> {quote}spark.master  yarn
> spark.driver.cores10
> spark.driver.maxResultSize5g
> spark.driver.memory   20g
> spark.executor.cores  5
> spark.executor.extraJavaOptions   *-XX:+UseG1GC 
> -Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6 
> *-XX:MaxDirectMemorySize=2048m* 
> -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
> spark.driver.extraJavaOptions   
> *-Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* 
> -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
> spark.executor.instances  30
> spark.executor.memory 30g
> *spark.kryoserializer.buffer.max   512m*
> spark.network.timeout 12000s
> spark.serializer  
> org.apache.spark.serializer.KryoSerializer
> spark.shuffle.io.preferDirectBufs false
> spark.sql.catalogImplementation   hive
> spark.sql.shuffle.partitions  5000
> spark.yarn.driver.memoryOverhead  1536
> spark.yarn.executor.memoryOverhead4096
> spark.core.connection.ack.wait.timeout600s
> spark.scheduler.maxRegisteredResourcesWaitingTime 15s
> spark.sql.hive.filesourcePartitionFileCacheSize   524288000
> spark.dynamicAllocation.executorIdleTimeout   3s
> spark.dynamicAllocation.enabled   true
> spark.hadoop.yarn.timeline-service.enabledfalse
> spark.shuffle.service.enabled true
> spark.yarn.am.extraJavaOptions*-Dhdp.version=2.5.3.0-37 
> -Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m*{quote}
> Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes 
> using latest version. But we started facing DirectBuffer outOfMemory error 
> and exceeding memory limits for executor memoryOverhead issue. To fix that we 
> started tweaking multiple properties but still issue persists. Relevant 
> information is shared below
> Please let me any other details is requried,
>   
> Snapshot for DirectMemory Error Stacktrace :- 
> {code:java}
> 10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in 
> stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): 
> FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), 
> shuffleId=7, mapId=141, reduceId=3372, message=
> org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 
> byte(s) of direct memory (used: 1073699840, max: 1073741824)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec

[jira] [Reopened] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2

2017-11-06 Thread Kaushal Prajapati (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kaushal Prajapati reopened SPARK-22458:
---

It's working with same configs using Spark 2.1, why not on Spark 2.2 and why 
Spark 2.2 demand extra memory and how much extra?

> OutOfDirectMemoryError with Spark 2.2
> -
>
> Key: SPARK-22458
> URL: https://issues.apache.org/jira/browse/SPARK-22458
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, SQL, YARN
>Affects Versions: 2.2.0
>Reporter: Kaushal Prajapati
>Priority: Blocker
>
> We were using Spark 2.1 from last 6 months to execute multiple spark jobs 
> that is running 15 hour long for 50+ TB of source data with below 
> configurations successfully. 
> {quote}spark.master  yarn
> spark.driver.cores10
> spark.driver.maxResultSize5g
> spark.driver.memory   20g
> spark.executor.cores  5
> spark.executor.extraJavaOptions   *-XX:+UseG1GC 
> -Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6 
> *-XX:MaxDirectMemorySize=2048m* 
> -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
> spark.driver.extraJavaOptions   
> *-Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* 
> -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
> spark.executor.instances  30
> spark.executor.memory 30g
> *spark.kryoserializer.buffer.max   512m*
> spark.network.timeout 12000s
> spark.serializer  
> org.apache.spark.serializer.KryoSerializer
> spark.shuffle.io.preferDirectBufs false
> spark.sql.catalogImplementation   hive
> spark.sql.shuffle.partitions  5000
> spark.yarn.driver.memoryOverhead  1536
> spark.yarn.executor.memoryOverhead4096
> spark.core.connection.ack.wait.timeout600s
> spark.scheduler.maxRegisteredResourcesWaitingTime 15s
> spark.sql.hive.filesourcePartitionFileCacheSize   524288000
> spark.dynamicAllocation.executorIdleTimeout   3s
> spark.dynamicAllocation.enabled   true
> spark.hadoop.yarn.timeline-service.enabledfalse
> spark.shuffle.service.enabled true
> spark.yarn.am.extraJavaOptions*-Dhdp.version=2.5.3.0-37 
> -Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m*{quote}
> Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes 
> using latest version. But we started facing DirectBuffer outOfMemory error 
> and exceeding memory limits for executor memoryOverhead issue. To fix that we 
> started tweaking multiple properties but still issue persists. Relevant 
> information is shared below
> Please let me any other details is requried,
>   
> Snapshot for DirectMemory Error Stacktrace :- 
> {code:java}
> 10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in 
> stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): 
> FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), 
> shuffleId=7, mapId=141, reduceId=3372, message=
> org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 
> byte(s) of direct memory (used: 1073699840, max: 1073741824)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeSta

[jira] [Resolved] (SPARK-22315) Check for version match between R package and JVM

2017-11-06 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-22315.
---
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.1

Issue resolved by pull request 19624
[https://github.com/apache/spark/pull/19624]

> Check for version match between R package and JVM
> -
>
> Key: SPARK-22315
> URL: https://issues.apache.org/jira/browse/SPARK-22315
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.1
>Reporter: Shivaram Venkataraman
> Fix For: 2.2.1, 2.3.0
>
>
> With the release of SparkR on CRAN we could have scenarios where users have a 
> newer version of package when compared to the Spark cluster they are 
> connecting to.
> We should print appropriate warnings on either (a) connecting to a different 
> version R Backend (b) connecting to a Spark master running a different 
> version of Spark (this should ideally happen inside Scala ?)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2

2017-11-06 Thread Kaushal Prajapati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240537#comment-16240537
 ] 

Kaushal Prajapati commented on SPARK-22458:
---

yes, but we tried multiple times, same code is working with Spark 2.1 not with 
Spark 2.2.

> OutOfDirectMemoryError with Spark 2.2
> -
>
> Key: SPARK-22458
> URL: https://issues.apache.org/jira/browse/SPARK-22458
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, SQL, YARN
>Affects Versions: 2.2.0
>Reporter: Kaushal Prajapati
>Priority: Blocker
>
> We were using Spark 2.1 from last 6 months to execute multiple spark jobs 
> that is running 15 hour long for 50+ TB of source data with below 
> configurations successfully. 
> {quote}spark.master  yarn
> spark.driver.cores10
> spark.driver.maxResultSize5g
> spark.driver.memory   20g
> spark.executor.cores  5
> spark.executor.extraJavaOptions   *-XX:+UseG1GC 
> -Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6 
> *-XX:MaxDirectMemorySize=2048m* 
> -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
> spark.driver.extraJavaOptions   
> *-Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* 
> -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
> spark.executor.instances  30
> spark.executor.memory 30g
> *spark.kryoserializer.buffer.max   512m*
> spark.network.timeout 12000s
> spark.serializer  
> org.apache.spark.serializer.KryoSerializer
> spark.shuffle.io.preferDirectBufs false
> spark.sql.catalogImplementation   hive
> spark.sql.shuffle.partitions  5000
> spark.yarn.driver.memoryOverhead  1536
> spark.yarn.executor.memoryOverhead4096
> spark.core.connection.ack.wait.timeout600s
> spark.scheduler.maxRegisteredResourcesWaitingTime 15s
> spark.sql.hive.filesourcePartitionFileCacheSize   524288000
> spark.dynamicAllocation.executorIdleTimeout   3s
> spark.dynamicAllocation.enabled   true
> spark.hadoop.yarn.timeline-service.enabledfalse
> spark.shuffle.service.enabled true
> spark.yarn.am.extraJavaOptions*-Dhdp.version=2.5.3.0-37 
> -Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m*{quote}
> Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes 
> using latest version. But we started facing DirectBuffer outOfMemory error 
> and exceeding memory limits for executor memoryOverhead issue. To fix that we 
> started tweaking multiple properties but still issue persists. Relevant 
> information is shared below
> Please let me any other details is requried,
>   
> Snapshot for DirectMemory Error Stacktrace :- 
> {code:java}
> 10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in 
> stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): 
> FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), 
> shuffleId=7, mapId=141, reduceId=3372, message=
> org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 
> byte(s) of direct memory (used: 1073699840, max: 1073741824)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$

[jira] [Commented] (SPARK-14516) Clustering evaluator

2017-11-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240468#comment-16240468
 ] 

Apache Spark commented on SPARK-14516:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/19676

> Clustering evaluator
> 
>
> Key: SPARK-14516
> URL: https://issues.apache.org/jira/browse/SPARK-14516
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Assignee: Marco Gaido
> Fix For: 2.3.0
>
>
> MLlib does not have any general purposed clustering metrics with a ground 
> truth.
> In 
> [Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics),
>  there are several kinds of metrics for this.
> It may be meaningful to add some clustering metrics into MLlib.
> This should be added as a {{ClusteringEvaluator}} class of extending 
> {{Evaluator}} in spark.ml.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14540) Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner

2017-11-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14540:


Assignee: Apache Spark

> Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner
> 
>
> Key: SPARK-14540
> URL: https://issues.apache.org/jira/browse/SPARK-14540
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> Using https://github.com/JoshRosen/spark/tree/build-for-2.12, I tried running 
> ClosureCleanerSuite with Scala 2.12 and ran into two bad test failures:
> {code}
> [info] - toplevel return statements in closures are identified at cleaning 
> time *** FAILED *** (32 milliseconds)
> [info]   Expected exception 
> org.apache.spark.util.ReturnStatementInClosureException to be thrown, but no 
> exception was thrown. (ClosureCleanerSuite.scala:57)
> {code}
> and
> {code}
> [info] - user provided closures are actually cleaned *** FAILED *** (56 
> milliseconds)
> [info]   Expected ReturnStatementInClosureException, but got 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task not 
> serializable: java.io.NotSerializableException: java.lang.Object
> [info]- element of array (index: 0)
> [info]- array (class "[Ljava.lang.Object;", size: 1)
> [info]- field (class "java.lang.invoke.SerializedLambda", name: 
> "capturedArgs", type: "class [Ljava.lang.Object;")
> [info]- object (class "java.lang.invoke.SerializedLambda", 
> SerializedLambda[capturingClass=class 
> org.apache.spark.util.TestUserClosuresActuallyCleaned$, 
> functionalInterfaceMethod=scala/runtime/java8/JFunction1$mcII$sp.apply$mcII$sp:(I)I,
>  implementation=invokeStatic 
> org/apache/spark/util/TestUserClosuresActuallyCleaned$.org$apache$spark$util$TestUserClosuresActuallyCleaned$$$anonfun$69:(Ljava/lang/Object;I)I,
>  instantiatedMethodType=(I)I, numCaptured=1])
> [info]- element of array (index: 0)
> [info]- array (class "[Ljava.lang.Object;", size: 1)
> [info]- field (class "java.lang.invoke.SerializedLambda", name: 
> "capturedArgs", type: "class [Ljava.lang.Object;")
> [info]- object (class "java.lang.invoke.SerializedLambda", 
> SerializedLambda[capturingClass=class org.apache.spark.rdd.RDD, 
> functionalInterfaceMethod=scala/Function3.apply:(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;,
>  implementation=invokeStatic 
> org/apache/spark/rdd/RDD.org$apache$spark$rdd$RDD$$$anonfun$20$adapted:(Lscala/Function1;Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;,
>  
> instantiatedMethodType=(Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;,
>  numCaptured=1])
> [info]- field (class "org.apache.spark.rdd.MapPartitionsRDD", name: 
> "f", type: "interface scala.Function3")
> [info]- object (class "org.apache.spark.rdd.MapPartitionsRDD", 
> MapPartitionsRDD[2] at apply at Transformer.scala:22)
> [info]- field (class "scala.Tuple2", name: "_1", type: "class 
> java.lang.Object")
> [info]- root object (class "scala.Tuple2", (MapPartitionsRDD[2] at 
> apply at 
> Transformer.scala:22,org.apache.spark.SparkContext$$Lambda$957/431842435@6e803685)).
> [info]   This means the closure provided by user is not actually cleaned. 
> (ClosureCleanerSuite.scala:78)
> {code}
> We'll need to figure out a closure cleaning strategy which works for 2.12 
> lambdas.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14540) Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner

2017-11-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14540:


Assignee: (was: Apache Spark)

> Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner
> 
>
> Key: SPARK-14540
> URL: https://issues.apache.org/jira/browse/SPARK-14540
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Josh Rosen
>
> Using https://github.com/JoshRosen/spark/tree/build-for-2.12, I tried running 
> ClosureCleanerSuite with Scala 2.12 and ran into two bad test failures:
> {code}
> [info] - toplevel return statements in closures are identified at cleaning 
> time *** FAILED *** (32 milliseconds)
> [info]   Expected exception 
> org.apache.spark.util.ReturnStatementInClosureException to be thrown, but no 
> exception was thrown. (ClosureCleanerSuite.scala:57)
> {code}
> and
> {code}
> [info] - user provided closures are actually cleaned *** FAILED *** (56 
> milliseconds)
> [info]   Expected ReturnStatementInClosureException, but got 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task not 
> serializable: java.io.NotSerializableException: java.lang.Object
> [info]- element of array (index: 0)
> [info]- array (class "[Ljava.lang.Object;", size: 1)
> [info]- field (class "java.lang.invoke.SerializedLambda", name: 
> "capturedArgs", type: "class [Ljava.lang.Object;")
> [info]- object (class "java.lang.invoke.SerializedLambda", 
> SerializedLambda[capturingClass=class 
> org.apache.spark.util.TestUserClosuresActuallyCleaned$, 
> functionalInterfaceMethod=scala/runtime/java8/JFunction1$mcII$sp.apply$mcII$sp:(I)I,
>  implementation=invokeStatic 
> org/apache/spark/util/TestUserClosuresActuallyCleaned$.org$apache$spark$util$TestUserClosuresActuallyCleaned$$$anonfun$69:(Ljava/lang/Object;I)I,
>  instantiatedMethodType=(I)I, numCaptured=1])
> [info]- element of array (index: 0)
> [info]- array (class "[Ljava.lang.Object;", size: 1)
> [info]- field (class "java.lang.invoke.SerializedLambda", name: 
> "capturedArgs", type: "class [Ljava.lang.Object;")
> [info]- object (class "java.lang.invoke.SerializedLambda", 
> SerializedLambda[capturingClass=class org.apache.spark.rdd.RDD, 
> functionalInterfaceMethod=scala/Function3.apply:(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;,
>  implementation=invokeStatic 
> org/apache/spark/rdd/RDD.org$apache$spark$rdd$RDD$$$anonfun$20$adapted:(Lscala/Function1;Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;,
>  
> instantiatedMethodType=(Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;,
>  numCaptured=1])
> [info]- field (class "org.apache.spark.rdd.MapPartitionsRDD", name: 
> "f", type: "interface scala.Function3")
> [info]- object (class "org.apache.spark.rdd.MapPartitionsRDD", 
> MapPartitionsRDD[2] at apply at Transformer.scala:22)
> [info]- field (class "scala.Tuple2", name: "_1", type: "class 
> java.lang.Object")
> [info]- root object (class "scala.Tuple2", (MapPartitionsRDD[2] at 
> apply at 
> Transformer.scala:22,org.apache.spark.SparkContext$$Lambda$957/431842435@6e803685)).
> [info]   This means the closure provided by user is not actually cleaned. 
> (ClosureCleanerSuite.scala:78)
> {code}
> We'll need to figure out a closure cleaning strategy which works for 2.12 
> lambdas.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14540) Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner

2017-11-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240422#comment-16240422
 ] 

Apache Spark commented on SPARK-14540:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/19675

> Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner
> 
>
> Key: SPARK-14540
> URL: https://issues.apache.org/jira/browse/SPARK-14540
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Josh Rosen
>
> Using https://github.com/JoshRosen/spark/tree/build-for-2.12, I tried running 
> ClosureCleanerSuite with Scala 2.12 and ran into two bad test failures:
> {code}
> [info] - toplevel return statements in closures are identified at cleaning 
> time *** FAILED *** (32 milliseconds)
> [info]   Expected exception 
> org.apache.spark.util.ReturnStatementInClosureException to be thrown, but no 
> exception was thrown. (ClosureCleanerSuite.scala:57)
> {code}
> and
> {code}
> [info] - user provided closures are actually cleaned *** FAILED *** (56 
> milliseconds)
> [info]   Expected ReturnStatementInClosureException, but got 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task not 
> serializable: java.io.NotSerializableException: java.lang.Object
> [info]- element of array (index: 0)
> [info]- array (class "[Ljava.lang.Object;", size: 1)
> [info]- field (class "java.lang.invoke.SerializedLambda", name: 
> "capturedArgs", type: "class [Ljava.lang.Object;")
> [info]- object (class "java.lang.invoke.SerializedLambda", 
> SerializedLambda[capturingClass=class 
> org.apache.spark.util.TestUserClosuresActuallyCleaned$, 
> functionalInterfaceMethod=scala/runtime/java8/JFunction1$mcII$sp.apply$mcII$sp:(I)I,
>  implementation=invokeStatic 
> org/apache/spark/util/TestUserClosuresActuallyCleaned$.org$apache$spark$util$TestUserClosuresActuallyCleaned$$$anonfun$69:(Ljava/lang/Object;I)I,
>  instantiatedMethodType=(I)I, numCaptured=1])
> [info]- element of array (index: 0)
> [info]- array (class "[Ljava.lang.Object;", size: 1)
> [info]- field (class "java.lang.invoke.SerializedLambda", name: 
> "capturedArgs", type: "class [Ljava.lang.Object;")
> [info]- object (class "java.lang.invoke.SerializedLambda", 
> SerializedLambda[capturingClass=class org.apache.spark.rdd.RDD, 
> functionalInterfaceMethod=scala/Function3.apply:(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;,
>  implementation=invokeStatic 
> org/apache/spark/rdd/RDD.org$apache$spark$rdd$RDD$$$anonfun$20$adapted:(Lscala/Function1;Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;,
>  
> instantiatedMethodType=(Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;,
>  numCaptured=1])
> [info]- field (class "org.apache.spark.rdd.MapPartitionsRDD", name: 
> "f", type: "interface scala.Function3")
> [info]- object (class "org.apache.spark.rdd.MapPartitionsRDD", 
> MapPartitionsRDD[2] at apply at Transformer.scala:22)
> [info]- field (class "scala.Tuple2", name: "_1", type: "class 
> java.lang.Object")
> [info]- root object (class "scala.Tuple2", (MapPartitionsRDD[2] at 
> apply at 
> Transformer.scala:22,org.apache.spark.SparkContext$$Lambda$957/431842435@6e803685)).
> [info]   This means the closure provided by user is not actually cleaned. 
> (ClosureCleanerSuite.scala:78)
> {code}
> We'll need to figure out a closure cleaning strategy which works for 2.12 
> lambdas.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2

2017-11-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22458.
---
Resolution: Not A Problem

Please read http://spark.apache.org/contributing.html
This just means you're out of YARN container memory. You need to increase the 
overhead size. This is not a bug.

> OutOfDirectMemoryError with Spark 2.2
> -
>
> Key: SPARK-22458
> URL: https://issues.apache.org/jira/browse/SPARK-22458
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, SQL, YARN
>Affects Versions: 2.2.0
>Reporter: Kaushal Prajapati
>Priority: Blocker
>
> We were using Spark 2.1 from last 6 months to execute multiple spark jobs 
> that is running 15 hour long for 50+ TB of source data with below 
> configurations successfully. 
> {quote}spark.master  yarn
> spark.driver.cores10
> spark.driver.maxResultSize5g
> spark.driver.memory   20g
> spark.executor.cores  5
> spark.executor.extraJavaOptions   *-XX:+UseG1GC 
> -Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6 
> *-XX:MaxDirectMemorySize=2048m* 
> -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
> spark.driver.extraJavaOptions   
> *-Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* 
> -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
> spark.executor.instances  30
> spark.executor.memory 30g
> *spark.kryoserializer.buffer.max   512m*
> spark.network.timeout 12000s
> spark.serializer  
> org.apache.spark.serializer.KryoSerializer
> spark.shuffle.io.preferDirectBufs false
> spark.sql.catalogImplementation   hive
> spark.sql.shuffle.partitions  5000
> spark.yarn.driver.memoryOverhead  1536
> spark.yarn.executor.memoryOverhead4096
> spark.core.connection.ack.wait.timeout600s
> spark.scheduler.maxRegisteredResourcesWaitingTime 15s
> spark.sql.hive.filesourcePartitionFileCacheSize   524288000
> spark.dynamicAllocation.executorIdleTimeout   3s
> spark.dynamicAllocation.enabled   true
> spark.hadoop.yarn.timeline-service.enabledfalse
> spark.shuffle.service.enabled true
> spark.yarn.am.extraJavaOptions*-Dhdp.version=2.5.3.0-37 
> -Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m*{quote}
> Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes 
> using latest version. But we started facing DirectBuffer outOfMemory error 
> and exceeding memory limits for executor memoryOverhead issue. To fix that we 
> started tweaking multiple properties but still issue persists. Relevant 
> information is shared below
> Please let me any other details is requried,
>   
> Snapshot for DirectMemory Error Stacktrace :- 
> {code:java}
> 10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in 
> stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): 
> FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), 
> shuffleId=7, mapId=141, reduceId=3372, message=
> org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 
> byte(s) of direct memory (used: 1073699840, max: 1073741824)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.Whol

[jira] [Commented] (SPARK-22395) Fix the behavior of timestamp values for Pandas to respect session timezone

2017-11-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240396#comment-16240396
 ] 

Apache Spark commented on SPARK-22395:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/19674

> Fix the behavior of timestamp values for Pandas to respect session timezone
> ---
>
> Key: SPARK-22395
> URL: https://issues.apache.org/jira/browse/SPARK-22395
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Takuya Ueshin
>
> When converting Pandas DataFrame/Series from/to Spark DataFrame using 
> {{toPandas()}} or pandas udfs, timestamp values behave to respect Python 
> system timezone instead of session timezone.
> For example, let's say we use {{"America/Los_Angeles"}} as session timezone 
> and have a timestamp value {{"1970-01-01 00:00:01"}} in the timezone. Btw, 
> I'm in Japan so Python timezone would be {{"Asia/Tokyo"}}.
> The timestamp value from current {{toPandas()}} will be the following:
> {noformat}
> >>> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
> >>> df = spark.createDataFrame([28801], "long").selectExpr("timestamp(value) 
> >>> as ts")
> >>> df.show()
> +---+
> | ts|
> +---+
> |1970-01-01 00:00:01|
> +---+
> >>> df.toPandas()
>ts
> 0 1970-01-01 17:00:01
> {noformat}
> As you can see, the value becomes {{"1970-01-01 17:00:01"}} because it 
> respects Python timezone.
> As we discussed in https://github.com/apache/spark/pull/18664, we consider 
> this behavior is a bug and the value should be {{"1970-01-01 00:00:01"}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20644) Hook up Spark UI to the new key-value store backend

2017-11-06 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-20644.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19582
[https://github.com/apache/spark/pull/19582]

> Hook up Spark UI to the new key-value store backend
> ---
>
> Key: SPARK-20644
> URL: https://issues.apache.org/jira/browse/SPARK-20644
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
> Fix For: 2.3.0
>
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks hooking up the Spark UI (both live and SHS) to the key-value 
> store based backend. It's the initial work to allow individual UI pages to be 
> de-coupled from the listener implementations and use the REST API data saved 
> in the store.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20644) Hook up Spark UI to the new key-value store backend

2017-11-06 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-20644:


Assignee: Marcelo Vanzin

> Hook up Spark UI to the new key-value store backend
> ---
>
> Key: SPARK-20644
> URL: https://issues.apache.org/jira/browse/SPARK-20644
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 2.3.0
>
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks hooking up the Spark UI (both live and SHS) to the key-value 
> store based backend. It's the initial work to allow individual UI pages to be 
> de-coupled from the listener implementations and use the REST API data saved 
> in the store.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2

2017-11-06 Thread Kaushal Prajapati (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kaushal Prajapati updated SPARK-22458:
--
Description: 
We were using Spark 2.1 from last 6 months to execute multiple spark jobs that 
is running 15 hour long for 50+ TB of source data with below configurations 
successfully. 


{quote}spark.master  yarn
spark.driver.cores10
spark.driver.maxResultSize5g
spark.driver.memory   20g
spark.executor.cores  5
spark.executor.extraJavaOptions   *-XX:+UseG1GC 
-Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6 
*-XX:MaxDirectMemorySize=2048m* 
-Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
spark.driver.extraJavaOptions   
*-Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* 
-Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
spark.executor.instances  30
spark.executor.memory 30g
*spark.kryoserializer.buffer.max   512m*

spark.network.timeout 12000s
spark.serializer  
org.apache.spark.serializer.KryoSerializer
spark.shuffle.io.preferDirectBufs false
spark.sql.catalogImplementation   hive
spark.sql.shuffle.partitions  5000
spark.yarn.driver.memoryOverhead  1536
spark.yarn.executor.memoryOverhead4096
spark.core.connection.ack.wait.timeout600s
spark.scheduler.maxRegisteredResourcesWaitingTime 15s
spark.sql.hive.filesourcePartitionFileCacheSize   524288000


spark.dynamicAllocation.executorIdleTimeout   3s
spark.dynamicAllocation.enabled   true
spark.hadoop.yarn.timeline-service.enabledfalse
spark.shuffle.service.enabled true
spark.yarn.am.extraJavaOptions*-Dhdp.version=2.5.3.0-37 
-Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m*{quote}


Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes 
using latest version. But we started facing DirectBuffer outOfMemory error and 
exceeding memory limits for executor memoryOverhead issue. To fix that we 
started tweaking multiple properties but still issue persists. Relevant 
information is shared below

Please let me any other details is requried,

Snapshot for DirectMemory Error Stacktrace :- 

{code:java}
10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in 
stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): 
FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), 
shuffleId=7, mapId=141, reduceId=3372, message=
org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 byte(s) 
of direct memory (used: 1073699840, max: 1073741824)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:414)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(Unsafe

[jira] [Updated] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2

2017-11-06 Thread Kaushal Prajapati (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kaushal Prajapati updated SPARK-22458:
--
Description: 
We were using Spark 2.1 from last 6 months to execute multiple spark jobs that 
is running 15 hour long for 50+ TB of source data with below configurations 
successfully. 


{noformat}
spark.master  yarn
spark.driver.cores10
spark.driver.maxResultSize5g
spark.driver.memory   20g
spark.executor.cores  5
spark.executor.extraJavaOptions   -XX:+UseG1GC 
{color:red}-Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6{color} 
-XX:MaxDirectMemorySize=2048m* 
-Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
spark.driver.extraJavaOptions
*-Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* 
-Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
spark.executor.instances  30
spark.executor.memory 30g
*spark.kryoserializer.buffer.max   512m*

spark.network.timeout 12000s
spark.serializer  
org.apache.spark.serializer.KryoSerializer
spark.shuffle.io.preferDirectBufs false
spark.sql.catalogImplementation   hive
spark.sql.shuffle.partitions  5000
spark.yarn.driver.memoryOverhead  1536
spark.yarn.executor.memoryOverhead4096
spark.core.connection.ack.wait.timeout600s
spark.scheduler.maxRegisteredResourcesWaitingTime 15s
spark.sql.hive.filesourcePartitionFileCacheSize   524288000


spark.dynamicAllocation.executorIdleTimeout   3s
spark.dynamicAllocation.enabled   true
spark.hadoop.yarn.timeline-service.enabledfalse
spark.shuffle.service.enabled true
spark.yarn.am.extraJavaOptions-Dhdp.version=2.5.3.0-37 * 
-Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m*
{noformat}


Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes 
using latest version. But we started facing DirectBuffer outOfMemory error and 
exceeding memory limits for executor memoryOverhead issue. To fix that we 
started tweaking multiple properties but still issue persists. Relevant 
information is shared below

Please let me any other details is requried,

Snapshot for DirectMemory Error Stacktrace :- 

{code:java}
10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in 
stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): 
FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), 
shuffleId=7, mapId=141, reduceId=3372, message=
org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 byte(s) 
of direct memory (used: 1073699840, max: 1073741824)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:414)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.Unsafe

[jira] [Updated] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2

2017-11-06 Thread Kaushal Prajapati (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kaushal Prajapati updated SPARK-22458:
--
Description: 
We were using Spark 2.1 from last 6 months to execute multiple spark jobs that 
is running 15 hour long for 50+ TB of source data with below configurations 
successfully. 


{noformat}
spark.master  yarn
spark.driver.cores10
spark.driver.maxResultSize5g
spark.driver.memory   20g
spark.executor.cores  5
spark.executor.extraJavaOptions   -XX:+UseG1GC 
*-Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6 
*-XX:MaxDirectMemorySize=2048m* 
-Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
spark.driver.extraJavaOptions
*-Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* 
-Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
spark.executor.instances  30
spark.executor.memory 30g
*spark.kryoserializer.buffer.max   512m*

spark.network.timeout 12000s
spark.serializer  
org.apache.spark.serializer.KryoSerializer
spark.shuffle.io.preferDirectBufs false
spark.sql.catalogImplementation   hive
spark.sql.shuffle.partitions  5000
spark.yarn.driver.memoryOverhead  1536
spark.yarn.executor.memoryOverhead4096
spark.core.connection.ack.wait.timeout600s
spark.scheduler.maxRegisteredResourcesWaitingTime 15s
spark.sql.hive.filesourcePartitionFileCacheSize   524288000


spark.dynamicAllocation.executorIdleTimeout   3s
spark.dynamicAllocation.enabled   true
spark.hadoop.yarn.timeline-service.enabledfalse
spark.shuffle.service.enabled true
spark.yarn.am.extraJavaOptions-Dhdp.version=2.5.3.0-37 * 
-Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m*
{noformat}


Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes 
using latest version. But we started facing DirectBuffer outOfMemory error and 
exceeding memory limits for executor memoryOverhead issue. To fix that we 
started tweaking multiple properties but still issue persists. Relevant 
information is shared below

Please let me any other details is requried,

Snapshot for DirectMemory Error Stacktrace :- 

{code:java}
10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in 
stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): 
FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), 
shuffleId=7, mapId=141, reduceId=3372, message=
org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 byte(s) 
of direct memory (used: 1073699840, max: 1073741824)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:414)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.wr

[jira] [Updated] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2

2017-11-06 Thread Kaushal Prajapati (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kaushal Prajapati updated SPARK-22458:
--
Description: 
We were using Spark 2.1 from last 6 months to execute multiple spark jobs that 
is running 15 hour long for 50+ TB of source data with below configurations 
successfully. 


{noformat}
spark.master  yarn
spark.driver.cores10
spark.driver.maxResultSize5g
spark.driver.memory   20g
spark.executor.cores  5
spark.executor.extraJavaOptions   -XX:+UseG1GC 
*-Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6 
*-XX:MaxDirectMemorySize=2048m* 
-Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
spark.driver.extraJavaOptions* 
-Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* 
-Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
spark.executor.instances  30
spark.executor.memory 30g
*spark.kryoserializer.buffer.max   512m*

spark.network.timeout 12000s
spark.serializer  
org.apache.spark.serializer.KryoSerializer
spark.shuffle.io.preferDirectBufs false
spark.sql.catalogImplementation   hive
spark.sql.shuffle.partitions  5000
spark.yarn.driver.memoryOverhead  1536
spark.yarn.executor.memoryOverhead4096
spark.core.connection.ack.wait.timeout600s
spark.scheduler.maxRegisteredResourcesWaitingTime 15s
spark.sql.hive.filesourcePartitionFileCacheSize   524288000


spark.dynamicAllocation.executorIdleTimeout   3s
spark.dynamicAllocation.enabled   true
spark.hadoop.yarn.timeline-service.enabledfalse
spark.shuffle.service.enabled true
spark.yarn.am.extraJavaOptions-Dhdp.version=2.5.3.0-37 * 
-Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m*
{noformat}


Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes 
using latest version. But we started facing DirectBuffer outOfMemory error and 
exceeding memory limits for executor memoryOverhead issue. To fix that we 
started tweaking multiple properties but still issue persists. Relevant 
information is shared below

Please let me any other details is requried,

Snapshot for DirectMemory Error Stacktrace :- 

{code:java}
10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in 
stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): 
FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), 
shuffleId=7, mapId=141, reduceId=3372, message=
org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 byte(s) 
of direct memory (used: 1073699840, max: 1073741824)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:414)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.w

[jira] [Updated] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2

2017-11-06 Thread Kaushal Prajapati (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kaushal Prajapati updated SPARK-22458:
--
Description: 
We were using Spark 2.1 from last 6 months to execute multiple spark jobs that 
is running 15 hour long for 50+ TB of source data with below configurations 
successfully. 

{{spark.master  yarn
spark.driver.cores10
spark.driver.maxResultSize5g
spark.driver.memory   20g
spark.executor.cores  5
spark.executor.extraJavaOptions   -XX:+UseG1GC 
*-Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6 
*-XX:MaxDirectMemorySize=2048m* 
-Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
spark.driver.extraJavaOptions* 
-Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* 
-Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
spark.executor.instances  30
spark.executor.memory 30g
*spark.kryoserializer.buffer.max   512m*

spark.network.timeout 12000s
spark.serializer  
org.apache.spark.serializer.KryoSerializer
spark.shuffle.io.preferDirectBufs false
spark.sql.catalogImplementation   hive
spark.sql.shuffle.partitions  5000
spark.yarn.driver.memoryOverhead  1536
spark.yarn.executor.memoryOverhead4096
spark.core.connection.ack.wait.timeout600s
spark.scheduler.maxRegisteredResourcesWaitingTime 15s
spark.sql.hive.filesourcePartitionFileCacheSize   524288000


spark.dynamicAllocation.executorIdleTimeout   3s
spark.dynamicAllocation.enabled   true
spark.hadoop.yarn.timeline-service.enabledfalse
spark.shuffle.service.enabled true
spark.yarn.am.extraJavaOptions-Dhdp.version=2.5.3.0-37 * 
-Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m*
}}
Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes 
using latest version. But we started facing DirectBuffer outOfMemory error and 
exceeding memory limits for executor memoryOverhead issue. To fix that we 
started tweaking multiple properties but still issue persists. Relevant 
information is shared below

Please let me any other details is requried,

Snapshot for DirectMemory Error Stacktrace :- 

{code:java}
10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in 
stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): 
FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), 
shuffleId=7, mapId=141, reduceId=3372, message=
org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 byte(s) 
of direct memory (used: 1073699840, max: 1073741824)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:414)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWr

[jira] [Resolved] (SPARK-22445) move CodegenContext.copyResult to CodegenSupport

2017-11-06 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22445.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19656
[https://github.com/apache/spark/pull/19656]

> move CodegenContext.copyResult to CodegenSupport
> 
>
> Key: SPARK-22445
> URL: https://issues.apache.org/jira/browse/SPARK-22445
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2

2017-11-06 Thread Kaushal Prajapati (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kaushal Prajapati updated SPARK-22458:
--
Description: 
We were using Spark 2.1 from last 6 months to execute multiple spark jobs that 
is running 15 hour long for 50+ TB of source data with below configurations 
successfully. 

spark.master  yarn
spark.driver.cores10
spark.driver.maxResultSize5g
spark.driver.memory   20g
spark.executor.cores  5
spark.executor.extraJavaOptions   -XX:+UseG1GC 
*-Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6 
*-XX:MaxDirectMemorySize=2048m* 
-Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
spark.driver.extraJavaOptions* 
-Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* 
-Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
spark.executor.instances  30
spark.executor.memory 30g
*spark.kryoserializer.buffer.max   512m*

spark.network.timeout 12000s
spark.serializer  
org.apache.spark.serializer.KryoSerializer
spark.shuffle.io.preferDirectBufs false
spark.sql.catalogImplementation   hive
spark.sql.shuffle.partitions  5000
spark.yarn.driver.memoryOverhead  1536
spark.yarn.executor.memoryOverhead4096
spark.core.connection.ack.wait.timeout600s
spark.scheduler.maxRegisteredResourcesWaitingTime 15s
spark.sql.hive.filesourcePartitionFileCacheSize   524288000


spark.dynamicAllocation.executorIdleTimeout   3s
spark.dynamicAllocation.enabled   true
spark.hadoop.yarn.timeline-service.enabledfalse
spark.shuffle.service.enabled true
spark.yarn.am.extraJavaOptions-Dhdp.version=2.5.3.0-37 * 
-Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m*

Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes 
using latest version. But we started facing DirectBuffer outOfMemory error and 
exceeding memory limits for executor memoryOverhead issue. To fix that we 
started tweaking multiple properties but still issue persists. Relevant 
information is shared below

Please let me any other details is requried,

Snapshot for DirectMemory Error Stacktrace :- 
10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in 
stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): 
FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), 
shuffleId=7, mapId=141, reduceId=3372, message=
org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 byte(s) 
of direct memory (used: 1073699840, max: 1073741824)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:414)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:166)
  

[jira] [Created] (SPARK-22458) OutOfDirectMemoryError with Spark 2.2

2017-11-06 Thread Kaushal Prajapati (JIRA)
Kaushal Prajapati created SPARK-22458:
-

 Summary: OutOfDirectMemoryError with Spark 2.2
 Key: SPARK-22458
 URL: https://issues.apache.org/jira/browse/SPARK-22458
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, SQL, YARN
Affects Versions: 2.2.0
Reporter: Kaushal Prajapati
Priority: Blocker


We were using Spark 2.1 from last 6 months to execute multiple spark jobs that 
is running 15 hour long for 50+ TB of source data with below configurations 
successfully. 

spark.master  yarn
spark.driver.cores10
spark.driver.maxResultSize5g
spark.driver.memory   20g
spark.executor.cores  5
spark.executor.extraJavaOptions   -XX:+UseG1GC 
*-Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=6 
*-XX:MaxDirectMemorySize=2048m* 
-Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
spark.driver.extraJavaOptions* 
-Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m* 
-Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
spark.executor.instances  30
spark.executor.memory 30g
*spark.kryoserializer.buffer.max   512m*

spark.network.timeout 12000s
spark.serializer  
org.apache.spark.serializer.KryoSerializer
spark.shuffle.io.preferDirectBufs false
spark.sql.catalogImplementation   hive
spark.sql.shuffle.partitions  5000
spark.yarn.driver.memoryOverhead  1536
spark.yarn.executor.memoryOverhead4096
spark.core.connection.ack.wait.timeout600s
spark.scheduler.maxRegisteredResourcesWaitingTime 15s
spark.sql.hive.filesourcePartitionFileCacheSize   524288000


spark.dynamicAllocation.executorIdleTimeout   3s
spark.dynamicAllocation.enabled   true
spark.hadoop.yarn.timeline-service.enabledfalse
spark.shuffle.service.enabled true
spark.yarn.am.extraJavaOptions-Dhdp.version=2.5.3.0-37 * 
-Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m*

Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes 
using latest version. But we started facing DirectBuffer outOfMemory error and 
exceeding memory limits for executor memoryOverhead issue. To fix that we 
started tweaking multiple properties but still issue persists. Relevant 
information is shared below

Please let me any other details is requried,

Snapshot for DirectMemory Error Stacktrace :- 
10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in 
stage 5.3 (TID 25022, dedwdprshc070.de.xxx.com, executor 615): 
FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxx.com, 7337, None), 
shuffleId=7, mapId=141, reduceId=3372, message=
org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536 byte(s) 
of direct memory (used: 1073699840, max: 1073741824)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon

[jira] [Commented] (SPARK-22456) Add new function dayofweek

2017-11-06 Thread Michael Styles (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240331#comment-16240331
 ] 

Michael Styles commented on SPARK-22456:


This functionality is part of the SQL standard and supported by a number of 
relational database systems including Oracle, DB2, SQL Server, MySQL, and 
Teradata, to name a few.

> Add new function dayofweek
> --
>
> Key: SPARK-22456
> URL: https://issues.apache.org/jira/browse/SPARK-22456
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Michael Styles
>
> Add new function *dayofweek* to return the day of the week of the given 
> argument as an integer value in the range 1-7, where 1 represents Sunday.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22457) Tables are supposed to be MANAGED only taking into account whether a path is provided

2017-11-06 Thread David Arroyo Cazorla (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Arroyo Cazorla updated SPARK-22457:
-
Description: 
As far as I know, since Spark 2.2, tables are supposed to be MANAGED only 
taking into account whether a path is provided:

{code:java}
val tableType = if (storage.locationUri.isDefined) {
  CatalogTableType.EXTERNAL
} else {
  CatalogTableType.MANAGED
}
{code}

This solution seems to be right for filesystem based data sources. On the other 
hand, when working with other data sources such as elasticsearch, that solution 
is leading to a weird behaviour described below: 

1) InMemoryCatalog's doCreateTable() adds a locationURI if 
CatalogTableType.MANAGED && tableDefinition.storage.locationUri.isEmpty.

2) Before loading the data source table FindDataSourceTable's 
readDataSourceTable() adds a path option if locationURI exists:

{code:java}
val pathOption = table.storage.locationUri.map("path" -> 
CatalogUtils.URIToString(_))
{code}

3) That causes an error when reading from elasticsearch because 'path' is an 
option already supported by elasticsearch (locationUri is set to 
file:/home/user/spark-rv/elasticsearch/shop/clients)

org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot find mapping 
for file:/home/user/spark-rv/elasticsearch/shop/clients - one is required 
before using Spark SQL


Would be possible only to mark tables as MANAGED for a subset of data sources 
(TEXT, CSV, JSON, JDBC, PARQUET, ORC, HIVE) or think about any other solution?

P.S. InMemoryCatalog' doDropTable() deletes the directory of the table which 
from my point of view should only be required for filesystem based data 
sources: 
{code:java}
   if (tableMeta.tableType == CatalogTableType.MANAGED)
   ...
   // Delete the data/directory of the table
val dir = new Path(tableMeta.location)
try {
  val fs = dir.getFileSystem(hadoopConfig)
  fs.delete(dir, true)
} catch {
  case e: IOException =>
throw new SparkException(s"Unable to drop table $table as failed " +
  s"to delete its directory $dir", e)
}
{code}

  was:
As far as I know, since Spark 2.2, tables are supposed to be MANAGED only 
taking into account whether a path is provided:

{code:java}
val tableType = if (storage.locationUri.isDefined) {
  CatalogTableType.EXTERNAL
} else {
  CatalogTableType.MANAGED
}
{code}

This solution seems to be right for filesystem based data sources. On the other 
hand, when working with other data sources such as elasticsearch, that solution 
is leading to a weird behaviour described below: 

1) InMemoryCatalog's doCreateTable() adds a locationURI if 
CatalogTableType.MANAGED && tableDefinition.storage.locationUri.isEmpty.

2) Before loading the data source table FindDataSourceTable's 
readDataSourceTable() adds a path option if locationURI exists:

{code:java}
val pathOption = table.storage.locationUri.map("path" -> 
CatalogUtils.URIToString(_))
{code}

3) That causes an error when reading from elasticsearch because 'path' is an 
option already supported by elasticsearch (locationUri is set to 
file:/home/user/spark-rv/elasticsearch/shop/clients)

org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot find mapping 
for file:/home/user/spark-rv/elasticsearch/shop/clients - one is required 
before using Spark SQL


Would be possible only mark tables as MANAGED for a subset of data sources 
(TEXT, CSV, JSON, JDBC, PARQUET, ORC, HIVE) or think about any other solution?

P.S. InMemoryCatalog' doDropTable() deletes the directory of the table which 
from my point of view should only be required for filesystem based data 
sources: 
{code:java}
   if (tableMeta.tableType == CatalogTableType.MANAGED)
   ...
   // Delete the data/directory of the table
val dir = new Path(tableMeta.location)
try {
  val fs = dir.getFileSystem(hadoopConfig)
  fs.delete(dir, true)
} catch {
  case e: IOException =>
throw new SparkException(s"Unable to drop table $table as failed " +
  s"to delete its directory $dir", e)
}
{code}


> Tables are supposed to be MANAGED only taking into account whether a path is 
> provided
> -
>
> Key: SPARK-22457
> URL: https://issues.apache.org/jira/browse/SPARK-22457
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: David Arroyo Cazorla
>
> As far as I know, since Spark 2.2, tables are supposed to be MANAGED only 
> taking into account whether a path is provided:
> {code:java}
> val tableType = if (storage.locationUri.isDefined) {
>   CatalogTableType.EXTERNAL
> } else {
>  

[jira] [Commented] (SPARK-10399) Off Heap Memory Access for non-JVM libraries (C++)

2017-11-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240314#comment-16240314
 ] 

Sean Owen commented on SPARK-10399:
---

I think this is mostly superseded by Arrow's intended role here, yeah.

> Off Heap Memory Access for non-JVM libraries (C++)
> --
>
> Key: SPARK-10399
> URL: https://issues.apache.org/jira/browse/SPARK-10399
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Paul Weiss
>
> *Summary*
> Provide direct off-heap memory access to an external non-JVM program such as 
> a c++ library within the Spark running JVM/executor.  As Spark moves to 
> storing all data into off heap memory it makes sense to provide access points 
> to the memory for non-JVM programs.
> 
> *Assumptions*
> * Zero copies will be made during the call into non-JVM library
> * Access into non-JVM libraries will be accomplished via JNI
> * A generic JNI interface will be created so that developers will not need to 
> deal with the raw JNI call
> * C++ will be the initial target non-JVM use case
> * memory management will remain on the JVM/Spark side
> * the API from C++ will be similar to dataframes as much as feasible and NOT 
> require expert knowledge of JNI
> * Data organization and layout will support complex (multi-type, nested, 
> etc.) types
> 
> *Design*
> * Initially Spark JVM -> non-JVM will be supported 
> * Creating an embedded JVM with Spark running from a non-JVM program is 
> initially out of scope
> 
> *Technical*
> * GetDirectBufferAddress is the JNI call used to access byte buffer without 
> copy



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22457) Tables are supposed to be MANAGED only taking into account whether a path is provided

2017-11-06 Thread David Arroyo Cazorla (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Arroyo Cazorla updated SPARK-22457:
-
Description: 
As far as I know, since Spark 2.2, tables are supposed to be MANAGED only 
taking into account whether a path is provided:

{code:java}
val tableType = if (storage.locationUri.isDefined) {
  CatalogTableType.EXTERNAL
} else {
  CatalogTableType.MANAGED
}
{code}

This solution seems to be right for filesystem based data sources. On the other 
hand, when working with other data sources such as elasticsearch, that solution 
is leading to a weird behaviour described below. 

1) InMemoryCatalog's doCreateTable() adds a locationURI if 
CatalogTableType.MANAGED && tableDefinition.storage.locationUri.isEmpty.

2) Before loading the data source table FindDataSourceTable's 
readDataSourceTable() adds a path option if locationURI exists:

{code:java}
val pathOption = table.storage.locationUri.map("path" -> 
CatalogUtils.URIToString(_))
{code}

3) That causes an error when reading from elasticsearch because 'path' is an 
option already supported by elasticsearch (locationUri is set to 
file:/home/user/spark-rv/elasticsearch/shop/clients)

org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot find mapping 
for file:/home/user/spark-rv/elasticsearch/shop/clients - one is required 
before using Spark SQL


Would be possible only mark tables as MANAGED for a subset of data sources 
(TEXT, CSV, JSON, JDBC, PARQUET, ORC, HIVE) or think about any other solution?

P.S. InMemoryCatalog' doDropTable() deletes the directory of the table which 
from my point of view should only be required for filesystem based data 
sources: 
{code:java}
   if (tableMeta.tableType == CatalogTableType.MANAGED)
   ...
   // Delete the data/directory of the table
val dir = new Path(tableMeta.location)
try {
  val fs = dir.getFileSystem(hadoopConfig)
  fs.delete(dir, true)
} catch {
  case e: IOException =>
throw new SparkException(s"Unable to drop table $table as failed " +
  s"to delete its directory $dir", e)
}
{code}

  was:
As far as I know, since Spark 2.2, tables are supposed to be MANAGED only 
taking into account whether a path is provided:

{code:scala}
val tableType = if (storage.locationUri.isDefined) {
  CatalogTableType.EXTERNAL
} else {
  CatalogTableType.MANAGED
}
{code}

This solution seems to be right for filesystem based data sources. On the other 
hand, when working with other data sources such as elasticsearch, that solution 
is leading to a weird behaviour described below. 

1) InMemoryCatalog's doCreateTable() adds a locationURI if 
CatalogTableType.MANAGED && tableDefinition.storage.locationUri.isEmpty.

2) Before loading the data source table FindDataSourceTable's 
readDataSourceTable() adds a path option if locationURI exists:

{code:scala}
val pathOption = table.storage.locationUri.map("path" -> 
CatalogUtils.URIToString(_))
{code}

3) That causes an error when reading from elasticsearch because 'path' is an 
option already supported by elasticsearch (locationUri is set to 
file:/home/user/spark-rv/elasticsearch/shop/clients)

org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot find mapping 
for file:/home/user/spark-rv/elasticsearch/shop/clients - one is required 
before using Spark SQL


Would be possible only mark tables as MANAGED for a subset of data sources 
(TEXT, CSV, JSON, JDBC, PARQUET, ORC, HIVE) or think about any other solution?

P.S. InMemoryCatalog' doDropTable() deletes the directory of the table which 
from my point of view should only be required for filesystem based data 
sources: 
{code:scala}
   if (tableMeta.tableType == CatalogTableType.MANAGED)
   ...
   // Delete the data/directory of the table
val dir = new Path(tableMeta.location)
try {
  val fs = dir.getFileSystem(hadoopConfig)
  fs.delete(dir, true)
} catch {
  case e: IOException =>
throw new SparkException(s"Unable to drop table $table as failed " +
  s"to delete its directory $dir", e)
}
{code}


> Tables are supposed to be MANAGED only taking into account whether a path is 
> provided
> -
>
> Key: SPARK-22457
> URL: https://issues.apache.org/jira/browse/SPARK-22457
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: David Arroyo Cazorla
>
> As far as I know, since Spark 2.2, tables are supposed to be MANAGED only 
> taking into account whether a path is provided:
> {code:java}
> val tableType = if (storage.locationUri.isDefined) {
>   CatalogTableType.EXTERNAL
> } else {
>  

[jira] [Updated] (SPARK-22457) Tables are supposed to be MANAGED only taking into account whether a path is provided

2017-11-06 Thread David Arroyo Cazorla (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Arroyo Cazorla updated SPARK-22457:
-
Description: 
As far as I know, since Spark 2.2, tables are supposed to be MANAGED only 
taking into account whether a path is provided:

{code:java}
val tableType = if (storage.locationUri.isDefined) {
  CatalogTableType.EXTERNAL
} else {
  CatalogTableType.MANAGED
}
{code}

This solution seems to be right for filesystem based data sources. On the other 
hand, when working with other data sources such as elasticsearch, that solution 
is leading to a weird behaviour described below: 

1) InMemoryCatalog's doCreateTable() adds a locationURI if 
CatalogTableType.MANAGED && tableDefinition.storage.locationUri.isEmpty.

2) Before loading the data source table FindDataSourceTable's 
readDataSourceTable() adds a path option if locationURI exists:

{code:java}
val pathOption = table.storage.locationUri.map("path" -> 
CatalogUtils.URIToString(_))
{code}

3) That causes an error when reading from elasticsearch because 'path' is an 
option already supported by elasticsearch (locationUri is set to 
file:/home/user/spark-rv/elasticsearch/shop/clients)

org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot find mapping 
for file:/home/user/spark-rv/elasticsearch/shop/clients - one is required 
before using Spark SQL


Would be possible only mark tables as MANAGED for a subset of data sources 
(TEXT, CSV, JSON, JDBC, PARQUET, ORC, HIVE) or think about any other solution?

P.S. InMemoryCatalog' doDropTable() deletes the directory of the table which 
from my point of view should only be required for filesystem based data 
sources: 
{code:java}
   if (tableMeta.tableType == CatalogTableType.MANAGED)
   ...
   // Delete the data/directory of the table
val dir = new Path(tableMeta.location)
try {
  val fs = dir.getFileSystem(hadoopConfig)
  fs.delete(dir, true)
} catch {
  case e: IOException =>
throw new SparkException(s"Unable to drop table $table as failed " +
  s"to delete its directory $dir", e)
}
{code}

  was:
As far as I know, since Spark 2.2, tables are supposed to be MANAGED only 
taking into account whether a path is provided:

{code:java}
val tableType = if (storage.locationUri.isDefined) {
  CatalogTableType.EXTERNAL
} else {
  CatalogTableType.MANAGED
}
{code}

This solution seems to be right for filesystem based data sources. On the other 
hand, when working with other data sources such as elasticsearch, that solution 
is leading to a weird behaviour described below. 

1) InMemoryCatalog's doCreateTable() adds a locationURI if 
CatalogTableType.MANAGED && tableDefinition.storage.locationUri.isEmpty.

2) Before loading the data source table FindDataSourceTable's 
readDataSourceTable() adds a path option if locationURI exists:

{code:java}
val pathOption = table.storage.locationUri.map("path" -> 
CatalogUtils.URIToString(_))
{code}

3) That causes an error when reading from elasticsearch because 'path' is an 
option already supported by elasticsearch (locationUri is set to 
file:/home/user/spark-rv/elasticsearch/shop/clients)

org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot find mapping 
for file:/home/user/spark-rv/elasticsearch/shop/clients - one is required 
before using Spark SQL


Would be possible only mark tables as MANAGED for a subset of data sources 
(TEXT, CSV, JSON, JDBC, PARQUET, ORC, HIVE) or think about any other solution?

P.S. InMemoryCatalog' doDropTable() deletes the directory of the table which 
from my point of view should only be required for filesystem based data 
sources: 
{code:java}
   if (tableMeta.tableType == CatalogTableType.MANAGED)
   ...
   // Delete the data/directory of the table
val dir = new Path(tableMeta.location)
try {
  val fs = dir.getFileSystem(hadoopConfig)
  fs.delete(dir, true)
} catch {
  case e: IOException =>
throw new SparkException(s"Unable to drop table $table as failed " +
  s"to delete its directory $dir", e)
}
{code}


> Tables are supposed to be MANAGED only taking into account whether a path is 
> provided
> -
>
> Key: SPARK-22457
> URL: https://issues.apache.org/jira/browse/SPARK-22457
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: David Arroyo Cazorla
>
> As far as I know, since Spark 2.2, tables are supposed to be MANAGED only 
> taking into account whether a path is provided:
> {code:java}
> val tableType = if (storage.locationUri.isDefined) {
>   CatalogTableType.EXTERNAL
> } else {
> 

[jira] [Created] (SPARK-22457) Tables are supposed to be MANAGED only taking into account whether a path is provided

2017-11-06 Thread David Arroyo Cazorla (JIRA)
David Arroyo Cazorla created SPARK-22457:


 Summary: Tables are supposed to be MANAGED only taking into 
account whether a path is provided
 Key: SPARK-22457
 URL: https://issues.apache.org/jira/browse/SPARK-22457
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: David Arroyo Cazorla


As far as I know, since Spark 2.2, tables are supposed to be MANAGED only 
taking into account whether a path is provided:

{code:scala}
val tableType = if (storage.locationUri.isDefined) {
  CatalogTableType.EXTERNAL
} else {
  CatalogTableType.MANAGED
}
{code}

This solution seems to be right for filesystem based data sources. On the other 
hand, when working with other data sources such as elasticsearch, that solution 
is leading to a weird behaviour described below. 

1) InMemoryCatalog's doCreateTable() adds a locationURI if 
CatalogTableType.MANAGED && tableDefinition.storage.locationUri.isEmpty.

2) Before loading the data source table FindDataSourceTable's 
readDataSourceTable() adds a path option if locationURI exists:

{code:scala}
val pathOption = table.storage.locationUri.map("path" -> 
CatalogUtils.URIToString(_))
{code}

3) That causes an error when reading from elasticsearch because 'path' is an 
option already supported by elasticsearch (locationUri is set to 
file:/home/user/spark-rv/elasticsearch/shop/clients)

org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot find mapping 
for file:/home/user/spark-rv/elasticsearch/shop/clients - one is required 
before using Spark SQL


Would be possible only mark tables as MANAGED for a subset of data sources 
(TEXT, CSV, JSON, JDBC, PARQUET, ORC, HIVE) or think about any other solution?

P.S. InMemoryCatalog' doDropTable() deletes the directory of the table which 
from my point of view should only be required for filesystem based data 
sources: 
{code:scala}
   if (tableMeta.tableType == CatalogTableType.MANAGED)
   ...
   // Delete the data/directory of the table
val dir = new Path(tableMeta.location)
try {
  val fs = dir.getFileSystem(hadoopConfig)
  fs.delete(dir, true)
} catch {
  case e: IOException =>
throw new SparkException(s"Unable to drop table $table as failed " +
  s"to delete its directory $dir", e)
}
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10399) Off Heap Memory Access for non-JVM libraries (C++)

2017-11-06 Thread Jim Pivarski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240304#comment-16240304
 ] 

Jim Pivarski commented on SPARK-10399:
--

WontFix because PR 19222 has no conflicts and will be merged, or because 
off-heap memory will instead be exposed as Arrow buffers, or because you don't 
intend to support this feature? I'm just asking for clarification, not asking 
you to change your minds.


> Off Heap Memory Access for non-JVM libraries (C++)
> --
>
> Key: SPARK-10399
> URL: https://issues.apache.org/jira/browse/SPARK-10399
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Paul Weiss
>
> *Summary*
> Provide direct off-heap memory access to an external non-JVM program such as 
> a c++ library within the Spark running JVM/executor.  As Spark moves to 
> storing all data into off heap memory it makes sense to provide access points 
> to the memory for non-JVM programs.
> 
> *Assumptions*
> * Zero copies will be made during the call into non-JVM library
> * Access into non-JVM libraries will be accomplished via JNI
> * A generic JNI interface will be created so that developers will not need to 
> deal with the raw JNI call
> * C++ will be the initial target non-JVM use case
> * memory management will remain on the JVM/Spark side
> * the API from C++ will be similar to dataframes as much as feasible and NOT 
> require expert knowledge of JNI
> * Data organization and layout will support complex (multi-type, nested, 
> etc.) types
> 
> *Design*
> * Initially Spark JVM -> non-JVM will be supported 
> * Creating an embedded JVM with Spark running from a non-JVM program is 
> initially out of scope
> 
> *Technical*
> * GetDirectBufferAddress is the JNI call used to access byte buffer without 
> copy



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21640) Method mode with String parameters within DataFrameWriter is error prone

2017-11-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240298#comment-16240298
 ] 

Apache Spark commented on SPARK-21640:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/19673

> Method mode with String parameters within DataFrameWriter is error prone
> 
>
> Key: SPARK-21640
> URL: https://issues.apache.org/jira/browse/SPARK-21640
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Alberto
>Assignee: Alberto
>Priority: Trivial
> Fix For: 2.3.0
>
>
> The following method:
> {code:java}
> def mode(saveMode: String): DataFrameWriter[T]
> {code}
> sets the SaveMode of the DataFrameWriter depending on the string that is 
> pass-in as parameter.
> There is a java Enum with all the save modes which are Append, Overwrite, 
> ErrorIfExists and Ignore. In my current project I was writing some code that 
> was using this enum to get the string value that I use to call the mode 
> method:
> {code:java}
>   private[utils] val configModeAppend = SaveMode.Append.toString.toLowerCase
>   private[utils] val configModeErrorIfExists = 
> SaveMode.ErrorIfExists.toString.toLowerCase
>   private[utils] val configModeIgnore = SaveMode.Ignore.toString.toLowerCase
>   private[utils] val configModeOverwrite = 
> SaveMode.Overwrite.toString.toLowerCase
> {code}
> The configModeErrorIfExists val contains the value "errorifexists" and when I 
> call the saveMode method using this string it does not match. I suggest to 
> include "errorifexists" as a right match for the ErrorIfExists SaveMode.
> Will create a PR to address this issue ASAP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3383) DecisionTree aggregate size could be smaller

2017-11-06 Thread 颜发才

[ 
https://issues.apache.org/jira/browse/SPARK-3383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240284#comment-16240284
 ] 

Yan Facai (颜发才) edited comment on SPARK-3383 at 11/6/17 1:28 PM:
-

[~WeichenXu123] Good work! I'd like to take a look if time allows. Anyway, I 
believe that unordered features can benefit a lot from your work.


was (Author: facai):
[~WeichenXu123] Good work! I'd like to take a look if time allows. Anyway, I 
believe that unordered features can benefit a lot from the PR.

> DecisionTree aggregate size could be smaller
> 
>
> Key: SPARK-3383
> URL: https://issues.apache.org/jira/browse/SPARK-3383
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Storage and communication optimization:
> DecisionTree aggregate statistics could store less data (described below).  
> The savings would be significant for datasets with many low-arity categorical 
> features (binary features, or unordered categorical features).  Savings would 
> be negligible for continuous features.
> DecisionTree stores a vector sufficient statistics for each (node, feature, 
> bin).  We could store 1 fewer bin per (node, feature): For a given (node, 
> feature), if we store these vectors for all but the last bin, and also store 
> the total statistics for each node, then we could compute the statistics for 
> the last bin.  For binary and unordered categorical features, this would cut 
> in half the number of bins to store and communicate.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3383) DecisionTree aggregate size could be smaller

2017-11-06 Thread 颜发才

[ 
https://issues.apache.org/jira/browse/SPARK-3383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240284#comment-16240284
 ] 

Yan Facai (颜发才) commented on SPARK-3383:


[~WeichenXu123] Good work! I'd like to take a look if time allows. Anyway, I 
believe that unordered features can benefit a lot from the PR.

> DecisionTree aggregate size could be smaller
> 
>
> Key: SPARK-3383
> URL: https://issues.apache.org/jira/browse/SPARK-3383
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Storage and communication optimization:
> DecisionTree aggregate statistics could store less data (described below).  
> The savings would be significant for datasets with many low-arity categorical 
> features (binary features, or unordered categorical features).  Savings would 
> be negligible for continuous features.
> DecisionTree stores a vector sufficient statistics for each (node, feature, 
> bin).  We could store 1 fewer bin per (node, feature): For a given (node, 
> feature), if we store these vectors for all but the last bin, and also store 
> the total statistics for each node, then we could compute the statistics for 
> the last bin.  For binary and unordered categorical features, this would cut 
> in half the number of bins to store and communicate.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21195) MetricSystem should pick up dynamically registered metrics in sources

2017-11-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21195:


Assignee: (was: Apache Spark)

> MetricSystem should pick up dynamically registered metrics in sources
> -
>
> Key: SPARK-21195
> URL: https://issues.apache.org/jira/browse/SPARK-21195
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Robert Kruszewski
>Priority: Minor
>
> Currently when MetricsSystem registers a source it only picks up currently 
> registered metrics. It's quite cumbersome and leads to a lot of boilerplate 
> to preregister all metrics especially with systems that use instrumentation. 
> This change proposes to teach MetricsSystem to watch metrics added to sources 
> and dynamically register them



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21195) MetricSystem should pick up dynamically registered metrics in sources

2017-11-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21195:


Assignee: Apache Spark

> MetricSystem should pick up dynamically registered metrics in sources
> -
>
> Key: SPARK-21195
> URL: https://issues.apache.org/jira/browse/SPARK-21195
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Robert Kruszewski
>Assignee: Apache Spark
>Priority: Minor
>
> Currently when MetricsSystem registers a source it only picks up currently 
> registered metrics. It's quite cumbersome and leads to a lot of boilerplate 
> to preregister all metrics especially with systems that use instrumentation. 
> This change proposes to teach MetricsSystem to watch metrics added to sources 
> and dynamically register them



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-21195) MetricSystem should pick up dynamically registered metrics in sources

2017-11-06 Thread Robert Kruszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kruszewski reopened SPARK-21195:
---

> MetricSystem should pick up dynamically registered metrics in sources
> -
>
> Key: SPARK-21195
> URL: https://issues.apache.org/jira/browse/SPARK-21195
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Robert Kruszewski
>Priority: Minor
>
> Currently when MetricsSystem registers a source it only picks up currently 
> registered metrics. It's quite cumbersome and leads to a lot of boilerplate 
> to preregister all metrics especially with systems that use instrumentation. 
> This change proposes to teach MetricsSystem to watch metrics added to sources 
> and dynamically register them



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21195) MetricSystem should pick up dynamically registered metrics in sources

2017-11-06 Thread Robert Kruszewski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240274#comment-16240274
 ] 

Robert Kruszewski commented on SPARK-21195:
---

Thanks, indeed I responded that I missed the last comment.

> MetricSystem should pick up dynamically registered metrics in sources
> -
>
> Key: SPARK-21195
> URL: https://issues.apache.org/jira/browse/SPARK-21195
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Robert Kruszewski
>Priority: Minor
>
> Currently when MetricsSystem registers a source it only picks up currently 
> registered metrics. It's quite cumbersome and leads to a lot of boilerplate 
> to preregister all metrics especially with systems that use instrumentation. 
> This change proposes to teach MetricsSystem to watch metrics added to sources 
> and dynamically register them



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22456) Add new function dayofweek

2017-11-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22456:


Assignee: (was: Apache Spark)

> Add new function dayofweek
> --
>
> Key: SPARK-22456
> URL: https://issues.apache.org/jira/browse/SPARK-22456
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Michael Styles
>
> Add new function *dayofweek* to return the day of the week of the given 
> argument as an integer value in the range 1-7, where 1 represents Sunday.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22456) Add new function dayofweek

2017-11-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240272#comment-16240272
 ] 

Apache Spark commented on SPARK-22456:
--

User 'ptkool' has created a pull request for this issue:
https://github.com/apache/spark/pull/19672

> Add new function dayofweek
> --
>
> Key: SPARK-22456
> URL: https://issues.apache.org/jira/browse/SPARK-22456
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Michael Styles
>
> Add new function *dayofweek* to return the day of the week of the given 
> argument as an integer value in the range 1-7, where 1 represents Sunday.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22456) Add new function dayofweek

2017-11-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22456:


Assignee: Apache Spark

> Add new function dayofweek
> --
>
> Key: SPARK-22456
> URL: https://issues.apache.org/jira/browse/SPARK-22456
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Michael Styles
>Assignee: Apache Spark
>
> Add new function *dayofweek* to return the day of the week of the given 
> argument as an integer value in the range 1-7, where 1 represents Sunday.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21195) MetricSystem should pick up dynamically registered metrics in sources

2017-11-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240270#comment-16240270
 ] 

Sean Owen commented on SPARK-21195:
---

The PR hadn't been updated in almost half a year, and Jerry had some concerns. 
I see you just commented on it. If there's any interest in continuing with it, 
you can reopen.

> MetricSystem should pick up dynamically registered metrics in sources
> -
>
> Key: SPARK-21195
> URL: https://issues.apache.org/jira/browse/SPARK-21195
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Robert Kruszewski
>Priority: Minor
>
> Currently when MetricsSystem registers a source it only picks up currently 
> registered metrics. It's quite cumbersome and leads to a lot of boilerplate 
> to preregister all metrics especially with systems that use instrumentation. 
> This change proposes to teach MetricsSystem to watch metrics added to sources 
> and dynamically register them



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22456) Add new function dayofweek

2017-11-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240269#comment-16240269
 ] 

Sean Owen commented on SPARK-22456:
---

Does it exist in other SQL dialects? that's the reason most functions like this 
exist in Spark. Beyond that, these sorts of things should be UDFs.

> Add new function dayofweek
> --
>
> Key: SPARK-22456
> URL: https://issues.apache.org/jira/browse/SPARK-22456
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Michael Styles
>
> Add new function *dayofweek* to return the day of the week of the given 
> argument as an integer value in the range 1-7, where 1 represents Sunday.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22456) Add new function dayofweek

2017-11-06 Thread Michael Styles (JIRA)
Michael Styles created SPARK-22456:
--

 Summary: Add new function dayofweek
 Key: SPARK-22456
 URL: https://issues.apache.org/jira/browse/SPARK-22456
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, SQL
Affects Versions: 2.2.0
Reporter: Michael Styles


Add new function *dayofweek* to return the day of the week of the given 
argument as an integer value in the range 1-7, where 1 represents Sunday.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21195) MetricSystem should pick up dynamically registered metrics in sources

2017-11-06 Thread Robert Kruszewski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240267#comment-16240267
 ] 

Robert Kruszewski commented on SPARK-21195:
---

What's the justification for `won't fix` here?

> MetricSystem should pick up dynamically registered metrics in sources
> -
>
> Key: SPARK-21195
> URL: https://issues.apache.org/jira/browse/SPARK-21195
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Robert Kruszewski
>Priority: Minor
>
> Currently when MetricsSystem registers a source it only picks up currently 
> registered metrics. It's quite cumbersome and leads to a lot of boilerplate 
> to preregister all metrics especially with systems that use instrumentation. 
> This change proposes to teach MetricsSystem to watch metrics added to sources 
> and dynamically register them



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22455) Provide an option to store the exception records/files and reasons in log files when reading data from a file-based data source.

2017-11-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22455:
--
   Flags:   (was: Patch)
  Labels:   (was: patch)
Priority: Minor  (was: Major)

> Provide an option to store the exception records/files and reasons in log 
> files when reading data from a file-based data source.
> 
>
> Key: SPARK-22455
> URL: https://issues.apache.org/jira/browse/SPARK-22455
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.2.0
>Reporter: Sreenath Chothar
>Priority: Minor
>
> Provide an option to store the exception/bad records and reasons in log files 
> when reading data from a file-based data source into a PySpark dataframe. Now 
> only following three options are available:
> 1. PERMISSIVE : sets other fields to null when it meets a corrupted record, 
> and puts the malformed string into a field configured by 
> columnNameOfCorruptRecord.
> 2. DROPMALFORMED : ignores the whole corrupted records.
> 3. FAILFAST : throws an exception when it meets corrupted records.
> We could use first option to accumulate the corrupted records and output to a 
> log file.But we can't use this option when input schema is inferred 
> automatically. If the number of columns to read is too large, providing the 
> complete schema with additional column for storing corrupted data is 
> difficult. Instead "pyspark.sql.DataFrameReader.csv" reader functions could 
> provide an option to redirect the bad records to configured log file path 
> with exception details.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21195) MetricSystem should pick up dynamically registered metrics in sources

2017-11-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21195.
---
Resolution: Won't Fix

> MetricSystem should pick up dynamically registered metrics in sources
> -
>
> Key: SPARK-21195
> URL: https://issues.apache.org/jira/browse/SPARK-21195
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Robert Kruszewski
>Priority: Minor
>
> Currently when MetricsSystem registers a source it only picks up currently 
> registered metrics. It's quite cumbersome and leads to a lot of boilerplate 
> to preregister all metrics especially with systems that use instrumentation. 
> This change proposes to teach MetricsSystem to watch metrics added to sources 
> and dynamically register them



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20638) Optimize the CartesianRDD to reduce repeatedly data fetching

2017-11-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20638.
---
Resolution: Won't Fix

> Optimize the CartesianRDD to reduce repeatedly data fetching
> 
>
> Key: SPARK-20638
> URL: https://issues.apache.org/jira/browse/SPARK-20638
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Teng Jiang
>
> In CartesianRDD, group each iterator to multiple groups. Thus in the second 
> iteration, the data with be fetched (num of data)/groupSize times, rather 
> than (num of data) times.
> The test results are:
> Test Environment : 3 workers, each has 10 cores, 30G memory, 1 executor
> Test data : users : 480,189, each is a 10-dim vector, and items : 17770, each 
> is a 10-dim vector.
> With default CartesianRDD, cartesian time is 2420.7s.
> With this proposal, cartesian time is 45.3s
> 50x faster than the original method.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22154) add a shutdown hook that explains why the output is terminating

2017-11-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22154.
---
Resolution: Won't Fix

> add a shutdown hook that explains why the output is  terminating
> 
>
> Key: SPARK-22154
> URL: https://issues.apache.org/jira/browse/SPARK-22154
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.2.0
>Reporter: liuzhaokun
>Priority: Critical
>
> It would be nice to add a shutdown hook here that explains why the output is 
> terminating. Otherwise if the worker dies the executor logs will silently 
> stop.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10399) Off Heap Memory Access for non-JVM libraries (C++)

2017-11-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10399.
---
Resolution: Won't Fix

> Off Heap Memory Access for non-JVM libraries (C++)
> --
>
> Key: SPARK-10399
> URL: https://issues.apache.org/jira/browse/SPARK-10399
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Paul Weiss
>
> *Summary*
> Provide direct off-heap memory access to an external non-JVM program such as 
> a c++ library within the Spark running JVM/executor.  As Spark moves to 
> storing all data into off heap memory it makes sense to provide access points 
> to the memory for non-JVM programs.
> 
> *Assumptions*
> * Zero copies will be made during the call into non-JVM library
> * Access into non-JVM libraries will be accomplished via JNI
> * A generic JNI interface will be created so that developers will not need to 
> deal with the raw JNI call
> * C++ will be the initial target non-JVM use case
> * memory management will remain on the JVM/Spark side
> * the API from C++ will be similar to dataframes as much as feasible and NOT 
> require expert knowledge of JNI
> * Data organization and layout will support complex (multi-type, nested, 
> etc.) types
> 
> *Design*
> * Initially Spark JVM -> non-JVM will be supported 
> * Creating an embedded JVM with Spark running from a non-JVM program is 
> initially out of scope
> 
> *Technical*
> * GetDirectBufferAddress is the JNI call used to access byte buffer without 
> copy



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13547) Add SQL query in web UI's SQL Tab

2017-11-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13547.
---
Resolution: Won't Fix

> Add SQL query in web UI's SQL Tab
> -
>
> Key: SPARK-13547
> URL: https://issues.apache.org/jira/browse/SPARK-13547
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 1.6.0
>Reporter: Jeff Zhang
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> It would be nice to have the sql query in sql tab



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19458) loading hive jars from the local repo which has already downloaded

2017-11-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19458.
---
Resolution: Won't Fix

> loading hive jars from the local repo which has already downloaded
> --
>
> Key: SPARK-19458
> URL: https://issues.apache.org/jira/browse/SPARK-19458
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>Priority: Minor
>
> Currently when we new a HiveClient for a specific metastore version and 
> `spark.sql.hive.metastore.jars` is setted to `maven`, Spark will download the 
> hive jars from remote repo(http://www.datanucleus.org/downloads/maven2).
> we should allow the user to load hive jars from the local repo which has 
> already downloaded.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20044) Support Spark UI behind front-end reverse proxy using a path prefix

2017-11-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20044.
---
Resolution: Won't Fix

> Support Spark UI behind front-end reverse proxy using a path prefix
> ---
>
> Key: SPARK-20044
> URL: https://issues.apache.org/jira/browse/SPARK-20044
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: Oliver Koeth
>Priority: Minor
>  Labels: reverse-proxy, sso
>
> Purpose: allow to run the Spark web UI behind a reverse proxy with URLs 
> prefixed by a context root, like www.mydomain.com/spark. In particular, this 
> allows to access multiple Spark clusters through the same virtual host, only 
> distinguishing them by context root, like www.mydomain.com/cluster1, 
> www.mydomain.com/cluster2, and it allows to run the Spark UI in a common 
> cookie domain (for SSO) with other services.
> [SPARK-15487] introduced some support for front-end reverse proxies by 
> allowing all Spark UI requests to be routed through the master UI as a single 
> endpoint and also added a spark.ui.reverseProxyUrl setting to define a 
> another proxy sitting in front of Spark. However, as noted in the comments on 
> [SPARK-15487], this mechanism does not currently work if the reverseProxyUrl 
> includes a context root like the examples above: Most links generated by the 
> Spark UI result in full path URLs (like /proxy/app-"id"/...) that do not 
> account for a path prefix (context root) and work only if the Spark UI "owns" 
> the entire virtual host. In fact, the only place in the UI where the 
> reverseProxyUrl seems to be used is the back-link from the worker UI to the 
> master UI.
> The discussion on [SPARK-15487] proposes to open a new issue for the problem, 
> but that does not seem to have happened, so this issue aims to address the 
> remaining shortcomings of spark.ui.reverseProxyUrl
> The problem can be partially worked around by doing content rewrite in a 
> front-end proxy and prefixing src="/..." or href="/..." links with a context 
> root. However, detecting and patching URLs in HTML output is not a robust 
> approach and breaks down for URLs included in custom REST responses. E.g. the 
> "allexecutors" REST call used from the Spark 2.1.0 application/executors page 
> returns links for log viewing that direct to the worker UI and do not work in 
> this scenario.
> This issue proposes to honor spark.ui.reverseProxyUrl throughout Spark UI URL 
> generation. Experiments indicate that most of this can simply be achieved by 
> using/prepending spark.ui.reverseProxyUrl to the existing spark.ui.proxyBase 
> system property. Beyond that, the places that require adaption are
> - worker and application links in the master web UI
> - webui URLs returned by REST interfaces
> Note: It seems that returned redirect location headers do not need to be 
> adapted, since URL rewriting for these is commonly done in front-end proxies 
> and has a well-defined interface



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19527) Approximate Size of Intersection of Bloom Filters

2017-11-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19527.
---
Resolution: Won't Fix

> Approximate Size of Intersection of Bloom Filters
> -
>
> Key: SPARK-19527
> URL: https://issues.apache.org/jira/browse/SPARK-19527
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Brandon Poole
>
> No way to calculate approximate items in a Bloom filter externally as 
> numOfHashFunctions and bits are private. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22455) Provide an option to store the exception records/files and reasons in log files when reading data from a file-based data source.

2017-11-06 Thread Sreenath Chothar (JIRA)
Sreenath Chothar created SPARK-22455:


 Summary: Provide an option to store the exception records/files 
and reasons in log files when reading data from a file-based data source.
 Key: SPARK-22455
 URL: https://issues.apache.org/jira/browse/SPARK-22455
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Affects Versions: 2.2.0
Reporter: Sreenath Chothar


Provide an option to store the exception/bad records and reasons in log files 
when reading data from a file-based data source into a PySpark dataframe. Now 
only following three options are available:
1. PERMISSIVE : sets other fields to null when it meets a corrupted record, and 
puts the malformed string into a field configured by columnNameOfCorruptRecord.
2. DROPMALFORMED : ignores the whole corrupted records.
3. FAILFAST : throws an exception when it meets corrupted records.

We could use first option to accumulate the corrupted records and output to a 
log file.But we can't use this option when input schema is inferred 
automatically. If the number of columns to read is too large, providing the 
complete schema with additional column for storing corrupted data is difficult. 
Instead "pyspark.sql.DataFrameReader.csv" reader functions could provide an 
option to redirect the bad records to configured log file path with exception 
details.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22297) Flaky test: BlockManagerSuite "Shuffle registration timeout and maxAttempts conf"

2017-11-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240228#comment-16240228
 ] 

Apache Spark commented on SPARK-22297:
--

User 'mpetruska' has created a pull request for this issue:
https://github.com/apache/spark/pull/19671

> Flaky test: BlockManagerSuite "Shuffle registration timeout and maxAttempts 
> conf"
> -
>
> Key: SPARK-22297
> URL: https://issues.apache.org/jira/browse/SPARK-22297
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> Ran into this locally; the test code seems to use timeouts which generally 
> end up in flakiness like this.
> {noformat}
> [info] - SPARK-20640: Shuffle registration timeout and maxAttempts conf are 
> working *** FAILED *** (1 second, 203 milliseconds)
> [info]   "Unable to register with external shuffle server due to : 
> java.util.concurrent.TimeoutException: Timeout waiting for task." did not 
> contain "test_spark_20640_try_again" (BlockManagerSuite.scala:1370)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
> [info]   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
> [info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
> [info]   at 
> org.apache.spark.storage.BlockManagerSuite$$anonfun$14.apply$mcV$sp(BlockManagerSuite.scala:1370)
> [info]   at 
> org.apache.spark.storage.BlockManagerSuite$$anonfun$14.apply(BlockManagerSuite.scala:1323)
> [info]   at 
> org.apache.spark.storage.BlockManagerSuite$$anonfun$14.apply(BlockManagerSuite.scala:1323)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22297) Flaky test: BlockManagerSuite "Shuffle registration timeout and maxAttempts conf"

2017-11-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22297:


Assignee: (was: Apache Spark)

> Flaky test: BlockManagerSuite "Shuffle registration timeout and maxAttempts 
> conf"
> -
>
> Key: SPARK-22297
> URL: https://issues.apache.org/jira/browse/SPARK-22297
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> Ran into this locally; the test code seems to use timeouts which generally 
> end up in flakiness like this.
> {noformat}
> [info] - SPARK-20640: Shuffle registration timeout and maxAttempts conf are 
> working *** FAILED *** (1 second, 203 milliseconds)
> [info]   "Unable to register with external shuffle server due to : 
> java.util.concurrent.TimeoutException: Timeout waiting for task." did not 
> contain "test_spark_20640_try_again" (BlockManagerSuite.scala:1370)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
> [info]   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
> [info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
> [info]   at 
> org.apache.spark.storage.BlockManagerSuite$$anonfun$14.apply$mcV$sp(BlockManagerSuite.scala:1370)
> [info]   at 
> org.apache.spark.storage.BlockManagerSuite$$anonfun$14.apply(BlockManagerSuite.scala:1323)
> [info]   at 
> org.apache.spark.storage.BlockManagerSuite$$anonfun$14.apply(BlockManagerSuite.scala:1323)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22297) Flaky test: BlockManagerSuite "Shuffle registration timeout and maxAttempts conf"

2017-11-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22297:


Assignee: Apache Spark

> Flaky test: BlockManagerSuite "Shuffle registration timeout and maxAttempts 
> conf"
> -
>
> Key: SPARK-22297
> URL: https://issues.apache.org/jira/browse/SPARK-22297
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Minor
>
> Ran into this locally; the test code seems to use timeouts which generally 
> end up in flakiness like this.
> {noformat}
> [info] - SPARK-20640: Shuffle registration timeout and maxAttempts conf are 
> working *** FAILED *** (1 second, 203 milliseconds)
> [info]   "Unable to register with external shuffle server due to : 
> java.util.concurrent.TimeoutException: Timeout waiting for task." did not 
> contain "test_spark_20640_try_again" (BlockManagerSuite.scala:1370)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
> [info]   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
> [info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
> [info]   at 
> org.apache.spark.storage.BlockManagerSuite$$anonfun$14.apply$mcV$sp(BlockManagerSuite.scala:1370)
> [info]   at 
> org.apache.spark.storage.BlockManagerSuite$$anonfun$14.apply(BlockManagerSuite.scala:1323)
> [info]   at 
> org.apache.spark.storage.BlockManagerSuite$$anonfun$14.apply(BlockManagerSuite.scala:1323)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22454) ExternalShuffleClient.close() should check null

2017-11-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240192#comment-16240192
 ] 

Apache Spark commented on SPARK-22454:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/19670

> ExternalShuffleClient.close() should check null
> ---
>
> Key: SPARK-22454
> URL: https://issues.apache.org/jira/browse/SPARK-22454
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.1.1
>Reporter: Yuming Wang
>Priority: Minor
>
> {noformat}
> 17/11/06 20:08:05 ERROR Utils: Uncaught exception in thread main
> java.lang.NullPointerException
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient.close(ExternalShuffleClient.java:152)
>   at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1407)
>   at org.apache.spark.SparkEnv.stop(SparkEnv.scala:89)
>   at 
> org.apache.spark.SparkContext$$anonfun$stop$11.apply$mcV$sp(SparkContext.scala:1849)
>   at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1283)
>   at org.apache.spark.SparkContext.stop(SparkContext.scala:1848)
>   at org.apache.spark.SparkContext.(SparkContext.scala:587)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2320)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:868)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:860)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:47)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:288)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:137)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:743)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >