[jira] [Updated] (SPARK-23811) FetchFailed comes before Success of same task will cause child stage never succeed

2018-03-29 Thread Li Yuanjian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Yuanjian updated SPARK-23811:

Description: 
This is a bug caused by abnormal scenario describe below:
 # ShuffleMapTask 1.0 running, this task will fetch data from ExecutorA
 # ExecutorA Lost, trigger `mapOutputTracker.removeOutputsOnExecutor(execId)` , 
shuffleStatus changed.
 # Speculative ShuffleMapTask 1.1 start, got a FetchFailed immediately.
 # ShuffleMapTask 1.0 finally succeed, but because of 1.1's FetchFailed, stage 
still mark as failed stage.
 # ShuffleMapTask 1 is the last task of its stage, so this stage will never 
succeed because of there's no missing task DagScheduler can get.

  was:
This is a bug caused by abnormal scenario describe below:
 # ShuffleMapTask 1.0 running, this task will fetch data from ExecutorA
 # ExecutorA Lost, trigger `mapOutputTracker.removeOutputsOnExecutor(execId)` , 
shuffleStatus changed.
 # Speculative ShuffleMapTask 1.1 start, got a FetchFailed immediately.
 # ShuffleMapTask 1 is the last task of its stage, so this stage will never 
succeed because of there's no missing task DagScheduler can get.


> FetchFailed comes before Success of same task will cause child stage never 
> succeed
> --
>
> Key: SPARK-23811
> URL: https://issues.apache.org/jira/browse/SPARK-23811
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Li Yuanjian
>Priority: Major
> Attachments: 1.png, 2.png
>
>
> This is a bug caused by abnormal scenario describe below:
>  # ShuffleMapTask 1.0 running, this task will fetch data from ExecutorA
>  # ExecutorA Lost, trigger `mapOutputTracker.removeOutputsOnExecutor(execId)` 
> , shuffleStatus changed.
>  # Speculative ShuffleMapTask 1.1 start, got a FetchFailed immediately.
>  # ShuffleMapTask 1.0 finally succeed, but because of 1.1's FetchFailed, 
> stage still mark as failed stage.
>  # ShuffleMapTask 1 is the last task of its stage, so this stage will never 
> succeed because of there's no missing task DagScheduler can get.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22968) java.lang.IllegalStateException: No current assignment for partition kssh-2

2018-03-29 Thread Jepson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16420215#comment-16420215
 ] 

Jepson commented on SPARK-22968:


Adjust these parameters,keep monitor.

“request.timeout.ms“ -> (21: java.lang.Integer)
“session.timeout.ms“ -> (18: java.lang.Integer) 
“heartbeat.interval.ms“ -> (6000: java.lang.Integer)

> java.lang.IllegalStateException: No current assignment for partition kssh-2
> ---
>
> Key: SPARK-22968
> URL: https://issues.apache.org/jira/browse/SPARK-22968
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
> Environment: Kafka:  0.10.0  (CDH5.12.0)
> Apache Spark 2.1.1
> Spark streaming+Kafka
>Reporter: Jepson
>Priority: Major
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> *Kafka Broker:*
> {code:java}
>message.max.bytes : 2621440  
> {code}
> *Spark Streaming+Kafka Code:*
> {code:java}
> , "max.partition.fetch.bytes" -> (5242880: java.lang.Integer) //default: 
> 1048576
> , "request.timeout.ms" -> (9: java.lang.Integer) //default: 6
> , "session.timeout.ms" -> (6: java.lang.Integer) //default: 3
> , "heartbeat.interval.ms" -> (5000: java.lang.Integer)
> , "receive.buffer.bytes" -> (10485760: java.lang.Integer)
> {code}
> *Error message:*
> {code:java}
> 8/01/05 09:48:27 INFO internals.ConsumerCoordinator: Revoking previously 
> assigned partitions [kssh-7, kssh-4, kssh-3, kssh-6, kssh-5, kssh-0, kssh-2, 
> kssh-1] for group use_a_separate_group_id_for_each_stream
> 18/01/05 09:48:27 INFO internals.AbstractCoordinator: (Re-)joining group 
> use_a_separate_group_id_for_each_stream
> 18/01/05 09:48:27 INFO internals.AbstractCoordinator: Successfully joined 
> group use_a_separate_group_id_for_each_stream with generation 4
> 18/01/05 09:48:27 INFO internals.ConsumerCoordinator: Setting newly assigned 
> partitions [kssh-7, kssh-4, kssh-6, kssh-5] for group 
> use_a_separate_group_id_for_each_stream
> 18/01/05 09:48:27 ERROR scheduler.JobScheduler: Error generating jobs for 
> time 1515116907000 ms
> java.lang.IllegalStateException: No current assignment for partition kssh-2
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231)
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:197)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:214)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
>   at 
> org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerato

[jira] [Assigned] (SPARK-23743) IsolatedClientLoader.isSharedClass returns an unindented result against `slf4j` keyword

2018-03-29 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reassigned SPARK-23743:
---

Assignee: Jongyoul Lee

> IsolatedClientLoader.isSharedClass returns an unindented result against 
> `slf4j` keyword
> ---
>
> Key: SPARK-23743
> URL: https://issues.apache.org/jira/browse/SPARK-23743
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jongyoul Lee
>Assignee: Jongyoul Lee
>Priority: Minor
> Fix For: 2.4.0
>
>
> {{isSharedClass}} returns if some classes can/should be shared or not. It 
> checks if the classes names have some keywords or start with some names. 
> Following the logic, it can occur unintended behaviors when a custom package 
> has {{slf4j}} inside the package or class name. As I guess, the first 
> intention seems to figure out the class containing {{org.slf4j}}. It would be 
> better to change the comparison logic to {{name.startsWith("org.slf4j")}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23743) IsolatedClientLoader.isSharedClass returns an unindented result against `slf4j` keyword

2018-03-29 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-23743.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20860
[https://github.com/apache/spark/pull/20860]

> IsolatedClientLoader.isSharedClass returns an unindented result against 
> `slf4j` keyword
> ---
>
> Key: SPARK-23743
> URL: https://issues.apache.org/jira/browse/SPARK-23743
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jongyoul Lee
>Priority: Minor
> Fix For: 2.4.0
>
>
> {{isSharedClass}} returns if some classes can/should be shared or not. It 
> checks if the classes names have some keywords or start with some names. 
> Following the logic, it can occur unintended behaviors when a custom package 
> has {{slf4j}} inside the package or class name. As I guess, the first 
> intention seems to figure out the class containing {{org.slf4j}}. It would be 
> better to change the comparison logic to {{name.startsWith("org.slf4j")}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23829) spark-sql-kafka source in spark 2.3 causes reading stream failure frequently

2018-03-29 Thread Norman Bai (JIRA)
Norman Bai created SPARK-23829:
--

 Summary: spark-sql-kafka source in spark 2.3 causes reading stream 
failure frequently
 Key: SPARK-23829
 URL: https://issues.apache.org/jira/browse/SPARK-23829
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Norman Bai


In spark 2.3 , it provides a source "spark-sql-kafka-0-10_2.11".

 

When I wanted to read from my kafka-0.10.2.1 cluster, it throws out an error 
"*java.util.concurrent.TimeoutException: Cannot fetch record  for offset in 
12000 milliseconds*"  frequently , and the job thus failed.

 

I searched on google & stackoverflow for a while, and found many other people 
who got this excption too, and nobody gave an answer.

 

I debuged the source code, found nothing, but I guess it's because the 
dependency spark-sql-kafka-0-10_2.11 is using.

 
{code:java}

 org.apache.spark
 spark-sql-kafka-0-10_2.11
 2.3.0
 
 
 kafka-clients
 org.apache.kafka
 
 


 org.apache.kafka
 kafka-clients
 0.10.2.1
{code}
I excluded it from maven ,and added another version , rerun the code , and now 
it works.

 

I guess something is wrong on kafka-clients0.10.0.1 working with kafka0.10.2.1, 
or more kafka versions. 

 

Hope for an explanation.

Here is the error stack.
{code:java}
[ERROR] 2018-03-30 13:34:11,404 [stream execution thread for [id = 
83076cf1-4bf0-4c82-a0b3-23d8432f5964, runId = 
b3e18aa6-358f-43f6-a077-e34db0822df6]] 
org.apache.spark.sql.execution.streaming.MicroBatchExecution logError - Query 
[id = 83076cf1-4bf0-4c82-a0b3-23d8432f5964, runId = 
b3e18aa6-358f-43f6-a077-e34db0822df6] terminated with error
org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in 
stage 0.0 failed 1 times, most recent failure: Lost task 6.0 in stage 0.0 (TID 
6, localhost, executor driver): java.util.concurrent.TimeoutException: Cannot 
fetch record for offset 6481521 in 12 milliseconds
at 
org.apache.spark.sql.kafka010.CachedKafkaConsumer.org$apache$spark$sql$kafka010$CachedKafkaConsumer$$fetchData(CachedKafkaConsumer.scala:230)
at 
org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:122)
at 
org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:106)
at 
org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
at 
org.apache.spark.sql.kafka010.CachedKafkaConsumer.runUninterruptiblyIfPossible(CachedKafkaConsumer.scala:68)
at 
org.apache.spark.sql.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:106)
at 
org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:157)
at 
org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:148)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at 
org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:107)
at 
org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:105)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:818)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:818)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at o

[jira] [Resolved] (SPARK-23808) Test spark sessions should set default session

2018-03-29 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23808.
-
Resolution: Fixed
  Assignee: Jose Torres

> Test spark sessions should set default session
> --
>
> Key: SPARK-23808
> URL: https://issues.apache.org/jira/browse/SPARK-23808
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Jose Torres
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> SparkSession.getOrCreate() ensures that the session it returns is set as a 
> default. Test code (TestSparkSession and TestHiveSparkSession) shortcuts 
> around this method, and thus a default is never set. We need to set it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23808) Test spark sessions should set default session

2018-03-29 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23808:

Fix Version/s: 2.4.0
   2.3.1

> Test spark sessions should set default session
> --
>
> Key: SPARK-23808
> URL: https://issues.apache.org/jira/browse/SPARK-23808
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> SparkSession.getOrCreate() ensures that the session it returns is set as a 
> default. Test code (TestSparkSession and TestHiveSparkSession) shortcuts 
> around this method, and thus a default is never set. We need to set it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23805) support vector-size validation and Inference

2018-03-29 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16420113#comment-16420113
 ] 

zhengruifeng commented on SPARK-23805:
--

[~sethah] [~josephkb]  Are you interested in this?

> support vector-size validation and Inference
> 
>
> Key: SPARK-23805
> URL: https://issues.apache.org/jira/browse/SPARK-23805
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: zhengruifeng
>Priority: Major
>
> I think it maybe miningful to unify the usage of \{{AttributeGroup}} and 
> support vector-size validation and inference in algs.
> My thoughts are:
>  * In \{{transformSchema}}, validate the input vector-size if possible. If 
> the input vector-size can be obtained from schema, check it.
>  ** Suppose a \{{PCA}} estimator with k=4, the \{{transformSchema}} will 
> require the vector-size to be no more than 4.
>  ** Suppose a \{{PCAModel}} trained with vectors of length 10, the 
> \{{transformSchema}} will require the vector-size to be 10.
>  * In \{{transformSchema}}, inference the output vector-size if possible.
>  ** Suppose a \{{PCA}} estimator with k=4, the \{{transformSchema}} will 
> return a schema with output vector-size=4.
>  ** Suppose a \{{PCAModel}} trained with k=4, the \{{transformSchema}} will 
> return a schema with output vector-size=4.
>  * In \{{transform}}, inference the output vector-size if possible.
>  * In \{{fit}}, obtain the input vector-size from schema if possible. This 
> can help eliminating redundant \{{first}} jobs.
>  
> Current PR only modifies \{{PCA}} and \{{MaxAbsScaler}} to illustrate my 
> idea. Since the validation and inference is quite alg-speciafic, we may need 
> to sperate the task into several small subtasks.
> How do you think about this? [~srowen] [~yanboliang] [~WeichenXu123] [~mlnick]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB

2018-03-29 Thread yuliang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuliang updated SPARK-5928:
---
Comment: was deleted

(was: !image-2018-03-29-11-52-32-075.png|width=741,height=189!
 
>From the picture, The shuffle read size is over 2GB, but job not failed.

I wonder if anyone can answer me?

 )

> Remote Shuffle Blocks cannot be more than 2 GB
> --
>
> Key: SPARK-5928
> URL: https://issues.apache.org/jira/browse/SPARK-5928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Imran Rashid
>Priority: Major
> Attachments: image-2018-03-29-11-52-32-075.png
>
>
> If a shuffle block is over 2GB, the shuffle fails, with an uninformative 
> exception.  The tasks get retried a few times and then eventually the job 
> fails.
> Here is an example program which can cause the exception:
> {code}
> val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =>
>   val n = 3e3.toInt
>   val arr = new Array[Byte](n)
>   //need to make sure the array doesn't compress to something small
>   scala.util.Random.nextBytes(arr)
>   arr
> }
> rdd.map { x => (1, x)}.groupByKey().count()
> {code}
> Note that you can't trigger this exception in local mode, it only happens on 
> remote fetches.   I triggered these exceptions running with 
> {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}}
> {noformat}
> 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, 
> imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, 
> imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message=
> org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 
> 2147483647: 3021252889 - discarded
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
>   at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
>   at 
> org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
>   at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame 
> length exceeds 2147483647: 3021252889 - discarded
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io

[jira] [Commented] (SPARK-23825) [K8s] Spark pods should request memory + memoryOverhead as resources

2018-03-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419995#comment-16419995
 ] 

Apache Spark commented on SPARK-23825:
--

User 'dvogelbacher' has created a pull request for this issue:
https://github.com/apache/spark/pull/20943

> [K8s] Spark pods should request memory + memoryOverhead as resources
> 
>
> Key: SPARK-23825
> URL: https://issues.apache.org/jira/browse/SPARK-23825
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: David Vogelbacher
>Priority: Major
>
> We currently request  {{spark.[driver,executor].memory}} as memory from 
> Kubernetes (e.g., 
> [here|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L95]).
> The limit is set to {{spark.[driver,executor].memory + 
> spark.kubernetes.[driver,executor].memoryOverhead}}.
> This seems to be using Kubernetes wrong. 
> [How Pods with resource limits are 
> run|https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-limits-are-run],
>  states:
> {noformat}
> If a Container exceeds its memory request, it is likely that its Pod will be 
> evicted whenever the node runs out of memory.
> {noformat}
> Thus, if a the  spark driver/executor uses {{memory + memoryOverhead}} 
> memory, it can be evicted. While an executor might get restarted (but it 
> would still be very bad performance-wise), the driver would be hard to 
> recover.
> I think spark should be able to run with the requested (and, thus, 
> guaranteed) resources from Kubernetes without being in danger of termination 
> without needing to rely on optional available resources.
> Thus, we shoud request {{memory + memoryOverhead}} memory from Kubernetes 
> (and this should also be the limit).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23825) [K8s] Spark pods should request memory + memoryOverhead as resources

2018-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23825:


Assignee: Apache Spark

> [K8s] Spark pods should request memory + memoryOverhead as resources
> 
>
> Key: SPARK-23825
> URL: https://issues.apache.org/jira/browse/SPARK-23825
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: David Vogelbacher
>Assignee: Apache Spark
>Priority: Major
>
> We currently request  {{spark.[driver,executor].memory}} as memory from 
> Kubernetes (e.g., 
> [here|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L95]).
> The limit is set to {{spark.[driver,executor].memory + 
> spark.kubernetes.[driver,executor].memoryOverhead}}.
> This seems to be using Kubernetes wrong. 
> [How Pods with resource limits are 
> run|https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-limits-are-run],
>  states:
> {noformat}
> If a Container exceeds its memory request, it is likely that its Pod will be 
> evicted whenever the node runs out of memory.
> {noformat}
> Thus, if a the  spark driver/executor uses {{memory + memoryOverhead}} 
> memory, it can be evicted. While an executor might get restarted (but it 
> would still be very bad performance-wise), the driver would be hard to 
> recover.
> I think spark should be able to run with the requested (and, thus, 
> guaranteed) resources from Kubernetes without being in danger of termination 
> without needing to rely on optional available resources.
> Thus, we shoud request {{memory + memoryOverhead}} memory from Kubernetes 
> (and this should also be the limit).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23825) [K8s] Spark pods should request memory + memoryOverhead as resources

2018-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23825:


Assignee: (was: Apache Spark)

> [K8s] Spark pods should request memory + memoryOverhead as resources
> 
>
> Key: SPARK-23825
> URL: https://issues.apache.org/jira/browse/SPARK-23825
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: David Vogelbacher
>Priority: Major
>
> We currently request  {{spark.[driver,executor].memory}} as memory from 
> Kubernetes (e.g., 
> [here|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L95]).
> The limit is set to {{spark.[driver,executor].memory + 
> spark.kubernetes.[driver,executor].memoryOverhead}}.
> This seems to be using Kubernetes wrong. 
> [How Pods with resource limits are 
> run|https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-limits-are-run],
>  states:
> {noformat}
> If a Container exceeds its memory request, it is likely that its Pod will be 
> evicted whenever the node runs out of memory.
> {noformat}
> Thus, if a the  spark driver/executor uses {{memory + memoryOverhead}} 
> memory, it can be evicted. While an executor might get restarted (but it 
> would still be very bad performance-wise), the driver would be hard to 
> recover.
> I think spark should be able to run with the requested (and, thus, 
> guaranteed) resources from Kubernetes without being in danger of termination 
> without needing to rely on optional available resources.
> Thus, we shoud request {{memory + memoryOverhead}} memory from Kubernetes 
> (and this should also be the limit).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23822) Improve error message for Parquet schema mismatches

2018-03-29 Thread Yuchen Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuchen Huo updated SPARK-23822:
---
Component/s: (was: Input/Output)
 SQL

> Improve error message for Parquet schema mismatches
> ---
>
> Key: SPARK-23822
> URL: https://issues.apache.org/jira/browse/SPARK-23822
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Yuchen Huo
>Priority: Major
>
> If a user attempts to read Parquet files with mismatched schemas and schema 
> merging is disabled then this may result in a very confusing 
> UnsupportedOperationException and ParquetDecodingException errors from 
> Parquet.
> e.g.
> {code:java}
> Seq(("bcd")).toDF("a").coalesce(1).write.mode("overwrite").parquet(s"$path/")
> Seq((1)).toDF("a").coalesce(1).write.mode("append").parquet(s"$path/")
> spark.read.parquet(s"$path/").collect()
> {code}
> Would result in
> {code:java}
> Caused by: java.lang.UnsupportedOperationException: Unimplemented type: 
> IntegerType
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:474)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:182)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:617)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23825) [K8s] Spark pods should request memory + memoryOverhead as resources

2018-03-29 Thread David Vogelbacher (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419950#comment-16419950
 ] 

David Vogelbacher commented on SPARK-23825:
---

addressed by https://github.com/apache/spark/pull/20943

> [K8s] Spark pods should request memory + memoryOverhead as resources
> 
>
> Key: SPARK-23825
> URL: https://issues.apache.org/jira/browse/SPARK-23825
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: David Vogelbacher
>Priority: Major
>
> We currently request  {{spark.[driver,executor].memory}} as memory from 
> Kubernetes (e.g., 
> [here|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L95]).
> The limit is set to {{spark.[driver,executor].memory + 
> spark.kubernetes.[driver,executor].memoryOverhead}}.
> This seems to be using Kubernetes wrong. 
> [How Pods with resource limits are 
> run|https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-limits-are-run],
>  states:
> {noformat}
> If a Container exceeds its memory request, it is likely that its Pod will be 
> evicted whenever the node runs out of memory.
> {noformat}
> Thus, if a the  spark driver/executor uses {{memory + memoryOverhead}} 
> memory, it can be evicted. While an executor might get restarted (but it 
> would still be very bad performance-wise), the driver would be hard to 
> recover.
> I think spark should be able to run with the requested (and, thus, 
> guaranteed) resources from Kubernetes without being in danger of termination 
> without needing to rely on optional available resources.
> Thus, we shoud request {{memory + memoryOverhead}} memory from Kubernetes 
> (and this should also be the limit).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23828) PySpark StringIndexerModel should have constructor from labels

2018-03-29 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-23828:


 Summary: PySpark StringIndexerModel should have constructor from 
labels
 Key: SPARK-23828
 URL: https://issues.apache.org/jira/browse/SPARK-23828
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 2.3.0
Reporter: Bryan Cutler


The Scala StringIndexerModel has an alternate constructor that will create the 
model from an array of label strings.  This should be in the Python API also, 
maybe as a classmethod
{code:java}
model = StringIndexerModel.from_labels(['a', 'b']){code}
See SPARK-15009 which added a similar constructor to CountVectorizerModel



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15009) PySpark CountVectorizerModel should be able to construct from vocabulary list

2018-03-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419939#comment-16419939
 ] 

Apache Spark commented on SPARK-15009:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/20942

> PySpark CountVectorizerModel should be able to construct from vocabulary list
> -
>
> Key: SPARK-15009
> URL: https://issues.apache.org/jira/browse/SPARK-15009
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Minor
> Fix For: 2.4.0
>
>
> Like the Scala version, PySpark CountVectorizerModel should be able to 
> construct the model from given a vocabulary list.
> For example
> {noformat}
> cvm = CountVectorizerModel(["a", "b", "c"])
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23827) StreamingJoinExec should ensure that input data is partitioned into specific number of partitions

2018-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23827:


Assignee: Apache Spark  (was: Tathagata Das)

> StreamingJoinExec should ensure that input data is partitioned into specific 
> number of partitions
> -
>
> Key: SPARK-23827
> URL: https://issues.apache.org/jira/browse/SPARK-23827
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Apache Spark
>Priority: Critical
>
> Currently, the requiredChildDistribution does not specify the partitions. 
> This can cause the weird corner cases where the child's distribution is 
> `SinglePartition` which satisfies the required distribution of 
> `ClusterDistribution(no-num-partition-requirement)`, thus eliminating the 
> shuffle needed to repartition input data into the required number of 
> partitions (i.e. same as state stores). That can lead to "file not found" 
> errors on the state store delta files as the micro-batch-with-no-shuffle will 
> not run certain tasks and therefore not generate the expected state store 
> delta files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23827) StreamingJoinExec should ensure that input data is partitioned into specific number of partitions

2018-03-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419925#comment-16419925
 ] 

Apache Spark commented on SPARK-23827:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/20941

> StreamingJoinExec should ensure that input data is partitioned into specific 
> number of partitions
> -
>
> Key: SPARK-23827
> URL: https://issues.apache.org/jira/browse/SPARK-23827
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
>
> Currently, the requiredChildDistribution does not specify the partitions. 
> This can cause the weird corner cases where the child's distribution is 
> `SinglePartition` which satisfies the required distribution of 
> `ClusterDistribution(no-num-partition-requirement)`, thus eliminating the 
> shuffle needed to repartition input data into the required number of 
> partitions (i.e. same as state stores). That can lead to "file not found" 
> errors on the state store delta files as the micro-batch-with-no-shuffle will 
> not run certain tasks and therefore not generate the expected state store 
> delta files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23827) StreamingJoinExec should ensure that input data is partitioned into specific number of partitions

2018-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23827:


Assignee: Tathagata Das  (was: Apache Spark)

> StreamingJoinExec should ensure that input data is partitioned into specific 
> number of partitions
> -
>
> Key: SPARK-23827
> URL: https://issues.apache.org/jira/browse/SPARK-23827
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
>
> Currently, the requiredChildDistribution does not specify the partitions. 
> This can cause the weird corner cases where the child's distribution is 
> `SinglePartition` which satisfies the required distribution of 
> `ClusterDistribution(no-num-partition-requirement)`, thus eliminating the 
> shuffle needed to repartition input data into the required number of 
> partitions (i.e. same as state stores). That can lead to "file not found" 
> errors on the state store delta files as the micro-batch-with-no-shuffle will 
> not run certain tasks and therefore not generate the expected state store 
> delta files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22711) _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from cloudpickle.py

2018-03-29 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-22711.
--
Resolution: Workaround

Closing this because it seems wordnet is not serializable with Cloudpickle and 
it's preferable to import inside the executor function rather than globally.

> _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from 
> cloudpickle.py
> ---
>
> Key: SPARK-22711
> URL: https://issues.apache.org/jira/browse/SPARK-22711
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit
>Affects Versions: 2.2.0, 2.2.1
> Environment: Ubuntu pseudo distributed installation of Spark 2.2.0
>Reporter: Prateek
>Priority: Major
> Attachments: Jira_Spark_minimized_code.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> When I submit a Pyspark program with spark-submit command this error is 
> thrown.
> It happens when for code like below
> RDD2 = RDD1.map(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v)
> or 
> RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v)
> or
> RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduce(lambda c,v :c+v)
> Traceback (most recent call last):
>   File "/home/prateek/Project/textrank.py", line 299, in 
> summaryRDD = sentenceTokensReduceRDD.map(lambda m: 
> get_summary(m)).reduceByKey(lambda c,v :c+v)
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1608, 
> in reduceByKey
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1846, 
> in combineByKey
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1783, 
> in partitionBy
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, 
> in _jrdd
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2388, 
> in _wrap_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2374, 
> in _prepare_for_python_RDD
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 
> 460, in dumps
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 704, in dumps
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 148, in dump
>   File "/usr/lib/python3.5/pickle.py", line 408, in dump
> self.save(obj)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 740, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list
> self._batch_appends(obj)
>   File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends
> save(x)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list
> self._batch_appends(obj)
>   File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends
> save(x)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>

[jira] [Updated] (SPARK-23429) Add executor memory metrics to heartbeat and expose in executors REST API

2018-03-29 Thread Edwina Lu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Edwina Lu updated SPARK-23429:
--
Description: 
Add new executor level memory metrics ( jvmUsedMemory, onHeapExecutionMemory, 
offHeapExecutionMemory, onHeapStorageMemory, offHeapStorageMemory, 
onHeapUnifiedMemory, and offHeapUnifiedMemory), and expose these via the 
executors REST API. This information will help provide insight into how 
executor and driver JVM memory is used, and for the different memory regions. 
It can be used to help determine good values for spark.executor.memory, 
spark.driver.memory, spark.memory.fraction, and spark.memory.storageFraction.

Add an ExecutorMetrics class, with jvmUsedMemory, onHeapExecutionMemory, 
offHeapExecutionMemory, onHeapStorageMemory, and offHeapStorageMemory. This 
will track the memory usage at the executor level. The new ExecutorMetrics will 
be sent by executors to the driver as part of the Heartbeat. A heartbeat will 
be added for the driver as well, to collect these metrics for the driver.

Modify the EventLoggingListener to log ExecutorMetricsUpdate events if there is 
a new peak value for one of the memory metrics for an executor and stage. Only 
the ExecutorMetrics will be logged, and not the TaskMetrics, to minimize 
additional logging. Analysis on a set of sample applications showed an increase 
of 0.25% in the size of the Spark history log, with this approach.

Modify the AppStatusListener to collect snapshots of peak values for each 
memory metric. Each snapshot has the time, jvmUsedMemory, executionMemory and 
storageMemory, and list of active stages.

Add the new memory metrics (snapshots of peak values for each memory metric) to 
the executors REST API.

This is a subtask for SPARK-23206. Please refer to the design doc for that 
ticket for more details.

  was:
Add new executor level memory metrics ( jvmUsedMemory, executionMemory, 
storageMemory, and unifiedMemory), and expose these via the executors REST API. 
This information will help provide insight into how executor and driver JVM 
memory is used, and for the different memory regions. It can be used to help 
determine good values for spark.executor.memory, spark.driver.memory, 
spark.memory.fraction, and spark.memory.storageFraction.

Add an ExecutorMetrics class, with jvmUsedMemory, executionMemory, and 
storageMemory. This will track the memory usage at the executor level. The new 
ExecutorMetrics will be sent by executors to the driver as part of the 
Heartbeat. A heartbeat will be added for the driver as well, to collect these 
metrics for the driver.

Modify the EventLoggingListener to log ExecutorMetricsUpdate events if there is 
a new peak value for one of the memory metrics for an executor and stage. Only 
the ExecutorMetrics will be logged, and not the TaskMetrics, to minimize 
additional logging. Analysis on a set of sample applications showed an increase 
of 0.25% in the size of the Spark history log, with this approach.

Modify the AppStatusListener to collect snapshots of peak values for each 
memory metric. Each snapshot has the time, jvmUsedMemory, executionMemory and 
storageMemory, and list of active stages.

Add the new memory metrics (snapshots of peak values for each memory metric) to 
the executors REST API.

This is a subtask for SPARK-23206. Please refer to the design doc for that 
ticket for more details.


> Add executor memory metrics to heartbeat and expose in executors REST API
> -
>
> Key: SPARK-23429
> URL: https://issues.apache.org/jira/browse/SPARK-23429
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
>
> Add new executor level memory metrics ( jvmUsedMemory, onHeapExecutionMemory, 
> offHeapExecutionMemory, onHeapStorageMemory, offHeapStorageMemory, 
> onHeapUnifiedMemory, and offHeapUnifiedMemory), and expose these via the 
> executors REST API. This information will help provide insight into how 
> executor and driver JVM memory is used, and for the different memory regions. 
> It can be used to help determine good values for spark.executor.memory, 
> spark.driver.memory, spark.memory.fraction, and spark.memory.storageFraction.
> Add an ExecutorMetrics class, with jvmUsedMemory, onHeapExecutionMemory, 
> offHeapExecutionMemory, onHeapStorageMemory, and offHeapStorageMemory. This 
> will track the memory usage at the executor level. The new ExecutorMetrics 
> will be sent by executors to the driver as part of the Heartbeat. A heartbeat 
> will be added for the driver as well, to collect these metrics for the driver.
> Modify the EventLoggingListener to log ExecutorMetricsUpdate events if there 
> is a new peak value for one of the memory metrics for an execu

[jira] [Created] (SPARK-23827) StreamingJoinExec should ensure that input data is partitioned into specific number of partitions

2018-03-29 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-23827:
-

 Summary: StreamingJoinExec should ensure that input data is 
partitioned into specific number of partitions
 Key: SPARK-23827
 URL: https://issues.apache.org/jira/browse/SPARK-23827
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Tathagata Das
Assignee: Tathagata Das


Currently, the requiredChildDistribution does not specify the partitions. This 
can cause the weird corner cases where the child's distribution is 
`SinglePartition` which satisfies the required distribution of 
`ClusterDistribution(no-num-partition-requirement)`, thus eliminating the 
shuffle needed to repartition input data into the required number of partitions 
(i.e. same as state stores). That can lead to "file not found" errors on the 
state store delta files as the micro-batch-with-no-shuffle will not run certain 
tasks and therefore not generate the expected state store delta files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23826) TestHiveSparkSession should set default session

2018-03-29 Thread Jose Torres (JIRA)
Jose Torres created SPARK-23826:
---

 Summary: TestHiveSparkSession should set default session
 Key: SPARK-23826
 URL: https://issues.apache.org/jira/browse/SPARK-23826
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0
Reporter: Jose Torres


The fix for TestSparkSession breaks hive/testOnly, because many of the tests 
both instantiate a TestHiveSparkSession and call SparkSession.getOrCreate().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23429) Add executor memory metrics to heartbeat and expose in executors REST API

2018-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23429:


Assignee: (was: Apache Spark)

> Add executor memory metrics to heartbeat and expose in executors REST API
> -
>
> Key: SPARK-23429
> URL: https://issues.apache.org/jira/browse/SPARK-23429
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
>
> Add new executor level memory metrics ( jvmUsedMemory, executionMemory, 
> storageMemory, and unifiedMemory), and expose these via the executors REST 
> API. This information will help provide insight into how executor and driver 
> JVM memory is used, and for the different memory regions. It can be used to 
> help determine good values for spark.executor.memory, spark.driver.memory, 
> spark.memory.fraction, and spark.memory.storageFraction.
> Add an ExecutorMetrics class, with jvmUsedMemory, executionMemory, and 
> storageMemory. This will track the memory usage at the executor level. The 
> new ExecutorMetrics will be sent by executors to the driver as part of the 
> Heartbeat. A heartbeat will be added for the driver as well, to collect these 
> metrics for the driver.
> Modify the EventLoggingListener to log ExecutorMetricsUpdate events if there 
> is a new peak value for one of the memory metrics for an executor and stage. 
> Only the ExecutorMetrics will be logged, and not the TaskMetrics, to minimize 
> additional logging. Analysis on a set of sample applications showed an 
> increase of 0.25% in the size of the Spark history log, with this approach.
> Modify the AppStatusListener to collect snapshots of peak values for each 
> memory metric. Each snapshot has the time, jvmUsedMemory, executionMemory and 
> storageMemory, and list of active stages.
> Add the new memory metrics (snapshots of peak values for each memory metric) 
> to the executors REST API.
> This is a subtask for SPARK-23206. Please refer to the design doc for that 
> ticket for more details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23429) Add executor memory metrics to heartbeat and expose in executors REST API

2018-03-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419861#comment-16419861
 ] 

Apache Spark commented on SPARK-23429:
--

User 'edwinalu' has created a pull request for this issue:
https://github.com/apache/spark/pull/20940

> Add executor memory metrics to heartbeat and expose in executors REST API
> -
>
> Key: SPARK-23429
> URL: https://issues.apache.org/jira/browse/SPARK-23429
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
>
> Add new executor level memory metrics ( jvmUsedMemory, executionMemory, 
> storageMemory, and unifiedMemory), and expose these via the executors REST 
> API. This information will help provide insight into how executor and driver 
> JVM memory is used, and for the different memory regions. It can be used to 
> help determine good values for spark.executor.memory, spark.driver.memory, 
> spark.memory.fraction, and spark.memory.storageFraction.
> Add an ExecutorMetrics class, with jvmUsedMemory, executionMemory, and 
> storageMemory. This will track the memory usage at the executor level. The 
> new ExecutorMetrics will be sent by executors to the driver as part of the 
> Heartbeat. A heartbeat will be added for the driver as well, to collect these 
> metrics for the driver.
> Modify the EventLoggingListener to log ExecutorMetricsUpdate events if there 
> is a new peak value for one of the memory metrics for an executor and stage. 
> Only the ExecutorMetrics will be logged, and not the TaskMetrics, to minimize 
> additional logging. Analysis on a set of sample applications showed an 
> increase of 0.25% in the size of the Spark history log, with this approach.
> Modify the AppStatusListener to collect snapshots of peak values for each 
> memory metric. Each snapshot has the time, jvmUsedMemory, executionMemory and 
> storageMemory, and list of active stages.
> Add the new memory metrics (snapshots of peak values for each memory metric) 
> to the executors REST API.
> This is a subtask for SPARK-23206. Please refer to the design doc for that 
> ticket for more details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23429) Add executor memory metrics to heartbeat and expose in executors REST API

2018-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23429:


Assignee: Apache Spark

> Add executor memory metrics to heartbeat and expose in executors REST API
> -
>
> Key: SPARK-23429
> URL: https://issues.apache.org/jira/browse/SPARK-23429
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Assignee: Apache Spark
>Priority: Major
>
> Add new executor level memory metrics ( jvmUsedMemory, executionMemory, 
> storageMemory, and unifiedMemory), and expose these via the executors REST 
> API. This information will help provide insight into how executor and driver 
> JVM memory is used, and for the different memory regions. It can be used to 
> help determine good values for spark.executor.memory, spark.driver.memory, 
> spark.memory.fraction, and spark.memory.storageFraction.
> Add an ExecutorMetrics class, with jvmUsedMemory, executionMemory, and 
> storageMemory. This will track the memory usage at the executor level. The 
> new ExecutorMetrics will be sent by executors to the driver as part of the 
> Heartbeat. A heartbeat will be added for the driver as well, to collect these 
> metrics for the driver.
> Modify the EventLoggingListener to log ExecutorMetricsUpdate events if there 
> is a new peak value for one of the memory metrics for an executor and stage. 
> Only the ExecutorMetrics will be logged, and not the TaskMetrics, to minimize 
> additional logging. Analysis on a set of sample applications showed an 
> increase of 0.25% in the size of the Spark history log, with this approach.
> Modify the AppStatusListener to collect snapshots of peak values for each 
> memory metric. Each snapshot has the time, jvmUsedMemory, executionMemory and 
> storageMemory, and list of active stages.
> Add the new memory metrics (snapshots of peak values for each memory metric) 
> to the executors REST API.
> This is a subtask for SPARK-23206. Please refer to the design doc for that 
> ticket for more details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23825) [K8s] Spark pods should request memory + memoryOverhead as resources

2018-03-29 Thread David Vogelbacher (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419857#comment-16419857
 ] 

David Vogelbacher commented on SPARK-23825:
---

Will make a PR shortly, cc [~mcheah]

> [K8s] Spark pods should request memory + memoryOverhead as resources
> 
>
> Key: SPARK-23825
> URL: https://issues.apache.org/jira/browse/SPARK-23825
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: David Vogelbacher
>Priority: Major
>
> We currently request  {{spark.[driver,executor].memory}} as memory from 
> Kubernetes (e.g., 
> [here|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L95]).
> The limit is set to {{spark.[driver,executor].memory + 
> spark.kubernetes.[driver,executor].memoryOverhead}}.
> This seems to be using Kubernetes wrong. 
> [How Pods with resource limits are 
> run|https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-limits-are-run],
>  states:
> {noformat}
> If a Container exceeds its memory request, it is likely that its Pod will be 
> evicted whenever the node runs out of memory.
> {noformat}
> Thus, if a the  spark driver/executor uses {{memory + memoryOverhead}} 
> memory, it can be evicted. While an executor might get restarted (but it 
> would still be very bad performance-wise), the driver would be hard to 
> recover.
> I think spark should be able to run with the requested (and, thus, 
> guaranteed) resources from Kubernetes without being in danger of termination 
> without needing to rely on optional available resources.
> Thus, we shoud request {{memory + memoryOverhead}} memory from Kubernetes 
> (and this should also be the limit).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23825) [K8s] Spark pods should request memory + memoryOverhead as resources

2018-03-29 Thread David Vogelbacher (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Vogelbacher updated SPARK-23825:
--
Description: 
We currently request  {{spark.[driver,executor].memory}} as memory from 
Kubernetes (e.g., 
[here|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L95]).
The limit is set to {{spark.[driver,executor].memory + 
spark.kubernetes.[driver,executor].memoryOverhead}}.
This seems to be using Kubernetes wrong. 
[How Pods with resource limits are 
run|https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-limits-are-run],
 states:

{noformat}
If a Container exceeds its memory request, it is likely that its Pod will be 
evicted whenever the node runs out of memory.
{noformat}
Thus, if a the  spark driver/executor uses {{memory + memoryOverhead}} memory, 
it can be evicted. While an executor might get restarted (but it would still be 
very bad performance-wise), the driver would be hard to recover.

I think spark should be able to run with the requested (and, thus, guaranteed) 
resources from Kubernetes without being in danger of termination without 
needing to rely on optional available resources.

Thus, we shoud request {{memory + memoryOverhead}} memory from Kubernetes (and 
this should also be the limit).

  was:
We currently request  {{spark.{driver,executor}.memory}} as memory from 
Kubernetes (e.g., 
[here|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L95]).
The limit is set to {{spark.{driver,executor}.memory + 
spark.kubernetes.{driver,executor}.memoryOverhead}}.
This seems to be using Kubernetes wrong. 
[How Pods with resource limits are 
run|https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-limits-are-run],
 states"

{noformat}
If a Container exceeds its memory request, it is likely that its Pod will be 
evicted whenever the node runs out of memory.
{noformat}

Thus, if a the  spark driver/executor uses {{memory + memoryOverhead}} memory, 
it can be evicted. While an executor might get restarted (but it would still be 
very bad performance-wise), the driver would be hard to recover.

I think spark should be able to run with the requested (and, thus, guaranteed) 
resources from Kubernetes. It shouldn't rely on optional resources above the 
request and, therefore, be in danger of termination on high cluster utilization.

Thus, we shoud request {{memory + memoryOverhead}} memory from Kubernetes (and 
this should also be the limit).


> [K8s] Spark pods should request memory + memoryOverhead as resources
> 
>
> Key: SPARK-23825
> URL: https://issues.apache.org/jira/browse/SPARK-23825
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: David Vogelbacher
>Priority: Major
>
> We currently request  {{spark.[driver,executor].memory}} as memory from 
> Kubernetes (e.g., 
> [here|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L95]).
> The limit is set to {{spark.[driver,executor].memory + 
> spark.kubernetes.[driver,executor].memoryOverhead}}.
> This seems to be using Kubernetes wrong. 
> [How Pods with resource limits are 
> run|https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-limits-are-run],
>  states:
> {noformat}
> If a Container exceeds its memory request, it is likely that its Pod will be 
> evicted whenever the node runs out of memory.
> {noformat}
> Thus, if a the  spark driver/executor uses {{memory + memoryOverhead}} 
> memory, it can be evicted. While an executor might get restarted (but it 
> would still be very bad performance-wise), the driver would be hard to 
> recover.
> I think spark should be able to run with the requested (and, thus, 
> guaranteed) resources from Kubernetes without being in danger of termination 
> without needing to rely on optional available resources.
> Thus, we shoud request {{memory + memoryOverhead}} memory from Kubernetes 
> (and this should also be the limit).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23825) [K8s] Spark pods should request memory + memoryOverhead as resources

2018-03-29 Thread David Vogelbacher (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Vogelbacher updated SPARK-23825:
--
Description: 
We currently request  {{spark.{driver,executor}.memory}} as memory from 
Kubernetes (e.g., 
[here|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L95]).
The limit is set to {{spark.{driver,executor}.memory + 
spark.kubernetes.{driver,executor}.memoryOverhead}}.
This seems to be using Kubernetes wrong. 
[How Pods with resource limits are 
run|https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-limits-are-run],
 states"

{noformat}
If a Container exceeds its memory request, it is likely that its Pod will be 
evicted whenever the node runs out of memory.
{noformat}

Thus, if a the  spark driver/executor uses {{memory + memoryOverhead}} memory, 
it can be evicted. While an executor might get restarted (but it would still be 
very bad performance-wise), the driver would be hard to recover.

I think spark should be able to run with the requested (and, thus, guaranteed) 
resources from Kubernetes. It shouldn't rely on optional resources above the 
request and, therefore, be in danger of termination on high cluster utilization.

Thus, we shoud request {{memory + memoryOverhead}} memory from Kubernetes (and 
this should also be the limit).

  was:
We currently request `spark.{driver,executor}.memory` as memory from Kubernetes 
(e.g., 
[here|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L95]).
The limit is set to `spark.{driver,executor}.memory + 
spark.kubernetes.{driver,executor}.memoryOverhead`.
This seems to be using Kubernetes wrong. 
[How Pods with resource limits are 
run|https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-limits-are-run],
 states"

{noformat}
If a Container exceeds its memory request, it is likely that its Pod will be 
evicted whenever the node runs out of memory.
{noformat}

Thus, if a the  spark driver/executor uses `memory + memoryOverhead` memory, it 
can be evicted. While an executor might get restarted (but it would still be 
very bad performance-wise), the driver would be hard to recover.

I think spark should be able to run with the requested (and, thus, guaranteed) 
resources from Kubernetes. It shouldn't rely on optional resources above the 
request and, therefore, be in danger of termination on high cluster utilization.

Thus, we shoud request `memory + memoryOverhead` memory from Kubernetes (and 
this should also be the limit).


> [K8s] Spark pods should request memory + memoryOverhead as resources
> 
>
> Key: SPARK-23825
> URL: https://issues.apache.org/jira/browse/SPARK-23825
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: David Vogelbacher
>Priority: Major
>
> We currently request  {{spark.{driver,executor}.memory}} as memory from 
> Kubernetes (e.g., 
> [here|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L95]).
> The limit is set to {{spark.{driver,executor}.memory + 
> spark.kubernetes.{driver,executor}.memoryOverhead}}.
> This seems to be using Kubernetes wrong. 
> [How Pods with resource limits are 
> run|https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-limits-are-run],
>  states"
> {noformat}
> If a Container exceeds its memory request, it is likely that its Pod will be 
> evicted whenever the node runs out of memory.
> {noformat}
> Thus, if a the  spark driver/executor uses {{memory + memoryOverhead}} 
> memory, it can be evicted. While an executor might get restarted (but it 
> would still be very bad performance-wise), the driver would be hard to 
> recover.
> I think spark should be able to run with the requested (and, thus, 
> guaranteed) resources from Kubernetes. It shouldn't rely on optional 
> resources above the request and, therefore, be in danger of termination on 
> high cluster utilization.
> Thus, we shoud request {{memory + memoryOverhead}} memory from Kubernetes 
> (and this should also be the limit).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23825) [K8s] Spark pods should request memory + memoryOverhead as resources

2018-03-29 Thread David Vogelbacher (JIRA)
David Vogelbacher created SPARK-23825:
-

 Summary: [K8s] Spark pods should request memory + memoryOverhead 
as resources
 Key: SPARK-23825
 URL: https://issues.apache.org/jira/browse/SPARK-23825
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 2.3.0
Reporter: David Vogelbacher


We currently request `spark.{driver,executor}.memory` as memory from Kubernetes 
(e.g., 
[here|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L95]).
The limit is set to `spark.{driver,executor}.memory + 
spark.kubernetes.{driver,executor}.memoryOverhead`.
This seems to be using Kubernetes wrong. 
[How Pods with resource limits are 
run|https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-limits-are-run],
 states"

{noformat}
If a Container exceeds its memory request, it is likely that its Pod will be 
evicted whenever the node runs out of memory.
{noformat}

Thus, if a the  spark driver/executor uses `memory + memoryOverhead` memory, it 
can be evicted. While an executor might get restarted (but it would still be 
very bad performance-wise), the driver would be hard to recover.

I think spark should be able to run with the requested (and, thus, guaranteed) 
resources from Kubernetes. It shouldn't rely on optional resources above the 
request and, therefore, be in danger of termination on high cluster utilization.

Thus, we shoud request `memory + memoryOverhead` memory from Kubernetes (and 
this should also be the limit).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20169) Groupby Bug with Sparksql

2018-03-29 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419808#comment-16419808
 ] 

Maryann Xue commented on SPARK-20169:
-

[~smilegator], I think this is also caused by SPARK-23368, so could you please 
assign someone else to review https://github.com/apache/spark/pull/20613 if the 
original person still does not respond in a few days?

> Groupby Bug with Sparksql
> -
>
> Key: SPARK-20169
> URL: https://issues.apache.org/jira/browse/SPARK-20169
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Bin Wu
>Priority: Major
>
> We find a potential bug in Catalyst optimizer which cannot correctly 
> process "groupby". You can reproduce it by following simple example:
> =
> from pyspark.sql.functions import *
> #e=sc.parallelize([(1,2),(1,3),(1,4),(2,1),(3,1),(4,1)]).toDF(["src","dst"])
> e = spark.read.csv("graph.csv", header=True)
> r = sc.parallelize([(1,),(2,),(3,),(4,)]).toDF(['src'])
> r1 = e.join(r, 'src').groupBy('dst').count().withColumnRenamed('dst','src')
> jr = e.join(r1, 'src')
> jr.show()
> r2 = jr.groupBy('dst').count()
> r2.show()
> =
> FYI, "graph.csv" contains exactly the same data as the commented line.
> You can find that jr is:
> |src|dst|count|
> |  3|  1|1|
> |  1|  4|3|
> |  1|  3|3|
> |  1|  2|3|
> |  4|  1|1|
> |  2|  1|1|
> But, after the last groupBy, the 3 rows with dst = 1 are not grouped together:
> |dst|count|
> |  1|1|
> |  4|1|
> |  3|1|
> |  2|1|
> |  1|1|
> |  1|1|
> If we build jr directly from raw data (commented line), this error will not 
> show up.  So 
> we suspect  that there is a bug in the Catalyst optimizer when multiple joins 
> and groupBy's 
> are being optimized. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23784) Cannot use custom Aggregator with groupBy/agg

2018-03-29 Thread Joshua Howard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419764#comment-16419764
 ] 

Joshua Howard commented on SPARK-23784:
---

Correct. It had been answered, but I am just now closing this out.  

> Cannot use custom Aggregator with groupBy/agg 
> --
>
> Key: SPARK-23784
> URL: https://issues.apache.org/jira/browse/SPARK-23784
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Joshua Howard
>Priority: Major
>
> I have code 
> [here|http://stackoverflow.com/questions/49440766/trouble-getting-spark-aggregators-to-work]
>  where I am trying to use an Aggregator with both the select and agg 
> functions. I cannot seem to get this to work in Spark 2.3.0. 
> [Here|https://docs.cloud.databricks.com/docs/spark/1.6/examples/Dataset%20Aggregator.html]
>  is a blog post that appears to be using this functionality in Spark 1.6, but 
> It appears to no longer work. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-23784) Cannot use custom Aggregator with groupBy/agg

2018-03-29 Thread Joshua Howard (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Howard closed SPARK-23784.
-

See SO link.

> Cannot use custom Aggregator with groupBy/agg 
> --
>
> Key: SPARK-23784
> URL: https://issues.apache.org/jira/browse/SPARK-23784
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Joshua Howard
>Priority: Major
>
> I have code 
> [here|http://stackoverflow.com/questions/49440766/trouble-getting-spark-aggregators-to-work]
>  where I am trying to use an Aggregator with both the select and agg 
> functions. I cannot seem to get this to work in Spark 2.3.0. 
> [Here|https://docs.cloud.databricks.com/docs/spark/1.6/examples/Dataset%20Aggregator.html]
>  is a blog post that appears to be using this functionality in Spark 1.6, but 
> It appears to no longer work. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23333) SparkML VectorAssembler.transform slow when needing to invoke .first() on sorted DataFrame

2018-03-29 Thread V Luong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

V Luong resolved SPARK-2.
-
Resolution: Won't Fix

> SparkML VectorAssembler.transform slow when needing to invoke .first() on 
> sorted DataFrame
> --
>
> Key: SPARK-2
> URL: https://issues.apache.org/jira/browse/SPARK-2
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, SQL
>Affects Versions: 2.2.1
>Reporter: V Luong
>Priority: Minor
>
> Under certain circumstances, newDF = vectorAssembler.transform(oldDF) invokes 
> oldDF.first() in order to establish some metadata/attributes: 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L88.]
>  When oldDF is sorted, the above triggering of oldDF.first() can be very slow.
> For the purpose of establishing metadata, taking an arbitrary row from oldDF 
> will be just as good as taking oldDF.first(). Is there hence a way we can 
> speed up a great deal by somehow grabbing a random row, instead of relying on 
> oldDF.first()?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23333) SparkML VectorAssembler.transform slow when needing to invoke .first() on sorted DataFrame

2018-03-29 Thread V Luong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419745#comment-16419745
 ] 

V Luong commented on SPARK-2:
-

[~bago.amirbekian] thank you, that is indeed a good solution available in Spark 
2.3.

I'm using that successfully. Will close this issue now.

> SparkML VectorAssembler.transform slow when needing to invoke .first() on 
> sorted DataFrame
> --
>
> Key: SPARK-2
> URL: https://issues.apache.org/jira/browse/SPARK-2
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, SQL
>Affects Versions: 2.2.1
>Reporter: V Luong
>Priority: Minor
>
> Under certain circumstances, newDF = vectorAssembler.transform(oldDF) invokes 
> oldDF.first() in order to establish some metadata/attributes: 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L88.]
>  When oldDF is sorted, the above triggering of oldDF.first() can be very slow.
> For the purpose of establishing metadata, taking an arbitrary row from oldDF 
> will be just as good as taking oldDF.first(). Is there hence a way we can 
> speed up a great deal by somehow grabbing a random row, instead of relying on 
> oldDF.first()?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23824) Make inpurityStats publicly accessible in ml.tree.Node

2018-03-29 Thread Barry Becker (JIRA)
Barry Becker created SPARK-23824:


 Summary: Make inpurityStats publicly accessible in ml.tree.Node
 Key: SPARK-23824
 URL: https://issues.apache.org/jira/browse/SPARK-23824
 Project: Spark
  Issue Type: Wish
  Components: ML
Affects Versions: 2.1.1
Reporter: Barry Becker


This is minor, but it is also a very easy fix.

I would like to visualize the structure of a decision tree model, but currently 
the only means of obtaining the label distribution data at each node of the 
tree is hidden within each ml.tree.Node inside the impurityStats.

I'm pretty sure that the fix for this is as easy as removing the private[ml] 
qualifier from occurrences of

private[ml] def impurityStats: ImpurityCalculator
and
override private[ml] val impurityStats: ImpurityCalculator

 

As a workaround, I've put my class that needs access into a  
org.apache.spark.ml.tree package in my own repository, but I would really like 
to not have to do that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23823) ResolveReferences loses correct origin

2018-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23823:


Assignee: (was: Apache Spark)

> ResolveReferences loses correct origin
> --
>
> Key: SPARK-23823
> URL: https://issues.apache.org/jira/browse/SPARK-23823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jiahui Jiang
>Priority: Major
>
> Introduced in [https://github.com/apache/spark/pull/19585]
> ResolveReferences stopped doing transfromsUp after this change and Attributes 
> sometimes lose its correct origin



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23823) ResolveReferences loses correct origin

2018-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23823:


Assignee: Apache Spark

> ResolveReferences loses correct origin
> --
>
> Key: SPARK-23823
> URL: https://issues.apache.org/jira/browse/SPARK-23823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jiahui Jiang
>Assignee: Apache Spark
>Priority: Major
>
> Introduced in [https://github.com/apache/spark/pull/19585]
> ResolveReferences stopped doing transfromsUp after this change and Attributes 
> sometimes lose its correct origin



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23823) ResolveReferences loses correct origin

2018-03-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419482#comment-16419482
 ] 

Apache Spark commented on SPARK-23823:
--

User 'JiahuiJiang' has created a pull request for this issue:
https://github.com/apache/spark/pull/20939

> ResolveReferences loses correct origin
> --
>
> Key: SPARK-23823
> URL: https://issues.apache.org/jira/browse/SPARK-23823
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jiahui Jiang
>Priority: Major
>
> Introduced in [https://github.com/apache/spark/pull/19585]
> ResolveReferences stopped doing transfromsUp after this change and Attributes 
> sometimes lose its correct origin



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23823) ResolveReferences loses correct origin

2018-03-29 Thread Jiahui Jiang (JIRA)
Jiahui Jiang created SPARK-23823:


 Summary: ResolveReferences loses correct origin
 Key: SPARK-23823
 URL: https://issues.apache.org/jira/browse/SPARK-23823
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Jiahui Jiang


Introduced in [https://github.com/apache/spark/pull/19585]

ResolveReferences stopped doing transfromsUp after this change and Attributes 
sometimes lose its correct origin



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23639) SparkSQL CLI fails talk to Kerberized metastore when use proxy user

2018-03-29 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-23639.

   Resolution: Fixed
Fix Version/s: 2.3.1
   2.4.0

Issue resolved by pull request 20784
[https://github.com/apache/spark/pull/20784]

> SparkSQL CLI fails talk to Kerberized metastore when use proxy user
> ---
>
> Key: SPARK-23639
> URL: https://issues.apache.org/jira/browse/SPARK-23639
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 2.4.0, 2.3.1
>
>
> In SparkSQLCLI, SessionState generates before SparkContext instantiating. 
> When we use --proxy-user to impersonate,  it's unable to initializing a 
> metastore client to talk to the secured metastore for no kerberos ticket.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23639) SparkSQL CLI fails talk to Kerberized metastore when use proxy user

2018-03-29 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-23639:
--

Assignee: Kent Yao

> SparkSQL CLI fails talk to Kerberized metastore when use proxy user
> ---
>
> Key: SPARK-23639
> URL: https://issues.apache.org/jira/browse/SPARK-23639
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> In SparkSQLCLI, SessionState generates before SparkContext instantiating. 
> When we use --proxy-user to impersonate,  it's unable to initializing a 
> metastore client to talk to the secured metastore for no kerberos ticket.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23785) LauncherBackend doesn't check state of connection before setting state

2018-03-29 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-23785:
--

Assignee: Sahil Takiar

> LauncherBackend doesn't check state of connection before setting state
> --
>
> Key: SPARK-23785
> URL: https://issues.apache.org/jira/browse/SPARK-23785
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> Found in HIVE-18533 while trying to integration with the 
> {{InProcessLauncher}}. {{LauncherBackend}} doesn't check the state of its 
> connection to the {{LauncherServer}} before trying to run {{setState}} - 
> which sends a {{SetState}} message on the connection.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23785) LauncherBackend doesn't check state of connection before setting state

2018-03-29 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-23785.

   Resolution: Fixed
Fix Version/s: 2.3.1
   2.4.0

Issue resolved by pull request 20893
[https://github.com/apache/spark/pull/20893]

> LauncherBackend doesn't check state of connection before setting state
> --
>
> Key: SPARK-23785
> URL: https://issues.apache.org/jira/browse/SPARK-23785
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Fix For: 2.4.0, 2.3.1
>
>
> Found in HIVE-18533 while trying to integration with the 
> {{InProcessLauncher}}. {{LauncherBackend}} doesn't check the state of its 
> connection to the {{LauncherServer}} before trying to run {{setState}} - 
> which sends a {{SetState}} message on the connection.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23821) Collection function: flatten

2018-03-29 Thread Marek Novotny (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marek Novotny updated SPARK-23821:
--
Summary: Collection function: flatten  (was: Collection functions: flatten)

> Collection function: flatten
> 
>
> Key: SPARK-23821
> URL: https://issues.apache.org/jira/browse/SPARK-23821
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Marek Novotny
>Priority: Major
>
> Add the flatten function that transforms an Array of Arrays column into an 
> Array elements column. if the array structure contains more than two levels 
> of nesting, the function removes one nesting level
> Example:
> {{flatten(array(array(1, 2, 3), array(3, 4, 5), array(6, 7, 8)) => 
> [1,2,3,4,5,6,7,8,9]}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23821) Collection functions: flatten

2018-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23821:


Assignee: (was: Apache Spark)

> Collection functions: flatten
> -
>
> Key: SPARK-23821
> URL: https://issues.apache.org/jira/browse/SPARK-23821
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Marek Novotny
>Priority: Major
>
> Add the flatten function that transforms an Array of Arrays column into an 
> Array elements column. if the array structure contains more than two levels 
> of nesting, the function removes one nesting level
> Example:
> {{flatten(array(array(1, 2, 3), array(3, 4, 5), array(6, 7, 8)) => 
> [1,2,3,4,5,6,7,8,9]}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23821) Collection functions: flatten

2018-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23821:


Assignee: Apache Spark

> Collection functions: flatten
> -
>
> Key: SPARK-23821
> URL: https://issues.apache.org/jira/browse/SPARK-23821
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Marek Novotny
>Assignee: Apache Spark
>Priority: Major
>
> Add the flatten function that transforms an Array of Arrays column into an 
> Array elements column. if the array structure contains more than two levels 
> of nesting, the function removes one nesting level
> Example:
> {{flatten(array(array(1, 2, 3), array(3, 4, 5), array(6, 7, 8)) => 
> [1,2,3,4,5,6,7,8,9]}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23821) Collection functions: flatten

2018-03-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419426#comment-16419426
 ] 

Apache Spark commented on SPARK-23821:
--

User 'mn-mikke' has created a pull request for this issue:
https://github.com/apache/spark/pull/20938

> Collection functions: flatten
> -
>
> Key: SPARK-23821
> URL: https://issues.apache.org/jira/browse/SPARK-23821
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Marek Novotny
>Priority: Major
>
> Add the flatten function that transforms an Array of Arrays column into an 
> Array elements column. if the array structure contains more than two levels 
> of nesting, the function removes one nesting level
> Example:
> {{flatten(array(array(1, 2, 3), array(3, 4, 5), array(6, 7, 8)) => 
> [1,2,3,4,5,6,7,8,9]}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20498) RandomForestRegressionModel should expose getMaxDepth in PySpark

2018-03-29 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-20498.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

> RandomForestRegressionModel should expose getMaxDepth in PySpark
> 
>
> Key: SPARK-20498
> URL: https://issues.apache.org/jira/browse/SPARK-20498
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.1.0
>Reporter: Nick Lothian
>Assignee: Xin Ren
>Priority: Minor
> Fix For: 2.3.0
>
>
> Currently it isn't clear hot to get the max depth of a 
> RandomForestRegressionModel (eg, after doing a grid search)
> It is possible to call
> {{regressor._java_obj.getMaxDepth()}} 
> but most other decision trees allow
> {{regressor.getMaxDepth()}} 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23822) Improve error message for Parquet schema mismatches

2018-03-29 Thread Yuchen Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuchen Huo updated SPARK-23822:
---
Description: 
If a user attempts to read Parquet files with mismatched schemas and schema 
merging is disabled then this may result in a very confusing 
UnsupportedOperationException and ParquetDecodingException errors from Parquet.

e.g.
{code:java}
Seq(("bcd")).toDF("a").coalesce(1).write.mode("overwrite").parquet(s"$path/")
Seq((1)).toDF("a").coalesce(1).write.mode("append").parquet(s"$path/")

spark.read.parquet(s"$path/").collect()
{code}
Would result in
{code:java}
Caused by: java.lang.UnsupportedOperationException: Unimplemented type: 
IntegerType

  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:474)

  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214)

  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261)

  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159)

  at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)

  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)

  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:182)

  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)

  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch$(Unknown
 Source)

  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)

  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)

  at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:617)

  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)

  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)

  at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)

  at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)

  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)

  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)

  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

  at org.apache.spark.scheduler.Task.run(Task.scala:109)

  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)

  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

  at java.lang.Thread.run(Thread.java:748)
{code}
 

  was:
If a user attempts to read Parquet files with mismatched schemas and schema 
merging is disabled then this may result in a very confusing 
UnsupportedOperationException and ParquetDecodingException errors from Parquet.

e.g.
{code:java}
Seq(("bcd")).toDF("a").coalesce(1).write.mode("overwrite").parquet(s"$path/")
Seq((1)).toDF("a").coalesce(1).write.mode("append").parquet(s"$path/")

spark.read.parquet(s"$path/").collect()
{code}
Would result in

 


> Improve error message for Parquet schema mismatches
> ---
>
> Key: SPARK-23822
> URL: https://issues.apache.org/jira/browse/SPARK-23822
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.3.0
>Reporter: Yuchen Huo
>Priority: Major
>
> If a user attempts to read Parquet files with mismatched schemas and schema 
> merging is disabled then this may result in a very confusing 
> UnsupportedOperationException and ParquetDecodingException errors from 
> Parquet.
> e.g.
> {code:java}
> Seq(("bcd")).toDF("a").coalesce(1).write.mode("overwrite").parquet(s"$path/")
> Seq((1)).toDF("a").coalesce(1).write.mode("append").parquet(s"$path/")
> spark.read.parquet(s"$path/").collect()
> {code}
> Would result in
> {code:java}
> Caused by: java.lang.UnsupportedOperationException: Unimplemented type: 
> IntegerType
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:474)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReade

[jira] [Updated] (SPARK-20498) RandomForestRegressionModel should expose getMaxDepth in PySpark

2018-03-29 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20498:
--
Target Version/s:   (was: 2.4.0)

> RandomForestRegressionModel should expose getMaxDepth in PySpark
> 
>
> Key: SPARK-20498
> URL: https://issues.apache.org/jira/browse/SPARK-20498
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.1.0
>Reporter: Nick Lothian
>Assignee: Xin Ren
>Priority: Minor
>
> Currently it isn't clear hot to get the max depth of a 
> RandomForestRegressionModel (eg, after doing a grid search)
> It is possible to call
> {{regressor._java_obj.getMaxDepth()}} 
> but most other decision trees allow
> {{regressor.getMaxDepth()}} 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20498) RandomForestRegressionModel should expose getMaxDepth in PySpark

2018-03-29 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419397#comment-16419397
 ] 

Joseph K. Bradley commented on SPARK-20498:
---

I'll close this since Bryan's PR mostly solved this issue, but feel free to 
reopen it if needed.  Sorry for the delay.

> RandomForestRegressionModel should expose getMaxDepth in PySpark
> 
>
> Key: SPARK-20498
> URL: https://issues.apache.org/jira/browse/SPARK-20498
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.1.0
>Reporter: Nick Lothian
>Assignee: Xin Ren
>Priority: Minor
>
> Currently it isn't clear hot to get the max depth of a 
> RandomForestRegressionModel (eg, after doing a grid search)
> It is possible to call
> {{regressor._java_obj.getMaxDepth()}} 
> but most other decision trees allow
> {{regressor.getMaxDepth()}} 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20498) RandomForestRegressionModel should expose getMaxDepth in PySpark

2018-03-29 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20498:
--
Shepherd:   (was: Joseph K. Bradley)

> RandomForestRegressionModel should expose getMaxDepth in PySpark
> 
>
> Key: SPARK-20498
> URL: https://issues.apache.org/jira/browse/SPARK-20498
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.1.0
>Reporter: Nick Lothian
>Assignee: Xin Ren
>Priority: Minor
>
> Currently it isn't clear hot to get the max depth of a 
> RandomForestRegressionModel (eg, after doing a grid search)
> It is possible to call
> {{regressor._java_obj.getMaxDepth()}} 
> but most other decision trees allow
> {{regressor.getMaxDepth()}} 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23821) Collection functions: flatten

2018-03-29 Thread Marek Novotny (JIRA)
Marek Novotny created SPARK-23821:
-

 Summary: Collection functions: flatten
 Key: SPARK-23821
 URL: https://issues.apache.org/jira/browse/SPARK-23821
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.4.0
Reporter: Marek Novotny


Add the flatten function that transforms an Array of Arrays column into an 
Array elements column. if the array structure contains more than two levels of 
nesting, the function removes one nesting level

Example:
{{flatten(array(array(1, 2, 3), array(3, 4, 5), array(6, 7, 8)) => 
[1,2,3,4,5,6,7,8,9]}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23822) Improve error message for Parquet schema mismatches

2018-03-29 Thread Yuchen Huo (JIRA)
Yuchen Huo created SPARK-23822:
--

 Summary: Improve error message for Parquet schema mismatches
 Key: SPARK-23822
 URL: https://issues.apache.org/jira/browse/SPARK-23822
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Affects Versions: 2.3.0
Reporter: Yuchen Huo


If a user attempts to read Parquet files with mismatched schemas and schema 
merging is disabled then this may result in a very confusing 
UnsupportedOperationException and ParquetDecodingException errors from Parquet.

e.g.
{code:java}
Seq(("bcd")).toDF("a").coalesce(1).write.mode("overwrite").parquet(s"$path/")
Seq((1)).toDF("a").coalesce(1).write.mode("append").parquet(s"$path/")

spark.read.parquet(s"$path/").collect()
{code}
Would result in

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala

2018-03-29 Thread Zoltan Ivanfi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Ivanfi updated SPARK-20297:
--
Comment: was deleted

(was: Sorry, commented to the wrong JIRA.)

> Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
> ---
>
> Key: SPARK-20297
> URL: https://issues.apache.org/jira/browse/SPARK-20297
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Mostafa Mokhtar
>Priority: Major
>  Labels: integration
>
> While trying to load some data using Spark 2.1 I realized that decimal(12,2) 
> columns stored in Parquet written by Spark are not readable by Hive or Impala.
> Repro 
> {code}
> CREATE TABLE customer_acctbal(
>   c_acctbal decimal(12,2))
> STORED AS Parquet;
> insert into customer_acctbal values (7539.95);
> {code}
> Error from Hive
> {code}
> Failed with exception 
> java.io.IOException:parquet.io.ParquetDecodingException: Can not read value 
> at 1 in block 0 in file 
> hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet
> Time taken: 0.122 seconds
> {code}
> Error from Impala
> {code}
> File 
> 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet'
>  has an incompatible Parquet schema for column 
> 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: 
> DECIMAL(12,2), Parquet schema:
> optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar)
> {code}
> Table info 
> {code}
> hive> describe formatted customer_acctbal;
> OK
> # col_name  data_type   comment
> c_acctbal   decimal(12,2)
> # Detailed Table Information
> Database:   tpch_nested_3000_parquet
> Owner:  mmokhtar
> CreateTime: Mon Apr 10 17:47:24 PDT 2017
> LastAccessTime: UNKNOWN
> Protect Mode:   None
> Retention:  0
> Location:   
> hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal
> Table Type: MANAGED_TABLE
> Table Parameters:
> COLUMN_STATS_ACCURATE   true
> numFiles1
> numRows 0
> rawDataSize 0
> totalSize   120
> transient_lastDdlTime   1491871644
> # Storage Information
> SerDe Library:  
> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
> InputFormat:
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
> OutputFormat:   
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
> Compressed: No
> Num Buckets:-1
> Bucket Columns: []
> Sort Columns:   []
> Storage Desc Params:
> serialization.format1
> Time taken: 0.032 seconds, Fetched: 31 row(s)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23704) PySpark access of individual trees in random forest is slow

2018-03-29 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-23704:
-
Component/s: PySpark

> PySpark access of individual trees in random forest is slow
> ---
>
> Key: SPARK-23704
> URL: https://issues.apache.org/jira/browse/SPARK-23704
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.1
> Environment: PySpark 2.2.1 / Windows 10
>Reporter: Julian King
>Priority: Minor
>
> Making predictions from a randomForestClassifier PySpark is much faster than 
> making predictions from an individual tree contained within the .trees 
> attribute. 
> In fact, the model.transform call without an action is more than 10x slower 
> for an individual tree vs the model.transform call for the random forest 
> model.
> See 
> [https://stackoverflow.com/questions/49297470/slow-individual-tree-access-for-random-forest-in-pyspark]
>  for example with timing.
> Ideally:
>  * Getting a prediction from a single tree should be comparable to or faster 
> than getting predictions from the whole tree
>  * Getting all the predictions from all the individual trees should be 
> comparable in speed to getting the predictions from the random forest
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala

2018-03-29 Thread Zoltan Ivanfi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419338#comment-16419338
 ] 

Zoltan Ivanfi commented on SPARK-20297:
---

Sorry, commented to the wrong JIRA.

> Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
> ---
>
> Key: SPARK-20297
> URL: https://issues.apache.org/jira/browse/SPARK-20297
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Mostafa Mokhtar
>Priority: Major
>  Labels: integration
>
> While trying to load some data using Spark 2.1 I realized that decimal(12,2) 
> columns stored in Parquet written by Spark are not readable by Hive or Impala.
> Repro 
> {code}
> CREATE TABLE customer_acctbal(
>   c_acctbal decimal(12,2))
> STORED AS Parquet;
> insert into customer_acctbal values (7539.95);
> {code}
> Error from Hive
> {code}
> Failed with exception 
> java.io.IOException:parquet.io.ParquetDecodingException: Can not read value 
> at 1 in block 0 in file 
> hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet
> Time taken: 0.122 seconds
> {code}
> Error from Impala
> {code}
> File 
> 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet'
>  has an incompatible Parquet schema for column 
> 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: 
> DECIMAL(12,2), Parquet schema:
> optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar)
> {code}
> Table info 
> {code}
> hive> describe formatted customer_acctbal;
> OK
> # col_name  data_type   comment
> c_acctbal   decimal(12,2)
> # Detailed Table Information
> Database:   tpch_nested_3000_parquet
> Owner:  mmokhtar
> CreateTime: Mon Apr 10 17:47:24 PDT 2017
> LastAccessTime: UNKNOWN
> Protect Mode:   None
> Retention:  0
> Location:   
> hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal
> Table Type: MANAGED_TABLE
> Table Parameters:
> COLUMN_STATS_ACCURATE   true
> numFiles1
> numRows 0
> rawDataSize 0
> totalSize   120
> transient_lastDdlTime   1491871644
> # Storage Information
> SerDe Library:  
> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
> InputFormat:
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
> OutputFormat:   
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
> Compressed: No
> Num Buckets:-1
> Bucket Columns: []
> Sort Columns:   []
> Storage Desc Params:
> serialization.format1
> Time taken: 0.032 seconds, Fetched: 31 row(s)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala

2018-03-29 Thread Zoltan Ivanfi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Ivanfi updated SPARK-20297:
--
Comment: was deleted

(was: Could you please clarify how those DECIMALS were written in the first 
place?
 * If some manual configuration was done to allow Spark to choose this 
representation, then we are fine.
 * If an upstream Spark version wrote data using this representation by 
default, that's a valid reason to feel mildly uncomfortable.
 * If a downstream Spark version wrote data using this representation by 
default, then we should open a JIRA to prevent CDH Spark from doing so until 
Hive and Impala supports it.

Thanks!)

> Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
> ---
>
> Key: SPARK-20297
> URL: https://issues.apache.org/jira/browse/SPARK-20297
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Mostafa Mokhtar
>Priority: Major
>  Labels: integration
>
> While trying to load some data using Spark 2.1 I realized that decimal(12,2) 
> columns stored in Parquet written by Spark are not readable by Hive or Impala.
> Repro 
> {code}
> CREATE TABLE customer_acctbal(
>   c_acctbal decimal(12,2))
> STORED AS Parquet;
> insert into customer_acctbal values (7539.95);
> {code}
> Error from Hive
> {code}
> Failed with exception 
> java.io.IOException:parquet.io.ParquetDecodingException: Can not read value 
> at 1 in block 0 in file 
> hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet
> Time taken: 0.122 seconds
> {code}
> Error from Impala
> {code}
> File 
> 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet'
>  has an incompatible Parquet schema for column 
> 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: 
> DECIMAL(12,2), Parquet schema:
> optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar)
> {code}
> Table info 
> {code}
> hive> describe formatted customer_acctbal;
> OK
> # col_name  data_type   comment
> c_acctbal   decimal(12,2)
> # Detailed Table Information
> Database:   tpch_nested_3000_parquet
> Owner:  mmokhtar
> CreateTime: Mon Apr 10 17:47:24 PDT 2017
> LastAccessTime: UNKNOWN
> Protect Mode:   None
> Retention:  0
> Location:   
> hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal
> Table Type: MANAGED_TABLE
> Table Parameters:
> COLUMN_STATS_ACCURATE   true
> numFiles1
> numRows 0
> rawDataSize 0
> totalSize   120
> transient_lastDdlTime   1491871644
> # Storage Information
> SerDe Library:  
> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
> InputFormat:
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
> OutputFormat:   
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
> Compressed: No
> Num Buckets:-1
> Bucket Columns: []
> Sort Columns:   []
> Storage Desc Params:
> serialization.format1
> Time taken: 0.032 seconds, Fetched: 31 row(s)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20297) Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala

2018-03-29 Thread Zoltan Ivanfi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419335#comment-16419335
 ] 

Zoltan Ivanfi commented on SPARK-20297:
---

Could you please clarify how those DECIMALS were written in the first place?
 * If some manual configuration was done to allow Spark to choose this 
representation, then we are fine.
 * If an upstream Spark version wrote data using this representation by 
default, that's a valid reason to feel mildly uncomfortable.
 * If a downstream Spark version wrote data using this representation by 
default, then we should open a JIRA to prevent CDH Spark from doing so until 
Hive and Impala supports it.

Thanks!

> Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala
> ---
>
> Key: SPARK-20297
> URL: https://issues.apache.org/jira/browse/SPARK-20297
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Mostafa Mokhtar
>Priority: Major
>  Labels: integration
>
> While trying to load some data using Spark 2.1 I realized that decimal(12,2) 
> columns stored in Parquet written by Spark are not readable by Hive or Impala.
> Repro 
> {code}
> CREATE TABLE customer_acctbal(
>   c_acctbal decimal(12,2))
> STORED AS Parquet;
> insert into customer_acctbal values (7539.95);
> {code}
> Error from Hive
> {code}
> Failed with exception 
> java.io.IOException:parquet.io.ParquetDecodingException: Can not read value 
> at 1 in block 0 in file 
> hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet
> Time taken: 0.122 seconds
> {code}
> Error from Impala
> {code}
> File 
> 'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-0-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet'
>  has an incompatible Parquet schema for column 
> 'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type: 
> DECIMAL(12,2), Parquet schema:
> optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar)
> {code}
> Table info 
> {code}
> hive> describe formatted customer_acctbal;
> OK
> # col_name  data_type   comment
> c_acctbal   decimal(12,2)
> # Detailed Table Information
> Database:   tpch_nested_3000_parquet
> Owner:  mmokhtar
> CreateTime: Mon Apr 10 17:47:24 PDT 2017
> LastAccessTime: UNKNOWN
> Protect Mode:   None
> Retention:  0
> Location:   
> hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal
> Table Type: MANAGED_TABLE
> Table Parameters:
> COLUMN_STATS_ACCURATE   true
> numFiles1
> numRows 0
> rawDataSize 0
> totalSize   120
> transient_lastDdlTime   1491871644
> # Storage Information
> SerDe Library:  
> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
> InputFormat:
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
> OutputFormat:   
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
> Compressed: No
> Num Buckets:-1
> Bucket Columns: []
> Sort Columns:   []
> Storage Desc Params:
> serialization.format1
> Time taken: 0.032 seconds, Fetched: 31 row(s)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23503) continuous execution should sequence committed epochs

2018-03-29 Thread Efim Poberezkin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419136#comment-16419136
 ] 

Efim Poberezkin commented on SPARK-23503:
-

[~joseph.torres] Good day Jose. From what I've figured about Continuous 
Execution implementation an epoch coordinator is created per streaming query 
run and is able to store state. I've added tracking of last committed epoch and 
of waiting epochs to it to enforce epoch sequencing. Could you please correct 
me if my understanding of your implementation is not correct and take a look at 
my approach?

> continuous execution should sequence committed epochs
> -
>
> Key: SPARK-23503
> URL: https://issues.apache.org/jira/browse/SPARK-23503
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> Currently, the EpochCoordinator doesn't enforce a commit order. If a message 
> for epoch n gets lost in the ether, and epoch n + 1 happens to be ready for 
> commit earlier, epoch n + 1 will be committed.
>  
> This is either incorrect or needlessly confusing, because it's not safe to 
> start from the end offset of epoch n + 1 until epoch n is committed. 
> EpochCoordinator should enforce this sequencing.
>  
> Note that this is not actually a problem right now, because the commit 
> messages go through the same RPC channel from the same place. But we 
> shouldn't implicitly bake this assumption in.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23723) New charset option for json datasource

2018-03-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419123#comment-16419123
 ] 

Apache Spark commented on SPARK-23723:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/20937

> New charset option for json datasource
> --
>
> Key: SPARK-23723
> URL: https://issues.apache.org/jira/browse/SPARK-23723
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently JSON Reader can read json files in different charset/encodings. The 
> JSON Reader uses the jackson-json library to automatically detect the charset 
> of input text/stream. Here you can see the method which detects encoding: 
> [https://github.com/FasterXML/jackson-core/blob/master/src/main/java/com/fasterxml/jackson/core/json/ByteSourceJsonBootstrapper.java#L111-L174]
>  
> The detectEncoding method checks the BOM 
> ([https://en.wikipedia.org/wiki/Byte_order_mark]) at the beginning of a text. 
> The BOM can be in the file but it is not mandatory. If it is not present, the 
> auto detection mechanism can select wrong charset. And as a consequence of 
> that, the user cannot read the json file. *The proposed option will allow to 
> bypass the auto detection mechanism and set the charset explicitly.*
>  
> The charset option is already exposed as a CSV option: 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L87-L88]
>  . I propose to add the same option for JSON.
>  
> Regarding to JSON Writer, *the charset option will give to the user 
> opportunity* to read json files in charset different from UTF-8, modify the 
> dataset and *write results back to json files in the original encoding.* At 
> the moment it is not possible to do because the result can be saved in UTF-8 
> only.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23724) Custom record separator for jsons in charsets different from UTF-8

2018-03-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419124#comment-16419124
 ] 

Apache Spark commented on SPARK-23724:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/20937

> Custom record separator for jsons in charsets different from UTF-8
> --
>
> Key: SPARK-23724
> URL: https://issues.apache.org/jira/browse/SPARK-23724
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The option should define a sequence of bytes between two consecutive json 
> records. Currently the separator is detected automatically by hadoop library:
>  
> [https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L185-L254]
>  
> The method is able to recognize only *\r, \n* and *\r\n* in UTF-8 encoding. 
> It doesn't work in the cases if encoding of input stream is different from 
> UTF-8. The option should allow to users explicitly set separator/delimiter of 
> json records.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23503) continuous execution should sequence committed epochs

2018-03-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419111#comment-16419111
 ] 

Apache Spark commented on SPARK-23503:
--

User 'efimpoberezkin' has created a pull request for this issue:
https://github.com/apache/spark/pull/20936

> continuous execution should sequence committed epochs
> -
>
> Key: SPARK-23503
> URL: https://issues.apache.org/jira/browse/SPARK-23503
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> Currently, the EpochCoordinator doesn't enforce a commit order. If a message 
> for epoch n gets lost in the ether, and epoch n + 1 happens to be ready for 
> commit earlier, epoch n + 1 will be committed.
>  
> This is either incorrect or needlessly confusing, because it's not safe to 
> start from the end offset of epoch n + 1 until epoch n is committed. 
> EpochCoordinator should enforce this sequencing.
>  
> Note that this is not actually a problem right now, because the commit 
> messages go through the same RPC channel from the same place. But we 
> shouldn't implicitly bake this assumption in.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23503) continuous execution should sequence committed epochs

2018-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23503:


Assignee: (was: Apache Spark)

> continuous execution should sequence committed epochs
> -
>
> Key: SPARK-23503
> URL: https://issues.apache.org/jira/browse/SPARK-23503
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> Currently, the EpochCoordinator doesn't enforce a commit order. If a message 
> for epoch n gets lost in the ether, and epoch n + 1 happens to be ready for 
> commit earlier, epoch n + 1 will be committed.
>  
> This is either incorrect or needlessly confusing, because it's not safe to 
> start from the end offset of epoch n + 1 until epoch n is committed. 
> EpochCoordinator should enforce this sequencing.
>  
> Note that this is not actually a problem right now, because the commit 
> messages go through the same RPC channel from the same place. But we 
> shouldn't implicitly bake this assumption in.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23503) continuous execution should sequence committed epochs

2018-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23503:


Assignee: Apache Spark

> continuous execution should sequence committed epochs
> -
>
> Key: SPARK-23503
> URL: https://issues.apache.org/jira/browse/SPARK-23503
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Apache Spark
>Priority: Major
>
> Currently, the EpochCoordinator doesn't enforce a commit order. If a message 
> for epoch n gets lost in the ether, and epoch n + 1 happens to be ready for 
> commit earlier, epoch n + 1 will be committed.
>  
> This is either incorrect or needlessly confusing, because it's not safe to 
> start from the end offset of epoch n + 1 until epoch n is committed. 
> EpochCoordinator should enforce this sequencing.
>  
> Note that this is not actually a problem right now, because the commit 
> messages go through the same RPC channel from the same place. But we 
> shouldn't implicitly bake this assumption in.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21960) Spark Streaming Dynamic Allocation should respect spark.executor.instances

2018-03-29 Thread Leonel Atencio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419104#comment-16419104
 ] 

Leonel Atencio commented on SPARK-21960:


This is a very important issue, because right now, Streaming DRA does not 
include an `initExecutors` property. Some streaming jobs need a minimal amount 
of executors to work properly. Right now, you can only wait for the DRA 
algorithm to fire add executors until your minimum is reached. This is an issue.

> Spark Streaming Dynamic Allocation should respect spark.executor.instances
> --
>
> Key: SPARK-21960
> URL: https://issues.apache.org/jira/browse/SPARK-21960
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Karthik Palaniappan
>Priority: Minor
>
> This check enforces that spark.executor.instances (aka --num-executors) is 
> either unset or explicitly set to 0. 
> https://github.com/apache/spark/blob/v2.2.0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ExecutorAllocationManager.scala#L207
> If spark.executor.instances is unset, the check is fine, and the property 
> defaults to 2. Spark requests the cluster manager for 2 executors to start 
> with, then adds/removes executors appropriately.
> However, if you explicitly set it to 0, the check also succeeds, but Spark 
> never asks the cluster manager for any executors. When running on YARN, I 
> repeatedly saw:
> {code:java}
> 17/08/22 19:35:21 WARN org.apache.spark.scheduler.cluster.YarnScheduler: 
> Initial job has not accepted any resources; check your cluster UI to ensure 
> that workers are registered and have sufficient resources
> 17/08/22 19:35:36 WARN org.apache.spark.scheduler.cluster.YarnScheduler: 
> Initial job has not accepted any resources; check your cluster UI to ensure 
> that workers are registered and have sufficient resources
> 17/08/22 19:35:51 WARN org.apache.spark.scheduler.cluster.YarnScheduler: 
> Initial job has not accepted any resources; check your cluster UI to ensure 
> that workers are registered and have sufficient resources
> {code}
> I noticed that at least Google Dataproc and Ambari explicitly set 
> spark.executor.instances to a positive number, meaning that to use dynamic 
> allocation, you would have to edit spark-defaults.conf to remove the 
> property. That's obnoxious.
> In addition, in Spark 2.3, spark-submit will refuse to accept "0" as a value 
> for --num-executors or --conf spark.executor.instances: 
> https://github.com/apache/spark/commit/0fd84b05dc9ac3de240791e2d4200d8bdffbb01a#diff-63a5d817d2d45ae24de577f6a1bd80f9
> It is much more reasonable for Streaming DRA to use spark.executor.instances, 
> just like Core DRA. I'll open a pull request to remove the check if there are 
> no objections.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23820) Allow the long form of call sites to be recorded in the log

2018-03-29 Thread Michael Mior (JIRA)
Michael Mior created SPARK-23820:


 Summary: Allow the long form of call sites to be recorded in the 
log
 Key: SPARK-23820
 URL: https://issues.apache.org/jira/browse/SPARK-23820
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Michael Mior


It would be nice if the long form of the callsite information could be included 
in the log. An example of what I'm proposing is here: 
https://github.com/michaelmior/spark/commit/4b4076cfb1d51ceb20fd2b0a3b1b5be2aebb6416



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-991) Report call sites of operators in Python

2018-03-29 Thread Michael Mior (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419099#comment-16419099
 ] 

Michael Mior commented on SPARK-991:


Maybe I'm missing something here, but I'm not seeing a Python stack trace 
captured when I look at the RDD call site information.

> Report call sites of operators in Python
> 
>
> Key: SPARK-991
> URL: https://issues.apache.org/jira/browse/SPARK-991
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Matei Zaharia
>Priority: Major
>  Labels: Starter
> Fix For: 0.9.0
>
>
> Similar to our call site reporting in Java, we can get a stack trace in 
> Python with traceback.extract_stack. It would be nice to set that stack trace 
> as the name on the corresponding RDDs so it can appear in the UI.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22618) RDD.unpersist can cause fatal exception when used with dynamic allocation

2018-03-29 Thread Brad (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419062#comment-16419062
 ] 

Brad commented on SPARK-22618:
--

Yeah the fix for broadcaset unpersist should be basically the same. Thanks 
Thomas.

> RDD.unpersist can cause fatal exception when used with dynamic allocation
> -
>
> Key: SPARK-22618
> URL: https://issues.apache.org/jira/browse/SPARK-22618
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Brad
>Assignee: Brad
>Priority: Minor
> Fix For: 2.3.0
>
>
> If you use rdd.unpersist() with dynamic allocation, then an executor can be 
> deallocated while your rdd is being removed, which will throw an uncaught 
> exception killing your job. 
> I looked into different ways of preventing this error from occurring but 
> couldn't come up with anything that wouldn't require a big change. I propose 
> the best fix is just to catch and log IOExceptions in unpersist() so they 
> don't kill your job. This will match the effective behavior when executors 
> are lost from dynamic allocation in other parts of the code.
> In the worst case scenario I think this could lead to RDD partitions getting 
> left on executors after they were unpersisted, but this is probably better 
> than the whole job failing. I think in most cases the IOException would be 
> due to the executor dieing for some reason, which is effectively the same 
> result as unpersisting the rdd from that executor anyway.
> I noticed this exception in a job that loads a 100GB dataset on a cluster 
> where we use dynamic allocation heavily. Here is the relevant stack trace
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> at sun.nio.ch.IOUtil.read(IOUtil.java:192)
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
> at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
> at 
> io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
> at 
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:276)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> at java.lang.Thread.run(Thread.java:748)
> Exception in thread "main" org.apache.spark.SparkException: Exception thrown 
> in awaitResult:
> at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
> at 
> org.apache.spark.storage.BlockManagerMaster.removeRdd(BlockManagerMaster.scala:131)
> at org.apache.spark.SparkContext.unpersistRDD(SparkContext.scala:1806)
> at org.apache.spark.rdd.RDD.unpersist(RDD.scala:217)
> at 
> com.ibm.sparktc.sparkbench.workload.exercise.CacheTest.doWorkload(CacheTest.scala:62)
> at 
> com.ibm.sparktc.sparkbench.workload.Workload$class.run(Workload.scala:40)
> at 
> com.ibm.sparktc.sparkbench.workload.exercise.CacheTest.run(CacheTest.scala:33)
> at 
> com.ibm.sparktc.sparkbench.workload.SuiteKickoff$$anonfun$com$ibm$sparktc$sparkbench$workload$SuiteKickoff$$runSerially$1.apply(SuiteKickoff.scala:78)
> at 
> com.ibm.sparktc.sparkbench.workload.SuiteKickoff$$anonfun$com$ibm$sparktc$sparkbench$workload$SuiteKickoff$$runSerially$1.apply(SuiteKickoff.scala:78)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.immutable.List.map(List.scala:285)
> at 
> com.ibm.sparktc.sparkbench.workload.SuiteKickoff$.com$ibm$sparktc$sparkbench$workload$SuiteKickoff$$runSeria

[jira] [Commented] (SPARK-23534) Spark run on Hadoop 3.0.0

2018-03-29 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419053#comment-16419053
 ] 

Steve Loughran commented on SPARK-23534:


[~nchammas]

bq. Cloudera still ships 2.6 and EMR is on 2.7.

EMR and google have 2.8.x options & CDH may be based on 2.6 but they've 
backported and updated a lot. The aws & azure connectors on (HDP, CDH, 
HD/Insights) are all very much 2.8.x+ source and dependencies

bq. Hadoop releases have historically been very disruptive, with 
backwards-incompatible changes even in minor releases. 

I don't disagree. But having the profiles so you can build and test this stuff 
is how to find out the problems. The HBase team have already been fairly busy, 
driving the development of a shaded client adequate for their needs...spark & 
hadoop 3 helps identify & fix regressions, push that shaded jar to be complete, 
etc.  Profiles => build & tests => bug reports => fixes.

[~Bidek]

bq. I am afraid of using of older version of azure-storage because of all the 
security issues that have been found and fixed in the newer version, 

Hadoop 3.x is still shipping with Azure storage 5.4.0, and as that upgrade came 
from Microsoft (HADOOP-14662) I think they'll be up to speed with any security 
issues. That's in Hadoop 2.9.0 BTW.

If you really want that storage lib, build Spark against with  {{-Phadoop-2.7 
-Dhadoop.version=2.9.0}}. Hive will be happy.

> Spark run on Hadoop 3.0.0
> -
>
> Key: SPARK-23534
> URL: https://issues.apache.org/jira/browse/SPARK-23534
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Major
>
> Major Hadoop vendors already/will step in Hadoop 3.0. So we should also make 
> sure Spark can run with Hadoop 3.0. This Jira tracks the work to make Spark 
> run on Hadoop 3.0.
> The work includes:
>  # Add a Hadoop 3.0.0 new profile to make Spark build-able with Hadoop 3.0.
>  # Test to see if there's dependency issues with Hadoop 3.0.
>  # Investigating the feasibility to use shaded client jars (HADOOP-11804).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23807) Add Hadoop 3 profile with relevant POM fix ups, cloud-storage artifacts and binding

2018-03-29 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-23807:
---
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-23534

> Add Hadoop 3 profile with relevant POM fix ups, cloud-storage artifacts and 
> binding
> ---
>
> Key: SPARK-23807
> URL: https://issues.apache.org/jira/browse/SPARK-23807
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Steve Loughran
>Priority: Major
>
> Hadoop 3, and particular Hadoop 3.1 adds:
>  * Java 8 as the minimum (and currently sole) supported Java version
>  * A new "hadoop-cloud-storage" module intended to be a minimal dependency 
> POM for all the cloud connectors in the version of hadoop built against
>  * The ability to declare a committer for any FileOutputFormat which 
> supercedes the classic FileOutputCommitter -in both a job and for a specific 
> FS URI
>  * A shaded client JAR, though not yet one complete enough for spark.
>  * Lots of other features and fixes.
> The basic work of building spark with hadoop 3 is one of just doing the build 
> with {{-Dhadoop.version=3.x.y}}; however that
>  * Doesn't build on SBT (dependency resolution of zookeeper JAR)
>  * Misses the new cloud features
> The ZK dependency can be fixed everywhere by explicitly declaring the ZK 
> artifact, instead of relying on curator to pull it in; this needs a profile 
> to declare the right ZK version, obviously..
> To use the cloud features spark the hadoop-3 profile should declare that the 
> spark-hadoop-cloud module depends on —and only on— the 
> hadoop/hadoop-cloud-storage module for its transitive dependencies on cloud 
> storage, and a source package which is only built and tested when build 
> against Hadoop 3.1+
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23807) Add Hadoop 3 profile with relevant POM fix ups, cloud-storage artifacts and binding

2018-03-29 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419017#comment-16419017
 ] 

Steve Loughran commented on SPARK-23807:


yes, this profile is part of the hadoop 3 support, "necessary but not 
sufficient", given the hive issue

> Add Hadoop 3 profile with relevant POM fix ups, cloud-storage artifacts and 
> binding
> ---
>
> Key: SPARK-23807
> URL: https://issues.apache.org/jira/browse/SPARK-23807
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Steve Loughran
>Priority: Major
>
> Hadoop 3, and particular Hadoop 3.1 adds:
>  * Java 8 as the minimum (and currently sole) supported Java version
>  * A new "hadoop-cloud-storage" module intended to be a minimal dependency 
> POM for all the cloud connectors in the version of hadoop built against
>  * The ability to declare a committer for any FileOutputFormat which 
> supercedes the classic FileOutputCommitter -in both a job and for a specific 
> FS URI
>  * A shaded client JAR, though not yet one complete enough for spark.
>  * Lots of other features and fixes.
> The basic work of building spark with hadoop 3 is one of just doing the build 
> with {{-Dhadoop.version=3.x.y}}; however that
>  * Doesn't build on SBT (dependency resolution of zookeeper JAR)
>  * Misses the new cloud features
> The ZK dependency can be fixed everywhere by explicitly declaring the ZK 
> artifact, instead of relying on curator to pull it in; this needs a profile 
> to declare the right ZK version, obviously..
> To use the cloud features spark the hadoop-3 profile should declare that the 
> spark-hadoop-cloud module depends on —and only on— the 
> hadoop/hadoop-cloud-storage module for its transitive dependencies on cloud 
> storage, and a source package which is only built and tested when build 
> against Hadoop 3.1+
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15125) CSV data source recognizes empty quoted strings in the input as null.

2018-03-29 Thread Max Murphy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418925#comment-16418925
 ] 

Max Murphy commented on SPARK-15125:


[~snanda] That would not allow ,, to be distinguished from ,"", whereas we want 
,, => null and ,"", => "".

> CSV data source recognizes empty quoted strings in the input as null. 
> --
>
> Key: SPARK-15125
> URL: https://issues.apache.org/jira/browse/SPARK-15125
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Suresh Thalamati
>Priority: Major
>
> CSV data source does not differentiate between empty quoted strings and empty 
> fields  as null. In some scenarios user would want  to differentiate between 
> these values,  especially in the context of SQL where NULL , and empty string 
> have different meanings  If input data happens to be dump from traditional 
> relational data source, users will see different results for the SQL queries. 
> {code}
> Repro:
> Test Data: (test.csv)
> year,make,model,comment,price
> 2017,Tesla,Mode 3,looks nice.,35000.99
> 2016,Chevy,Bolt,"",29000.00
> 2015,Porsche,"",,
> scala> val df= sqlContext.read.format("csv").option("header", 
> "true").option("inferSchema", "true").option("nullValue", 
> null).load("/tmp/test.csv")
> df: org.apache.spark.sql.DataFrame = [year: int, make: string ... 3 more 
> fields]
> scala> df.show
> ++---+--+---++
> |year|   make| model|comment|   price|
> ++---+--+---++
> |2017|  Tesla|Mode 3|looks nice.|35000.99|
> |2016|  Chevy|  Bolt|   null| 29000.0|
> |2015|Porsche|  null|   null|null|
> ++---+--+---++
> Expected:
> ++---+--+---++
> |year|   make| model|comment|   price|
> ++---+--+---++
> |2017|  Tesla|Mode 3|looks nice.|35000.99|
> |2016|  Chevy|  Bolt|   | 29000.0|
> |2015|Porsche|  |   null|null|
> ++---+--+---++
> {code}
> Testing a fix for the this issue. I will give a shot at submitting a PR for 
> this soon. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23811) FetchFailed comes before Success of same task will cause child stage never succeed

2018-03-29 Thread Li Yuanjian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Yuanjian updated SPARK-23811:

Summary: FetchFailed comes before Success of same task will cause child 
stage never succeed  (was: Same tasks' FetchFailed event comes before Success 
will cause child stage never succeed)

> FetchFailed comes before Success of same task will cause child stage never 
> succeed
> --
>
> Key: SPARK-23811
> URL: https://issues.apache.org/jira/browse/SPARK-23811
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Li Yuanjian
>Priority: Major
> Attachments: 1.png, 2.png
>
>
> This is a bug caused by abnormal scenario describe below:
>  # ShuffleMapTask 1.0 running, this task will fetch data from ExecutorA
>  # ExecutorA Lost, trigger `mapOutputTracker.removeOutputsOnExecutor(execId)` 
> , shuffleStatus changed.
>  # Speculative ShuffleMapTask 1.1 start, got a FetchFailed immediately.
>  # ShuffleMapTask 1 is the last task of its stage, so this stage will never 
> succeed because of there's no missing task DagScheduler can get.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23819) InMemoryTableScanExec prunes orderable complex types due to out of date ColumnStats

2018-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23819:


Assignee: Apache Spark

> InMemoryTableScanExec prunes orderable complex types due to out of date 
> ColumnStats
> ---
>
> Key: SPARK-23819
> URL: https://issues.apache.org/jira/browse/SPARK-23819
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Patrick Woody
>Assignee: Apache Spark
>Priority: Major
>
> The data types that can be compared via BinaryComparison was expanded in 
> SPARK-21110 now include Arrays/Structs/etc, but ColumnStats would still have 
> hard coded upper/lower bounds for these types.
> InMemoryTableScanExec used to be safe against these comparisons because the 
> predicate would fail type checking. Now that it passes, the statistics 
> unintentionally allow pruning of the partition, causing correctness issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23819) InMemoryTableScanExec prunes orderable complex types due to out of date ColumnStats

2018-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23819:


Assignee: (was: Apache Spark)

> InMemoryTableScanExec prunes orderable complex types due to out of date 
> ColumnStats
> ---
>
> Key: SPARK-23819
> URL: https://issues.apache.org/jira/browse/SPARK-23819
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Patrick Woody
>Priority: Major
>
> The data types that can be compared via BinaryComparison was expanded in 
> SPARK-21110 now include Arrays/Structs/etc, but ColumnStats would still have 
> hard coded upper/lower bounds for these types.
> InMemoryTableScanExec used to be safe against these comparisons because the 
> predicate would fail type checking. Now that it passes, the statistics 
> unintentionally allow pruning of the partition, causing correctness issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23819) InMemoryTableScanExec prunes orderable complex types due to out of date ColumnStats

2018-03-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418777#comment-16418777
 ] 

Apache Spark commented on SPARK-23819:
--

User 'pwoody' has created a pull request for this issue:
https://github.com/apache/spark/pull/20935

> InMemoryTableScanExec prunes orderable complex types due to out of date 
> ColumnStats
> ---
>
> Key: SPARK-23819
> URL: https://issues.apache.org/jira/browse/SPARK-23819
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Patrick Woody
>Priority: Major
>
> The data types that can be compared via BinaryComparison was expanded in 
> SPARK-21110 now include Arrays/Structs/etc, but ColumnStats would still have 
> hard coded upper/lower bounds for these types.
> InMemoryTableScanExec used to be safe against these comparisons because the 
> predicate would fail type checking. Now that it passes, the statistics 
> unintentionally allow pruning of the partition, causing correctness issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23818) an official UDF interface for Spark SQL

2018-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23818:


Assignee: Apache Spark

> an official UDF interface for Spark SQL
> ---
>
> Key: SPARK-23818
> URL: https://issues.apache.org/jira/browse/SPARK-23818
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>
> Nowadays the Spark UDF interface is pretty simple: it's basically just a 
> lambda function.
> We should come up with a new one that can:
> 1. operate on InternalRow directly for better performance
> 2. define deterministic/mutable/nullable in the UDF itself
> 3. java friendly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23818) an official UDF interface for Spark SQL

2018-03-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418754#comment-16418754
 ] 

Apache Spark commented on SPARK-23818:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/20934

> an official UDF interface for Spark SQL
> ---
>
> Key: SPARK-23818
> URL: https://issues.apache.org/jira/browse/SPARK-23818
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Nowadays the Spark UDF interface is pretty simple: it's basically just a 
> lambda function.
> We should come up with a new one that can:
> 1. operate on InternalRow directly for better performance
> 2. define deterministic/mutable/nullable in the UDF itself
> 3. java friendly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23818) an official UDF interface for Spark SQL

2018-03-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23818:


Assignee: (was: Apache Spark)

> an official UDF interface for Spark SQL
> ---
>
> Key: SPARK-23818
> URL: https://issues.apache.org/jira/browse/SPARK-23818
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Nowadays the Spark UDF interface is pretty simple: it's basically just a 
> lambda function.
> We should come up with a new one that can:
> 1. operate on InternalRow directly for better performance
> 2. define deterministic/mutable/nullable in the UDF itself
> 3. java friendly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23819) InMemoryTableScanExec prunes orderable complex types due to out of date ColumnStats

2018-03-29 Thread Patrick Woody (JIRA)
Patrick Woody created SPARK-23819:
-

 Summary: InMemoryTableScanExec prunes orderable complex types due 
to out of date ColumnStats
 Key: SPARK-23819
 URL: https://issues.apache.org/jira/browse/SPARK-23819
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Patrick Woody


The data types that can be compared via BinaryComparison was expanded in 
SPARK-21110 now include Arrays/Structs/etc, but ColumnStats would still have 
hard coded upper/lower bounds for these types.

InMemoryTableScanExec used to be safe against these comparisons because the 
predicate would fail type checking. Now that it passes, the statistics 
unintentionally allow pruning of the partition, causing correctness issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23818) an official UDF interface for Spark SQL

2018-03-29 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-23818:
---

 Summary: an official UDF interface for Spark SQL
 Key: SPARK-23818
 URL: https://issues.apache.org/jira/browse/SPARK-23818
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Wenchen Fan


Nowadays the Spark UDF interface is pretty simple: it's basically just a lambda 
function.

We should come up with a new one that can:
1. operate on InternalRow directly for better performance
2. define deterministic/mutable/nullable in the UDF itself
3. java friendly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23770) Expose repartitionByRange in SparkR

2018-03-29 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-23770:


Assignee: Hyukjin Kwon

> Expose repartitionByRange in SparkR
> ---
>
> Key: SPARK-23770
> URL: https://issues.apache.org/jira/browse/SPARK-23770
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.4.0
>
>
> SPARK-22614 added repartitionByRange. It'd be good to have it in R side too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23770) Expose repartitionByRange in SparkR

2018-03-29 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23770.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20902
[https://github.com/apache/spark/pull/20902]

> Expose repartitionByRange in SparkR
> ---
>
> Key: SPARK-23770
> URL: https://issues.apache.org/jira/browse/SPARK-23770
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.4.0
>
>
> SPARK-22614 added repartitionByRange. It'd be good to have it in R side too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20384) supporting value classes over primitives in DataSets

2018-03-29 Thread Furcy Pin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418691#comment-16418691
 ] 

Furcy Pin edited comment on SPARK-20384 at 3/29/18 10:10 AM:
-

+1 on this issue.

I think the generic use case is that the spark Encoder magic to automatically 
transform a DataFrame into a case class currently only work for base types.

This is great if you have a 
{code:java}
case class Table(id: Long, attribute: String)
{code}
with simple attributes,

 

BUT, if you want to wrap your attribute into another simple class like this
{code:java}
case class Attribute(value: String) {
  // some specific methods...
}
case class Table(id: Long, attribute: Attribute){code}
Then this won't work automatically, unless the "attribute" column in your 
DataFrame is a struct itself.

 

The problem is that currently there doesn't seem to be any simple way to 
achieve this, which really limits the usefulness of the whole Encoder magic. 

And if a nice, simple way to achieve this exists, please document it as I did 
not find it.

 

 EDIT: after giving it some thought, I tried to do this:
{code:java}
implicit class Attribute(value: String)
case class Table(id: Long, attribute: Attribute){code}
But it does not work either. If it were possible like this, it would be a nice 
way to do it.

 

 


was (Author: fpin):
+1 on this issue.


 I think the generic use case is that the spark Encoder magic to automatically 
transform a DataFrame into a case class currently only work for base types.

This is great if you have a 
{code:java}
case class Table(id: Long, attribute: String)
{code}
with simple attributes,

 

BUT, if you want to wrap your attribute into another simple class like this
{code:java}
case class Attribute(value: String) {
  // some specific methods...
}
case class Table(id: Long, attribute: Attribute){code}
Then this won't work automatically, unless the "attribute" column in your 
DataFrame is a struct itself.

 

The problem is that currently there doesn't seem to be any simple way to 
achieve this, which really limits the usefulness of the whole Encoder magic. 

And if a nice, simple way to achieve this exists, please document it as I did 
not find it.

 

 

> supporting value classes over primitives in DataSets
> 
>
> Key: SPARK-20384
> URL: https://issues.apache.org/jira/browse/SPARK-20384
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0
>Reporter: Daniel Davis
>Priority: Minor
>
> As a spark user who uses value classes in scala for modelling domain objects, 
> I also would like to make use of them for datasets. 
> For example, I would like to use the {{User}} case class which is using a 
> value-class for it's {{id}} as the type for a DataSet:
> - the underlying primitive should be mapped to the value-class column
> - function on the column (for example comparison ) should only work if 
> defined on the value-class and use these implementation
> - show() should pick up the toString method of the value-class
> {code}
> case class Id(value: Long) extends AnyVal {
>   def toString: String = value.toHexString
> }
> case class User(id: Id, name: String)
> val ds = spark.sparkContext
>   .parallelize(0L to 12L).map(i => (i, f"name-$i")).toDS()
>   .withColumnRenamed("_1", "id")
>   .withColumnRenamed("_2", "name")
> // mapping should work
> val usrs = ds.as[User]
> // show should use toString
> usrs.show()
> // comparison with long should throw exception, as not defined on Id
> usrs.col("id") > 0L
> {code}
> For example `.show()` should use the toString of the `Id` value class:
> {noformat}
> +---+---+
> | id|   name|
> +---+---+
> |  0| name-0|
> |  1| name-1|
> |  2| name-2|
> |  3| name-3|
> |  4| name-4|
> |  5| name-5|
> |  6| name-6|
> |  7| name-7|
> |  8| name-8|
> |  9| name-9|
> |  A|name-10|
> |  B|name-11|
> |  C|name-12|
> +---+---+
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20384) supporting value classes over primitives in DataSets

2018-03-29 Thread Furcy Pin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418691#comment-16418691
 ] 

Furcy Pin commented on SPARK-20384:
---

+1 on this issue.


 I think the generic use case is that the spark Encoder magic to automatically 
transform a DataFrame into a case class currently only work for base types.

This is great if you have a 
{code:java}
case class Table(id: Long, attribute: String)
{code}
with simple attributes,

 

BUT, if you want to wrap your attribute into another simple class like this
{code:java}
case class Attribute(value: String) {
  // some specific methods...
}
case class Table(id: Long, attribute: Attribute){code}
Then this won't work automatically, unless the "attribute" column in your 
DataFrame is a struct itself.

 

The problem is that currently there doesn't seem to be any simple way to 
achieve this, which really limits the usefulness of the whole Encoder magic. 

And if a nice, simple way to achieve this exists, please document it as I did 
not find it.

 

 

> supporting value classes over primitives in DataSets
> 
>
> Key: SPARK-20384
> URL: https://issues.apache.org/jira/browse/SPARK-20384
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0
>Reporter: Daniel Davis
>Priority: Minor
>
> As a spark user who uses value classes in scala for modelling domain objects, 
> I also would like to make use of them for datasets. 
> For example, I would like to use the {{User}} case class which is using a 
> value-class for it's {{id}} as the type for a DataSet:
> - the underlying primitive should be mapped to the value-class column
> - function on the column (for example comparison ) should only work if 
> defined on the value-class and use these implementation
> - show() should pick up the toString method of the value-class
> {code}
> case class Id(value: Long) extends AnyVal {
>   def toString: String = value.toHexString
> }
> case class User(id: Id, name: String)
> val ds = spark.sparkContext
>   .parallelize(0L to 12L).map(i => (i, f"name-$i")).toDS()
>   .withColumnRenamed("_1", "id")
>   .withColumnRenamed("_2", "name")
> // mapping should work
> val usrs = ds.as[User]
> // show should use toString
> usrs.show()
> // comparison with long should throw exception, as not defined on Id
> usrs.col("id") > 0L
> {code}
> For example `.show()` should use the toString of the `Id` value class:
> {noformat}
> +---+---+
> | id|   name|
> +---+---+
> |  0| name-0|
> |  1| name-1|
> |  2| name-2|
> |  3| name-3|
> |  4| name-4|
> |  5| name-5|
> |  6| name-6|
> |  7| name-7|
> |  8| name-8|
> |  9| name-9|
> |  A|name-10|
> |  B|name-11|
> |  C|name-12|
> +---+---+
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22342) refactor schedulerDriver registration

2018-03-29 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418617#comment-16418617
 ] 

Stavros Kontopoulos edited comment on SPARK-22342 at 3/29/18 9:43 AM:
--

[~susanxhuynh] I guess we need to create a bug for this or change this issue 
type.


was (Author: skonto):
[~susanxhuynh] I guess we need to create a bug for this or change this is issue 
type.

> refactor schedulerDriver registration
> -
>
> Key: SPARK-22342
> URL: https://issues.apache.org/jira/browse/SPARK-22342
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.2.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> This is an umbrella issue for working on:
> https://github.com/apache/spark/pull/13143
> and handle the multiple re-registration issue which invalidates an offer.
> To test:
>  dcos spark run --verbose --name=spark-nohive  --submit-args="--driver-cores 
> 1 --conf spark.cores.max=1 --driver-memory 512M --class 
> org.apache.spark.examples.SparkPi http://.../spark-examples_2.11-2.2.0.jar";
> master log:
> I1020 13:49:05.00  3087 master.cpp:6618] Updating info for framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00  3085 hierarchical.cpp:303] Added framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00  3085 hierarchical.cpp:412] Deactivated framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00  3090 hierarchical.cpp:380] Activated framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00  3087 master.cpp:2974] Subscribing framework Spark Pi 
> with checkpointing disabled and capabilities [  ]
> I1020 13:49:05.00  3087 master.cpp:6618] Updating info for framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00  3087 master.cpp:3083] Framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark 
> Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed 
> over
> I1020 13:49:05.00  3087 master.cpp:2894] Received SUBSCRIBE call for 
> framework 'Spark Pi' at 
> scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00  3087 master.cpp:2894] Received SUBSCRIBE call for 
> framework 'Spark Pi' at 
> scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for 
> framework 'Spark Pi' at 
> scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call for 
> framework 'Spark Pi' at 
> scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi 
> with checkpointing disabled and capabilities [ ]
> I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00 3087 master.cpp:3083] Framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark 
> Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed 
> over
> I1020 13:49:05.00 3087 master.cpp:7662] Sending 6 offers to framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark 
> Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.00 3087 master.cpp:2974] Subscribing framework Spark Pi 
> with checkpointing disabled and capabilities [ ]
> I1020 13:49:05.00 3087 master.cpp:6618] Updating info for framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.00 3087 master.cpp:3083] Framework 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003 (Spark 
> Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed 
> over
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-O10039
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-O10038
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-O10037
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-O10036
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-O10035
> I1020 13:49:05.00 3087 master.cpp:9159] Removing offer 
> 9764beab-c90a-4b4f-b0ff-44c187851b34-O10034
> I1020 13:49:05.00 3087 master.cpp:2894] Received SUBSCRIBE call 

[jira] [Updated] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB

2018-03-29 Thread fengchaoge (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fengchaoge updated SPARK-21337:
---
Attachment: (was: duibi1.zip)

> SQL which has large ‘case when’ expressions may cause code generation beyond 
> 64KB
> -
>
> Key: SPARK-21337
> URL: https://issues.apache.org/jira/browse/SPARK-21337
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2
>Reporter: fengchaoge
>Priority: Major
> Fix For: 2.1.1
>
> Attachments: test.JPG, test1.JPG, test2.JPG
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB

2018-03-29 Thread fengchaoge (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fengchaoge updated SPARK-21337:
---
Attachment: (was: duibi2.zip)

> SQL which has large ‘case when’ expressions may cause code generation beyond 
> 64KB
> -
>
> Key: SPARK-21337
> URL: https://issues.apache.org/jira/browse/SPARK-21337
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2
>Reporter: fengchaoge
>Priority: Major
> Fix For: 2.1.1
>
> Attachments: test.JPG, test1.JPG, test2.JPG
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB

2018-03-29 Thread fengchaoge (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fengchaoge updated SPARK-21337:
---
Attachment: duibi2.zip

> SQL which has large ‘case when’ expressions may cause code generation beyond 
> 64KB
> -
>
> Key: SPARK-21337
> URL: https://issues.apache.org/jira/browse/SPARK-21337
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2
>Reporter: fengchaoge
>Priority: Major
> Fix For: 2.1.1
>
> Attachments: duibi1.zip, duibi2.zip, test.JPG, test1.JPG, test2.JPG
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB

2018-03-29 Thread fengchaoge (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fengchaoge updated SPARK-21337:
---
Attachment: duibi1.zip

> SQL which has large ‘case when’ expressions may cause code generation beyond 
> 64KB
> -
>
> Key: SPARK-21337
> URL: https://issues.apache.org/jira/browse/SPARK-21337
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2
>Reporter: fengchaoge
>Priority: Major
> Fix For: 2.1.1
>
> Attachments: duibi1.zip, duibi2.zip, test.JPG, test1.JPG, test2.JPG
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB

2018-03-29 Thread fengchaoge (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fengchaoge updated SPARK-21337:
---
Attachment: (was: shiro.ini)

> SQL which has large ‘case when’ expressions may cause code generation beyond 
> 64KB
> -
>
> Key: SPARK-21337
> URL: https://issues.apache.org/jira/browse/SPARK-21337
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2
>Reporter: fengchaoge
>Priority: Major
> Fix For: 2.1.1
>
> Attachments: test.JPG, test1.JPG, test2.JPG
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB

2018-03-29 Thread fengchaoge (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fengchaoge updated SPARK-21337:
---
Attachment: (was: SecurityRestApi.java)

> SQL which has large ‘case when’ expressions may cause code generation beyond 
> 64KB
> -
>
> Key: SPARK-21337
> URL: https://issues.apache.org/jira/browse/SPARK-21337
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2
>Reporter: fengchaoge
>Priority: Major
> Fix For: 2.1.1
>
> Attachments: test.JPG, test1.JPG, test2.JPG
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB

2018-03-29 Thread fengchaoge (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fengchaoge updated SPARK-21337:
---
Attachment: (was: zeppelin-site.xml)

> SQL which has large ‘case when’ expressions may cause code generation beyond 
> 64KB
> -
>
> Key: SPARK-21337
> URL: https://issues.apache.org/jira/browse/SPARK-21337
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2
>Reporter: fengchaoge
>Priority: Major
> Fix For: 2.1.1
>
> Attachments: test.JPG, test1.JPG, test2.JPG
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB

2018-03-29 Thread fengchaoge (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fengchaoge updated SPARK-21337:
---
Attachment: (was: ZeppelinConfiguration.java)

> SQL which has large ‘case when’ expressions may cause code generation beyond 
> 64KB
> -
>
> Key: SPARK-21337
> URL: https://issues.apache.org/jira/browse/SPARK-21337
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2
>Reporter: fengchaoge
>Priority: Major
> Fix For: 2.1.1
>
> Attachments: test.JPG, test1.JPG, test2.JPG
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB

2018-03-29 Thread fengchaoge (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fengchaoge updated SPARK-21337:
---
Attachment: (was: SecurityUtils.java)

> SQL which has large ‘case when’ expressions may cause code generation beyond 
> 64KB
> -
>
> Key: SPARK-21337
> URL: https://issues.apache.org/jira/browse/SPARK-21337
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2
>Reporter: fengchaoge
>Priority: Major
> Fix For: 2.1.1
>
> Attachments: test.JPG, test1.JPG, test2.JPG
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB

2018-03-29 Thread fengchaoge (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fengchaoge updated SPARK-21337:
---
Attachment: (was: LoginRestApi.java)

> SQL which has large ‘case when’ expressions may cause code generation beyond 
> 64KB
> -
>
> Key: SPARK-21337
> URL: https://issues.apache.org/jira/browse/SPARK-21337
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2
>Reporter: fengchaoge
>Priority: Major
> Fix For: 2.1.1
>
> Attachments: SecurityRestApi.java, SecurityUtils.java, 
> ZeppelinConfiguration.java, shiro.ini, test.JPG, test1.JPG, test2.JPG, 
> zeppelin-site.xml
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB

2018-03-29 Thread fengchaoge (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fengchaoge updated SPARK-21337:
---
Attachment: (was: GetUserList.java)

> SQL which has large ‘case when’ expressions may cause code generation beyond 
> 64KB
> -
>
> Key: SPARK-21337
> URL: https://issues.apache.org/jira/browse/SPARK-21337
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2
>Reporter: fengchaoge
>Priority: Major
> Fix For: 2.1.1
>
> Attachments: SecurityRestApi.java, SecurityUtils.java, 
> ZeppelinConfiguration.java, shiro.ini, test.JPG, test1.JPG, test2.JPG, 
> zeppelin-site.xml
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB

2018-03-29 Thread fengchaoge (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fengchaoge updated SPARK-21337:
---
Attachment: (was: GbdLdapRealm.java)

> SQL which has large ‘case when’ expressions may cause code generation beyond 
> 64KB
> -
>
> Key: SPARK-21337
> URL: https://issues.apache.org/jira/browse/SPARK-21337
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2
>Reporter: fengchaoge
>Priority: Major
> Fix For: 2.1.1
>
> Attachments: SecurityRestApi.java, SecurityUtils.java, 
> ZeppelinConfiguration.java, shiro.ini, test.JPG, test1.JPG, test2.JPG, 
> zeppelin-site.xml
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >