[jira] [Created] (SPARK-19487) Low latency execution for Spark
Shivaram Venkataraman created SPARK-19487: - Summary: Low latency execution for Spark Key: SPARK-19487 URL: https://issues.apache.org/jira/browse/SPARK-19487 Project: Spark Issue Type: Umbrella Components: ML, Scheduler, Structured Streaming Affects Versions: 2.1.0 Reporter: Shivaram Venkataraman This JIRA tracks the design discussion for supporting low latency execution in Apache Spark. The motivation for this comes from need to support lower latency stream processing and lower latency iterations for sparse ML workloads. Overview of proposed design (in the format of Spark Improvement Proposal) is at https://docs.google.com/document/d/1m_q83DjQcWQonEz4IsRUHu4QSjcDyqRpl29qE4LJc4s/edit?usp=sharing Source code prototype is at: https://github.com/amplab/drizzle-spark Lets use this JIRA to discuss high level design and we can create subtasks as we break this down into smaller PRs. This is joint work with [~kayousterhout] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18967) Locality preferences should be used when scheduling even when delay scheduling is turned off
[ https://issues.apache.org/jira/browse/SPARK-18967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-18967. Resolution: Fixed Fix Version/s: 2.2 > Locality preferences should be used when scheduling even when delay > scheduling is turned off > > > Key: SPARK-18967 > URL: https://issues.apache.org/jira/browse/SPARK-18967 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Imran Rashid >Assignee: Imran Rashid > Fix For: 2.2 > > > If you turn delay scheduling off by setting {{spark.locality.wait=0}}, you > effectively turn off the use the of locality preferences when there is a bulk > scheduling event. {{TaskSchedulerImpl}} will use resources based on whatever > random order it decides to shuffle them, rather than taking advantage of the > most local options. > This happens because {{TaskSchedulerImpl}} offers resources to a > {{TaskSetManager}} one at a time, each time subject to a maxLocality > constraint. However, that constraint doesn't move through all possible > locality levels -- it uses [{{tsm.myLocalityLevels}} > |https://github.com/apache/spark/blob/1a64388973711b4e567f25fa33d752066a018b49/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L360]. > And {{tsm.myLocalityLevels}} [skips locality levels completely if the wait > == 0 | > https://github.com/apache/spark/blob/1a64388973711b4e567f25fa33d752066a018b49/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L953]. > So with delay scheduling off, {{TaskSchedulerImpl}} immediately jumps to > giving tsms the offers with {{maxLocality = ANY}}. > *WORKAROUND*: instead of setting {{spark.locality.wait=0}}, use > {{spark.locality.wait=1ms}}. The one downside of this is if you have tasks > that actually take less than 1ms. You could even run into SPARK-18886. But > that is a relatively unlikely scenario for real workloads. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16441) Spark application hang when dynamic allocation is enabled
[ https://issues.apache.org/jira/browse/SPARK-16441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855414#comment-15855414 ] Yuming Wang commented on SPARK-16441: - [~cenyuhai], [2.1.0|https://github.com/apache/spark/tree/v2.1.0] not fix, master and branch-2.1 has fixed. My [PR|https://github.com/apache/spark/pull/16819] is to reduce friction. > Spark application hang when dynamic allocation is enabled > - > > Key: SPARK-16441 > URL: https://issues.apache.org/jira/browse/SPARK-16441 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0, 2.1.0 > Environment: hadoop 2.7.2 spark1.6.2 >Reporter: cen yuhai > Attachments: SPARK-16441-compare-apply-PR-16819.zip, > SPARK-16441-stage.jpg, SPARK-16441-threadDump.jpg, > SPARK-16441-yarn-metrics.jpg > > > spark application are waiting for rpc response all the time and spark > listener are blocked by dynamic allocation. Executors can not connect to > driver and lost. > "spark-dynamic-executor-allocation" #239 daemon prio=5 os_prio=0 > tid=0x7fa304438000 nid=0xcec6 waiting on condition [0x7fa2b81e4000] >java.lang.Thread.State: TIMED_WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00070fdb94f8> (a > scala.concurrent.impl.Promise$CompletionLatch) > at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) > at > scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:107) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77) > at > org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:436) > - locked <0x828a8960> (a > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend) > at > org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1438) > at > org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359) > at > org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310) > - locked <0x880e6308> (a > org.apache.spark.ExecutorAllocationManager) > at > org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264) > - locked <0x880e6308> (a > org.apache.spark.ExecutorAllocationManager) > at > org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223) > "SparkListenerBus" #161 daemon prio=5 os_prio=0 tid=0x7fa3053be000 > nid=0xcec9 waiting for monitor entry [0x7fa2b3dfc000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener.onTaskEnd(ExecutorAllocationManager.scala:618) > - waiting to lock <0x880e6308> (a > org.apache.spark.ExecutorAllocationManager) > at > org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55) > at > org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) > at >
[jira] [Commented] (SPARK-19484) continue work to create a table with an empty schema
[ https://issues.apache.org/jira/browse/SPARK-19484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855402#comment-15855402 ] Apache Spark commented on SPARK-19484: -- User 'windpiger' has created a pull request for this issue: https://github.com/apache/spark/pull/16828 > continue work to create a table with an empty schema > > > Key: SPARK-19484 > URL: https://issues.apache.org/jira/browse/SPARK-19484 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun >Priority: Minor > > after SPARK-19279, we could not create a Hive table with an empty schema, > we should tighten up the condition when create a hive table in > https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L835 > That is if a CatalogTable t has an empty schema, and (there is no > `spark.sql.schema.numParts` or its value is 0), we should not add a default > `col` schema, if we did, a table with an empty schema will be created, that > is not we expected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19484) continue work to create a table with an empty schema
[ https://issues.apache.org/jira/browse/SPARK-19484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19484: Assignee: Apache Spark > continue work to create a table with an empty schema > > > Key: SPARK-19484 > URL: https://issues.apache.org/jira/browse/SPARK-19484 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun >Assignee: Apache Spark >Priority: Minor > > after SPARK-19279, we could not create a Hive table with an empty schema, > we should tighten up the condition when create a hive table in > https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L835 > That is if a CatalogTable t has an empty schema, and (there is no > `spark.sql.schema.numParts` or its value is 0), we should not add a default > `col` schema, if we did, a table with an empty schema will be created, that > is not we expected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19484) continue work to create a table with an empty schema
[ https://issues.apache.org/jira/browse/SPARK-19484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19484: Assignee: (was: Apache Spark) > continue work to create a table with an empty schema > > > Key: SPARK-19484 > URL: https://issues.apache.org/jira/browse/SPARK-19484 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun >Priority: Minor > > after SPARK-19279, we could not create a Hive table with an empty schema, > we should tighten up the condition when create a hive table in > https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L835 > That is if a CatalogTable t has an empty schema, and (there is no > `spark.sql.schema.numParts` or its value is 0), we should not add a default > `col` schema, if we did, a table with an empty schema will be created, that > is not we expected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16441) Spark application hang when dynamic allocation is enabled
[ https://issues.apache.org/jira/browse/SPARK-16441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-16441: Attachment: SPARK-16441-compare-apply-PR-16819.zip > Spark application hang when dynamic allocation is enabled > - > > Key: SPARK-16441 > URL: https://issues.apache.org/jira/browse/SPARK-16441 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0, 2.1.0 > Environment: hadoop 2.7.2 spark1.6.2 >Reporter: cen yuhai > Attachments: SPARK-16441-compare-apply-PR-16819.zip, > SPARK-16441-stage.jpg, SPARK-16441-threadDump.jpg, > SPARK-16441-yarn-metrics.jpg > > > spark application are waiting for rpc response all the time and spark > listener are blocked by dynamic allocation. Executors can not connect to > driver and lost. > "spark-dynamic-executor-allocation" #239 daemon prio=5 os_prio=0 > tid=0x7fa304438000 nid=0xcec6 waiting on condition [0x7fa2b81e4000] >java.lang.Thread.State: TIMED_WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00070fdb94f8> (a > scala.concurrent.impl.Promise$CompletionLatch) > at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) > at > scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:107) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77) > at > org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:436) > - locked <0x828a8960> (a > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend) > at > org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1438) > at > org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359) > at > org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310) > - locked <0x880e6308> (a > org.apache.spark.ExecutorAllocationManager) > at > org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264) > - locked <0x880e6308> (a > org.apache.spark.ExecutorAllocationManager) > at > org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223) > "SparkListenerBus" #161 daemon prio=5 os_prio=0 tid=0x7fa3053be000 > nid=0xcec9 waiting for monitor entry [0x7fa2b3dfc000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener.onTaskEnd(ExecutorAllocationManager.scala:618) > - waiting to lock <0x880e6308> (a > org.apache.spark.ExecutorAllocationManager) > at > org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55) > at > org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
[jira] [Created] (SPARK-19486) Investigate using multiple threads for task serialization
Shivaram Venkataraman created SPARK-19486: - Summary: Investigate using multiple threads for task serialization Key: SPARK-19486 URL: https://issues.apache.org/jira/browse/SPARK-19486 Project: Spark Issue Type: Improvement Components: Scheduler Affects Versions: 2.1.0 Reporter: Shivaram Venkataraman This is related to SPARK-18890, where all the serialization logic is moved into the Scheduler backend thread. As a follow on to this we can investigate using a thread pool to serialize a number of tasks together instead of using a single thread to serialize all of them. Note that this may not yield sufficient benefits unless the driver has enough cores and we don't run into contention across threads. We can first investigate potential benefits and if there are sufficient benefits we can create a PR for this. cc [~kayousterhout] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19485) Launch tasks async i.e. dont wait for the network
Shivaram Venkataraman created SPARK-19485: - Summary: Launch tasks async i.e. dont wait for the network Key: SPARK-19485 URL: https://issues.apache.org/jira/browse/SPARK-19485 Project: Spark Issue Type: Improvement Components: Scheduler Affects Versions: 2.1.0 Reporter: Shivaram Venkataraman Currently the scheduling thread in CoarseGrainedSchedulerBackend is used to both walk through the list of offers and to serialize, create RPCs and send messages over the network. For stages with large number of tasks we can avoid blocking on RPCs / serialization by moving that to a separate thread in CGSB. As a part of this JIRA we can first investigate the potential benefits of doing this for different kinds of jobs (one large stage, many independent small stages etc.) and then propose a code change. cc [~kayousterhout] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19484) continue work to create a table with an empty schema
Song Jun created SPARK-19484: Summary: continue work to create a table with an empty schema Key: SPARK-19484 URL: https://issues.apache.org/jira/browse/SPARK-19484 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun Priority: Minor after SPARK-19279, we could not create a Hive table with an empty schema, we should tighten up the condition when create a hive table in https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L835 That is if a CatalogTable t has an empty schema, and (there is no `spark.sql.schema.numParts` or its value is 0), we should not add a default `col` schema, if we did, a table with an empty schema will be created, that is not we expected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19407) defaultFS is used FileSystem.get instead of getting it from uri scheme
[ https://issues.apache.org/jira/browse/SPARK-19407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-19407. -- Resolution: Fixed Assignee: Genmao Yu Fix Version/s: 2.2.0 2.1.1 > defaultFS is used FileSystem.get instead of getting it from uri scheme > -- > > Key: SPARK-19407 > URL: https://issues.apache.org/jira/browse/SPARK-19407 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Amit Assudani >Assignee: Genmao Yu > Labels: checkpoint, filesystem, starter, streaming > Fix For: 2.1.1, 2.2.0 > > > Caused by: java.lang.IllegalArgumentException: Wrong FS: > s3a://**/checkpoint/7b2231a3-d845-4740-bfa3-681850e5987f/metadata, > expected: file:/// > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649) > at > org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82) > at > org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:606) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601) > at > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421) > at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426) > at > org.apache.spark.sql.execution.streaming.StreamMetadata$.read(StreamMetadata.scala:51) > at > org.apache.spark.sql.execution.streaming.StreamExecution.(StreamExecution.scala:100) > at > org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:232) > at > org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:269) > at > org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:262) > Can easily replicate on spark standalone cluster by providing checkpoint > location uri scheme anything other than "file://" and not overriding in > config. > WorkAround --conf spark.hadoop.fs.defaultFS=s3a://somebucket or set it in > sparkConf or spark-default.conf -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19483) Add one RocketMQ plugin for the Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-19483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Longda Feng updated SPARK-19483: External issue URL: https://issues.apache.org/jira/browse/ROCKETMQ-81 External issue ID: ROCKETMQ-81 > Add one RocketMQ plugin for the Apache Spark > > > Key: SPARK-19483 > URL: https://issues.apache.org/jira/browse/SPARK-19483 > Project: Spark > Issue Type: Task > Components: Input/Output >Affects Versions: 2.1.0 >Reporter: Longda Feng >Priority: Minor > > Apache RocketMQ® is an open source distributed messaging and streaming data > platform. It has been used in a lot of companies. Please refer to > http://rocketmq.incubator.apache.org/ for more details. > Since the Apache RocketMq 4.0 will be released in the next few days, we can > start the job of adding the RocketMq plugin for the Apache Spark. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19483) Add one RocketMQ plugin for the Apache Spark
Longda Feng created SPARK-19483: --- Summary: Add one RocketMQ plugin for the Apache Spark Key: SPARK-19483 URL: https://issues.apache.org/jira/browse/SPARK-19483 Project: Spark Issue Type: Task Components: Input/Output Affects Versions: 2.1.0 Reporter: Longda Feng Priority: Minor Apache RocketMQ® is an open source distributed messaging and streaming data platform. It has been used in a lot of companies. Please refer to http://rocketmq.incubator.apache.org/ for more details. Since the Apache RocketMq 4.0 will be released in the next few days, we can start the job of adding the RocketMq plugin for the Apache Spark. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19482) Fail it if 'spark.master' is set with different value
[ https://issues.apache.org/jira/browse/SPARK-19482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19482: Assignee: (was: Apache Spark) > Fail it if 'spark.master' is set with different value > - > > Key: SPARK-19482 > URL: https://issues.apache.org/jira/browse/SPARK-19482 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.2, 2.1.0 >Reporter: Genmao Yu > > First, there is no need to set 'spark.master' multi-times with different > values. Second, It is possible for users to set the different 'spark.master' > in code with > `spark-submit` command, and will confuse users. So, we should do once check > if the 'spark.master' already exists in settings and if the previous value is > the same with current value. Throw a IllegalArgumentException when previous > value is different with current value. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19482) Fail it if 'spark.master' is set with different value
[ https://issues.apache.org/jira/browse/SPARK-19482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19482: Assignee: Apache Spark > Fail it if 'spark.master' is set with different value > - > > Key: SPARK-19482 > URL: https://issues.apache.org/jira/browse/SPARK-19482 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.2, 2.1.0 >Reporter: Genmao Yu >Assignee: Apache Spark > > First, there is no need to set 'spark.master' multi-times with different > values. Second, It is possible for users to set the different 'spark.master' > in code with > `spark-submit` command, and will confuse users. So, we should do once check > if the 'spark.master' already exists in settings and if the previous value is > the same with current value. Throw a IllegalArgumentException when previous > value is different with current value. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19482) Fail it if 'spark.master' is set with different value
[ https://issues.apache.org/jira/browse/SPARK-19482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855240#comment-15855240 ] Apache Spark commented on SPARK-19482: -- User 'uncleGen' has created a pull request for this issue: https://github.com/apache/spark/pull/16827 > Fail it if 'spark.master' is set with different value > - > > Key: SPARK-19482 > URL: https://issues.apache.org/jira/browse/SPARK-19482 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.2, 2.1.0 >Reporter: Genmao Yu > > First, there is no need to set 'spark.master' multi-times with different > values. Second, It is possible for users to set the different 'spark.master' > in code with > `spark-submit` command, and will confuse users. So, we should do once check > if the 'spark.master' already exists in settings and if the previous value is > the same with current value. Throw a IllegalArgumentException when previous > value is different with current value. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19482) Fail it if 'spark.master' is set with different value
Genmao Yu created SPARK-19482: - Summary: Fail it if 'spark.master' is set with different value Key: SPARK-19482 URL: https://issues.apache.org/jira/browse/SPARK-19482 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.1.0, 2.0.2 Reporter: Genmao Yu First, there is no need to set 'spark.master' multi-times with different values. Second, It is possible for users to set the different 'spark.master' in code with `spark-submit` command, and will confuse users. So, we should do once check if the 'spark.master' already exists in settings and if the previous value is the same with current value. Throw a IllegalArgumentException when previous value is different with current value. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19467) PySpark ML shouldn't use circular imports
[ https://issues.apache.org/jira/browse/SPARK-19467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-19467: - Assignee: Maciej Szymkiewicz > PySpark ML shouldn't use circular imports > - > > Key: SPARK-19467 > URL: https://issues.apache.org/jira/browse/SPARK-19467 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Minor > Fix For: 2.2.0 > > > {{pyspark.ml}} and {{pyspark.ml.pipeline}} contain circular imports with the > [former > one|https://github.com/apache/spark/blob/39f328ba3519b01940a7d1cdee851ba4e75ef31f/python/pyspark/ml/__init__.py#L23]: > {code} > from pyspark.ml.pipeline import Pipeline, PipelineModel > {code} > and the [latter > one|https://github.com/apache/spark/blob/39f328ba3519b01940a7d1cdee851ba4e75ef31f/python/pyspark/ml/pipeline.py#L24]: > {code} > from pyspark.ml import Estimator, Model, Transformer > {code} > This is unnecessary and can cause failures when working with external tools. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19467) PySpark ML shouldn't use circular imports
[ https://issues.apache.org/jira/browse/SPARK-19467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-19467. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16814 [https://github.com/apache/spark/pull/16814] > PySpark ML shouldn't use circular imports > - > > Key: SPARK-19467 > URL: https://issues.apache.org/jira/browse/SPARK-19467 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Minor > Fix For: 2.2.0 > > > {{pyspark.ml}} and {{pyspark.ml.pipeline}} contain circular imports with the > [former > one|https://github.com/apache/spark/blob/39f328ba3519b01940a7d1cdee851ba4e75ef31f/python/pyspark/ml/__init__.py#L23]: > {code} > from pyspark.ml.pipeline import Pipeline, PipelineModel > {code} > and the [latter > one|https://github.com/apache/spark/blob/39f328ba3519b01940a7d1cdee851ba4e75ef31f/python/pyspark/ml/pipeline.py#L24]: > {code} > from pyspark.ml import Estimator, Model, Transformer > {code} > This is unnecessary and can cause failures when working with external tools. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19441) Remove IN type coercion from PromoteStrings
[ https://issues.apache.org/jira/browse/SPARK-19441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-19441. - Resolution: Fixed Issue resolved by pull request 16783 [https://github.com/apache/spark/pull/16783] > Remove IN type coercion from PromoteStrings > --- > > Key: SPARK-19441 > URL: https://issues.apache.org/jira/browse/SPARK-19441 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.2.0 > > > The removed codes are not reachable, because `InConversion` already resolve > the type coercion issues. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19479) Spark Mesos artifact split causes spark-core dependency to not pull in mesos impl
[ https://issues.apache.org/jira/browse/SPARK-19479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855102#comment-15855102 ] Charles Allen commented on SPARK-19479: --- [~mgummelt] that's actually a really good suggestion. Somehow I never got subscribed to the dev list > Spark Mesos artifact split causes spark-core dependency to not pull in mesos > impl > - > > Key: SPARK-19479 > URL: https://issues.apache.org/jira/browse/SPARK-19479 > Project: Spark > Issue Type: Bug > Components: Mesos, Spark Core >Affects Versions: 2.1.0 >Reporter: Charles Allen > > https://github.com/apache/spark/pull/14637 ( > https://issues.apache.org/jira/browse/SPARK-16967 ) forked off the mesos impl > into its own artifact, but the release notes do not call this out. This broke > our deployments because we depend on packaging with spark-core, which no > longer had any mesos awareness. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15573) Backwards-compatible persistence for spark.ml
[ https://issues.apache.org/jira/browse/SPARK-15573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855097#comment-15855097 ] Joseph K. Bradley commented on SPARK-15573: --- It's a good point that we can't make updates to older Spark releases for persistence. However, I doubt that we would backport many such fixes for non-bugs. The issue you reference is arguably a scalability limit, not a bug. Still, adding an internal ML persistence version is a good idea; I'd be OK with it. > Backwards-compatible persistence for spark.ml > - > > Key: SPARK-15573 > URL: https://issues.apache.org/jira/browse/SPARK-15573 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > This JIRA is for imposing backwards-compatible persistence for the > DataFrames-based API for MLlib. I.e., we want to be able to load models > saved in previous versions of Spark. We will not require loading models > saved in later versions of Spark. > This requires: > * Putting unit tests in place to check loading models from previous versions > * Notifying all committers active on MLlib to be aware of this requirement in > the future > The unit tests could be written as in spark.mllib, where we essentially > copied and pasted the save() code every time it changed. This happens > rarely, so it should be acceptable, though other designs are fine. > Subtasks of this JIRA should cover checking and adding tests for existing > cases, such as KMeansModel (whose format changed between 1.6 and 2.0). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855090#comment-15855090 ] Joseph K. Bradley commented on SPARK-19208: --- You're right that sharing intermediate results will be necessary. I'm happy with [~mlnick]'s VectorSummarizer API. I also think that, if we wanted to use the API I suggested above, the version returning a single struct col would work: {{df.select(VectorSummary.summary("features", "weights"))}}. The new column could be constructed from intermediate columns which would not show up in the final output. (Is this essentially the "private UDAF" [~podongfeng] is mentioning above?) I'm OK either way. > MultivariateOnlineSummarizer performance optimization > - > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz) > After modication in the pr, the above example run successfully. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19479) Spark Mesos artifact split causes spark-core dependency to not pull in mesos impl
[ https://issues.apache.org/jira/browse/SPARK-19479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855091#comment-15855091 ] Michael Gummelt commented on SPARK-19479: - Yea, sorry for the inconvenience, but I announced this on the dev list. Search for "Mesos is now a maven module". If I were you, I would create an email filter for "Mesos" on the user/dev lists. This is what I do. > Spark Mesos artifact split causes spark-core dependency to not pull in mesos > impl > - > > Key: SPARK-19479 > URL: https://issues.apache.org/jira/browse/SPARK-19479 > Project: Spark > Issue Type: Bug > Components: Mesos, Spark Core >Affects Versions: 2.1.0 >Reporter: Charles Allen > > https://github.com/apache/spark/pull/14637 ( > https://issues.apache.org/jira/browse/SPARK-16967 ) forked off the mesos impl > into its own artifact, but the release notes do not call this out. This broke > our deployments because we depend on packaging with spark-core, which no > longer had any mesos awareness. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12157) Support numpy types as return values of Python UDFs
[ https://issues.apache.org/jira/browse/SPARK-12157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855060#comment-15855060 ] Joseph K. Bradley commented on SPARK-12157: --- I don't know of any Python UDF perf tests. Ad hoc tests could suffice for now... > Support numpy types as return values of Python UDFs > --- > > Key: SPARK-12157 > URL: https://issues.apache.org/jira/browse/SPARK-12157 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 1.5.2 >Reporter: Justin Uang > > Currently, if I have a python UDF > {code} > import pyspark.sql.types as T > import pyspark.sql.functions as F > from pyspark.sql import Row > import numpy as np > argmax = F.udf(lambda x: np.argmax(x), T.IntegerType()) > df = sqlContext.createDataFrame([Row(array=[1,2,3])]) > df.select(argmax("array")).count() > {code} > I get an exception that is fairly opaque: > {code} > Caused by: net.razorvine.pickle.PickleException: expected zero arguments for > construction of ClassDict (for numpy.dtype) > at > net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) > at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:701) > at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:171) > at net.razorvine.pickle.Unpickler.load(Unpickler.java:85) > at net.razorvine.pickle.Unpickler.loads(Unpickler.java:98) > at > org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:404) > at > org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:403) > {code} > Numpy types like np.int and np.float64 should automatically be cast to the > proper dtypes. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16824) Add API docs for VectorUDT
[ https://issues.apache.org/jira/browse/SPARK-16824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855057#comment-15855057 ] Joseph K. Bradley commented on SPARK-16824: --- I think we didn't document it since the future of UDTs becoming public APIs was uncertain. VectorUDT is private in spark.ml. Still, adding docs for the public spark.mllib VectorUDT sounds good to me. > Add API docs for VectorUDT > -- > > Key: SPARK-16824 > URL: https://issues.apache.org/jira/browse/SPARK-16824 > Project: Spark > Issue Type: Documentation > Components: Documentation, MLlib, PySpark >Affects Versions: 2.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > Following on the [discussion > here|https://issues.apache.org/jira/browse/SPARK-12157?focusedCommentId=15401153=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15401153], > it appears that {{VectorUDT}} is missing documentation, at least in PySpark. > I'm not sure if this is intentional or not. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16824) Add API docs for VectorUDT
[ https://issues.apache.org/jira/browse/SPARK-16824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-16824: -- Issue Type: Documentation (was: Improvement) > Add API docs for VectorUDT > -- > > Key: SPARK-16824 > URL: https://issues.apache.org/jira/browse/SPARK-16824 > Project: Spark > Issue Type: Documentation > Components: Documentation, MLlib, PySpark >Affects Versions: 2.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > Following on the [discussion > here|https://issues.apache.org/jira/browse/SPARK-12157?focusedCommentId=15401153=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15401153], > it appears that {{VectorUDT}} is missing documentation, at least in PySpark. > I'm not sure if this is intentional or not. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16824) Add API docs for VectorUDT
[ https://issues.apache.org/jira/browse/SPARK-16824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-16824: -- Component/s: MLlib > Add API docs for VectorUDT > -- > > Key: SPARK-16824 > URL: https://issues.apache.org/jira/browse/SPARK-16824 > Project: Spark > Issue Type: Improvement > Components: Documentation, MLlib, PySpark >Affects Versions: 2.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > Following on the [discussion > here|https://issues.apache.org/jira/browse/SPARK-12157?focusedCommentId=15401153=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15401153], > it appears that {{VectorUDT}} is missing documentation, at least in PySpark. > I'm not sure if this is intentional or not. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18891) Support for specific collection types
[ https://issues.apache.org/jira/browse/SPARK-18891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855011#comment-15855011 ] Michal Šenkýř commented on SPARK-18891: --- Thanks. Yes, I know about the Maps issue and I will be happy to continue on to Map support. However in [PR #16541|https://github.com/apache/spark/pull/16541] I tried to use a better deserialization method. In order to make the code for Maps consistent with that of Seqs I am waiting for how that turns out before I start on that. > Support for specific collection types > - > > Key: SPARK-18891 > URL: https://issues.apache.org/jira/browse/SPARK-18891 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.3, 2.1.0 >Reporter: Michael Armbrust >Priority: Critical > > Encoders treat all collections the same (i.e. {{Seq}} vs {{List}}) which > force users to only define classes with the most generic type. > An [example > error|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/2398463439880241/2840265927289860/latest.html]: > {code} > case class SpecificCollection(aList: List[Int]) > Seq(SpecificCollection(1 :: Nil)).toDS().collect() > {code} > {code} > java.lang.RuntimeException: Error while decoding: > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 98, Column 120: No applicable constructor/method found > for actual parameters "scala.collection.Seq"; candidates are: > "line29e7e4b1e36445baa3505b2e102aa86b29.$read$$iw$$iw$$iw$$iw$SpecificCollection(scala.collection.immutable.List)" > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19481) Fix flaky test: o.a.s.repl.ReplSuite should clone and clean line object in ClosureCleaner
[ https://issues.apache.org/jira/browse/SPARK-19481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19481: Assignee: Shixiong Zhu (was: Apache Spark) > Fix flaky test: o.a.s.repl.ReplSuite should clone and clean line object in > ClosureCleaner > - > > Key: SPARK-19481 > URL: https://issues.apache.org/jira/browse/SPARK-19481 > Project: Spark > Issue Type: Test > Components: Spark Shell >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > org.apache.spark.repl.cancelOnInterrupt leaks a SparkContext and makes the > tests unstable. See: > http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.repl.ReplSuite_name=should+clone+and+clean+line+object+in+ClosureCleaner -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19481) Fix flaky test: o.a.s.repl.ReplSuite should clone and clean line object in ClosureCleaner
[ https://issues.apache.org/jira/browse/SPARK-19481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19481: Assignee: Apache Spark (was: Shixiong Zhu) > Fix flaky test: o.a.s.repl.ReplSuite should clone and clean line object in > ClosureCleaner > - > > Key: SPARK-19481 > URL: https://issues.apache.org/jira/browse/SPARK-19481 > Project: Spark > Issue Type: Test > Components: Spark Shell >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark > > org.apache.spark.repl.cancelOnInterrupt leaks a SparkContext and makes the > tests unstable. See: > http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.repl.ReplSuite_name=should+clone+and+clean+line+object+in+ClosureCleaner -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19481) Fix flaky test: o.a.s.repl.ReplSuite should clone and clean line object in ClosureCleaner
[ https://issues.apache.org/jira/browse/SPARK-19481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15854889#comment-15854889 ] Apache Spark commented on SPARK-19481: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/16825 > Fix flaky test: o.a.s.repl.ReplSuite should clone and clean line object in > ClosureCleaner > - > > Key: SPARK-19481 > URL: https://issues.apache.org/jira/browse/SPARK-19481 > Project: Spark > Issue Type: Test > Components: Spark Shell >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > org.apache.spark.repl.cancelOnInterrupt leaks a SparkContext and makes the > tests unstable. See: > http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.repl.ReplSuite_name=should+clone+and+clean+line+object+in+ClosureCleaner -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19481) Fix flaky test: o.a.s.repl.ReplSuite should clone and clean line object in ClosureCleaner
Shixiong Zhu created SPARK-19481: Summary: Fix flaky test: o.a.s.repl.ReplSuite should clone and clean line object in ClosureCleaner Key: SPARK-19481 URL: https://issues.apache.org/jira/browse/SPARK-19481 Project: Spark Issue Type: Test Components: Spark Shell Affects Versions: 2.1.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu org.apache.spark.repl.cancelOnInterrupt leaks a SparkContext and makes the tests unstable. See: http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.repl.ReplSuite_name=should+clone+and+clean+line+object+in+ClosureCleaner -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19480) Higher order functions in SQL
Reynold Xin created SPARK-19480: --- Summary: Higher order functions in SQL Key: SPARK-19480 URL: https://issues.apache.org/jira/browse/SPARK-19480 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.2.0 Reporter: Reynold Xin To enable users to manipulate nested data types, which is common in ETL jobs with deeply nested JSON fields. Operations should include map, filter, reduce on arrays/maps. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19472) [SQL]SQLParser fails to resolve nested CASE WHEN statement with parentheses
[ https://issues.apache.org/jira/browse/SPARK-19472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-19472. - Resolution: Fixed Assignee: Herman van Hovell Fix Version/s: 2.2.0 2.1.1 2.0.3 > [SQL]SQLParser fails to resolve nested CASE WHEN statement with parentheses > --- > > Key: SPARK-19472 > URL: https://issues.apache.org/jira/browse/SPARK-19472 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: StanZhai >Assignee: Herman van Hovell > Fix For: 2.0.3, 2.1.1, 2.2.0 > > > SQLParser fails to resolve nested CASE WHEN statement like this: > select case when > (1) + > case when 1>0 then 1 else 0 end = 2 > then 1 else 0 end > from tb > Exception > Exception in thread "main" > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input 'then' expecting {'.', '[', 'OR', 'AND', 'IN', NOT, > 'BETWEEN', 'LIKE', RLIKE, 'IS', 'WHEN', EQ, '<=>', '<>', '!=', '<', LTE, '>', > GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '^'}(line 5, pos 0) > == SQL == > select case when > (1) + > case when 1>0 then 1 else 0 end = 2 > then 1 else 0 end > ^^^ > from tb > But,remove parentheses will be fine: > select case when > 1 + > case when 1>0 then 1 else 0 end = 2 > then 1 else 0 end > from tb -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19479) Spark Mesos artifact split causes spark-core dependency to not pull in mesos impl
[ https://issues.apache.org/jira/browse/SPARK-19479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19479. --- Resolution: Invalid As I said on the PR, I don't think this is for release notes because release notes are for end users. Regardless, I am not sure what the actionable change is supposed to be here. > Spark Mesos artifact split causes spark-core dependency to not pull in mesos > impl > - > > Key: SPARK-19479 > URL: https://issues.apache.org/jira/browse/SPARK-19479 > Project: Spark > Issue Type: Bug > Components: Mesos, Spark Core >Affects Versions: 2.1.0 >Reporter: Charles Allen > > https://github.com/apache/spark/pull/14637 ( > https://issues.apache.org/jira/browse/SPARK-16967 ) forked off the mesos impl > into its own artifact, but the release notes do not call this out. This broke > our deployments because we depend on packaging with spark-core, which no > longer had any mesos awareness. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19479) Spark Mesos artifact split causes spark-core dependency to not pull in mesos impl
Charles Allen created SPARK-19479: - Summary: Spark Mesos artifact split causes spark-core dependency to not pull in mesos impl Key: SPARK-19479 URL: https://issues.apache.org/jira/browse/SPARK-19479 Project: Spark Issue Type: Bug Components: Mesos, Spark Core Affects Versions: 2.1.0 Reporter: Charles Allen https://github.com/apache/spark/pull/14637 ( https://issues.apache.org/jira/browse/SPARK-16967 ) forked off the mesos impl into its own artifact, but the release notes do not call this out. This broke our deployments because we depend on packaging with spark-core, which no longer had any mesos awareness. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19478) JDBC Sink
[ https://issues.apache.org/jira/browse/SPARK-19478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-19478: - Issue Type: New Feature (was: Bug) > JDBC Sink > - > > Key: SPARK-19478 > URL: https://issues.apache.org/jira/browse/SPARK-19478 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.0.0 >Reporter: Michael Armbrust > > A sink that transactionally commits data into a database use JDBC. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19478) JDBC Sink
Michael Armbrust created SPARK-19478: Summary: JDBC Sink Key: SPARK-19478 URL: https://issues.apache.org/jira/browse/SPARK-19478 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.0.0 Reporter: Michael Armbrust A sink that transactionally commits data into a database use JDBC. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19398) Log in TaskSetManager is not correct
[ https://issues.apache.org/jira/browse/SPARK-19398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-19398: --- Fix Version/s: 2.2 > Log in TaskSetManager is not correct > > > Key: SPARK-19398 > URL: https://issues.apache.org/jira/browse/SPARK-19398 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: jin xing >Assignee: jin xing >Priority: Trivial > Fix For: 2.2 > > > Log below is misleading: > {code:title="TaskSetManager.scala"} > if (successful(index)) { > logInfo( > s"Task ${info.id} in stage ${taskSet.id} (TID $tid) failed, " + > "but another instance of the task has already succeeded, " + > "so not re-queuing the task to be re-executed.") > } > {code} > If fetch failed, the task is marked as *successful* in *TaskSetManager:: > handleFailedTask*. Then log above will be printed. The *successful* just > means task will not be scheduled any longer, not a real success. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19398) Log in TaskSetManager is not correct
[ https://issues.apache.org/jira/browse/SPARK-19398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-19398. Resolution: Fixed Assignee: jin xing > Log in TaskSetManager is not correct > > > Key: SPARK-19398 > URL: https://issues.apache.org/jira/browse/SPARK-19398 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: jin xing >Assignee: jin xing >Priority: Trivial > > Log below is misleading: > {code:title="TaskSetManager.scala"} > if (successful(index)) { > logInfo( > s"Task ${info.id} in stage ${taskSet.id} (TID $tid) failed, " + > "but another instance of the task has already succeeded, " + > "so not re-queuing the task to be re-executed.") > } > {code} > If fetch failed, the task is marked as *successful* in *TaskSetManager:: > handleFailedTask*. Then log above will be printed. The *successful* just > means task will not be scheduled any longer, not a real success. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19462) when spark.sql.adaptive.enabled is enabled, DF is not resilient to node/container failure
[ https://issues.apache.org/jira/browse/SPARK-19462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15854534#comment-15854534 ] Ian commented on SPARK-19462: - I appears that the state mutating of newPartitioning of org.apache.spark.sql.execution.Exchange is preventing the recalculation when dynamic shuffle partitioning is enabled. > when spark.sql.adaptive.enabled is enabled, DF is not resilient to > node/container failure > - > > Key: SPARK-19462 > URL: https://issues.apache.org/jira/browse/SPARK-19462 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3 >Reporter: Ian > > property spark.sql.adaptive.enabled needs to be set "true" for the issue to > be reproduced. > reproducible steps using spark-shell > 0. we use yarn as cluster manager, spark-shell runs in client mode > 1. launch spark-shell > 2. > {code} > val df1 = sc.parallelize( 1 to 1000, 2).toDF("number") > df1.registerTempTable("test") > val data1 = sqlContext.sql("SELECT * FROM test WHERE number > 50") > data1.collect > val data2 = sqlContext.sql("SELECT number, count(*) cnt FROM test GROUP BY > number") > data2.collect > // everything is fine up to this point > // manually kill both the AM and all the NMs of the spark-shell app > // re-run data1.collect, the result is returned successfully > data1.collect > // but data2.collect will fail > data2.collect > // stacktrace > Caused by: java.lang.RuntimeException: Exchange not implemented for > UnknownPartitioning(1) > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.Exchange.org$apache$spark$sql$execution$Exchange$$getPartitionKeyExtractor$1(Exchange.scala:198) > at > org.apache.spark.sql.execution.Exchange$$anonfun$3.apply(Exchange.scala:208) > at > org.apache.spark.sql.execution.Exchange$$anonfun$3.apply(Exchange.scala:207) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > The difference between data1 and data2 is whether ShuffledRowRDD is present > in lineage. > When the RDD lineage contains ShuffledRowRDD, the above mentioned behavior > can be observed when node failures or container loss happens. > {code} > scala> data2.rdd.toDebugString > res6: String = > (1) MapPartitionsRDD[20] at rdd at :26 [] > | MapPartitionsRDD[19] at rdd at :26 [] > | ShuffledRowRDD[8] at collect at :26 [] > +-(2) MapPartitionsRDD[7] at collect at :26 [] > | MapPartitionsRDD[6] at collect at :26 [] > | MapPartitionsRDD[5] at collect at :26 [] > | MapPartitionsRDD[1] at intRddToDataFrameHolder at :25 [] > | ParallelCollectionRDD[0] at parallelize at :25 [] > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19477) [SQL] Datasets created from a Dataframe with extra columns retain the extra columns
Don Drake created SPARK-19477: - Summary: [SQL] Datasets created from a Dataframe with extra columns retain the extra columns Key: SPARK-19477 URL: https://issues.apache.org/jira/browse/SPARK-19477 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Don Drake In 1.6, when you created a Dataset from a Dataframe that had extra columns, the columns not in the case class were dropped from the Dataset. For example in 1.6, the column c4 is gone: {code} scala> case class F(f1: String, f2: String, f3:String) defined class F scala> import sqlContext.implicits._ import sqlContext.implicits._ scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", "j","z")).toDF("f1", "f2", "f3", "c4") df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string, c4: string] scala> val ds = df.as[F] ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: string] scala> ds.show +---+---+---+ | f1| f2| f3| +---+---+---+ | a| b| c| | d| e| f| | h| i| j| {code} This seems to have changed in Spark 2.0 and also 2.1: Spark 2.1.0: {code} scala> case class F(f1: String, f2: String, f3:String) defined class F scala> import spark.implicits._ import spark.implicits._ scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", "j","z")).toDF("f1", "f2", "f3", "c4") df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more fields] scala> val ds = df.as[F] ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more fields] scala> ds.show +---+---+---+---+ | f1| f2| f3| c4| +---+---+---+---+ | a| b| c| x| | d| e| f| y| | h| i| j| z| +---+---+---+---+ scala> import org.apache.spark.sql.Encoders import org.apache.spark.sql.Encoders scala> val fEncoder = Encoders.product[F] fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]: string, f3[0]: string] scala> fEncoder.schema == ds.schema res2: Boolean = false scala> ds.schema res3: org.apache.spark.sql.types.StructType = StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), StructField(f3,StringType,true), StructField(c4,StringType,true)) scala> fEncoder.schema res4: org.apache.spark.sql.types.StructType = StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), StructField(f3,StringType,true)) {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18069) Many examples in Python docstrings are incomplete
[ https://issues.apache.org/jira/browse/SPARK-18069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15854468#comment-15854468 ] Apache Spark commented on SPARK-18069: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/16824 > Many examples in Python docstrings are incomplete > - > > Key: SPARK-18069 > URL: https://issues.apache.org/jira/browse/SPARK-18069 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.0.1 >Reporter: Mortada Mehyar >Priority: Minor > > A lot of the python API functions show example usage that is incomplete. The > docstring shows output without having the input DataFrame defined. It can be > quite confusing trying to understand and/or follow the example. > For instance, the docstring for `DataFrame.dtypes()` is currently > {code} > def dtypes(self): > """Returns all column names and their data types as a list. > > >>> df.dtypes > [('age', 'int'), ('name', 'string')] > """ > {code} > when it should really be > {code} > def dtypes(self): > """Returns all column names and their data types as a list. > > >>> df = spark.createDataFrame([('Alice', 2), ('Bob', 5)], ['name', > 'age']) > >>> df.dtypes > [('age', 'int'), ('name', 'string')] > """ > {code} > I have a pending PR for fixing many of these occurrences here: > https://github.com/apache/spark/pull/15053 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19439) PySpark's registerJavaFunction Should Support UDAFs
[ https://issues.apache.org/jira/browse/SPARK-19439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15854330#comment-15854330 ] Keith Bourgoin commented on SPARK-19439: SPARK-10915 refers to making it possible to write UDAFs in Python, and looks like it'll require a significant amount of work. This ticket is merely looking to add the ability to call Scala UDAFs from Python. It's substantially less work, and a different sort of issue. In fact, I was able to hack my way around and figure out how to get it working just now: {code} In [1]: data = [[i+0.5] for i in range(100)] In [2]: df = sqlContext.createDataFrame(data, ['key']) In [3]: df.registerTempTable('df') In [4]: geo_mean = sc._jvm.com.foo.bar.GeometricMean() In [5]: sqlContext.sparkSession._jsparkSession.udf().register('geo_mean', geo_mean) Out[5]: JavaObject id=o43 In [6]: sqlContext.sql('select geo_mean(key) from df').collect() Out[6]: [Row(geometricmean(key)=36.91550879325741)] {code} I clearly can't submit this as a PR, but if I have time I'll try to cook up a worthwhile patch to get the process moving. > PySpark's registerJavaFunction Should Support UDAFs > --- > > Key: SPARK-19439 > URL: https://issues.apache.org/jira/browse/SPARK-19439 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Keith Bourgoin > > When trying to import a Scala UDAF using registerJavaFunction, I get this > error: > {code} > In [1]: sqlContext.registerJavaFunction('geo_mean', > 'com.foo.bar.GeometricMean') > --- > Py4JJavaError Traceback (most recent call last) > in () > > 1 sqlContext.registerJavaFunction('geo_mean', > 'com.foo.bar.GeometricMean') > /home/kfb/src/projects/spark/python/pyspark/sql/context.pyc in > registerJavaFunction(self, name, javaClassName, returnType) > 227 if returnType is not None: > 228 jdt = > self.sparkSession._jsparkSession.parseDataType(returnType.json()) > --> 229 self.sparkSession._jsparkSession.udf().registerJava(name, > javaClassName, jdt) > 230 > 231 # TODO(andrew): delete this once we refactor things to take in > SparkSession > /home/kfb/src/projects/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py > in __call__(self, *args) >1131 answer = self.gateway_client.send_command(command) >1132 return_value = get_return_value( > -> 1133 answer, self.gateway_client, self.target_id, self.name) >1134 >1135 for temp_arg in temp_args: > /home/kfb/src/projects/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw) > 61 def deco(*a, **kw): > 62 try: > ---> 63 return f(*a, **kw) > 64 except py4j.protocol.Py4JJavaError as e: > 65 s = e.java_exception.toString() > /home/kfb/src/projects/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py > in get_return_value(answer, gateway_client, target_id, name) > 317 raise Py4JJavaError( > 318 "An error occurred while calling {0}{1}{2}.\n". > --> 319 format(target_id, ".", name), value) > 320 else: > 321 raise Py4JError( > Py4JJavaError: An error occurred while calling o28.registerJava. > : java.io.IOException: UDF class com.foo.bar.GeometricMean doesn't implement > any UDF interface > at > org.apache.spark.sql.UDFRegistration.registerJava(UDFRegistration.scala:438) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:214) > at java.lang.Thread.run(Thread.java:745) > {code} > According to SPARK-10915, UDAFs in Python aren't happening anytime soon. > Without this, there's no way to get Scala UDAFs into Python Spark SQL > whatsoever. Fixing that would be a huge help so that we can keep aggregations > in the JVM and using DataFrames. Otherwise, all our code has to drop to to > RDDs and live in Python. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (SPARK-19476) Running threads in Spark DataFrame foreachPartition() causes NullPointerException
Gal Topper created SPARK-19476: -- Summary: Running threads in Spark DataFrame foreachPartition() causes NullPointerException Key: SPARK-19476 URL: https://issues.apache.org/jira/browse/SPARK-19476 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0, 2.0.2, 2.0.1, 2.0.0, 1.6.3, 1.6.2, 1.6.1, 1.6.0 Reporter: Gal Topper First reported on [Stack overflow|http://stackoverflow.com/questions/41674069/running-threads-in-spark-dataframe-foreachpartition]. I use multiple threads inside foreachPartition(), which works great for me except for when the underlying iterator is TungstenAggregationIterator. Here is a minimal code snippet to reproduce: {code:title=Reproduce.scala|borderStyle=solid} import scala.concurrent.ExecutionContext.Implicits.global import scala.concurrent.duration.Duration import scala.concurrent.{Await, Future} import org.apache.spark.SparkContext import org.apache.spark.sql.SQLContext object Reproduce extends App { val sc = new SparkContext("local", "reproduce") val sqlContext = new SQLContext(sc) import sqlContext.implicits._ val df = sc.parallelize(Seq(1)).toDF("number").groupBy("number").count() df.foreachPartition { iterator => val f = Future(iterator.toVector) Await.result(f, Duration.Inf) } } {code} When I run this, I get: {noformat} java.lang.NullPointerException at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:751) at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:84) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) {noformat} I believe I actually understand why this happens - TungstenAggregationIterator uses a ThreadLocal variable that returns null when called from a thread other than the original thread that got the iterator from Spark. From examining the code, this does not appear to differ between recent Spark versions. However, this limitation is specific to TungstenAggregationIterator, and not documented, as far as I'm aware. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19080) simplify data source analysis
[ https://issues.apache.org/jira/browse/SPARK-19080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-19080. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16269 [https://github.com/apache/spark/pull/16269] > simplify data source analysis > - > > Key: SPARK-19080 > URL: https://issues.apache.org/jira/browse/SPARK-19080 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19475) (ML|MLlib).linalg.DenseVector method delegation fails for __neg__
[ https://issues.apache.org/jira/browse/SPARK-19475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19475: Assignee: (was: Apache Spark) > (ML|MLlib).linalg.DenseVector method delegation fails for __neg__ > - > > Key: SPARK-19475 > URL: https://issues.apache.org/jira/browse/SPARK-19475 > Project: Spark > Issue Type: Bug > Components: ML, MLlib, PySpark >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Maciej Szymkiewicz >Priority: Minor > > {{(ML|MLlib).linalg.DenseVector}} delegate number of methods to NumPy. By > design [it does the > same|https://github.com/apache/spark/blob/933a6548d423cf17448207a99299cf36fc1a95f6/python/pyspark/mllib/linalg/__init__.py#L487] > for {{__neg__}} but current {{_delegate}} method [expects binary > operators|https://github.com/apache/spark/blob/933a6548d423cf17448207a99299cf36fc1a95f6/python/pyspark/mllib/linalg/__init__.py#L481]. > {code} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 2.1.0 > /_/ > Using Python version 3.5.2 (default, Jul 2 2016 17:53:06) > SparkSession available as 'spark'. > In [1]: from pyspark.ml import linalg > In [2]: -linalg.DenseVector([1, 2, 3]) > --- > TypeError Traceback (most recent call last) > in () > > 1 -linalg.DenseVector([1, 2, 3]) > TypeError: func() missing 1 required positional argument: 'other' > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19475) (ML|MLlib).linalg.DenseVector method delegation fails for __neg__
[ https://issues.apache.org/jira/browse/SPARK-19475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15854270#comment-15854270 ] Apache Spark commented on SPARK-19475: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/16822 > (ML|MLlib).linalg.DenseVector method delegation fails for __neg__ > - > > Key: SPARK-19475 > URL: https://issues.apache.org/jira/browse/SPARK-19475 > Project: Spark > Issue Type: Bug > Components: ML, MLlib, PySpark >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Maciej Szymkiewicz >Priority: Minor > > {{(ML|MLlib).linalg.DenseVector}} delegate number of methods to NumPy. By > design [it does the > same|https://github.com/apache/spark/blob/933a6548d423cf17448207a99299cf36fc1a95f6/python/pyspark/mllib/linalg/__init__.py#L487] > for {{__neg__}} but current {{_delegate}} method [expects binary > operators|https://github.com/apache/spark/blob/933a6548d423cf17448207a99299cf36fc1a95f6/python/pyspark/mllib/linalg/__init__.py#L481]. > {code} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 2.1.0 > /_/ > Using Python version 3.5.2 (default, Jul 2 2016 17:53:06) > SparkSession available as 'spark'. > In [1]: from pyspark.ml import linalg > In [2]: -linalg.DenseVector([1, 2, 3]) > --- > TypeError Traceback (most recent call last) > in () > > 1 -linalg.DenseVector([1, 2, 3]) > TypeError: func() missing 1 required positional argument: 'other' > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19475) (ML|MLlib).linalg.DenseVector method delegation fails for __neg__
[ https://issues.apache.org/jira/browse/SPARK-19475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19475: Assignee: Apache Spark > (ML|MLlib).linalg.DenseVector method delegation fails for __neg__ > - > > Key: SPARK-19475 > URL: https://issues.apache.org/jira/browse/SPARK-19475 > Project: Spark > Issue Type: Bug > Components: ML, MLlib, PySpark >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Maciej Szymkiewicz >Assignee: Apache Spark >Priority: Minor > > {{(ML|MLlib).linalg.DenseVector}} delegate number of methods to NumPy. By > design [it does the > same|https://github.com/apache/spark/blob/933a6548d423cf17448207a99299cf36fc1a95f6/python/pyspark/mllib/linalg/__init__.py#L487] > for {{__neg__}} but current {{_delegate}} method [expects binary > operators|https://github.com/apache/spark/blob/933a6548d423cf17448207a99299cf36fc1a95f6/python/pyspark/mllib/linalg/__init__.py#L481]. > {code} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 2.1.0 > /_/ > Using Python version 3.5.2 (default, Jul 2 2016 17:53:06) > SparkSession available as 'spark'. > In [1]: from pyspark.ml import linalg > In [2]: -linalg.DenseVector([1, 2, 3]) > --- > TypeError Traceback (most recent call last) > in () > > 1 -linalg.DenseVector([1, 2, 3]) > TypeError: func() missing 1 required positional argument: 'other' > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19475) (ML|MLlib).linalg.DenseVector method delegation fails for __neg__
Maciej Szymkiewicz created SPARK-19475: -- Summary: (ML|MLlib).linalg.DenseVector method delegation fails for __neg__ Key: SPARK-19475 URL: https://issues.apache.org/jira/browse/SPARK-19475 Project: Spark Issue Type: Bug Components: ML, MLlib, PySpark Affects Versions: 2.1.0, 2.0.0, 2.2.0 Reporter: Maciej Szymkiewicz Priority: Minor {{(ML|MLlib).linalg.DenseVector}} delegate number of methods to NumPy. By design [it does the same|https://github.com/apache/spark/blob/933a6548d423cf17448207a99299cf36fc1a95f6/python/pyspark/mllib/linalg/__init__.py#L487] for {{__neg__}} but current {{_delegate}} method [expects binary operators|https://github.com/apache/spark/blob/933a6548d423cf17448207a99299cf36fc1a95f6/python/pyspark/mllib/linalg/__init__.py#L481]. {code} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.1.0 /_/ Using Python version 3.5.2 (default, Jul 2 2016 17:53:06) SparkSession available as 'spark'. In [1]: from pyspark.ml import linalg In [2]: -linalg.DenseVector([1, 2, 3]) --- TypeError Traceback (most recent call last) in () > 1 -linalg.DenseVector([1, 2, 3]) TypeError: func() missing 1 required positional argument: 'other' {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19468) Dataset slow because of unnecessary shuffles
[ https://issues.apache.org/jira/browse/SPARK-19468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15854242#comment-15854242 ] koert kuipers commented on SPARK-19468: --- so to summarize: RDD does what we would expect, DataFrame does what we would expect, Dataset inserts extra unnecessary shuffles. > Dataset slow because of unnecessary shuffles > > > Key: SPARK-19468 > URL: https://issues.apache.org/jira/browse/SPARK-19468 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: koert kuipers > > we noticed that some algos we ported from rdd to dataset are significantly > slower, and the main reason seems to be more shuffles that we successfully > avoid for rdds by careful partitioning. this seems to be dataset specific as > it works ok for dataframe. > see also here: > http://blog.hydronitrogen.com/2016/05/13/shuffle-free-joins-in-spark-sql/ > it kind of boils down to this... if i partition and sort dataframes that get > used for joins repeatedly i can avoid shuffles: > {noformat} > System.setProperty("spark.sql.autoBroadcastJoinThreshold", "-1") > val df1 = Seq((0, 0), (1, 1)).toDF("key", "value") > > .repartition(col("key")).sortWithinPartitions(col("key")).persist(StorageLevel.DISK_ONLY) > val df2 = Seq((0, 0), (1, 1)).toDF("key2", "value2") > > .repartition(col("key2")).sortWithinPartitions(col("key2")).persist(StorageLevel.DISK_ONLY) > val joined = df1.join(df2, col("key") === col("key2")) > joined.explain > == Physical Plan == > *SortMergeJoin [key#5], [key2#27], Inner > :- InMemoryTableScan [key#5, value#6] > : +- InMemoryRelation [key#5, value#6], true, 1, StorageLevel(disk, 1 > replicas) > : +- *Sort [key#5 ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(key#5, 4) > : +- LocalTableScan [key#5, value#6] > +- InMemoryTableScan [key2#27, value2#28] > +- InMemoryRelation [key2#27, value2#28], true, 1, > StorageLevel(disk, 1 replicas) > +- *Sort [key2#27 ASC NULLS FIRST], false, 0 >+- Exchange hashpartitioning(key2#27, 4) > +- LocalTableScan [key2#27, value2#28] > {noformat} > notice how the persisted dataframes are not shuffled or sorted anymore before > being used in the join. however if i try to do the same with dataset i have > no luck: > {noformat} > val ds1 = Seq((0, 0), (1, 1)).toDS > > .repartition(col("_1")).sortWithinPartitions(col("_1")).persist(StorageLevel.DISK_ONLY) > val ds2 = Seq((0, 0), (1, 1)).toDS > > .repartition(col("_1")).sortWithinPartitions(col("_1")).persist(StorageLevel.DISK_ONLY) > val joined1 = ds1.joinWith(ds2, ds1("_1") === ds2("_1")) > joined1.explain > == Physical Plan == > *SortMergeJoin [_1#105._1], [_2#106._1], Inner > :- *Sort [_1#105._1 ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(_1#105._1, 4) > : +- *Project [named_struct(_1, _1#83, _2, _2#84) AS _1#105] > :+- InMemoryTableScan [_1#83, _2#84] > : +- InMemoryRelation [_1#83, _2#84], true, 1, > StorageLevel(disk, 1 replicas) > :+- *Sort [_1#83 ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(_1#83, 4) > : +- LocalTableScan [_1#83, _2#84] > +- *Sort [_2#106._1 ASC NULLS FIRST], false, 0 >+- Exchange hashpartitioning(_2#106._1, 4) > +- *Project [named_struct(_1, _1#100, _2, _2#101) AS _2#106] > +- InMemoryTableScan [_1#100, _2#101] >+- InMemoryRelation [_1#100, _2#101], true, 1, > StorageLevel(disk, 1 replicas) > +- *Sort [_1#83 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(_1#83, 4) >+- LocalTableScan [_1#83, _2#84] > {noformat} > notice how my persisted Datasets are shuffled and sorted again. part of the > issue seems to be in joinWith, which does some preprocessing that seems to > confuse the planner. if i change the joinWith to join (which returns a > dataframe) it looks a little better in that only one side gets shuffled > again, but still not optimal: > {noformat} > val ds1 = Seq((0, 0), (1, 1)).toDS > > .repartition(col("_1")).sortWithinPartitions(col("_1")).persist(StorageLevel.DISK_ONLY) > val ds2 = Seq((0, 0), (1, 1)).toDS > > .repartition(col("_1")).sortWithinPartitions(col("_1")).persist(StorageLevel.DISK_ONLY) > val joined1 = ds1.join(ds2, ds1("_1") === ds2("_1")) > joined1.explain > == Physical Plan == > *SortMergeJoin [_1#83], [_1#100], Inner > :- InMemoryTableScan [_1#83, _2#84] > : +- InMemoryRelation [_1#83, _2#84], true, 1, StorageLevel(disk, 1 > replicas) > : +- *Sort [_1#83 ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(_1#83, 4) > :
[jira] [Assigned] (SPARK-19472) [SQL]SQLParser fails to resolve nested CASE WHEN statement with parentheses
[ https://issues.apache.org/jira/browse/SPARK-19472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19472: Assignee: Apache Spark > [SQL]SQLParser fails to resolve nested CASE WHEN statement with parentheses > --- > > Key: SPARK-19472 > URL: https://issues.apache.org/jira/browse/SPARK-19472 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: StanZhai >Assignee: Apache Spark > > SQLParser fails to resolve nested CASE WHEN statement like this: > select case when > (1) + > case when 1>0 then 1 else 0 end = 2 > then 1 else 0 end > from tb > Exception > Exception in thread "main" > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input 'then' expecting {'.', '[', 'OR', 'AND', 'IN', NOT, > 'BETWEEN', 'LIKE', RLIKE, 'IS', 'WHEN', EQ, '<=>', '<>', '!=', '<', LTE, '>', > GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '^'}(line 5, pos 0) > == SQL == > select case when > (1) + > case when 1>0 then 1 else 0 end = 2 > then 1 else 0 end > ^^^ > from tb > But,remove parentheses will be fine: > select case when > 1 + > case when 1>0 then 1 else 0 end = 2 > then 1 else 0 end > from tb -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19472) [SQL]SQLParser fails to resolve nested CASE WHEN statement with parentheses
[ https://issues.apache.org/jira/browse/SPARK-19472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19472: Assignee: (was: Apache Spark) > [SQL]SQLParser fails to resolve nested CASE WHEN statement with parentheses > --- > > Key: SPARK-19472 > URL: https://issues.apache.org/jira/browse/SPARK-19472 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: StanZhai > > SQLParser fails to resolve nested CASE WHEN statement like this: > select case when > (1) + > case when 1>0 then 1 else 0 end = 2 > then 1 else 0 end > from tb > Exception > Exception in thread "main" > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input 'then' expecting {'.', '[', 'OR', 'AND', 'IN', NOT, > 'BETWEEN', 'LIKE', RLIKE, 'IS', 'WHEN', EQ, '<=>', '<>', '!=', '<', LTE, '>', > GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '^'}(line 5, pos 0) > == SQL == > select case when > (1) + > case when 1>0 then 1 else 0 end = 2 > then 1 else 0 end > ^^^ > from tb > But,remove parentheses will be fine: > select case when > 1 + > case when 1>0 then 1 else 0 end = 2 > then 1 else 0 end > from tb -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19472) [SQL]SQLParser fails to resolve nested CASE WHEN statement with parentheses
[ https://issues.apache.org/jira/browse/SPARK-19472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15854187#comment-15854187 ] Apache Spark commented on SPARK-19472: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/16821 > [SQL]SQLParser fails to resolve nested CASE WHEN statement with parentheses > --- > > Key: SPARK-19472 > URL: https://issues.apache.org/jira/browse/SPARK-19472 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: StanZhai > > SQLParser fails to resolve nested CASE WHEN statement like this: > select case when > (1) + > case when 1>0 then 1 else 0 end = 2 > then 1 else 0 end > from tb > Exception > Exception in thread "main" > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input 'then' expecting {'.', '[', 'OR', 'AND', 'IN', NOT, > 'BETWEEN', 'LIKE', RLIKE, 'IS', 'WHEN', EQ, '<=>', '<>', '!=', '<', LTE, '>', > GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '^'}(line 5, pos 0) > == SQL == > select case when > (1) + > case when 1>0 then 1 else 0 end = 2 > then 1 else 0 end > ^^^ > from tb > But,remove parentheses will be fine: > select case when > 1 + > case when 1>0 then 1 else 0 end = 2 > then 1 else 0 end > from tb -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19468) Dataset slow because of unnecessary shuffles
[ https://issues.apache.org/jira/browse/SPARK-19468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15854152#comment-15854152 ] koert kuipers commented on SPARK-19468: --- inserting unnecessary shuffles makes things very slow. and this is not an edge case. this is the most basic shuffle optimization. given how RDD and DataFrame behave it is also unexpected. however it does not produce incorrect results. it is definitely a blocker for switching from RDD to Dataset. no matter what else you do to make things fast (catalyst planner, tungsten, etc,) if you cant get the basic shuffles right then RDD is going to be more efficient for a lot of algos. > Dataset slow because of unnecessary shuffles > > > Key: SPARK-19468 > URL: https://issues.apache.org/jira/browse/SPARK-19468 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: koert kuipers > > we noticed that some algos we ported from rdd to dataset are significantly > slower, and the main reason seems to be more shuffles that we successfully > avoid for rdds by careful partitioning. this seems to be dataset specific as > it works ok for dataframe. > see also here: > http://blog.hydronitrogen.com/2016/05/13/shuffle-free-joins-in-spark-sql/ > it kind of boils down to this... if i partition and sort dataframes that get > used for joins repeatedly i can avoid shuffles: > {noformat} > System.setProperty("spark.sql.autoBroadcastJoinThreshold", "-1") > val df1 = Seq((0, 0), (1, 1)).toDF("key", "value") > > .repartition(col("key")).sortWithinPartitions(col("key")).persist(StorageLevel.DISK_ONLY) > val df2 = Seq((0, 0), (1, 1)).toDF("key2", "value2") > > .repartition(col("key2")).sortWithinPartitions(col("key2")).persist(StorageLevel.DISK_ONLY) > val joined = df1.join(df2, col("key") === col("key2")) > joined.explain > == Physical Plan == > *SortMergeJoin [key#5], [key2#27], Inner > :- InMemoryTableScan [key#5, value#6] > : +- InMemoryRelation [key#5, value#6], true, 1, StorageLevel(disk, 1 > replicas) > : +- *Sort [key#5 ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(key#5, 4) > : +- LocalTableScan [key#5, value#6] > +- InMemoryTableScan [key2#27, value2#28] > +- InMemoryRelation [key2#27, value2#28], true, 1, > StorageLevel(disk, 1 replicas) > +- *Sort [key2#27 ASC NULLS FIRST], false, 0 >+- Exchange hashpartitioning(key2#27, 4) > +- LocalTableScan [key2#27, value2#28] > {noformat} > notice how the persisted dataframes are not shuffled or sorted anymore before > being used in the join. however if i try to do the same with dataset i have > no luck: > {noformat} > val ds1 = Seq((0, 0), (1, 1)).toDS > > .repartition(col("_1")).sortWithinPartitions(col("_1")).persist(StorageLevel.DISK_ONLY) > val ds2 = Seq((0, 0), (1, 1)).toDS > > .repartition(col("_1")).sortWithinPartitions(col("_1")).persist(StorageLevel.DISK_ONLY) > val joined1 = ds1.joinWith(ds2, ds1("_1") === ds2("_1")) > joined1.explain > == Physical Plan == > *SortMergeJoin [_1#105._1], [_2#106._1], Inner > :- *Sort [_1#105._1 ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(_1#105._1, 4) > : +- *Project [named_struct(_1, _1#83, _2, _2#84) AS _1#105] > :+- InMemoryTableScan [_1#83, _2#84] > : +- InMemoryRelation [_1#83, _2#84], true, 1, > StorageLevel(disk, 1 replicas) > :+- *Sort [_1#83 ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(_1#83, 4) > : +- LocalTableScan [_1#83, _2#84] > +- *Sort [_2#106._1 ASC NULLS FIRST], false, 0 >+- Exchange hashpartitioning(_2#106._1, 4) > +- *Project [named_struct(_1, _1#100, _2, _2#101) AS _2#106] > +- InMemoryTableScan [_1#100, _2#101] >+- InMemoryRelation [_1#100, _2#101], true, 1, > StorageLevel(disk, 1 replicas) > +- *Sort [_1#83 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(_1#83, 4) >+- LocalTableScan [_1#83, _2#84] > {noformat} > notice how my persisted Datasets are shuffled and sorted again. part of the > issue seems to be in joinWith, which does some preprocessing that seems to > confuse the planner. if i change the joinWith to join (which returns a > dataframe) it looks a little better in that only one side gets shuffled > again, but still not optimal: > {noformat} > val ds1 = Seq((0, 0), (1, 1)).toDS > > .repartition(col("_1")).sortWithinPartitions(col("_1")).persist(StorageLevel.DISK_ONLY) > val ds2 = Seq((0, 0), (1, 1)).toDS > > .repartition(col("_1")).sortWithinPartitions(col("_1")).persist(StorageLevel.DISK_ONLY) > val joined1 = ds1.join(ds2,
[jira] [Assigned] (SPARK-17663) SchedulableBuilder should handle invalid data access via scheduler.allocation.file
[ https://issues.apache.org/jira/browse/SPARK-17663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid reassigned SPARK-17663: Assignee: Eren Avsarogullari (was: Imran Rashid) > SchedulableBuilder should handle invalid data access via > scheduler.allocation.file > -- > > Key: SPARK-17663 > URL: https://issues.apache.org/jira/browse/SPARK-17663 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Eren Avsarogullari >Assignee: Eren Avsarogullari > Fix For: 2.2.0 > > > If spark.scheduler.allocation.file has invalid minShare or/and weight values, > these cause : > - NumberFormatException due to toInt function > - SparkContext can not be initialized. > - It does not show meaningful error message to user. > In a nutshell, this functionality can be more robust by selecting one of the > following flows : > *1-* Currently, if schedulingMode has an invalid value, a warning message is > logged and default value is set as FIFO. Same pattern can be used for > minShare(default: 0) and weight(default: 1) as well > *2-* Meaningful error message can be shown to the user for all invalid cases. > *Code to Reproduce* : > {code} > val conf = new > SparkConf().setAppName("spark-fairscheduler").setMaster("local") > conf.set("spark.scheduler.mode", "FAIR") > conf.set("spark.scheduler.allocation.file", > "src/main/resources/fairscheduler-invalid-data.xml") > val sc = new SparkContext(conf) > {code} > *fairscheduler-invalid-data.xml* : > {code} > > > FIFO > invalid_weight > 2 > > > {code} > *Stacktrace* : > {code} > Exception in thread "main" java.lang.NumberFormatException: For input string: > "invalid_weight" > at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) > at java.lang.Integer.parseInt(Integer.java:580) > at java.lang.Integer.parseInt(Integer.java:615) > at > scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272) > at scala.collection.immutable.StringOps.toInt(StringOps.scala:29) > at > org.apache.spark.scheduler.FairSchedulableBuilder$$anonfun$org$apache$spark$scheduler$FairSchedulableBuilder$$buildFairSchedulerPool$1.apply(SchedulableBuilder.scala:127) > at > org.apache.spark.scheduler.FairSchedulableBuilder$$anonfun$org$apache$spark$scheduler$FairSchedulableBuilder$$buildFairSchedulerPool$1.apply(SchedulableBuilder.scala:102) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17663) SchedulableBuilder should handle invalid data access via scheduler.allocation.file
[ https://issues.apache.org/jira/browse/SPARK-17663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid reassigned SPARK-17663: Assignee: Imran Rashid > SchedulableBuilder should handle invalid data access via > scheduler.allocation.file > -- > > Key: SPARK-17663 > URL: https://issues.apache.org/jira/browse/SPARK-17663 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Eren Avsarogullari >Assignee: Imran Rashid > Fix For: 2.2.0 > > > If spark.scheduler.allocation.file has invalid minShare or/and weight values, > these cause : > - NumberFormatException due to toInt function > - SparkContext can not be initialized. > - It does not show meaningful error message to user. > In a nutshell, this functionality can be more robust by selecting one of the > following flows : > *1-* Currently, if schedulingMode has an invalid value, a warning message is > logged and default value is set as FIFO. Same pattern can be used for > minShare(default: 0) and weight(default: 1) as well > *2-* Meaningful error message can be shown to the user for all invalid cases. > *Code to Reproduce* : > {code} > val conf = new > SparkConf().setAppName("spark-fairscheduler").setMaster("local") > conf.set("spark.scheduler.mode", "FAIR") > conf.set("spark.scheduler.allocation.file", > "src/main/resources/fairscheduler-invalid-data.xml") > val sc = new SparkContext(conf) > {code} > *fairscheduler-invalid-data.xml* : > {code} > > > FIFO > invalid_weight > 2 > > > {code} > *Stacktrace* : > {code} > Exception in thread "main" java.lang.NumberFormatException: For input string: > "invalid_weight" > at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) > at java.lang.Integer.parseInt(Integer.java:580) > at java.lang.Integer.parseInt(Integer.java:615) > at > scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272) > at scala.collection.immutable.StringOps.toInt(StringOps.scala:29) > at > org.apache.spark.scheduler.FairSchedulableBuilder$$anonfun$org$apache$spark$scheduler$FairSchedulableBuilder$$buildFairSchedulerPool$1.apply(SchedulableBuilder.scala:127) > at > org.apache.spark.scheduler.FairSchedulableBuilder$$anonfun$org$apache$spark$scheduler$FairSchedulableBuilder$$buildFairSchedulerPool$1.apply(SchedulableBuilder.scala:102) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17663) SchedulableBuilder should handle invalid data access via scheduler.allocation.file
[ https://issues.apache.org/jira/browse/SPARK-17663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-17663. -- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 15237 [https://github.com/apache/spark/pull/15237] > SchedulableBuilder should handle invalid data access via > scheduler.allocation.file > -- > > Key: SPARK-17663 > URL: https://issues.apache.org/jira/browse/SPARK-17663 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Eren Avsarogullari > Fix For: 2.2.0 > > > If spark.scheduler.allocation.file has invalid minShare or/and weight values, > these cause : > - NumberFormatException due to toInt function > - SparkContext can not be initialized. > - It does not show meaningful error message to user. > In a nutshell, this functionality can be more robust by selecting one of the > following flows : > *1-* Currently, if schedulingMode has an invalid value, a warning message is > logged and default value is set as FIFO. Same pattern can be used for > minShare(default: 0) and weight(default: 1) as well > *2-* Meaningful error message can be shown to the user for all invalid cases. > *Code to Reproduce* : > {code} > val conf = new > SparkConf().setAppName("spark-fairscheduler").setMaster("local") > conf.set("spark.scheduler.mode", "FAIR") > conf.set("spark.scheduler.allocation.file", > "src/main/resources/fairscheduler-invalid-data.xml") > val sc = new SparkContext(conf) > {code} > *fairscheduler-invalid-data.xml* : > {code} > > > FIFO > invalid_weight > 2 > > > {code} > *Stacktrace* : > {code} > Exception in thread "main" java.lang.NumberFormatException: For input string: > "invalid_weight" > at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) > at java.lang.Integer.parseInt(Integer.java:580) > at java.lang.Integer.parseInt(Integer.java:615) > at > scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272) > at scala.collection.immutable.StringOps.toInt(StringOps.scala:29) > at > org.apache.spark.scheduler.FairSchedulableBuilder$$anonfun$org$apache$spark$scheduler$FairSchedulableBuilder$$buildFairSchedulerPool$1.apply(SchedulableBuilder.scala:127) > at > org.apache.spark.scheduler.FairSchedulableBuilder$$anonfun$org$apache$spark$scheduler$FairSchedulableBuilder$$buildFairSchedulerPool$1.apply(SchedulableBuilder.scala:102) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19439) PySpark's registerJavaFunction Should Support UDAFs
[ https://issues.apache.org/jira/browse/SPARK-19439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15854034#comment-15854034 ] Hyukjin Kwon commented on SPARK-19439: -- So, as you said, is this a duplicate of SPARK-10915? If so, it'd maybe make sense just to add some comment there or reopen rather than creating a new one. > PySpark's registerJavaFunction Should Support UDAFs > --- > > Key: SPARK-19439 > URL: https://issues.apache.org/jira/browse/SPARK-19439 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Keith Bourgoin > > When trying to import a Scala UDAF using registerJavaFunction, I get this > error: > {code} > In [1]: sqlContext.registerJavaFunction('geo_mean', > 'com.foo.bar.GeometricMean') > --- > Py4JJavaError Traceback (most recent call last) > in () > > 1 sqlContext.registerJavaFunction('geo_mean', > 'com.foo.bar.GeometricMean') > /home/kfb/src/projects/spark/python/pyspark/sql/context.pyc in > registerJavaFunction(self, name, javaClassName, returnType) > 227 if returnType is not None: > 228 jdt = > self.sparkSession._jsparkSession.parseDataType(returnType.json()) > --> 229 self.sparkSession._jsparkSession.udf().registerJava(name, > javaClassName, jdt) > 230 > 231 # TODO(andrew): delete this once we refactor things to take in > SparkSession > /home/kfb/src/projects/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py > in __call__(self, *args) >1131 answer = self.gateway_client.send_command(command) >1132 return_value = get_return_value( > -> 1133 answer, self.gateway_client, self.target_id, self.name) >1134 >1135 for temp_arg in temp_args: > /home/kfb/src/projects/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw) > 61 def deco(*a, **kw): > 62 try: > ---> 63 return f(*a, **kw) > 64 except py4j.protocol.Py4JJavaError as e: > 65 s = e.java_exception.toString() > /home/kfb/src/projects/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py > in get_return_value(answer, gateway_client, target_id, name) > 317 raise Py4JJavaError( > 318 "An error occurred while calling {0}{1}{2}.\n". > --> 319 format(target_id, ".", name), value) > 320 else: > 321 raise Py4JError( > Py4JJavaError: An error occurred while calling o28.registerJava. > : java.io.IOException: UDF class com.foo.bar.GeometricMean doesn't implement > any UDF interface > at > org.apache.spark.sql.UDFRegistration.registerJava(UDFRegistration.scala:438) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:214) > at java.lang.Thread.run(Thread.java:745) > {code} > According to SPARK-10915, UDAFs in Python aren't happening anytime soon. > Without this, there's no way to get Scala UDAFs into Python Spark SQL > whatsoever. Fixing that would be a huge help so that we can keep aggregations > in the JVM and using DataFrames. Otherwise, all our code has to drop to to > RDDs and live in Python. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19440) Window in pyspark doesn't have attributes unboundedPreceding, unboundedFollowing and currentRow
[ https://issues.apache.org/jira/browse/SPARK-19440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-19440. -- Resolution: Invalid It seems there are as below: {code} >>> from pyspark.sql import Window >>> window = Window.orderBy("date").rowsBetween(Window.unboundedPreceding, >>> Window.currentRow) >>> dir(Window) ['_FOLLOWING_THRESHOLD', '_JAVA_MAX_LONG', '_JAVA_MIN_LONG', '_PRECEDING_THRESHOLD', '__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'currentRow', 'orderBy', 'partitionBy', 'rangeBetween', 'rowsBetween', 'unboundedFollowing', 'unboundedPreceding'] >>> {code} Please reopen this if I am mistaken. > Window in pyspark doesn't have attributes unboundedPreceding, > unboundedFollowing and currentRow > --- > > Key: SPARK-19440 > URL: https://issues.apache.org/jira/browse/SPARK-19440 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Franklyn Dsouza >Priority: Trivial > > The Window class in pyspark doesn't have the attributes unboundedPreceding, > unboundedFollowing and currentRow despite the documentation suggesting this > attributes be used. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19442) Unable to add column to the dataset using Dataset.WithColumn() api
[ https://issues.apache.org/jira/browse/SPARK-19442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-19442. -- Resolution: Cannot Reproduce I am resolving this as I can't reproduce in the current master as below: {code} // Replace existing column spark.range(10).toDF("newColumnName") .withColumn("newColumnName", new Column("newColumnName").cast("int")) .show(); // Adding a new column spark.range(10).toDF("a") .withColumn("newColumnName", new Column("a").cast("int")) .show(); {code} prints {code} +-+ |newColumnName| +-+ |0| |1| |2| |3| |4| |5| |6| |7| |8| |9| +-+ +---+-+ | a|newColumnName| +---+-+ | 0|0| | 1|1| | 2|2| | 3|3| | 4|4| | 5|5| | 6|6| | 7|7| | 8|8| | 9|9| +---+-+ {code} Please reopen this if I am mistaken > Unable to add column to the dataset using Dataset.WithColumn() api > -- > > Key: SPARK-19442 > URL: https://issues.apache.org/jira/browse/SPARK-19442 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.0.2 >Reporter: Navya Krishnappa > > When I'm creating a new column using Dataset.WithColumn() api, Analysis > Exception is thrown. > Dataset.WithColumn() api: > Dataset.withColumn("newColumnName', new > org.apache.spark.sql.Column("newColumnName").cast("int")); > Stacktrace: > cannot resolve '`NewColumn`' given input columns: [abc,xyz ] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19451) Long values in Window function
[ https://issues.apache.org/jira/browse/SPARK-19451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853963#comment-15853963 ] Herman van Hovell commented on SPARK-19451: --- At the end of the day I would like to support arbitrary literals for range frames. See: https://issues.apache.org/jira/browse/SPARK-9221 > Long values in Window function > -- > > Key: SPARK-19451 > URL: https://issues.apache.org/jira/browse/SPARK-19451 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1, 2.0.2 >Reporter: Julien Champ > > Hi there, > there seems to be a major limitation in spark window functions and > rangeBetween method. > If I have the following code : > {code:title=Exemple |borderStyle=solid} > val tw = Window.orderBy("date") > .partitionBy("id") > .rangeBetween( from , 0) > {code} > Everything seems ok, while *from* value is not too large... Even if the > rangeBetween() method supports Long parameters. > But If i set *-216000L* value to *from* it does not work ! > It is probably related to this part of code in the between() method, of the > WindowSpec class, called by rangeBetween() > {code:title=between() method|borderStyle=solid} > val boundaryStart = start match { > case 0 => CurrentRow > case Long.MinValue => UnboundedPreceding > case x if x < 0 => ValuePreceding(-start.toInt) > case x if x > 0 => ValueFollowing(start.toInt) > } > {code} > ( look at this *.toInt* ) > Does anybody know it there's a way to solve / patch this behavior ? > Any help will be appreciated > Thx -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19451) Long values in Window function
[ https://issues.apache.org/jira/browse/SPARK-19451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853936#comment-15853936 ] Julien Champ commented on SPARK-19451: -- Glad to see that I'm not the only one convinced by this usage ! This probably needs to use different data structures for rowBetween() and rangeBetween() > Long values in Window function > -- > > Key: SPARK-19451 > URL: https://issues.apache.org/jira/browse/SPARK-19451 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1, 2.0.2 >Reporter: Julien Champ > > Hi there, > there seems to be a major limitation in spark window functions and > rangeBetween method. > If I have the following code : > {code:title=Exemple |borderStyle=solid} > val tw = Window.orderBy("date") > .partitionBy("id") > .rangeBetween( from , 0) > {code} > Everything seems ok, while *from* value is not too large... Even if the > rangeBetween() method supports Long parameters. > But If i set *-216000L* value to *from* it does not work ! > It is probably related to this part of code in the between() method, of the > WindowSpec class, called by rangeBetween() > {code:title=between() method|borderStyle=solid} > val boundaryStart = start match { > case 0 => CurrentRow > case Long.MinValue => UnboundedPreceding > case x if x < 0 => ValuePreceding(-start.toInt) > case x if x > 0 => ValueFollowing(start.toInt) > } > {code} > ( look at this *.toInt* ) > Does anybody know it there's a way to solve / patch this behavior ? > Any help will be appreciated > Thx -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19471) [SQL]A confusing NullPointerException when creating table
[ https://issues.apache.org/jira/browse/SPARK-19471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19471: Assignee: (was: Apache Spark) > [SQL]A confusing NullPointerException when creating table > - > > Key: SPARK-19471 > URL: https://issues.apache.org/jira/browse/SPARK-19471 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: StanZhai >Priority: Critical > > After upgrading our Spark from 1.6.2 to 2.1.0, I encounter a confusing > NullPointerException when creating table under Spark 2.1.0, but the problem > does not exists in Spark 1.6.1. > Environment: Hive 1.2.1, Hadoop 2.6.4 > Code > // spark is an instance of HiveContext > // merge is a Hive UDF > val df = spark.sql("SELECT merge(field_a, null) AS new_a, field_b AS new_b > FROM tb_1 group by field_a, field_b") > df.createTempView("tb_temp") > spark.sql("create table tb_result stored as parquet as " + > "SELECT new_a" + > "FROM tb_temp" + > "LEFT JOIN `tb_2` ON " + > "if(((`tb_temp`.`new_b`) = '' OR (`tb_temp`.`new_b`) IS NULL), > concat('GrLSRwZE_', cast((rand() * 200) AS int)), (`tb_temp`.`new_b`)) = > `tb_2`.`fka6862f17`") > Physical Plan > *Project [new_a] > +- *BroadcastHashJoin [if (((new_b = ) || isnull(new_b))) concat(GrLSRwZE_, > cast(cast((_nondeterministic * 200.0) as int) as string)) else new_b], > [fka6862f17], LeftOuter, BuildRight >:- HashAggregate(keys=[field_a, field_b], functions=[], output=[new_a, > new_b, _nondeterministic]) >: +- Exchange(coordinator ) hashpartitioning(field_a, field_b, 180), > coordinator[target post-shuffle partition size: 1024880] >: +- *HashAggregate(keys=[field_a, field_b], functions=[], > output=[field_a, field_b]) >:+- *FileScan parquet bdp.tb_1[field_a,field_b] Batched: true, > Format: Parquet, Location: InMemoryFileIndex[hdfs://hdcluster/data/tb_1, > PartitionFilters: [], PushedFilters: [], ReadSchema: struct >+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, > true])) > +- *Project [fka6862f17] > +- *FileScan parquet bdp.tb_2[fka6862f17] Batched: true, Format: > Parquet, Location: InMemoryFileIndex[hdfs://hdcluster/data/tb_2, > PartitionFilters: [], PushedFilters: [], ReadSchema: struct > What does '*' mean before HashAggregate? > Exception > org.apache.spark.SparkException: Task failed while writing rows > ... > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_2$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$3.apply(AggregationIterator.scala:260) > > at > org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$3.apply(AggregationIterator.scala:259) > > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:392) > > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:79) > > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:252) > > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:199) > > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:197) > > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341) > > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:202) > > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$4.apply(FileFormatWriter.scala:138) > > at >
[jira] [Assigned] (SPARK-19471) [SQL]A confusing NullPointerException when creating table
[ https://issues.apache.org/jira/browse/SPARK-19471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19471: Assignee: Apache Spark > [SQL]A confusing NullPointerException when creating table > - > > Key: SPARK-19471 > URL: https://issues.apache.org/jira/browse/SPARK-19471 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: StanZhai >Assignee: Apache Spark >Priority: Critical > > After upgrading our Spark from 1.6.2 to 2.1.0, I encounter a confusing > NullPointerException when creating table under Spark 2.1.0, but the problem > does not exists in Spark 1.6.1. > Environment: Hive 1.2.1, Hadoop 2.6.4 > Code > // spark is an instance of HiveContext > // merge is a Hive UDF > val df = spark.sql("SELECT merge(field_a, null) AS new_a, field_b AS new_b > FROM tb_1 group by field_a, field_b") > df.createTempView("tb_temp") > spark.sql("create table tb_result stored as parquet as " + > "SELECT new_a" + > "FROM tb_temp" + > "LEFT JOIN `tb_2` ON " + > "if(((`tb_temp`.`new_b`) = '' OR (`tb_temp`.`new_b`) IS NULL), > concat('GrLSRwZE_', cast((rand() * 200) AS int)), (`tb_temp`.`new_b`)) = > `tb_2`.`fka6862f17`") > Physical Plan > *Project [new_a] > +- *BroadcastHashJoin [if (((new_b = ) || isnull(new_b))) concat(GrLSRwZE_, > cast(cast((_nondeterministic * 200.0) as int) as string)) else new_b], > [fka6862f17], LeftOuter, BuildRight >:- HashAggregate(keys=[field_a, field_b], functions=[], output=[new_a, > new_b, _nondeterministic]) >: +- Exchange(coordinator ) hashpartitioning(field_a, field_b, 180), > coordinator[target post-shuffle partition size: 1024880] >: +- *HashAggregate(keys=[field_a, field_b], functions=[], > output=[field_a, field_b]) >:+- *FileScan parquet bdp.tb_1[field_a,field_b] Batched: true, > Format: Parquet, Location: InMemoryFileIndex[hdfs://hdcluster/data/tb_1, > PartitionFilters: [], PushedFilters: [], ReadSchema: struct >+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, > true])) > +- *Project [fka6862f17] > +- *FileScan parquet bdp.tb_2[fka6862f17] Batched: true, Format: > Parquet, Location: InMemoryFileIndex[hdfs://hdcluster/data/tb_2, > PartitionFilters: [], PushedFilters: [], ReadSchema: struct > What does '*' mean before HashAggregate? > Exception > org.apache.spark.SparkException: Task failed while writing rows > ... > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_2$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$3.apply(AggregationIterator.scala:260) > > at > org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$3.apply(AggregationIterator.scala:259) > > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:392) > > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:79) > > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:252) > > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:199) > > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:197) > > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341) > > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:202) > > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$4.apply(FileFormatWriter.scala:138) >
[jira] [Commented] (SPARK-19471) [SQL]A confusing NullPointerException when creating table
[ https://issues.apache.org/jira/browse/SPARK-19471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853906#comment-15853906 ] Apache Spark commented on SPARK-19471: -- User 'yangw1234' has created a pull request for this issue: https://github.com/apache/spark/pull/16820 > [SQL]A confusing NullPointerException when creating table > - > > Key: SPARK-19471 > URL: https://issues.apache.org/jira/browse/SPARK-19471 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: StanZhai >Priority: Critical > > After upgrading our Spark from 1.6.2 to 2.1.0, I encounter a confusing > NullPointerException when creating table under Spark 2.1.0, but the problem > does not exists in Spark 1.6.1. > Environment: Hive 1.2.1, Hadoop 2.6.4 > Code > // spark is an instance of HiveContext > // merge is a Hive UDF > val df = spark.sql("SELECT merge(field_a, null) AS new_a, field_b AS new_b > FROM tb_1 group by field_a, field_b") > df.createTempView("tb_temp") > spark.sql("create table tb_result stored as parquet as " + > "SELECT new_a" + > "FROM tb_temp" + > "LEFT JOIN `tb_2` ON " + > "if(((`tb_temp`.`new_b`) = '' OR (`tb_temp`.`new_b`) IS NULL), > concat('GrLSRwZE_', cast((rand() * 200) AS int)), (`tb_temp`.`new_b`)) = > `tb_2`.`fka6862f17`") > Physical Plan > *Project [new_a] > +- *BroadcastHashJoin [if (((new_b = ) || isnull(new_b))) concat(GrLSRwZE_, > cast(cast((_nondeterministic * 200.0) as int) as string)) else new_b], > [fka6862f17], LeftOuter, BuildRight >:- HashAggregate(keys=[field_a, field_b], functions=[], output=[new_a, > new_b, _nondeterministic]) >: +- Exchange(coordinator ) hashpartitioning(field_a, field_b, 180), > coordinator[target post-shuffle partition size: 1024880] >: +- *HashAggregate(keys=[field_a, field_b], functions=[], > output=[field_a, field_b]) >:+- *FileScan parquet bdp.tb_1[field_a,field_b] Batched: true, > Format: Parquet, Location: InMemoryFileIndex[hdfs://hdcluster/data/tb_1, > PartitionFilters: [], PushedFilters: [], ReadSchema: struct >+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, > true])) > +- *Project [fka6862f17] > +- *FileScan parquet bdp.tb_2[fka6862f17] Batched: true, Format: > Parquet, Location: InMemoryFileIndex[hdfs://hdcluster/data/tb_2, > PartitionFilters: [], PushedFilters: [], ReadSchema: struct > What does '*' mean before HashAggregate? > Exception > org.apache.spark.SparkException: Task failed while writing rows > ... > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_2$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$3.apply(AggregationIterator.scala:260) > > at > org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$3.apply(AggregationIterator.scala:259) > > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:392) > > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:79) > > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:252) > > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:199) > > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:197) > > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341) > > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:202) > > at >
[jira] [Resolved] (SPARK-19469) PySpark should allow driver process on different machine
[ https://issues.apache.org/jira/browse/SPARK-19469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19469. --- Resolution: Not A Problem I don't think this makes sense. The JVM and Python process must be colocated. (Please use user@ for questions) > PySpark should allow driver process on different machine > > > Key: SPARK-19469 > URL: https://issues.apache.org/jira/browse/SPARK-19469 > Project: Spark > Issue Type: Wish > Components: PySpark >Affects Versions: 1.6.3 >Reporter: haiyangsea > > In my scenario, there is a resident spark driver process(creating by yarn > cluster mode), all PySpark python client will connect to this resident driver > process and share the SparkContext. In python client process, I also run > other applications that have requirements for the environment.So I want > separate the PySpark python client process and the driver process. > There are two main limitations in PySpark: > 1. *parallelize* method uses local file to transform data between java > process and python process. > 2. using the hard code *localhost* to transform task result or intermediate > data between java process and python process. > Is there any way to achieve my goal? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18341) Eliminate use of SingularMatrixException in WeightedLeastSquares logic
[ https://issues.apache.org/jira/browse/SPARK-18341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18341. --- Resolution: Won't Fix > Eliminate use of SingularMatrixException in WeightedLeastSquares logic > -- > > Key: SPARK-18341 > URL: https://issues.apache.org/jira/browse/SPARK-18341 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > WeightedLeastSquares uses an Exception to implement fallback logic for which > solver to use: > [https://github.com/apache/spark/blob/6f3697136aa68dc39d3ce42f43a7af554d2a3bf9/mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala#L258] > We should use an error code instead of an exception. > * Note the error code should be internal, not a public API. > * We may be able to eliminate the SingularMatrixException class. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19449) Inconsistent results between ml package RandomForestClassificationModel and mllib package RandomForestModel
[ https://issues.apache.org/jira/browse/SPARK-19449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853831#comment-15853831 ] Sean Owen commented on SPARK-19449: --- This isn't a bug. It's not expected that, even if they're both deterministic, that two different implementations give exactly the same answers. They should give similar answers. > Inconsistent results between ml package RandomForestClassificationModel and > mllib package RandomForestModel > --- > > Key: SPARK-19449 > URL: https://issues.apache.org/jira/browse/SPARK-19449 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Aseem Bansal > > I worked on some code to convert ml package RandomForestClassificationModel > to mllib package RandomForestModel. It was needed because we need to make > predictions on the order of ms. I found that the results are inconsistent > although the underlying DecisionTreeModel are exactly the same. So the > behavior between the 2 implementations is inconsistent which should not be > the case. > The below code can be used to reproduce the issue. Can run this as a simple > Java app as long as you have spark dependencies set up properly. > {noformat} > import org.apache.spark.ml.Transformer; > import org.apache.spark.ml.classification.*; > import org.apache.spark.ml.linalg.*; > import org.apache.spark.ml.regression.RandomForestRegressionModel; > import org.apache.spark.mllib.linalg.DenseVector; > import org.apache.spark.mllib.linalg.Vector; > import org.apache.spark.mllib.tree.configuration.Algo; > import org.apache.spark.mllib.tree.model.DecisionTreeModel; > import org.apache.spark.mllib.tree.model.RandomForestModel; > import org.apache.spark.sql.Dataset; > import org.apache.spark.sql.Row; > import org.apache.spark.sql.RowFactory; > import org.apache.spark.sql.SparkSession; > import org.apache.spark.sql.types.DataTypes; > import org.apache.spark.sql.types.Metadata; > import org.apache.spark.sql.types.StructField; > import org.apache.spark.sql.types.StructType; > import scala.Enumeration; > import java.util.ArrayList; > import java.util.List; > import java.util.Random; > abstract class Predictor { > abstract double predict(Vector vector); > } > public class MainConvertModels { > public static final int seed = 42; > public static void main(String[] args) { > int numRows = 1000; > int numFeatures = 3; > int numClasses = 2; > double trainFraction = 0.8; > double testFraction = 0.2; > SparkSession spark = SparkSession.builder() > .appName("conversion app") > .master("local") > .getOrCreate(); > Dataset data = getDummyData(spark, numRows, numFeatures, > numClasses); > Dataset[] splits = data.randomSplit(new double[]{trainFraction, > testFraction}, seed); > Dataset trainingData = splits[0]; > Dataset testData = splits[1]; > testData.cache(); > List labels = getLabels(testData); > List features = getFeatures(testData); > DecisionTreeClassifier classifier1 = new DecisionTreeClassifier(); > DecisionTreeClassificationModel model1 = > classifier1.fit(trainingData); > final DecisionTreeModel convertedModel1 = > convertDecisionTreeModel(model1, Algo.Classification()); > RandomForestClassifier classifier = new RandomForestClassifier(); > RandomForestClassificationModel model2 = classifier.fit(trainingData); > final RandomForestModel convertedModel2 = > convertRandomForestModel(model2); > System.out.println( > "** DecisionTreeClassifier\n" + > "** Original **" + getInfo(model1, testData) + "\n" + > "** New **" + getInfo(new Predictor() { > double predict(Vector vector) {return > convertedModel1.predict(vector);} > }, labels, features) + "\n" + > "\n" + > "** RandomForestClassifier\n" + > "** Original **" + getInfo(model2, testData) + "\n" + > "** New **" + getInfo(new Predictor() {double > predict(Vector vector) {return convertedModel2.predict(vector);}}, labels, > features) + "\n" + > "\n" + > ""); > } > static Dataset getDummyData(SparkSession spark, int numberRows, int > numberFeatures, int labelUpperBound) { > StructType schema = new StructType(new StructField[]{ > new StructField("label", DataTypes.DoubleType, false, > Metadata.empty()), > new StructField("features", new VectorUDT(), false, >
[jira] [Comment Edited] (SPARK-10643) Support remote application download in client mode spark submit
[ https://issues.apache.org/jira/browse/SPARK-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853817#comment-15853817 ] wangqiaoshi edited comment on SPARK-10643 at 2/6/17 10:50 AM: -- +1. i think it would be useful when use azkaban in mutil-executor mode,i expect get execution-jar from hdfs but from mysql. eg. not support client mode: when we edit spark job with client mode,azkaban executor A need to get xx.jar from host a,azkaban executor B also need to get xx.jar from host b. was (Author: wangqiaoshi): +1. i think it would be useful when use azkaban in mutil-executor mode,i expect get execution-jar from hdfs but from mysql. > Support remote application download in client mode spark submit > --- > > Key: SPARK-10643 > URL: https://issues.apache.org/jira/browse/SPARK-10643 > Project: Spark > Issue Type: New Feature > Components: Spark Submit >Reporter: Alan Braithwaite >Priority: Minor > > When using mesos with docker and marathon, it would be nice to be able to > make spark-submit deployable on marathon and have that download a jar from > HDFS instead of having to package the jar with the docker. > {code} > $ docker run -it docker.example.com/spark:latest > /usr/local/spark/bin/spark-submit --class > com.example.spark.streaming.EventHandler hdfs://hdfs/tmp/application.jar > Warning: Skip remote jar hdfs://hdfs/tmp/application.jar. > java.lang.ClassNotFoundException: com.example.spark.streaming.EventHandler > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at org.apache.spark.util.Utils$.classForName(Utils.scala:173) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:639) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} > Although I'm aware that we can run in cluster mode with mesos, we've already > built some nice tools surrounding marathon for logging and monitoring. > Code in question: > https://github.com/apache/spark/blob/132718ad7f387e1002b708b19e471d9cd907e105/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L723-L736 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19470) Spark 1.6.4 in Intellij can't use jetty 8
[ https://issues.apache.org/jira/browse/SPARK-19470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19470. --- Resolution: Invalid There is no Spark 1.6.4, and you can't expect to change a dependency with no other changes required in the code. > Spark 1.6.4 in Intellij can't use jetty 8 > - > > Key: SPARK-19470 > URL: https://issues.apache.org/jira/browse/SPARK-19470 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 1.6.1, 1.6.4 >Reporter: Mingda Li > Original Estimate: 1m > Remaining Estimate: 1m > > When I build the project in Intellij, I get the following error: > [error] > /Users/MingdaLi/Desktop/ucla_4/research/spark-branch-1.6.4/core/src/main/scala/org/apache/spark/HttpServer.scala:22: > object ssl is not a member of package org.eclipse.jetty.server > [error] import org.eclipse.jetty.server.ssl.SslSocketConnector > [error] ^ > [error] > /Users/MingdaLi/Desktop/ucla_4/research/spark-branch-1.6.4/core/src/main/scala/org/apache/spark/HttpServer.scala:28: > object bio is not a member of package org.eclipse.jetty.server > [error] import org.eclipse.jetty.server.bio.SocketConnector > [error] ^ > [error] > /Users/MingdaLi/Desktop/ucla_4/research/spark-branch-1.6.4/core/src/main/scala/org/apache/spark/HttpServer.scala:78: > not found: type SslSocketConnector > [error] .map(new SslSocketConnector(_)).getOrElse(new SocketConnector) > [error]^ > [error] > /Users/MingdaLi/Desktop/ucla_4/research/spark-branch-1.6.4/core/src/main/scala/org/apache/spark/HttpServer.scala:87: > value setThreadPool is not a member of org.eclipse.jetty.server.Server > [error] server.setThreadPool(threadPool) > [error]^ > [error] > /Users/MingdaLi/Desktop/ucla_4/research/spark-branch-1.6.4/core/src/main/scala/org/apache/spark/HttpServer.scala:106: > value getLocalPort is not a member of org.eclipse.jetty.server.Connector > [error] val actualPort = server.getConnectors()(0).getLocalPort > [error]^ > [error] > /Users/MingdaLi/Desktop/ucla_4/research/spark-branch-1.6.4/core/src/main/scala/org/apache/spark/SSLOptions.scala:66: > type mismatch; > [error] found : String > [error] required: java.security.KeyStore > [error] trustStore.foreach(file => > sslContextFactory.setTrustStore(file.getAbsolutePath)) > [error] > ^ > [error] > /Users/MingdaLi/Desktop/ucla_4/research/spark-branch-1.6.4/core/src/main/scala/org/apache/spark/deploy/rest/RestSubmissionServer.scala:84: > value setThreadPool is not a member of org.eclipse.jetty.server.Server > [error] server.setThreadPool(threadPool) > [error]^ > [error] > /Users/MingdaLi/Desktop/ucla_4/research/spark-branch-1.6.4/core/src/main/scala/org/apache/spark/deploy/rest/RestSubmissionServer.scala:92: > value getLocalPort is not a member of org.eclipse.jetty.server.Connector > [error] val boundPort = server.getConnectors()(0).getLocalPort > [error] > I think this is caused by using the jetty 9.3.11 but not 8.1.19. > Actually, I solved this by manually remove the 9.3.11's libraries of jetty > from project's library and add the 8.1.19 to it. Then I build it. This can > run well. > But how to avoid manually solve this? > And I am afraid next time, when I open Intellij, this will appear again. > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19474) SparkSQL unsupports to change hive table's name\dataType
Xiaochen Ouyang created SPARK-19474: --- Summary: SparkSQL unsupports to change hive table's name\dataType Key: SPARK-19474 URL: https://issues.apache.org/jira/browse/SPARK-19474 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.1, 2.0.0 Environment: Spark2.0.1 Reporter: Xiaochen Ouyang After creating a hive table, it will be not allowed to change a column's date type or name. It will be useful to provide a public interface (e.g. SQL) to do that. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10643) Support remote application download in client mode spark submit
[ https://issues.apache.org/jira/browse/SPARK-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853817#comment-15853817 ] wangqiaoshi commented on SPARK-10643: - +1. i think it would be useful when use azkaban in mutil-executor mode,i expect get execution-jar from hdfs but from mysql. > Support remote application download in client mode spark submit > --- > > Key: SPARK-10643 > URL: https://issues.apache.org/jira/browse/SPARK-10643 > Project: Spark > Issue Type: New Feature > Components: Spark Submit >Reporter: Alan Braithwaite >Priority: Minor > > When using mesos with docker and marathon, it would be nice to be able to > make spark-submit deployable on marathon and have that download a jar from > HDFS instead of having to package the jar with the docker. > {code} > $ docker run -it docker.example.com/spark:latest > /usr/local/spark/bin/spark-submit --class > com.example.spark.streaming.EventHandler hdfs://hdfs/tmp/application.jar > Warning: Skip remote jar hdfs://hdfs/tmp/application.jar. > java.lang.ClassNotFoundException: com.example.spark.streaming.EventHandler > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at org.apache.spark.util.Utils$.classForName(Utils.scala:173) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:639) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} > Although I'm aware that we can run in cluster mode with mesos, we've already > built some nice tools surrounding marathon for logging and monitoring. > Code in question: > https://github.com/apache/spark/blob/132718ad7f387e1002b708b19e471d9cd907e105/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L723-L736 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19468) Dataset slow because of unnecessary shuffles
[ https://issues.apache.org/jira/browse/SPARK-19468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853809#comment-15853809 ] Sean Owen commented on SPARK-19468: --- I am unclear whether this is a bug report. You're saying one thing is slower than another thing, but is the behavior wrong or unreasonably slow? if x is better than y, do x? > Dataset slow because of unnecessary shuffles > > > Key: SPARK-19468 > URL: https://issues.apache.org/jira/browse/SPARK-19468 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: koert kuipers > > we noticed that some algos we ported from rdd to dataset are significantly > slower, and the main reason seems to be more shuffles that we successfully > avoid for rdds by careful partitioning. this seems to be dataset specific as > it works ok for dataframe. > see also here: > http://blog.hydronitrogen.com/2016/05/13/shuffle-free-joins-in-spark-sql/ > it kind of boils down to this... if i partition and sort dataframes that get > used for joins repeatedly i can avoid shuffles: > {noformat} > System.setProperty("spark.sql.autoBroadcastJoinThreshold", "-1") > val df1 = Seq((0, 0), (1, 1)).toDF("key", "value") > > .repartition(col("key")).sortWithinPartitions(col("key")).persist(StorageLevel.DISK_ONLY) > val df2 = Seq((0, 0), (1, 1)).toDF("key2", "value2") > > .repartition(col("key2")).sortWithinPartitions(col("key2")).persist(StorageLevel.DISK_ONLY) > val joined = df1.join(df2, col("key") === col("key2")) > joined.explain > == Physical Plan == > *SortMergeJoin [key#5], [key2#27], Inner > :- InMemoryTableScan [key#5, value#6] > : +- InMemoryRelation [key#5, value#6], true, 1, StorageLevel(disk, 1 > replicas) > : +- *Sort [key#5 ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(key#5, 4) > : +- LocalTableScan [key#5, value#6] > +- InMemoryTableScan [key2#27, value2#28] > +- InMemoryRelation [key2#27, value2#28], true, 1, > StorageLevel(disk, 1 replicas) > +- *Sort [key2#27 ASC NULLS FIRST], false, 0 >+- Exchange hashpartitioning(key2#27, 4) > +- LocalTableScan [key2#27, value2#28] > {noformat} > notice how the persisted dataframes are not shuffled or sorted anymore before > being used in the join. however if i try to do the same with dataset i have > no luck: > {noformat} > val ds1 = Seq((0, 0), (1, 1)).toDS > > .repartition(col("_1")).sortWithinPartitions(col("_1")).persist(StorageLevel.DISK_ONLY) > val ds2 = Seq((0, 0), (1, 1)).toDS > > .repartition(col("_1")).sortWithinPartitions(col("_1")).persist(StorageLevel.DISK_ONLY) > val joined1 = ds1.joinWith(ds2, ds1("_1") === ds2("_1")) > joined1.explain > == Physical Plan == > *SortMergeJoin [_1#105._1], [_2#106._1], Inner > :- *Sort [_1#105._1 ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(_1#105._1, 4) > : +- *Project [named_struct(_1, _1#83, _2, _2#84) AS _1#105] > :+- InMemoryTableScan [_1#83, _2#84] > : +- InMemoryRelation [_1#83, _2#84], true, 1, > StorageLevel(disk, 1 replicas) > :+- *Sort [_1#83 ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(_1#83, 4) > : +- LocalTableScan [_1#83, _2#84] > +- *Sort [_2#106._1 ASC NULLS FIRST], false, 0 >+- Exchange hashpartitioning(_2#106._1, 4) > +- *Project [named_struct(_1, _1#100, _2, _2#101) AS _2#106] > +- InMemoryTableScan [_1#100, _2#101] >+- InMemoryRelation [_1#100, _2#101], true, 1, > StorageLevel(disk, 1 replicas) > +- *Sort [_1#83 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(_1#83, 4) >+- LocalTableScan [_1#83, _2#84] > {noformat} > notice how my persisted Datasets are shuffled and sorted again. part of the > issue seems to be in joinWith, which does some preprocessing that seems to > confuse the planner. if i change the joinWith to join (which returns a > dataframe) it looks a little better in that only one side gets shuffled > again, but still not optimal: > {noformat} > val ds1 = Seq((0, 0), (1, 1)).toDS > > .repartition(col("_1")).sortWithinPartitions(col("_1")).persist(StorageLevel.DISK_ONLY) > val ds2 = Seq((0, 0), (1, 1)).toDS > > .repartition(col("_1")).sortWithinPartitions(col("_1")).persist(StorageLevel.DISK_ONLY) > val joined1 = ds1.join(ds2, ds1("_1") === ds2("_1")) > joined1.explain > == Physical Plan == > *SortMergeJoin [_1#83], [_1#100], Inner > :- InMemoryTableScan [_1#83, _2#84] > : +- InMemoryRelation [_1#83, _2#84], true, 1, StorageLevel(disk, 1 > replicas) > : +- *Sort [_1#83 ASC NULLS FIRST], false, 0 > : +- Exchange
[jira] [Commented] (SPARK-19451) Long values in Window function
[ https://issues.apache.org/jira/browse/SPARK-19451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853800#comment-15853800 ] Herman van Hovell commented on SPARK-19451: --- Yeah, you are right about that. We should definitely support this. > Long values in Window function > -- > > Key: SPARK-19451 > URL: https://issues.apache.org/jira/browse/SPARK-19451 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1, 2.0.2 >Reporter: Julien Champ > > Hi there, > there seems to be a major limitation in spark window functions and > rangeBetween method. > If I have the following code : > {code:title=Exemple |borderStyle=solid} > val tw = Window.orderBy("date") > .partitionBy("id") > .rangeBetween( from , 0) > {code} > Everything seems ok, while *from* value is not too large... Even if the > rangeBetween() method supports Long parameters. > But If i set *-216000L* value to *from* it does not work ! > It is probably related to this part of code in the between() method, of the > WindowSpec class, called by rangeBetween() > {code:title=between() method|borderStyle=solid} > val boundaryStart = start match { > case 0 => CurrentRow > case Long.MinValue => UnboundedPreceding > case x if x < 0 => ValuePreceding(-start.toInt) > case x if x > 0 => ValueFollowing(start.toInt) > } > {code} > ( look at this *.toInt* ) > Does anybody know it there's a way to solve / patch this behavior ? > Any help will be appreciated > Thx -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19451) Long values in Window function
[ https://issues.apache.org/jira/browse/SPARK-19451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853793#comment-15853793 ] Julien Champ commented on SPARK-19451: -- Let's imagine that this window is used on timestamp values in ms : I can ask for a window with a range between [-216000L, 0] and only have a few values inside, not necessarily 216000L. I can understand the limitation for the rowBetween() method but the rangeBetween() method is nice for this kind of usage. > Long values in Window function > -- > > Key: SPARK-19451 > URL: https://issues.apache.org/jira/browse/SPARK-19451 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1, 2.0.2 >Reporter: Julien Champ > > Hi there, > there seems to be a major limitation in spark window functions and > rangeBetween method. > If I have the following code : > {code:title=Exemple |borderStyle=solid} > val tw = Window.orderBy("date") > .partitionBy("id") > .rangeBetween( from , 0) > {code} > Everything seems ok, while *from* value is not too large... Even if the > rangeBetween() method supports Long parameters. > But If i set *-216000L* value to *from* it does not work ! > It is probably related to this part of code in the between() method, of the > WindowSpec class, called by rangeBetween() > {code:title=between() method|borderStyle=solid} > val boundaryStart = start match { > case 0 => CurrentRow > case Long.MinValue => UnboundedPreceding > case x if x < 0 => ValuePreceding(-start.toInt) > case x if x > 0 => ValueFollowing(start.toInt) > } > {code} > ( look at this *.toInt* ) > Does anybody know it there's a way to solve / patch this behavior ? > Any help will be appreciated > Thx -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16441) Spark application hang when dynamic allocation is enabled
[ https://issues.apache.org/jira/browse/SPARK-16441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-16441: Attachment: SPARK-16441-yarn-metrics.jpg SPARK-16441-threadDump.jpg SPARK-16441-stage.jpg > Spark application hang when dynamic allocation is enabled > - > > Key: SPARK-16441 > URL: https://issues.apache.org/jira/browse/SPARK-16441 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0, 2.1.0 > Environment: hadoop 2.7.2 spark1.6.2 >Reporter: cen yuhai > Attachments: SPARK-16441-stage.jpg, SPARK-16441-threadDump.jpg, > SPARK-16441-yarn-metrics.jpg > > > spark application are waiting for rpc response all the time and spark > listener are blocked by dynamic allocation. Executors can not connect to > driver and lost. > "spark-dynamic-executor-allocation" #239 daemon prio=5 os_prio=0 > tid=0x7fa304438000 nid=0xcec6 waiting on condition [0x7fa2b81e4000] >java.lang.Thread.State: TIMED_WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00070fdb94f8> (a > scala.concurrent.impl.Promise$CompletionLatch) > at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) > at > scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:107) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77) > at > org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:436) > - locked <0x828a8960> (a > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend) > at > org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1438) > at > org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359) > at > org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310) > - locked <0x880e6308> (a > org.apache.spark.ExecutorAllocationManager) > at > org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264) > - locked <0x880e6308> (a > org.apache.spark.ExecutorAllocationManager) > at > org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223) > "SparkListenerBus" #161 daemon prio=5 os_prio=0 tid=0x7fa3053be000 > nid=0xcec9 waiting for monitor entry [0x7fa2b3dfc000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener.onTaskEnd(ExecutorAllocationManager.scala:618) > - waiting to lock <0x880e6308> (a > org.apache.spark.ExecutorAllocationManager) > at > org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55) > at > org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) > at
[jira] [Updated] (SPARK-16441) Spark application hang when dynamic allocation is enabled
[ https://issues.apache.org/jira/browse/SPARK-16441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-16441: Affects Version/s: 2.1.0 > Spark application hang when dynamic allocation is enabled > - > > Key: SPARK-16441 > URL: https://issues.apache.org/jira/browse/SPARK-16441 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0, 2.1.0 > Environment: hadoop 2.7.2 spark1.6.2 >Reporter: cen yuhai > > spark application are waiting for rpc response all the time and spark > listener are blocked by dynamic allocation. Executors can not connect to > driver and lost. > "spark-dynamic-executor-allocation" #239 daemon prio=5 os_prio=0 > tid=0x7fa304438000 nid=0xcec6 waiting on condition [0x7fa2b81e4000] >java.lang.Thread.State: TIMED_WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00070fdb94f8> (a > scala.concurrent.impl.Promise$CompletionLatch) > at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) > at > scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:107) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77) > at > org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:436) > - locked <0x828a8960> (a > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend) > at > org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1438) > at > org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359) > at > org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310) > - locked <0x880e6308> (a > org.apache.spark.ExecutorAllocationManager) > at > org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264) > - locked <0x880e6308> (a > org.apache.spark.ExecutorAllocationManager) > at > org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223) > "SparkListenerBus" #161 daemon prio=5 os_prio=0 tid=0x7fa3053be000 > nid=0xcec9 waiting for monitor entry [0x7fa2b3dfc000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener.onTaskEnd(ExecutorAllocationManager.scala:618) > - waiting to lock <0x880e6308> (a > org.apache.spark.ExecutorAllocationManager) > at > org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55) > at > org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64) > at
[jira] [Commented] (SPARK-16441) Spark application hang when dynamic allocation is enabled
[ https://issues.apache.org/jira/browse/SPARK-16441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853784#comment-15853784 ] Yuming Wang commented on SPARK-16441: - set {{spark.dynamicAllocation.maxExecutors}} to a reasonable value is OK. This happens when a stage has a lot of tasks and need a lot of executors , this is far exceed cluster resources, and it doesn't make sense. my opinion is dynamic set {{spark.dynamicAllocation.maxExecutors}} by cluster resources. I have logged the numExecutors by [CoarseGrainedSchedulerBackend.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala]: {code:java} final override def requestTotalExecutors( numExecutors: Int, localityAwareTasks: Int, hostToLocalTaskCount: Map[String, Int] ): Boolean = { if (numExecutors < 0) { throw new IllegalArgumentException( "Attempted to request a negative number of executor(s) " + s"$numExecutors from the cluster manager. Please specify a positive number!") } val response = synchronized { this.localityAwareTasks = localityAwareTasks this.hostToLocalTaskCount = hostToLocalTaskCount numPendingExecutors = math.max(numExecutors - numExistingExecutors + executorsPendingToRemove.size, 0) logInfo(s"numExecutors: ${numExecutors}, " + s"numExistingExecutors: ${numExistingExecutors}, " + s"executorsPendingToRemove.size: ${executorsPendingToRemove.size}") doRequestTotalExecutors(numExecutors) } defaultAskTimeout.awaitResult(response) } {code} This is the result: {noformat} 17/02/06 16:00:17 INFO YarnClientSchedulerBackend: numExecutors: 200, numExistingExecutors: 0, executorsPendingToRemove.size: 0 17/02/06 16:01:19 INFO YarnClientSchedulerBackend: numExecutors: 0, numExistingExecutors: 91, executorsPendingToRemove.size: 3 17/02/06 16:02:01 INFO YarnClientSchedulerBackend: numExecutors: 1, numExistingExecutors: 0, executorsPendingToRemove.size: 0 17/02/06 16:02:02 INFO YarnClientSchedulerBackend: numExecutors: 3, numExistingExecutors: 0, executorsPendingToRemove.size: 0 17/02/06 16:02:03 INFO YarnClientSchedulerBackend: numExecutors: 7, numExistingExecutors: 0, executorsPendingToRemove.size: 0 17/02/06 16:02:04 INFO YarnClientSchedulerBackend: numExecutors: 15, numExistingExecutors: 1, executorsPendingToRemove.size: 0 17/02/06 16:02:05 INFO YarnClientSchedulerBackend: numExecutors: 31, numExistingExecutors: 1, executorsPendingToRemove.size: 0 17/02/06 16:02:06 INFO YarnClientSchedulerBackend: numExecutors: 63, numExistingExecutors: 3, executorsPendingToRemove.size: 0 17/02/06 16:02:07 INFO YarnClientSchedulerBackend: numExecutors: 127, numExistingExecutors: 10, executorsPendingToRemove.size: 0 17/02/06 16:02:08 INFO YarnClientSchedulerBackend: numExecutors: 255, numExistingExecutors: 15, executorsPendingToRemove.size: 0 17/02/06 16:02:09 INFO YarnClientSchedulerBackend: numExecutors: 511, numExistingExecutors: 39, executorsPendingToRemove.size: 0 17/02/06 16:02:10 INFO YarnClientSchedulerBackend: numExecutors: 1023, numExistingExecutors: 51, executorsPendingToRemove.size: 0 17/02/06 16:02:11 INFO YarnClientSchedulerBackend: numExecutors: 2047, numExistingExecutors: 60, executorsPendingToRemove.size: 0 17/02/06 16:02:12 INFO YarnClientSchedulerBackend: numExecutors: 4095, numExistingExecutors: 65, executorsPendingToRemove.size: 0 17/02/06 16:02:13 INFO YarnClientSchedulerBackend: numExecutors: 8191, numExistingExecutors: 65, executorsPendingToRemove.size: 0 17/02/06 16:02:14 INFO YarnClientSchedulerBackend: numExecutors: 16383, numExistingExecutors: 67, executorsPendingToRemove.size: 0 17/02/06 16:02:15 INFO YarnClientSchedulerBackend: numExecutors: 32767, numExistingExecutors: 68, executorsPendingToRemove.size: 0 17/02/06 16:02:16 INFO YarnClientSchedulerBackend: numExecutors: 65535, numExistingExecutors: 71, executorsPendingToRemove.size: 0 17/02/06 16:02:17 INFO YarnClientSchedulerBackend: numExecutors: 69248, numExistingExecutors: 72, executorsPendingToRemove.size: 0 17/02/06 16:02:20 INFO YarnClientSchedulerBackend: numExecutors: 69195, numExistingExecutors: 79, executorsPendingToRemove.size: 0 17/02/06 16:02:20 INFO YarnClientSchedulerBackend: numExecutors: 69151, numExistingExecutors: 79, executorsPendingToRemove.size: 0 17/02/06 16:02:21 INFO YarnClientSchedulerBackend: numExecutors: 69083, numExistingExecutors: 82, executorsPendingToRemove.size: 0 17/02/06 16:02:21 INFO YarnClientSchedulerBackend: numExecutors: 69021, numExistingExecutors: 82, executorsPendingToRemove.size: 0 17/02/06 16:02:21 INFO YarnClientSchedulerBackend: numExecutors: 68954, numExistingExecutors: 82, executorsPendingToRemove.size: 0 17/02/06 16:02:21 INFO YarnClientSchedulerBackend: numExecutors: 68868, numExistingExecutors: 82, executorsPendingToRemove.size:
[jira] [Commented] (SPARK-19451) Long values in Window function
[ https://issues.apache.org/jira/browse/SPARK-19451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853782#comment-15853782 ] Herman van Hovell commented on SPARK-19451: --- [~jchamp] how may rows are in your partitions? 2 billion? So this is an oversight, but I am not sure we should even try to support more than {{1 << 32 - 1}} values in a partition. > Long values in Window function > -- > > Key: SPARK-19451 > URL: https://issues.apache.org/jira/browse/SPARK-19451 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1, 2.0.2 >Reporter: Julien Champ > > Hi there, > there seems to be a major limitation in spark window functions and > rangeBetween method. > If I have the following code : > {code:title=Exemple |borderStyle=solid} > val tw = Window.orderBy("date") > .partitionBy("id") > .rangeBetween( from , 0) > {code} > Everything seems ok, while *from* value is not too large... Even if the > rangeBetween() method supports Long parameters. > But If i set *-216000L* value to *from* it does not work ! > It is probably related to this part of code in the between() method, of the > WindowSpec class, called by rangeBetween() > {code:title=between() method|borderStyle=solid} > val boundaryStart = start match { > case 0 => CurrentRow > case Long.MinValue => UnboundedPreceding > case x if x < 0 => ValuePreceding(-start.toInt) > case x if x > 0 => ValueFollowing(start.toInt) > } > {code} > ( look at this *.toInt* ) > Does anybody know it there's a way to solve / patch this behavior ? > Any help will be appreciated > Thx -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16441) Spark application hang when dynamic allocation is enabled
[ https://issues.apache.org/jira/browse/SPARK-16441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16441: Assignee: (was: Apache Spark) > Spark application hang when dynamic allocation is enabled > - > > Key: SPARK-16441 > URL: https://issues.apache.org/jira/browse/SPARK-16441 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0 > Environment: hadoop 2.7.2 spark1.6.2 >Reporter: cen yuhai > > spark application are waiting for rpc response all the time and spark > listener are blocked by dynamic allocation. Executors can not connect to > driver and lost. > "spark-dynamic-executor-allocation" #239 daemon prio=5 os_prio=0 > tid=0x7fa304438000 nid=0xcec6 waiting on condition [0x7fa2b81e4000] >java.lang.Thread.State: TIMED_WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00070fdb94f8> (a > scala.concurrent.impl.Promise$CompletionLatch) > at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) > at > scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:107) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77) > at > org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:436) > - locked <0x828a8960> (a > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend) > at > org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1438) > at > org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359) > at > org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310) > - locked <0x880e6308> (a > org.apache.spark.ExecutorAllocationManager) > at > org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264) > - locked <0x880e6308> (a > org.apache.spark.ExecutorAllocationManager) > at > org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223) > "SparkListenerBus" #161 daemon prio=5 os_prio=0 tid=0x7fa3053be000 > nid=0xcec9 waiting for monitor entry [0x7fa2b3dfc000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener.onTaskEnd(ExecutorAllocationManager.scala:618) > - waiting to lock <0x880e6308> (a > org.apache.spark.ExecutorAllocationManager) > at > org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55) > at > org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64) > at
[jira] [Assigned] (SPARK-16441) Spark application hang when dynamic allocation is enabled
[ https://issues.apache.org/jira/browse/SPARK-16441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16441: Assignee: Apache Spark > Spark application hang when dynamic allocation is enabled > - > > Key: SPARK-16441 > URL: https://issues.apache.org/jira/browse/SPARK-16441 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0 > Environment: hadoop 2.7.2 spark1.6.2 >Reporter: cen yuhai >Assignee: Apache Spark > > spark application are waiting for rpc response all the time and spark > listener are blocked by dynamic allocation. Executors can not connect to > driver and lost. > "spark-dynamic-executor-allocation" #239 daemon prio=5 os_prio=0 > tid=0x7fa304438000 nid=0xcec6 waiting on condition [0x7fa2b81e4000] >java.lang.Thread.State: TIMED_WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00070fdb94f8> (a > scala.concurrent.impl.Promise$CompletionLatch) > at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) > at > scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:107) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77) > at > org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:436) > - locked <0x828a8960> (a > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend) > at > org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1438) > at > org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359) > at > org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310) > - locked <0x880e6308> (a > org.apache.spark.ExecutorAllocationManager) > at > org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264) > - locked <0x880e6308> (a > org.apache.spark.ExecutorAllocationManager) > at > org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223) > "SparkListenerBus" #161 daemon prio=5 os_prio=0 tid=0x7fa3053be000 > nid=0xcec9 waiting for monitor entry [0x7fa2b3dfc000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener.onTaskEnd(ExecutorAllocationManager.scala:618) > - waiting to lock <0x880e6308> (a > org.apache.spark.ExecutorAllocationManager) > at > org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55) > at > org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64) > at
[jira] [Commented] (SPARK-16441) Spark application hang when dynamic allocation is enabled
[ https://issues.apache.org/jira/browse/SPARK-16441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853775#comment-15853775 ] Apache Spark commented on SPARK-16441: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/16819 > Spark application hang when dynamic allocation is enabled > - > > Key: SPARK-16441 > URL: https://issues.apache.org/jira/browse/SPARK-16441 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0 > Environment: hadoop 2.7.2 spark1.6.2 >Reporter: cen yuhai > > spark application are waiting for rpc response all the time and spark > listener are blocked by dynamic allocation. Executors can not connect to > driver and lost. > "spark-dynamic-executor-allocation" #239 daemon prio=5 os_prio=0 > tid=0x7fa304438000 nid=0xcec6 waiting on condition [0x7fa2b81e4000] >java.lang.Thread.State: TIMED_WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00070fdb94f8> (a > scala.concurrent.impl.Promise$CompletionLatch) > at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) > at > scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:107) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77) > at > org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:436) > - locked <0x828a8960> (a > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend) > at > org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1438) > at > org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359) > at > org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310) > - locked <0x880e6308> (a > org.apache.spark.ExecutorAllocationManager) > at > org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264) > - locked <0x880e6308> (a > org.apache.spark.ExecutorAllocationManager) > at > org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223) > "SparkListenerBus" #161 daemon prio=5 os_prio=0 tid=0x7fa3053be000 > nid=0xcec9 waiting for monitor entry [0x7fa2b3dfc000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener.onTaskEnd(ExecutorAllocationManager.scala:618) > - waiting to lock <0x880e6308> (a > org.apache.spark.ExecutorAllocationManager) > at > org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55) > at > org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) > at >
[jira] [Commented] (SPARK-19451) Long values in Window function
[ https://issues.apache.org/jira/browse/SPARK-19451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853773#comment-15853773 ] Genmao Yu commented on SPARK-19451: --- [~jchamp] I have taken a fast look through the code, and did not find any strong point to set the type of index as Int. In the window function, there is really a underlying integer overflow issue. I have make a pull request (https://github.com/apache/spark/pull/16818), and any suggestion is appreciated. > Long values in Window function > -- > > Key: SPARK-19451 > URL: https://issues.apache.org/jira/browse/SPARK-19451 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1, 2.0.2 >Reporter: Julien Champ > > Hi there, > there seems to be a major limitation in spark window functions and > rangeBetween method. > If I have the following code : > {code:title=Exemple |borderStyle=solid} > val tw = Window.orderBy("date") > .partitionBy("id") > .rangeBetween( from , 0) > {code} > Everything seems ok, while *from* value is not too large... Even if the > rangeBetween() method supports Long parameters. > But If i set *-216000L* value to *from* it does not work ! > It is probably related to this part of code in the between() method, of the > WindowSpec class, called by rangeBetween() > {code:title=between() method|borderStyle=solid} > val boundaryStart = start match { > case 0 => CurrentRow > case Long.MinValue => UnboundedPreceding > case x if x < 0 => ValuePreceding(-start.toInt) > case x if x > 0 => ValueFollowing(start.toInt) > } > {code} > ( look at this *.toInt* ) > Does anybody know it there's a way to solve / patch this behavior ? > Any help will be appreciated > Thx -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19451) Long values in Window function
[ https://issues.apache.org/jira/browse/SPARK-19451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853773#comment-15853773 ] Genmao Yu edited comment on SPARK-19451 at 2/6/17 9:58 AM: --- [~jchamp] I have taken a fast look through the code, and did not find any strong point to set the type of index as Int. In the window function, there is really a underlying integer overflow issue. I made a pull request (https://github.com/apache/spark/pull/16818), and any suggestion is appreciated. was (Author: unclegen): [~jchamp] I have taken a fast look through the code, and did not find any strong point to set the type of index as Int. In the window function, there is really a underlying integer overflow issue. I have make a pull request (https://github.com/apache/spark/pull/16818), and any suggestion is appreciated. > Long values in Window function > -- > > Key: SPARK-19451 > URL: https://issues.apache.org/jira/browse/SPARK-19451 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1, 2.0.2 >Reporter: Julien Champ > > Hi there, > there seems to be a major limitation in spark window functions and > rangeBetween method. > If I have the following code : > {code:title=Exemple |borderStyle=solid} > val tw = Window.orderBy("date") > .partitionBy("id") > .rangeBetween( from , 0) > {code} > Everything seems ok, while *from* value is not too large... Even if the > rangeBetween() method supports Long parameters. > But If i set *-216000L* value to *from* it does not work ! > It is probably related to this part of code in the between() method, of the > WindowSpec class, called by rangeBetween() > {code:title=between() method|borderStyle=solid} > val boundaryStart = start match { > case 0 => CurrentRow > case Long.MinValue => UnboundedPreceding > case x if x < 0 => ValuePreceding(-start.toInt) > case x if x > 0 => ValueFollowing(start.toInt) > } > {code} > ( look at this *.toInt* ) > Does anybody know it there's a way to solve / patch this behavior ? > Any help will be appreciated > Thx -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19451) Long values in Window function
[ https://issues.apache.org/jira/browse/SPARK-19451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19451: Assignee: (was: Apache Spark) > Long values in Window function > -- > > Key: SPARK-19451 > URL: https://issues.apache.org/jira/browse/SPARK-19451 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1, 2.0.2 >Reporter: Julien Champ > > Hi there, > there seems to be a major limitation in spark window functions and > rangeBetween method. > If I have the following code : > {code:title=Exemple |borderStyle=solid} > val tw = Window.orderBy("date") > .partitionBy("id") > .rangeBetween( from , 0) > {code} > Everything seems ok, while *from* value is not too large... Even if the > rangeBetween() method supports Long parameters. > But If i set *-216000L* value to *from* it does not work ! > It is probably related to this part of code in the between() method, of the > WindowSpec class, called by rangeBetween() > {code:title=between() method|borderStyle=solid} > val boundaryStart = start match { > case 0 => CurrentRow > case Long.MinValue => UnboundedPreceding > case x if x < 0 => ValuePreceding(-start.toInt) > case x if x > 0 => ValueFollowing(start.toInt) > } > {code} > ( look at this *.toInt* ) > Does anybody know it there's a way to solve / patch this behavior ? > Any help will be appreciated > Thx -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19451) Long values in Window function
[ https://issues.apache.org/jira/browse/SPARK-19451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853763#comment-15853763 ] Apache Spark commented on SPARK-19451: -- User 'uncleGen' has created a pull request for this issue: https://github.com/apache/spark/pull/16818 > Long values in Window function > -- > > Key: SPARK-19451 > URL: https://issues.apache.org/jira/browse/SPARK-19451 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1, 2.0.2 >Reporter: Julien Champ > > Hi there, > there seems to be a major limitation in spark window functions and > rangeBetween method. > If I have the following code : > {code:title=Exemple |borderStyle=solid} > val tw = Window.orderBy("date") > .partitionBy("id") > .rangeBetween( from , 0) > {code} > Everything seems ok, while *from* value is not too large... Even if the > rangeBetween() method supports Long parameters. > But If i set *-216000L* value to *from* it does not work ! > It is probably related to this part of code in the between() method, of the > WindowSpec class, called by rangeBetween() > {code:title=between() method|borderStyle=solid} > val boundaryStart = start match { > case 0 => CurrentRow > case Long.MinValue => UnboundedPreceding > case x if x < 0 => ValuePreceding(-start.toInt) > case x if x > 0 => ValueFollowing(start.toInt) > } > {code} > ( look at this *.toInt* ) > Does anybody know it there's a way to solve / patch this behavior ? > Any help will be appreciated > Thx -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19451) Long values in Window function
[ https://issues.apache.org/jira/browse/SPARK-19451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19451: Assignee: Apache Spark > Long values in Window function > -- > > Key: SPARK-19451 > URL: https://issues.apache.org/jira/browse/SPARK-19451 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1, 2.0.2 >Reporter: Julien Champ >Assignee: Apache Spark > > Hi there, > there seems to be a major limitation in spark window functions and > rangeBetween method. > If I have the following code : > {code:title=Exemple |borderStyle=solid} > val tw = Window.orderBy("date") > .partitionBy("id") > .rangeBetween( from , 0) > {code} > Everything seems ok, while *from* value is not too large... Even if the > rangeBetween() method supports Long parameters. > But If i set *-216000L* value to *from* it does not work ! > It is probably related to this part of code in the between() method, of the > WindowSpec class, called by rangeBetween() > {code:title=between() method|borderStyle=solid} > val boundaryStart = start match { > case 0 => CurrentRow > case Long.MinValue => UnboundedPreceding > case x if x < 0 => ValuePreceding(-start.toInt) > case x if x > 0 => ValueFollowing(start.toInt) > } > {code} > ( look at this *.toInt* ) > Does anybody know it there's a way to solve / patch this behavior ? > Any help will be appreciated > Thx -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17910) Allow users to update the comment of a column
[ https://issues.apache.org/jira/browse/SPARK-17910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853748#comment-15853748 ] Xiaochen Ouyang commented on SPARK-17910: - Hey,I wonder that do we have a plan to support changing column's dataType and name later ? Thanks! [~yhuai] [~jiangxb] > Allow users to update the comment of a column > - > > Key: SPARK-17910 > URL: https://issues.apache.org/jira/browse/SPARK-17910 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Jiang Xingbo > Fix For: 2.2.0 > > > Right now, once a user set the comment of a column with create table command, > he/she cannot update the comment. It will be useful to provide a public > interface (e.g. SQL) to do that. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19451) Long values in Window function
[ https://issues.apache.org/jira/browse/SPARK-19451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853685#comment-15853685 ] Julien Champ commented on SPARK-19451: -- Thanks [~uncleGen] for your answer. I was tempted to try to fix it myself.. But not sure of possible side effects. If I can help you in any way ( code / debug / tests ), feel free to ask. P.S. : I think this affects all version of Spark with Window functions, but I have only tested it on spark 1.6.1 and 2.0.2 > Long values in Window function > -- > > Key: SPARK-19451 > URL: https://issues.apache.org/jira/browse/SPARK-19451 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1, 2.0.2 >Reporter: Julien Champ > > Hi there, > there seems to be a major limitation in spark window functions and > rangeBetween method. > If I have the following code : > {code:title=Exemple |borderStyle=solid} > val tw = Window.orderBy("date") > .partitionBy("id") > .rangeBetween( from , 0) > {code} > Everything seems ok, while *from* value is not too large... Even if the > rangeBetween() method supports Long parameters. > But If i set *-216000L* value to *from* it does not work ! > It is probably related to this part of code in the between() method, of the > WindowSpec class, called by rangeBetween() > {code:title=between() method|borderStyle=solid} > val boundaryStart = start match { > case 0 => CurrentRow > case Long.MinValue => UnboundedPreceding > case x if x < 0 => ValuePreceding(-start.toInt) > case x if x > 0 => ValueFollowing(start.toInt) > } > {code} > ( look at this *.toInt* ) > Does anybody know it there's a way to solve / patch this behavior ? > Any help will be appreciated > Thx -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17213) Parquet String Pushdown for Non-Eq Comparisons Broken
[ https://issues.apache.org/jira/browse/SPARK-17213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853673#comment-15853673 ] Apache Spark commented on SPARK-17213: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/16817 > Parquet String Pushdown for Non-Eq Comparisons Broken > - > > Key: SPARK-17213 > URL: https://issues.apache.org/jira/browse/SPARK-17213 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Andrew Duffy >Assignee: Cheng Lian > Fix For: 2.1.0 > > > Spark defines ordering over strings based on comparison of UTF8 byte arrays, > which compare bytes as unsigned integers. Currently however Parquet does not > respect this ordering. This is currently in the process of being fixed in > Parquet, JIRA and PR link below, but currently all filters are broken over > strings, with there actually being a correctness issue for {{>}} and {{<}}. > *Repro:* > Querying directly from in-memory DataFrame: > {code} > > Seq("a", "é").toDF("name").where("name > 'a'").count > 1 > {code} > Querying from a parquet dataset: > {code} > > Seq("a", "é").toDF("name").write.parquet("/tmp/bad") > > spark.read.parquet("/tmp/bad").where("name > 'a'").count > 0 > {code} > This happens because Spark sorts the rows to be {{[a, é]}}, but Parquet's > implementation of comparison of strings is based on signed byte array > comparison, so it will actually create 1 row group with statistics > {{min=é,max=a}}, and so the row group will be dropped by the query. > Based on the way Parquet pushes down Eq, it will not be affecting correctness > but it will force you to read row groups you should be able to skip. > Link to PARQUET issue: https://issues.apache.org/jira/browse/PARQUET-686 > Link to PR: https://github.com/apache/parquet-mr/pull/362 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org