date:20170206

[jira] [Created] (SPARK-19487) Low latency execution for Spark

2017-02-06 Thread Shivaram Venkataraman (JIRA)

Shivaram Venkataraman created SPARK-19487:
-

 Summary: Low latency execution for Spark
 Key: SPARK-19487
 URL: https://issues.apache.org/jira/browse/SPARK-19487
 Project: Spark
  Issue Type: Umbrella
  Components: ML, Scheduler, Structured Streaming
Affects Versions: 2.1.0
Reporter: Shivaram Venkataraman


This JIRA tracks the design discussion for supporting low latency execution in 
Apache Spark. The motivation for this comes from need to support lower latency 
stream processing and lower latency iterations for sparse ML workloads.

Overview of proposed design (in the format of Spark Improvement Proposal) is at 
https://docs.google.com/document/d/1m_q83DjQcWQonEz4IsRUHu4QSjcDyqRpl29qE4LJc4s/edit?usp=sharing

Source code prototype is at: https://github.com/amplab/drizzle-spark

Lets use this JIRA to discuss high level design and we can create subtasks as 
we break this down into smaller PRs.

This is joint work with [~kayousterhout]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18967) Locality preferences should be used when scheduling even when delay scheduling is turned off

2017-02-06 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-18967.

   Resolution: Fixed
Fix Version/s: 2.2

> Locality preferences should be used when scheduling even when delay 
> scheduling is turned off
> 
>
> Key: SPARK-18967
> URL: https://issues.apache.org/jira/browse/SPARK-18967
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
> Fix For: 2.2
>
>
> If you turn delay scheduling off by setting {{spark.locality.wait=0}}, you 
> effectively turn off the use the of locality preferences when there is a bulk 
> scheduling event.  {{TaskSchedulerImpl}} will use resources based on whatever 
> random order it decides to shuffle them, rather than taking advantage of the 
> most local options.
> This happens because {{TaskSchedulerImpl}} offers resources to a 
> {{TaskSetManager}} one at a time, each time subject to a maxLocality 
> constraint.  However, that constraint doesn't move through all possible 
> locality levels -- it uses [{{tsm.myLocalityLevels}} 
> |https://github.com/apache/spark/blob/1a64388973711b4e567f25fa33d752066a018b49/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L360].
>   And {{tsm.myLocalityLevels}} [skips locality levels completely if the wait 
> == 0 | 
> https://github.com/apache/spark/blob/1a64388973711b4e567f25fa33d752066a018b49/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L953].
>   So with delay scheduling off, {{TaskSchedulerImpl}} immediately jumps to 
> giving tsms the offers with {{maxLocality = ANY}}.
> *WORKAROUND*: instead of setting {{spark.locality.wait=0}}, use 
> {{spark.locality.wait=1ms}}.  The one downside of this is if you have tasks 
> that actually take less than 1ms.  You could even run into SPARK-18886.  But 
> that is a relatively unlikely scenario for real workloads.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16441) Spark application hang when dynamic allocation is enabled

2017-02-06 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855414#comment-15855414
 ] 

Yuming Wang commented on SPARK-16441:
-

[~cenyuhai], [2.1.0|https://github.com/apache/spark/tree/v2.1.0] not fix, 
master and branch-2.1 has fixed. My 
[PR|https://github.com/apache/spark/pull/16819] is  to reduce friction.

> Spark application hang when dynamic allocation is enabled
> -
>
> Key: SPARK-16441
> URL: https://issues.apache.org/jira/browse/SPARK-16441
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0, 2.1.0
> Environment: hadoop 2.7.2  spark1.6.2
>Reporter: cen yuhai
> Attachments: SPARK-16441-compare-apply-PR-16819.zip, 
> SPARK-16441-stage.jpg, SPARK-16441-threadDump.jpg, 
> SPARK-16441-yarn-metrics.jpg
>
>
> spark application are waiting for rpc response all the time and spark 
> listener are blocked by dynamic allocation. Executors can not connect to 
> driver and lost.
> "spark-dynamic-executor-allocation" #239 daemon prio=5 os_prio=0 
> tid=0x7fa304438000 nid=0xcec6 waiting on condition [0x7fa2b81e4000]
>java.lang.Thread.State: TIMED_WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x00070fdb94f8> (a 
> scala.concurrent.impl.Promise$CompletionLatch)
>   at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.result(package.scala:107)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
>   at 
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:436)
>   - locked <0x828a8960> (a 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend)
>   at 
> org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1438)
>   at 
> org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359)
>   at 
> org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310)
>   - locked <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264)
>   - locked <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223)
> "SparkListenerBus" #161 daemon prio=5 os_prio=0 tid=0x7fa3053be000 
> nid=0xcec9 waiting for monitor entry [0x7fa2b3dfc000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener.onTaskEnd(ExecutorAllocationManager.scala:618)
>   - waiting to lock <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
>   at 
> org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at 
>

[jira] [Commented] (SPARK-19484) continue work to create a table with an empty schema

2017-02-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855402#comment-15855402
 ] 

Apache Spark commented on SPARK-19484:
--

User 'windpiger' has created a pull request for this issue:
https://github.com/apache/spark/pull/16828

> continue work to create a table with an empty schema
> 
>
> Key: SPARK-19484
> URL: https://issues.apache.org/jira/browse/SPARK-19484
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>Priority: Minor
>
> after SPARK-19279, we could not create a Hive table with an empty schema,
> we should tighten up the condition when create a hive table in 
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L835
> That is if a CatalogTable t has an empty schema, and (there is no 
> `spark.sql.schema.numParts` or its value is 0), we should not add a default 
> `col` schema, if we did, a table with an empty schema will be created, that 
> is not we expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19484) continue work to create a table with an empty schema

2017-02-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19484:


Assignee: Apache Spark

> continue work to create a table with an empty schema
> 
>
> Key: SPARK-19484
> URL: https://issues.apache.org/jira/browse/SPARK-19484
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>Assignee: Apache Spark
>Priority: Minor
>
> after SPARK-19279, we could not create a Hive table with an empty schema,
> we should tighten up the condition when create a hive table in 
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L835
> That is if a CatalogTable t has an empty schema, and (there is no 
> `spark.sql.schema.numParts` or its value is 0), we should not add a default 
> `col` schema, if we did, a table with an empty schema will be created, that 
> is not we expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19484) continue work to create a table with an empty schema

2017-02-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19484:


Assignee: (was: Apache Spark)

> continue work to create a table with an empty schema
> 
>
> Key: SPARK-19484
> URL: https://issues.apache.org/jira/browse/SPARK-19484
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>Priority: Minor
>
> after SPARK-19279, we could not create a Hive table with an empty schema,
> we should tighten up the condition when create a hive table in 
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L835
> That is if a CatalogTable t has an empty schema, and (there is no 
> `spark.sql.schema.numParts` or its value is 0), we should not add a default 
> `col` schema, if we did, a table with an empty schema will be created, that 
> is not we expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16441) Spark application hang when dynamic allocation is enabled

2017-02-06 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-16441:

Attachment: SPARK-16441-compare-apply-PR-16819.zip

> Spark application hang when dynamic allocation is enabled
> -
>
> Key: SPARK-16441
> URL: https://issues.apache.org/jira/browse/SPARK-16441
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0, 2.1.0
> Environment: hadoop 2.7.2  spark1.6.2
>Reporter: cen yuhai
> Attachments: SPARK-16441-compare-apply-PR-16819.zip, 
> SPARK-16441-stage.jpg, SPARK-16441-threadDump.jpg, 
> SPARK-16441-yarn-metrics.jpg
>
>
> spark application are waiting for rpc response all the time and spark 
> listener are blocked by dynamic allocation. Executors can not connect to 
> driver and lost.
> "spark-dynamic-executor-allocation" #239 daemon prio=5 os_prio=0 
> tid=0x7fa304438000 nid=0xcec6 waiting on condition [0x7fa2b81e4000]
>java.lang.Thread.State: TIMED_WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x00070fdb94f8> (a 
> scala.concurrent.impl.Promise$CompletionLatch)
>   at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.result(package.scala:107)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
>   at 
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:436)
>   - locked <0x828a8960> (a 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend)
>   at 
> org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1438)
>   at 
> org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359)
>   at 
> org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310)
>   - locked <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264)
>   - locked <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223)
> "SparkListenerBus" #161 daemon prio=5 os_prio=0 tid=0x7fa3053be000 
> nid=0xcec9 waiting for monitor entry [0x7fa2b3dfc000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener.onTaskEnd(ExecutorAllocationManager.scala:618)
>   - waiting to lock <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
>   at 
> org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)

[jira] [Created] (SPARK-19486) Investigate using multiple threads for task serialization

2017-02-06 Thread Shivaram Venkataraman (JIRA)

Shivaram Venkataraman created SPARK-19486:
-

 Summary: Investigate using multiple threads for task serialization
 Key: SPARK-19486
 URL: https://issues.apache.org/jira/browse/SPARK-19486
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 2.1.0
Reporter: Shivaram Venkataraman


This is related to SPARK-18890, where all the serialization logic is moved into 
the Scheduler backend thread. As a follow on to this we can investigate using a 
thread pool to serialize a number of tasks together instead of using a single 
thread to serialize all of them.

Note that this may not yield sufficient benefits unless the driver has enough 
cores and we don't run into contention across threads. We can first investigate 
potential benefits and if there are sufficient benefits we can create a PR for 
this.

cc [~kayousterhout]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19485) Launch tasks async i.e. dont wait for the network

2017-02-06 Thread Shivaram Venkataraman (JIRA)

Shivaram Venkataraman created SPARK-19485:
-

 Summary: Launch tasks async i.e. dont wait for the network
 Key: SPARK-19485
 URL: https://issues.apache.org/jira/browse/SPARK-19485
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 2.1.0
Reporter: Shivaram Venkataraman


Currently the scheduling thread in CoarseGrainedSchedulerBackend is used to 
both walk through the list of offers and to serialize, create RPCs and send 
messages over the network.

For stages with large number of tasks we can avoid blocking on RPCs / 
serialization by moving that to a separate thread in CGSB. As a part of this 
JIRA we can first investigate the potential benefits of doing this for 
different kinds of jobs (one large stage, many independent small stages etc.) 
and then propose a code change.

cc [~kayousterhout]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19484) continue work to create a table with an empty schema

2017-02-06 Thread Song Jun (JIRA)

Song Jun created SPARK-19484:


 Summary: continue work to create a table with an empty schema
 Key: SPARK-19484
 URL: https://issues.apache.org/jira/browse/SPARK-19484
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun
Priority: Minor


after SPARK-19279, we could not create a Hive table with an empty schema,
we should tighten up the condition when create a hive table in 

https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L835

That is if a CatalogTable t has an empty schema, and (there is no 
`spark.sql.schema.numParts` or its value is 0), we should not add a default 
`col` schema, if we did, a table with an empty schema will be created, that is 
not we expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19407) defaultFS is used FileSystem.get instead of getting it from uri scheme

2017-02-06 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-19407.
--
   Resolution: Fixed
 Assignee: Genmao Yu
Fix Version/s: 2.2.0
   2.1.1

> defaultFS is used FileSystem.get instead of getting it from uri scheme
> --
>
> Key: SPARK-19407
> URL: https://issues.apache.org/jira/browse/SPARK-19407
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Amit Assudani
>Assignee: Genmao Yu
>  Labels: checkpoint, filesystem, starter, streaming
> Fix For: 2.1.1, 2.2.0
>
>
> Caused by: java.lang.IllegalArgumentException: Wrong FS: 
> s3a://**/checkpoint/7b2231a3-d845-4740-bfa3-681850e5987f/metadata,
>  expected: file:///
>   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:606)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
>   at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
>   at 
> org.apache.spark.sql.execution.streaming.StreamMetadata$.read(StreamMetadata.scala:51)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.(StreamExecution.scala:100)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:232)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:269)
>   at 
> org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:262)
> Can easily replicate on spark standalone cluster by providing checkpoint 
> location uri scheme anything other than "file://" and not overriding in 
> config.
> WorkAround  --conf spark.hadoop.fs.defaultFS=s3a://somebucket or set it in 
> sparkConf or spark-default.conf



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19483) Add one RocketMQ plugin for the Apache Spark

2017-02-06 Thread Longda Feng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Longda Feng updated SPARK-19483:

External issue URL: https://issues.apache.org/jira/browse/ROCKETMQ-81
 External issue ID: ROCKETMQ-81

> Add one RocketMQ plugin for the Apache Spark
> 
>
> Key: SPARK-19483
> URL: https://issues.apache.org/jira/browse/SPARK-19483
> Project: Spark
>  Issue Type: Task
>  Components: Input/Output
>Affects Versions: 2.1.0
>Reporter: Longda Feng
>Priority: Minor
>
> Apache RocketMQ® is an open source distributed messaging and streaming data 
> platform. It has been used in a lot of companies. Please refer to 
> http://rocketmq.incubator.apache.org/ for more details.
> Since the Apache RocketMq 4.0 will be released in the next few days, we can 
> start the job of adding the RocketMq plugin for the Apache Spark.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19483) Add one RocketMQ plugin for the Apache Spark

2017-02-06 Thread Longda Feng (JIRA)

Longda Feng created SPARK-19483:
---

 Summary: Add one RocketMQ plugin for the Apache Spark
 Key: SPARK-19483
 URL: https://issues.apache.org/jira/browse/SPARK-19483
 Project: Spark
  Issue Type: Task
  Components: Input/Output
Affects Versions: 2.1.0
Reporter: Longda Feng
Priority: Minor


Apache RocketMQ® is an open source distributed messaging and streaming data 
platform. It has been used in a lot of companies. Please refer to 
http://rocketmq.incubator.apache.org/ for more details.

Since the Apache RocketMq 4.0 will be released in the next few days, we can 
start the job of adding the RocketMq plugin for the Apache Spark.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19482) Fail it if 'spark.master' is set with different value

2017-02-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19482:


Assignee: (was: Apache Spark)

> Fail it if 'spark.master' is set with different value
> -
>
> Key: SPARK-19482
> URL: https://issues.apache.org/jira/browse/SPARK-19482
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>
> First, there is no need to set 'spark.master' multi-times with different 
> values. Second, It is possible for users to set the different 'spark.master' 
> in code with
> `spark-submit` command, and will confuse users. So, we should do once check 
> if the 'spark.master' already exists in settings and if the previous value is 
> the same with current value. Throw a IllegalArgumentException when previous 
> value is different with current value.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19482) Fail it if 'spark.master' is set with different value

2017-02-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19482:


Assignee: Apache Spark

> Fail it if 'spark.master' is set with different value
> -
>
> Key: SPARK-19482
> URL: https://issues.apache.org/jira/browse/SPARK-19482
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>Assignee: Apache Spark
>
> First, there is no need to set 'spark.master' multi-times with different 
> values. Second, It is possible for users to set the different 'spark.master' 
> in code with
> `spark-submit` command, and will confuse users. So, we should do once check 
> if the 'spark.master' already exists in settings and if the previous value is 
> the same with current value. Throw a IllegalArgumentException when previous 
> value is different with current value.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19482) Fail it if 'spark.master' is set with different value

2017-02-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855240#comment-15855240
 ] 

Apache Spark commented on SPARK-19482:
--

User 'uncleGen' has created a pull request for this issue:
https://github.com/apache/spark/pull/16827

> Fail it if 'spark.master' is set with different value
> -
>
> Key: SPARK-19482
> URL: https://issues.apache.org/jira/browse/SPARK-19482
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>
> First, there is no need to set 'spark.master' multi-times with different 
> values. Second, It is possible for users to set the different 'spark.master' 
> in code with
> `spark-submit` command, and will confuse users. So, we should do once check 
> if the 'spark.master' already exists in settings and if the previous value is 
> the same with current value. Throw a IllegalArgumentException when previous 
> value is different with current value.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19482) Fail it if 'spark.master' is set with different value

2017-02-06 Thread Genmao Yu (JIRA)

Genmao Yu created SPARK-19482:
-

 Summary: Fail it if 'spark.master' is set with different value
 Key: SPARK-19482
 URL: https://issues.apache.org/jira/browse/SPARK-19482
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.1.0, 2.0.2
Reporter: Genmao Yu


First, there is no need to set 'spark.master' multi-times with different 
values. Second, It is possible for users to set the different 'spark.master' in 
code with
`spark-submit` command, and will confuse users. So, we should do once check if 
the 'spark.master' already exists in settings and if the previous value is the 
same with current value. Throw a IllegalArgumentException when previous value 
is different with current value.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19467) PySpark ML shouldn't use circular imports

2017-02-06 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-19467:
-

Assignee: Maciej Szymkiewicz

> PySpark ML shouldn't use circular imports
> -
>
> Key: SPARK-19467
> URL: https://issues.apache.org/jira/browse/SPARK-19467
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Minor
> Fix For: 2.2.0
>
>
> {{pyspark.ml}} and {{pyspark.ml.pipeline}} contain circular imports with the 
> [former 
> one|https://github.com/apache/spark/blob/39f328ba3519b01940a7d1cdee851ba4e75ef31f/python/pyspark/ml/__init__.py#L23]:
> {code}
> from pyspark.ml.pipeline import Pipeline, PipelineModel
> {code}
> and the [latter 
> one|https://github.com/apache/spark/blob/39f328ba3519b01940a7d1cdee851ba4e75ef31f/python/pyspark/ml/pipeline.py#L24]:
> {code}
> from pyspark.ml import Estimator, Model, Transformer
> {code}
> This is unnecessary and can cause failures when working with external tools.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19467) PySpark ML shouldn't use circular imports

2017-02-06 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-19467.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16814
[https://github.com/apache/spark/pull/16814]

> PySpark ML shouldn't use circular imports
> -
>
> Key: SPARK-19467
> URL: https://issues.apache.org/jira/browse/SPARK-19467
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Minor
> Fix For: 2.2.0
>
>
> {{pyspark.ml}} and {{pyspark.ml.pipeline}} contain circular imports with the 
> [former 
> one|https://github.com/apache/spark/blob/39f328ba3519b01940a7d1cdee851ba4e75ef31f/python/pyspark/ml/__init__.py#L23]:
> {code}
> from pyspark.ml.pipeline import Pipeline, PipelineModel
> {code}
> and the [latter 
> one|https://github.com/apache/spark/blob/39f328ba3519b01940a7d1cdee851ba4e75ef31f/python/pyspark/ml/pipeline.py#L24]:
> {code}
> from pyspark.ml import Estimator, Model, Transformer
> {code}
> This is unnecessary and can cause failures when working with external tools.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19441) Remove IN type coercion from PromoteStrings

2017-02-06 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19441.
-
Resolution: Fixed

Issue resolved by pull request 16783
[https://github.com/apache/spark/pull/16783]

> Remove IN type coercion from PromoteStrings
> ---
>
> Key: SPARK-19441
> URL: https://issues.apache.org/jira/browse/SPARK-19441
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.2.0
>
>
> The removed codes are not reachable, because `InConversion` already resolve 
> the type coercion issues. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19479) Spark Mesos artifact split causes spark-core dependency to not pull in mesos impl

2017-02-06 Thread Charles Allen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855102#comment-15855102
 ] 

Charles Allen commented on SPARK-19479:
---

[~mgummelt] that's actually a really good suggestion. Somehow I never got 
subscribed to the dev list

> Spark Mesos artifact split causes spark-core dependency to not pull in mesos 
> impl
> -
>
> Key: SPARK-19479
> URL: https://issues.apache.org/jira/browse/SPARK-19479
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.1.0
>Reporter: Charles Allen
>
> https://github.com/apache/spark/pull/14637 ( 
> https://issues.apache.org/jira/browse/SPARK-16967 ) forked off the mesos impl 
> into its own artifact, but the release notes do not call this out. This broke 
> our deployments because we depend on packaging with spark-core, which no 
> longer had any mesos awareness. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15573) Backwards-compatible persistence for spark.ml

2017-02-06 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855097#comment-15855097
 ] 

Joseph K. Bradley commented on SPARK-15573:
---

It's a good point that we can't make updates to older Spark releases for 
persistence.  However, I doubt that we would backport many such fixes for 
non-bugs.  The issue you reference is arguably a scalability limit, not a bug.  
Still, adding an internal ML persistence version is a good idea; I'd be OK with 
it.

> Backwards-compatible persistence for spark.ml
> -
>
> Key: SPARK-15573
> URL: https://issues.apache.org/jira/browse/SPARK-15573
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA is for imposing backwards-compatible persistence for the 
> DataFrames-based API for MLlib.  I.e., we want to be able to load models 
> saved in previous versions of Spark.  We will not require loading models 
> saved in later versions of Spark.
> This requires:
> * Putting unit tests in place to check loading models from previous versions
> * Notifying all committers active on MLlib to be aware of this requirement in 
> the future
> The unit tests could be written as in spark.mllib, where we essentially 
> copied and pasted the save() code every time it changed.  This happens 
> rarely, so it should be acceptable, though other designs are fine.
> Subtasks of this JIRA should cover checking and adding tests for existing 
> cases, such as KMeansModel (whose format changed between 1.6 and 2.0).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-06 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855090#comment-15855090
 ] 

Joseph K. Bradley commented on SPARK-19208:
---

You're right that sharing intermediate results will be necessary.

I'm happy with [~mlnick]'s VectorSummarizer API.  I also think that, if we 
wanted to use the API I suggested above, the version returning a single struct 
col would work: {{df.select(VectorSummary.summary("features", "weights"))}}.  
The new column could be constructed from intermediate columns which would not 
show up in the final output.  (Is this essentially the "private UDAF" 
[~podongfeng] is mentioning above?)

I'm OK either way.

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19479) Spark Mesos artifact split causes spark-core dependency to not pull in mesos impl

2017-02-06 Thread Michael Gummelt (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855091#comment-15855091
 ] 

Michael Gummelt commented on SPARK-19479:
-

Yea, sorry for the inconvenience, but I announced this on the dev list.  Search 
for "Mesos is now a maven module".  If I were you, I would create an email 
filter for "Mesos" on the user/dev lists.  This is what I do.

> Spark Mesos artifact split causes spark-core dependency to not pull in mesos 
> impl
> -
>
> Key: SPARK-19479
> URL: https://issues.apache.org/jira/browse/SPARK-19479
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.1.0
>Reporter: Charles Allen
>
> https://github.com/apache/spark/pull/14637 ( 
> https://issues.apache.org/jira/browse/SPARK-16967 ) forked off the mesos impl 
> into its own artifact, but the release notes do not call this out. This broke 
> our deployments because we depend on packaging with spark-core, which no 
> longer had any mesos awareness. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12157) Support numpy types as return values of Python UDFs

2017-02-06 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855060#comment-15855060
 ] 

Joseph K. Bradley commented on SPARK-12157:
---

I don't know of any Python UDF perf tests.  Ad hoc tests could suffice for 
now...

> Support numpy types as return values of Python UDFs
> ---
>
> Key: SPARK-12157
> URL: https://issues.apache.org/jira/browse/SPARK-12157
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.5.2
>Reporter: Justin Uang
>
> Currently, if I have a python UDF
> {code}
> import pyspark.sql.types as T
> import pyspark.sql.functions as F
> from pyspark.sql import Row
> import numpy as np
> argmax = F.udf(lambda x: np.argmax(x), T.IntegerType())
> df = sqlContext.createDataFrame([Row(array=[1,2,3])])
> df.select(argmax("array")).count()
> {code}
> I get an exception that is fairly opaque:
> {code}
> Caused by: net.razorvine.pickle.PickleException: expected zero arguments for 
> construction of ClassDict (for numpy.dtype)
> at 
> net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
> at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:701)
> at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:171)
> at net.razorvine.pickle.Unpickler.load(Unpickler.java:85)
> at net.razorvine.pickle.Unpickler.loads(Unpickler.java:98)
> at 
> org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:404)
> at 
> org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:403)
> {code}
> Numpy types like np.int and np.float64 should automatically be cast to the 
> proper dtypes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16824) Add API docs for VectorUDT

2017-02-06 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855057#comment-15855057
 ] 

Joseph K. Bradley commented on SPARK-16824:
---

I think we didn't document it since the future of UDTs becoming public APIs was 
uncertain.  VectorUDT is private in spark.ml.  Still, adding docs for the 
public spark.mllib VectorUDT sounds good to me.

> Add API docs for VectorUDT
> --
>
> Key: SPARK-16824
> URL: https://issues.apache.org/jira/browse/SPARK-16824
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Following on the [discussion 
> here|https://issues.apache.org/jira/browse/SPARK-12157?focusedCommentId=15401153=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15401153],
>  it appears that {{VectorUDT}} is missing documentation, at least in PySpark. 
> I'm not sure if this is intentional or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16824) Add API docs for VectorUDT

2017-02-06 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-16824:
--
Issue Type: Documentation  (was: Improvement)

> Add API docs for VectorUDT
> --
>
> Key: SPARK-16824
> URL: https://issues.apache.org/jira/browse/SPARK-16824
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Following on the [discussion 
> here|https://issues.apache.org/jira/browse/SPARK-12157?focusedCommentId=15401153=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15401153],
>  it appears that {{VectorUDT}} is missing documentation, at least in PySpark. 
> I'm not sure if this is intentional or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16824) Add API docs for VectorUDT

2017-02-06 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-16824:
--
Component/s: MLlib

> Add API docs for VectorUDT
> --
>
> Key: SPARK-16824
> URL: https://issues.apache.org/jira/browse/SPARK-16824
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, MLlib, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Following on the [discussion 
> here|https://issues.apache.org/jira/browse/SPARK-12157?focusedCommentId=15401153=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15401153],
>  it appears that {{VectorUDT}} is missing documentation, at least in PySpark. 
> I'm not sure if this is intentional or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18891) Support for specific collection types

2017-02-06 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-18891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855011#comment-15855011
 ] 

Michal Šenkýř commented on SPARK-18891:
---

Thanks. Yes, I know about the Maps issue and I will be happy to continue on to 
Map support. However in [PR #16541|https://github.com/apache/spark/pull/16541]  
I tried to use a better deserialization method. In order to make the code for 
Maps consistent with that of Seqs I am waiting for how that turns out before I 
start on that.

> Support for specific collection types
> -
>
> Key: SPARK-18891
> URL: https://issues.apache.org/jira/browse/SPARK-18891
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Michael Armbrust
>Priority: Critical
>
> Encoders treat all collections the same (i.e. {{Seq}} vs {{List}}) which 
> force users to only define classes with the most generic type.
> An [example 
> error|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/2398463439880241/2840265927289860/latest.html]:
> {code}
> case class SpecificCollection(aList: List[Int])
> Seq(SpecificCollection(1 :: Nil)).toDS().collect()
> {code}
> {code}
> java.lang.RuntimeException: Error while decoding: 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 98, Column 120: No applicable constructor/method found 
> for actual parameters "scala.collection.Seq"; candidates are: 
> "line29e7e4b1e36445baa3505b2e102aa86b29.$read$$iw$$iw$$iw$$iw$SpecificCollection(scala.collection.immutable.List)"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19481) Fix flaky test: o.a.s.repl.ReplSuite should clone and clean line object in ClosureCleaner

2017-02-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19481:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Fix flaky test: o.a.s.repl.ReplSuite should clone and clean line object in 
> ClosureCleaner
> -
>
> Key: SPARK-19481
> URL: https://issues.apache.org/jira/browse/SPARK-19481
> Project: Spark
>  Issue Type: Test
>  Components: Spark Shell
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> org.apache.spark.repl.cancelOnInterrupt leaks a SparkContext and makes the 
> tests unstable. See:
> http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.repl.ReplSuite_name=should+clone+and+clean+line+object+in+ClosureCleaner



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19481) Fix flaky test: o.a.s.repl.ReplSuite should clone and clean line object in ClosureCleaner

2017-02-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19481:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Fix flaky test: o.a.s.repl.ReplSuite should clone and clean line object in 
> ClosureCleaner
> -
>
> Key: SPARK-19481
> URL: https://issues.apache.org/jira/browse/SPARK-19481
> Project: Spark
>  Issue Type: Test
>  Components: Spark Shell
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> org.apache.spark.repl.cancelOnInterrupt leaks a SparkContext and makes the 
> tests unstable. See:
> http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.repl.ReplSuite_name=should+clone+and+clean+line+object+in+ClosureCleaner



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19481) Fix flaky test: o.a.s.repl.ReplSuite should clone and clean line object in ClosureCleaner

2017-02-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15854889#comment-15854889
 ] 

Apache Spark commented on SPARK-19481:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/16825

> Fix flaky test: o.a.s.repl.ReplSuite should clone and clean line object in 
> ClosureCleaner
> -
>
> Key: SPARK-19481
> URL: https://issues.apache.org/jira/browse/SPARK-19481
> Project: Spark
>  Issue Type: Test
>  Components: Spark Shell
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> org.apache.spark.repl.cancelOnInterrupt leaks a SparkContext and makes the 
> tests unstable. See:
> http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.repl.ReplSuite_name=should+clone+and+clean+line+object+in+ClosureCleaner



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19481) Fix flaky test: o.a.s.repl.ReplSuite should clone and clean line object in ClosureCleaner

2017-02-06 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-19481:


 Summary: Fix flaky test: o.a.s.repl.ReplSuite should clone and 
clean line object in ClosureCleaner
 Key: SPARK-19481
 URL: https://issues.apache.org/jira/browse/SPARK-19481
 Project: Spark
  Issue Type: Test
  Components: Spark Shell
Affects Versions: 2.1.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


org.apache.spark.repl.cancelOnInterrupt leaks a SparkContext and makes the 
tests unstable. See:

http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.repl.ReplSuite_name=should+clone+and+clean+line+object+in+ClosureCleaner



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19480) Higher order functions in SQL

2017-02-06 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-19480:
---

 Summary: Higher order functions in SQL
 Key: SPARK-19480
 URL: https://issues.apache.org/jira/browse/SPARK-19480
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.2.0
Reporter: Reynold Xin


To enable users to manipulate nested data types, which is common in ETL jobs 
with deeply nested JSON fields. Operations should include map, filter, reduce 
on arrays/maps.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19472) [SQL]SQLParser fails to resolve nested CASE WHEN statement with parentheses

2017-02-06 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-19472.
-
   Resolution: Fixed
 Assignee: Herman van Hovell
Fix Version/s: 2.2.0
   2.1.1
   2.0.3

> [SQL]SQLParser fails to resolve nested CASE WHEN statement with parentheses
> ---
>
> Key: SPARK-19472
> URL: https://issues.apache.org/jira/browse/SPARK-19472
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: StanZhai
>Assignee: Herman van Hovell
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> SQLParser fails to resolve nested CASE WHEN statement like this:
> select case when
>   (1) +
>   case when 1>0 then 1 else 0 end = 2
> then 1 else 0 end
> from tb
>  Exception 
> Exception in thread "main" 
> org.apache.spark.sql.catalyst.parser.ParseException: 
> mismatched input 'then' expecting {'.', '[', 'OR', 'AND', 'IN', NOT, 
> 'BETWEEN', 'LIKE', RLIKE, 'IS', 'WHEN', EQ, '<=>', '<>', '!=', '<', LTE, '>', 
> GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '^'}(line 5, pos 0)
> == SQL ==
> select case when
>   (1) +
>   case when 1>0 then 1 else 0 end = 2
> then 1 else 0 end
> ^^^
> from tb
> But，remove parentheses will be fine：
> select case when
>   1 +
>   case when 1>0 then 1 else 0 end = 2
> then 1 else 0 end
> from tb



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19479) Spark Mesos artifact split causes spark-core dependency to not pull in mesos impl

2017-02-06 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19479.
---
Resolution: Invalid

As I said on the PR, I don't think this is for release notes because release 
notes are for end users. Regardless, I am not sure what the actionable change 
is supposed to be here.

> Spark Mesos artifact split causes spark-core dependency to not pull in mesos 
> impl
> -
>
> Key: SPARK-19479
> URL: https://issues.apache.org/jira/browse/SPARK-19479
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.1.0
>Reporter: Charles Allen
>
> https://github.com/apache/spark/pull/14637 ( 
> https://issues.apache.org/jira/browse/SPARK-16967 ) forked off the mesos impl 
> into its own artifact, but the release notes do not call this out. This broke 
> our deployments because we depend on packaging with spark-core, which no 
> longer had any mesos awareness. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19479) Spark Mesos artifact split causes spark-core dependency to not pull in mesos impl

2017-02-06 Thread Charles Allen (JIRA)

Charles Allen created SPARK-19479:
-

 Summary: Spark Mesos artifact split causes spark-core dependency 
to not pull in mesos impl
 Key: SPARK-19479
 URL: https://issues.apache.org/jira/browse/SPARK-19479
 Project: Spark
  Issue Type: Bug
  Components: Mesos, Spark Core
Affects Versions: 2.1.0
Reporter: Charles Allen


https://github.com/apache/spark/pull/14637 ( 
https://issues.apache.org/jira/browse/SPARK-16967 ) forked off the mesos impl 
into its own artifact, but the release notes do not call this out. This broke 
our deployments because we depend on packaging with spark-core, which no longer 
had any mesos awareness. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19478) JDBC Sink

2017-02-06 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-19478:
-
Issue Type: New Feature  (was: Bug)

> JDBC Sink
> -
>
> Key: SPARK-19478
> URL: https://issues.apache.org/jira/browse/SPARK-19478
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.0.0
>Reporter: Michael Armbrust
>
> A sink that transactionally commits data into a database use JDBC.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19478) JDBC Sink

2017-02-06 Thread Michael Armbrust (JIRA)

Michael Armbrust created SPARK-19478:


 Summary: JDBC Sink
 Key: SPARK-19478
 URL: https://issues.apache.org/jira/browse/SPARK-19478
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.0.0
Reporter: Michael Armbrust


A sink that transactionally commits data into a database use JDBC.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19398) Log in TaskSetManager is not correct

2017-02-06 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-19398:
---
Fix Version/s: 2.2

> Log in TaskSetManager is not correct
> 
>
> Key: SPARK-19398
> URL: https://issues.apache.org/jira/browse/SPARK-19398
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: jin xing
>Assignee: jin xing
>Priority: Trivial
> Fix For: 2.2
>
>
> Log below is misleading:
> {code:title="TaskSetManager.scala"}
> if (successful(index)) {
>   logInfo(
> s"Task ${info.id} in stage ${taskSet.id} (TID $tid) failed, " +
> "but another instance of the task has already succeeded, " +
> "so not re-queuing the task to be re-executed.")
> }
> {code}
> If fetch failed, the task is marked as *successful* in *TaskSetManager:: 
> handleFailedTask*. Then log above will be printed. The *successful* just 
> means task will not be scheduled any longer, not a real success.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19398) Log in TaskSetManager is not correct

2017-02-06 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-19398.

Resolution: Fixed
  Assignee: jin xing

> Log in TaskSetManager is not correct
> 
>
> Key: SPARK-19398
> URL: https://issues.apache.org/jira/browse/SPARK-19398
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: jin xing
>Assignee: jin xing
>Priority: Trivial
>
> Log below is misleading:
> {code:title="TaskSetManager.scala"}
> if (successful(index)) {
>   logInfo(
> s"Task ${info.id} in stage ${taskSet.id} (TID $tid) failed, " +
> "but another instance of the task has already succeeded, " +
> "so not re-queuing the task to be re-executed.")
> }
> {code}
> If fetch failed, the task is marked as *successful* in *TaskSetManager:: 
> handleFailedTask*. Then log above will be printed. The *successful* just 
> means task will not be scheduled any longer, not a real success.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19462) when spark.sql.adaptive.enabled is enabled, DF is not resilient to node/container failure

2017-02-06 Thread Ian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15854534#comment-15854534
 ] 

Ian commented on SPARK-19462:
-

I appears that the state mutating of newPartitioning of 
org.apache.spark.sql.execution.Exchange is preventing the recalculation when 
dynamic shuffle partitioning is enabled.
 

> when spark.sql.adaptive.enabled is enabled, DF is not resilient to 
> node/container failure
> -
>
> Key: SPARK-19462
> URL: https://issues.apache.org/jira/browse/SPARK-19462
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3
>Reporter: Ian
>
> property spark.sql.adaptive.enabled needs to be set "true" for the issue to 
> be reproduced.
> reproducible steps using spark-shell
> 0. we use yarn as cluster manager, spark-shell runs in client mode 
> 1. launch spark-shell
> 2. 
> {code}
> val df1 = sc.parallelize( 1 to 1000, 2).toDF("number")
> df1.registerTempTable("test")
> val data1 = sqlContext.sql("SELECT * FROM test WHERE number > 50")
> data1.collect
> val data2 = sqlContext.sql("SELECT number, count(*) cnt FROM test GROUP BY 
> number")
> data2.collect
> // everything is fine up to this point
> // manually kill both the AM and all the NMs of the spark-shell app
> // re-run data1.collect, the result is returned successfully
> data1.collect
> // but data2.collect will fail
> data2.collect
> // stacktrace
> Caused by: java.lang.RuntimeException: Exchange not implemented for 
> UnknownPartitioning(1)
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.Exchange.org$apache$spark$sql$execution$Exchange$$getPartitionKeyExtractor$1(Exchange.scala:198)
>   at 
> org.apache.spark.sql.execution.Exchange$$anonfun$3.apply(Exchange.scala:208)
>   at 
> org.apache.spark.sql.execution.Exchange$$anonfun$3.apply(Exchange.scala:207)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> The difference between data1 and data2 is whether ShuffledRowRDD is present 
> in lineage.
> When the RDD lineage contains ShuffledRowRDD, the above mentioned behavior 
> can be observed when node failures or container loss happens.
> {code}
> scala> data2.rdd.toDebugString
> res6: String =
> (1) MapPartitionsRDD[20] at rdd at :26 []
>  |  MapPartitionsRDD[19] at rdd at :26 []
>  |  ShuffledRowRDD[8] at collect at :26 []
>  +-(2) MapPartitionsRDD[7] at collect at :26 []
> |  MapPartitionsRDD[6] at collect at :26 []
> |  MapPartitionsRDD[5] at collect at :26 []
> |  MapPartitionsRDD[1] at intRddToDataFrameHolder at :25 []
> |  ParallelCollectionRDD[0] at parallelize at :25 []
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19477) [SQL] Datasets created from a Dataframe with extra columns retain the extra columns

2017-02-06 Thread Don Drake (JIRA)

Don Drake created SPARK-19477:
-

 Summary: [SQL] Datasets created from a Dataframe with extra 
columns retain the extra columns
 Key: SPARK-19477
 URL: https://issues.apache.org/jira/browse/SPARK-19477
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Don Drake


In 1.6, when you created a Dataset from a Dataframe that had extra columns, the 
columns not in the case class were dropped from the Dataset.

For example in 1.6, the column c4 is gone:

{code}
scala> case class F(f1: String, f2: String, f3:String)
defined class F

scala> import sqlContext.implicits._
import sqlContext.implicits._

scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", 
"j","z")).toDF("f1", "f2", "f3", "c4")
df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string, c4: 
string]

scala> val ds = df.as[F]
ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: string]

scala> ds.show
+---+---+---+
| f1| f2| f3|
+---+---+---+
|  a|  b|  c|
|  d|  e|  f|
|  h|  i|  j|

{code}

This seems to have changed in Spark 2.0 and also 2.1:

Spark 2.1.0:

{code}
scala> case class F(f1: String, f2: String, f3:String)
defined class F

scala> import spark.implicits._
import spark.implicits._

scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", 
"j","z")).toDF("f1", "f2", "f3", "c4")
df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more fields]

scala> val ds = df.as[F]
ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more fields]

scala> ds.show
+---+---+---+---+
| f1| f2| f3| c4|
+---+---+---+---+
|  a|  b|  c|  x|
|  d|  e|  f|  y|
|  h|  i|  j|  z|
+---+---+---+---+

scala> import org.apache.spark.sql.Encoders
import org.apache.spark.sql.Encoders

scala> val fEncoder = Encoders.product[F]
fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]: string, 
f3[0]: string]

scala> fEncoder.schema == ds.schema
res2: Boolean = false

scala> ds.schema
res3: org.apache.spark.sql.types.StructType = 
StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), 
StructField(f3,StringType,true), StructField(c4,StringType,true))

scala> fEncoder.schema
res4: org.apache.spark.sql.types.StructType = 
StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), 
StructField(f3,StringType,true))


{code}





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18069) Many examples in Python docstrings are incomplete

2017-02-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15854468#comment-15854468
 ] 

Apache Spark commented on SPARK-18069:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/16824

> Many examples in Python docstrings are incomplete
> -
>
> Key: SPARK-18069
> URL: https://issues.apache.org/jira/browse/SPARK-18069
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.0.1
>Reporter: Mortada Mehyar
>Priority: Minor
>
> A lot of the python API functions show example usage that is incomplete. The 
> docstring shows output without having the input DataFrame defined. It can be 
> quite confusing trying to understand and/or follow the example.
> For instance, the docstring for `DataFrame.dtypes()` is currently
> {code}
>  def dtypes(self):
>  """Returns all column names and their data types as a list.
>  
>  >>> df.dtypes
>  [('age', 'int'), ('name', 'string')]
>  """
> {code}
> when it should really be
> {code}
>  def dtypes(self):
>  """Returns all column names and their data types as a list.
>  
>  >>> df = spark.createDataFrame([('Alice', 2), ('Bob', 5)], ['name', 
> 'age'])
>  >>> df.dtypes
>  [('age', 'int'), ('name', 'string')]
>  """
> {code}
> I have a pending PR for fixing many of these occurrences here: 
> https://github.com/apache/spark/pull/15053 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19439) PySpark's registerJavaFunction Should Support UDAFs

2017-02-06 Thread Keith Bourgoin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15854330#comment-15854330
 ] 

Keith Bourgoin commented on SPARK-19439:


SPARK-10915 refers to making it possible to write UDAFs in Python, and looks 
like it'll require a significant amount of work. This ticket is merely looking 
to add the ability to call Scala UDAFs from Python. It's substantially less 
work, and a different sort of issue.

In fact, I was able to hack my way around and figure out how to get it working 
just now:

{code}
In [1]: data = [[i+0.5] for i in range(100)]

In [2]: df = sqlContext.createDataFrame(data, ['key'])

In [3]: df.registerTempTable('df')

In [4]: geo_mean = sc._jvm.com.foo.bar.GeometricMean()

In [5]: sqlContext.sparkSession._jsparkSession.udf().register('geo_mean', 
geo_mean)
Out[5]: JavaObject id=o43

In [6]: sqlContext.sql('select geo_mean(key) from df').collect()
Out[6]: [Row(geometricmean(key)=36.91550879325741)]
{code}

I clearly can't submit this as a PR, but if I have time I'll try to cook up a 
worthwhile patch to get the process moving.

> PySpark's registerJavaFunction Should Support UDAFs
> ---
>
> Key: SPARK-19439
> URL: https://issues.apache.org/jira/browse/SPARK-19439
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Keith Bourgoin
>
> When trying to import a Scala UDAF using registerJavaFunction, I get this 
> error:
> {code}
> In [1]: sqlContext.registerJavaFunction('geo_mean', 
> 'com.foo.bar.GeometricMean')
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
> > 1 sqlContext.registerJavaFunction('geo_mean', 
> 'com.foo.bar.GeometricMean')
> /home/kfb/src/projects/spark/python/pyspark/sql/context.pyc in 
> registerJavaFunction(self, name, javaClassName, returnType)
> 227 if returnType is not None:
> 228 jdt = 
> self.sparkSession._jsparkSession.parseDataType(returnType.json())
> --> 229 self.sparkSession._jsparkSession.udf().registerJava(name, 
> javaClassName, jdt)
> 230
> 231 # TODO(andrew): delete this once we refactor things to take in 
> SparkSession
> /home/kfb/src/projects/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
>1131 answer = self.gateway_client.send_command(command)
>1132 return_value = get_return_value(
> -> 1133 answer, self.gateway_client, self.target_id, self.name)
>1134
>1135 for temp_arg in temp_args:
> /home/kfb/src/projects/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)
>  61 def deco(*a, **kw):
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
> /home/kfb/src/projects/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py 
> in get_return_value(answer, gateway_client, target_id, name)
> 317 raise Py4JJavaError(
> 318 "An error occurred while calling {0}{1}{2}.\n".
> --> 319 format(target_id, ".", name), value)
> 320 else:
> 321 raise Py4JError(
> Py4JJavaError: An error occurred while calling o28.registerJava.
> : java.io.IOException: UDF class com.foo.bar.GeometricMean doesn't implement 
> any UDF interface
>   at 
> org.apache.spark.sql.UDFRegistration.registerJava(UDFRegistration.scala:438)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:214)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> According to SPARK-10915, UDAFs in Python aren't happening anytime soon. 
> Without this, there's no way to get Scala UDAFs into Python Spark SQL 
> whatsoever. Fixing that would be a huge help so that we can keep aggregations 
> in the JVM and using DataFrames. Otherwise, all our code has to drop to to 
> RDDs and live in Python.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (SPARK-19476) Running threads in Spark DataFrame foreachPartition() causes NullPointerException

2017-02-06 Thread Gal Topper (JIRA)

Gal Topper created SPARK-19476:
--

 Summary: Running threads in Spark DataFrame foreachPartition() 
causes NullPointerException
 Key: SPARK-19476
 URL: https://issues.apache.org/jira/browse/SPARK-19476
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0, 2.0.2, 2.0.1, 2.0.0, 1.6.3, 1.6.2, 1.6.1, 1.6.0
Reporter: Gal Topper


First reported on [Stack 
overflow|http://stackoverflow.com/questions/41674069/running-threads-in-spark-dataframe-foreachpartition].

I use multiple threads inside foreachPartition(), which works great for me 
except for when the underlying iterator is TungstenAggregationIterator. Here is 
a minimal code snippet to reproduce:
{code:title=Reproduce.scala|borderStyle=solid}
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.duration.Duration
import scala.concurrent.{Await, Future}

import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext

object Reproduce extends App {

  val sc = new SparkContext("local", "reproduce")
  val sqlContext = new SQLContext(sc)

  import sqlContext.implicits._

  val df = sc.parallelize(Seq(1)).toDF("number").groupBy("number").count()

  df.foreachPartition { iterator =>
val f = Future(iterator.toVector)
Await.result(f, Duration.Inf)
  }
}
{code}

When I run this, I get:
{noformat}
java.lang.NullPointerException
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:751)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:84)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
{noformat}

I believe I actually understand why this happens - TungstenAggregationIterator 
uses a ThreadLocal variable that returns null when called from a thread other 
than the original thread that got the iterator from Spark. From examining the 
code, this does not appear to differ between recent Spark versions.

However, this limitation is specific to TungstenAggregationIterator, and not 
documented, as far as I'm aware.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19080) simplify data source analysis

2017-02-06 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19080.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16269
[https://github.com/apache/spark/pull/16269]

> simplify data source analysis
> -
>
> Key: SPARK-19080
> URL: https://issues.apache.org/jira/browse/SPARK-19080
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19475) (ML|MLlib).linalg.DenseVector method delegation fails for neg

2017-02-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19475:


Assignee: (was: Apache Spark)

> (ML|MLlib).linalg.DenseVector method delegation fails for __neg__
> -
>
> Key: SPARK-19475
> URL: https://issues.apache.org/jira/browse/SPARK-19475
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib, PySpark
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> {{(ML|MLlib).linalg.DenseVector}} delegate number of methods to NumPy. By 
> design [it does the 
> same|https://github.com/apache/spark/blob/933a6548d423cf17448207a99299cf36fc1a95f6/python/pyspark/mllib/linalg/__init__.py#L487]
>  for {{__neg__}} but current {{_delegate}} method [expects binary 
> operators|https://github.com/apache/spark/blob/933a6548d423cf17448207a99299cf36fc1a95f6/python/pyspark/mllib/linalg/__init__.py#L481].
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 2.1.0
>   /_/
> Using Python version 3.5.2 (default, Jul  2 2016 17:53:06)
> SparkSession available as 'spark'.
> In [1]: from pyspark.ml import linalg
> In [2]: -linalg.DenseVector([1, 2, 3])
> ---
> TypeError Traceback (most recent call last)
>  in ()
> > 1 -linalg.DenseVector([1, 2, 3])
> TypeError: func() missing 1 required positional argument: 'other'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19475) (ML|MLlib).linalg.DenseVector method delegation fails for neg

2017-02-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15854270#comment-15854270
 ] 

Apache Spark commented on SPARK-19475:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/16822

> (ML|MLlib).linalg.DenseVector method delegation fails for __neg__
> -
>
> Key: SPARK-19475
> URL: https://issues.apache.org/jira/browse/SPARK-19475
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib, PySpark
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> {{(ML|MLlib).linalg.DenseVector}} delegate number of methods to NumPy. By 
> design [it does the 
> same|https://github.com/apache/spark/blob/933a6548d423cf17448207a99299cf36fc1a95f6/python/pyspark/mllib/linalg/__init__.py#L487]
>  for {{__neg__}} but current {{_delegate}} method [expects binary 
> operators|https://github.com/apache/spark/blob/933a6548d423cf17448207a99299cf36fc1a95f6/python/pyspark/mllib/linalg/__init__.py#L481].
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 2.1.0
>   /_/
> Using Python version 3.5.2 (default, Jul  2 2016 17:53:06)
> SparkSession available as 'spark'.
> In [1]: from pyspark.ml import linalg
> In [2]: -linalg.DenseVector([1, 2, 3])
> ---
> TypeError Traceback (most recent call last)
>  in ()
> > 1 -linalg.DenseVector([1, 2, 3])
> TypeError: func() missing 1 required positional argument: 'other'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19475) (ML|MLlib).linalg.DenseVector method delegation fails for neg

2017-02-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19475:


Assignee: Apache Spark

> (ML|MLlib).linalg.DenseVector method delegation fails for __neg__
> -
>
> Key: SPARK-19475
> URL: https://issues.apache.org/jira/browse/SPARK-19475
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib, PySpark
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>Priority: Minor
>
> {{(ML|MLlib).linalg.DenseVector}} delegate number of methods to NumPy. By 
> design [it does the 
> same|https://github.com/apache/spark/blob/933a6548d423cf17448207a99299cf36fc1a95f6/python/pyspark/mllib/linalg/__init__.py#L487]
>  for {{__neg__}} but current {{_delegate}} method [expects binary 
> operators|https://github.com/apache/spark/blob/933a6548d423cf17448207a99299cf36fc1a95f6/python/pyspark/mllib/linalg/__init__.py#L481].
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 2.1.0
>   /_/
> Using Python version 3.5.2 (default, Jul  2 2016 17:53:06)
> SparkSession available as 'spark'.
> In [1]: from pyspark.ml import linalg
> In [2]: -linalg.DenseVector([1, 2, 3])
> ---
> TypeError Traceback (most recent call last)
>  in ()
> > 1 -linalg.DenseVector([1, 2, 3])
> TypeError: func() missing 1 required positional argument: 'other'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19475) (ML|MLlib).linalg.DenseVector method delegation fails for neg

2017-02-06 Thread Maciej Szymkiewicz (JIRA)

Maciej Szymkiewicz created SPARK-19475:
--

 Summary: (ML|MLlib).linalg.DenseVector method delegation fails for 
__neg__
 Key: SPARK-19475
 URL: https://issues.apache.org/jira/browse/SPARK-19475
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib, PySpark
Affects Versions: 2.1.0, 2.0.0, 2.2.0
Reporter: Maciej Szymkiewicz
Priority: Minor


{{(ML|MLlib).linalg.DenseVector}} delegate number of methods to NumPy. By 
design [it does the 
same|https://github.com/apache/spark/blob/933a6548d423cf17448207a99299cf36fc1a95f6/python/pyspark/mllib/linalg/__init__.py#L487]
 for {{__neg__}} but current {{_delegate}} method [expects binary 
operators|https://github.com/apache/spark/blob/933a6548d423cf17448207a99299cf36fc1a95f6/python/pyspark/mllib/linalg/__init__.py#L481].

{code}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.0
  /_/

Using Python version 3.5.2 (default, Jul  2 2016 17:53:06)
SparkSession available as 'spark'.

In [1]: from pyspark.ml import linalg

In [2]: -linalg.DenseVector([1, 2, 3])
---
TypeError Traceback (most recent call last)
 in ()
> 1 -linalg.DenseVector([1, 2, 3])

TypeError: func() missing 1 required positional argument: 'other'
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19468) Dataset slow because of unnecessary shuffles

2017-02-06 Thread koert kuipers (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15854242#comment-15854242
 ] 

koert kuipers commented on SPARK-19468:
---

so to summarize: RDD does what we would expect, DataFrame does what we would 
expect, Dataset inserts extra unnecessary shuffles.

> Dataset slow because of unnecessary shuffles
> 
>
> Key: SPARK-19468
> URL: https://issues.apache.org/jira/browse/SPARK-19468
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: koert kuipers
>
> we noticed that some algos we ported from rdd to dataset are significantly 
> slower, and the main reason seems to be more shuffles that we successfully 
> avoid for rdds by careful partitioning. this seems to be dataset specific as 
> it works ok for dataframe.
> see also here:
> http://blog.hydronitrogen.com/2016/05/13/shuffle-free-joins-in-spark-sql/
> it kind of boils down to this... if i partition and sort dataframes that get 
> used for joins repeatedly i can avoid shuffles:
> {noformat}
> System.setProperty("spark.sql.autoBroadcastJoinThreshold", "-1")
> val df1 = Seq((0, 0), (1, 1)).toDF("key", "value")
>   
> .repartition(col("key")).sortWithinPartitions(col("key")).persist(StorageLevel.DISK_ONLY)
> val df2 = Seq((0, 0), (1, 1)).toDF("key2", "value2")
>   
> .repartition(col("key2")).sortWithinPartitions(col("key2")).persist(StorageLevel.DISK_ONLY)
> val joined = df1.join(df2, col("key") === col("key2"))
> joined.explain
> == Physical Plan ==
> *SortMergeJoin [key#5], [key2#27], Inner
> :- InMemoryTableScan [key#5, value#6]
> : +- InMemoryRelation [key#5, value#6], true, 1, StorageLevel(disk, 1 
> replicas)
> :   +- *Sort [key#5 ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(key#5, 4)
> : +- LocalTableScan [key#5, value#6]
> +- InMemoryTableScan [key2#27, value2#28]
>   +- InMemoryRelation [key2#27, value2#28], true, 1, 
> StorageLevel(disk, 1 replicas)
> +- *Sort [key2#27 ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(key2#27, 4)
>   +- LocalTableScan [key2#27, value2#28]
> {noformat}
> notice how the persisted dataframes are not shuffled or sorted anymore before 
> being used in the join. however if i try to do the same with dataset i have 
> no luck:
> {noformat}
> val ds1 = Seq((0, 0), (1, 1)).toDS
>   
> .repartition(col("_1")).sortWithinPartitions(col("_1")).persist(StorageLevel.DISK_ONLY)
> val ds2 = Seq((0, 0), (1, 1)).toDS
>   
> .repartition(col("_1")).sortWithinPartitions(col("_1")).persist(StorageLevel.DISK_ONLY)
> val joined1 = ds1.joinWith(ds2, ds1("_1") === ds2("_1"))
> joined1.explain
> == Physical Plan ==
> *SortMergeJoin [_1#105._1], [_2#106._1], Inner
> :- *Sort [_1#105._1 ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(_1#105._1, 4)
> : +- *Project [named_struct(_1, _1#83, _2, _2#84) AS _1#105]
> :+- InMemoryTableScan [_1#83, _2#84]
> :  +- InMemoryRelation [_1#83, _2#84], true, 1, 
> StorageLevel(disk, 1 replicas)
> :+- *Sort [_1#83 ASC NULLS FIRST], false, 0
> :   +- Exchange hashpartitioning(_1#83, 4)
> :  +- LocalTableScan [_1#83, _2#84]
> +- *Sort [_2#106._1 ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(_2#106._1, 4)
>   +- *Project [named_struct(_1, _1#100, _2, _2#101) AS _2#106]
>  +- InMemoryTableScan [_1#100, _2#101]
>+- InMemoryRelation [_1#100, _2#101], true, 1, 
> StorageLevel(disk, 1 replicas)
>  +- *Sort [_1#83 ASC NULLS FIRST], false, 0
> +- Exchange hashpartitioning(_1#83, 4)
>+- LocalTableScan [_1#83, _2#84]
> {noformat}
> notice how my persisted Datasets are shuffled and sorted again. part of the 
> issue seems to be in joinWith, which does some preprocessing that seems to 
> confuse the planner. if i change the joinWith to join (which returns a 
> dataframe) it looks a little better in that only one side gets shuffled 
> again, but still not optimal:
> {noformat}
> val ds1 = Seq((0, 0), (1, 1)).toDS
>   
> .repartition(col("_1")).sortWithinPartitions(col("_1")).persist(StorageLevel.DISK_ONLY)
> val ds2 = Seq((0, 0), (1, 1)).toDS
>   
> .repartition(col("_1")).sortWithinPartitions(col("_1")).persist(StorageLevel.DISK_ONLY)
> val joined1 = ds1.join(ds2, ds1("_1") === ds2("_1"))
> joined1.explain
> == Physical Plan ==
> *SortMergeJoin [_1#83], [_1#100], Inner
> :- InMemoryTableScan [_1#83, _2#84]
> : +- InMemoryRelation [_1#83, _2#84], true, 1, StorageLevel(disk, 1 
> replicas)
> :   +- *Sort [_1#83 ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(_1#83, 4)
> :

[jira] [Assigned] (SPARK-19472) [SQL]SQLParser fails to resolve nested CASE WHEN statement with parentheses

2017-02-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19472:


Assignee: Apache Spark

> [SQL]SQLParser fails to resolve nested CASE WHEN statement with parentheses
> ---
>
> Key: SPARK-19472
> URL: https://issues.apache.org/jira/browse/SPARK-19472
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: StanZhai
>Assignee: Apache Spark
>
> SQLParser fails to resolve nested CASE WHEN statement like this:
> select case when
>   (1) +
>   case when 1>0 then 1 else 0 end = 2
> then 1 else 0 end
> from tb
>  Exception 
> Exception in thread "main" 
> org.apache.spark.sql.catalyst.parser.ParseException: 
> mismatched input 'then' expecting {'.', '[', 'OR', 'AND', 'IN', NOT, 
> 'BETWEEN', 'LIKE', RLIKE, 'IS', 'WHEN', EQ, '<=>', '<>', '!=', '<', LTE, '>', 
> GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '^'}(line 5, pos 0)
> == SQL ==
> select case when
>   (1) +
>   case when 1>0 then 1 else 0 end = 2
> then 1 else 0 end
> ^^^
> from tb
> But，remove parentheses will be fine：
> select case when
>   1 +
>   case when 1>0 then 1 else 0 end = 2
> then 1 else 0 end
> from tb



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19472) [SQL]SQLParser fails to resolve nested CASE WHEN statement with parentheses

2017-02-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19472:


Assignee: (was: Apache Spark)

> [SQL]SQLParser fails to resolve nested CASE WHEN statement with parentheses
> ---
>
> Key: SPARK-19472
> URL: https://issues.apache.org/jira/browse/SPARK-19472
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: StanZhai
>
> SQLParser fails to resolve nested CASE WHEN statement like this:
> select case when
>   (1) +
>   case when 1>0 then 1 else 0 end = 2
> then 1 else 0 end
> from tb
>  Exception 
> Exception in thread "main" 
> org.apache.spark.sql.catalyst.parser.ParseException: 
> mismatched input 'then' expecting {'.', '[', 'OR', 'AND', 'IN', NOT, 
> 'BETWEEN', 'LIKE', RLIKE, 'IS', 'WHEN', EQ, '<=>', '<>', '!=', '<', LTE, '>', 
> GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '^'}(line 5, pos 0)
> == SQL ==
> select case when
>   (1) +
>   case when 1>0 then 1 else 0 end = 2
> then 1 else 0 end
> ^^^
> from tb
> But，remove parentheses will be fine：
> select case when
>   1 +
>   case when 1>0 then 1 else 0 end = 2
> then 1 else 0 end
> from tb



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19472) [SQL]SQLParser fails to resolve nested CASE WHEN statement with parentheses

2017-02-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15854187#comment-15854187
 ] 

Apache Spark commented on SPARK-19472:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/16821

> [SQL]SQLParser fails to resolve nested CASE WHEN statement with parentheses
> ---
>
> Key: SPARK-19472
> URL: https://issues.apache.org/jira/browse/SPARK-19472
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: StanZhai
>
> SQLParser fails to resolve nested CASE WHEN statement like this:
> select case when
>   (1) +
>   case when 1>0 then 1 else 0 end = 2
> then 1 else 0 end
> from tb
>  Exception 
> Exception in thread "main" 
> org.apache.spark.sql.catalyst.parser.ParseException: 
> mismatched input 'then' expecting {'.', '[', 'OR', 'AND', 'IN', NOT, 
> 'BETWEEN', 'LIKE', RLIKE, 'IS', 'WHEN', EQ, '<=>', '<>', '!=', '<', LTE, '>', 
> GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '^'}(line 5, pos 0)
> == SQL ==
> select case when
>   (1) +
>   case when 1>0 then 1 else 0 end = 2
> then 1 else 0 end
> ^^^
> from tb
> But，remove parentheses will be fine：
> select case when
>   1 +
>   case when 1>0 then 1 else 0 end = 2
> then 1 else 0 end
> from tb



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19468) Dataset slow because of unnecessary shuffles

2017-02-06 Thread koert kuipers (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15854152#comment-15854152
 ] 

koert kuipers commented on SPARK-19468:
---

inserting unnecessary shuffles makes things very slow. and this is not an edge 
case. this is the most basic shuffle optimization.

given how RDD and DataFrame behave it is also unexpected. however it does not 
produce incorrect results.

it is definitely a blocker for switching from RDD to Dataset. no matter what 
else you do to make things fast (catalyst planner, tungsten, etc,) if you cant 
get the basic shuffles right then RDD is going to be more efficient for a lot 
of algos.

> Dataset slow because of unnecessary shuffles
> 
>
> Key: SPARK-19468
> URL: https://issues.apache.org/jira/browse/SPARK-19468
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: koert kuipers
>
> we noticed that some algos we ported from rdd to dataset are significantly 
> slower, and the main reason seems to be more shuffles that we successfully 
> avoid for rdds by careful partitioning. this seems to be dataset specific as 
> it works ok for dataframe.
> see also here:
> http://blog.hydronitrogen.com/2016/05/13/shuffle-free-joins-in-spark-sql/
> it kind of boils down to this... if i partition and sort dataframes that get 
> used for joins repeatedly i can avoid shuffles:
> {noformat}
> System.setProperty("spark.sql.autoBroadcastJoinThreshold", "-1")
> val df1 = Seq((0, 0), (1, 1)).toDF("key", "value")
>   
> .repartition(col("key")).sortWithinPartitions(col("key")).persist(StorageLevel.DISK_ONLY)
> val df2 = Seq((0, 0), (1, 1)).toDF("key2", "value2")
>   
> .repartition(col("key2")).sortWithinPartitions(col("key2")).persist(StorageLevel.DISK_ONLY)
> val joined = df1.join(df2, col("key") === col("key2"))
> joined.explain
> == Physical Plan ==
> *SortMergeJoin [key#5], [key2#27], Inner
> :- InMemoryTableScan [key#5, value#6]
> : +- InMemoryRelation [key#5, value#6], true, 1, StorageLevel(disk, 1 
> replicas)
> :   +- *Sort [key#5 ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(key#5, 4)
> : +- LocalTableScan [key#5, value#6]
> +- InMemoryTableScan [key2#27, value2#28]
>   +- InMemoryRelation [key2#27, value2#28], true, 1, 
> StorageLevel(disk, 1 replicas)
> +- *Sort [key2#27 ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(key2#27, 4)
>   +- LocalTableScan [key2#27, value2#28]
> {noformat}
> notice how the persisted dataframes are not shuffled or sorted anymore before 
> being used in the join. however if i try to do the same with dataset i have 
> no luck:
> {noformat}
> val ds1 = Seq((0, 0), (1, 1)).toDS
>   
> .repartition(col("_1")).sortWithinPartitions(col("_1")).persist(StorageLevel.DISK_ONLY)
> val ds2 = Seq((0, 0), (1, 1)).toDS
>   
> .repartition(col("_1")).sortWithinPartitions(col("_1")).persist(StorageLevel.DISK_ONLY)
> val joined1 = ds1.joinWith(ds2, ds1("_1") === ds2("_1"))
> joined1.explain
> == Physical Plan ==
> *SortMergeJoin [_1#105._1], [_2#106._1], Inner
> :- *Sort [_1#105._1 ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(_1#105._1, 4)
> : +- *Project [named_struct(_1, _1#83, _2, _2#84) AS _1#105]
> :+- InMemoryTableScan [_1#83, _2#84]
> :  +- InMemoryRelation [_1#83, _2#84], true, 1, 
> StorageLevel(disk, 1 replicas)
> :+- *Sort [_1#83 ASC NULLS FIRST], false, 0
> :   +- Exchange hashpartitioning(_1#83, 4)
> :  +- LocalTableScan [_1#83, _2#84]
> +- *Sort [_2#106._1 ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(_2#106._1, 4)
>   +- *Project [named_struct(_1, _1#100, _2, _2#101) AS _2#106]
>  +- InMemoryTableScan [_1#100, _2#101]
>+- InMemoryRelation [_1#100, _2#101], true, 1, 
> StorageLevel(disk, 1 replicas)
>  +- *Sort [_1#83 ASC NULLS FIRST], false, 0
> +- Exchange hashpartitioning(_1#83, 4)
>+- LocalTableScan [_1#83, _2#84]
> {noformat}
> notice how my persisted Datasets are shuffled and sorted again. part of the 
> issue seems to be in joinWith, which does some preprocessing that seems to 
> confuse the planner. if i change the joinWith to join (which returns a 
> dataframe) it looks a little better in that only one side gets shuffled 
> again, but still not optimal:
> {noformat}
> val ds1 = Seq((0, 0), (1, 1)).toDS
>   
> .repartition(col("_1")).sortWithinPartitions(col("_1")).persist(StorageLevel.DISK_ONLY)
> val ds2 = Seq((0, 0), (1, 1)).toDS
>   
> .repartition(col("_1")).sortWithinPartitions(col("_1")).persist(StorageLevel.DISK_ONLY)
> val joined1 = ds1.join(ds2,

[jira] [Assigned] (SPARK-17663) SchedulableBuilder should handle invalid data access via scheduler.allocation.file

2017-02-06 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-17663:


Assignee: Eren Avsarogullari  (was: Imran Rashid)

> SchedulableBuilder should handle invalid data access via 
> scheduler.allocation.file
> --
>
> Key: SPARK-17663
> URL: https://issues.apache.org/jira/browse/SPARK-17663
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Eren Avsarogullari
>Assignee: Eren Avsarogullari
> Fix For: 2.2.0
>
>
> If spark.scheduler.allocation.file has invalid minShare or/and weight values, 
> these cause :
> - NumberFormatException due to toInt function
> - SparkContext can not be initialized.
> - It does not show meaningful error message to user.
> In a nutshell, this functionality can be more robust by selecting one of the 
> following flows :
> *1-* Currently, if schedulingMode has an invalid value, a warning message is 
> logged and default value is set as FIFO. Same pattern can be used for 
> minShare(default: 0) and weight(default: 1) as well
> *2-* Meaningful error message can be shown to the user for all invalid cases.
> *Code to Reproduce* :
> {code}
> val conf = new 
> SparkConf().setAppName("spark-fairscheduler").setMaster("local")
> conf.set("spark.scheduler.mode", "FAIR")
> conf.set("spark.scheduler.allocation.file", 
> "src/main/resources/fairscheduler-invalid-data.xml")
> val sc = new SparkContext(conf)
> {code}
> *fairscheduler-invalid-data.xml* :
> {code}
> 
> 
> FIFO
> invalid_weight
> 2
> 
> 
> {code}
> *Stacktrace* :
> {code}
> Exception in thread "main" java.lang.NumberFormatException: For input string: 
> "invalid_weight"
>   at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>   at java.lang.Integer.parseInt(Integer.java:580)
>   at java.lang.Integer.parseInt(Integer.java:615)
>   at 
> scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
>   at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
>   at 
> org.apache.spark.scheduler.FairSchedulableBuilder$$anonfun$org$apache$spark$scheduler$FairSchedulableBuilder$$buildFairSchedulerPool$1.apply(SchedulableBuilder.scala:127)
>   at 
> org.apache.spark.scheduler.FairSchedulableBuilder$$anonfun$org$apache$spark$scheduler$FairSchedulableBuilder$$buildFairSchedulerPool$1.apply(SchedulableBuilder.scala:102)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17663) SchedulableBuilder should handle invalid data access via scheduler.allocation.file

2017-02-06 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-17663:


Assignee: Imran Rashid

> SchedulableBuilder should handle invalid data access via 
> scheduler.allocation.file
> --
>
> Key: SPARK-17663
> URL: https://issues.apache.org/jira/browse/SPARK-17663
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Eren Avsarogullari
>Assignee: Imran Rashid
> Fix For: 2.2.0
>
>
> If spark.scheduler.allocation.file has invalid minShare or/and weight values, 
> these cause :
> - NumberFormatException due to toInt function
> - SparkContext can not be initialized.
> - It does not show meaningful error message to user.
> In a nutshell, this functionality can be more robust by selecting one of the 
> following flows :
> *1-* Currently, if schedulingMode has an invalid value, a warning message is 
> logged and default value is set as FIFO. Same pattern can be used for 
> minShare(default: 0) and weight(default: 1) as well
> *2-* Meaningful error message can be shown to the user for all invalid cases.
> *Code to Reproduce* :
> {code}
> val conf = new 
> SparkConf().setAppName("spark-fairscheduler").setMaster("local")
> conf.set("spark.scheduler.mode", "FAIR")
> conf.set("spark.scheduler.allocation.file", 
> "src/main/resources/fairscheduler-invalid-data.xml")
> val sc = new SparkContext(conf)
> {code}
> *fairscheduler-invalid-data.xml* :
> {code}
> 
> 
> FIFO
> invalid_weight
> 2
> 
> 
> {code}
> *Stacktrace* :
> {code}
> Exception in thread "main" java.lang.NumberFormatException: For input string: 
> "invalid_weight"
>   at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>   at java.lang.Integer.parseInt(Integer.java:580)
>   at java.lang.Integer.parseInt(Integer.java:615)
>   at 
> scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
>   at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
>   at 
> org.apache.spark.scheduler.FairSchedulableBuilder$$anonfun$org$apache$spark$scheduler$FairSchedulableBuilder$$buildFairSchedulerPool$1.apply(SchedulableBuilder.scala:127)
>   at 
> org.apache.spark.scheduler.FairSchedulableBuilder$$anonfun$org$apache$spark$scheduler$FairSchedulableBuilder$$buildFairSchedulerPool$1.apply(SchedulableBuilder.scala:102)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17663) SchedulableBuilder should handle invalid data access via scheduler.allocation.file

2017-02-06 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-17663.
--
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 15237
[https://github.com/apache/spark/pull/15237]

> SchedulableBuilder should handle invalid data access via 
> scheduler.allocation.file
> --
>
> Key: SPARK-17663
> URL: https://issues.apache.org/jira/browse/SPARK-17663
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Eren Avsarogullari
> Fix For: 2.2.0
>
>
> If spark.scheduler.allocation.file has invalid minShare or/and weight values, 
> these cause :
> - NumberFormatException due to toInt function
> - SparkContext can not be initialized.
> - It does not show meaningful error message to user.
> In a nutshell, this functionality can be more robust by selecting one of the 
> following flows :
> *1-* Currently, if schedulingMode has an invalid value, a warning message is 
> logged and default value is set as FIFO. Same pattern can be used for 
> minShare(default: 0) and weight(default: 1) as well
> *2-* Meaningful error message can be shown to the user for all invalid cases.
> *Code to Reproduce* :
> {code}
> val conf = new 
> SparkConf().setAppName("spark-fairscheduler").setMaster("local")
> conf.set("spark.scheduler.mode", "FAIR")
> conf.set("spark.scheduler.allocation.file", 
> "src/main/resources/fairscheduler-invalid-data.xml")
> val sc = new SparkContext(conf)
> {code}
> *fairscheduler-invalid-data.xml* :
> {code}
> 
> 
> FIFO
> invalid_weight
> 2
> 
> 
> {code}
> *Stacktrace* :
> {code}
> Exception in thread "main" java.lang.NumberFormatException: For input string: 
> "invalid_weight"
>   at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>   at java.lang.Integer.parseInt(Integer.java:580)
>   at java.lang.Integer.parseInt(Integer.java:615)
>   at 
> scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
>   at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
>   at 
> org.apache.spark.scheduler.FairSchedulableBuilder$$anonfun$org$apache$spark$scheduler$FairSchedulableBuilder$$buildFairSchedulerPool$1.apply(SchedulableBuilder.scala:127)
>   at 
> org.apache.spark.scheduler.FairSchedulableBuilder$$anonfun$org$apache$spark$scheduler$FairSchedulableBuilder$$buildFairSchedulerPool$1.apply(SchedulableBuilder.scala:102)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19439) PySpark's registerJavaFunction Should Support UDAFs

2017-02-06 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15854034#comment-15854034
 ] 

Hyukjin Kwon commented on SPARK-19439:
--

So, as you said, is this a duplicate of SPARK-10915? If so, it'd maybe make 
sense just to add some comment there or reopen rather than creating a new one.

> PySpark's registerJavaFunction Should Support UDAFs
> ---
>
> Key: SPARK-19439
> URL: https://issues.apache.org/jira/browse/SPARK-19439
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Keith Bourgoin
>
> When trying to import a Scala UDAF using registerJavaFunction, I get this 
> error:
> {code}
> In [1]: sqlContext.registerJavaFunction('geo_mean', 
> 'com.foo.bar.GeometricMean')
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
> > 1 sqlContext.registerJavaFunction('geo_mean', 
> 'com.foo.bar.GeometricMean')
> /home/kfb/src/projects/spark/python/pyspark/sql/context.pyc in 
> registerJavaFunction(self, name, javaClassName, returnType)
> 227 if returnType is not None:
> 228 jdt = 
> self.sparkSession._jsparkSession.parseDataType(returnType.json())
> --> 229 self.sparkSession._jsparkSession.udf().registerJava(name, 
> javaClassName, jdt)
> 230
> 231 # TODO(andrew): delete this once we refactor things to take in 
> SparkSession
> /home/kfb/src/projects/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
>1131 answer = self.gateway_client.send_command(command)
>1132 return_value = get_return_value(
> -> 1133 answer, self.gateway_client, self.target_id, self.name)
>1134
>1135 for temp_arg in temp_args:
> /home/kfb/src/projects/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)
>  61 def deco(*a, **kw):
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
> /home/kfb/src/projects/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py 
> in get_return_value(answer, gateway_client, target_id, name)
> 317 raise Py4JJavaError(
> 318 "An error occurred while calling {0}{1}{2}.\n".
> --> 319 format(target_id, ".", name), value)
> 320 else:
> 321 raise Py4JError(
> Py4JJavaError: An error occurred while calling o28.registerJava.
> : java.io.IOException: UDF class com.foo.bar.GeometricMean doesn't implement 
> any UDF interface
>   at 
> org.apache.spark.sql.UDFRegistration.registerJava(UDFRegistration.scala:438)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:214)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> According to SPARK-10915, UDAFs in Python aren't happening anytime soon. 
> Without this, there's no way to get Scala UDAFs into Python Spark SQL 
> whatsoever. Fixing that would be a huge help so that we can keep aggregations 
> in the JVM and using DataFrames. Otherwise, all our code has to drop to to 
> RDDs and live in Python.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19440) Window in pyspark doesn't have attributes unboundedPreceding, unboundedFollowing and currentRow

2017-02-06 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-19440.
--
Resolution: Invalid

It seems there are as below:

{code}
>>> from pyspark.sql import Window
>>> window = Window.orderBy("date").rowsBetween(Window.unboundedPreceding, 
>>> Window.currentRow)
>>> dir(Window)
['_FOLLOWING_THRESHOLD', '_JAVA_MAX_LONG', '_JAVA_MIN_LONG', 
'_PRECEDING_THRESHOLD', '__class__', '__delattr__', '__dict__', '__doc__', 
'__format__', '__getattribute__', '__hash__', '__init__', '__module__', 
'__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', 
'__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'currentRow', 
'orderBy', 'partitionBy', 'rangeBetween', 'rowsBetween', 'unboundedFollowing', 
'unboundedPreceding']
>>>
{code}

Please reopen this if I am mistaken.

> Window in pyspark doesn't have attributes unboundedPreceding, 
> unboundedFollowing and currentRow
> ---
>
> Key: SPARK-19440
> URL: https://issues.apache.org/jira/browse/SPARK-19440
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Franklyn Dsouza
>Priority: Trivial
>
> The Window class in pyspark doesn't have the attributes unboundedPreceding, 
> unboundedFollowing and currentRow despite the documentation suggesting this 
> attributes be used.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19442) Unable to add column to the dataset using Dataset.WithColumn() api

2017-02-06 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-19442.
--
Resolution: Cannot Reproduce

I am resolving this as I can't reproduce in the current master as below:

{code}
// Replace existing column
spark.range(10).toDF("newColumnName")
  .withColumn("newColumnName", new Column("newColumnName").cast("int"))
  .show();

// Adding a new column
spark.range(10).toDF("a")
   .withColumn("newColumnName", new Column("a").cast("int"))
   .show();
{code}

prints

{code}
+-+
|newColumnName|
+-+
|0|
|1|
|2|
|3|
|4|
|5|
|6|
|7|
|8|
|9|
+-+

+---+-+
|  a|newColumnName|
+---+-+
|  0|0|
|  1|1|
|  2|2|
|  3|3|
|  4|4|
|  5|5|
|  6|6|
|  7|7|
|  8|8|
|  9|9|
+---+-+
{code}

Please reopen this if I am mistaken

> Unable to add column to the dataset using Dataset.WithColumn() api
> --
>
> Key: SPARK-19442
> URL: https://issues.apache.org/jira/browse/SPARK-19442
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Navya Krishnappa
>
> When I'm creating a new column using Dataset.WithColumn() api, Analysis 
> Exception is thrown.
> Dataset.WithColumn() api: 
> Dataset.withColumn("newColumnName', new 
> org.apache.spark.sql.Column("newColumnName").cast("int"));
> Stacktrace: 
> cannot resolve '`NewColumn`' given input columns: [abc,xyz ]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19451) Long values in Window function

2017-02-06 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853963#comment-15853963
 ] 

Herman van Hovell commented on SPARK-19451:
---

At the end of the day I would like to support arbitrary literals for range 
frames. See: https://issues.apache.org/jira/browse/SPARK-9221

> Long values in Window function
> --
>
> Key: SPARK-19451
> URL: https://issues.apache.org/jira/browse/SPARK-19451
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.2
>Reporter: Julien Champ
>
> Hi there,
> there seems to be a major limitation in spark window functions and 
> rangeBetween method.
> If I have the following code :
> {code:title=Exemple |borderStyle=solid}
> val tw =  Window.orderBy("date")
>   .partitionBy("id")
>   .rangeBetween( from , 0)
> {code}
> Everything seems ok, while *from* value is not too large... Even if the 
> rangeBetween() method supports Long parameters.
> But If i set *-216000L* value to *from* it does not work !
> It is probably related to this part of code in the between() method, of the 
> WindowSpec class, called by rangeBetween()
> {code:title=between() method|borderStyle=solid}
> val boundaryStart = start match {
>   case 0 => CurrentRow
>   case Long.MinValue => UnboundedPreceding
>   case x if x < 0 => ValuePreceding(-start.toInt)
>   case x if x > 0 => ValueFollowing(start.toInt)
> }
> {code}
> ( look at this *.toInt* )
> Does anybody know it there's a way to solve / patch this behavior ?
> Any help will be appreciated
> Thx



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19451) Long values in Window function

2017-02-06 Thread Julien Champ (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853936#comment-15853936
 ] 

Julien Champ commented on SPARK-19451:
--

Glad to see that I'm not the only one convinced by this usage !

This probably needs to use different data structures for rowBetween() and 
rangeBetween()

> Long values in Window function
> --
>
> Key: SPARK-19451
> URL: https://issues.apache.org/jira/browse/SPARK-19451
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.2
>Reporter: Julien Champ
>
> Hi there,
> there seems to be a major limitation in spark window functions and 
> rangeBetween method.
> If I have the following code :
> {code:title=Exemple |borderStyle=solid}
> val tw =  Window.orderBy("date")
>   .partitionBy("id")
>   .rangeBetween( from , 0)
> {code}
> Everything seems ok, while *from* value is not too large... Even if the 
> rangeBetween() method supports Long parameters.
> But If i set *-216000L* value to *from* it does not work !
> It is probably related to this part of code in the between() method, of the 
> WindowSpec class, called by rangeBetween()
> {code:title=between() method|borderStyle=solid}
> val boundaryStart = start match {
>   case 0 => CurrentRow
>   case Long.MinValue => UnboundedPreceding
>   case x if x < 0 => ValuePreceding(-start.toInt)
>   case x if x > 0 => ValueFollowing(start.toInt)
> }
> {code}
> ( look at this *.toInt* )
> Does anybody know it there's a way to solve / patch this behavior ?
> Any help will be appreciated
> Thx



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19471) [SQL]A confusing NullPointerException when creating table

2017-02-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19471:


Assignee: (was: Apache Spark)

> [SQL]A confusing NullPointerException when creating table
> -
>
> Key: SPARK-19471
> URL: https://issues.apache.org/jira/browse/SPARK-19471
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: StanZhai
>Priority: Critical
>
> After upgrading our Spark from 1.6.2 to 2.1.0, I encounter a confusing 
> NullPointerException when creating table under Spark 2.1.0, but the problem 
> does not exists in Spark 1.6.1. 
> Environment: Hive 1.2.1, Hadoop 2.6.4 
>  Code  
> // spark is an instance of HiveContext 
> // merge is a Hive UDF 
> val df = spark.sql("SELECT merge(field_a, null) AS new_a, field_b AS new_b 
> FROM tb_1 group by field_a, field_b") 
> df.createTempView("tb_temp") 
> spark.sql("create table tb_result stored as parquet as " + 
>   "SELECT new_a" + 
>   "FROM tb_temp" + 
>   "LEFT JOIN `tb_2` ON " + 
>   "if(((`tb_temp`.`new_b`) = '' OR (`tb_temp`.`new_b`) IS NULL), 
> concat('GrLSRwZE_', cast((rand() * 200) AS int)), (`tb_temp`.`new_b`)) = 
> `tb_2`.`fka6862f17`") 
>  Physical Plan  
> *Project [new_a] 
> +- *BroadcastHashJoin [if (((new_b = ) || isnull(new_b))) concat(GrLSRwZE_, 
> cast(cast((_nondeterministic * 200.0) as int) as string)) else new_b], 
> [fka6862f17], LeftOuter, BuildRight 
>:- HashAggregate(keys=[field_a, field_b], functions=[], output=[new_a, 
> new_b, _nondeterministic]) 
>:  +- Exchange(coordinator ) hashpartitioning(field_a, field_b, 180), 
> coordinator[target post-shuffle partition size: 1024880] 
>: +- *HashAggregate(keys=[field_a, field_b], functions=[], 
> output=[field_a, field_b]) 
>:+- *FileScan parquet bdp.tb_1[field_a,field_b] Batched: true, 
> Format: Parquet, Location: InMemoryFileIndex[hdfs://hdcluster/data/tb_1, 
> PartitionFilters: [], PushedFilters: [], ReadSchema: struct 
>+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, 
> true])) 
>   +- *Project [fka6862f17] 
>  +- *FileScan parquet bdp.tb_2[fka6862f17] Batched: true, Format: 
> Parquet, Location: InMemoryFileIndex[hdfs://hdcluster/data/tb_2, 
> PartitionFilters: [], PushedFilters: [], ReadSchema: struct 
> What does '*' mean before HashAggregate? 
>  Exception  
> org.apache.spark.SparkException: Task failed while writing rows 
> ... 
> java.lang.NullPointerException 
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_2$(Unknown
>  Source) 
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source) 
> at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$3.apply(AggregationIterator.scala:260)
>  
> at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$3.apply(AggregationIterator.scala:259)
>  
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:392)
>  
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:79)
>  
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source) 
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>  
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:252)
>  
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:199)
>  
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:197)
>  
> at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
>  
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:202)
>  
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$4.apply(FileFormatWriter.scala:138)
>  
> at 
>

[jira] [Assigned] (SPARK-19471) [SQL]A confusing NullPointerException when creating table

2017-02-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19471:


Assignee: Apache Spark

> [SQL]A confusing NullPointerException when creating table
> -
>
> Key: SPARK-19471
> URL: https://issues.apache.org/jira/browse/SPARK-19471
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: StanZhai
>Assignee: Apache Spark
>Priority: Critical
>
> After upgrading our Spark from 1.6.2 to 2.1.0, I encounter a confusing 
> NullPointerException when creating table under Spark 2.1.0, but the problem 
> does not exists in Spark 1.6.1. 
> Environment: Hive 1.2.1, Hadoop 2.6.4 
>  Code  
> // spark is an instance of HiveContext 
> // merge is a Hive UDF 
> val df = spark.sql("SELECT merge(field_a, null) AS new_a, field_b AS new_b 
> FROM tb_1 group by field_a, field_b") 
> df.createTempView("tb_temp") 
> spark.sql("create table tb_result stored as parquet as " + 
>   "SELECT new_a" + 
>   "FROM tb_temp" + 
>   "LEFT JOIN `tb_2` ON " + 
>   "if(((`tb_temp`.`new_b`) = '' OR (`tb_temp`.`new_b`) IS NULL), 
> concat('GrLSRwZE_', cast((rand() * 200) AS int)), (`tb_temp`.`new_b`)) = 
> `tb_2`.`fka6862f17`") 
>  Physical Plan  
> *Project [new_a] 
> +- *BroadcastHashJoin [if (((new_b = ) || isnull(new_b))) concat(GrLSRwZE_, 
> cast(cast((_nondeterministic * 200.0) as int) as string)) else new_b], 
> [fka6862f17], LeftOuter, BuildRight 
>:- HashAggregate(keys=[field_a, field_b], functions=[], output=[new_a, 
> new_b, _nondeterministic]) 
>:  +- Exchange(coordinator ) hashpartitioning(field_a, field_b, 180), 
> coordinator[target post-shuffle partition size: 1024880] 
>: +- *HashAggregate(keys=[field_a, field_b], functions=[], 
> output=[field_a, field_b]) 
>:+- *FileScan parquet bdp.tb_1[field_a,field_b] Batched: true, 
> Format: Parquet, Location: InMemoryFileIndex[hdfs://hdcluster/data/tb_1, 
> PartitionFilters: [], PushedFilters: [], ReadSchema: struct 
>+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, 
> true])) 
>   +- *Project [fka6862f17] 
>  +- *FileScan parquet bdp.tb_2[fka6862f17] Batched: true, Format: 
> Parquet, Location: InMemoryFileIndex[hdfs://hdcluster/data/tb_2, 
> PartitionFilters: [], PushedFilters: [], ReadSchema: struct 
> What does '*' mean before HashAggregate? 
>  Exception  
> org.apache.spark.SparkException: Task failed while writing rows 
> ... 
> java.lang.NullPointerException 
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_2$(Unknown
>  Source) 
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source) 
> at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$3.apply(AggregationIterator.scala:260)
>  
> at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$3.apply(AggregationIterator.scala:259)
>  
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:392)
>  
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:79)
>  
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source) 
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>  
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:252)
>  
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:199)
>  
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:197)
>  
> at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
>  
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:202)
>  
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$4.apply(FileFormatWriter.scala:138)
>

[jira] [Commented] (SPARK-19471) [SQL]A confusing NullPointerException when creating table

2017-02-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853906#comment-15853906
 ] 

Apache Spark commented on SPARK-19471:
--

User 'yangw1234' has created a pull request for this issue:
https://github.com/apache/spark/pull/16820

> [SQL]A confusing NullPointerException when creating table
> -
>
> Key: SPARK-19471
> URL: https://issues.apache.org/jira/browse/SPARK-19471
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: StanZhai
>Priority: Critical
>
> After upgrading our Spark from 1.6.2 to 2.1.0, I encounter a confusing 
> NullPointerException when creating table under Spark 2.1.0, but the problem 
> does not exists in Spark 1.6.1. 
> Environment: Hive 1.2.1, Hadoop 2.6.4 
>  Code  
> // spark is an instance of HiveContext 
> // merge is a Hive UDF 
> val df = spark.sql("SELECT merge(field_a, null) AS new_a, field_b AS new_b 
> FROM tb_1 group by field_a, field_b") 
> df.createTempView("tb_temp") 
> spark.sql("create table tb_result stored as parquet as " + 
>   "SELECT new_a" + 
>   "FROM tb_temp" + 
>   "LEFT JOIN `tb_2` ON " + 
>   "if(((`tb_temp`.`new_b`) = '' OR (`tb_temp`.`new_b`) IS NULL), 
> concat('GrLSRwZE_', cast((rand() * 200) AS int)), (`tb_temp`.`new_b`)) = 
> `tb_2`.`fka6862f17`") 
>  Physical Plan  
> *Project [new_a] 
> +- *BroadcastHashJoin [if (((new_b = ) || isnull(new_b))) concat(GrLSRwZE_, 
> cast(cast((_nondeterministic * 200.0) as int) as string)) else new_b], 
> [fka6862f17], LeftOuter, BuildRight 
>:- HashAggregate(keys=[field_a, field_b], functions=[], output=[new_a, 
> new_b, _nondeterministic]) 
>:  +- Exchange(coordinator ) hashpartitioning(field_a, field_b, 180), 
> coordinator[target post-shuffle partition size: 1024880] 
>: +- *HashAggregate(keys=[field_a, field_b], functions=[], 
> output=[field_a, field_b]) 
>:+- *FileScan parquet bdp.tb_1[field_a,field_b] Batched: true, 
> Format: Parquet, Location: InMemoryFileIndex[hdfs://hdcluster/data/tb_1, 
> PartitionFilters: [], PushedFilters: [], ReadSchema: struct 
>+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, 
> true])) 
>   +- *Project [fka6862f17] 
>  +- *FileScan parquet bdp.tb_2[fka6862f17] Batched: true, Format: 
> Parquet, Location: InMemoryFileIndex[hdfs://hdcluster/data/tb_2, 
> PartitionFilters: [], PushedFilters: [], ReadSchema: struct 
> What does '*' mean before HashAggregate? 
>  Exception  
> org.apache.spark.SparkException: Task failed while writing rows 
> ... 
> java.lang.NullPointerException 
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_2$(Unknown
>  Source) 
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source) 
> at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$3.apply(AggregationIterator.scala:260)
>  
> at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$3.apply(AggregationIterator.scala:259)
>  
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:392)
>  
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:79)
>  
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source) 
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>  
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:252)
>  
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:199)
>  
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:197)
>  
> at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
>  
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:202)
>  
> at 
>

[jira] [Resolved] (SPARK-19469) PySpark should allow driver process on different machine

2017-02-06 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19469.
---
Resolution: Not A Problem

I don't think this makes sense. The JVM and Python process must be colocated.

(Please use user@ for questions)

> PySpark should allow driver process on different machine
> 
>
> Key: SPARK-19469
> URL: https://issues.apache.org/jira/browse/SPARK-19469
> Project: Spark
>  Issue Type: Wish
>  Components: PySpark
>Affects Versions: 1.6.3
>Reporter: haiyangsea
>
> In my scenario, there is a resident spark driver process(creating by yarn 
> cluster mode), all PySpark python client will connect to this resident driver 
> process and share the SparkContext. In python client process, I also run 
> other applications that have requirements for the environment.So I want 
> separate the PySpark python client process and the driver process.
> There are two main limitations in PySpark：
> 1. *parallelize* method uses local file to transform data between java 
> process and python process.
> 2. using the hard code *localhost* to transform task result or intermediate 
> data between java process and python process.
> Is there any way to achieve my goal?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18341) Eliminate use of SingularMatrixException in WeightedLeastSquares logic

2017-02-06 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18341.
---
Resolution: Won't Fix

> Eliminate use of SingularMatrixException in WeightedLeastSquares logic
> --
>
> Key: SPARK-18341
> URL: https://issues.apache.org/jira/browse/SPARK-18341
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> WeightedLeastSquares uses an Exception to implement fallback logic for which 
> solver to use: 
> [https://github.com/apache/spark/blob/6f3697136aa68dc39d3ce42f43a7af554d2a3bf9/mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala#L258]
> We should use an error code instead of an exception.
> * Note the error code should be internal, not a public API.
> * We may be able to eliminate the SingularMatrixException class.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19449) Inconsistent results between ml package RandomForestClassificationModel and mllib package RandomForestModel

2017-02-06 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853831#comment-15853831
 ] 

Sean Owen commented on SPARK-19449:
---

This isn't a bug. It's not expected that, even if they're both deterministic, 
that two different implementations give exactly the same answers. They should 
give similar answers.

> Inconsistent results between ml package RandomForestClassificationModel and 
> mllib package RandomForestModel
> ---
>
> Key: SPARK-19449
> URL: https://issues.apache.org/jira/browse/SPARK-19449
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Aseem Bansal
>
> I worked on some code to convert ml package RandomForestClassificationModel 
> to mllib package RandomForestModel. It was needed because we need to make 
> predictions on the order of ms. I found that the results are inconsistent 
> although the underlying DecisionTreeModel are exactly the same. So the 
> behavior between the 2 implementations is inconsistent which should not be 
> the case.
> The below code can be used to reproduce the issue. Can run this as a simple 
> Java app as long as you have spark dependencies set up properly.
> {noformat}
> import org.apache.spark.ml.Transformer;
> import org.apache.spark.ml.classification.*;
> import org.apache.spark.ml.linalg.*;
> import org.apache.spark.ml.regression.RandomForestRegressionModel;
> import org.apache.spark.mllib.linalg.DenseVector;
> import org.apache.spark.mllib.linalg.Vector;
> import org.apache.spark.mllib.tree.configuration.Algo;
> import org.apache.spark.mllib.tree.model.DecisionTreeModel;
> import org.apache.spark.mllib.tree.model.RandomForestModel;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.RowFactory;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.Metadata;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import scala.Enumeration;
> import java.util.ArrayList;
> import java.util.List;
> import java.util.Random;
> abstract class Predictor {
> abstract double predict(Vector vector);
> }
> public class MainConvertModels {
> public static final int seed = 42;
> public static void main(String[] args) {
> int numRows = 1000;
> int numFeatures = 3;
> int numClasses = 2;
> double trainFraction = 0.8;
> double testFraction = 0.2;
> SparkSession spark = SparkSession.builder()
> .appName("conversion app")
> .master("local")
> .getOrCreate();
> Dataset data = getDummyData(spark, numRows, numFeatures, 
> numClasses);
> Dataset[] splits = data.randomSplit(new double[]{trainFraction, 
> testFraction}, seed);
> Dataset trainingData = splits[0];
> Dataset testData = splits[1];
> testData.cache();
> List labels = getLabels(testData);
> List features = getFeatures(testData);
> DecisionTreeClassifier classifier1 = new DecisionTreeClassifier();
> DecisionTreeClassificationModel model1 = 
> classifier1.fit(trainingData);
> final DecisionTreeModel convertedModel1 = 
> convertDecisionTreeModel(model1, Algo.Classification());
> RandomForestClassifier classifier = new RandomForestClassifier();
> RandomForestClassificationModel model2 = classifier.fit(trainingData);
> final RandomForestModel convertedModel2 = 
> convertRandomForestModel(model2);
> System.out.println(
> "** DecisionTreeClassifier\n" +
> "** Original **" + getInfo(model1, testData) + "\n" +
> "** New  **" + getInfo(new Predictor() {
> double predict(Vector vector) {return 
> convertedModel1.predict(vector);}
> }, labels, features) + "\n" +
> "\n" +
> "** RandomForestClassifier\n" +
> "** Original **" + getInfo(model2, testData) + "\n" +
> "** New  **" + getInfo(new Predictor() {double 
> predict(Vector vector) {return convertedModel2.predict(vector);}}, labels, 
> features) + "\n" +
> "\n" +
> "");
> }
> static Dataset getDummyData(SparkSession spark, int numberRows, int 
> numberFeatures, int labelUpperBound) {
> StructType schema = new StructType(new StructField[]{
> new StructField("label", DataTypes.DoubleType, false, 
> Metadata.empty()),
> new StructField("features", new VectorUDT(), false, 
>

[jira] [Comment Edited] (SPARK-10643) Support remote application download in client mode spark submit

2017-02-06 Thread wangqiaoshi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853817#comment-15853817
 ] 

wangqiaoshi edited comment on SPARK-10643 at 2/6/17 10:50 AM:
--

+1.
i think it would be useful when  use azkaban in mutil-executor mode,i expect 
get execution-jar from hdfs but from mysql.

eg.
not support client mode:
when we edit spark job with client mode,azkaban executor A need to get xx.jar 
from host a,azkaban executor B also need to get xx.jar from host b.


was (Author: wangqiaoshi):
+1.
i think it would be useful when  use azkaban in mutil-executor mode,i expect 
get execution-jar from hdfs but from mysql.

> Support remote application download in client mode spark submit
> ---
>
> Key: SPARK-10643
> URL: https://issues.apache.org/jira/browse/SPARK-10643
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Submit
>Reporter: Alan Braithwaite
>Priority: Minor
>
> When using mesos with docker and marathon, it would be nice to be able to 
> make spark-submit deployable on marathon and have that download a jar from 
> HDFS instead of having to package the jar with the docker.
> {code}
> $ docker run -it docker.example.com/spark:latest 
> /usr/local/spark/bin/spark-submit  --class 
> com.example.spark.streaming.EventHandler hdfs://hdfs/tmp/application.jar 
> Warning: Skip remote jar hdfs://hdfs/tmp/application.jar.
> java.lang.ClassNotFoundException: com.example.spark.streaming.EventHandler
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at org.apache.spark.util.Utils$.classForName(Utils.scala:173)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:639)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}
> Although I'm aware that we can run in cluster mode with mesos, we've already 
> built some nice tools surrounding marathon for logging and monitoring.
> Code in question:
> https://github.com/apache/spark/blob/132718ad7f387e1002b708b19e471d9cd907e105/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L723-L736



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19470) Spark 1.6.4 in Intellij can't use jetty 8

2017-02-06 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19470.
---
Resolution: Invalid

There is no Spark 1.6.4, and you can't expect to change a dependency with no 
other changes required in the code.

> Spark 1.6.4 in Intellij can't use jetty 8
> -
>
> Key: SPARK-19470
> URL: https://issues.apache.org/jira/browse/SPARK-19470
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 1.6.1, 1.6.4
>Reporter: Mingda Li
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> When I build the project in Intellij, I get the following error:
> [error] 
> /Users/MingdaLi/Desktop/ucla_4/research/spark-branch-1.6.4/core/src/main/scala/org/apache/spark/HttpServer.scala:22:
>  object ssl is not a member of package org.eclipse.jetty.server
> [error] import org.eclipse.jetty.server.ssl.SslSocketConnector
> [error] ^
> [error] 
> /Users/MingdaLi/Desktop/ucla_4/research/spark-branch-1.6.4/core/src/main/scala/org/apache/spark/HttpServer.scala:28:
>  object bio is not a member of package org.eclipse.jetty.server
> [error] import org.eclipse.jetty.server.bio.SocketConnector
> [error] ^
> [error] 
> /Users/MingdaLi/Desktop/ucla_4/research/spark-branch-1.6.4/core/src/main/scala/org/apache/spark/HttpServer.scala:78:
>  not found: type SslSocketConnector
> [error]   .map(new SslSocketConnector(_)).getOrElse(new SocketConnector)
> [error]^
> [error] 
> /Users/MingdaLi/Desktop/ucla_4/research/spark-branch-1.6.4/core/src/main/scala/org/apache/spark/HttpServer.scala:87:
>  value setThreadPool is not a member of org.eclipse.jetty.server.Server
> [error] server.setThreadPool(threadPool)
> [error]^
> [error] 
> /Users/MingdaLi/Desktop/ucla_4/research/spark-branch-1.6.4/core/src/main/scala/org/apache/spark/HttpServer.scala:106:
>  value getLocalPort is not a member of org.eclipse.jetty.server.Connector
> [error] val actualPort = server.getConnectors()(0).getLocalPort
> [error]^
> [error] 
> /Users/MingdaLi/Desktop/ucla_4/research/spark-branch-1.6.4/core/src/main/scala/org/apache/spark/SSLOptions.scala:66:
>  type mismatch;
> [error]  found   : String
> [error]  required: java.security.KeyStore
> [error]   trustStore.foreach(file => 
> sslContextFactory.setTrustStore(file.getAbsolutePath))
> [error]   
> ^
> [error] 
> /Users/MingdaLi/Desktop/ucla_4/research/spark-branch-1.6.4/core/src/main/scala/org/apache/spark/deploy/rest/RestSubmissionServer.scala:84:
>  value setThreadPool is not a member of org.eclipse.jetty.server.Server
> [error] server.setThreadPool(threadPool)
> [error]^
> [error] 
> /Users/MingdaLi/Desktop/ucla_4/research/spark-branch-1.6.4/core/src/main/scala/org/apache/spark/deploy/rest/RestSubmissionServer.scala:92:
>  value getLocalPort is not a member of org.eclipse.jetty.server.Connector
> [error] val boundPort = server.getConnectors()(0).getLocalPort
> [error]   
> I think this is caused by using the jetty 9.3.11 but not 8.1.19.
> Actually, I solved this by manually remove the 9.3.11's libraries of jetty 
> from project's library and add the 8.1.19 to it. Then I build it. This can 
> run well.
> But how to avoid manually solve this?
> And I am afraid next time, when I open Intellij, this will appear again.  
> 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19474) SparkSQL unsupports to change hive table's name\dataType

2017-02-06 Thread Xiaochen Ouyang (JIRA)

Xiaochen Ouyang created SPARK-19474:
---

 Summary: SparkSQL unsupports  to change hive table's name\dataType
 Key: SPARK-19474
 URL: https://issues.apache.org/jira/browse/SPARK-19474
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1, 2.0.0
 Environment: Spark2.0.1
Reporter: Xiaochen Ouyang


After creating a hive table, it will be not allowed to  change a column's date 
type or name. It will be useful to provide a public interface (e.g. SQL) to do 
that. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10643) Support remote application download in client mode spark submit

2017-02-06 Thread wangqiaoshi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853817#comment-15853817
 ] 

wangqiaoshi commented on SPARK-10643:
-

+1.
i think it would be useful when  use azkaban in mutil-executor mode,i expect 
get execution-jar from hdfs but from mysql.

> Support remote application download in client mode spark submit
> ---
>
> Key: SPARK-10643
> URL: https://issues.apache.org/jira/browse/SPARK-10643
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Submit
>Reporter: Alan Braithwaite
>Priority: Minor
>
> When using mesos with docker and marathon, it would be nice to be able to 
> make spark-submit deployable on marathon and have that download a jar from 
> HDFS instead of having to package the jar with the docker.
> {code}
> $ docker run -it docker.example.com/spark:latest 
> /usr/local/spark/bin/spark-submit  --class 
> com.example.spark.streaming.EventHandler hdfs://hdfs/tmp/application.jar 
> Warning: Skip remote jar hdfs://hdfs/tmp/application.jar.
> java.lang.ClassNotFoundException: com.example.spark.streaming.EventHandler
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at org.apache.spark.util.Utils$.classForName(Utils.scala:173)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:639)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}
> Although I'm aware that we can run in cluster mode with mesos, we've already 
> built some nice tools surrounding marathon for logging and monitoring.
> Code in question:
> https://github.com/apache/spark/blob/132718ad7f387e1002b708b19e471d9cd907e105/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L723-L736



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19468) Dataset slow because of unnecessary shuffles

2017-02-06 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853809#comment-15853809
 ] 

Sean Owen commented on SPARK-19468:
---

I am unclear whether this is a bug report. You're saying one thing is slower 
than another thing, but is the behavior wrong or unreasonably slow? if x is 
better than y, do x?

> Dataset slow because of unnecessary shuffles
> 
>
> Key: SPARK-19468
> URL: https://issues.apache.org/jira/browse/SPARK-19468
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: koert kuipers
>
> we noticed that some algos we ported from rdd to dataset are significantly 
> slower, and the main reason seems to be more shuffles that we successfully 
> avoid for rdds by careful partitioning. this seems to be dataset specific as 
> it works ok for dataframe.
> see also here:
> http://blog.hydronitrogen.com/2016/05/13/shuffle-free-joins-in-spark-sql/
> it kind of boils down to this... if i partition and sort dataframes that get 
> used for joins repeatedly i can avoid shuffles:
> {noformat}
> System.setProperty("spark.sql.autoBroadcastJoinThreshold", "-1")
> val df1 = Seq((0, 0), (1, 1)).toDF("key", "value")
>   
> .repartition(col("key")).sortWithinPartitions(col("key")).persist(StorageLevel.DISK_ONLY)
> val df2 = Seq((0, 0), (1, 1)).toDF("key2", "value2")
>   
> .repartition(col("key2")).sortWithinPartitions(col("key2")).persist(StorageLevel.DISK_ONLY)
> val joined = df1.join(df2, col("key") === col("key2"))
> joined.explain
> == Physical Plan ==
> *SortMergeJoin [key#5], [key2#27], Inner
> :- InMemoryTableScan [key#5, value#6]
> : +- InMemoryRelation [key#5, value#6], true, 1, StorageLevel(disk, 1 
> replicas)
> :   +- *Sort [key#5 ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(key#5, 4)
> : +- LocalTableScan [key#5, value#6]
> +- InMemoryTableScan [key2#27, value2#28]
>   +- InMemoryRelation [key2#27, value2#28], true, 1, 
> StorageLevel(disk, 1 replicas)
> +- *Sort [key2#27 ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(key2#27, 4)
>   +- LocalTableScan [key2#27, value2#28]
> {noformat}
> notice how the persisted dataframes are not shuffled or sorted anymore before 
> being used in the join. however if i try to do the same with dataset i have 
> no luck:
> {noformat}
> val ds1 = Seq((0, 0), (1, 1)).toDS
>   
> .repartition(col("_1")).sortWithinPartitions(col("_1")).persist(StorageLevel.DISK_ONLY)
> val ds2 = Seq((0, 0), (1, 1)).toDS
>   
> .repartition(col("_1")).sortWithinPartitions(col("_1")).persist(StorageLevel.DISK_ONLY)
> val joined1 = ds1.joinWith(ds2, ds1("_1") === ds2("_1"))
> joined1.explain
> == Physical Plan ==
> *SortMergeJoin [_1#105._1], [_2#106._1], Inner
> :- *Sort [_1#105._1 ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(_1#105._1, 4)
> : +- *Project [named_struct(_1, _1#83, _2, _2#84) AS _1#105]
> :+- InMemoryTableScan [_1#83, _2#84]
> :  +- InMemoryRelation [_1#83, _2#84], true, 1, 
> StorageLevel(disk, 1 replicas)
> :+- *Sort [_1#83 ASC NULLS FIRST], false, 0
> :   +- Exchange hashpartitioning(_1#83, 4)
> :  +- LocalTableScan [_1#83, _2#84]
> +- *Sort [_2#106._1 ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(_2#106._1, 4)
>   +- *Project [named_struct(_1, _1#100, _2, _2#101) AS _2#106]
>  +- InMemoryTableScan [_1#100, _2#101]
>+- InMemoryRelation [_1#100, _2#101], true, 1, 
> StorageLevel(disk, 1 replicas)
>  +- *Sort [_1#83 ASC NULLS FIRST], false, 0
> +- Exchange hashpartitioning(_1#83, 4)
>+- LocalTableScan [_1#83, _2#84]
> {noformat}
> notice how my persisted Datasets are shuffled and sorted again. part of the 
> issue seems to be in joinWith, which does some preprocessing that seems to 
> confuse the planner. if i change the joinWith to join (which returns a 
> dataframe) it looks a little better in that only one side gets shuffled 
> again, but still not optimal:
> {noformat}
> val ds1 = Seq((0, 0), (1, 1)).toDS
>   
> .repartition(col("_1")).sortWithinPartitions(col("_1")).persist(StorageLevel.DISK_ONLY)
> val ds2 = Seq((0, 0), (1, 1)).toDS
>   
> .repartition(col("_1")).sortWithinPartitions(col("_1")).persist(StorageLevel.DISK_ONLY)
> val joined1 = ds1.join(ds2, ds1("_1") === ds2("_1"))
> joined1.explain
> == Physical Plan ==
> *SortMergeJoin [_1#83], [_1#100], Inner
> :- InMemoryTableScan [_1#83, _2#84]
> : +- InMemoryRelation [_1#83, _2#84], true, 1, StorageLevel(disk, 1 
> replicas)
> :   +- *Sort [_1#83 ASC NULLS FIRST], false, 0
> :  +- Exchange

[jira] [Commented] (SPARK-19451) Long values in Window function

2017-02-06 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853800#comment-15853800
 ] 

Herman van Hovell commented on SPARK-19451:
---

Yeah, you are right about that. We should definitely support this.

> Long values in Window function
> --
>
> Key: SPARK-19451
> URL: https://issues.apache.org/jira/browse/SPARK-19451
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.2
>Reporter: Julien Champ
>
> Hi there,
> there seems to be a major limitation in spark window functions and 
> rangeBetween method.
> If I have the following code :
> {code:title=Exemple |borderStyle=solid}
> val tw =  Window.orderBy("date")
>   .partitionBy("id")
>   .rangeBetween( from , 0)
> {code}
> Everything seems ok, while *from* value is not too large... Even if the 
> rangeBetween() method supports Long parameters.
> But If i set *-216000L* value to *from* it does not work !
> It is probably related to this part of code in the between() method, of the 
> WindowSpec class, called by rangeBetween()
> {code:title=between() method|borderStyle=solid}
> val boundaryStart = start match {
>   case 0 => CurrentRow
>   case Long.MinValue => UnboundedPreceding
>   case x if x < 0 => ValuePreceding(-start.toInt)
>   case x if x > 0 => ValueFollowing(start.toInt)
> }
> {code}
> ( look at this *.toInt* )
> Does anybody know it there's a way to solve / patch this behavior ?
> Any help will be appreciated
> Thx



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19451) Long values in Window function

2017-02-06 Thread Julien Champ (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853793#comment-15853793
 ] 

Julien Champ commented on SPARK-19451:
--

Let's imagine that this window is used on timestamp values in ms : I can ask 
for a window with a range between [-216000L, 0] and only have a few values 
inside, not necessarily 216000L.

I can understand the limitation for the rowBetween() method but the 
rangeBetween() method is nice for this kind of usage.

> Long values in Window function
> --
>
> Key: SPARK-19451
> URL: https://issues.apache.org/jira/browse/SPARK-19451
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.2
>Reporter: Julien Champ
>
> Hi there,
> there seems to be a major limitation in spark window functions and 
> rangeBetween method.
> If I have the following code :
> {code:title=Exemple |borderStyle=solid}
> val tw =  Window.orderBy("date")
>   .partitionBy("id")
>   .rangeBetween( from , 0)
> {code}
> Everything seems ok, while *from* value is not too large... Even if the 
> rangeBetween() method supports Long parameters.
> But If i set *-216000L* value to *from* it does not work !
> It is probably related to this part of code in the between() method, of the 
> WindowSpec class, called by rangeBetween()
> {code:title=between() method|borderStyle=solid}
> val boundaryStart = start match {
>   case 0 => CurrentRow
>   case Long.MinValue => UnboundedPreceding
>   case x if x < 0 => ValuePreceding(-start.toInt)
>   case x if x > 0 => ValueFollowing(start.toInt)
> }
> {code}
> ( look at this *.toInt* )
> Does anybody know it there's a way to solve / patch this behavior ?
> Any help will be appreciated
> Thx



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16441) Spark application hang when dynamic allocation is enabled

2017-02-06 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-16441:

Attachment: SPARK-16441-yarn-metrics.jpg
SPARK-16441-threadDump.jpg
SPARK-16441-stage.jpg

> Spark application hang when dynamic allocation is enabled
> -
>
> Key: SPARK-16441
> URL: https://issues.apache.org/jira/browse/SPARK-16441
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0, 2.1.0
> Environment: hadoop 2.7.2  spark1.6.2
>Reporter: cen yuhai
> Attachments: SPARK-16441-stage.jpg, SPARK-16441-threadDump.jpg, 
> SPARK-16441-yarn-metrics.jpg
>
>
> spark application are waiting for rpc response all the time and spark 
> listener are blocked by dynamic allocation. Executors can not connect to 
> driver and lost.
> "spark-dynamic-executor-allocation" #239 daemon prio=5 os_prio=0 
> tid=0x7fa304438000 nid=0xcec6 waiting on condition [0x7fa2b81e4000]
>java.lang.Thread.State: TIMED_WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x00070fdb94f8> (a 
> scala.concurrent.impl.Promise$CompletionLatch)
>   at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.result(package.scala:107)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
>   at 
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:436)
>   - locked <0x828a8960> (a 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend)
>   at 
> org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1438)
>   at 
> org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359)
>   at 
> org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310)
>   - locked <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264)
>   - locked <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223)
> "SparkListenerBus" #161 daemon prio=5 os_prio=0 tid=0x7fa3053be000 
> nid=0xcec9 waiting for monitor entry [0x7fa2b3dfc000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener.onTaskEnd(ExecutorAllocationManager.scala:618)
>   - waiting to lock <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
>   at 
> org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at

[jira] [Updated] (SPARK-16441) Spark application hang when dynamic allocation is enabled

2017-02-06 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-16441:

Affects Version/s: 2.1.0

> Spark application hang when dynamic allocation is enabled
> -
>
> Key: SPARK-16441
> URL: https://issues.apache.org/jira/browse/SPARK-16441
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0, 2.1.0
> Environment: hadoop 2.7.2  spark1.6.2
>Reporter: cen yuhai
>
> spark application are waiting for rpc response all the time and spark 
> listener are blocked by dynamic allocation. Executors can not connect to 
> driver and lost.
> "spark-dynamic-executor-allocation" #239 daemon prio=5 os_prio=0 
> tid=0x7fa304438000 nid=0xcec6 waiting on condition [0x7fa2b81e4000]
>java.lang.Thread.State: TIMED_WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x00070fdb94f8> (a 
> scala.concurrent.impl.Promise$CompletionLatch)
>   at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.result(package.scala:107)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
>   at 
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:436)
>   - locked <0x828a8960> (a 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend)
>   at 
> org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1438)
>   at 
> org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359)
>   at 
> org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310)
>   - locked <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264)
>   - locked <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223)
> "SparkListenerBus" #161 daemon prio=5 os_prio=0 tid=0x7fa3053be000 
> nid=0xcec9 waiting for monitor entry [0x7fa2b3dfc000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener.onTaskEnd(ExecutorAllocationManager.scala:618)
>   - waiting to lock <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
>   at 
> org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64)
>   at

[jira] [Commented] (SPARK-16441) Spark application hang when dynamic allocation is enabled

2017-02-06 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853784#comment-15853784
 ] 

Yuming Wang commented on SPARK-16441:
-

set {{spark.dynamicAllocation.maxExecutors}} to a reasonable value is OK.

This happens when a stage has a lot of tasks and need a lot of executors , this 
is far exceed cluster resources, and it doesn't make sense. my opinion is 
dynamic set {{spark.dynamicAllocation.maxExecutors}} by cluster resources.

I have logged the numExecutors by 
[CoarseGrainedSchedulerBackend.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala]:
{code:java}
  final override def requestTotalExecutors(
  numExecutors: Int,
  localityAwareTasks: Int,
  hostToLocalTaskCount: Map[String, Int]
): Boolean = {
if (numExecutors < 0) {
  throw new IllegalArgumentException(
"Attempted to request a negative number of executor(s) " +
  s"$numExecutors from the cluster manager. Please specify a positive 
number!")
}

val response = synchronized {
  this.localityAwareTasks = localityAwareTasks
  this.hostToLocalTaskCount = hostToLocalTaskCount

  numPendingExecutors =
math.max(numExecutors - numExistingExecutors + 
executorsPendingToRemove.size, 0)
  logInfo(s"numExecutors: ${numExecutors}, " +
s"numExistingExecutors: ${numExistingExecutors}, " +
s"executorsPendingToRemove.size: ${executorsPendingToRemove.size}")
  doRequestTotalExecutors(numExecutors)
}

defaultAskTimeout.awaitResult(response)
  }
{code}
This is the result:
{noformat}
17/02/06 16:00:17 INFO YarnClientSchedulerBackend: numExecutors: 200, 
numExistingExecutors: 0, executorsPendingToRemove.size: 0
17/02/06 16:01:19 INFO YarnClientSchedulerBackend: numExecutors: 0, 
numExistingExecutors: 91, executorsPendingToRemove.size: 3
17/02/06 16:02:01 INFO YarnClientSchedulerBackend: numExecutors: 1, 
numExistingExecutors: 0, executorsPendingToRemove.size: 0
17/02/06 16:02:02 INFO YarnClientSchedulerBackend: numExecutors: 3, 
numExistingExecutors: 0, executorsPendingToRemove.size: 0
17/02/06 16:02:03 INFO YarnClientSchedulerBackend: numExecutors: 7, 
numExistingExecutors: 0, executorsPendingToRemove.size: 0
17/02/06 16:02:04 INFO YarnClientSchedulerBackend: numExecutors: 15, 
numExistingExecutors: 1, executorsPendingToRemove.size: 0
17/02/06 16:02:05 INFO YarnClientSchedulerBackend: numExecutors: 31, 
numExistingExecutors: 1, executorsPendingToRemove.size: 0
17/02/06 16:02:06 INFO YarnClientSchedulerBackend: numExecutors: 63, 
numExistingExecutors: 3, executorsPendingToRemove.size: 0
17/02/06 16:02:07 INFO YarnClientSchedulerBackend: numExecutors: 127, 
numExistingExecutors: 10, executorsPendingToRemove.size: 0
17/02/06 16:02:08 INFO YarnClientSchedulerBackend: numExecutors: 255, 
numExistingExecutors: 15, executorsPendingToRemove.size: 0
17/02/06 16:02:09 INFO YarnClientSchedulerBackend: numExecutors: 511, 
numExistingExecutors: 39, executorsPendingToRemove.size: 0
17/02/06 16:02:10 INFO YarnClientSchedulerBackend: numExecutors: 1023, 
numExistingExecutors: 51, executorsPendingToRemove.size: 0
17/02/06 16:02:11 INFO YarnClientSchedulerBackend: numExecutors: 2047, 
numExistingExecutors: 60, executorsPendingToRemove.size: 0
17/02/06 16:02:12 INFO YarnClientSchedulerBackend: numExecutors: 4095, 
numExistingExecutors: 65, executorsPendingToRemove.size: 0
17/02/06 16:02:13 INFO YarnClientSchedulerBackend: numExecutors: 8191, 
numExistingExecutors: 65, executorsPendingToRemove.size: 0
17/02/06 16:02:14 INFO YarnClientSchedulerBackend: numExecutors: 16383, 
numExistingExecutors: 67, executorsPendingToRemove.size: 0
17/02/06 16:02:15 INFO YarnClientSchedulerBackend: numExecutors: 32767, 
numExistingExecutors: 68, executorsPendingToRemove.size: 0
17/02/06 16:02:16 INFO YarnClientSchedulerBackend: numExecutors: 65535, 
numExistingExecutors: 71, executorsPendingToRemove.size: 0
17/02/06 16:02:17 INFO YarnClientSchedulerBackend: numExecutors: 69248, 
numExistingExecutors: 72, executorsPendingToRemove.size: 0
17/02/06 16:02:20 INFO YarnClientSchedulerBackend: numExecutors: 69195, 
numExistingExecutors: 79, executorsPendingToRemove.size: 0
17/02/06 16:02:20 INFO YarnClientSchedulerBackend: numExecutors: 69151, 
numExistingExecutors: 79, executorsPendingToRemove.size: 0
17/02/06 16:02:21 INFO YarnClientSchedulerBackend: numExecutors: 69083, 
numExistingExecutors: 82, executorsPendingToRemove.size: 0
17/02/06 16:02:21 INFO YarnClientSchedulerBackend: numExecutors: 69021, 
numExistingExecutors: 82, executorsPendingToRemove.size: 0
17/02/06 16:02:21 INFO YarnClientSchedulerBackend: numExecutors: 68954, 
numExistingExecutors: 82, executorsPendingToRemove.size: 0
17/02/06 16:02:21 INFO YarnClientSchedulerBackend: numExecutors: 68868, 
numExistingExecutors: 82, executorsPendingToRemove.size:

[jira] [Commented] (SPARK-19451) Long values in Window function

2017-02-06 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853782#comment-15853782
 ] 

Herman van Hovell commented on SPARK-19451:
---

[~jchamp] how may rows are in your partitions? 2 billion? So this is an 
oversight, but I am not sure we should even try to support more than {{1 << 32 
- 1}} values in a partition.

> Long values in Window function
> --
>
> Key: SPARK-19451
> URL: https://issues.apache.org/jira/browse/SPARK-19451
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.2
>Reporter: Julien Champ
>
> Hi there,
> there seems to be a major limitation in spark window functions and 
> rangeBetween method.
> If I have the following code :
> {code:title=Exemple |borderStyle=solid}
> val tw =  Window.orderBy("date")
>   .partitionBy("id")
>   .rangeBetween( from , 0)
> {code}
> Everything seems ok, while *from* value is not too large... Even if the 
> rangeBetween() method supports Long parameters.
> But If i set *-216000L* value to *from* it does not work !
> It is probably related to this part of code in the between() method, of the 
> WindowSpec class, called by rangeBetween()
> {code:title=between() method|borderStyle=solid}
> val boundaryStart = start match {
>   case 0 => CurrentRow
>   case Long.MinValue => UnboundedPreceding
>   case x if x < 0 => ValuePreceding(-start.toInt)
>   case x if x > 0 => ValueFollowing(start.toInt)
> }
> {code}
> ( look at this *.toInt* )
> Does anybody know it there's a way to solve / patch this behavior ?
> Any help will be appreciated
> Thx



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16441) Spark application hang when dynamic allocation is enabled

2017-02-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16441:


Assignee: (was: Apache Spark)

> Spark application hang when dynamic allocation is enabled
> -
>
> Key: SPARK-16441
> URL: https://issues.apache.org/jira/browse/SPARK-16441
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0
> Environment: hadoop 2.7.2  spark1.6.2
>Reporter: cen yuhai
>
> spark application are waiting for rpc response all the time and spark 
> listener are blocked by dynamic allocation. Executors can not connect to 
> driver and lost.
> "spark-dynamic-executor-allocation" #239 daemon prio=5 os_prio=0 
> tid=0x7fa304438000 nid=0xcec6 waiting on condition [0x7fa2b81e4000]
>java.lang.Thread.State: TIMED_WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x00070fdb94f8> (a 
> scala.concurrent.impl.Promise$CompletionLatch)
>   at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.result(package.scala:107)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
>   at 
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:436)
>   - locked <0x828a8960> (a 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend)
>   at 
> org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1438)
>   at 
> org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359)
>   at 
> org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310)
>   - locked <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264)
>   - locked <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223)
> "SparkListenerBus" #161 daemon prio=5 os_prio=0 tid=0x7fa3053be000 
> nid=0xcec9 waiting for monitor entry [0x7fa2b3dfc000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener.onTaskEnd(ExecutorAllocationManager.scala:618)
>   - waiting to lock <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
>   at 
> org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64)
>   at

[jira] [Assigned] (SPARK-16441) Spark application hang when dynamic allocation is enabled

2017-02-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16441:


Assignee: Apache Spark

> Spark application hang when dynamic allocation is enabled
> -
>
> Key: SPARK-16441
> URL: https://issues.apache.org/jira/browse/SPARK-16441
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0
> Environment: hadoop 2.7.2  spark1.6.2
>Reporter: cen yuhai
>Assignee: Apache Spark
>
> spark application are waiting for rpc response all the time and spark 
> listener are blocked by dynamic allocation. Executors can not connect to 
> driver and lost.
> "spark-dynamic-executor-allocation" #239 daemon prio=5 os_prio=0 
> tid=0x7fa304438000 nid=0xcec6 waiting on condition [0x7fa2b81e4000]
>java.lang.Thread.State: TIMED_WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x00070fdb94f8> (a 
> scala.concurrent.impl.Promise$CompletionLatch)
>   at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.result(package.scala:107)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
>   at 
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:436)
>   - locked <0x828a8960> (a 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend)
>   at 
> org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1438)
>   at 
> org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359)
>   at 
> org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310)
>   - locked <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264)
>   - locked <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223)
> "SparkListenerBus" #161 daemon prio=5 os_prio=0 tid=0x7fa3053be000 
> nid=0xcec9 waiting for monitor entry [0x7fa2b3dfc000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener.onTaskEnd(ExecutorAllocationManager.scala:618)
>   - waiting to lock <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
>   at 
> org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64)
>   at

[jira] [Commented] (SPARK-16441) Spark application hang when dynamic allocation is enabled

2017-02-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853775#comment-15853775
 ] 

Apache Spark commented on SPARK-16441:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/16819

> Spark application hang when dynamic allocation is enabled
> -
>
> Key: SPARK-16441
> URL: https://issues.apache.org/jira/browse/SPARK-16441
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0
> Environment: hadoop 2.7.2  spark1.6.2
>Reporter: cen yuhai
>
> spark application are waiting for rpc response all the time and spark 
> listener are blocked by dynamic allocation. Executors can not connect to 
> driver and lost.
> "spark-dynamic-executor-allocation" #239 daemon prio=5 os_prio=0 
> tid=0x7fa304438000 nid=0xcec6 waiting on condition [0x7fa2b81e4000]
>java.lang.Thread.State: TIMED_WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x00070fdb94f8> (a 
> scala.concurrent.impl.Promise$CompletionLatch)
>   at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.result(package.scala:107)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
>   at 
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:436)
>   - locked <0x828a8960> (a 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend)
>   at 
> org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1438)
>   at 
> org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359)
>   at 
> org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310)
>   - locked <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264)
>   - locked <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223)
> "SparkListenerBus" #161 daemon prio=5 os_prio=0 tid=0x7fa3053be000 
> nid=0xcec9 waiting for monitor entry [0x7fa2b3dfc000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener.onTaskEnd(ExecutorAllocationManager.scala:618)
>   - waiting to lock <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
>   at 
> org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
>

[jira] [Commented] (SPARK-19451) Long values in Window function

2017-02-06 Thread Genmao Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853773#comment-15853773
 ] 

Genmao Yu commented on SPARK-19451:
---

[~jchamp] I have taken a fast look through the code, and did not find any 
strong point to set the type of index as Int. In the window function, there is 
really a underlying integer overflow issue. I have make a pull request 
(https://github.com/apache/spark/pull/16818), and any suggestion is appreciated.

> Long values in Window function
> --
>
> Key: SPARK-19451
> URL: https://issues.apache.org/jira/browse/SPARK-19451
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.2
>Reporter: Julien Champ
>
> Hi there,
> there seems to be a major limitation in spark window functions and 
> rangeBetween method.
> If I have the following code :
> {code:title=Exemple |borderStyle=solid}
> val tw =  Window.orderBy("date")
>   .partitionBy("id")
>   .rangeBetween( from , 0)
> {code}
> Everything seems ok, while *from* value is not too large... Even if the 
> rangeBetween() method supports Long parameters.
> But If i set *-216000L* value to *from* it does not work !
> It is probably related to this part of code in the between() method, of the 
> WindowSpec class, called by rangeBetween()
> {code:title=between() method|borderStyle=solid}
> val boundaryStart = start match {
>   case 0 => CurrentRow
>   case Long.MinValue => UnboundedPreceding
>   case x if x < 0 => ValuePreceding(-start.toInt)
>   case x if x > 0 => ValueFollowing(start.toInt)
> }
> {code}
> ( look at this *.toInt* )
> Does anybody know it there's a way to solve / patch this behavior ?
> Any help will be appreciated
> Thx



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19451) Long values in Window function

2017-02-06 Thread Genmao Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853773#comment-15853773
 ] 

Genmao Yu edited comment on SPARK-19451 at 2/6/17 9:58 AM:
---

[~jchamp] I have taken a fast look through the code, and did not find any 
strong point to set the type of index as Int. In the window function, there is 
really a underlying integer overflow issue. I  made a pull request 
(https://github.com/apache/spark/pull/16818), and any suggestion is appreciated.


was (Author: unclegen):
[~jchamp] I have taken a fast look through the code, and did not find any 
strong point to set the type of index as Int. In the window function, there is 
really a underlying integer overflow issue. I have make a pull request 
(https://github.com/apache/spark/pull/16818), and any suggestion is appreciated.

> Long values in Window function
> --
>
> Key: SPARK-19451
> URL: https://issues.apache.org/jira/browse/SPARK-19451
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.2
>Reporter: Julien Champ
>
> Hi there,
> there seems to be a major limitation in spark window functions and 
> rangeBetween method.
> If I have the following code :
> {code:title=Exemple |borderStyle=solid}
> val tw =  Window.orderBy("date")
>   .partitionBy("id")
>   .rangeBetween( from , 0)
> {code}
> Everything seems ok, while *from* value is not too large... Even if the 
> rangeBetween() method supports Long parameters.
> But If i set *-216000L* value to *from* it does not work !
> It is probably related to this part of code in the between() method, of the 
> WindowSpec class, called by rangeBetween()
> {code:title=between() method|borderStyle=solid}
> val boundaryStart = start match {
>   case 0 => CurrentRow
>   case Long.MinValue => UnboundedPreceding
>   case x if x < 0 => ValuePreceding(-start.toInt)
>   case x if x > 0 => ValueFollowing(start.toInt)
> }
> {code}
> ( look at this *.toInt* )
> Does anybody know it there's a way to solve / patch this behavior ?
> Any help will be appreciated
> Thx



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19451) Long values in Window function

2017-02-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19451:


Assignee: (was: Apache Spark)

> Long values in Window function
> --
>
> Key: SPARK-19451
> URL: https://issues.apache.org/jira/browse/SPARK-19451
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.2
>Reporter: Julien Champ
>
> Hi there,
> there seems to be a major limitation in spark window functions and 
> rangeBetween method.
> If I have the following code :
> {code:title=Exemple |borderStyle=solid}
> val tw =  Window.orderBy("date")
>   .partitionBy("id")
>   .rangeBetween( from , 0)
> {code}
> Everything seems ok, while *from* value is not too large... Even if the 
> rangeBetween() method supports Long parameters.
> But If i set *-216000L* value to *from* it does not work !
> It is probably related to this part of code in the between() method, of the 
> WindowSpec class, called by rangeBetween()
> {code:title=between() method|borderStyle=solid}
> val boundaryStart = start match {
>   case 0 => CurrentRow
>   case Long.MinValue => UnboundedPreceding
>   case x if x < 0 => ValuePreceding(-start.toInt)
>   case x if x > 0 => ValueFollowing(start.toInt)
> }
> {code}
> ( look at this *.toInt* )
> Does anybody know it there's a way to solve / patch this behavior ?
> Any help will be appreciated
> Thx



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19451) Long values in Window function

2017-02-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853763#comment-15853763
 ] 

Apache Spark commented on SPARK-19451:
--

User 'uncleGen' has created a pull request for this issue:
https://github.com/apache/spark/pull/16818

> Long values in Window function
> --
>
> Key: SPARK-19451
> URL: https://issues.apache.org/jira/browse/SPARK-19451
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.2
>Reporter: Julien Champ
>
> Hi there,
> there seems to be a major limitation in spark window functions and 
> rangeBetween method.
> If I have the following code :
> {code:title=Exemple |borderStyle=solid}
> val tw =  Window.orderBy("date")
>   .partitionBy("id")
>   .rangeBetween( from , 0)
> {code}
> Everything seems ok, while *from* value is not too large... Even if the 
> rangeBetween() method supports Long parameters.
> But If i set *-216000L* value to *from* it does not work !
> It is probably related to this part of code in the between() method, of the 
> WindowSpec class, called by rangeBetween()
> {code:title=between() method|borderStyle=solid}
> val boundaryStart = start match {
>   case 0 => CurrentRow
>   case Long.MinValue => UnboundedPreceding
>   case x if x < 0 => ValuePreceding(-start.toInt)
>   case x if x > 0 => ValueFollowing(start.toInt)
> }
> {code}
> ( look at this *.toInt* )
> Does anybody know it there's a way to solve / patch this behavior ?
> Any help will be appreciated
> Thx



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19451) Long values in Window function

2017-02-06 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19451:


Assignee: Apache Spark

> Long values in Window function
> --
>
> Key: SPARK-19451
> URL: https://issues.apache.org/jira/browse/SPARK-19451
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.2
>Reporter: Julien Champ
>Assignee: Apache Spark
>
> Hi there,
> there seems to be a major limitation in spark window functions and 
> rangeBetween method.
> If I have the following code :
> {code:title=Exemple |borderStyle=solid}
> val tw =  Window.orderBy("date")
>   .partitionBy("id")
>   .rangeBetween( from , 0)
> {code}
> Everything seems ok, while *from* value is not too large... Even if the 
> rangeBetween() method supports Long parameters.
> But If i set *-216000L* value to *from* it does not work !
> It is probably related to this part of code in the between() method, of the 
> WindowSpec class, called by rangeBetween()
> {code:title=between() method|borderStyle=solid}
> val boundaryStart = start match {
>   case 0 => CurrentRow
>   case Long.MinValue => UnboundedPreceding
>   case x if x < 0 => ValuePreceding(-start.toInt)
>   case x if x > 0 => ValueFollowing(start.toInt)
> }
> {code}
> ( look at this *.toInt* )
> Does anybody know it there's a way to solve / patch this behavior ?
> Any help will be appreciated
> Thx



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17910) Allow users to update the comment of a column

2017-02-06 Thread Xiaochen Ouyang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853748#comment-15853748
 ] 

Xiaochen Ouyang commented on SPARK-17910:
-

Hey，I wonder that do we have a plan to support changing column's dataType and 
name later ? Thanks!  [~yhuai] [~jiangxb]

> Allow users to update the comment of a column
> -
>
> Key: SPARK-17910
> URL: https://issues.apache.org/jira/browse/SPARK-17910
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Jiang Xingbo
> Fix For: 2.2.0
>
>
> Right now, once a user set the comment of a column with create table command, 
> he/she cannot update the comment. It will be useful to provide a public 
> interface (e.g. SQL) to do that. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19451) Long values in Window function

2017-02-06 Thread Julien Champ (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853685#comment-15853685
 ] 

Julien Champ commented on SPARK-19451:
--

Thanks [~uncleGen] for your answer.

I was tempted to try to fix it myself.. But not sure of possible side effects.
If I can help you in any way ( code / debug / tests ), feel free to ask.

P.S. : I think this affects all version of Spark with Window functions, but I 
have only tested it on spark 1.6.1 and 2.0.2

> Long values in Window function
> --
>
> Key: SPARK-19451
> URL: https://issues.apache.org/jira/browse/SPARK-19451
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.2
>Reporter: Julien Champ
>
> Hi there,
> there seems to be a major limitation in spark window functions and 
> rangeBetween method.
> If I have the following code :
> {code:title=Exemple |borderStyle=solid}
> val tw =  Window.orderBy("date")
>   .partitionBy("id")
>   .rangeBetween( from , 0)
> {code}
> Everything seems ok, while *from* value is not too large... Even if the 
> rangeBetween() method supports Long parameters.
> But If i set *-216000L* value to *from* it does not work !
> It is probably related to this part of code in the between() method, of the 
> WindowSpec class, called by rangeBetween()
> {code:title=between() method|borderStyle=solid}
> val boundaryStart = start match {
>   case 0 => CurrentRow
>   case Long.MinValue => UnboundedPreceding
>   case x if x < 0 => ValuePreceding(-start.toInt)
>   case x if x > 0 => ValueFollowing(start.toInt)
> }
> {code}
> ( look at this *.toInt* )
> Does anybody know it there's a way to solve / patch this behavior ?
> Any help will be appreciated
> Thx



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17213) Parquet String Pushdown for Non-Eq Comparisons Broken

2017-02-06 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853673#comment-15853673
 ] 

Apache Spark commented on SPARK-17213:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/16817

> Parquet String Pushdown for Non-Eq Comparisons Broken
> -
>
> Key: SPARK-17213
> URL: https://issues.apache.org/jira/browse/SPARK-17213
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Andrew Duffy
>Assignee: Cheng Lian
> Fix For: 2.1.0
>
>
> Spark defines ordering over strings based on comparison of UTF8 byte arrays, 
> which compare bytes as unsigned integers. Currently however Parquet does not 
> respect this ordering. This is currently in the process of being fixed in 
> Parquet, JIRA and PR link below, but currently all filters are broken over 
> strings, with there actually being a correctness issue for {{>}} and {{<}}.
> *Repro:*
> Querying directly from in-memory DataFrame:
> {code}
> > Seq("a", "é").toDF("name").where("name > 'a'").count
> 1
> {code}
> Querying from a parquet dataset:
> {code}
> > Seq("a", "é").toDF("name").write.parquet("/tmp/bad")
> > spark.read.parquet("/tmp/bad").where("name > 'a'").count
> 0
> {code}
> This happens because Spark sorts the rows to be {{[a, é]}}, but Parquet's 
> implementation of comparison of strings is based on signed byte array 
> comparison, so it will actually create 1 row group with statistics 
> {{min=é,max=a}}, and so the row group will be dropped by the query.
> Based on the way Parquet pushes down Eq, it will not be affecting correctness 
> but it will force you to read row groups you should be able to skip.
> Link to PARQUET issue: https://issues.apache.org/jira/browse/PARQUET-686
> Link to PR: https://github.com/apache/parquet-mr/pull/362



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

92 matches

Mail list logo