[jira] [Updated] (SPARK-21545) pyspark2

2017-07-26 Thread gumpcheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gumpcheng updated SPARK-21545:
--
Description: 
I install spark2.2 following the official steps with CDH5.12.
Info on Cloudera Manager is okay!
But I failed to initialize pyspark2.

My Environment : Python3.6.1,JDK1.8,CDH5.12

The problem make me crazy for several days.
And I found no way to solve it.
 Anyone can help me?
Very thank you!!!


[hdfs@Master /data/soft/spark2.2]$ pyspark2
Python 3.6.1 (default, Jul 27 2017, 11:07:01) 
[GCC 4.4.6 20110731 (Red Hat 4.4.6-4)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
17/07/27 12:02:09 ERROR spark.SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might 
have been killed or unable to launch application master.
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:85)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:173)
at org.apache.spark.SparkContext.(SparkContext.scala:509)
at 
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:236)
at 
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
17/07/27 12:02:09 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: 
Attempted to request executors before the AM has registered!
17/07/27 12:02:09 ERROR util.Utils: Uncaught exception in thread Thread-2
java.lang.NullPointerException
at 
org.apache.spark.network.shuffle.ExternalShuffleClient.close(ExternalShuffleClient.java:141)
at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1485)
at org.apache.spark.SparkEnv.stop(SparkEnv.scala:90)
at 
org.apache.spark.SparkContext$$anonfun$stop$11.apply$mcV$sp(SparkContext.scala:1937)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1317)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1936)
at org.apache.spark.SparkContext.(SparkContext.scala:587)
at 
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:236)
at 
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/python/pyspark/shell.py:52:
 UserWarning: Fall back to non-hive support because failing to access HiveConf, 
please make sure you build spark with hive
  warnings.warn("Fall back to non-hive support because failing to access 
HiveConf, "
17/07/27 12:02:09 WARN spark.SparkContext: Another SparkContext is being 
constructed (or threw an exception in its constructor).  This may indicate an 
error, since only one SparkContext may be running in this JVM (see SPARK-2243). 
The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)

[jira] [Updated] (SPARK-21545) pyspark2

2017-07-26 Thread gumpcheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gumpcheng updated SPARK-21545:
--
Description: 
I install spark2.2 following the official steps with CDH5.12.
Info on Cloudera Manager is okay!
But I failed to initialize pyspark2.

My Environment : Python3.6.1,JDK1.8,CDH5.12


[hdfs@Master /data/soft/spark2.2]$ pyspark2
Python 3.6.1 (default, Jul 27 2017, 11:07:01) 
[GCC 4.4.6 20110731 (Red Hat 4.4.6-4)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
17/07/27 12:02:09 ERROR spark.SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might 
have been killed or unable to launch application master.
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:85)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:173)
at org.apache.spark.SparkContext.(SparkContext.scala:509)
at 
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:236)
at 
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
17/07/27 12:02:09 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: 
Attempted to request executors before the AM has registered!
17/07/27 12:02:09 ERROR util.Utils: Uncaught exception in thread Thread-2
java.lang.NullPointerException
at 
org.apache.spark.network.shuffle.ExternalShuffleClient.close(ExternalShuffleClient.java:141)
at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1485)
at org.apache.spark.SparkEnv.stop(SparkEnv.scala:90)
at 
org.apache.spark.SparkContext$$anonfun$stop$11.apply$mcV$sp(SparkContext.scala:1937)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1317)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1936)
at org.apache.spark.SparkContext.(SparkContext.scala:587)
at 
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:236)
at 
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/python/pyspark/shell.py:52:
 UserWarning: Fall back to non-hive support because failing to access HiveConf, 
please make sure you build spark with hive
  warnings.warn("Fall back to non-hive support because failing to access 
HiveConf, "
17/07/27 12:02:09 WARN spark.SparkContext: Another SparkContext is being 
constructed (or threw an exception in its constructor).  This may indicate an 
error, since only one SparkContext may be running in this JVM (see SPARK-2243). 
The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

[jira] [Created] (SPARK-21545) pyspark2

2017-07-26 Thread gumpcheng (JIRA)
gumpcheng created SPARK-21545:
-

 Summary: pyspark2
 Key: SPARK-21545
 URL: https://issues.apache.org/jira/browse/SPARK-21545
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.2.0
 Environment: Spark2.2 with CDH5.12,python3.6.1,java jdk1.8_b31.
Reporter: gumpcheng
 Fix For: 2.2.0


[hdfs@Master /data/soft/spark2.2]$ pyspark2
Python 3.6.1 (default, Jul 27 2017, 11:07:01) 
[GCC 4.4.6 20110731 (Red Hat 4.4.6-4)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
17/07/27 12:02:09 ERROR spark.SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might 
have been killed or unable to launch application master.
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:85)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:173)
at org.apache.spark.SparkContext.(SparkContext.scala:509)
at 
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:236)
at 
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
17/07/27 12:02:09 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: 
Attempted to request executors before the AM has registered!
17/07/27 12:02:09 ERROR util.Utils: Uncaught exception in thread Thread-2
java.lang.NullPointerException
at 
org.apache.spark.network.shuffle.ExternalShuffleClient.close(ExternalShuffleClient.java:141)
at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1485)
at org.apache.spark.SparkEnv.stop(SparkEnv.scala:90)
at 
org.apache.spark.SparkContext$$anonfun$stop$11.apply$mcV$sp(SparkContext.scala:1937)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1317)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1936)
at org.apache.spark.SparkContext.(SparkContext.scala:587)
at 
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:236)
at 
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/python/pyspark/shell.py:52:
 UserWarning: Fall back to non-hive support because failing to access HiveConf, 
please make sure you build spark with hive
  warnings.warn("Fall back to non-hive support because failing to access 
HiveConf, "
17/07/27 12:02:09 WARN spark.SparkContext: Another SparkContext is being 
constructed (or threw an exception in its constructor).  This may indicate an 
error, since only one SparkContext may be running in this JVM (see SPARK-2243). 
The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)

[jira] [Resolved] (SPARK-21530) Update description of spark.shuffle.maxChunksBeingTransferred

2017-07-26 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21530.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18735
[https://github.com/apache/spark/pull/18735]

> Update description of spark.shuffle.maxChunksBeingTransferred
> -
>
> Key: SPARK-21530
> URL: https://issues.apache.org/jira/browse/SPARK-21530
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
> Fix For: 2.3.0
>
>
> We should expand the description of spark.shuffle.maxChunksBeingTransferred 
> to include what happens when the max is hit. In this case it actually closes 
> incoming connections so if the retry logic on the reducer doesn't happen 
> could end up in failure.  
> The way its currently worded doesn't tell users bad things can happen when 
> the max is hit.
> introduced with https://github.com/apache/spark/pull/18388/files
> SPARK-21175



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21530) Update description of spark.shuffle.maxChunksBeingTransferred

2017-07-26 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21530:
---

Assignee: jin xing

> Update description of spark.shuffle.maxChunksBeingTransferred
> -
>
> Key: SPARK-21530
> URL: https://issues.apache.org/jira/browse/SPARK-21530
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Assignee: jin xing
> Fix For: 2.3.0
>
>
> We should expand the description of spark.shuffle.maxChunksBeingTransferred 
> to include what happens when the max is hit. In this case it actually closes 
> incoming connections so if the retry logic on the reducer doesn't happen 
> could end up in failure.  
> The way its currently worded doesn't tell users bad things can happen when 
> the max is hit.
> introduced with https://github.com/apache/spark/pull/18388/files
> SPARK-21175



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-21400) Spark shouldn't ignore user defined output committer in append mode

2017-07-26 Thread Robert Kruszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kruszewski closed SPARK-21400.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Spark shouldn't ignore user defined output committer in append mode
> ---
>
> Key: SPARK-21400
> URL: https://issues.apache.org/jira/browse/SPARK-21400
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Robert Kruszewski
> Fix For: 2.3.0
>
>
> In https://issues.apache.org/jira/browse/SPARK-8578 we decided to override 
> user defined output committers in append mode. The reasoning was that there's 
> some output committers that can lead to correctness issues. Since then we 
> have removed DirectParquetOutputCommitter (the biggest known offender) from 
> codebase and rely on default implementations.
> I believe that we shouldn't be restricting this anymore and users should 
> understand that if they're overwriting this configuration they have tested 
> their committer for correctness. This unblocks using more sophisticated and 
> performant output committers without need to overwrite file format 
> implementations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21544) Test jar of some module should not install or deploy twice

2017-07-26 Thread zhoukang (JIRA)
zhoukang created SPARK-21544:


 Summary: Test jar of some module should not install or deploy twice
 Key: SPARK-21544
 URL: https://issues.apache.org/jira/browse/SPARK-21544
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 2.1.0, 1.6.1
Reporter: zhoukang


For moudle below:
common/network-common
streaming
sql/core
sql/catalyst
tests.jar will install or deploy twice.Like:

{code:java}
[DEBUG] Installing org.apache.spark:spark-streaming_2.11/maven-metadata.xml to 
/home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/maven-metadata-local.xml
[INFO] Installing 
/home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
 to 
/home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
[DEBUG] Skipped re-installing 
/home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
 to 
/home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar,
 seems unchanged
{code}
The reason is below:

{code:java}
[DEBUG]   (f) artifact = 
org.apache.spark:spark-streaming_2.11:jar:2.1.0-mdh2.1.0.1-SNAPSHOT
[DEBUG]   (f) attachedArtifacts = 
[org.apache.spark:spark-streaming_2.11:test-jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT,
 org.apache.spark:spark-streaming_2.11:jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT, 
org.apache.spark:spark
-streaming_2.11:java-source:sources:2.1.0-mdh2.1.0.1-SNAPSHOT, 
org.apache.spark:spark-streaming_2.11:java-source:test-sources:2.1.0-mdh2.1.0.1-SNAPSHOT,
 org.apache.spark:spark-streaming_2.11:javadoc:javadoc:2.1.0
-mdh2.1.0.1-SNAPSHOT]
{code}
when executing 'mvn deploy' to nexus during release.I will fail since release 
nexus can not be override.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21400) Spark shouldn't ignore user defined output committer in append mode

2017-07-26 Thread Robert Kruszewski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102638#comment-16102638
 ] 

Robert Kruszewski commented on SPARK-21400:
---

Fixed in [https://github.com/apache/spark/pull/18689 ]

> Spark shouldn't ignore user defined output committer in append mode
> ---
>
> Key: SPARK-21400
> URL: https://issues.apache.org/jira/browse/SPARK-21400
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Robert Kruszewski
>
> In https://issues.apache.org/jira/browse/SPARK-8578 we decided to override 
> user defined output committers in append mode. The reasoning was that there's 
> some output committers that can lead to correctness issues. Since then we 
> have removed DirectParquetOutputCommitter (the biggest known offender) from 
> codebase and rely on default implementations.
> I believe that we shouldn't be restricting this anymore and users should 
> understand that if they're overwriting this configuration they have tested 
> their committer for correctness. This unblocks using more sophisticated and 
> performant output committers without need to overwrite file format 
> implementations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21543) Should not count executor initialize failed towards task failures

2017-07-26 Thread zhoukang (JIRA)
zhoukang created SPARK-21543:


 Summary: Should not count executor initialize failed towards task 
failures
 Key: SPARK-21543
 URL: https://issues.apache.org/jira/browse/SPARK-21543
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 2.1.0, 1.6.1
Reporter: zhoukang


Till now, when executor init failed and exit with error code = 1, it will count 
toward task failures.Which i think should not count executor initialize failed 
towards task failures.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21539) Job should not be aborted when dynamic allocation is enabled or spark.executor.instances larger then current allocated number by yarn

2017-07-26 Thread zhoukang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101686#comment-16101686
 ] 

zhoukang edited comment on SPARK-21539 at 7/27/17 2:06 AM:
---

I want to report node blacklist to yarn real-time, does this make 
sense?[~squito]


was (Author: cane):
I am working on this.

> Job should not be aborted when dynamic allocation is enabled or 
> spark.executor.instances larger then current allocated number by yarn
> -
>
> Key: SPARK-21539
> URL: https://issues.apache.org/jira/browse/SPARK-21539
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.1.0
>Reporter: zhoukang
>
> For spark on yarn.
> Right now, when TaskSet can not run on any node or host.Which means 
> blacklistedEverywhere is true in TaskSetManager#abortIfCompleteBlacklisted.
> However, if dynamic allocation is enabled, we should wait for yarn to 
> allocate new nodemanager in order to execute job successfully.
> How to reproduce?
> 1、Set up a yarn cluster with  5 nodes.And assign a node1 with much larger cpu 
> core and memory,which can let yarn launch container on this node even it is 
> blacklisted by TaskScheduler.
> 2、modify BlockManager#registerWithExternalShuffleServer
> {code:java}
> logInfo("Registering executor with local external shuffle service.")
> val shuffleConfig = new ExecutorShuffleInfo(
>   diskBlockManager.localDirs.map(_.toString),
>   diskBlockManager.subDirsPerLocalDir,
>   shuffleManager.getClass.getName)
> val MAX_ATTEMPTS = conf.get(config.SHUFFLE_REGISTRATION_MAX_ATTEMPTS)
> val SLEEP_TIME_SECS = 5
> for (i <- 1 to MAX_ATTEMPTS) {
>   try {
> {color:red}if (shuffleId.host.equals("node1's address")) {
>  throw new Exception
> }{color}
> // Synchronous and will throw an exception if we cannot connect.
> 
> shuffleClient.asInstanceOf[ExternalShuffleClient].registerWithShuffleServer(
>   shuffleServerId.host, shuffleServerId.port, 
> shuffleServerId.executorId, shuffleConfig)
> return
>   } catch {
> case e: Exception if i < MAX_ATTEMPTS =>
>   logError(s"Failed to connect to external shuffle server, will retry 
> ${MAX_ATTEMPTS - i}"
> + s" more times after waiting $SLEEP_TIME_SECS seconds...", e)
>   Thread.sleep(SLEEP_TIME_SECS * 1000)
> case NonFatal(e) =>
>   throw new SparkException("Unable to register with external shuffle 
> server due to : " +
> e.getMessage, e)
>   }
> }
> {code}
> add logic in red.
> 3、set shuffle service enable as true and open shuffle service for yarn.
> Then yarn will always launch executor on node1 but failed since shuffle 
> service can not register success.
> Then job will be aborted.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21539) Job should not be aborted when dynamic allocation is enabled or spark.executor.instances larger then current allocated number by yarn

2017-07-26 Thread zhoukang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-21539:
-
Affects Version/s: (was: 2.2.0)

> Job should not be aborted when dynamic allocation is enabled or 
> spark.executor.instances larger then current allocated number by yarn
> -
>
> Key: SPARK-21539
> URL: https://issues.apache.org/jira/browse/SPARK-21539
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.1.0
>Reporter: zhoukang
>
> For spark on yarn.
> Right now, when TaskSet can not run on any node or host.Which means 
> blacklistedEverywhere is true in TaskSetManager#abortIfCompleteBlacklisted.
> However, if dynamic allocation is enabled, we should wait for yarn to 
> allocate new nodemanager in order to execute job successfully.
> How to reproduce?
> 1、Set up a yarn cluster with  5 nodes.And assign a node1 with much larger cpu 
> core and memory,which can let yarn launch container on this node even it is 
> blacklisted by TaskScheduler.
> 2、modify BlockManager#registerWithExternalShuffleServer
> {code:java}
> logInfo("Registering executor with local external shuffle service.")
> val shuffleConfig = new ExecutorShuffleInfo(
>   diskBlockManager.localDirs.map(_.toString),
>   diskBlockManager.subDirsPerLocalDir,
>   shuffleManager.getClass.getName)
> val MAX_ATTEMPTS = conf.get(config.SHUFFLE_REGISTRATION_MAX_ATTEMPTS)
> val SLEEP_TIME_SECS = 5
> for (i <- 1 to MAX_ATTEMPTS) {
>   try {
> {color:red}if (shuffleId.host.equals("node1's address")) {
>  throw new Exception
> }{color}
> // Synchronous and will throw an exception if we cannot connect.
> 
> shuffleClient.asInstanceOf[ExternalShuffleClient].registerWithShuffleServer(
>   shuffleServerId.host, shuffleServerId.port, 
> shuffleServerId.executorId, shuffleConfig)
> return
>   } catch {
> case e: Exception if i < MAX_ATTEMPTS =>
>   logError(s"Failed to connect to external shuffle server, will retry 
> ${MAX_ATTEMPTS - i}"
> + s" more times after waiting $SLEEP_TIME_SECS seconds...", e)
>   Thread.sleep(SLEEP_TIME_SECS * 1000)
> case NonFatal(e) =>
>   throw new SparkException("Unable to register with external shuffle 
> server due to : " +
> e.getMessage, e)
>   }
> }
> {code}
> add logic in red.
> 3、set shuffle service enable as true and open shuffle service for yarn.
> Then yarn will always launch executor on node1 but failed since shuffle 
> service can not register success.
> Then job will be aborted.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-21539) Job should not be aborted when dynamic allocation is enabled or spark.executor.instances larger then current allocated number by yarn

2017-07-26 Thread zhoukang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang reopened SPARK-21539:
--

> Job should not be aborted when dynamic allocation is enabled or 
> spark.executor.instances larger then current allocated number by yarn
> -
>
> Key: SPARK-21539
> URL: https://issues.apache.org/jira/browse/SPARK-21539
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.1.0
>Reporter: zhoukang
>
> For spark on yarn.
> Right now, when TaskSet can not run on any node or host.Which means 
> blacklistedEverywhere is true in TaskSetManager#abortIfCompleteBlacklisted.
> However, if dynamic allocation is enabled, we should wait for yarn to 
> allocate new nodemanager in order to execute job successfully.
> How to reproduce?
> 1、Set up a yarn cluster with  5 nodes.And assign a node1 with much larger cpu 
> core and memory,which can let yarn launch container on this node even it is 
> blacklisted by TaskScheduler.
> 2、modify BlockManager#registerWithExternalShuffleServer
> {code:java}
> logInfo("Registering executor with local external shuffle service.")
> val shuffleConfig = new ExecutorShuffleInfo(
>   diskBlockManager.localDirs.map(_.toString),
>   diskBlockManager.subDirsPerLocalDir,
>   shuffleManager.getClass.getName)
> val MAX_ATTEMPTS = conf.get(config.SHUFFLE_REGISTRATION_MAX_ATTEMPTS)
> val SLEEP_TIME_SECS = 5
> for (i <- 1 to MAX_ATTEMPTS) {
>   try {
> {color:red}if (shuffleId.host.equals("node1's address")) {
>  throw new Exception
> }{color}
> // Synchronous and will throw an exception if we cannot connect.
> 
> shuffleClient.asInstanceOf[ExternalShuffleClient].registerWithShuffleServer(
>   shuffleServerId.host, shuffleServerId.port, 
> shuffleServerId.executorId, shuffleConfig)
> return
>   } catch {
> case e: Exception if i < MAX_ATTEMPTS =>
>   logError(s"Failed to connect to external shuffle server, will retry 
> ${MAX_ATTEMPTS - i}"
> + s" more times after waiting $SLEEP_TIME_SECS seconds...", e)
>   Thread.sleep(SLEEP_TIME_SECS * 1000)
> case NonFatal(e) =>
>   throw new SparkException("Unable to register with external shuffle 
> server due to : " +
> e.getMessage, e)
>   }
> }
> {code}
> add logic in red.
> 3、set shuffle service enable as true and open shuffle service for yarn.
> Then yarn will always launch executor on node1 but failed since shuffle 
> service can not register success.
> Then job will be aborted.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21533) "configure(...)" method not called when using Hive Generic UDFs

2017-07-26 Thread Feng Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102598#comment-16102598
 ] 

Feng Zhu commented on SPARK-21533:
--

Could you post any examples?

> "configure(...)" method not called when using Hive Generic UDFs
> ---
>
> Key: SPARK-21533
> URL: https://issues.apache.org/jira/browse/SPARK-21533
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 2.1.1
>Reporter: Dean Gurvitz
>Priority: Minor
>
> Using Spark 2.1.1 and Java API, when executing a Hive Generic UDF through the 
> Spark SQL API, the configure() method in it is not called prior to the 
> initialize/evaluate methods as expected. 
> The method configure receives a MapredContext object. It is possible to 
> construct a version of such an object adjusted to Spark, and therefore 
> configure should be called to enable a smooth execution of all Hive Generic 
> UDFs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-21539) Job should not be aborted when dynamic allocation is enabled or spark.executor.instances larger then current allocated number by yarn

2017-07-26 Thread zhoukang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang closed SPARK-21539.

Resolution: Not A Problem

> Job should not be aborted when dynamic allocation is enabled or 
> spark.executor.instances larger then current allocated number by yarn
> -
>
> Key: SPARK-21539
> URL: https://issues.apache.org/jira/browse/SPARK-21539
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.1.0, 2.2.0
>Reporter: zhoukang
>
> For spark on yarn.
> Right now, when TaskSet can not run on any node or host.Which means 
> blacklistedEverywhere is true in TaskSetManager#abortIfCompleteBlacklisted.
> However, if dynamic allocation is enabled, we should wait for yarn to 
> allocate new nodemanager in order to execute job successfully.
> How to reproduce?
> 1、Set up a yarn cluster with  5 nodes.And assign a node1 with much larger cpu 
> core and memory,which can let yarn launch container on this node even it is 
> blacklisted by TaskScheduler.
> 2、modify BlockManager#registerWithExternalShuffleServer
> {code:java}
> logInfo("Registering executor with local external shuffle service.")
> val shuffleConfig = new ExecutorShuffleInfo(
>   diskBlockManager.localDirs.map(_.toString),
>   diskBlockManager.subDirsPerLocalDir,
>   shuffleManager.getClass.getName)
> val MAX_ATTEMPTS = conf.get(config.SHUFFLE_REGISTRATION_MAX_ATTEMPTS)
> val SLEEP_TIME_SECS = 5
> for (i <- 1 to MAX_ATTEMPTS) {
>   try {
> {color:red}if (shuffleId.host.equals("node1's address")) {
>  throw new Exception
> }{color}
> // Synchronous and will throw an exception if we cannot connect.
> 
> shuffleClient.asInstanceOf[ExternalShuffleClient].registerWithShuffleServer(
>   shuffleServerId.host, shuffleServerId.port, 
> shuffleServerId.executorId, shuffleConfig)
> return
>   } catch {
> case e: Exception if i < MAX_ATTEMPTS =>
>   logError(s"Failed to connect to external shuffle server, will retry 
> ${MAX_ATTEMPTS - i}"
> + s" more times after waiting $SLEEP_TIME_SECS seconds...", e)
>   Thread.sleep(SLEEP_TIME_SECS * 1000)
> case NonFatal(e) =>
>   throw new SparkException("Unable to register with external shuffle 
> server due to : " +
> e.getMessage, e)
>   }
> }
> {code}
> add logic in red.
> 3、set shuffle service enable as true and open shuffle service for yarn.
> Then yarn will always launch executor on node1 but failed since shuffle 
> service can not register success.
> Then job will be aborted.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21540) add spark.sql.functions.map_keys and spark.sql.functions.map_values

2017-07-26 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21540.
--
Resolution: Duplicate

I guess this was added in SPARK-19975.

> add spark.sql.functions.map_keys and spark.sql.functions.map_values
> ---
>
> Key: SPARK-21540
> URL: https://issues.apache.org/jira/browse/SPARK-21540
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: yu peng
>
> for dataframe/sql we support MapType, but unlike ArrayType, we can explode to 
> unpack from it.
> getItems is just not powerful enough for manipulate MapTypes
> df.select(map_keys('$map')) and df.select(map_values('$map')) would be really 
> useful



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21542) Helper functions for custom Python Persistence

2017-07-26 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21542:
--
Description: 
Currently, there is no way to easily persist Json-serializable parameters in 
Python only. All parameters in Python are persisted by converting them to Java 
objects and using the Java persistence implementation. In order to facilitate 
the creation of custom Python-only pipeline stages, it would be good to have a 
Python-only persistence framework so that these stages do not need to be 
implemented in Scala for persistence. 

This task involves:
- Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, 
DefaultParamsReader, and DefaultParamsWriter in pyspark.

  was:
Currnetly, there is no way to easily persist Json-serializable parameters in 
Python only. All parameters in Python are persisted by converting them to Java 
objects and using the Java persistence implementation. In order to facilitate 
the creation of custom Python-only pipeline stages, it would be good to have a 
Python-only persistence framework so that these stages do not need to be 
implemented in Scala for persistence. 

This task involves:
- Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, 
DefaultParamsReader, and DefaultParamsWriter in pyspark.


> Helper functions for custom Python Persistence
> --
>
> Key: SPARK-21542
> URL: https://issues.apache.org/jira/browse/SPARK-21542
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Ajay Saini
>
> Currently, there is no way to easily persist Json-serializable parameters in 
> Python only. All parameters in Python are persisted by converting them to 
> Java objects and using the Java persistence implementation. In order to 
> facilitate the creation of custom Python-only pipeline stages, it would be 
> good to have a Python-only persistence framework so that these stages do not 
> need to be implemented in Scala for persistence. 
> This task involves:
> - Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, 
> DefaultParamsReader, and DefaultParamsWriter in pyspark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21542) Helper functions for custom Python Persistence

2017-07-26 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21542:
--
Component/s: ML

> Helper functions for custom Python Persistence
> --
>
> Key: SPARK-21542
> URL: https://issues.apache.org/jira/browse/SPARK-21542
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Ajay Saini
>
> Currnetly, there is no way to easily persist Json-serializable parameters in 
> Python only. All parameters in Python are persisted by converting them to 
> Java objects and using the Java persistence implementation. In order to 
> facilitate the creation of custom Python-only pipeline stages, it would be 
> good to have a Python-only persistence framework so that these stages do not 
> need to be implemented in Scala for persistence. 
> This task involves:
> - Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, 
> DefaultParamsReader, and DefaultParamsWriter in pyspark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21542) Helper functions for custom Python Persistence

2017-07-26 Thread Ajay Saini (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ajay Saini updated SPARK-21542:
---
Component/s: (was: ML)

> Helper functions for custom Python Persistence
> --
>
> Key: SPARK-21542
> URL: https://issues.apache.org/jira/browse/SPARK-21542
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Ajay Saini
>
> Currnetly, there is no way to easily persist Json-serializable parameters in 
> Python only. All parameters in Python are persisted by converting them to 
> Java objects and using the Java persistence implementation. In order to 
> facilitate the creation of custom Python-only pipeline stages, it would be 
> good to have a Python-only persistence framework so that these stages do not 
> need to be implemented in Scala for persistence. 
> This task involves:
> - Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, 
> DefaultParamsReader, and DefaultParamsWriter in pyspark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21542) Helper functions for custom Python Persistence

2017-07-26 Thread Ajay Saini (JIRA)
Ajay Saini created SPARK-21542:
--

 Summary: Helper functions for custom Python Persistence
 Key: SPARK-21542
 URL: https://issues.apache.org/jira/browse/SPARK-21542
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Affects Versions: 2.2.0
Reporter: Ajay Saini


Currnetly, there is no way to easily persist Json-serializable parameters in 
Python only. All parameters in Python are persisted by converting them to Java 
objects and using the Java persistence implementation. In order to facilitate 
the creation of custom Python-only pipeline stages, it would be good to have a 
Python-only persistence framework so that these stages do not need to be 
implemented in Scala for persistence. 

This task involves:
- Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, 
DefaultParamsReader, and DefaultParamsWriter in pyspark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21190) SPIP: Vectorized UDFs in Python

2017-07-26 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102420#comment-16102420
 ] 

Bryan Cutler commented on SPARK-21190:
--

Hi [~icexelloss], yes I think there is definitely some basic framework needed 
to enable vectorized UDFs in Python that could be done before any API 
improvements.  Like you pointed out in (3) the PythonRDD and pyspark.worker 
will need to be more flexible to handle a different udf, as well as some work 
on the pyspark.serializer to produce/consume vector data.  I was hoping 
[#18659|https://github.com/apache/spark/pull/18659] could serve as this basis.  
I plan on updating it to use the {{ArrowColumnVector}} and {{ArrowWriter}} from 
[~ueshin] when that gets merged.

> SPIP: Vectorized UDFs in Python
> ---
>
> Key: SPARK-21190
> URL: https://issues.apache.org/jira/browse/SPARK-21190
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: SPIP
> Attachments: SPIPVectorizedUDFsforPython (1).pdf
>
>
> *Background and Motivation*
> Python is one of the most popular programming languages among Spark users. 
> Spark currently exposes a row-at-a-time interface for defining and executing 
> user-defined functions (UDFs). This introduces high overhead in serialization 
> and deserialization, and also makes it difficult to leverage Python libraries 
> (e.g. numpy, Pandas) that are written in native code.
>  
> This proposal advocates introducing new APIs to support vectorized UDFs in 
> Python, in which a block of data is transferred over to Python in some 
> columnar format for execution.
>  
>  
> *Target Personas*
> Data scientists, data engineers, library developers.
>  
> *Goals*
> - Support vectorized UDFs that apply on chunks of the data frame
> - Low system overhead: Substantially reduce serialization and deserialization 
> overhead when compared with row-at-a-time interface
> - UDF performance: Enable users to leverage native libraries in Python (e.g. 
> numpy, Pandas) for data manipulation in these UDFs
>  
> *Non-Goals*
> The following are explicitly out of scope for the current SPIP, and should be 
> done in future SPIPs. Nonetheless, it would be good to consider these future 
> use cases during API design, so we can achieve some consistency when rolling 
> out new APIs.
>  
> - Define block oriented UDFs in other languages (that are not Python).
> - Define aggregate UDFs
> - Tight integration with machine learning frameworks
>  
> *Proposed API Changes*
> The following sketches some possibilities. I haven’t spent a lot of time 
> thinking about the API (wrote it down in 5 mins) and I am not attached to 
> this design at all. The main purpose of the SPIP is to get feedback on use 
> cases and see how they can impact API design.
>  
> A few things to consider are:
>  
> 1. Python is dynamically typed, whereas DataFrames/SQL requires static, 
> analysis time typing. This means users would need to specify the return type 
> of their UDFs.
>  
> 2. Ratio of input rows to output rows. We propose initially we require number 
> of output rows to be the same as the number of input rows. In the future, we 
> can consider relaxing this constraint with support for vectorized aggregate 
> UDFs.
> 3. How do we handle null values, since Pandas doesn't have the concept of 
> nulls?
>  
> Proposed API sketch (using examples):
>  
> Use case 1. A function that defines all the columns of a DataFrame (similar 
> to a “map” function):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_on_entire_df(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A Pandas data frame.
>   """
>   input[c] = input[a] + input[b]
>   Input[d] = input[a] - input[b]
>   return input
>  
> spark.range(1000).selectExpr("id a", "id / 2 b")
>   .mapBatches(my_func_on_entire_df)
> {code}
>  
> Use case 2. A function that defines only one column (similar to existing 
> UDFs):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_that_returns_one_column(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A numpy array
>   """
>   return input[a] + input[b]
>  
> my_func = udf(my_func_that_returns_one_column)
>  
> df = spark.range(1000).selectExpr("id a", "id / 2 b")
> df.withColumn("c", my_func(df.a, df.b))
> {code}
>  
>  
>  
> *Optional Design Sketch*
> I’m more concerned about getting proper feedback for API design. The 
> implementation should be pretty straightforward and is not a huge concern at 
> this point. We can leverage the same implementation for faster 

[jira] [Updated] (SPARK-21541) Spark Logs show incorrect job status for a job that does not create SparkContext

2017-07-26 Thread Parth Gandhi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Parth Gandhi updated SPARK-21541:
-
Description: 
If you run a spark job without creating the SparkSession or SparkContext, the 
spark job logs says it succeeded but yarn says it fails and retries 3 times. 
Also, since, Application Master unregisters with Resource Manager and exits 
successfully, it deletes the spark staging directory, so when yarn makes 
subsequent retries, it fails to find the staging directory and thus, the 
retries fail.

*Steps:*
1. For example, run a pyspark job without creating SparkSession or 
SparkContext. 
*Example:*
import sys
from random import random
from operator import add
from pyspark import SparkContext

if __name__ == "__main__":
  print("hello world")

2. Spark will mark it as FAILED. Got to the UI and check the container logs.

3. You will see the following information in the logs:
spark:
7/07/14 13:22:10 INFO ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0
17/07/14 13:22:10 INFO ApplicationMaster: Unregistering ApplicationMaster with 
SUCCEEDED

But yarn logs will show:
2017-07-14 01:14:33,203 [AsyncDispatcher event handler] INFO 
attempt.RMAppAttemptImpl: appattempt_1493735952617_12443844_01 State change 
from FINAL_SAVING to FAILED

  was:
If you run a spark job without creating the SparkSession or SparkContext, the 
spark job logs says it succeeded but yarn says it fails and retries 3 times. 

*Steps:*
1. For example, run a pyspark job without creating SparkSession or 
SparkContext. 
*Example:*
import sys
from random import random
from operator import add
from pyspark import SparkContext

if __name__ == "__main__":
  print("hello world")

2. Spark will mark it as FAILED. Got to the UI and check the container logs.

3. You will see the following information in the logs:
spark:
7/07/14 13:22:10 INFO ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0
17/07/14 13:22:10 INFO ApplicationMaster: Unregistering ApplicationMaster with 
SUCCEEDED

But yarn logs will show:
2017-07-14 01:14:33,203 [AsyncDispatcher event handler] INFO 
attempt.RMAppAttemptImpl: appattempt_1493735952617_12443844_01 State change 
from FINAL_SAVING to FAILED


> Spark Logs show incorrect job status for a job that does not create 
> SparkContext
> 
>
> Key: SPARK-21541
> URL: https://issues.apache.org/jira/browse/SPARK-21541
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Parth Gandhi
>Priority: Minor
>
> If you run a spark job without creating the SparkSession or SparkContext, the 
> spark job logs says it succeeded but yarn says it fails and retries 3 times. 
> Also, since, Application Master unregisters with Resource Manager and exits 
> successfully, it deletes the spark staging directory, so when yarn makes 
> subsequent retries, it fails to find the staging directory and thus, the 
> retries fail.
> *Steps:*
> 1. For example, run a pyspark job without creating SparkSession or 
> SparkContext. 
> *Example:*
> import sys
> from random import random
> from operator import add
> from pyspark import SparkContext
> if __name__ == "__main__":
>   print("hello world")
> 2. Spark will mark it as FAILED. Got to the UI and check the container logs.
> 3. You will see the following information in the logs:
> spark:
> 7/07/14 13:22:10 INFO ApplicationMaster: Final app status: SUCCEEDED, 
> exitCode: 0
> 17/07/14 13:22:10 INFO ApplicationMaster: Unregistering ApplicationMaster 
> with SUCCEEDED
> But yarn logs will show:
> 2017-07-14 01:14:33,203 [AsyncDispatcher event handler] INFO 
> attempt.RMAppAttemptImpl: appattempt_1493735952617_12443844_01 State 
> change from FINAL_SAVING to FAILED



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21541) Spark Logs show incorrect job status for a job that does not create SparkContext

2017-07-26 Thread Parth Gandhi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102308#comment-16102308
 ] 

Parth Gandhi commented on SPARK-21541:
--

Currently working on the fix, will file a pull request as soon as it is done.

> Spark Logs show incorrect job status for a job that does not create 
> SparkContext
> 
>
> Key: SPARK-21541
> URL: https://issues.apache.org/jira/browse/SPARK-21541
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Parth Gandhi
>Priority: Minor
>
> If you run a spark job without creating the SparkSession or SparkContext, the 
> spark job logs says it succeeded but yarn says it fails and retries 3 times. 
> *Steps:*
> 1. For example, run a pyspark job without creating SparkSession or 
> SparkContext. 
> *Example:*
> import sys
> from random import random
> from operator import add
> from pyspark import SparkContext
> if __name__ == "__main__":
>   print("hello world")
> 2. Spark will mark it as FAILED. Got to the UI and check the container logs.
> 3. You will see the following information in the logs:
> spark:
> 7/07/14 13:22:10 INFO ApplicationMaster: Final app status: SUCCEEDED, 
> exitCode: 0
> 17/07/14 13:22:10 INFO ApplicationMaster: Unregistering ApplicationMaster 
> with SUCCEEDED
> But yarn logs will show:
> 2017-07-14 01:14:33,203 [AsyncDispatcher event handler] INFO 
> attempt.RMAppAttemptImpl: appattempt_1493735952617_12443844_01 State 
> change from FINAL_SAVING to FAILED



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21541) Spark Logs show incorrect job status for a job that does not create SparkContext

2017-07-26 Thread Parth Gandhi (JIRA)
Parth Gandhi created SPARK-21541:


 Summary: Spark Logs show incorrect job status for a job that does 
not create SparkContext
 Key: SPARK-21541
 URL: https://issues.apache.org/jira/browse/SPARK-21541
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.2.0
Reporter: Parth Gandhi
Priority: Minor


If you run a spark job without creating the SparkSession or SparkContext, the 
spark job logs says it succeeded but yarn says it fails and retries 3 times. 

*Steps:*
1. For example, run a pyspark job without creating SparkSession or 
SparkContext. 
*Example:*
import sys
from random import random
from operator import add
from pyspark import SparkContext

if __name__ == "__main__":
  print("hello world")

2. Spark will mark it as FAILED. Got to the UI and check the container logs.

3. You will see the following information in the logs:
spark:
7/07/14 13:22:10 INFO ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0
17/07/14 13:22:10 INFO ApplicationMaster: Unregistering ApplicationMaster with 
SUCCEEDED

But yarn logs will show:
2017-07-14 01:14:33,203 [AsyncDispatcher event handler] INFO 
attempt.RMAppAttemptImpl: appattempt_1493735952617_12443844_01 State change 
from FINAL_SAVING to FAILED



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20418) multi-label classification support

2017-07-26 Thread Weichen Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102280#comment-16102280
 ] 

Weichen Xu commented on SPARK-20418:


I will work on this.

> multi-label classification support
> --
>
> Key: SPARK-20418
> URL: https://issues.apache.org/jira/browse/SPARK-20418
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: yu peng
>
> add multi-label label point and make a binary relevance based classifier 
> holder.
> http://scikit-learn.org/dev/modules/generated/sklearn.multioutput.MultiOutputClassifier.html#sklearn.multioutput.MultiOutputClassifier
> ```
> X = [ [ 0, 1, 0], [1, 0, 1]] 
> y = [[1], [1, 2]]
> df = sql.CreateDataframe(zip(X, y), ['x', 'y'])
> mlp_clf = 
> MultiOutputClassifier(LogisticRegression).setFeature('X').setLabel('y')
> mlp_clf.fit(df)
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11215) Add multiple columns support to StringIndexer

2017-07-26 Thread Weichen Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102274#comment-16102274
 ] 

Weichen Xu commented on SPARK-11215:


I will take over this feature and create a PR soon.

> Add multiple columns support to StringIndexer
> -
>
> Key: SPARK-11215
> URL: https://issues.apache.org/jira/browse/SPARK-11215
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> Add multiple columns support to StringIndexer, then users can transform 
> multiple input columns to multiple output columns simultaneously. See 
> discussion SPARK-8418.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21535) Reduce memory requirement for CrossValidator and TrainValidationSplit

2017-07-26 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16100860#comment-16100860
 ] 

yuhao yang edited comment on SPARK-21535 at 7/26/17 6:30 PM:
-

https://github.com/apache/spark/pull/18733


was (Author: yuhaoyan):
https://github.com/apache/spark/pulls 

> Reduce memory requirement for CrossValidator and TrainValidationSplit 
> --
>
> Key: SPARK-21535
> URL: https://issues.apache.org/jira/browse/SPARK-21535
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>
> CrossValidator and TrainValidationSplit both use 
> {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where 
> epm is Array[ParamMap].
> Even though the training process is sequential, current implementation 
> consumes extra driver memory for holding the trained models, which is not 
> necessary and often leads to memory exception for both CrossValidator and 
> TrainValidationSplit. My proposal is to optimize the training implementation, 
> thus that used model can be collected by GC, and avoid the unnecessary OOM 
> exceptions.
> E.g. when grid search space is 12, old implementation needs to hold all 12 
> trained models in the driver memory at the same time, while the new 
> implementation only needs to hold 1 trained model at a time, and previous 
> model can be cleared by GC.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21087) CrossValidator, TrainValidationSplit should preserve all models after fitting: Scala

2017-07-26 Thread Weichen Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102048#comment-16102048
 ] 

Weichen Xu commented on SPARK-21087:


I will work on it.

> CrossValidator, TrainValidationSplit should preserve all models after 
> fitting: Scala
> 
>
> Key: SPARK-21087
> URL: https://issues.apache.org/jira/browse/SPARK-21087
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>
> See parent JIRA



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2017-07-26 Thread Weichen Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101994#comment-16101994
 ] 

Weichen Xu commented on SPARK-17025:


Because currently, scala calling python will be difficult and changing related 
code is hard, Maybe we can consider the way that we duplicate the import/export 
logic by python implementation, adding a python implemented MLWriter for 
pyspark.ml.PipelineModel, it traverse the list of model, if a model is 
"javaModel" than call the java-side writer, otherwise just call itself python 
implemented writer.

> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -
>
> Key: SPARK-17025
> URL: https://issues.apache.org/jira/browse/SPARK-17025
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Following the example in [this Databricks blog 
> post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
>  under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the 
> model, the operation fails because the custom transformer doesn't have a 
> {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in 
> model.bestModel.save('model')
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 222, in save
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 217, in write
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
>  line 93, in __init__
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for 
> [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
>  I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up 
> (i.e. [like 
> this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn 
> up clear results).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2017-07-26 Thread Ajay Saini (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101984#comment-16101984
 ] 

Ajay Saini edited comment on SPARK-17025 at 7/26/17 5:40 PM:
-

I'm currently working on a solution to this that involves building custom 
persistence support for Python-only pipeline stages. As of now, you cannot 
persist a pipeline stage in Python unless there is a Java implementation of 
that stage. The framework I'm working on will make it much easier to implement 
Python-only persistence of custom stages so that they don't need to rely on 
Java.

This fix will be sent out later this week.


was (Author: ajaysaini):

I'm currently working on a solution to this that involves building custom 
persistence support for Python-only pipeline stages. As of now, you cannot 
persist a pipeline stage in Python unless there is a Java implementation of 
that stage. The framework I'm working on will make it much easier to implement 
Python-only persistence of custom stages so that they don't need to rely on 
Java.

> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -
>
> Key: SPARK-17025
> URL: https://issues.apache.org/jira/browse/SPARK-17025
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Following the example in [this Databricks blog 
> post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
>  under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the 
> model, the operation fails because the custom transformer doesn't have a 
> {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in 
> model.bestModel.save('model')
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 222, in save
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 217, in write
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
>  line 93, in __init__
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for 
> [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
>  I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up 
> (i.e. [like 
> this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn 
> up clear results).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2017-07-26 Thread Ajay Saini (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101984#comment-16101984
 ] 

Ajay Saini commented on SPARK-17025:



I'm currently working on a solution to this that involves building custom 
persistence support for Python-only pipeline stages. As of now, you cannot 
persist a pipeline stage in Python unless there is a Java implementation of 
that stage. The framework I'm working on will make it much easier to implement 
Python-only persistence of custom stages so that they don't need to rely on 
Java.

> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -
>
> Key: SPARK-17025
> URL: https://issues.apache.org/jira/browse/SPARK-17025
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Following the example in [this Databricks blog 
> post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
>  under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the 
> model, the operation fails because the custom transformer doesn't have a 
> {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in 
> model.bestModel.save('model')
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 222, in save
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 217, in write
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
>  line 93, in __init__
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for 
> [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
>  I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up 
> (i.e. [like 
> this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn 
> up clear results).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21485) API Documentation for Spark SQL functions

2017-07-26 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-21485.
-
   Resolution: Fixed
 Assignee: Hyukjin Kwon
Fix Version/s: 2.3.0

> API Documentation for Spark SQL functions
> -
>
> Key: SPARK-21485
> URL: https://issues.apache.org/jira/browse/SPARK-21485
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.3.0
>
>
> It looks we can generate the documentation from {{ExpressionDescription}} and 
> {{ExpressionInfo}} for Spark's SQL function documentation.
> I had some time to play with this so I just made a rough version - 
> https://spark-test.github.io/sparksqldoc/
> Codes I used are as below :
> In {{pyspark}} shell:
> {code}
> from collections import namedtuple
> ExpressionInfo = namedtuple("ExpressionInfo", "className usage name extended")
> jinfos = 
> spark.sparkContext._jvm.org.apache.spark.sql.api.python.PythonSQLUtils.listBuiltinFunctions()
> infos = []
> for jinfo in jinfos:
> name = jinfo.getName()
> usage = jinfo.getUsage()
> usage = usage.replace("_FUNC_", name) if usage is not None else usage
> extended = jinfo.getExtended()
> extended = extended.replace("_FUNC_", name) if extended is not None else 
> extended
> infos.append(ExpressionInfo(
> className=jinfo.getClassName(),
> usage=usage,
> name=name,
> extended=extended))
> with open("index.md", 'w') as mdfile:
> strip = lambda s: "\n".join(map(lambda u: u.strip(), s.split("\n")))
> for info in sorted(infos, key=lambda i: i.name):
> mdfile.write("### %s\n\n" % info.name)
> if info.usage is not None:
> mdfile.write("%s\n\n" % strip(info.usage))
> if info.extended is not None:
> mdfile.write("```%s```\n\n" % strip(info.extended))
> {code}
> This change had to be made first before running the codes above:
> {code:none}
> +++ 
> b/sql/core/src/main/scala/org/apache/spark/sql/api/python/PythonSQLUtils.scala
> @@ -17,9 +17,15 @@
>  package org.apache.spark.sql.api.python
> +import org.apache.spark.sql.catalyst.analysis.FunctionRegistry
> +import org.apache.spark.sql.catalyst.expressions.ExpressionInfo
>  import org.apache.spark.sql.catalyst.parser.CatalystSqlParser
>  import org.apache.spark.sql.types.DataType
>  private[sql] object PythonSQLUtils {
>def parseDataType(typeText: String): DataType = 
> CatalystSqlParser.parseDataType(typeText)
> +
> +  def listBuiltinFunctions(): Array[ExpressionInfo] = {
> +FunctionRegistry.functionSet.flatMap(f => 
> FunctionRegistry.builtin.lookupFunction(f)).toArray
> +  }
>  }
> {code}
> And then, I ran this:
> {code}
> mkdir docs
> echo "site_name: Spark SQL 2.3.0" >> mkdocs.yml
> echo "theme: readthedocs" >> mkdocs.yml
> mv index.md docs/index.md
> mkdocs serve
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6809) Make numPartitions optional in pairRDD APIs

2017-07-26 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101900#comment-16101900
 ] 

Shivaram Venkataraman commented on SPARK-6809:
--

Yeah I dont this JIRA is applicable any more. We can close this

> Make numPartitions optional in pairRDD APIs
> ---
>
> Key: SPARK-6809
> URL: https://issues.apache.org/jira/browse/SPARK-6809
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6809) Make numPartitions optional in pairRDD APIs

2017-07-26 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-6809.
--
Resolution: Not A Problem

> Make numPartitions optional in pairRDD APIs
> ---
>
> Key: SPARK-6809
> URL: https://issues.apache.org/jira/browse/SPARK-6809
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21540) add spark.sql.functions.map_keys and spark.sql.functions.map_values

2017-07-26 Thread yu peng (JIRA)
yu peng created SPARK-21540:
---

 Summary: add spark.sql.functions.map_keys and 
spark.sql.functions.map_values
 Key: SPARK-21540
 URL: https://issues.apache.org/jira/browse/SPARK-21540
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: yu peng


for dataframe/sql we support MapType, but unlike ArrayType, we can explode to 
unpack from it.
getItems is just not powerful enough for manipulate MapTypes

df.select(map_keys('$map')) and df.select(map_values('$map')) would be really 
useful



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21245) Resolve code duplication for classification/regression summarizers

2017-07-26 Thread Seth Hendrickson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seth Hendrickson updated SPARK-21245:
-
  Labels: starter  (was: )
Priority: Minor  (was: Major)

> Resolve code duplication for classification/regression summarizers
> --
>
> Key: SPARK-21245
> URL: https://issues.apache.org/jira/browse/SPARK-21245
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Seth Hendrickson
>Priority: Minor
>  Labels: starter
>
> In several places (LogReg, LinReg, SVC) in Spark ML, we collect summary 
> information about training data using {{MultivariateOnlineSummarizer}} and 
> {{MulticlassSummarizer}}. We have the same code appearing in several places 
> (and including test suites). We can eliminate this by creating a common 
> implementation somewhere.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21535) Reduce memory requirement for CrossValidator and TrainValidationSplit

2017-07-26 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101870#comment-16101870
 ] 

yuhao yang commented on SPARK-21535:


The basic idea is that we should release the driver memory as soon as a trained 
model is evaluated. I don't think there's any conflict. But let me know if 
there's any, I'll revert the jira.

I'm not a big fan for the Parallel CV idea. Personally I cannot see how it 
improves the overall performance or ease of use. But maybe it's just I never 
met the appropriate scenarios.

> Reduce memory requirement for CrossValidator and TrainValidationSplit 
> --
>
> Key: SPARK-21535
> URL: https://issues.apache.org/jira/browse/SPARK-21535
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>
> CrossValidator and TrainValidationSplit both use 
> {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where 
> epm is Array[ParamMap].
> Even though the training process is sequential, current implementation 
> consumes extra driver memory for holding the trained models, which is not 
> necessary and often leads to memory exception for both CrossValidator and 
> TrainValidationSplit. My proposal is to optimize the training implementation, 
> thus that used model can be collected by GC, and avoid the unnecessary OOM 
> exceptions.
> E.g. when grid search space is 12, old implementation needs to hold all 12 
> trained models in the driver memory at the same time, while the new 
> implementation only needs to hold 1 trained model at a time, and previous 
> model can be cleared by GC.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12957) Derive and propagate data constrains in logical plan

2017-07-26 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12957.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Derive and propagate data constrains in logical plan 
> -
>
> Key: SPARK-12957
> URL: https://issues.apache.org/jira/browse/SPARK-12957
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Sameer Agarwal
> Fix For: 2.0.0
>
> Attachments: ConstraintPropagationinSparkSQL.pdf
>
>
> Based on the semantic of a query plan, we can derive data constrains (e.g. if 
> a filter defines {{a > 10}}, we know that the output data of this filter 
> satisfy the constrain of {{a > 10}} and {{a is not null}}). We should build a 
> framework to derive and propagate constrains in the logical plan, which can 
> help us to build more advanced optimizations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21538) Attribute resolution inconsistency in Dataset API

2017-07-26 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21538:

Affects Version/s: (was: 3.0.0)
   2.3.0

> Attribute resolution inconsistency in Dataset API
> -
>
> Key: SPARK-21538
> URL: https://issues.apache.org/jira/browse/SPARK-21538
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Adrian Ionescu
>
> {code}
> spark.range(1).withColumnRenamed("id", "x").sort(col("id"))  // works
> spark.range(1).withColumnRenamed("id", "x").sort($"id")  // works
> spark.range(1).withColumnRenamed("id", "x").sort('id) // works
> spark.range(1).withColumnRenamed("id", "x").sort("id") // fails with:
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "id" among 
> (x);
> ...
> {code}
> It looks like the Dataset API functions taking {{String}} use the basic 
> resolver that only look at the columns at that level, whereas all the other 
> means of expressing an attribute are lazily resolved during the analyzer.
> The reason why the first 3 calls work is explained in the docs for {{object 
> ResolveMissingReferences}}:
> {code}
>   /**
>* In many dialects of SQL it is valid to sort by attributes that are not 
> present in the SELECT
>* clause.  This rule detects such queries and adds the required attributes 
> to the original
>* projection, so that they will be available during sorting. Another 
> projection is added to
>* remove these attributes after sorting.
>*
>* The HAVING clause could also used a grouping columns that is not 
> presented in the SELECT.
>*/
> {code}
> For consistency, it would be good to use the same attribute resolution 
> mechanism everywhere.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21538) Attribute resolution inconsistency in Dataset API

2017-07-26 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21538:

Issue Type: Improvement  (was: Story)

> Attribute resolution inconsistency in Dataset API
> -
>
> Key: SPARK-21538
> URL: https://issues.apache.org/jira/browse/SPARK-21538
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Adrian Ionescu
>
> {code}
> spark.range(1).withColumnRenamed("id", "x").sort(col("id"))  // works
> spark.range(1).withColumnRenamed("id", "x").sort($"id")  // works
> spark.range(1).withColumnRenamed("id", "x").sort('id) // works
> spark.range(1).withColumnRenamed("id", "x").sort("id") // fails with:
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "id" among 
> (x);
> ...
> {code}
> It looks like the Dataset API functions taking {{String}} use the basic 
> resolver that only look at the columns at that level, whereas all the other 
> means of expressing an attribute are lazily resolved during the analyzer.
> The reason why the first 3 calls work is explained in the docs for {{object 
> ResolveMissingReferences}}:
> {code}
>   /**
>* In many dialects of SQL it is valid to sort by attributes that are not 
> present in the SELECT
>* clause.  This rule detects such queries and adds the required attributes 
> to the original
>* projection, so that they will be available during sorting. Another 
> projection is added to
>* remove these attributes after sorting.
>*
>* The HAVING clause could also used a grouping columns that is not 
> presented in the SELECT.
>*/
> {code}
> For consistency, it would be good to use the same attribute resolution 
> mechanism everywhere.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21539) Job should not be aborted when dynamic allocation is enabled or spark.executor.instances larger then current allocated number by yarn

2017-07-26 Thread zhoukang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101686#comment-16101686
 ] 

zhoukang commented on SPARK-21539:
--

I am working on this.

> Job should not be aborted when dynamic allocation is enabled or 
> spark.executor.instances larger then current allocated number by yarn
> -
>
> Key: SPARK-21539
> URL: https://issues.apache.org/jira/browse/SPARK-21539
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.1.0, 2.2.0
>Reporter: zhoukang
>
> For spark on yarn.
> Right now, when TaskSet can not run on any node or host.Which means 
> blacklistedEverywhere is true in TaskSetManager#abortIfCompleteBlacklisted.
> However, if dynamic allocation is enabled, we should wait for yarn to 
> allocate new nodemanager in order to execute job successfully.
> How to reproduce?
> 1、Set up a yarn cluster with  5 nodes.And assign a node1 with much larger cpu 
> core and memory,which can let yarn launch container on this node even it is 
> blacklisted by TaskScheduler.
> 2、modify BlockManager#registerWithExternalShuffleServer
> {code:java}
> logInfo("Registering executor with local external shuffle service.")
> val shuffleConfig = new ExecutorShuffleInfo(
>   diskBlockManager.localDirs.map(_.toString),
>   diskBlockManager.subDirsPerLocalDir,
>   shuffleManager.getClass.getName)
> val MAX_ATTEMPTS = conf.get(config.SHUFFLE_REGISTRATION_MAX_ATTEMPTS)
> val SLEEP_TIME_SECS = 5
> for (i <- 1 to MAX_ATTEMPTS) {
>   try {
> {color:red}if (shuffleId.host.equals("node1's address")) {
>  throw new Exception
> }{color}
> // Synchronous and will throw an exception if we cannot connect.
> 
> shuffleClient.asInstanceOf[ExternalShuffleClient].registerWithShuffleServer(
>   shuffleServerId.host, shuffleServerId.port, 
> shuffleServerId.executorId, shuffleConfig)
> return
>   } catch {
> case e: Exception if i < MAX_ATTEMPTS =>
>   logError(s"Failed to connect to external shuffle server, will retry 
> ${MAX_ATTEMPTS - i}"
> + s" more times after waiting $SLEEP_TIME_SECS seconds...", e)
>   Thread.sleep(SLEEP_TIME_SECS * 1000)
> case NonFatal(e) =>
>   throw new SparkException("Unable to register with external shuffle 
> server due to : " +
> e.getMessage, e)
>   }
> }
> {code}
> add logic in red.
> 3、set shuffle service enable as true and open shuffle service for yarn.
> Then yarn will always launch executor on node1 but failed since shuffle 
> service can not register success.
> Then job will be aborted.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21539) Job should not be aborted when dynamic allocation is enabled or spark.executor.instances larger then current allocated number by yarn

2017-07-26 Thread zhoukang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-21539:
-
Description: 
For spark on yarn.
Right now, when TaskSet can not run on any node or host.Which means 
blacklistedEverywhere is true in TaskSetManager#abortIfCompleteBlacklisted.
However, if dynamic allocation is enabled, we should wait for yarn to allocate 
new nodemanager in order to execute job successfully.
How to reproduce?
1、Set up a yarn cluster with  5 nodes.And assign a node1 with much larger cpu 
core and memory,which can let yarn launch container on this node even it is 
blacklisted by TaskScheduler.
2、modify BlockManager#registerWithExternalShuffleServer
{code:java}
logInfo("Registering executor with local external shuffle service.")
val shuffleConfig = new ExecutorShuffleInfo(
  diskBlockManager.localDirs.map(_.toString),
  diskBlockManager.subDirsPerLocalDir,
  shuffleManager.getClass.getName)

val MAX_ATTEMPTS = conf.get(config.SHUFFLE_REGISTRATION_MAX_ATTEMPTS)
val SLEEP_TIME_SECS = 5

for (i <- 1 to MAX_ATTEMPTS) {
  try {
{color:red}if (shuffleId.host.equals("node1's address")) {
 throw new Exception
}{color}
// Synchronous and will throw an exception if we cannot connect.

shuffleClient.asInstanceOf[ExternalShuffleClient].registerWithShuffleServer(
  shuffleServerId.host, shuffleServerId.port, 
shuffleServerId.executorId, shuffleConfig)
return
  } catch {
case e: Exception if i < MAX_ATTEMPTS =>
  logError(s"Failed to connect to external shuffle server, will retry 
${MAX_ATTEMPTS - i}"
+ s" more times after waiting $SLEEP_TIME_SECS seconds...", e)
  Thread.sleep(SLEEP_TIME_SECS * 1000)
case NonFatal(e) =>
  throw new SparkException("Unable to register with external shuffle 
server due to : " +
e.getMessage, e)
  }
}
{code}
add logic in red.
3、set shuffle service enable as true and open shuffle service for yarn.
Then yarn will always launch executor on node1 but failed since shuffle service 
can not register success.
Then job will be aborted.

  was:
For spark on yarn.
Right now, when TaskSet can not run on any node or host.Which means 
blacklistedEverywhere is true in TaskSetManager#abortIfCompleteBlacklisted.
However, if dynamic allocation is enabled, we should wait for yarn to allocate 
new nodemanager in order to execute job successfully.And we should report this 
information to yarn in case of assign same node which blacklisted by 
TaskScheduler.
How to reproduce?
1、Set up a yarn cluster with  5 nodes.And assign a node1 with much larger cpu 
core and memory,which can let yarn launch container on this node even it is 
blacklisted by TaskScheduler.
2、modify BlockManager#registerWithExternalShuffleServer
{code:java}
logInfo("Registering executor with local external shuffle service.")
val shuffleConfig = new ExecutorShuffleInfo(
  diskBlockManager.localDirs.map(_.toString),
  diskBlockManager.subDirsPerLocalDir,
  shuffleManager.getClass.getName)

val MAX_ATTEMPTS = conf.get(config.SHUFFLE_REGISTRATION_MAX_ATTEMPTS)
val SLEEP_TIME_SECS = 5

for (i <- 1 to MAX_ATTEMPTS) {
  try {
{color:red}if (shuffleId.host.equals("node1's address")) {
 throw new Exception
}{color}
// Synchronous and will throw an exception if we cannot connect.

shuffleClient.asInstanceOf[ExternalShuffleClient].registerWithShuffleServer(
  shuffleServerId.host, shuffleServerId.port, 
shuffleServerId.executorId, shuffleConfig)
return
  } catch {
case e: Exception if i < MAX_ATTEMPTS =>
  logError(s"Failed to connect to external shuffle server, will retry 
${MAX_ATTEMPTS - i}"
+ s" more times after waiting $SLEEP_TIME_SECS seconds...", e)
  Thread.sleep(SLEEP_TIME_SECS * 1000)
case NonFatal(e) =>
  throw new SparkException("Unable to register with external shuffle 
server due to : " +
e.getMessage, e)
  }
}
{code}
add logic in red.
3、set shuffle service enable as true and open shuffle service for yarn.
Then yarn will always launch executor on node1 but failed since shuffle service 
can not register success.
Then job will be aborted.


> Job should not be aborted when dynamic allocation is enabled or 
> spark.executor.instances larger then current allocated number by yarn
> -
>
> Key: SPARK-21539
> URL: https://issues.apache.org/jira/browse/SPARK-21539
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.1.0, 2.2.0
>Reporter: zhoukang
>
> For spark on 

[jira] [Created] (SPARK-21539) Job should not be aborted when dynamic allocation is enabled or spark.executor.instances larger then current allocated number by yarn

2017-07-26 Thread zhoukang (JIRA)
zhoukang created SPARK-21539:


 Summary: Job should not be aborted when dynamic allocation is 
enabled or spark.executor.instances larger then current allocated number by yarn
 Key: SPARK-21539
 URL: https://issues.apache.org/jira/browse/SPARK-21539
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.0, 2.1.0, 1.6.1
Reporter: zhoukang


For spark on yarn.
Right now, when TaskSet can not run on any node or host.Which means 
blacklistedEverywhere is true in TaskSetManager#abortIfCompleteBlacklisted.
However, if dynamic allocation is enabled, we should wait for yarn to allocate 
new nodemanager in order to execute job successfully.And we should report this 
information to yarn in case of assign same node which blacklisted by 
TaskScheduler.
How to reproduce?
1、Set up a yarn cluster with  5 nodes.And assign a node1 with much larger cpu 
core and memory,which can let yarn launch container on this node even it is 
blacklisted by TaskScheduler.
2、modify BlockManager#registerWithExternalShuffleServer
{code:java}
logInfo("Registering executor with local external shuffle service.")
val shuffleConfig = new ExecutorShuffleInfo(
  diskBlockManager.localDirs.map(_.toString),
  diskBlockManager.subDirsPerLocalDir,
  shuffleManager.getClass.getName)

val MAX_ATTEMPTS = conf.get(config.SHUFFLE_REGISTRATION_MAX_ATTEMPTS)
val SLEEP_TIME_SECS = 5

for (i <- 1 to MAX_ATTEMPTS) {
  try {
{color:red}if (shuffleId.host.equals("node1's address")) {
 throw new Exception
}{color}
// Synchronous and will throw an exception if we cannot connect.

shuffleClient.asInstanceOf[ExternalShuffleClient].registerWithShuffleServer(
  shuffleServerId.host, shuffleServerId.port, 
shuffleServerId.executorId, shuffleConfig)
return
  } catch {
case e: Exception if i < MAX_ATTEMPTS =>
  logError(s"Failed to connect to external shuffle server, will retry 
${MAX_ATTEMPTS - i}"
+ s" more times after waiting $SLEEP_TIME_SECS seconds...", e)
  Thread.sleep(SLEEP_TIME_SECS * 1000)
case NonFatal(e) =>
  throw new SparkException("Unable to register with external shuffle 
server due to : " +
e.getMessage, e)
  }
}
{code}
add logic in red.
3、set shuffle service enable as true and open shuffle service for yarn.
Then yarn will always launch executor on node1 but failed since shuffle service 
can not register success.
Then job will be aborted.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9776) Another instance of Derby may have already booted the database

2017-07-26 Thread Quincy HSIEH (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101597#comment-16101597
 ] 

Quincy HSIEH edited comment on SPARK-9776 at 7/26/17 12:10 PM:
---

Hi,

In case that someone who has the same issue and comes to this page : here is 
the workaround I use to resolve the problem :

1. Configure Spark with Yarn client mode.

2. Use FairScheduler in Hadoop config yarn-site.xml to enable multi-queques so 
that you can spark-submit 2 job to different queue name with --queue 


yarn.resourcemanager.scheduler.class

org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler


3. Switch derby to memory mode by modifying the connection string in Spark 
config hive-site.xml
   
 javax.jdo.option.ConnectionURL
 
jdbc:derby:memory:/metastore_db;create=true
 JDBC connect string for a JDBC metastore
   


was (Author: quincy.tw):
Hi,

In case that someone who has the same issue and comes to this page : here is 
the workaround I use to resolve the problem :

1. Configure Spark with Yarn client mode.

2. Use FairScheduler in Hadoop config yarn-site.xml to enable multi-queques so 
that you can spark-submit job to different queue name with --queue 


yarn.resourcemanager.scheduler.class

org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler


3. Switch derby to memory mode by modifying the connection string in Spark 
config hive-site.xml
   
 javax.jdo.option.ConnectionURL
 
jdbc:derby:memory:/metastore_db;create=true
 JDBC connect string for a JDBC metastore
   

> Another instance of Derby may have already booted the database 
> ---
>
> Key: SPARK-9776
> URL: https://issues.apache.org/jira/browse/SPARK-9776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: Mac Yosemite, spark-1.5.0
>Reporter: Sudhakar Thota
> Attachments: SPARK-9776-FL1.rtf
>
>
> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) results in 
> error. Though the same works for spark-1.4.1.
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
> database 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9776) Another instance of Derby may have already booted the database

2017-07-26 Thread Quincy HSIEH (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101597#comment-16101597
 ] 

Quincy HSIEH edited comment on SPARK-9776 at 7/26/17 12:09 PM:
---

Hi,

In case that someone who has the same issue and comes to this page : here is 
the workaround I use to resolve the problem :

1. Configure Spark with Yarn client mode.

2. Use FairScheduler in Hadoop config yarn-site.xml to enable multi-queques so 
that you can spark-submit job to different queue name with --queue 


yarn.resourcemanager.scheduler.class

org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler


3. Switch derby to memory mode by modifying the connection string in Spark 
config hive-site.xml
   
 javax.jdo.option.ConnectionURL
 
jdbc:derby:memory:/metastore_db;create=true
 JDBC connect string for a JDBC metastore
   


was (Author: quincy.tw):
Hi,

In case that someone who has the same issue and comes to this page : here is 
the workaround I use to resolve the problem :

1. Configure Spark with Yarn client mode.

2. Use FairScheduler in Hadoop config yarn-site.xml to enable multi-queques so 
that you can spark-submit job to different queue name with --queue 


yarn.resourcemanager.scheduler.class   
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler


3. Switch derby to memory mode by modifying the connection string in Spark 
config hive-site.xml
   
 javax.jdo.option.ConnectionURL
 
jdbc:derby:memory:/metastore_db;create=true
 JDBC connect string for a JDBC metastore
   

> Another instance of Derby may have already booted the database 
> ---
>
> Key: SPARK-9776
> URL: https://issues.apache.org/jira/browse/SPARK-9776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: Mac Yosemite, spark-1.5.0
>Reporter: Sudhakar Thota
> Attachments: SPARK-9776-FL1.rtf
>
>
> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) results in 
> error. Though the same works for spark-1.4.1.
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
> database 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9776) Another instance of Derby may have already booted the database

2017-07-26 Thread Quincy HSIEH (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101597#comment-16101597
 ] 

Quincy HSIEH edited comment on SPARK-9776 at 7/26/17 12:08 PM:
---

Hi,

In case that someone who has the same issue and comes to this page : here is 
the workaround I use to resolve the problem :

1. Configure Spark with Yarn client mode.

2. Use FairScheduler in Hadoop config yarn-site.xml to enable multi-queques so 
that you can spark-submit job to different queue name with --queue 


yarn.resourcemanager.scheduler.class   
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler


3. Switch derby to memory mode by modifying the connection string in Spark 
config hive-site.xml
   
 javax.jdo.option.ConnectionURL
 
jdbc:derby:memory:/metastore_db;create=true
 JDBC connect string for a JDBC metastore
   


was (Author: quincy.tw):
Hi,

In case that someone who has the same issue and comes to this page : here is 
the workaround I use to resolve the problem :

1. Configure Spark with Yarn client mode.

2. Use FairScheduler in Hadoop config yarn-site.xml to enable multi-queques so 
that you can spark-submit job to different queue name with --queue 


yarn.resourcemanager.scheduler.class   
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler


3. Switch derby to memory mode by modifying the connection string in Spark 
config hive-site.xml
   
 javax.jdo.option.ConnectionURL
 
jdbc:derby:memory:/metastore_db;create=true
 JDBC connect string for a JDBC metastore
   

> Another instance of Derby may have already booted the database 
> ---
>
> Key: SPARK-9776
> URL: https://issues.apache.org/jira/browse/SPARK-9776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: Mac Yosemite, spark-1.5.0
>Reporter: Sudhakar Thota
> Attachments: SPARK-9776-FL1.rtf
>
>
> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) results in 
> error. Though the same works for spark-1.4.1.
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
> database 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9776) Another instance of Derby may have already booted the database

2017-07-26 Thread Quincy HSIEH (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101597#comment-16101597
 ] 

Quincy HSIEH edited comment on SPARK-9776 at 7/26/17 12:07 PM:
---

Hi,

In case that someone who has the same issue and comes to this page : here is 
the workaround I use to resolve the problem :

1. Configure Spark with Yarn client mode.

2. Use FairScheduler in Hadoop config yarn-site.xml to enable multi-queques so 
that you can spark-submit job to different queue name with --queue 


yarn.resourcemanager.scheduler.class   
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler


3. Switch derby to memory mode by modifying the connection string in Spark 
config hive-site.xml
   
 javax.jdo.option.ConnectionURL
 
jdbc:derby:memory:/metastore_db;create=true
 JDBC connect string for a JDBC metastore
   


was (Author: quincy.tw):
Hi,

In case that someone who has the same issue and comes to this page : here is 
the workaround I use to resolve the problem :

1. Configure Spark with Yarn client mode.

2. Use FairScheduler in yarn-site.xml to enable multi-queques so that you can 
spark-submit job to different queue name with --queue 

yarn.resourcemanager.scheduler.class   
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler


3. Switch derby to memory mode by modify the connection string in 
   
 javax.jdo.option.ConnectionURL
 
jdbc:derby:memory:/metastore_db;create=true
 JDBC connect string for a JDBC metastore
   

> Another instance of Derby may have already booted the database 
> ---
>
> Key: SPARK-9776
> URL: https://issues.apache.org/jira/browse/SPARK-9776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: Mac Yosemite, spark-1.5.0
>Reporter: Sudhakar Thota
> Attachments: SPARK-9776-FL1.rtf
>
>
> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) results in 
> error. Though the same works for spark-1.4.1.
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
> database 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9776) Another instance of Derby may have already booted the database

2017-07-26 Thread Quincy HSIEH (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101597#comment-16101597
 ] 

Quincy HSIEH edited comment on SPARK-9776 at 7/26/17 12:06 PM:
---

Hi,

In case that someone who has the same issue and comes to this page : here is 
the workaround I use to resolve the problem :

1. Configure Spark with Yarn client mode.

2. Use FairScheduler in yarn-site.xml to enable multi-queques so that you can 
spark-submit job to different queue name with --queue 

yarn.resourcemanager.scheduler.class   
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler


3. Switch derby to memory mode by modify the connection string in 
   
 javax.jdo.option.ConnectionURL
 
jdbc:derby:memory:/metastore_db;create=true
 JDBC connect string for a JDBC metastore
   


was (Author: quincy.tw):
Hi,

In case that someone who has the same issue and comes to this page : here is 
the workaround I use to resolve the problem :

1. Configure Spark with Yarn client mode.
2. Use FairScheduler in yarn-site.xml to enable multi-queques so that you can 
spark-submit job to different queue name with --queue 

yarn.resourcemanager.scheduler.class   
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler

3. Switch derby to memory mode by modify the connection string in 
   
 javax.jdo.option.ConnectionURL
 
jdbc:derby*:memory*:/metastore_db;create=true
 JDBC connect string for a JDBC metastore
   

> Another instance of Derby may have already booted the database 
> ---
>
> Key: SPARK-9776
> URL: https://issues.apache.org/jira/browse/SPARK-9776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: Mac Yosemite, spark-1.5.0
>Reporter: Sudhakar Thota
> Attachments: SPARK-9776-FL1.rtf
>
>
> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) results in 
> error. Though the same works for spark-1.4.1.
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
> database 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9776) Another instance of Derby may have already booted the database

2017-07-26 Thread Quincy HSIEH (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101597#comment-16101597
 ] 

Quincy HSIEH commented on SPARK-9776:
-

Hi,

In case that someone who has the same issue and comes to this page : here is 
the workaround I use to resolve the problem :

1. Configure Spark with Yarn client mode.
2. Use FairScheduler in yarn-site.xml to enable multi-queques so that you can 
spark-submit job to different queue name with --queue 

yarn.resourcemanager.scheduler.class



org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler

3. Switch derby to memory mode by modify the connection string in 
   
 javax.jdo.option.ConnectionURL
 
jdbc:derby*:memory*:/metastore_db;create=true
 JDBC connect string for a JDBC metastore
   

> Another instance of Derby may have already booted the database 
> ---
>
> Key: SPARK-9776
> URL: https://issues.apache.org/jira/browse/SPARK-9776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: Mac Yosemite, spark-1.5.0
>Reporter: Sudhakar Thota
> Attachments: SPARK-9776-FL1.rtf
>
>
> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) results in 
> error. Though the same works for spark-1.4.1.
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
> database 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9776) Another instance of Derby may have already booted the database

2017-07-26 Thread Quincy HSIEH (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101597#comment-16101597
 ] 

Quincy HSIEH edited comment on SPARK-9776 at 7/26/17 12:05 PM:
---

Hi,

In case that someone who has the same issue and comes to this page : here is 
the workaround I use to resolve the problem :

1. Configure Spark with Yarn client mode.
2. Use FairScheduler in yarn-site.xml to enable multi-queques so that you can 
spark-submit job to different queue name with --queue 

yarn.resourcemanager.scheduler.class   
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler

3. Switch derby to memory mode by modify the connection string in 
   
 javax.jdo.option.ConnectionURL
 
jdbc:derby*:memory*:/metastore_db;create=true
 JDBC connect string for a JDBC metastore
   


was (Author: quincy.tw):
Hi,

In case that someone who has the same issue and comes to this page : here is 
the workaround I use to resolve the problem :

1. Configure Spark with Yarn client mode.
2. Use FairScheduler in yarn-site.xml to enable multi-queques so that you can 
spark-submit job to different queue name with --queue 

yarn.resourcemanager.scheduler.class



org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler

3. Switch derby to memory mode by modify the connection string in 
   
 javax.jdo.option.ConnectionURL
 
jdbc:derby*:memory*:/metastore_db;create=true
 JDBC connect string for a JDBC metastore
   

> Another instance of Derby may have already booted the database 
> ---
>
> Key: SPARK-9776
> URL: https://issues.apache.org/jira/browse/SPARK-9776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: Mac Yosemite, spark-1.5.0
>Reporter: Sudhakar Thota
> Attachments: SPARK-9776-FL1.rtf
>
>
> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) results in 
> error. Though the same works for spark-1.4.1.
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
> database 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20988) Convert logistic regression to new aggregator framework

2017-07-26 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-20988.

   Resolution: Fixed
Fix Version/s: 2.3.0

> Convert logistic regression to new aggregator framework
> ---
>
> Key: SPARK-20988
> URL: https://issues.apache.org/jira/browse/SPARK-20988
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>Priority: Minor
> Fix For: 2.3.0
>
>
> Use the hierarchy from SPARK-19762 for logistic regression optimization



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20988) Convert logistic regression to new aggregator framework

2017-07-26 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-20988:
--

Assignee: Seth Hendrickson

> Convert logistic regression to new aggregator framework
> ---
>
> Key: SPARK-20988
> URL: https://issues.apache.org/jira/browse/SPARK-20988
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>Priority: Minor
> Fix For: 2.3.0
>
>
> Use the hierarchy from SPARK-19762 for logistic regression optimization



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21538) Attribute resolution inconsistency in Dataset API

2017-07-26 Thread Adrian Ionescu (JIRA)
Adrian Ionescu created SPARK-21538:
--

 Summary: Attribute resolution inconsistency in Dataset API
 Key: SPARK-21538
 URL: https://issues.apache.org/jira/browse/SPARK-21538
 Project: Spark
  Issue Type: Story
  Components: SQL
Affects Versions: 3.0.0
Reporter: Adrian Ionescu


{code}
spark.range(1).withColumnRenamed("id", "x").sort(col("id"))  // works
spark.range(1).withColumnRenamed("id", "x").sort($"id")  // works
spark.range(1).withColumnRenamed("id", "x").sort('id) // works
spark.range(1).withColumnRenamed("id", "x").sort("id") // fails with:
org.apache.spark.sql.AnalysisException: Cannot resolve column name "id" among 
(x);
...
{code}

It looks like the Dataset API functions taking {{String}} use the basic 
resolver that only look at the columns at that level, whereas all the other 
means of expressing an attribute are lazily resolved during the analyzer.

The reason why the first 3 calls work is explained in the docs for {{object 
ResolveMissingReferences}}:
{code}
  /**
   * In many dialects of SQL it is valid to sort by attributes that are not 
present in the SELECT
   * clause.  This rule detects such queries and adds the required attributes 
to the original
   * projection, so that they will be available during sorting. Another 
projection is added to
   * remove these attributes after sorting.
   *
   * The HAVING clause could also used a grouping columns that is not presented 
in the SELECT.
   */
{code}

For consistency, it would be good to use the same attribute resolution 
mechanism everywhere.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21537) toPandas() should handle nested columns (as a Pandas MultiIndex)

2017-07-26 Thread Eric O. LEBIGOT (EOL) (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric O. LEBIGOT (EOL) updated SPARK-21537:
--
Description: 
The conversion of a *PySpark dataframe with nested columns* to Pandas (with 
`toPandas()`) does not convert nested columns into their Pandas equivalent, 
i.e. *columns indexed by a 
[MultiIndex|https://pandas.pydata.org/pandas-docs/stable/advanced.html]*.

For example, a dataframe with the following structure:
{code:java}
>>> df.printSchema()
root
 |-- device_ID: string (nullable = true)
 |-- time_origin_UTC: timestamp (nullable = true)
 |-- duration_s: integer (nullable = true)
 |-- session_time_UTC: timestamp (nullable = true)
 |-- probes_by_AP: struct (nullable = true)
 ||-- aa:bb:cc:dd:ee:ff: array (nullable = true)
 |||-- element: struct (containsNull = true)
 ||||-- delay_s: float (nullable = true)
 ||||-- RSSI: short (nullable = true)
 |-- max_RSSI_info_by_AP: struct (nullable = true)
 ||-- aa:bb:cc:dd:ee:ff: struct (nullable = true)
 |||-- delay_s: float (nullable = true)
 |||-- RSSI: short (nullable = true)
{code}
yields a Pandas dataframe where the `max_RSSI_info_by_AP` column is _not_ 
nested inside Pandas (through a MultiIndex):
{code}
>>> df_pandas_version = df.toPandas()
>>> df_pandas_version["max_RSSI_info_by_AP", "aa:bb:cc:dd:ee:ff", "RSSI"]. # 
>>> Should work!
(…)
KeyError: ('max_RSSI_info_by_AP', 'aa:bb:cc:dd:ee:ff', 'RSSI')
>>> df_pandas_version["max_RSSI_info_by_AP"].iloc[0]
Row(aa:bb:cc:dd:ee:ff=Row(delay_s=0.0, RSSI=6))
>>> type(_)  # PySpark type, instead of Pandas!
pyspark.sql.types.Row
{code}

It would be much more convenient if the Spark dataframe did the conversion to 
Pandas more thoroughly.

  was:
The conversion of a *PySpark dataframe with nested columns* to Pandas (with 
`toPandas()`) does not convert nested columns into their Pandas equivalent, 
i.e. *columns indexed by a 
[MultiIndex|https://pandas.pydata.org/pandas-docs/stable/advanced.html]*.

For example, a dataframe with the following structure:
{code:java}
>>> df.printSchema()
root
 |-- device_ID: string (nullable = true)
 |-- time_origin_UTC: timestamp (nullable = true)
 |-- duration_s: integer (nullable = true)
 |-- session_time_UTC: timestamp (nullable = true)
 |-- probes_by_AP: struct (nullable = true)
 ||-- aa:bb:cc:dd:ee:ff: array (nullable = true)
 |||-- element: struct (containsNull = true)
 ||||-- delay_s: float (nullable = true)
 ||||-- RSSI: short (nullable = true)
 |-- max_RSSI_info_by_AP: struct (nullable = true)
 ||-- aa:bb:cc:dd:ee:ff: struct (nullable = true)
 |||-- delay_s: float (nullable = true)
 |||-- RSSI: short (nullable = true)
{code}
yields a Pandas dataframe where the `max_RSSI_info_by_AP` column is _not_ 
nested inside Pandas (through a MultiIndex):
{code}
>>> df_pandas_version["max_RSSI_info_by_AP", "aa:bb:cc:dd:ee:ff", "RSSI"]. # 
>>> Should work!
(…)
KeyError: ('max_RSSI_info_by_AP', 'aa:bb:cc:dd:ee:ff', 'RSSI')

>>> sessions_in_period["max_RSSI_info_by_AP"].iloc[0]
Row(aa:bb:cc:dd:ee:ff=Row(delay_s=0.0, RSSI=6))

>>> type(_)  # PySpark type, instead of Pandas!
pyspark.sql.types.Row
{code}

It would be much more convenient if the Spark dataframe did the conversion to 
Pandas more thoroughly.


> toPandas() should handle nested columns (as a Pandas MultiIndex)
> 
>
> Key: SPARK-21537
> URL: https://issues.apache.org/jira/browse/SPARK-21537
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Eric O. LEBIGOT (EOL)
>  Labels: pandas
>
> The conversion of a *PySpark dataframe with nested columns* to Pandas (with 
> `toPandas()`) does not convert nested columns into their Pandas equivalent, 
> i.e. *columns indexed by a 
> [MultiIndex|https://pandas.pydata.org/pandas-docs/stable/advanced.html]*.
> For example, a dataframe with the following structure:
> {code:java}
> >>> df.printSchema()
> root
>  |-- device_ID: string (nullable = true)
>  |-- time_origin_UTC: timestamp (nullable = true)
>  |-- duration_s: integer (nullable = true)
>  |-- session_time_UTC: timestamp (nullable = true)
>  |-- probes_by_AP: struct (nullable = true)
>  ||-- aa:bb:cc:dd:ee:ff: array (nullable = true)
>  |||-- element: struct (containsNull = true)
>  ||||-- delay_s: float (nullable = true)
>  ||||-- RSSI: short (nullable = true)
>  |-- max_RSSI_info_by_AP: struct (nullable = true)
>  ||-- aa:bb:cc:dd:ee:ff: struct (nullable = true)
>  |||-- delay_s: float (nullable = true)
>  |||-- RSSI: short (nullable = true)
> {code}
> yields a Pandas dataframe where the `max_RSSI_info_by_AP` column is _not_ 
> nested inside Pandas 

[jira] [Created] (SPARK-21537) toPandas() should handle nested columns (as a Pandas MultiIndex)

2017-07-26 Thread Eric O. LEBIGOT (EOL) (JIRA)
Eric O. LEBIGOT (EOL) created SPARK-21537:
-

 Summary: toPandas() should handle nested columns (as a Pandas 
MultiIndex)
 Key: SPARK-21537
 URL: https://issues.apache.org/jira/browse/SPARK-21537
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.2.0
Reporter: Eric O. LEBIGOT (EOL)


The conversion of a *PySpark dataframe with nested columns* to Pandas (with 
`toPandas()`) does not convert nested columns into their Pandas equivalent, 
i.e. *columns indexed by a 
[MultiIndex|https://pandas.pydata.org/pandas-docs/stable/advanced.html]*.

For example, a dataframe with the following structure:
{code:java}
>>> df.printSchema()
root
 |-- device_ID: string (nullable = true)
 |-- time_origin_UTC: timestamp (nullable = true)
 |-- duration_s: integer (nullable = true)
 |-- session_time_UTC: timestamp (nullable = true)
 |-- probes_by_AP: struct (nullable = true)
 ||-- aa:bb:cc:dd:ee:ff: array (nullable = true)
 |||-- element: struct (containsNull = true)
 ||||-- delay_s: float (nullable = true)
 ||||-- RSSI: short (nullable = true)
 |-- max_RSSI_info_by_AP: struct (nullable = true)
 ||-- aa:bb:cc:dd:ee:ff: struct (nullable = true)
 |||-- delay_s: float (nullable = true)
 |||-- RSSI: short (nullable = true)
{code}
yields a Pandas dataframe where the `max_RSSI_info_by_AP` column is _not_ 
nested inside Pandas (through a MultiIndex):
{code}
>>> df_pandas_version["max_RSSI_info_by_AP", "aa:bb:cc:dd:ee:ff", "RSSI"]. # 
>>> Should work!
(…)
KeyError: ('max_RSSI_info_by_AP', 'aa:bb:cc:dd:ee:ff', 'RSSI')

>>> sessions_in_period["max_RSSI_info_by_AP"].iloc[0]
Row(aa:bb:cc:dd:ee:ff=Row(delay_s=0.0, RSSI=6))

>>> type(_)  # PySpark type, instead of Pandas!
pyspark.sql.types.Row
{code}

It would be much more convenient if the Spark dataframe did the conversion to 
Pandas more thoroughly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21476) RandomForest classification model not using broadcast in transform

2017-07-26 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101476#comment-16101476
 ] 

Peng Meng commented on SPARK-21476:
---

Hi [~sagraw], could you please test copy pasted the transform method from 
ProbabilisticClassifier into RandomForestClassificationModel, and added 
broadcasting inside transform. Thanks.

> RandomForest classification model not using broadcast in transform
> --
>
> Key: SPARK-21476
> URL: https://issues.apache.org/jira/browse/SPARK-21476
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Saurabh Agrawal
>
> I notice significant task deserialization latency while running prediction 
> with pipelines using RandomForestClassificationModel. While digging into the 
> source, found that the transform method in RandomForestClassificationModel 
> binds to its parent ProbabilisticClassificationModel and the only concrete 
> definition that RandomForestClassificationModel provides and which is 
> actually used in transform is that of predictRaw. Broadcasting is not being 
> used in predictRaw.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21476) RandomForest classification model not using broadcast in transform

2017-07-26 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101476#comment-16101476
 ] 

Peng Meng edited comment on SPARK-21476 at 7/26/17 10:06 AM:
-

Hi [~sagraw], could you please test copy pasted the transform method from  
ProbabilisticClassificationModel into RandomForestClassificationModel, and 
added broadcasting inside transform. Thanks.


was (Author: peng.m...@intel.com):
Hi [~sagraw], could you please test copy pasted the transform method from 
ProbabilisticClassifier into RandomForestClassificationModel, and added 
broadcasting inside transform. Thanks.

> RandomForest classification model not using broadcast in transform
> --
>
> Key: SPARK-21476
> URL: https://issues.apache.org/jira/browse/SPARK-21476
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Saurabh Agrawal
>
> I notice significant task deserialization latency while running prediction 
> with pipelines using RandomForestClassificationModel. While digging into the 
> source, found that the transform method in RandomForestClassificationModel 
> binds to its parent ProbabilisticClassificationModel and the only concrete 
> definition that RandomForestClassificationModel provides and which is 
> actually used in transform is that of predictRaw. Broadcasting is not being 
> used in predictRaw.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6809) Make numPartitions optional in pairRDD APIs

2017-07-26 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101459#comment-16101459
 ] 

Hyukjin Kwon commented on SPARK-6809:
-

Hi [~davies], I can't find the APIs of pairRDD.R file in the R API 
documentation. Is this JIRA obsolete in favour of SPARK-7230?

> Make numPartitions optional in pairRDD APIs
> ---
>
> Key: SPARK-6809
> URL: https://issues.apache.org/jira/browse/SPARK-6809
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21535) Reduce memory requirement for CrossValidator and TrainValidationSplit

2017-07-26 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101452#comment-16101452
 ] 

Nick Pentreath commented on SPARK-21535:


Parallel CV is in progress: https://github.com/apache/spark/pull/16774. How 
will this work with that?

> Reduce memory requirement for CrossValidator and TrainValidationSplit 
> --
>
> Key: SPARK-21535
> URL: https://issues.apache.org/jira/browse/SPARK-21535
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>
> CrossValidator and TrainValidationSplit both use 
> {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where 
> epm is Array[ParamMap].
> Even though the training process is sequential, current implementation 
> consumes extra driver memory for holding the trained models, which is not 
> necessary and often leads to memory exception for both CrossValidator and 
> TrainValidationSplit. My proposal is to optimize the training implementation, 
> thus that used model can be collected by GC, and avoid the unnecessary OOM 
> exceptions.
> E.g. when grid search space is 12, old implementation needs to hold all 12 
> trained models in the driver memory at the same time, while the new 
> implementation only needs to hold 1 trained model at a time, and previous 
> model can be cleared by GC.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21524) ValidatorParamsSuiteHelpers generates wrong temp files

2017-07-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21524.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18728
[https://github.com/apache/spark/pull/18728]

> ValidatorParamsSuiteHelpers generates wrong temp files
> --
>
> Key: SPARK-21524
> URL: https://issues.apache.org/jira/browse/SPARK-21524
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Priority: Minor
> Fix For: 2.3.0
>
>
> ValidatorParamsSuiteHelpers.testFileMove() is generating temp dir in the 
> wrong place and does not delete them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21524) ValidatorParamsSuiteHelpers generates wrong temp files

2017-07-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-21524:
-

Assignee: yuhao yang

> ValidatorParamsSuiteHelpers generates wrong temp files
> --
>
> Key: SPARK-21524
> URL: https://issues.apache.org/jira/browse/SPARK-21524
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Minor
> Fix For: 2.3.0
>
>
> ValidatorParamsSuiteHelpers.testFileMove() is generating temp dir in the 
> wrong place and does not delete them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11046) Pass schema from R to JVM using JSON format

2017-07-26 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-11046.
--
Resolution: Not A Problem

I lately touched some codes around here - SPARK-20493. I assume we are lost 
around here about the actual issue and the benefit from the fix. I am resolving 
this as it anyway looks obsolete.

Please reopen this if I misunderstood.

> Pass schema from R to JVM using JSON format
> ---
>
> Key: SPARK-11046
> URL: https://issues.apache.org/jira/browse/SPARK-11046
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Sun Rui
>Priority: Minor
>
> Currently, SparkR passes a DataFrame schema from R to JVM backend using 
> regular expression. However, Spark now supports schmea using JSON format.   
> So enhance SparkR to use schema in JSON format.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21476) RandomForest classification model not using broadcast in transform

2017-07-26 Thread Saurabh Agrawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101443#comment-16101443
 ] 

Saurabh Agrawal commented on SPARK-21476:
-

[~peng.m...@intel.com] My streaming application is suffering from increased 
latency in the stage where I use this model for prediction. In the spark UI for 
this stage, I could see that task execution time is alright but each of the 
tasks takes a bigger chunk of the time in deserialization. When I comment out 
the line where I call model.transform and use dummy values there instead, run 
the same application in the same environment, the task deserialization time 
reduces and overall the stage executes significantly faster. 

I also tried xgboost model with sparkml compatible third party libraries 
([https://github.com/komiya-atsushi/xgboost-predictor-java/blob/master/xgboost-predictor-spark/src/main/scala/biz/k11i/xgboost/spark/model/XGBoostBinaryClassificationModel.scala]).
 Just like RandomForestClassificationModel, this model subclasses 
ProbabilisticClassificationModel hence it uses the same transform method. I 
noticed the same kind of task deserialization and stage execution times as with 
RF model. I cloned this repo and copy pasted the transform method from 
ProbabilisticClassifier into XGBoostBinaryClassificationModel, only this time I 
added broadcasting inside transform. This change brought down the execution 
time significantly. 

[~srowen] Adding broadcast in ProbabilisticClassifier transform implementation 
can fix this, i.e. broadcasting the model instance and calling predictRaw, 
raw2Probability and other row level methods on this broadcast value.


> RandomForest classification model not using broadcast in transform
> --
>
> Key: SPARK-21476
> URL: https://issues.apache.org/jira/browse/SPARK-21476
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Saurabh Agrawal
>
> I notice significant task deserialization latency while running prediction 
> with pipelines using RandomForestClassificationModel. While digging into the 
> source, found that the transform method in RandomForestClassificationModel 
> binds to its parent ProbabilisticClassificationModel and the only concrete 
> definition that RandomForestClassificationModel provides and which is 
> actually used in transform is that of predictRaw. Broadcasting is not being 
> used in predictRaw.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21536) Remove the workaroud to allow dots in field names in R's createDataFame

2017-07-26 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101418#comment-16101418
 ] 

Hyukjin Kwon commented on SPARK-21536:
--

BTW, my try was - https://github.com/apache/spark/pull/18737. I think the fix 
can be separate (like I did in that PR) or together. I won't mind resolving 
this as a duplicate if anyone feels strongly so.

> Remove the workaroud to allow dots in field names in R's createDataFame
> ---
>
> Key: SPARK-21536
> URL: https://issues.apache.org/jira/browse/SPARK-21536
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> We are currently converting dots to underscore in some cases as below in 
> SparkR
> {code}
> > createDataFrame(list(list(1)), "a.a")
> SparkDataFrame[a_a:double]
> ...
> In FUN(X[[i]], ...) : Use a_a instead of a.a  as column name
> {code}
> {code}
> > createDataFrame(list(list(a.a = 1)))
> SparkDataFrame[a_a:double]
> ...
> In FUN(X[[i]], ...) : Use a_a instead of a.a  as column name
> {code}
> This looks introduced in the first place due to SPARK-2775 but now it is 
> fixed in SPARK-6898.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21536) Remove the workaroud to allow dots in field names in R's createDataFame

2017-07-26 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101410#comment-16101410
 ] 

Hyukjin Kwon commented on SPARK-21536:
--

I just realised what I found while fixing this is related with SPARK-12191. 

> Remove the workaroud to allow dots in field names in R's createDataFame
> ---
>
> Key: SPARK-21536
> URL: https://issues.apache.org/jira/browse/SPARK-21536
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> We are currently converting dots to underscore in some cases as below in 
> SparkR
> {code}
> > createDataFrame(list(list(1)), "a.a")
> SparkDataFrame[a_a:double]
> ...
> In FUN(X[[i]], ...) : Use a_a instead of a.a  as column name
> {code}
> {code}
> > createDataFrame(list(list(a.a = 1)))
> SparkDataFrame[a_a:double]
> ...
> In FUN(X[[i]], ...) : Use a_a instead of a.a  as column name
> {code}
> This looks introduced in the first place due to SPARK-2775 but now it is 
> fixed in SPARK-6898.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16784) Configurable log4j settings

2017-07-26 Thread HanCheol Cho (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16099477#comment-16099477
 ] 

HanCheol Cho edited comment on SPARK-16784 at 7/26/17 8:27 AM:
---

Hi, 

I used the following options that allows both driver & executors use a custom 
log4j.properties,
{code}
spark-submit \
  --driver-java-options 
"-Dlog4j.configuration=file:///absolute/path/to/conf/log4j.properties" \
  --files conf/log4j.properties \
  --conf 
"spark.executor.extraJavaOptions='-Dlog4j.configuration=log4j.properties'" \
  ...
{code}
I used a local log4j.properties file for the driver and the file in the 
distributed cache (by --files option) for the executors.
As shown above, the paths to driver and executors are different.

I hope it might be useful to the others.

Edited 2017-07-26: log4j.configuration needs path in URI format. Therefore, 
file:// prefix is necessary for driver option.



was (Author: priancho):
Hi, 

I used the following options that allows both driver & executors use a custom 
log4j.properties,
{code}
spark-submit \
  
--driver-java-options=-Dlog4j.configuration=file:///absolute/path/to/conf/log4j.properties
 \
  --files conf/log4j.properties \
  --conf 
"spark.executor.extraJavaOptions='-Dlog4j.configuration=log4j.properties'" \
  ...
{code}
I used a local log4j.properties file for the driver and the file in the 
distributed cache (by --files option) for the executors.
As shown above, the paths to driver and executors are different.

I hope it might be useful to the others.

Edited 2017-07-26: log4j.configuration needs path in URI format. Therefore, 
file:// prefix is necessary for driver option.


> Configurable log4j settings
> ---
>
> Key: SPARK-16784
> URL: https://issues.apache.org/jira/browse/SPARK-16784
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Michael Gummelt
>
> I often want to change the logging configuration on a single spark job.  This 
> is easy in client mode.  I just modify log4j.properties.  It's difficult in 
> cluster mode, because I need to modify the log4j.properties in the 
> distribution in which the driver runs.  I'd like a way of setting this 
> dynamically, such as a java system property.  Some brief searching showed 
> that log4j doesn't seem to accept such a property, but I'd like to open up 
> this idea for further comment.  Maybe we can find a solution.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16784) Configurable log4j settings

2017-07-26 Thread HanCheol Cho (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16099477#comment-16099477
 ] 

HanCheol Cho edited comment on SPARK-16784 at 7/26/17 8:26 AM:
---

Hi, 

I used the following options that allows both driver & executors use a custom 
log4j.properties,
{code}
spark-submit \
  
--driver-java-options=-Dlog4j.configuration=file:///absolute/path/to/conf/log4j.properties
 \
  --files conf/log4j.properties \
  --conf 
"spark.executor.extraJavaOptions='-Dlog4j.configuration=log4j.properties'" \
  ...
{code}
I used a local log4j.properties file for the driver and the file in the 
distributed cache (by --files option) for the executors.
As shown above, the paths to driver and executors are different.

I hope it might be useful to the others.

Edited 2017-07-26: log4j.configuration needs path in URI format. Therefore, 
file:// prefix is necessary for driver option.



was (Author: priancho):
Hi, 

I used the following options that allows both driver & executors use a custom 
log4j.properties,
{code}
spark-submit \
  --driver-java-options=-Dlog4j.configuration=conf/log4j.properties \
  --files conf/log4j.properties \
  --conf 
"spark.executor.extraJavaOptions='-Dlog4j.configuration=log4j.properties'" \
  ...
{code}
I used a local log4j.properties file for the driver and the file in the 
distributed cache (by --files option) for the executors.
As shown above, the paths to driver and executors are different.

I hope it might be useful to the others.



> Configurable log4j settings
> ---
>
> Key: SPARK-16784
> URL: https://issues.apache.org/jira/browse/SPARK-16784
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Michael Gummelt
>
> I often want to change the logging configuration on a single spark job.  This 
> is easy in client mode.  I just modify log4j.properties.  It's difficult in 
> cluster mode, because I need to modify the log4j.properties in the 
> distribution in which the driver runs.  I'd like a way of setting this 
> dynamically, such as a java system property.  Some brief searching showed 
> that log4j doesn't seem to accept such a property, but I'd like to open up 
> this idea for further comment.  Maybe we can find a solution.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21476) RandomForest classification model not using broadcast in transform

2017-07-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101298#comment-16101298
 ] 

Sean Owen commented on SPARK-21476:
---

[~sagraw] first someone would have to propose a fix

> RandomForest classification model not using broadcast in transform
> --
>
> Key: SPARK-21476
> URL: https://issues.apache.org/jira/browse/SPARK-21476
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Saurabh Agrawal
>
> I notice significant task deserialization latency while running prediction 
> with pipelines using RandomForestClassificationModel. While digging into the 
> source, found that the transform method in RandomForestClassificationModel 
> binds to its parent ProbabilisticClassificationModel and the only concrete 
> definition that RandomForestClassificationModel provides and which is 
> actually used in transform is that of predictRaw. Broadcasting is not being 
> used in predictRaw.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21412) Reset BufferHolder while initialize an UnsafeRowWriter

2017-07-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21412.
---
Resolution: Not A Problem

> Reset BufferHolder while initialize an UnsafeRowWriter
> --
>
> Key: SPARK-21412
> URL: https://issues.apache.org/jira/browse/SPARK-21412
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Chenzhao Guo
>
> UnsafeRowWriter's construtor should contain BufferHolder.reset to make the 
> writer out of the box.
> While writing UnsafeRow using UnsafeRowWriter, developers should manually 
> call BufferHolder.reset to make the BufferHolder's cursor(which indicates 
> where to write variable length portion) right in order to write variable 
> length fields like UTF8String, byte[], etc.
> If a developer doesn't reuse the BufferHolder so maybe he never noticed reset 
> and the comments in code, the UnsafeRow won't be correct if he also writes 
> variable length UTF8String. This API design doesn't make sense. We should 
> reset the BufferHolder to make UnsafeRowWriter out of the box, but not let 
> user read the code and manually call it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21476) RandomForest classification model not using broadcast in transform

2017-07-26 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101253#comment-16101253
 ] 

Peng Meng commented on SPARK-21476:
---

Not each transform uses broadcast, do you have some experiment data shows 
broadcast is a problem here. Thanks. 

> RandomForest classification model not using broadcast in transform
> --
>
> Key: SPARK-21476
> URL: https://issues.apache.org/jira/browse/SPARK-21476
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Saurabh Agrawal
>
> I notice significant task deserialization latency while running prediction 
> with pipelines using RandomForestClassificationModel. While digging into the 
> source, found that the transform method in RandomForestClassificationModel 
> binds to its parent ProbabilisticClassificationModel and the only concrete 
> definition that RandomForestClassificationModel provides and which is 
> actually used in transform is that of predictRaw. Broadcasting is not being 
> used in predictRaw.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21476) RandomForest classification model not using broadcast in transform

2017-07-26 Thread Saurabh Agrawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16100151#comment-16100151
 ] 

Saurabh Agrawal edited comment on SPARK-21476 at 7/26/17 6:56 AM:
--

[~peng.m...@intel.com] I am using it in spark streaming where I give 16 cores 
to the application. The dataset in each batch has around 100 partitions. The 
model has 120 trees and is trained with max depth 15. Number of features is 
around 100. 


was (Author: sagraw):
I am using it in spark streaming where I give 16 cores to the application. The 
dataset in each batch has around 100 partitions. The model has 120 trees and is 
trained with max depth 15. Number of features is around 100. 

> RandomForest classification model not using broadcast in transform
> --
>
> Key: SPARK-21476
> URL: https://issues.apache.org/jira/browse/SPARK-21476
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Saurabh Agrawal
>
> I notice significant task deserialization latency while running prediction 
> with pipelines using RandomForestClassificationModel. While digging into the 
> source, found that the transform method in RandomForestClassificationModel 
> binds to its parent ProbabilisticClassificationModel and the only concrete 
> definition that RandomForestClassificationModel provides and which is 
> actually used in transform is that of predictRaw. Broadcasting is not being 
> used in predictRaw.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21476) RandomForest classification model not using broadcast in transform

2017-07-26 Thread Saurabh Agrawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101241#comment-16101241
 ] 

Saurabh Agrawal commented on SPARK-21476:
-

[~srowen] Can a fix for this go in the next maintenance release for 2.2? 

> RandomForest classification model not using broadcast in transform
> --
>
> Key: SPARK-21476
> URL: https://issues.apache.org/jira/browse/SPARK-21476
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Saurabh Agrawal
>
> I notice significant task deserialization latency while running prediction 
> with pipelines using RandomForestClassificationModel. While digging into the 
> source, found that the transform method in RandomForestClassificationModel 
> binds to its parent ProbabilisticClassificationModel and the only concrete 
> definition that RandomForestClassificationModel provides and which is 
> actually used in transform is that of predictRaw. Broadcasting is not being 
> used in predictRaw.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21445) NotSerializableException thrown by UTF8String.IntWrapper

2017-07-26 Thread jin xing (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101224#comment-16101224
 ] 

jin xing commented on SPARK-21445:
--

Sorry, I report the exception by mistake.
With the change, it works well for me.

> NotSerializableException thrown by UTF8String.IntWrapper
> 
>
> Key: SPARK-21445
> URL: https://issues.apache.org/jira/browse/SPARK-21445
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 2.2.1, 2.3.0
>
>
> {code}
> Caused by: java.io.NotSerializableException: 
> org.apache.spark.unsafe.types.UTF8String$IntWrapper
> Serialization stack:
> - object not serializable (class: 
> org.apache.spark.unsafe.types.UTF8String$IntWrapper, value: 
> org.apache.spark.unsafe.types.UTF8String$IntWrapper@326450e)
> - field (class: 
> org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToInt$1, name: 
> result$2, type: class org.apache.spark.unsafe.types.UTF8String$IntWrapper)
> - object (class 
> org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToInt$1, 
> )
> {code}
> Not exactly sure in which specific case this exception is thrown, because I 
> couldn't come up with a simple reproducer yet, but will include a test in the 
> PR



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21536) Remove the workaroud to allow dots in field names in R's createDataFame

2017-07-26 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-21536:


 Summary: Remove the workaroud to allow dots in field names in R's 
createDataFame
 Key: SPARK-21536
 URL: https://issues.apache.org/jira/browse/SPARK-21536
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 2.3.0
Reporter: Hyukjin Kwon
Priority: Minor


We are currently converting dots to underscore in some cases as below in SparkR

{code}
> createDataFrame(list(list(1)), "a.a")
SparkDataFrame[a_a:double]
...
In FUN(X[[i]], ...) : Use a_a instead of a.a  as column name
{code}

{code}
> createDataFrame(list(list(a.a = 1)))
SparkDataFrame[a_a:double]
...
In FUN(X[[i]], ...) : Use a_a instead of a.a  as column name
{code}

This looks introduced in the first place due to SPARK-2775 but now it is fixed 
in SPARK-6898.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org