[jira] [Assigned] (SPARK-12230) WeightedLeastSquares.fit() should handle division by zero properly if standard deviation of target variable is zero.

2015-12-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12230:


Assignee: Apache Spark

> WeightedLeastSquares.fit() should handle division by zero properly if 
> standard deviation of target variable is zero.
> 
>
> Key: SPARK-12230
> URL: https://issues.apache.org/jira/browse/SPARK-12230
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Imran Younus
>Assignee: Apache Spark
>Priority: Trivial
>
> This is a TODO in WeightedLeastSquares.fit() method. If the standard 
> deviation of the target variablel is zero, then the regression is 
> meaningless. I think the fit() method should inform the user and exit nicely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12230) WeightedLeastSquares.fit() should handle division by zero properly if standard deviation of target variable is zero.

2015-12-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12230:


Assignee: (was: Apache Spark)

> WeightedLeastSquares.fit() should handle division by zero properly if 
> standard deviation of target variable is zero.
> 
>
> Key: SPARK-12230
> URL: https://issues.apache.org/jira/browse/SPARK-12230
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Imran Younus
>Priority: Trivial
>
> This is a TODO in WeightedLeastSquares.fit() method. If the standard 
> deviation of the target variablel is zero, then the regression is 
> meaningless. I think the fit() method should inform the user and exit nicely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12230) WeightedLeastSquares.fit() should handle division by zero properly if standard deviation of target variable is zero.

2015-12-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15056360#comment-15056360
 ] 

Apache Spark commented on SPARK-12230:
--

User 'iyounus' has created a pull request for this issue:
https://github.com/apache/spark/pull/10274

> WeightedLeastSquares.fit() should handle division by zero properly if 
> standard deviation of target variable is zero.
> 
>
> Key: SPARK-12230
> URL: https://issues.apache.org/jira/browse/SPARK-12230
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Imran Younus
>Priority: Trivial
>
> This is a TODO in WeightedLeastSquares.fit() method. If the standard 
> deviation of the target variablel is zero, then the regression is 
> meaningless. I think the fit() method should inform the user and exit nicely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12271) Improve error message for Dataset.as[] when the schema is incompatible.

2015-12-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12271:


Assignee: Apache Spark

> Improve error message for Dataset.as[] when the schema is incompatible.
> ---
>
> Key: SPARK-12271
> URL: https://issues.apache.org/jira/browse/SPARK-12271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Nong Li
>Assignee: Apache Spark
>
> It currently fails with an unexecutable exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12296) Feature parity for pyspark.mllib StandardScalerModel

2015-12-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12296:


Assignee: (was: Apache Spark)

> Feature parity for pyspark.mllib StandardScalerModel
> 
>
> Key: SPARK-12296
> URL: https://issues.apache.org/jira/browse/SPARK-12296
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Some methods are missing, such as ways to access the std, mean, etc.  This 
> JIRA is for feature parity for pyspark.mllib.feature.StandardScaler & 
> StandardScalerModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12296) Feature parity for pyspark.mllib StandardScalerModel

2015-12-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12296:


Assignee: Apache Spark

> Feature parity for pyspark.mllib StandardScalerModel
> 
>
> Key: SPARK-12296
> URL: https://issues.apache.org/jira/browse/SPARK-12296
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>
> Some methods are missing, such as ways to access the std, mean, etc.  This 
> JIRA is for feature parity for pyspark.mllib.feature.StandardScaler & 
> StandardScalerModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-12062) Master rebuilding historical SparkUI should be asynchronous

2015-12-14 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or reopened SPARK-12062:
---

> Master rebuilding historical SparkUI should be asynchronous
> ---
>
> Key: SPARK-12062
> URL: https://issues.apache.org/jira/browse/SPARK-12062
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Bryan Cutler
>
> When a long-running application finishes, it takes a while (sometimes 
> minutes) to rebuild the SparkUI. However, in Master.scala this is currently 
> done within the RPC event loop, which runs only in 1 thread. Thus, in the 
> mean time no other applications can register with this master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12062) Master rebuilding historical SparkUI should be asynchronous

2015-12-14 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-12062:
--
Target Version/s: 1.6.1, 2.0.0

> Master rebuilding historical SparkUI should be asynchronous
> ---
>
> Key: SPARK-12062
> URL: https://issues.apache.org/jira/browse/SPARK-12062
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Bryan Cutler
>
> When a long-running application finishes, it takes a while (sometimes 
> minutes) to rebuild the SparkUI. However, in Master.scala this is currently 
> done within the RPC event loop, which runs only in 1 thread. Thus, in the 
> mean time no other applications can register with this master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4816) Maven profile netlib-lgpl does not work

2015-12-14 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15056630#comment-15056630
 ] 

RJ Nowling commented on SPARK-4816:
---

Tried with Maven 3.3.9.  I see no issues with the newer version of Maven:

{code}
$ mvn -version
Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 
2015-11-10T16:41:47+00:00)
Maven home: /root/apache-maven-3.3.9
Java version: 1.7.0_85, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.85-2.6.1.2.el7_1.x86_64/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "3.10.0-229.1.2.el7.x86_64", arch: "amd64", family: 
"unix"
$ zipinfo -1 assembly/target/scala-2.10/spark-assembly-1.4.1-hadoop2.4.0.jar | 
grep netlib-native
netlib-native_ref-osx-x86_64.jnilib
netlib-native_ref-osx-x86_64.jnilib.asc
netlib-native_ref-osx-x86_64.pom
netlib-native_ref-osx-x86_64.pom.asc
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/pom.properties
netlib-native_ref-linux-x86_64.so
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-x86_64/
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-x86_64/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-x86_64/pom.properties
netlib-native_ref-linux-i686.so
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-i686/
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-i686/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-i686/pom.properties
netlib-native_ref-win-x86_64.dll
netlib-native_ref-win-x86_64.dll.asc
netlib-native_ref-win-x86_64.pom
netlib-native_ref-win-x86_64.pom.asc
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-x86_64/
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-x86_64/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-x86_64/pom.properties
netlib-native_ref-win-i686.dll
netlib-native_ref-win-i686.dll.asc
netlib-native_ref-win-i686.pom
netlib-native_ref-win-i686.pom.asc
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-i686/
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-i686/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-win-i686/pom.properties
netlib-native_ref-linux-armhf.so
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-armhf/
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-armhf/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_ref-linux-armhf/pom.properties
netlib-native_system-osx-x86_64.jnilib
netlib-native_system-osx-x86_64.jnilib.asc
netlib-native_system-osx-x86_64.pom
netlib-native_system-osx-x86_64.pom.asc
META-INF/maven/com.github.fommil.netlib/netlib-native_system-osx-x86_64/
META-INF/maven/com.github.fommil.netlib/netlib-native_system-osx-x86_64/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_system-osx-x86_64/pom.properties
netlib-native_system-linux-x86_64.pom.asc
netlib-native_system-linux-x86_64.pom
netlib-native_system-linux-x86_64.so
netlib-native_system-linux-x86_64.so.asc
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-x86_64/
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-x86_64/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-x86_64/pom.properties
netlib-native_system-linux-i686.pom
netlib-native_system-linux-i686.so.asc
netlib-native_system-linux-i686.pom.asc
netlib-native_system-linux-i686.so
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-i686/
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-i686/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-i686/pom.properties
netlib-native_system-linux-armhf.pom
netlib-native_system-linux-armhf.so.asc
netlib-native_system-linux-armhf.pom.asc
netlib-native_system-linux-armhf.so
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-armhf/
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-armhf/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_system-linux-armhf/pom.properties
netlib-native_system-win-x86_64.dll
netlib-native_system-win-x86_64.dll.asc
netlib-native_system-win-x86_64.pom
netlib-native_system-win-x86_64.pom.asc
META-INF/maven/com.github.fommil.netlib/netlib-native_system-win-x86_64/
META-INF/maven/com.github.fommil.netlib/netlib-native_system-win-x86_64/pom.xml
META-INF/maven/com.github.fommil.netlib/netlib-native_system-win-x86_64/pom.properties
netlib-native_system-win-i686.dll
netlib-native_system-win-i686.dll.asc
netlib-native_system-win-i686.pom
netlib-native_system-win-i686.pom.asc
META-INF/maven/com.github.fommil.netlib/netlib-native_system-win-i686/
META-INF/maven/com.github.fommil.netlib/netlib-native_system-win-i686/pom.xml

[jira] [Resolved] (SPARK-4816) Maven profile netlib-lgpl does not work

2015-12-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4816.
--
   Resolution: Fixed
 Assignee: Sean Owen
Fix Version/s: 1.4.2

I see, so it's re-fixed for older (but supported) versions of Maven by a commit 
already in the branch. Elsewhere, it's a moot point. I guess we can consider it 
fixed as a better resolution here.

> Maven profile netlib-lgpl does not work
> ---
>
> Key: SPARK-4816
> URL: https://issues.apache.org/jira/browse/SPARK-4816
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.1.0
> Environment: maven 3.0.5 / Ubuntu
>Reporter: Guillaume Pitel
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.4.2, 1.1.1
>
>
> When doing what the documentation recommends to recompile Spark with Netlib 
> Native system binding (i.e. to bind with openblas or, in my case, MKL), 
> mvn -Pnetlib-lgpl -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests 
> clean package
> The resulting assembly jar still lacked the netlib-system class. (I checked 
> the content of spark-assembly...jar)
> When forcing the netlib-lgpl profile in MLLib package to be active, the jar 
> is correctly built.
> So I guess it's a problem with the way maven passes profiles activitations to 
> children modules.
> Also, despite the documentation claiming that if the job's jar contains 
> netlib with necessary bindings, it should works, it does not. The classloader 
> must be unhappy with two occurrences of netlib ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2356) Exception: Could not locate executable null\bin\winutils.exe in the Hadoop

2015-12-14 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15055633#comment-15055633
 ] 

Michael Han edited comment on SPARK-2356 at 12/14/15 9:21 AM:
--

Hello Everyone,

I encounter this issue today again when I tried to create a cluster using two 
windows 7 (64) desktop.
This errors happens when I register the second worker to the master using the 
following command:
spark-class org.apache.spark.deploy.worker.Worker spark://masternode:7077

Strange it works fine when I register the first worker to the master.
anyone knows some work around to fix this issue?
The above work around works fine when I using local mode.
Since I registered one worker successfully in the cluster, but when run 
spark-submit in the successfully worker, it also throw this exception.

I tried to set the HADOOP_HOME = C:\winutil in the env variables, but it 
doesn't work.
The error is:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/12/14 16:49:22 WARN NativeCodeLoader: Unable to load native-hadoop library fo
r your platform... using builtin-java classes where applicable
15/12/14 16:49:22 ERROR Shell: Failed to locate the winutils binary in the hadoo
p binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Ha
doop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:355)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:370)
at org.apache.hadoop.util.Shell.(Shell.java:363)
at org.apache.hadoop.util.StringUtils.(StringUtils.java:79)
at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:104)

at org.apache.hadoop.security.Groups.(Groups.java:86)
at org.apache.hadoop.security.Groups.(Groups.java:66)
at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Group
s.java:280)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupI
nformation.java:271)
at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(Use
rGroupInformation.java:248)
at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(
UserGroupInformation.java:763)
at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGrou
pInformation.java:748)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGr
oupInformation.java:621)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils
.scala:2091)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils
.scala:2091)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2091)
at org.apache.spark.SecurityManager.(SecurityManager.scala:212)
at org.apache.spark.deploy.worker.Worker$.startRpcEnvAndEndpoint(Worker.
scala:692)
at org.apache.spark.deploy.worker.Worker$.main(Worker.scala:674)
at org.apache.spark.deploy.worker.Worker.main(Worker.scala)
15/12/14 16:49:22 INFO SecurityManager: Changing view acls to: mh6
15/12/14 16:49:22 INFO SecurityManager: Changing modify acls to: mh6
15/12/14 16:49:22 INFO SecurityManager: SecurityManager: authentication disabled
; ui acls disabled; users with view permissions: Set(mh6); users with modify per
missions: Set(mh6)
15/12/14 16:49:23 INFO Slf4jLogger: Slf4jLogger started
15/12/14 16:49:23 INFO Remoting: Starting remoting
15/12/14 16:49:24 INFO Remoting: Remoting started; listening on addresses :[akka
.tcp://sparkWorker@167.3.129.160:46862]
15/12/14 16:49:24 INFO Utils: Successfully started service 'sparkWorker' on port
 46862.
15/12/14 16:49:24 INFO Worker: Starting Spark worker 167.3.129.160:46862 with 4
cores, 2.9 GB RAM
15/12/14 16:49:24 INFO Worker: Running Spark version 1.5.2
15/12/14 16:49:24 INFO Worker: Spark home: C:\spark-1.5.2-bin-hadoop2.6\bin\..
15/12/14 16:49:24 INFO Utils: Successfully started service 'WorkerUI' on port 80
81.
15/12/14 16:49:24 INFO WorkerWebUI: Started WorkerWebUI at http://167.3.129.160:
8081
15/12/14 16:49:24 INFO Worker: Connecting to master 192.168.79.1:7077...
15/12/14 16:49:39 INFO Worker: Retrying connection to master (attempt # 1)
15/12/14 16:49:39 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thr
ead Thread[sparkWorker-akka.actor.default-dispatcher-2,5,main]
java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.Futur
eTask@3ef5e68c rejected from java.util.concurrent.ThreadPoolExecutor@741cb720[Ru
nning, pool size = 1, active threads = 1, queued tasks = 0, completed tasks = 0]

at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution
(ThreadPoolExecutor.java:2047)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.jav
a:823)
at 

[jira] [Assigned] (SPARK-11882) Allow for running Spark applications against a custom coarse grained scheduler

2015-12-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11882:


Assignee: (was: Apache Spark)

> Allow for running Spark applications against a custom coarse grained scheduler
> --
>
> Key: SPARK-11882
> URL: https://issues.apache.org/jira/browse/SPARK-11882
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core, Spark Submit
>Reporter: Jacek Lewandowski
>Priority: Minor
>
> SparkContext makes a decision which scheduler to use according to the Master 
> URI. How about running applications against a custom scheduler? Such a custom 
> scheduler would just extend {{CoarseGrainedSchedulerBackend}}. 
> The custom scheduler would be created by a provided factory. Factories would 
> be defined in the configuration like 
> {{spark.scheduler.factory.=}}, where {{name}} is the 
> scheduler name. {{SparkContext}}, once it learns that master address is not 
> for standalone, Yarn, Mesos, local or any other predefined scheduler, it 
> would resolve scheme from the provided master URI and look for the scheduler 
> factory with the name equal to the resolved scheme. 
> For example:
> {{spark.scheduler.factory.custom=org.a.b.c.CustomSchedulerFactory}}
> then Master address would be {{custom://192.168.1.1}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11882) Allow for running Spark applications against a custom coarse grained scheduler

2015-12-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15055842#comment-15055842
 ] 

Apache Spark commented on SPARK-11882:
--

User 'jacek-lewandowski' has created a pull request for this issue:
https://github.com/apache/spark/pull/10292

> Allow for running Spark applications against a custom coarse grained scheduler
> --
>
> Key: SPARK-11882
> URL: https://issues.apache.org/jira/browse/SPARK-11882
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core, Spark Submit
>Reporter: Jacek Lewandowski
>Priority: Minor
>
> SparkContext makes a decision which scheduler to use according to the Master 
> URI. How about running applications against a custom scheduler? Such a custom 
> scheduler would just extend {{CoarseGrainedSchedulerBackend}}. 
> The custom scheduler would be created by a provided factory. Factories would 
> be defined in the configuration like 
> {{spark.scheduler.factory.=}}, where {{name}} is the 
> scheduler name. {{SparkContext}}, once it learns that master address is not 
> for standalone, Yarn, Mesos, local or any other predefined scheduler, it 
> would resolve scheme from the provided master URI and look for the scheduler 
> factory with the name equal to the resolved scheme. 
> For example:
> {{spark.scheduler.factory.custom=org.a.b.c.CustomSchedulerFactory}}
> then Master address would be {{custom://192.168.1.1}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11882) Allow for running Spark applications against a custom coarse grained scheduler

2015-12-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11882:


Assignee: Apache Spark

> Allow for running Spark applications against a custom coarse grained scheduler
> --
>
> Key: SPARK-11882
> URL: https://issues.apache.org/jira/browse/SPARK-11882
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core, Spark Submit
>Reporter: Jacek Lewandowski
>Assignee: Apache Spark
>Priority: Minor
>
> SparkContext makes a decision which scheduler to use according to the Master 
> URI. How about running applications against a custom scheduler? Such a custom 
> scheduler would just extend {{CoarseGrainedSchedulerBackend}}. 
> The custom scheduler would be created by a provided factory. Factories would 
> be defined in the configuration like 
> {{spark.scheduler.factory.=}}, where {{name}} is the 
> scheduler name. {{SparkContext}}, once it learns that master address is not 
> for standalone, Yarn, Mesos, local or any other predefined scheduler, it 
> would resolve scheme from the provided master URI and look for the scheduler 
> factory with the name equal to the resolved scheme. 
> For example:
> {{spark.scheduler.factory.custom=org.a.b.c.CustomSchedulerFactory}}
> then Master address would be {{custom://192.168.1.1}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12320) throw exception if the number of fields does not line up for Tuple encoder

2015-12-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12320:


Assignee: (was: Apache Spark)

> throw exception if the number of fields does not line up for Tuple encoder
> --
>
> Key: SPARK-12320
> URL: https://issues.apache.org/jira/browse/SPARK-12320
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12320) throw exception if the number of fields does not line up for Tuple encoder

2015-12-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15055858#comment-15055858
 ] 

Apache Spark commented on SPARK-12320:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/10293

> throw exception if the number of fields does not line up for Tuple encoder
> --
>
> Key: SPARK-12320
> URL: https://issues.apache.org/jira/browse/SPARK-12320
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12320) throw exception if the number of fields does not line up for Tuple encoder

2015-12-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12320:


Assignee: Apache Spark

> throw exception if the number of fields does not line up for Tuple encoder
> --
>
> Key: SPARK-12320
> URL: https://issues.apache.org/jira/browse/SPARK-12320
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12323) Don't assign default value for non-nullable columns of a Dataset

2015-12-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12323:


Assignee: Apache Spark  (was: Cheng Lian)

> Don't assign default value for non-nullable columns of a Dataset
> 
>
> Key: SPARK-12323
> URL: https://issues.apache.org/jira/browse/SPARK-12323
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>
> For a field of a Dataset, if it's specified as non-nullable in the schema of 
> the Dataset, we shouldn't assign default value for it if input data contain 
> null. Instead, a runtime exception with nice error message should be thrown, 
> and ask the user to use {{Option}} or nullable types (e.g., 
> {{java.lang.Integer}} instead of {{scala.Int}}).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12016) word2vec load model can't use findSynonyms to get words

2015-12-14 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-12016.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10100
[https://github.com/apache/spark/pull/10100]

> word2vec load model can't use findSynonyms to get words 
> 
>
> Key: SPARK-12016
> URL: https://issues.apache.org/jira/browse/SPARK-12016
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
> Environment: ubuntu 14.04
>Reporter: yuangang.liu
> Fix For: 2.0.0
>
>
> I use word2vec.fit to train a word2vecModel and then save the model to file 
> system. when I load the model from file system, I found I can use 
> transform('a') to get a vector, but I can't use findSynonyms('a', 2) to get 
> some words.
> I use the fellow code to test word2vec
> from pyspark import SparkContext
> from pyspark.mllib.feature import Word2Vec, Word2VecModel
> import os, tempfile
> from shutil import rmtree
> if __name__ == '__main__':
> sc = SparkContext('local', 'test')
> sentence = "a b " * 100 + "a c " * 10
> localDoc = [sentence, sentence]
> doc = sc.parallelize(localDoc).map(lambda line: line.split(" "))
> model = Word2Vec().setVectorSize(10).setSeed(42).fit(doc)
> syms = model.findSynonyms("a", 2)
> print [s[0] for s in syms]
> path = tempfile.mkdtemp()
> model.save(sc, path)
> sameModel = Word2VecModel.load(sc, path)
> print model.transform("a") == sameModel.transform("a")
> syms = sameModel.findSynonyms("a", 2)
> print [s[0] for s in syms]
> try:
> rmtree(path)
> except OSError:
> pass
> I got "[u'b', u'c']" when the first printf
> then the “True” and " [u'__class__'] "
> I don't know how to get 'b' or 'c' with sameModel.findSynonyms("a", 2)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4816) Maven profile netlib-lgpl does not work

2015-12-14 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15056414#comment-15056414
 ] 

RJ Nowling commented on SPARK-4816:
---

Happy to try Maven 3.3.x and report back.  Would certainly confirm if it's a 
Maven bug or regression in behavior.

> Maven profile netlib-lgpl does not work
> ---
>
> Key: SPARK-4816
> URL: https://issues.apache.org/jira/browse/SPARK-4816
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.1.0
> Environment: maven 3.0.5 / Ubuntu
>Reporter: Guillaume Pitel
>Priority: Minor
> Fix For: 1.1.1
>
>
> When doing what the documentation recommends to recompile Spark with Netlib 
> Native system binding (i.e. to bind with openblas or, in my case, MKL), 
> mvn -Pnetlib-lgpl -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests 
> clean package
> The resulting assembly jar still lacked the netlib-system class. (I checked 
> the content of spark-assembly...jar)
> When forcing the netlib-lgpl profile in MLLib package to be active, the jar 
> is correctly built.
> So I guess it's a problem with the way maven passes profiles activitations to 
> children modules.
> Also, despite the documentation claiming that if the job's jar contains 
> netlib with necessary bindings, it should works, it does not. The classloader 
> must be unhappy with two occurrences of netlib ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12271) Improve error message for Dataset.as[] when the schema is incompatible.

2015-12-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12271:


Assignee: Apache Spark

> Improve error message for Dataset.as[] when the schema is incompatible.
> ---
>
> Key: SPARK-12271
> URL: https://issues.apache.org/jira/browse/SPARK-12271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Nong Li
>Assignee: Apache Spark
>
> It currently fails with an unexecutable exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-11255) R Test build should run on R 3.1.1

2015-12-14 Thread shane knapp (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp closed SPARK-11255.
---

> R Test build should run on R 3.1.1
> --
>
> Key: SPARK-11255
> URL: https://issues.apache.org/jira/browse/SPARK-11255
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Felix Cheung
>Assignee: shane knapp
>Priority: Minor
>
> Test should run on R 3.1.1 which is the version listed as supported.
> Apparently there are few R changes that can go undetected since Jenkins Test 
> build is running something newer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11255) R Test build should run on R 3.1.1

2015-12-14 Thread shane knapp (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp resolved SPARK-11255.
-
Resolution: Fixed

this is done

> R Test build should run on R 3.1.1
> --
>
> Key: SPARK-11255
> URL: https://issues.apache.org/jira/browse/SPARK-11255
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Felix Cheung
>Assignee: shane knapp
>Priority: Minor
>
> Test should run on R 3.1.1 which is the version listed as supported.
> Apparently there are few R changes that can go undetected since Jenkins Test 
> build is running something newer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12271) Improve error message for Dataset.as[] when the schema is incompatible.

2015-12-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12271:


Assignee: (was: Apache Spark)

> Improve error message for Dataset.as[] when the schema is incompatible.
> ---
>
> Key: SPARK-12271
> URL: https://issues.apache.org/jira/browse/SPARK-12271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Nong Li
>
> It currently fails with an unexecutable exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12296) Feature parity for pyspark.mllib StandardScalerModel

2015-12-14 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15056576#comment-15056576
 ] 

holdenk commented on SPARK-12296:
-

I can take this one :)

> Feature parity for pyspark.mllib StandardScalerModel
> 
>
> Key: SPARK-12296
> URL: https://issues.apache.org/jira/browse/SPARK-12296
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Some methods are missing, such as ways to access the std, mean, etc.  This 
> JIRA is for feature parity for pyspark.mllib.feature.StandardScaler & 
> StandardScalerModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12304) Make Spark Streaming web UI display more friendly Receiver graphs

2015-12-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15056422#comment-15056422
 ] 

Apache Spark commented on SPARK-12304:
--

User 'proflin' has created a pull request for this issue:
https://github.com/apache/spark/pull/10276

> Make Spark Streaming web UI display more friendly Receiver graphs
> -
>
> Key: SPARK-12304
> URL: https://issues.apache.org/jira/browse/SPARK-12304
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Liwei Lin
>Priority: Minor
> Attachments: after-5.png, before-5.png
>
>
> Currently, the Spark Streaming web UI uses the same maxY when displays 'Input 
> Rate Times& Histograms' and 'Per-Receiver Times& Histograms'. 
> This may lead to somewhat un-friendly graphs: once we have tens of Receivers 
> or more, every 'Per-Receiver Times' line almost hits the ground.
> This issue proposes to calculate a new maxY against the original one, which 
> is shared among all the `Per-Receiver Times& Histograms' graphs.
> Before:
> !before-5.png!
> After:
> !after-5.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12317) Support configurate value with unit(e.g. kb/mb/gb) in SQL

2015-12-14 Thread Bo Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15056471#comment-15056471
 ] 

Bo Meng commented on SPARK-12317:
-

Good point. We can follow JVM convention for memory configuration as 
[g|G|m|M|k|K] 

> Support configurate value with unit(e.g. kb/mb/gb) in SQL
> -
>
> Key: SPARK-12317
> URL: https://issues.apache.org/jira/browse/SPARK-12317
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Yadong Qi
>Priority: Minor
>
> e.g. `spark.sql.autoBroadcastJoinThreshold` should be configurated as `10MB` 
> instead of `10485760`, because `10MB` is more easier than `10485760`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12323) Don't assign default value for non-nullable columns of a Dataset

2015-12-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12323:


Assignee: Cheng Lian  (was: Apache Spark)

> Don't assign default value for non-nullable columns of a Dataset
> 
>
> Key: SPARK-12323
> URL: https://issues.apache.org/jira/browse/SPARK-12323
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> For a field of a Dataset, if it's specified as non-nullable in the schema of 
> the Dataset, we shouldn't assign default value for it if input data contain 
> null. Instead, a runtime exception with nice error message should be thrown, 
> and ask the user to use {{Option}} or nullable types (e.g., 
> {{java.lang.Integer}} instead of {{scala.Int}}).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4816) Maven profile netlib-lgpl does not work

2015-12-14 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15056409#comment-15056409
 ] 

Sean Owen commented on SPARK-4816:
--

I'm on Maven 3.3.x. I wonder if that could be a difference -- can you try 3.3.x 
just to check?

If you're correct, this is already fixed for the next 1.4 which should be 
1.4.2. I don't know if/when that will be released though. (I also don't know 
why the branch shows 1.4.3-SNAPSHOT) It's as fixed as it would be for this 
branch though. But then yes it would be listed as fixed as part of any release 
notes, automatically.

I think finding a relevant JIRA may be as good as it gets in the general case 
for finding whether something's already known as an issue and fixed. This one 
ought to be easy to find by keyword. Of course -- if there is a problem -- just 
having it work in later releases is even better.

I'm not aware of any additional fix that needs to be made though. As I say I 
can't even reproduce it.

> Maven profile netlib-lgpl does not work
> ---
>
> Key: SPARK-4816
> URL: https://issues.apache.org/jira/browse/SPARK-4816
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.1.0
> Environment: maven 3.0.5 / Ubuntu
>Reporter: Guillaume Pitel
>Priority: Minor
> Fix For: 1.1.1
>
>
> When doing what the documentation recommends to recompile Spark with Netlib 
> Native system binding (i.e. to bind with openblas or, in my case, MKL), 
> mvn -Pnetlib-lgpl -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests 
> clean package
> The resulting assembly jar still lacked the netlib-system class. (I checked 
> the content of spark-assembly...jar)
> When forcing the netlib-lgpl profile in MLLib package to be active, the jar 
> is correctly built.
> So I guess it's a problem with the way maven passes profiles activitations to 
> children modules.
> Also, despite the documentation claiming that if the job's jar contains 
> netlib with necessary bindings, it should works, it does not. The classloader 
> must be unhappy with two occurrences of netlib ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5506) java.lang.ClassCastException using lambda expressions in combination of spark and Servlet

2015-12-14 Thread Pavan Achanta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15056413#comment-15056413
 ] 

Pavan Achanta commented on SPARK-5506:
--

I get the same exception while running a job from Intellij IDE. 

{code}

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

public class App {
public static void main(String[] args) {
String logFile = "/usr/local/spark-1.5.2/README.md"; // Should be some 
file on your system
SparkConf conf = new SparkConf().setAppName("Simple Application")
.set("spark.eventLog.enabled", "true")
.set("spark.eventLog.dir", "/opt/logs/")
//.setMaster("local")
.setMaster("spark://localhost:7077")
;
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD logData = sc.textFile(logFile).cache();

long numAs = logData.filter(s -> s.contains("a")).count();
long numBs = logData.filter(s -> s.contains("b")).count();

System.out.println("Lines with a: " + numAs + ", lines with b: " + 
numBs);
}
}
{code}



THe exception I see is as follows:
{code}
15/12/13 23:47:58 INFO SparkDeploySchedulerBackend: Registered executor: 
AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@127.0.0.1:50873/user/Executor#-484673147])
 with ID 0
15/12/13 23:47:59 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
127.0.0.1, PROCESS_LOCAL, 2146 bytes)
15/12/13 23:47:59 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 
127.0.0.1, PROCESS_LOCAL, 2146 bytes)
15/12/13 23:47:59 INFO BlockManagerMasterEndpoint: Registering block manager 
127.0.0.1:50877 with 530.0 MB RAM, BlockManagerId(0, 127.0.0.1, 50877)
15/12/13 23:48:00 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 
127.0.0.1:50877 (size: 2.2 KB, free: 530.0 MB)
15/12/13 23:48:01 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, 
127.0.0.1): java.lang.ClassCastException: cannot assign instance of 
java.lang.invoke.SerializedLambda to field 
org.apache.spark.api.java.JavaRDD$$anonfun$filter$1.f$1 of type 
org.apache.spark.api.java.function.Function in instance of 
org.apache.spark.api.java.JavaRDD$$anonfun$filter$1
at 
java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2133)
at 
java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1305)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2006)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:72)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:98)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

15/12/13 23:48:01 INFO TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) on 
executor 127.0.0.1: java.lang.ClassCastException (cannot assign instance of 
java.lang.invoke.SerializedLambda to field 
org.apache.spark.api.java.JavaRDD$$anonfun$filter$1.f$1 of type 
org.apache.spark.api.java.function.Function in instance of 
org.apache.spark.api.java.JavaRDD$$anonfun$filter$1) [duplicate 1]
15/12/13 23:48:01 INFO 

[jira] [Closed] (SPARK-12282) Document spark.jars

2015-12-14 Thread Justin Bailey (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Bailey closed SPARK-12282.
-

> Document spark.jars
> ---
>
> Key: SPARK-12282
> URL: https://issues.apache.org/jira/browse/SPARK-12282
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Justin Bailey
>Priority: Trivial
>
> The spark.jars property (as implemented in SparkSubmit.scala,  
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L516)
>  is not documented anywhere, and should be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11255) R Test build should run on R 3.1.1

2015-12-14 Thread shane knapp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15056461#comment-15056461
 ] 

shane knapp commented on SPARK-11255:
-

this is happening now.  i forgot about it last week...  :)

> R Test build should run on R 3.1.1
> --
>
> Key: SPARK-11255
> URL: https://issues.apache.org/jira/browse/SPARK-11255
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Felix Cheung
>Assignee: shane knapp
>Priority: Minor
>
> Test should run on R 3.1.1 which is the version listed as supported.
> Apparently there are few R changes that can go undetected since Jenkins Test 
> build is running something newer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4816) Maven profile netlib-lgpl does not work

2015-12-14 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15056368#comment-15056368
 ] 

RJ Nowling commented on SPARK-4816:
---

I want to push for two things (a) some sort of documentation for users (e.g., 
release notes in the next releases) and (b) make sure it's fixed in the latest 
releases.  I want users to be able to find documentation (like this JIRA) so 
they don't have to spend time tracking it down like I did.  

Spark 1.4.2 hasn't been released yet and git has moved to a 1.4.3 SNAPSHOT.  
You mention adding the commit to the 1.5.x branch in the commit -- has this 
been done?

Until 1.4.3 and a 1.5.x release are out with your change, this could still hit 
certain users, even if it's rare because it's tied to a specific Maven version 
or such.

> Maven profile netlib-lgpl does not work
> ---
>
> Key: SPARK-4816
> URL: https://issues.apache.org/jira/browse/SPARK-4816
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.1.0
> Environment: maven 3.0.5 / Ubuntu
>Reporter: Guillaume Pitel
>Priority: Minor
> Fix For: 1.1.1
>
>
> When doing what the documentation recommends to recompile Spark with Netlib 
> Native system binding (i.e. to bind with openblas or, in my case, MKL), 
> mvn -Pnetlib-lgpl -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests 
> clean package
> The resulting assembly jar still lacked the netlib-system class. (I checked 
> the content of spark-assembly...jar)
> When forcing the netlib-lgpl profile in MLLib package to be active, the jar 
> is correctly built.
> So I guess it's a problem with the way maven passes profiles activitations to 
> children modules.
> Also, despite the documentation claiming that if the job's jar contains 
> netlib with necessary bindings, it should works, it does not. The classloader 
> must be unhappy with two occurrences of netlib ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4816) Maven profile netlib-lgpl does not work

2015-12-14 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15056375#comment-15056375
 ] 

RJ Nowling commented on SPARK-4816:
---

Also, what version of Maven are you running?

> Maven profile netlib-lgpl does not work
> ---
>
> Key: SPARK-4816
> URL: https://issues.apache.org/jira/browse/SPARK-4816
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.1.0
> Environment: maven 3.0.5 / Ubuntu
>Reporter: Guillaume Pitel
>Priority: Minor
> Fix For: 1.1.1
>
>
> When doing what the documentation recommends to recompile Spark with Netlib 
> Native system binding (i.e. to bind with openblas or, in my case, MKL), 
> mvn -Pnetlib-lgpl -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests 
> clean package
> The resulting assembly jar still lacked the netlib-system class. (I checked 
> the content of spark-assembly...jar)
> When forcing the netlib-lgpl profile in MLLib package to be active, the jar 
> is correctly built.
> So I guess it's a problem with the way maven passes profiles activitations to 
> children modules.
> Also, despite the documentation claiming that if the job's jar contains 
> netlib with necessary bindings, it should works, it does not. The classloader 
> must be unhappy with two occurrences of netlib ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12317) Support configurate value with unit(e.g. kb/mb/gb) in SQL

2015-12-14 Thread Bo Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15056446#comment-15056446
 ] 

Bo Meng commented on SPARK-12317:
-

If we wanna go that route, my suggestion is to support Double as the number 
plus the unit. For example, 1.5GB, that will make the configuration more 
general.

> Support configurate value with unit(e.g. kb/mb/gb) in SQL
> -
>
> Key: SPARK-12317
> URL: https://issues.apache.org/jira/browse/SPARK-12317
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Yadong Qi
>Priority: Minor
>
> e.g. `spark.sql.autoBroadcastJoinThreshold` should be configurated as `10MB` 
> instead of `10485760`, because `10MB` is more easier than `10485760`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12282) Document spark.jars

2015-12-14 Thread Justin Bailey (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justin Bailey resolved SPARK-12282.
---
Resolution: Not A Problem

> Document spark.jars
> ---
>
> Key: SPARK-12282
> URL: https://issues.apache.org/jira/browse/SPARK-12282
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Justin Bailey
>Priority: Trivial
>
> The spark.jars property (as implemented in SparkSubmit.scala,  
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L516)
>  is not documented anywhere, and should be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12317) Support configurate value with unit(e.g. kb/mb/gb) in SQL

2015-12-14 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15056448#comment-15056448
 ] 

Sean Owen commented on SPARK-12317:
---

I like the idea though am so used to the JVM's version of this which doesn't 
allow fractional values.

> Support configurate value with unit(e.g. kb/mb/gb) in SQL
> -
>
> Key: SPARK-12317
> URL: https://issues.apache.org/jira/browse/SPARK-12317
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Yadong Qi
>Priority: Minor
>
> e.g. `spark.sql.autoBroadcastJoinThreshold` should be configurated as `10MB` 
> instead of `10485760`, because `10MB` is more easier than `10485760`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12325) Inappropriate error messages in DataFrame StatFunctions

2015-12-14 Thread Narine Kokhlikyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Narine Kokhlikyan updated SPARK-12325:
--
Description: 
Hi there,

I have mentioned this issue earlier in one of my pull requests for SQL 
component, but I've never received a feedback in any of them.
https://github.com/apache/spark/pull/9366#issuecomment-155171975

Although this has been very frustrating, I'll try to list certain facts again:

1. I call dataframe correlation method and it says that covariance is wrong.
I do not think that this is an appropriate message to show here.

scala> df.stat.corr("rating", "income")
java.lang.IllegalArgumentException: requirement failed: Covariance calculation 
for columns with dataType StringType not supported.
at scala.Predef$.require(Predef.scala:233)
at 
org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$3.apply(StatFunctions.scala:81)


2. The biggest issue here is not the message shown, but the design.
A class called CovarianceCounter does the computations both for correlation and 
covariance. This might be a convenient way
from certain perspective, however something like this is harder to understand 
and extend, especially if you want to add another algorithm
e.g. Spearman correlation, or something else.

There are many possible solutions here:
starting from
1. just fixing the message 
2. fixing the message and renaming  CovarianceCounter and corresponding methods
3. create CorrelationCounter and splitting the computations for correlation and 
covariance

and many more  

Since I'm not getting any response and according to github all five of you have 
been working on this, I'll try again:
[~brkyvz], [~rxin], [~davies], [~viirya], [~cloud_fan]

Can any of you ,please, explain me such a behavior with the stat functions or 
communicate more about this ?
In case you are planning to remove it or something else, we'd truly appreciate 
if you communicate.

In fact, I would like to do a pull request on this, but since my pull requests 
in SQL/ML components are just staying there without any response, I'll wait for 
your response first.

cc: [~shivaram], [~mengxr]

Thank you,
Narine


  was:
Hi there,

I have mentioned this issue earlier in one of my pull requests for SQL 
component, but I've never received a feedback in any of them.
https://github.com/apache/spark/pull/9366#issuecomment-155171975

Although this has been very frustrating, I'll try to list certain facts again:

1. I call dataframe correlation method and it says that covariance is wrong.
I do not think that this is an appropriate message to show here.

scala> df.stat.corr("rating", "income")
java.lang.IllegalArgumentException: requirement failed: Covariance calculation 
for columns with dataType StringType not supported.
at scala.Predef$.require(Predef.scala:233)
at 
org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$3.apply(StatFunctions.scala:81)


2. The biggest issue here is not the message shown, but the design.
A class called CovarianceCounter does the computations both for correlation and 
covariance. This might be a convenient way
from certain perspective, however something like this is harder to understand 
and extend, especially if you want to add another algorithm
e.g. Spearman correlation, or something else.

There are many possible solutions here:
starting from
1. just fixing the message 
2. fixing the message and renaming  CovarianceCounter and corresponding methods
3. create CorrelationCounter and splitting the computations for correlation and 
covariance

and many more  

Since I'm not getting any response and according to github all five of you have 
been working on this, I'll try again:
[~brkyvz], [~rxin], [~davies], [~viirya], [~cloud_fan]

Can any of you ,please, explain me such a behavior or communicate more about 
this ?
In case you are planning to remove it or something else, we'd truly appreciate 
if you communicate.

In fact, I would like to do a pull request on this, but since my pull requests 
in SQL/ML components are just staying there without any response, I'll wait for 
your response first.

cc: [~shivaram], [~mengxr]

Thank you,
Narine



> Inappropriate error messages in DataFrame StatFunctions 
> 
>
> Key: SPARK-12325
> URL: https://issues.apache.org/jira/browse/SPARK-12325
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Narine Kokhlikyan
>Priority: Critical
>
> Hi there,
> I have mentioned this issue earlier in one of my pull requests for SQL 
> component, but I've never received a feedback in any of them.
> https://github.com/apache/spark/pull/9366#issuecomment-155171975
> Although this has been very frustrating, I'll try to list certain facts again:
> 1. I call 

[jira] [Assigned] (SPARK-12302) Example for servlet filter used by spark.ui.filters

2015-12-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12302:


Assignee: Apache Spark

> Example for servlet filter used by spark.ui.filters
> ---
>
> Key: SPARK-12302
> URL: https://issues.apache.org/jira/browse/SPARK-12302
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 1.5.2
>Reporter: Kai Sasaki
>Assignee: Apache Spark
>Priority: Trivial
>  Labels: examples, security
>
> Although {{spark.ui.filters}} configuration uses simple servlet filter, it is 
> often difficult to understand how to write filter code and how to integrate 
> actual spark applications. 
> It can be help to write examples for trying secure Spark cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12302) Example for servlet filter used by spark.ui.filters

2015-12-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12302:


Assignee: (was: Apache Spark)

> Example for servlet filter used by spark.ui.filters
> ---
>
> Key: SPARK-12302
> URL: https://issues.apache.org/jira/browse/SPARK-12302
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 1.5.2
>Reporter: Kai Sasaki
>Priority: Trivial
>  Labels: examples, security
>
> Although {{spark.ui.filters}} configuration uses simple servlet filter, it is 
> often difficult to understand how to write filter code and how to integrate 
> actual spark applications. 
> It can be help to write examples for trying secure Spark cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12270) JDBC Where clause comparison doesn't work for DB2 char(n)

2015-12-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15056870#comment-15056870
 ] 

Apache Spark commented on SPARK-12270:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/10262

> JDBC Where clause comparison doesn't work for DB2 char(n) 
> --
>
> Key: SPARK-12270
> URL: https://issues.apache.org/jira/browse/SPARK-12270
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> I am doing some Spark jdbc test against DB2. My test is like this: 
> {code}
>  conn.prepareStatement(
> "create table people (name char(32)").executeUpdate()
>  conn.prepareStatement("insert into people values 
> ('fred')").executeUpdate()
>  sql(
>s"""
>   |CREATE TEMPORARY TABLE foobar
>   |USING org.apache.spark.sql.jdbc
>   |OPTIONS (url '$url', dbtable 'PEOPLE', user 'testuser', password 
> 'testpassword')
>   """.stripMargin.replaceAll("\n", " "))
>  val df = sqlContext.sql("SELECT * FROM foobar WHERE NAME = 'fred'")
> {code}
> I am expecting to see one row with content 'fred' in df. However, there is no 
> row returned. If I changed the data type to varchar (32) in the create table 
> ddl , then I can get the row back correctly. The cause of the problem is that 
> for data type char (num), DB2 defines it as fixed-length character strings, 
> so if I have char (32), when doing "SELECT * FROM foobar WHERE NAME = 
> 'fred'", DB2 returns 'fred' padded with 28 empty space. Spark treats "fred' 
> padded with empty space not the same as 'fred' so df doesn't have any row. If 
> I have varchar (32), DB2 just returns 'fred' for the select statement and df 
> has the right row. In order to make DB2 char (num) works for spark, I suggest 
> to change spark code to trim the empty space after get the data from 
> database. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12325) Inappropriate error messages in DataFrame StatFunctions

2015-12-14 Thread Narine Kokhlikyan (JIRA)
Narine Kokhlikyan created SPARK-12325:
-

 Summary: Inappropriate error messages in DataFrame StatFunctions 
 Key: SPARK-12325
 URL: https://issues.apache.org/jira/browse/SPARK-12325
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Narine Kokhlikyan
Priority: Critical


Hi there,

I have mentioned this issue earlier in one of my pull requests for SQL 
component, but I've never received a feedback in any of them.
https://github.com/apache/spark/pull/9366#issuecomment-155171975

Although this has been very frustrating, I'll try to list certain facts again:

1. I call dataframe correlation method and it says that covariance is wrong.
I do not think that this is an appropriate message to show here.

scala> df.stat.corr("rating", "income")
java.lang.IllegalArgumentException: requirement failed: Covariance calculation 
for columns with dataType StringType not supported.
at scala.Predef$.require(Predef.scala:233)
at 
org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$3.apply(StatFunctions.scala:81)


2. The biggest issue here is not the message shown, but the design.
A class called CovarianceCounter does the computations both for correlation and 
covariance. This might be a convenient way
from certain perspective, however something like this is harder to understand 
and extend, especially if you want to add another algorithm
e.g. Spearman correlation, or something else.

There are many possible solutions here:
starting from
1. just fixing the message 
2. fixing the message and renaming  CovarianceCounter and corresponding methods
3. create CorrelationCounter and splitting the computations for correlation and 
covariance

and many more  

Since I'm not getting any response and according to github all five of you have 
been working on this, I'll try again:
[~brkyvz], [~rxin], [~davies], [~viirya], [~cloud_fan]

Can any of you ,please, explain me such a behavior or communicate more about 
this.
In case you are planning to remove it or something else, we'd truly appreciate 
if you communicate.

In fact, I would like to do a pull request on this, but since my pull requests 
in SQL/ML components are just staying there without any response, I'll wait for 
your response first.

cc: [~shivaram], [~mengxr]

Thank you,
Narine




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12325) Inappropriate error messages in DataFrame StatFunctions

2015-12-14 Thread Narine Kokhlikyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Narine Kokhlikyan updated SPARK-12325:
--
Description: 
Hi there,

I have mentioned this issue earlier in one of my pull requests for SQL 
component, but I've never received a feedback in any of them.
https://github.com/apache/spark/pull/9366#issuecomment-155171975

Although this has been very frustrating, I'll try to list certain facts again:

1. I call dataframe correlation method and it says that covariance is wrong.
I do not think that this is an appropriate message to show here.

scala> df.stat.corr("rating", "income")
java.lang.IllegalArgumentException: requirement failed: Covariance calculation 
for columns with dataType StringType not supported.
at scala.Predef$.require(Predef.scala:233)
at 
org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$3.apply(StatFunctions.scala:81)


2. The biggest issue here is not the message shown, but the design.
A class called CovarianceCounter does the computations both for correlation and 
covariance. This might be a convenient way
from certain perspective, however something like this is harder to understand 
and extend, especially if you want to add another algorithm
e.g. Spearman correlation, or something else.

There are many possible solutions here:
starting from
1. just fixing the message 
2. fixing the message and renaming  CovarianceCounter and corresponding methods
3. create CorrelationCounter and splitting the computations for correlation and 
covariance

and many more  

Since I'm not getting any response and according to github all five of you have 
been working on this, I'll try again:
[~brkyvz], [~rxin], [~davies], [~viirya], [~cloud_fan]

Can any of you ,please, explain me such a behavior or communicate more about 
this ?
In case you are planning to remove it or something else, we'd truly appreciate 
if you communicate.

In fact, I would like to do a pull request on this, but since my pull requests 
in SQL/ML components are just staying there without any response, I'll wait for 
your response first.

cc: [~shivaram], [~mengxr]

Thank you,
Narine


  was:
Hi there,

I have mentioned this issue earlier in one of my pull requests for SQL 
component, but I've never received a feedback in any of them.
https://github.com/apache/spark/pull/9366#issuecomment-155171975

Although this has been very frustrating, I'll try to list certain facts again:

1. I call dataframe correlation method and it says that covariance is wrong.
I do not think that this is an appropriate message to show here.

scala> df.stat.corr("rating", "income")
java.lang.IllegalArgumentException: requirement failed: Covariance calculation 
for columns with dataType StringType not supported.
at scala.Predef$.require(Predef.scala:233)
at 
org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$3.apply(StatFunctions.scala:81)


2. The biggest issue here is not the message shown, but the design.
A class called CovarianceCounter does the computations both for correlation and 
covariance. This might be a convenient way
from certain perspective, however something like this is harder to understand 
and extend, especially if you want to add another algorithm
e.g. Spearman correlation, or something else.

There are many possible solutions here:
starting from
1. just fixing the message 
2. fixing the message and renaming  CovarianceCounter and corresponding methods
3. create CorrelationCounter and splitting the computations for correlation and 
covariance

and many more  

Since I'm not getting any response and according to github all five of you have 
been working on this, I'll try again:
[~brkyvz], [~rxin], [~davies], [~viirya], [~cloud_fan]

Can any of you ,please, explain me such a behavior or communicate more about 
this.
In case you are planning to remove it or something else, we'd truly appreciate 
if you communicate.

In fact, I would like to do a pull request on this, but since my pull requests 
in SQL/ML components are just staying there without any response, I'll wait for 
your response first.

cc: [~shivaram], [~mengxr]

Thank you,
Narine



> Inappropriate error messages in DataFrame StatFunctions 
> 
>
> Key: SPARK-12325
> URL: https://issues.apache.org/jira/browse/SPARK-12325
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Narine Kokhlikyan
>Priority: Critical
>
> Hi there,
> I have mentioned this issue earlier in one of my pull requests for SQL 
> component, but I've never received a feedback in any of them.
> https://github.com/apache/spark/pull/9366#issuecomment-155171975
> Although this has been very frustrating, I'll try to list certain facts again:
> 1. I call dataframe correlation 

[jira] [Created] (SPARK-12326) Move GBT implementation from spark.mllib to spark.ml

2015-12-14 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-12326:


 Summary: Move GBT implementation from spark.mllib to spark.ml
 Key: SPARK-12326
 URL: https://issues.apache.org/jira/browse/SPARK-12326
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Reporter: Seth Hendrickson


Several improvements can be made to gradient boosted trees, but are not 
possible without moving the GBT implementation to spark.ml (e.g. rawPrediction 
column, feature importance). This Jira is for moving the current GBT 
implementation to spark.ml, which will have roughly the following steps:

1. Copy the implementation to spark.ml and change spark.ml classes to use that 
implementation. Current tests will ensure that the implementations learn 
exactly the same models. 
2. Move the decision tree helper classes over to spark.ml (e.g. Impurity, 
InformationGainStats, ImpurityStats, DTStatsAggregator, etc...). Since 
eventually all tree implementations will reside in spark.ml, the helper classes 
should as well.
3. Remove the spark.mllib implementation, and make the spark.mllib APIs 
wrappers around the spark.ml implementation. The spark.ml tests will again 
ensure that we do not change any behavior.
4. Move the unit tests to spark.ml, and change the spark.mllib unit tests to 
verify model equivalence.

Steps 2, 3, and 4 should be in separate Jiras. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12296) Feature parity for pyspark.mllib StandardScalerModel

2015-12-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12296:


Assignee: (was: Apache Spark)

> Feature parity for pyspark.mllib StandardScalerModel
> 
>
> Key: SPARK-12296
> URL: https://issues.apache.org/jira/browse/SPARK-12296
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Some methods are missing, such as ways to access the std, mean, etc.  This 
> JIRA is for feature parity for pyspark.mllib.feature.StandardScaler & 
> StandardScalerModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12296) Feature parity for pyspark.mllib StandardScalerModel

2015-12-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12296:


Assignee: Apache Spark

> Feature parity for pyspark.mllib StandardScalerModel
> 
>
> Key: SPARK-12296
> URL: https://issues.apache.org/jira/browse/SPARK-12296
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>
> Some methods are missing, such as ways to access the std, mean, etc.  This 
> JIRA is for feature parity for pyspark.mllib.feature.StandardScaler & 
> StandardScalerModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12324) The documentation sidebar does not collapse properly

2015-12-14 Thread Timothy Hunter (JIRA)
Timothy Hunter created SPARK-12324:
--

 Summary: The documentation sidebar does not collapse properly
 Key: SPARK-12324
 URL: https://issues.apache.org/jira/browse/SPARK-12324
 Project: Spark
  Issue Type: Bug
  Components: Documentation, MLlib
Affects Versions: 1.5.2
Reporter: Timothy Hunter


When the browser's window is reduced horizontally, the sidebar slides under the 
main content and does not collapse:
 - hide the sidebar when the browser's width is not large enough
 - add a button to show and hide the sidebar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12324) The documentation sidebar does not collapse properly

2015-12-14 Thread Timothy Hunter (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Hunter updated SPARK-12324:
---
Attachment: Screen Shot 2015-12-14 at 12.29.57 PM.png

> The documentation sidebar does not collapse properly
> 
>
> Key: SPARK-12324
> URL: https://issues.apache.org/jira/browse/SPARK-12324
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, MLlib
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
> Attachments: Screen Shot 2015-12-14 at 12.29.57 PM.png
>
>
> When the browser's window is reduced horizontally, the sidebar slides under 
> the main content and does not collapse:
>  - hide the sidebar when the browser's width is not large enough
>  - add a button to show and hide the sidebar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4816) Maven profile netlib-lgpl does not work

2015-12-14 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15056675#comment-15056675
 ] 

RJ Nowling commented on SPARK-4816:
---

Agreed.  Thanks!

> Maven profile netlib-lgpl does not work
> ---
>
> Key: SPARK-4816
> URL: https://issues.apache.org/jira/browse/SPARK-4816
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.1.0
> Environment: maven 3.0.5 / Ubuntu
>Reporter: Guillaume Pitel
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.1.1, 1.4.2
>
>
> When doing what the documentation recommends to recompile Spark with Netlib 
> Native system binding (i.e. to bind with openblas or, in my case, MKL), 
> mvn -Pnetlib-lgpl -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests 
> clean package
> The resulting assembly jar still lacked the netlib-system class. (I checked 
> the content of spark-assembly...jar)
> When forcing the netlib-lgpl profile in MLLib package to be active, the jar 
> is correctly built.
> So I guess it's a problem with the way maven passes profiles activitations to 
> children modules.
> Also, despite the documentation claiming that if the job's jar contains 
> netlib with necessary bindings, it should works, it does not. The classloader 
> must be unhappy with two occurrences of netlib ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12324) The documentation sidebar does not collapse properly

2015-12-14 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15056677#comment-15056677
 ] 

Timothy Hunter commented on SPARK-12324:


I am creating a PR with a fix.

cc [~josephkb]

> The documentation sidebar does not collapse properly
> 
>
> Key: SPARK-12324
> URL: https://issues.apache.org/jira/browse/SPARK-12324
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, MLlib
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
> Attachments: Screen Shot 2015-12-14 at 12.29.57 PM.png
>
>
> When the browser's window is reduced horizontally, the sidebar slides under 
> the main content and does not collapse:
>  - hide the sidebar when the browser's width is not large enough
>  - add a button to show and hide the sidebar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12324) The documentation sidebar does not collapse properly

2015-12-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12324:
--
   Priority: Minor  (was: Major)
Component/s: (was: MLlib)

> The documentation sidebar does not collapse properly
> 
>
> Key: SPARK-12324
> URL: https://issues.apache.org/jira/browse/SPARK-12324
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
>Priority: Minor
> Attachments: Screen Shot 2015-12-14 at 12.29.57 PM.png
>
>
> When the browser's window is reduced horizontally, the sidebar slides under 
> the main content and does not collapse:
>  - hide the sidebar when the browser's width is not large enough
>  - add a button to show and hide the sidebar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12324) The documentation sidebar does not collapse properly

2015-12-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12324:


Assignee: Apache Spark

> The documentation sidebar does not collapse properly
> 
>
> Key: SPARK-12324
> URL: https://issues.apache.org/jira/browse/SPARK-12324
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
>Assignee: Apache Spark
>Priority: Minor
> Attachments: Screen Shot 2015-12-14 at 12.29.57 PM.png
>
>
> When the browser's window is reduced horizontally, the sidebar slides under 
> the main content and does not collapse:
>  - hide the sidebar when the browser's width is not large enough
>  - add a button to show and hide the sidebar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12324) The documentation sidebar does not collapse properly

2015-12-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15056703#comment-15056703
 ] 

Apache Spark commented on SPARK-12324:
--

User 'thunterdb' has created a pull request for this issue:
https://github.com/apache/spark/pull/10297

> The documentation sidebar does not collapse properly
> 
>
> Key: SPARK-12324
> URL: https://issues.apache.org/jira/browse/SPARK-12324
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
>Priority: Minor
> Attachments: Screen Shot 2015-12-14 at 12.29.57 PM.png
>
>
> When the browser's window is reduced horizontally, the sidebar slides under 
> the main content and does not collapse:
>  - hide the sidebar when the browser's width is not large enough
>  - add a button to show and hide the sidebar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12326) Move GBT implementation from spark.mllib to spark.ml

2015-12-14 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15056715#comment-15056715
 ] 

Seth Hendrickson commented on SPARK-12326:
--

[~josephkb] Could you review the plan above? I couldn't find any other Jira for 
moving GBTs to ML and it seems like it would be good to get this done so we can 
move on some other improvements that are needed as well. Thanks!

> Move GBT implementation from spark.mllib to spark.ml
> 
>
> Key: SPARK-12326
> URL: https://issues.apache.org/jira/browse/SPARK-12326
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> Several improvements can be made to gradient boosted trees, but are not 
> possible without moving the GBT implementation to spark.ml (e.g. 
> rawPrediction column, feature importance). This Jira is for moving the 
> current GBT implementation to spark.ml, which will have roughly the following 
> steps:
> 1. Copy the implementation to spark.ml and change spark.ml classes to use 
> that implementation. Current tests will ensure that the implementations learn 
> exactly the same models. 
> 2. Move the decision tree helper classes over to spark.ml (e.g. Impurity, 
> InformationGainStats, ImpurityStats, DTStatsAggregator, etc...). Since 
> eventually all tree implementations will reside in spark.ml, the helper 
> classes should as well.
> 3. Remove the spark.mllib implementation, and make the spark.mllib APIs 
> wrappers around the spark.ml implementation. The spark.ml tests will again 
> ensure that we do not change any behavior.
> 4. Move the unit tests to spark.ml, and change the spark.mllib unit tests to 
> verify model equivalence.
> Steps 2, 3, and 4 should be in separate Jiras. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12275) No plan for BroadcastHint in some condition

2015-12-14 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-12275.
---
  Resolution: Fixed
   Fix Version/s: 1.5.3
Target Version/s: 1.5.3, 1.6.1, 2.0.0  (was: 1.5.3)

> No plan for BroadcastHint in some condition
> ---
>
> Key: SPARK-12275
> URL: https://issues.apache.org/jira/browse/SPARK-12275
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: yucai
>Assignee: yucai
>  Labels: backport-needed
> Fix For: 1.5.3, 1.6.1, 2.0.0
>
>
> *Summary*
> No plan for BroadcastHint is generated in some condition.
> *Test Case*
> {code}
> val df1 = Seq((1, "1"), (2, "2")).toDF("key", "value")
> val parquetTempFile =
>   "%s/SPARK-_%d.parquet".format(System.getProperty("java.io.tmpdir"), 
> scala.util.Random.nextInt)
> df1.write.parquet(parquetTempFile)
> val pf1 = sqlContext.read.parquet(parquetTempFile)
> #1. df1.join(broadcast(pf1)).count()
> #2. broadcast(pf1).count()
> {code}
> *Result*
> It will trigger assertion in QueryPlanner.scala, like below:
> {code}
> scala> df1.join(broadcast(pf1)).count()
> java.lang.AssertionError: assertion failed: No plan for BroadcastHint
> +- Relation[key#6,value#7] 
> ParquetRelation[hdfs://10.1.0.20:8020/tmp/SPARK-_1817830406.parquet]
>   at scala.Predef$.assert(Predef.scala:179)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:336)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12296) Feature parity for pyspark.mllib StandardScalerModel

2015-12-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15056723#comment-15056723
 ] 

Apache Spark commented on SPARK-12296:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/10298

> Feature parity for pyspark.mllib StandardScalerModel
> 
>
> Key: SPARK-12296
> URL: https://issues.apache.org/jira/browse/SPARK-12296
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Some methods are missing, such as ways to access the std, mean, etc.  This 
> JIRA is for feature parity for pyspark.mllib.feature.StandardScaler & 
> StandardScalerModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12318) Save mode in SparkR should be error by default

2015-12-14 Thread Jeff Zhang (JIRA)
Jeff Zhang created SPARK-12318:
--

 Summary: Save mode in SparkR should be error by default
 Key: SPARK-12318
 URL: https://issues.apache.org/jira/browse/SPARK-12318
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.5.2
Reporter: Jeff Zhang
Priority: Minor


The save mode in SparkR should be consistent with that of scala api



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12318) Save mode in SparkR should be error by default

2015-12-14 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15055625#comment-15055625
 ] 

Jeff Zhang commented on SPARK-12318:


Working on it. 

> Save mode in SparkR should be error by default
> --
>
> Key: SPARK-12318
> URL: https://issues.apache.org/jira/browse/SPARK-12318
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Jeff Zhang
>Priority: Minor
>
> The save mode in SparkR should be consistent with that of scala api



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9578) Stemmer feature transformer

2015-12-14 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15055626#comment-15055626
 ] 

yuhao yang commented on SPARK-9578:
---

PR was sent two days ago. I'm not sure why it's not linked here...

https://github.com/apache/spark/pull/10272

> Stemmer feature transformer
> ---
>
> Key: SPARK-9578
> URL: https://issues.apache.org/jira/browse/SPARK-9578
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Transformer mentioned first in [SPARK-5571] based on suggestion from 
> [~aloknsingh].  Very standard NLP preprocessing task.
> From [~aloknsingh]:
> {quote}
> We have one scala stemmer in scalanlp%chalk 
> https://github.com/scalanlp/chalk/tree/master/src/main/scala/chalk/text/analyze
>   which can easily copied (as it is apache project) and is in scala too.
> I think this will be better alternative than lucene englishAnalyzer or 
> opennlp.
> Note: we already use the scalanlp%breeze via the maven dependency so I think 
> adding scalanlp%chalk dependency is also the options. But as you had said we 
> can copy the code as it is small.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10347) Investigate the usage of normalizePath()

2015-12-14 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15055629#comment-15055629
 ] 

Sun Rui commented on SPARK-10347:
-

A possible solution would be:
provide a utility function, within which pesudo code is like follows:
{code}
  if (path does not contain scheme &&
  default hadoop file system scheme is local) {
normalizePath(path, mustWork=TRUE)
  } else {
   path
 }
{code}

The code piece to get default hadoop file system scheme:
{code}
hadoopConf <- callJMethod(sc, "hadoopConfiguration")
defaultScheme <- callJMethod(hadoopConf, "get", "fs.default.name")
{code}

> Investigate the usage of normalizePath()
> 
>
> Key: SPARK-10347
> URL: https://issues.apache.org/jira/browse/SPARK-10347
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Sun Rui
>Priority: Minor
>
> Currently normalizePath() is used in several places allowing users to specify 
> paths via the use of tilde expansion, or to normalize a relative path to an 
> absolute path. However, normalizePath() is used for paths which are actually 
> expected to be a URI. normalizePath() may display warning messages when it 
> does not recognize a URI as a local file path. So suppressWarnings() is used 
> to suppress the possible warnings.
> Worse than warnings, call normalizePath() on a URI may cause error. Because 
> it may turn a user specified relative path to an absolute path using the 
> local current directory, but this may not be true because the path is 
> actually relative to the working directory of the default file system instead 
> of the local file system (depends on the Hadoop configuration of Spark).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2356) Exception: Could not locate executable null\bin\winutils.exe in the Hadoop

2015-12-14 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15055633#comment-15055633
 ] 

Michael Han commented on SPARK-2356:


Hello Everyone,

I encounter this issue today again when I tried to create a cluster using two 
windows 7 (64) desktop.
This errors happens when I register the second worker to the master using the 
following command:
spark-class org.apache.spark.deploy.worker.Worker spark://masternode:7077

Strange it works fine when I register the first worker to the master.
anyone knows some work around to fix this issue?
The above work around works fine when I using local mode.

> Exception: Could not locate executable null\bin\winutils.exe in the Hadoop 
> ---
>
> Key: SPARK-2356
> URL: https://issues.apache.org/jira/browse/SPARK-2356
> Project: Spark
>  Issue Type: Bug
>  Components: Windows
>Affects Versions: 1.0.0
>Reporter: Kostiantyn Kudriavtsev
>Priority: Critical
>
> I'm trying to run some transformation on Spark, it works fine on cluster 
> (YARN, linux machines). However, when I'm trying to run it on local machine 
> (Windows 7) under unit test, I got errors (I don't use Hadoop, I'm read file 
> from local filesystem):
> {code}
> 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the 
> hadoop binary path
> java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
> Hadoop binaries.
>   at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
>   at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
>   at org.apache.hadoop.util.Shell.(Shell.java:326)
>   at org.apache.hadoop.util.StringUtils.(StringUtils.java:76)
>   at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
>   at org.apache.hadoop.security.Groups.(Groups.java:77)
>   at 
> org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
>   at 
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
>   at 
> org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:36)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala:109)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala)
>   at org.apache.spark.SparkContext.(SparkContext.scala:228)
>   at org.apache.spark.SparkContext.(SparkContext.scala:97)
> {code}
> It's happened because Hadoop config is initialized each time when spark 
> context is created regardless is hadoop required or not.
> I propose to add some special flag to indicate if hadoop config is required 
> (or start this configuration manually)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12325) Inappropriate error messages in DataFrame StatFunctions

2015-12-14 Thread Narine Kokhlikyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Narine Kokhlikyan updated SPARK-12325:
--
Affects Version/s: 1.5.2

> Inappropriate error messages in DataFrame StatFunctions 
> 
>
> Key: SPARK-12325
> URL: https://issues.apache.org/jira/browse/SPARK-12325
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Narine Kokhlikyan
>Priority: Critical
>
> Hi there,
> I have mentioned this issue earlier in one of my pull requests for SQL 
> component, but I've never received a feedback in any of them.
> https://github.com/apache/spark/pull/9366#issuecomment-155171975
> Although this has been very frustrating, I'll try to list certain facts again:
> 1. I call dataframe correlation method and it says that covariance is wrong.
> I do not think that this is an appropriate message to show here.
> scala> df.stat.corr("rating", "income")
> java.lang.IllegalArgumentException: requirement failed: Covariance 
> calculation for columns with dataType StringType not supported.
> at scala.Predef$.require(Predef.scala:233)
> at 
> org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$3.apply(StatFunctions.scala:81)
> 2. The biggest issue here is not the message shown, but the design.
> A class called CovarianceCounter does the computations both for correlation 
> and covariance. This might be a convenient way
> from certain perspective, however something like this is harder to understand 
> and extend, especially if you want to add another algorithm
> e.g. Spearman correlation, or something else.
> There are many possible solutions here:
> starting from
> 1. just fixing the message 
> 2. fixing the message and renaming  CovarianceCounter and corresponding 
> methods
> 3. create CorrelationCounter and splitting the computations for correlation 
> and covariance
> and many more  
> Since I'm not getting any response and according to github all five of you 
> have been working on this, I'll try again:
> [~brkyvz], [~rxin], [~davies], [~viirya], [~cloud_fan]
> Can any of you ,please, explain me such a behavior with the stat functions or 
> communicate more about this ?
> In case you are planning to remove it or something else, we'd truly 
> appreciate if you communicate.
> In fact, I would like to do a pull request on this, but since my pull 
> requests in SQL/ML components are just staying there without any response, 
> I'll wait for your response first.
> cc: [~shivaram], [~mengxr]
> Thank you,
> Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12317) Support configurate value with unit(e.g. kb/mb/gb) in SQL

2015-12-14 Thread kevin yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15056948#comment-15056948
 ] 

kevin yu commented on SPARK-12317:
--

I talked with Bo, I will work on this PR. Thanks.

Kevin

> Support configurate value with unit(e.g. kb/mb/gb) in SQL
> -
>
> Key: SPARK-12317
> URL: https://issues.apache.org/jira/browse/SPARK-12317
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Yadong Qi
>Priority: Minor
>
> e.g. `spark.sql.autoBroadcastJoinThreshold` should be configurated as `10MB` 
> instead of `10485760`, because `10MB` is more easier than `10485760`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12327) lint-r checks fail with commented code

2015-12-14 Thread Shivaram Venkataraman (JIRA)
Shivaram Venkataraman created SPARK-12327:
-

 Summary: lint-r checks fail with commented code
 Key: SPARK-12327
 URL: https://issues.apache.org/jira/browse/SPARK-12327
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Reporter: Shivaram Venkataraman


We get this after our R version downgrade

{code}
R/RDD.R:183:68: style: Commented code should be removed.
rdd@env$jrdd_val <- callJMethod(rddRef, "asJavaRDD") # 
rddRef$asJavaRDD()
   
^~
R/RDD.R:228:63: style: Commented code should be removed.
#' http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence.
  ^~~~
R/RDD.R:388:24: style: Commented code should be removed.
#' collectAsMap(rdd) # list(`1` = 2, `3` = 4)
   ^~
R/RDD.R:603:61: style: Commented code should be removed.
#' unlist(collect(filterRDD(rdd, function (x) { x < 3 }))) # c(1, 2)
^~~~
R/RDD.R:762:20: style: Commented code should be removed.
#' take(rdd, 2L) # list(1, 2)
   ^~
R/RDD.R:830:42: style: Commented code should be removed.
#' sort(unlist(collect(distinct(rdd # c(1, 2, 3)
 ^~~
R/RDD.R:980:47: style: Commented code should be removed.
#' collect(keyBy(rdd, function(x) { x*x })) # list(list(1, 1), list(4, 2), 
list(9, 3))
  
^~~~
R/RDD.R:1194:27: style: Commented code should be removed.
#' takeOrdered(rdd, 6L) # list(1, 2, 3, 4, 5, 6)
  ^~
R/RDD.R:1215:19: style: Commented code should be removed.
#' top(rdd, 6L) # list(10, 9, 7, 6, 5, 4)
  ^~~
R/RDD.R:1270:50: style: Commented code should be removed.
#' aggregateRDD(rdd, zeroValue, seqOp, combOp) # list(10, 4)
 ^~~
R/RDD.R:1374:6: style: Commented code should be removed.
#' # list(list("a", 0), list("b", 3), list("c", 1), list("d", 4), list("e", 2))
 ^~
R/RDD.R:1415:6: style: Commented code should be removed.
#' # list(list("a", 0), list("b", 1), list("c", 2), list("d", 3), list("e", 4))
 ^~
R/RDD.R:1461:6: style: Commented code should be removed.
#' # list(list(1, 2), list(3, 4))
 ^~~~
R/RDD.R:1527:6: style: Commented code should be removed.
#' # list(list(0, 1000), list(1, 1001), list(2, 1002), list(3, 1003), list(4, 
1004))
 
^~~
R/RDD.R:1564:6: style: Commented code should be removed.
#' # list(list(1, 1), list(1, 2), list(2, 1), list(2, 2))
 ^~~~
R/RDD.R:1595:6: style: Commented code should be removed.
#' # list(1, 1, 3)
 ^
R/RDD.R:1627:6: style: Commented code should be removed.
#' # list(1, 2, 3)
 ^
R/RDD.R:1663:6: style: Commented code should be removed.
#' # list(list(1, c(1,2), c(1,2,3)), list(2, c(3,4), c(4,5,6)))
 ^~
R/deserialize.R:22:3: style: Commented code should be removed.
# void -> NULL
  ^~~~
R/deserialize.R:23:3: style: Commented code should be removed.
# Int -> integer
  ^~
R/deserialize.R:24:3: style: Commented code should be removed.
# String -> character
  ^~~
R/deserialize.R:25:3: style: Commented code should be removed.
# Boolean -> logical
  ^~
R/deserialize.R:26:3: style: Commented code should be removed.
# Float -> double
  ^~~
R/deserialize.R:27:3: style: Commented code should be removed.
# Double -> double
  ^~~~
R/deserialize.R:28:3: style: Commented code should be removed.
# Long -> double
  ^~
R/deserialize.R:29:3: style: Commented code should be removed.
# Array[Byte] -> raw
  ^~
R/deserialize.R:30:3: style: Commented code should be removed.
# Date -> Date
  ^~~~
R/deserialize.R:31:3: style: Commented code should be removed.
# Time -> POSIXct
  ^~~
R/deserialize.R:33:3: style: Commented code should be removed.
# Array[T] -> list()
  ^~
R/deserialize.R:34:3: style: Commented code should be removed.
# Object -> jobj
  ^~
R/pairRDD.R:37:21: style: Commented code should be removed.
#' lookup(rdd, 1) # list(1, 3)
^~
R/pairRDD.R:83:25: style: Commented code should be removed.
#' collect(keys(rdd)) # list(1, 3)

[jira] [Updated] (SPARK-12232) Create new R API for read.table to avoid conflict

2015-12-14 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-12232:
-
Summary: Create new R API for read.table to avoid conflict  (was: Consider 
exporting read.table in R)

> Create new R API for read.table to avoid conflict
> -
>
> Key: SPARK-12232
> URL: https://issues.apache.org/jira/browse/SPARK-12232
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Felix Cheung
>Priority: Minor
>
> Since we have read.df, read.json, read.parquet (some in pending PRs), we have 
> table() and we should consider having read.table() for consistency and 
> R-likeness.
> However, this conflicts with utils::read.table which returns a R data.frame.
> It seems neither table() or read.table() is desirable in this case.
> table: https://stat.ethz.ch/R-manual/R-devel/library/base/html/table.html
> read.table: 
> https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2