date:20150219


[ 
https://issues.apache.org/jira/browse/SPARK-5669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327133#comment-14327133
 ] 

Sean Owen commented on SPARK-5669:
--

[~mengxr] That just applies to GCC, right? it still wouldn't change the LGPL 
license for libgfortran. I also don't know if Spark qualifies given the 
definition of Eligible Compilation Process. My understanding is that without 
this exception, anything compiled by GCC would be copyleft, and this prevents 
that. I don't know if it generally allows redistribution of libgcc.

 Spark assembly includes incompatibly licensed libgfortran, libgcc code via 
 JBLAS
 

 Key: SPARK-5669
 URL: https://issues.apache.org/jira/browse/SPARK-5669
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Blocker
 Fix For: 1.3.0


 Sorry for Blocker, but it's a license issue. The Spark assembly includes 
 the following, from JBLAS:
 {code}
 lib/
 lib/static/
 lib/static/Mac OS X/
 lib/static/Mac OS X/x86_64/
 lib/static/Mac OS X/x86_64/libjblas_arch_flavor.jnilib
 lib/static/Mac OS X/x86_64/sse3/
 lib/static/Mac OS X/x86_64/sse3/libjblas.jnilib
 lib/static/Windows/
 lib/static/Windows/x86/
 lib/static/Windows/x86/libgfortran-3.dll
 lib/static/Windows/x86/libgcc_s_dw2-1.dll
 lib/static/Windows/x86/jblas_arch_flavor.dll
 lib/static/Windows/x86/sse3/
 lib/static/Windows/x86/sse3/jblas.dll
 lib/static/Windows/amd64/
 lib/static/Windows/amd64/libgfortran-3.dll
 lib/static/Windows/amd64/jblas.dll
 lib/static/Windows/amd64/libgcc_s_sjlj-1.dll
 lib/static/Windows/amd64/jblas_arch_flavor.dll
 lib/static/Linux/
 lib/static/Linux/i386/
 lib/static/Linux/i386/sse3/
 lib/static/Linux/i386/sse3/libjblas.so
 lib/static/Linux/i386/libjblas_arch_flavor.so
 lib/static/Linux/amd64/
 lib/static/Linux/amd64/sse3/
 lib/static/Linux/amd64/sse3/libjblas.so
 lib/static/Linux/amd64/libjblas_arch_flavor.so
 {code}
 Unfortunately the libgfortran and libgcc libraries included for Windows are 
 not licensed in a way that's compatible with Spark and the AL2 -- LGPL at 
 least.
 It's easy to exclude them. I'm not clear what it does to running on Windows; 
 I assume it can still work but the libs would have to be made available 
 locally and put on the shared library path manually. I don't think there's a 
 package manager as in Linux that would make it easily available. I'm not able 
 to test on Windows.
 If it doesn't work, the follow-up question is whether that means JBLAS has to 
 be removed on the double, or treated as a known issue for 1.3.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5910) DataFrame.selectExpr(col as newName) does not work

Yin Huai created SPARK-5910:
---

 Summary: DataFrame.selectExpr(col as newName) does not work
 Key: SPARK-5910
 URL: https://issues.apache.org/jira/browse/SPARK-5910
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Priority: Blocker


{code}
val rdd = sc.parallelize((1 to 10).map(i = s{a:$i, b:str${i}}))
sqlContext.jsonRDD(rdd).selectExpr(a as newName)
{code}

{code}
java.lang.RuntimeException: [1.3] failure: ``or'' expected but `as' found

a as newName
  ^
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.SqlParser.parseExpression(SqlParser.scala:45)

{code}

For selectExpr, we need to use projection parser instead of expression parser 
(which cannot parse AS).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5337) respect spark.task.cpus when launch executors


 [ 
https://issues.apache.org/jira/browse/SPARK-5337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5337:
-
Affects Version/s: 1.0.0

 respect spark.task.cpus when launch executors
 -

 Key: SPARK-5337
 URL: https://issues.apache.org/jira/browse/SPARK-5337
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Tao Wang

 In standalone mode, we did not respect spark.task.cpus when luanch executors. 
 Some executors would have not enough cores to launch a single task.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-2628) Mesos backend throwing unable to find LoginModule

2015-02-19 Thread Timothy Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Chen closed SPARK-2628.
---
Resolution: Won't Fix

 Mesos backend throwing unable to find LoginModule 
 --

 Key: SPARK-2628
 URL: https://issues.apache.org/jira/browse/SPARK-2628
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Reporter: Timothy Chen
Assignee: Tim Chen

 http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3c1401892590126-6927.p...@n3.nabble.com%3E
 14/07/22 19:57:59 INFO HttpServer: Starting HTTP Server
 14/07/22 19:57:59 ERROR Executor: Uncaught exception in thread 
 Thread[Executor task launch worker-1,5,main]
 java.lang.Error: java.io.IOException: failure to login
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1116)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:636)
 Caused by: java.io.IOException: failure to login
 at 
 org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:490)
 at 
 org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:452)
 at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:40)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 ... 2 more
 Caused by: javax.security.auth.login.LoginException: unable to find 
 LoginModule class: 
 org/apache/hadoop/security/UserGroupInformation$HadoopLoginModule
 at 
 javax.security.auth.login.LoginContext.invoke(LoginContext.java:823)
 at 
 javax.security.auth.login.LoginContext.access$000(LoginContext.java:203)
 at javax.security.auth.login.LoginContext$5.run(LoginContext.java:721)
 at javax.security.auth.login.LoginContext$5.run(LoginContext.java:719)
 at java.security.AccessController.doPrivileged(Native Method)
 at 
 javax.security.auth.login.LoginContext.invokeCreatorPriv(LoginContext.java:718)
 at javax.security.auth.login.LoginContext.login(LoginContext.java:590)
 at 
 org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:471)
 ... 6 more
 14/07/22 19:57:59 ERROR Executor: Uncaught exception in thread 
 Thread[Executor task launch worker-0,5,main]
 java.lang.Error: java.io.IOException: failure to login
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1116)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:636)
 Caused by: java.io.IOException: failure to login
 at 
 org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:490)
 at 
 org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:452)
 at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:40)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 ... 2 more
 Caused by: javax.security.auth.login.LoginException: unable to find 
 LoginModule class: 
 org/apache/hadoop/security/UserGroupInformation$HadoopLoginModule
 at 
 javax.security.auth.login.LoginContext.invoke(LoginContext.java:823)
 at 
 javax.security.auth.login.LoginContext.access$000(LoginContext.java:203)
 at javax.security.auth.login.LoginContext$5.run(LoginContext.java:721)
 at javax.security.auth.login.LoginContext$5.run(LoginContext.java:719)
 at java.security.AccessController.doPrivileged(Native Method)
 at 
 javax.security.auth.login.LoginContext.invokeCreatorPriv(LoginContext.java:718)
 at javax.security.auth.login.LoginContext.login(LoginContext.java:590)
 at 
 org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:471)
 ... 6 more
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2628) Mesos backend throwing unable to find LoginModule

2015-02-19 Thread Timothy Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327978#comment-14327978
 ] 

Timothy Chen commented on SPARK-2628:
-

Seems like this is fixed post 1.0.4, somewhere in 1.1. If users are using older 
versions than 1.1 people can run into this. 
Will close this as won't fix.

 Mesos backend throwing unable to find LoginModule 
 --

 Key: SPARK-2628
 URL: https://issues.apache.org/jira/browse/SPARK-2628
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Reporter: Timothy Chen
Assignee: Tim Chen

 http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3c1401892590126-6927.p...@n3.nabble.com%3E
 14/07/22 19:57:59 INFO HttpServer: Starting HTTP Server
 14/07/22 19:57:59 ERROR Executor: Uncaught exception in thread 
 Thread[Executor task launch worker-1,5,main]
 java.lang.Error: java.io.IOException: failure to login
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1116)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:636)
 Caused by: java.io.IOException: failure to login
 at 
 org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:490)
 at 
 org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:452)
 at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:40)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 ... 2 more
 Caused by: javax.security.auth.login.LoginException: unable to find 
 LoginModule class: 
 org/apache/hadoop/security/UserGroupInformation$HadoopLoginModule
 at 
 javax.security.auth.login.LoginContext.invoke(LoginContext.java:823)
 at 
 javax.security.auth.login.LoginContext.access$000(LoginContext.java:203)
 at javax.security.auth.login.LoginContext$5.run(LoginContext.java:721)
 at javax.security.auth.login.LoginContext$5.run(LoginContext.java:719)
 at java.security.AccessController.doPrivileged(Native Method)
 at 
 javax.security.auth.login.LoginContext.invokeCreatorPriv(LoginContext.java:718)
 at javax.security.auth.login.LoginContext.login(LoginContext.java:590)
 at 
 org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:471)
 ... 6 more
 14/07/22 19:57:59 ERROR Executor: Uncaught exception in thread 
 Thread[Executor task launch worker-0,5,main]
 java.lang.Error: java.io.IOException: failure to login
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1116)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:636)
 Caused by: java.io.IOException: failure to login
 at 
 org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:490)
 at 
 org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:452)
 at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:40)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 ... 2 more
 Caused by: javax.security.auth.login.LoginException: unable to find 
 LoginModule class: 
 org/apache/hadoop/security/UserGroupInformation$HadoopLoginModule
 at 
 javax.security.auth.login.LoginContext.invoke(LoginContext.java:823)
 at 
 javax.security.auth.login.LoginContext.access$000(LoginContext.java:203)
 at javax.security.auth.login.LoginContext$5.run(LoginContext.java:721)
 at javax.security.auth.login.LoginContext$5.run(LoginContext.java:719)
 at java.security.AccessController.doPrivileged(Native Method)
 at 
 javax.security.auth.login.LoginContext.invokeCreatorPriv(LoginContext.java:718)
 at javax.security.auth.login.LoginContext.login(LoginContext.java:590)
 at 
 org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:471)
 ... 6 more
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5669) Spark assembly includes incompatibly licensed libgfortran, libgcc code via JBLAS


[ 
https://issues.apache.org/jira/browse/SPARK-5669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328010#comment-14328010
 ] 

Xiangrui Meng commented on SPARK-5669:
--

Yes, we are going to remove JBLAS anyway in 1.4. Having a simple dependency 
tree is always a good thing. The problem is how we should proceed for 
branch-1.0/1.1/1.2. If we are covered by this exemption. Maybe the only thing 
we need to do is to put a notice. This also applies to branch-1.3. We don't 
call native routines in Spark but it doesn't mean that users don't. With the 
current solution, users need to supply the origin JBLAS jar at runtime to use 
native routines, and I haven't tested whether it works or not. So if we are 
covered by this exemption, the best thing to do might be to revert the patch 
and put a notice.

 Spark assembly includes incompatibly licensed libgfortran, libgcc code via 
 JBLAS
 

 Key: SPARK-5669
 URL: https://issues.apache.org/jira/browse/SPARK-5669
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Blocker
 Fix For: 1.3.0


 Sorry for Blocker, but it's a license issue. The Spark assembly includes 
 the following, from JBLAS:
 {code}
 lib/
 lib/static/
 lib/static/Mac OS X/
 lib/static/Mac OS X/x86_64/
 lib/static/Mac OS X/x86_64/libjblas_arch_flavor.jnilib
 lib/static/Mac OS X/x86_64/sse3/
 lib/static/Mac OS X/x86_64/sse3/libjblas.jnilib
 lib/static/Windows/
 lib/static/Windows/x86/
 lib/static/Windows/x86/libgfortran-3.dll
 lib/static/Windows/x86/libgcc_s_dw2-1.dll
 lib/static/Windows/x86/jblas_arch_flavor.dll
 lib/static/Windows/x86/sse3/
 lib/static/Windows/x86/sse3/jblas.dll
 lib/static/Windows/amd64/
 lib/static/Windows/amd64/libgfortran-3.dll
 lib/static/Windows/amd64/jblas.dll
 lib/static/Windows/amd64/libgcc_s_sjlj-1.dll
 lib/static/Windows/amd64/jblas_arch_flavor.dll
 lib/static/Linux/
 lib/static/Linux/i386/
 lib/static/Linux/i386/sse3/
 lib/static/Linux/i386/sse3/libjblas.so
 lib/static/Linux/i386/libjblas_arch_flavor.so
 lib/static/Linux/amd64/
 lib/static/Linux/amd64/sse3/
 lib/static/Linux/amd64/sse3/libjblas.so
 lib/static/Linux/amd64/libjblas_arch_flavor.so
 {code}
 Unfortunately the libgfortran and libgcc libraries included for Windows are 
 not licensed in a way that's compatible with Spark and the AL2 -- LGPL at 
 least.
 It's easy to exclude them. I'm not clear what it does to running on Windows; 
 I assume it can still work but the libs would have to be made available 
 locally and put on the shared library path manually. I don't think there's a 
 package manager as in Linux that would make it easily available. I'm not able 
 to test on Windows.
 If it doesn't work, the follow-up question is whether that means JBLAS has to 
 be removed on the double, or treated as a known issue for 1.3.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5669) Spark assembly includes incompatibly licensed libgfortran, libgcc code via JBLAS

[
https://issues.apache.org/jira/browse/SPARK-5669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328010#comment-14328010
]

Xiangrui Meng edited comment on SPARK-5669 at 2/19/15 7:43 PM:
---

Yes, we are going to remove JBLAS anyway in 1.4. Having a simple dependency
tree is always a good thing. The problem is how we should proceed for
branch-1.0/1.1/1.2. If we are covered by this exemption, maybe the only thing
we need to do is to put a notice. This also applies to branch-1.3. We don't
call native routines in Spark but it doesn't mean that users don't. With the
current solution, users need to supply the origin JBLAS jar at runtime to use
native routines, and I haven't tested whether it works or not. So if we are
covered by this exemption, the best thing to do might be to revert the patch
and put a notice.

was (Author: mengxr):
Yes, we are going to remove JBLAS anyway in 1.4. Having a simple dependency
tree is always a good thing. The problem is how we should proceed for
branch-1.0/1.1/1.2. If we are covered by this exemption. Maybe the only thing
we need to do is to put a notice. This also applies to branch-1.3. We don't
call native routines in Spark but it doesn't mean that users don't. With the
current solution, users need to supply the origin JBLAS jar at runtime to use
native routines, and I haven't tested whether it works or not. So if we are
covered by this exemption, the best thing to do might be to revert the patch
and put a notice.

Spark assembly includes incompatibly licensed libgfortran, libgcc code via
JBLAS

Key: SPARK-5669
URL: https://issues.apache.org/jira/browse/SPARK-5669
Project: Spark
Issue Type: Bug
Components: Build
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Blocker
Fix For: 1.3.0

Sorry for Blocker, but it's a license issue. The Spark assembly includes
the following, from JBLAS:
{code}
lib/
lib/static/
lib/static/Mac OS X/
lib/static/Mac OS X/x86_64/
lib/static/Mac OS X/x86_64/libjblas_arch_flavor.jnilib
lib/static/Mac OS X/x86_64/sse3/
lib/static/Mac OS X/x86_64/sse3/libjblas.jnilib
lib/static/Windows/
lib/static/Windows/x86/
lib/static/Windows/x86/libgfortran-3.dll
lib/static/Windows/x86/libgcc_s_dw2-1.dll
lib/static/Windows/x86/jblas_arch_flavor.dll
lib/static/Windows/x86/sse3/
lib/static/Windows/x86/sse3/jblas.dll
lib/static/Windows/amd64/
lib/static/Windows/amd64/libgfortran-3.dll
lib/static/Windows/amd64/jblas.dll
lib/static/Windows/amd64/libgcc_s_sjlj-1.dll
lib/static/Windows/amd64/jblas_arch_flavor.dll
lib/static/Linux/
lib/static/Linux/i386/
lib/static/Linux/i386/sse3/
lib/static/Linux/i386/sse3/libjblas.so
lib/static/Linux/i386/libjblas_arch_flavor.so
lib/static/Linux/amd64/
lib/static/Linux/amd64/sse3/
lib/static/Linux/amd64/sse3/libjblas.so
lib/static/Linux/amd64/libjblas_arch_flavor.so
{code}
Unfortunately the libgfortran and libgcc libraries included for Windows are
not licensed in a way that's compatible with Spark and the AL2 -- LGPL at
least.
It's easy to exclude them. I'm not clear what it does to running on Windows;
I assume it can still work but the libs would have to be made available
locally and put on the shared library path manually. I don't think there's a
package manager as in Linux that would make it easily available. I'm not able
to test on Windows.
If it doesn't work, the follow-up question is whether that means JBLAS has to
be removed on the double, or treated as a known issue for 1.3.0.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5825) Failure stopping Services while command line argument is too long


 [ 
https://issues.apache.org/jira/browse/SPARK-5825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5825:
-
Affects Version/s: 1.0.0

 Failure stopping Services while command line argument is too long
 -

 Key: SPARK-5825
 URL: https://issues.apache.org/jira/browse/SPARK-5825
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.0.0
Reporter: Cheng Hao
Assignee: Cheng Hao
Priority: Blocker

 Stopping service in `spark-daemon.sh` will confirm the process id by fuzzy 
 matching the class name, however, it will fail if the java process arguments 
 is very long (greater than 4096).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-5825) Failure stopping Services while command line argument is too long


 [ 
https://issues.apache.org/jira/browse/SPARK-5825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-5825.

   Resolution: Fixed
Fix Version/s: 1.2.2
   1.3.0

 Failure stopping Services while command line argument is too long
 -

 Key: SPARK-5825
 URL: https://issues.apache.org/jira/browse/SPARK-5825
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.0.0
Reporter: Cheng Hao
Assignee: Cheng Hao
Priority: Blocker
 Fix For: 1.3.0, 1.2.2


 Stopping service in `spark-daemon.sh` will confirm the process id by fuzzy 
 matching the class name, however, it will fail if the java process arguments 
 is very long (greater than 4096).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5669) Spark assembly includes incompatibly licensed libgfortran, libgcc code via JBLAS


[ 
https://issues.apache.org/jira/browse/SPARK-5669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328092#comment-14328092
 ] 

Sean Owen commented on SPARK-5669:
--

I do find it confusing. I can see an argument that this is allowed on the 
grounds that it *does* meet the exception because the target work is created 
*without* GCC, and on the grounds that it is *not* a Category X license case as 
described in http://www.apache.org/legal/resolved.html#category-x , even though 
the list calls out special exception to the GPL licenses, since the problem 
it identifies regarding derived works is *not* part of the exception terms. 

If that's true, I don't even see that a notice is required.

On that grounds, you could put back the binaries into 1.3. (Yes, it's a moot 
point in 1.4). Your reasoning is that it would prevent users from having to 
bring their own JBLAS if they already use JBLAS. But they'll have to in 1.4 
anyway, and we've always required programs to bring their own dependencies even 
if they're also used by Spark. I suppose I'd favor taking that hit earlier than 
later, since it happens anyway, and if it lets us be a tiny bit more 
conservative about the licensing issue.

But I do not feel strongly about it. Having said all that, would you rather 
proceed with just putting back the libs in 1.3?

 Spark assembly includes incompatibly licensed libgfortran, libgcc code via 
 JBLAS
 

 Key: SPARK-5669
 URL: https://issues.apache.org/jira/browse/SPARK-5669
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Blocker
 Fix For: 1.3.0


 Sorry for Blocker, but it's a license issue. The Spark assembly includes 
 the following, from JBLAS:
 {code}
 lib/
 lib/static/
 lib/static/Mac OS X/
 lib/static/Mac OS X/x86_64/
 lib/static/Mac OS X/x86_64/libjblas_arch_flavor.jnilib
 lib/static/Mac OS X/x86_64/sse3/
 lib/static/Mac OS X/x86_64/sse3/libjblas.jnilib
 lib/static/Windows/
 lib/static/Windows/x86/
 lib/static/Windows/x86/libgfortran-3.dll
 lib/static/Windows/x86/libgcc_s_dw2-1.dll
 lib/static/Windows/x86/jblas_arch_flavor.dll
 lib/static/Windows/x86/sse3/
 lib/static/Windows/x86/sse3/jblas.dll
 lib/static/Windows/amd64/
 lib/static/Windows/amd64/libgfortran-3.dll
 lib/static/Windows/amd64/jblas.dll
 lib/static/Windows/amd64/libgcc_s_sjlj-1.dll
 lib/static/Windows/amd64/jblas_arch_flavor.dll
 lib/static/Linux/
 lib/static/Linux/i386/
 lib/static/Linux/i386/sse3/
 lib/static/Linux/i386/sse3/libjblas.so
 lib/static/Linux/i386/libjblas_arch_flavor.so
 lib/static/Linux/amd64/
 lib/static/Linux/amd64/sse3/
 lib/static/Linux/amd64/sse3/libjblas.so
 lib/static/Linux/amd64/libjblas_arch_flavor.so
 {code}
 Unfortunately the libgfortran and libgcc libraries included for Windows are 
 not licensed in a way that's compatible with Spark and the AL2 -- LGPL at 
 least.
 It's easy to exclude them. I'm not clear what it does to running on Windows; 
 I assume it can still work but the libs would have to be made available 
 locally and put on the shared library path manually. I don't think there's a 
 package manager as in Linux that would make it easily available. I'm not able 
 to test on Windows.
 If it doesn't work, the follow-up question is whether that means JBLAS has to 
 be removed on the double, or treated as a known issue for 1.3.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5902) PipelineStage.transformSchema should be public, not private


 [ 
https://issues.apache.org/jira/browse/SPARK-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5902.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4682
[https://github.com/apache/spark/pull/4682]

 PipelineStage.transformSchema should be public, not private
 ---

 Key: SPARK-5902
 URL: https://issues.apache.org/jira/browse/SPARK-5902
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Minor
 Fix For: 1.3.0


 For users to implement their own PipelineStages, we need to make 
 PipelineStage.transformSchema be public instead of private to ml.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5825) Failure stopping Services while command line argument is too long


 [ 
https://issues.apache.org/jira/browse/SPARK-5825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5825:
-
 Component/s: (was: Spark Submit)
  Deploy
Target Version/s: 1.3.0, 1.2.2  (was: 1.3.0)

 Failure stopping Services while command line argument is too long
 -

 Key: SPARK-5825
 URL: https://issues.apache.org/jira/browse/SPARK-5825
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Reporter: Cheng Hao
Assignee: Cheng Hao
Priority: Blocker

 Stopping service in `spark-daemon.sh` will confirm the process id by fuzzy 
 matching the class name, however, it will fail if the java process arguments 
 is very long (greater than 4096).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5837) HTTP 500 if try to access Spark UI in yarn-cluster or yarn-client mode

2015-02-19 Thread Rok Roskar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327182#comment-14327182
 ] 

Rok Roskar edited comment on SPARK-5837 at 2/19/15 9:59 AM:


this looks to perhaps be a related yarn issue: 
https://issues.apache.org/jira/browse/YARN-2713

though I don't know if this is why the ApplicationMaster link results in a 
connection refused error


was (Author: rok):
this looks to be a yarn issue: https://issues.apache.org/jira/browse/YARN-2713

 HTTP 500 if try to access Spark UI in yarn-cluster or yarn-client mode
 --

 Key: SPARK-5837
 URL: https://issues.apache.org/jira/browse/SPARK-5837
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.2.0, 1.2.1
Reporter: Marco Capuccini

 Both Spark 1.2.0 and Spark 1.2.1 return this error when I try to access the 
 Spark UI if I run over yarn (version 2.4.0):
 HTTP ERROR 500
 Problem accessing /proxy/application_1423564210894_0017/. Reason:
 Connection refused
 Caused by:
 java.net.ConnectException: Connection refused
   at java.net.PlainSocketImpl.socketConnect(Native Method)
   at 
 java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
   at 
 java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
   at 
 java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
   at java.net.Socket.connect(Socket.java:579)
   at java.net.Socket.connect(Socket.java:528)
   at java.net.Socket.init(Socket.java:425)
   at java.net.Socket.init(Socket.java:280)
   at 
 org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80)
   at 
 org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122)
   at 
 org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707)
   at 
 org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387)
   at 
 org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
   at 
 org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
   at 
 org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:346)
   at 
 org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.proxyLink(WebAppProxyServlet.java:187)
   at 
 org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:344)
   at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
   at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
   at 
 org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
   at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
   at 
 com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66)
   at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
   at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:79)
   at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
   at 
 com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
   at 
 com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
   at 
 com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
   at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
   at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
   at 
 org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
   at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
   at 
 org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1192)
   at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
   at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
   at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
   at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
   at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
   at

[jira] [Updated] (SPARK-5889) remove pid file in spark-daemon.sh after killing the process.


 [ 
https://issues.apache.org/jira/browse/SPARK-5889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5889:
-
 Priority: Minor  (was: Major)
 Target Version/s: 1.3.0, 1.2.2
Affects Version/s: 1.2.1
 Assignee: Zhan Zhang

 remove pid file in spark-daemon.sh after killing the process.
 -

 Key: SPARK-5889
 URL: https://issues.apache.org/jira/browse/SPARK-5889
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.1
Reporter: Zhan Zhang
Assignee: Zhan Zhang
Priority: Minor

 Currently, if the thriftserver/history server are stopped. The pid file is 
 not deleted. The fix is trial, but it is important for some service checking 
 relying on the file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5889) remove pid file in spark-daemon.sh after killing the process.


[ 
https://issues.apache.org/jira/browse/SPARK-5889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327238#comment-14327238
 ] 

Sean Owen commented on SPARK-5889:
--

Yeah I wanted to do this in the original PR, although I think there's a small 
potential problem: what if {{kill}} fails? then you lose the PID file. In that 
case, a lot of bets are off anyway and it's not clear that subsequent retries 
would succeed.

Still, since the script handles old PID files already (er, it's trying to), I 
wonder if this can be slightly more conservative and only remove if kill 
succeeds?

 remove pid file in spark-daemon.sh after killing the process.
 -

 Key: SPARK-5889
 URL: https://issues.apache.org/jira/browse/SPARK-5889
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.1
Reporter: Zhan Zhang

 Currently, if the thriftserver/history server are stopped. The pid file is 
 not deleted. The fix is trial, but it is important for some service checking 
 relying on the file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5899) Viewing specific stage information which contains thousands of tasks will freak out the driver and CPU cores from where it runs


 [ 
https://issues.apache.org/jira/browse/SPARK-5899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5899.
--
Resolution: Duplicate

 Viewing specific stage information which contains thousands of tasks will 
 freak out the driver and CPU cores from where it runs
 ---

 Key: SPARK-5899
 URL: https://issues.apache.org/jira/browse/SPARK-5899
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.3.0, 1.2.1
 Environment: CentOS 7, Spark Standalone
Reporter: Mark Khaitman
Priority: Critical

 If a user ever tries to view specific stage stats, for example, a repartition 
 stage which involves 3 partitions, the Web UI attempts to load every 
 single task result onto a single webpage, which is completely destroying CPU 
 usage on the driver, which subsequently causes the remaining tasks / jobs to 
 be nearly impossible to complete. 
 Ideally the task results should be paged (if not too much trouble) to prevent 
 this from happening.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5837) HTTP 500 if try to access Spark UI in yarn-cluster or yarn-client mode

2015-02-19 Thread Rok Roskar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327182#comment-14327182
 ] 

Rok Roskar commented on SPARK-5837:
---

this looks to be a yarn issue: https://issues.apache.org/jira/browse/YARN-2713

 HTTP 500 if try to access Spark UI in yarn-cluster or yarn-client mode
 --

 Key: SPARK-5837
 URL: https://issues.apache.org/jira/browse/SPARK-5837
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.2.0, 1.2.1
Reporter: Marco Capuccini

 Both Spark 1.2.0 and Spark 1.2.1 return this error when I try to access the 
 Spark UI if I run over yarn (version 2.4.0):
 HTTP ERROR 500
 Problem accessing /proxy/application_1423564210894_0017/. Reason:
 Connection refused
 Caused by:
 java.net.ConnectException: Connection refused
   at java.net.PlainSocketImpl.socketConnect(Native Method)
   at 
 java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
   at 
 java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
   at 
 java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
   at java.net.Socket.connect(Socket.java:579)
   at java.net.Socket.connect(Socket.java:528)
   at java.net.Socket.init(Socket.java:425)
   at java.net.Socket.init(Socket.java:280)
   at 
 org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80)
   at 
 org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122)
   at 
 org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707)
   at 
 org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387)
   at 
 org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
   at 
 org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
   at 
 org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:346)
   at 
 org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.proxyLink(WebAppProxyServlet.java:187)
   at 
 org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:344)
   at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
   at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
   at 
 org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
   at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
   at 
 com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66)
   at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
   at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:79)
   at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
   at 
 com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
   at 
 com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
   at 
 com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
   at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
   at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
   at 
 org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
   at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
   at 
 org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1192)
   at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
   at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
   at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
   at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
   at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
   at 
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
   at 
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
   at 
 org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
   at

[jira] [Updated] (SPARK-5889) remove pid file in spark-daemon.sh after killing the process.


 [ 
https://issues.apache.org/jira/browse/SPARK-5889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5889:
-
Component/s: Deploy

 remove pid file in spark-daemon.sh after killing the process.
 -

 Key: SPARK-5889
 URL: https://issues.apache.org/jira/browse/SPARK-5889
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.2.1
Reporter: Zhan Zhang
Assignee: Zhan Zhang
Priority: Minor

 Currently, if the thriftserver/history server are stopped. The pid file is 
 not deleted. The fix is trial, but it is important for some service checking 
 relying on the file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1476) 2GB limit in spark for blocks

2015-02-19 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327394#comment-14327394
 ] 

Imran Rashid commented on SPARK-1476:
-

Based on discussion on the dev list, [~mridulm80] isn't actively working on 
this.   I'd like to start on it, with the following very minimal goals:

1. Make it *possible* for blocks to be bigger than 2GB
2. Maintain performance on smaller blocks

ie., I'm not going to try to do anything fancy to optimize performance of the 
large blocks.  To that end, my plan is to

1. create a {{LargeByteBuffer}} interface, which just has the same methods we 
use on {{ByteBuffer}}
2. have one implementation that just wraps one {{ByteBuffer}}, and another 
which wraps a completely static set of {{ByteBuffer}}s (eg., if you map a 3 GB 
file, it'll just immediately map it to 2 {{ByteBuffer}}s, nothing fancy with 
only mapping the first half of the file until the second is needed etc. etc.)
3. change {{ByteBuffer}} to {{LargeByteBuffer}} in {{ShuffleBlockManager}} and 
{{BlockStore}}

I see that about a year back there was a lot of discussion on this, and some 
alternate proposals.  I'd like to push forward with a POC to try to move the 
discussion along again.  I know there was some discussion about how important 
this is, and whether or not we want to support it.  IMO this is a big 
limitation and results in a lot of frustration for the users, we really need a 
solution for this.

 2GB limit in spark for blocks
 -

 Key: SPARK-1476
 URL: https://issues.apache.org/jira/browse/SPARK-1476
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
 Environment: all
Reporter: Mridul Muralidharan
Assignee: Mridul Muralidharan
Priority: Critical
 Attachments: 2g_fix_proposal.pdf


 The underlying abstraction for blocks in spark is a ByteBuffer : which limits 
 the size of the block to 2GB.
 This has implication not just for managed blocks in use, but also for shuffle 
 blocks (memory mapped blocks are limited to 2gig, even though the api allows 
 for long), ser-deser via byte array backed outstreams (SPARK-1391), etc.
 This is a severe limitation for use of spark when used on non trivial 
 datasets.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5494) SparkSqlSerializer Ignores KryoRegistrators


[ 
https://issues.apache.org/jira/browse/SPARK-5494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327657#comment-14327657
 ] 

Apache Spark commented on SPARK-5494:
-

User 'hkothari' has created a pull request for this issue:
https://github.com/apache/spark/pull/4693

 SparkSqlSerializer Ignores KryoRegistrators
 ---

 Key: SPARK-5494
 URL: https://issues.apache.org/jira/browse/SPARK-5494
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Hamel Ajay Kothari

 We should make SparkSqlSerializer call {{super.newKryo}} before doing any of 
 it's custom stuff in order to make sure it picks up on custom 
 KryoRegistrators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5907) Selected column from DataFrame should not re-analyze logical plan

2015-02-19 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created SPARK-5907:
--

 Summary: Selected column from DataFrame should not re-analyze 
logical plan
 Key: SPARK-5907
 URL: https://issues.apache.org/jira/browse/SPARK-5907
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh


Currently, selecting a column from DataFrame wraps the original logical plan 
with a Project. As the column is used, the logical plan will be analyzed again. 
 For some query plan, re-analyzing would side-effect that increases expression 
id. So when accessing the column, column's expr and its analyzed plan will 
point to different expressions.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5908) Hive udtf with single alias should be resolved correctly


[ 
https://issues.apache.org/jira/browse/SPARK-5908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327512#comment-14327512
 ] 

Apache Spark commented on SPARK-5908:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/4692

 Hive udtf with single alias should be resolved correctly
 

 Key: SPARK-5908
 URL: https://issues.apache.org/jira/browse/SPARK-5908
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh

 ResolveUdtfsAlias in hiveUdfs only considers the HiveGenericUdtf with 
 multiple alias. When only single alias is used with HiveGenericUdtf, the 
 alias is not working.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5719) allow daemons to bind to specified host


 [ 
https://issues.apache.org/jira/browse/SPARK-5719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5719:
-
Affects Version/s: 1.0.0

 allow daemons to bind to specified host
 ---

 Key: SPARK-5719
 URL: https://issues.apache.org/jira/browse/SPARK-5719
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 1.0.0
Reporter: Tao Wang
Priority: Minor

 Now web ui binds to 0.0.0.0.  When multiple network plane is enabled, we 
 may try to bind ui port to some specified ip address so that it is possible 
 to do some firewall work(ip filter).
 The added config items also work for daemons.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-5423) ExternalAppendOnlyMap won't delete temp spilled file if some exception happens during using it


 [ 
https://issues.apache.org/jira/browse/SPARK-5423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-5423.

  Resolution: Fixed
   Fix Version/s: 1.2.2
  1.1.2
  1.3.0
Assignee: Shixiong Zhu
Target Version/s: 1.3.0, 1.1.2, 1.2.2

 ExternalAppendOnlyMap won't delete temp spilled file if some exception 
 happens during using it 
 ---

 Key: SPARK-5423
 URL: https://issues.apache.org/jira/browse/SPARK-5423
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 1.0.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu
 Fix For: 1.3.0, 1.1.2, 1.2.2


 ExternalAppendOnlyMap won't delete temp spilled file if some exception 
 happens during using it.
 There is already a TODO in the comment:
 {code}
 // TODO: Ensure this gets called even if the iterator isn't drained.
 private def cleanup() {
   batchIndex = batchOffsets.length  // Prevent reading any other batch
   val ds = deserializeStream
   deserializeStream = null
   fileStream = null
   ds.close()
   file.delete()
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5775) GenericRow cannot be cast to SpecificMutableRow when nested data and partitioned table

2015-02-19 Thread Anselme Vignon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327938#comment-14327938
 ] 

Anselme Vignon commented on SPARK-5775:
---

This bug is due to a problem in the TableScanOperations, involving indeed 
partition columns and complex type columns.

I made a pull request patching up the issue here :

https://github.com/apache/spark/pull/4697

 GenericRow cannot be cast to SpecificMutableRow when nested data and 
 partitioned table
 --

 Key: SPARK-5775
 URL: https://issues.apache.org/jira/browse/SPARK-5775
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
Reporter: Ayoub Benali
  Labels: hivecontext, nested, parquet, partition

 Using the LOAD sql command in Hive context to load parquet files into a 
 partitioned table causes exceptions during query time. 
 The bug requires the table to have a column of *type Array of struct* and to 
 be *partitioned*. 
 The example bellow shows how to reproduce the bug and you can see that if the 
 table is not partitioned the query works fine. 
 {noformat}
 scala val data1 = {data_array:[{field1:1,field2:2}]}
 scala val data2 = {data_array:[{field1:3,field2:4}]}
 scala val jsonRDD = sc.makeRDD(data1 :: data2 :: Nil)
 scala val schemaRDD = hiveContext.jsonRDD(jsonRDD)
 scala schemaRDD.printSchema
 root
  |-- data_array: array (nullable = true)
  ||-- element: struct (containsNull = false)
  |||-- field1: integer (nullable = true)
  |||-- field2: integer (nullable = true)
 scala hiveContext.sql(create external table if not exists 
 partitioned_table(data_array ARRAY STRUCTfield1: INT, field2: INT) 
 Partitioned by (date STRING) STORED AS PARQUET Location 
 'hdfs:///partitioned_table')
 scala hiveContext.sql(create external table if not exists 
 none_partitioned_table(data_array ARRAY STRUCTfield1: INT, field2: INT) 
 STORED AS PARQUET Location 'hdfs:///none_partitioned_table')
 scala schemaRDD.saveAsParquetFile(hdfs:///tmp_data_1)
 scala schemaRDD.saveAsParquetFile(hdfs:///tmp_data_2)
 scala hiveContext.sql(LOAD DATA INPATH 
 'hdfs://qa-hdc001.ffm.nugg.ad:8020/erlogd/tmp_data_1' INTO TABLE 
 partitioned_table PARTITION(date='2015-02-12'))
 scala hiveContext.sql(LOAD DATA INPATH 
 'hdfs://qa-hdc001.ffm.nugg.ad:8020/erlogd/tmp_data_2' INTO TABLE 
 none_partitioned_table)
 scala hiveContext.sql(select data.field1 from none_partitioned_table 
 LATERAL VIEW explode(data_array) nestedStuff AS data).collect
 res23: Array[org.apache.spark.sql.Row] = Array([1], [3])
 scala hiveContext.sql(select data.field1 from partitioned_table LATERAL 
 VIEW explode(data_array) nestedStuff AS data).collect
 15/02/12 16:21:03 INFO ParseDriver: Parsing command: select data.field1 from 
 partitioned_table LATERAL VIEW explode(data_array) nestedStuff AS data
 15/02/12 16:21:03 INFO ParseDriver: Parse Completed
 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(260661) called with 
 curMem=0, maxMem=280248975
 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_18 stored as values in 
 memory (estimated size 254.6 KB, free 267.0 MB)
 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(28615) called with 
 curMem=260661, maxMem=280248975
 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_18_piece0 stored as bytes 
 in memory (estimated size 27.9 KB, free 267.0 MB)
 15/02/12 16:21:03 INFO BlockManagerInfo: Added broadcast_18_piece0 in memory 
 on *:51990 (size: 27.9 KB, free: 267.2 MB)
 15/02/12 16:21:03 INFO BlockManagerMaster: Updated info of block 
 broadcast_18_piece0
 15/02/12 16:21:03 INFO SparkContext: Created broadcast 18 from NewHadoopRDD 
 at ParquetTableOperations.scala:119
 15/02/12 16:21:03 INFO FileInputFormat: Total input paths to process : 3
 15/02/12 16:21:03 INFO ParquetInputFormat: Total input paths to process : 3
 15/02/12 16:21:03 INFO FilteringParquetRowInputFormat: Using Task Side 
 Metadata Split Strategy
 15/02/12 16:21:03 INFO SparkContext: Starting job: collect at 
 SparkPlan.scala:84
 15/02/12 16:21:03 INFO DAGScheduler: Got job 12 (collect at 
 SparkPlan.scala:84) with 3 output partitions (allowLocal=false)
 15/02/12 16:21:03 INFO DAGScheduler: Final stage: Stage 13(collect at 
 SparkPlan.scala:84)
 15/02/12 16:21:03 INFO DAGScheduler: Parents of final stage: List()
 15/02/12 16:21:03 INFO DAGScheduler: Missing parents: List()
 15/02/12 16:21:03 INFO DAGScheduler: Submitting Stage 13 (MappedRDD[111] at 
 map at SparkPlan.scala:84), which has no missing parents
 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(7632) called with 
 curMem=289276, maxMem=280248975
 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_19 stored as values in 
 memory (estimated size 7.5 KB, free 267.0 MB)
 15/02/12 16:21:03 INFO

[jira] [Commented] (SPARK-5669) Spark assembly includes incompatibly licensed libgfortran, libgcc code via JBLAS


[ 
https://issues.apache.org/jira/browse/SPARK-5669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327953#comment-14327953
 ] 

Xiangrui Meng commented on SPARK-5669:
--

GFortran is part of GCC (https://gcc.gnu.org/wiki/GFortran) and hence the 
`libgfortran` library. In Apple's libgfortran header file 
(http://www.opensource.apple.com/source/gcc/gcc-5484/libgfortran/libgfortran.h),
 I found the following:

{code}
As a special exception, if you link this library with other files,
   some of which are compiled with GCC, to produce an executable,
   this library does not by itself cause the resulting executable
   to be covered by the GNU General Public License.
   This exception does not however invalidate any other reasons why
   the executable file might be covered by the GNU General Public License.
{code}

The official one linked to the special exception page: 
https://github.com/gcc-mirror/gcc/blob/master/libgfortran/libgfortran.h#L18

 Spark assembly includes incompatibly licensed libgfortran, libgcc code via 
 JBLAS
 

 Key: SPARK-5669
 URL: https://issues.apache.org/jira/browse/SPARK-5669
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Blocker
 Fix For: 1.3.0


 Sorry for Blocker, but it's a license issue. The Spark assembly includes 
 the following, from JBLAS:
 {code}
 lib/
 lib/static/
 lib/static/Mac OS X/
 lib/static/Mac OS X/x86_64/
 lib/static/Mac OS X/x86_64/libjblas_arch_flavor.jnilib
 lib/static/Mac OS X/x86_64/sse3/
 lib/static/Mac OS X/x86_64/sse3/libjblas.jnilib
 lib/static/Windows/
 lib/static/Windows/x86/
 lib/static/Windows/x86/libgfortran-3.dll
 lib/static/Windows/x86/libgcc_s_dw2-1.dll
 lib/static/Windows/x86/jblas_arch_flavor.dll
 lib/static/Windows/x86/sse3/
 lib/static/Windows/x86/sse3/jblas.dll
 lib/static/Windows/amd64/
 lib/static/Windows/amd64/libgfortran-3.dll
 lib/static/Windows/amd64/jblas.dll
 lib/static/Windows/amd64/libgcc_s_sjlj-1.dll
 lib/static/Windows/amd64/jblas_arch_flavor.dll
 lib/static/Linux/
 lib/static/Linux/i386/
 lib/static/Linux/i386/sse3/
 lib/static/Linux/i386/sse3/libjblas.so
 lib/static/Linux/i386/libjblas_arch_flavor.so
 lib/static/Linux/amd64/
 lib/static/Linux/amd64/sse3/
 lib/static/Linux/amd64/sse3/libjblas.so
 lib/static/Linux/amd64/libjblas_arch_flavor.so
 {code}
 Unfortunately the libgfortran and libgcc libraries included for Windows are 
 not licensed in a way that's compatible with Spark and the AL2 -- LGPL at 
 least.
 It's easy to exclude them. I'm not clear what it does to running on Windows; 
 I assume it can still work but the libs would have to be made available 
 locally and put on the shared library path manually. I don't think there's a 
 package manager as in Linux that would make it easily available. I'm not able 
 to test on Windows.
 If it doesn't work, the follow-up question is whether that means JBLAS has to 
 be removed on the double, or treated as a known issue for 1.3.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4423) Improve foreach() documentation to avoid confusion between local- and cluster-mode behavior


[ 
https://issues.apache.org/jira/browse/SPARK-4423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327833#comment-14327833
 ] 

Apache Spark commented on SPARK-4423:
-

User 'ilganeli' has created a pull request for this issue:
https://github.com/apache/spark/pull/4696

 Improve foreach() documentation to avoid confusion between local- and 
 cluster-mode behavior
 ---

 Key: SPARK-4423
 URL: https://issues.apache.org/jira/browse/SPARK-4423
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Josh Rosen
Assignee: Ilya Ganelin

 {{foreach}} seems to be a common source of confusion for new users: in 
 {{local}} mode, {{foreach}} can be used to update local variables on the 
 driver, but programs that do this will not work properly when executed on 
 clusters, since the {{foreach}} will update per-executor variables (note that 
 this _will_ work correctly for accumulators, but not for other types of 
 mutable objects).
 Similarly, I've seen users become confused when {{.foreach(println)}} doesn't 
 print to the driver's standard output.
 At a minimum, we should improve the documentation to warn users against 
 unsafe uses of {{foreach}} that won't work properly when transitioning from 
 local mode to a real cluster.
 We might also consider changes to local mode so that its behavior more 
 closely matches the cluster modes; this will require some discussion, though, 
 since any change of behavior here would technically be a user-visible 
 backwards-incompatible change (I don't think that we made any explicit 
 guarantees about the current local-mode behavior, but someone might be 
 relying on the current implicit behavior).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5902) PipelineStage.transformSchema should be public, not private


 [ 
https://issues.apache.org/jira/browse/SPARK-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5902:
-
Description: For users to implement their own PipelineStages, we need to 
make PipelineStage.transformSchema be public instead of private to ml.  (was: 
For users to implement their own PipelineStages, we need to make 
PipelineStage.transformSchema be protected instead of private to ml.)

 PipelineStage.transformSchema should be public, not private
 ---

 Key: SPARK-5902
 URL: https://issues.apache.org/jira/browse/SPARK-5902
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Minor

 For users to implement their own PipelineStages, we need to make 
 PipelineStage.transformSchema be public instead of private to ml.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5423) ExternalAppendOnlyMap won't delete temp spilled file if some exception happens during using it


 [ 
https://issues.apache.org/jira/browse/SPARK-5423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5423:
-
Affects Version/s: 1.0.0

 ExternalAppendOnlyMap won't delete temp spilled file if some exception 
 happens during using it 
 ---

 Key: SPARK-5423
 URL: https://issues.apache.org/jira/browse/SPARK-5423
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 1.0.0
Reporter: Shixiong Zhu
Priority: Minor

 ExternalAppendOnlyMap won't delete temp spilled file if some exception 
 happens during using it.
 There is already a TODO in the comment:
 {code}
 // TODO: Ensure this gets called even if the iterator isn't drained.
 private def cleanup() {
   batchIndex = batchOffsets.length  // Prevent reading any other batch
   val ds = deserializeStream
   deserializeStream = null
   fileStream = null
   ds.close()
   file.delete()
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5887) Class not found exception com.datastax.spark.connector.rdd.partitioner.CassandraPartition

2015-02-19 Thread Vijay Pawnarkar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327921#comment-14327921
 ] 

Vijay Pawnarkar commented on SPARK-5887:


Thanks! This could be a class loader issue in Spark. The class is present in 
the connector jar and the jar is being added to class loader's list of jars as 
per the logs . However classloader is not able to find it.  Property 
spark.files.userClassPathFirst is documented as being experimental. Debugging 
this further.




 Class not found exception  
 com.datastax.spark.connector.rdd.partitioner.CassandraPartition
 --

 Key: SPARK-5887
 URL: https://issues.apache.org/jira/browse/SPARK-5887
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
 Environment: Spark 1.2.1
 Spark Cassandra Connector 1.2.0 Alpha2
Reporter: Vijay Pawnarkar

 I am getting following class not found exception when using Spark 1.2.1 with 
 spark-cassandra-connector_2.10-1.2.0-alpha2. When the job is submitted to 
 Spark.. it successfully adds required connector JAR file to Worker's 
 classpath. Corresponding log entries are also included in following 
 description.
 From log statements and looking at spark 1.2.1 codebase it looks like the JAR 
 get added to urlClassLoader via Executor.scala's updateDependencies method. 
 However when it time to execute the Task, its not able to resolve the class 
 name. 
 
 [task-result-getter-0] WARN org.apache.spark.scheduler.TaskSetManager - Lost 
 task 0.0 in stage 0.0 (TID 0, 127.0.0.1): java.lang.ClassNotFoundException: 
 com.datastax.spark.connector.rdd.partitioner.CassandraPartition
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:274)
 at 
 org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59)
 at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
 at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
 at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
 at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
 at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
 at 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 --
 LOG indicating JAR files were added to worker classpath.
 15/02/17 16:56:48 INFO Executor: Fetching 
 http://127.0.0.1:64265/jars/spark-cassandra-connector-java_2.10-1.2.0-alpha2.jar
  with timestamp 1424210185005
 15/02/17 16:56:48 INFO Utils: Fetching 
 http://127.0.0.1:64265/jars/spark-cassandra-connector-java_2.10-1.2.0-alpha2.jar
  to 
 C:\Users\sparkus\AppData\Local\Temp\spark-10f5e149-5460-4899-9c8f-b19b19bdaf55\spark-fba24b2b-5847-4b04-848c-90677d12ff99\spark-35f5ed4b-041d-40d8-8854-b243787de188\fetchFileTemp4665176275367448514.tmp
 15/02/17 16:56:48 DEBUG Utils: fetchFile not using security
 15/02/17 16:56:48 INFO Utils: Copying 
 C:\Users\sparkus\AppData\Local\Temp\spark-10f5e149-5460-4899-9c8f-b19b19bdaf55\spark-fba24b2b-5847-4b04-848c-90677d12ff99\spark-35f5ed4b-041d-40d8-8854-b243787de188\16215993091424210185005_cache
  to 
 C:\localapps\spark-1.2.1-bin-hadoop2.4\work\app-20150217165625-0006\0\.\spark-cassandra-connector-java_2.10-1.2.0-alpha2.jar
 15/02/17 16:56:48 INFO Executor: Adding 
 file:/C:/localapps/spark-1.2.1-bin-hadoop2.4/work/app-20150217165625-0006/0/./spark-cassandra-connector-java_2.10-1.2.0-alpha2.jar
  to class loader
 15/02/17 16:56:50 INFO Executor: Fetching 
 http://127.0.0.1:64265/jars/spark-cassandra-connector_2.10-1.2.0-alpha2.jar 
 with timestamp 1424210185012
 15/02/17 16:56:50 INFO Utils: Fetching

[jira] [Commented] (SPARK-5775) GenericRow cannot be cast to SpecificMutableRow when nested data and partitioned table


[ 
https://issues.apache.org/jira/browse/SPARK-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327936#comment-14327936
 ] 

Apache Spark commented on SPARK-5775:
-

User 'anselmevignon' has created a pull request for this issue:
https://github.com/apache/spark/pull/4697

 GenericRow cannot be cast to SpecificMutableRow when nested data and 
 partitioned table
 --

 Key: SPARK-5775
 URL: https://issues.apache.org/jira/browse/SPARK-5775
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
Reporter: Ayoub Benali
  Labels: hivecontext, nested, parquet, partition

 Using the LOAD sql command in Hive context to load parquet files into a 
 partitioned table causes exceptions during query time. 
 The bug requires the table to have a column of *type Array of struct* and to 
 be *partitioned*. 
 The example bellow shows how to reproduce the bug and you can see that if the 
 table is not partitioned the query works fine. 
 {noformat}
 scala val data1 = {data_array:[{field1:1,field2:2}]}
 scala val data2 = {data_array:[{field1:3,field2:4}]}
 scala val jsonRDD = sc.makeRDD(data1 :: data2 :: Nil)
 scala val schemaRDD = hiveContext.jsonRDD(jsonRDD)
 scala schemaRDD.printSchema
 root
  |-- data_array: array (nullable = true)
  ||-- element: struct (containsNull = false)
  |||-- field1: integer (nullable = true)
  |||-- field2: integer (nullable = true)
 scala hiveContext.sql(create external table if not exists 
 partitioned_table(data_array ARRAY STRUCTfield1: INT, field2: INT) 
 Partitioned by (date STRING) STORED AS PARQUET Location 
 'hdfs:///partitioned_table')
 scala hiveContext.sql(create external table if not exists 
 none_partitioned_table(data_array ARRAY STRUCTfield1: INT, field2: INT) 
 STORED AS PARQUET Location 'hdfs:///none_partitioned_table')
 scala schemaRDD.saveAsParquetFile(hdfs:///tmp_data_1)
 scala schemaRDD.saveAsParquetFile(hdfs:///tmp_data_2)
 scala hiveContext.sql(LOAD DATA INPATH 
 'hdfs://qa-hdc001.ffm.nugg.ad:8020/erlogd/tmp_data_1' INTO TABLE 
 partitioned_table PARTITION(date='2015-02-12'))
 scala hiveContext.sql(LOAD DATA INPATH 
 'hdfs://qa-hdc001.ffm.nugg.ad:8020/erlogd/tmp_data_2' INTO TABLE 
 none_partitioned_table)
 scala hiveContext.sql(select data.field1 from none_partitioned_table 
 LATERAL VIEW explode(data_array) nestedStuff AS data).collect
 res23: Array[org.apache.spark.sql.Row] = Array([1], [3])
 scala hiveContext.sql(select data.field1 from partitioned_table LATERAL 
 VIEW explode(data_array) nestedStuff AS data).collect
 15/02/12 16:21:03 INFO ParseDriver: Parsing command: select data.field1 from 
 partitioned_table LATERAL VIEW explode(data_array) nestedStuff AS data
 15/02/12 16:21:03 INFO ParseDriver: Parse Completed
 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(260661) called with 
 curMem=0, maxMem=280248975
 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_18 stored as values in 
 memory (estimated size 254.6 KB, free 267.0 MB)
 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(28615) called with 
 curMem=260661, maxMem=280248975
 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_18_piece0 stored as bytes 
 in memory (estimated size 27.9 KB, free 267.0 MB)
 15/02/12 16:21:03 INFO BlockManagerInfo: Added broadcast_18_piece0 in memory 
 on *:51990 (size: 27.9 KB, free: 267.2 MB)
 15/02/12 16:21:03 INFO BlockManagerMaster: Updated info of block 
 broadcast_18_piece0
 15/02/12 16:21:03 INFO SparkContext: Created broadcast 18 from NewHadoopRDD 
 at ParquetTableOperations.scala:119
 15/02/12 16:21:03 INFO FileInputFormat: Total input paths to process : 3
 15/02/12 16:21:03 INFO ParquetInputFormat: Total input paths to process : 3
 15/02/12 16:21:03 INFO FilteringParquetRowInputFormat: Using Task Side 
 Metadata Split Strategy
 15/02/12 16:21:03 INFO SparkContext: Starting job: collect at 
 SparkPlan.scala:84
 15/02/12 16:21:03 INFO DAGScheduler: Got job 12 (collect at 
 SparkPlan.scala:84) with 3 output partitions (allowLocal=false)
 15/02/12 16:21:03 INFO DAGScheduler: Final stage: Stage 13(collect at 
 SparkPlan.scala:84)
 15/02/12 16:21:03 INFO DAGScheduler: Parents of final stage: List()
 15/02/12 16:21:03 INFO DAGScheduler: Missing parents: List()
 15/02/12 16:21:03 INFO DAGScheduler: Submitting Stage 13 (MappedRDD[111] at 
 map at SparkPlan.scala:84), which has no missing parents
 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(7632) called with 
 curMem=289276, maxMem=280248975
 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_19 stored as values in 
 memory (estimated size 7.5 KB, free 267.0 MB)
 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(4230) called with 
 curMem=296908, maxMem=280248975
 15/02/12 16:21:03 INFO

[jira] [Updated] (SPARK-5423) ExternalAppendOnlyMap won't delete temp spilled file if some exception happens during using it


 [ 
https://issues.apache.org/jira/browse/SPARK-5423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5423:
-
Priority: Major  (was: Minor)

 ExternalAppendOnlyMap won't delete temp spilled file if some exception 
 happens during using it 
 ---

 Key: SPARK-5423
 URL: https://issues.apache.org/jira/browse/SPARK-5423
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 1.0.0
Reporter: Shixiong Zhu

 ExternalAppendOnlyMap won't delete temp spilled file if some exception 
 happens during using it.
 There is already a TODO in the comment:
 {code}
 // TODO: Ensure this gets called even if the iterator isn't drained.
 private def cleanup() {
   batchIndex = batchOffsets.length  // Prevent reading any other batch
   val ds = deserializeStream
   deserializeStream = null
   fileStream = null
   ds.close()
   file.delete()
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5887) Class not found exception com.datastax.spark.connector.rdd.partitioner.CassandraPartition

2015-02-19 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-5887.

Resolution: Invalid

The Datastax connector is not part of the Apache Spark distribution, it's 
maintained by Datastax directly. So please reach out to them for support. 
Thanks!

 Class not found exception  
 com.datastax.spark.connector.rdd.partitioner.CassandraPartition
 --

 Key: SPARK-5887
 URL: https://issues.apache.org/jira/browse/SPARK-5887
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
 Environment: Spark 1.2.1
 Spark Cassandra Connector 1.2.0 Alpha2
Reporter: Vijay Pawnarkar

 I am getting following class not found exception when using Spark 1.2.1 with 
 spark-cassandra-connector_2.10-1.2.0-alpha2. When the job is submitted to 
 Spark.. it successfully adds required connector JAR file to Worker's 
 classpath. Corresponding log entries are also included in following 
 description.
 From log statements and looking at spark 1.2.1 codebase it looks like the JAR 
 get added to urlClassLoader via Executor.scala's updateDependencies method. 
 However when it time to execute the Task, its not able to resolve the class 
 name. 
 
 [task-result-getter-0] WARN org.apache.spark.scheduler.TaskSetManager - Lost 
 task 0.0 in stage 0.0 (TID 0, 127.0.0.1): java.lang.ClassNotFoundException: 
 com.datastax.spark.connector.rdd.partitioner.CassandraPartition
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:274)
 at 
 org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59)
 at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
 at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
 at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
 at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
 at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
 at 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 --
 LOG indicating JAR files were added to worker classpath.
 15/02/17 16:56:48 INFO Executor: Fetching 
 http://127.0.0.1:64265/jars/spark-cassandra-connector-java_2.10-1.2.0-alpha2.jar
  with timestamp 1424210185005
 15/02/17 16:56:48 INFO Utils: Fetching 
 http://127.0.0.1:64265/jars/spark-cassandra-connector-java_2.10-1.2.0-alpha2.jar
  to 
 C:\Users\sparkus\AppData\Local\Temp\spark-10f5e149-5460-4899-9c8f-b19b19bdaf55\spark-fba24b2b-5847-4b04-848c-90677d12ff99\spark-35f5ed4b-041d-40d8-8854-b243787de188\fetchFileTemp4665176275367448514.tmp
 15/02/17 16:56:48 DEBUG Utils: fetchFile not using security
 15/02/17 16:56:48 INFO Utils: Copying 
 C:\Users\sparkus\AppData\Local\Temp\spark-10f5e149-5460-4899-9c8f-b19b19bdaf55\spark-fba24b2b-5847-4b04-848c-90677d12ff99\spark-35f5ed4b-041d-40d8-8854-b243787de188\16215993091424210185005_cache
  to 
 C:\localapps\spark-1.2.1-bin-hadoop2.4\work\app-20150217165625-0006\0\.\spark-cassandra-connector-java_2.10-1.2.0-alpha2.jar
 15/02/17 16:56:48 INFO Executor: Adding 
 file:/C:/localapps/spark-1.2.1-bin-hadoop2.4/work/app-20150217165625-0006/0/./spark-cassandra-connector-java_2.10-1.2.0-alpha2.jar
  to class loader
 15/02/17 16:56:50 INFO Executor: Fetching 
 http://127.0.0.1:64265/jars/spark-cassandra-connector_2.10-1.2.0-alpha2.jar 
 with timestamp 1424210185012
 15/02/17 16:56:50 INFO Utils: Fetching 
 http://127.0.0.1:64265/jars/spark-cassandra-connector_2.10-1.2.0-alpha2.jar 
 to

[jira] [Updated] (SPARK-5863) Performance regression in Spark SQL/Parquet due to ScalaReflection.convertRowToScala

2015-02-19 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5863:
---
Priority: Critical  (was: Major)

 Performance regression in Spark SQL/Parquet due to 
 ScalaReflection.convertRowToScala
 

 Key: SPARK-5863
 URL: https://issues.apache.org/jira/browse/SPARK-5863
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.2.1
Reporter: Cristian
Priority: Critical

 Was doing some perf testing on reading parquet files and noticed that moving 
 from Spark 1.1 to 1.2 the performance is 3x worse. In the profiler the 
 culprit showed up as being in ScalaReflection.convertRowToScala.
 Particularly this zip is the issue:
 {code}
 r.toSeq.zip(schema.fields.map(_.dataType))
 {code}
 I see there's a comment on that currently that this is slow but it wasn't 
 fixed. This actually produces a 3x degradation in parquet read performance, 
 at least in my test case.
 Edit: the map is part of the issue as well. This whole code block is in a 
 tight loop and allocates a new ListBuffer that needs to grow for each 
 transformation. A possible solution is to change to using seq.view which 
 would allocate iterators instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5887) Class not found exception com.datastax.spark.connector.rdd.partitioner.CassandraPartition

2015-02-19 Thread Vijay Pawnarkar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327921#comment-14327921
 ] 

Vijay Pawnarkar edited comment on SPARK-5887 at 2/19/15 6:35 PM:
-

Thanks! This could be a class loader issue in Spark. The class is present in 
the connector jar and the jar is being added to class loader's list of jars as 
per the logs . However classloader is not able to find it.  Property 
spark.files.userClassPathFirst is documented as being experimental. Debugging 
this further.

Had logged a issue with Datastax as well. 
https://datastax-oss.atlassian.net/browse/SPARKC-59



was (Author: tech20nn):
Thanks! This could be a class loader issue in Spark. The class is present in 
the connector jar and the jar is being added to class loader's list of jars as 
per the logs . However classloader is not able to find it.  Property 
spark.files.userClassPathFirst is documented as being experimental. Debugging 
this further.




 Class not found exception  
 com.datastax.spark.connector.rdd.partitioner.CassandraPartition
 --

 Key: SPARK-5887
 URL: https://issues.apache.org/jira/browse/SPARK-5887
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
 Environment: Spark 1.2.1
 Spark Cassandra Connector 1.2.0 Alpha2
Reporter: Vijay Pawnarkar

 I am getting following class not found exception when using Spark 1.2.1 with 
 spark-cassandra-connector_2.10-1.2.0-alpha2. When the job is submitted to 
 Spark.. it successfully adds required connector JAR file to Worker's 
 classpath. Corresponding log entries are also included in following 
 description.
 From log statements and looking at spark 1.2.1 codebase it looks like the JAR 
 get added to urlClassLoader via Executor.scala's updateDependencies method. 
 However when it time to execute the Task, its not able to resolve the class 
 name. 
 
 [task-result-getter-0] WARN org.apache.spark.scheduler.TaskSetManager - Lost 
 task 0.0 in stage 0.0 (TID 0, 127.0.0.1): java.lang.ClassNotFoundException: 
 com.datastax.spark.connector.rdd.partitioner.CassandraPartition
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:274)
 at 
 org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59)
 at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
 at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
 at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
 at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
 at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
 at 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 --
 LOG indicating JAR files were added to worker classpath.
 15/02/17 16:56:48 INFO Executor: Fetching 
 http://127.0.0.1:64265/jars/spark-cassandra-connector-java_2.10-1.2.0-alpha2.jar
  with timestamp 1424210185005
 15/02/17 16:56:48 INFO Utils: Fetching 
 http://127.0.0.1:64265/jars/spark-cassandra-connector-java_2.10-1.2.0-alpha2.jar
  to 
 C:\Users\sparkus\AppData\Local\Temp\spark-10f5e149-5460-4899-9c8f-b19b19bdaf55\spark-fba24b2b-5847-4b04-848c-90677d12ff99\spark-35f5ed4b-041d-40d8-8854-b243787de188\fetchFileTemp4665176275367448514.tmp
 15/02/17 16:56:48 DEBUG Utils: fetchFile not using security
 15/02/17 16:56:48 INFO Utils: Copying 
 C:\Users\sparkus\AppData\Local\Temp\spark-10f5e149-5460-4899-9c8f-b19b19bdaf55\spark-fba24b2b-5847-4b04-848c-90677d12ff99\spark-35f5ed4b-041d-40d8-8854-b243787de188\16215993091424210185005_cache
  to

[jira] [Updated] (SPARK-5316) DAGScheduler may make shuffleToMapStage leak if getParentStages failes


 [ 
https://issues.apache.org/jira/browse/SPARK-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5316:
-
Priority: Major  (was: Minor)

 DAGScheduler may make shuffleToMapStage leak if getParentStages failes
 --

 Key: SPARK-5316
 URL: https://issues.apache.org/jira/browse/SPARK-5316
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Reporter: YanTang Zhai

 DAGScheduler may make shuffleToMapStage leak if getParentStages failes.
 If getParentStages has exception for example input path does not exist, 
 DAGScheduler would fail to handle job submission, while shuffleToMapStage may 
 be put some records when getParentStages. However these records in 
 shuffleToMapStage aren't going to be cleaned.
 A simple job as follows:
 {code:java}
 val inputFile1 = ... // Input path does not exist when this job submits
 val inputFile2 = ...
 val outputFile = ...
 val conf = new SparkConf()
 val sc = new SparkContext(conf)
 val rdd1 = sc.textFile(inputFile1)
 .flatMap(line = line.split( ))
 .map(word = (word, 1))
 .reduceByKey(_ + _, 1)
 val rdd2 = sc.textFile(inputFile2)
 .flatMap(line = line.split(,))
 .map(word = (word, 1))
 .reduceByKey(_ + _, 1)
 try {
   val rdd3 = new PairRDDFunctions(rdd1).join(rdd2, 1)
   rdd3.saveAsTextFile(outputFile)
 } catch {
   case e : Exception =
   logError(e)
 }
 // print the information of DAGScheduler's shuffleToMapStage to check
 // whether it still has uncleaned records.
 ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4962) Put TaskScheduler.start back in SparkContext to shorten cluster resources occupation period


 [ 
https://issues.apache.org/jira/browse/SPARK-4962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4962:
-
Affects Version/s: 1.0.0

 Put TaskScheduler.start back in SparkContext to shorten cluster resources 
 occupation period
 ---

 Key: SPARK-4962
 URL: https://issues.apache.org/jira/browse/SPARK-4962
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 1.0.0
Reporter: YanTang Zhai
Priority: Minor

 When SparkContext object is instantiated, TaskScheduler is started and some 
 resources are allocated from cluster. However, these
 resources may be not used for the moment. For example, 
 DAGScheduler.JobSubmitted is processing and so on. These resources are wasted 
 in
 this period. Thus, we want to put TaskScheduler.start back to shorten cluster 
 resources occupation period specially for busy cluster.
 TaskScheduler could be started just before running stages.
 We could analyse and compare the  resources occupation period before and 
 after optimization.
 TaskScheduler.start execution time: [time1__]
 DAGScheduler.JobSubmitted (excluding HadoopRDD.getPartitions or 
 TaskScheduler.start) execution time: [time2_]
 HadoopRDD.getPartitions execution time: [time3___]
 Stages execution time: [time4_]
 The cluster resources occupation period before optimization is 
 [time2_][time3___][time4_].
 The cluster resources occupation period after optimization 
 is[time3___][time4_].
 In summary, the cluster resources
 occupation period after optimization is less than before.
 If HadoopRDD.getPartitions could be put forward (SPARK-4961), the period may 
 be shorten more which is [time4_].
 The resources saving is important for busy cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5316) DAGScheduler may make shuffleToMapStage leak if getParentStages failes


 [ 
https://issues.apache.org/jira/browse/SPARK-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5316:
-
Affects Version/s: 1.0.0

 DAGScheduler may make shuffleToMapStage leak if getParentStages failes
 --

 Key: SPARK-5316
 URL: https://issues.apache.org/jira/browse/SPARK-5316
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 1.0.0
Reporter: YanTang Zhai

 DAGScheduler may make shuffleToMapStage leak if getParentStages failes.
 If getParentStages has exception for example input path does not exist, 
 DAGScheduler would fail to handle job submission, while shuffleToMapStage may 
 be put some records when getParentStages. However these records in 
 shuffleToMapStage aren't going to be cleaned.
 A simple job as follows:
 {code:java}
 val inputFile1 = ... // Input path does not exist when this job submits
 val inputFile2 = ...
 val outputFile = ...
 val conf = new SparkConf()
 val sc = new SparkContext(conf)
 val rdd1 = sc.textFile(inputFile1)
 .flatMap(line = line.split( ))
 .map(word = (word, 1))
 .reduceByKey(_ + _, 1)
 val rdd2 = sc.textFile(inputFile2)
 .flatMap(line = line.split(,))
 .map(word = (word, 1))
 .reduceByKey(_ + _, 1)
 try {
   val rdd3 = new PairRDDFunctions(rdd1).join(rdd2, 1)
   rdd3.saveAsTextFile(outputFile)
 } catch {
   case e : Exception =
   logError(e)
 }
 // print the information of DAGScheduler's shuffleToMapStage to check
 // whether it still has uncleaned records.
 ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4921) TaskSetManager mistakenly returns PROCESS_LOCAL for NO_PREF tasks


 [ 
https://issues.apache.org/jira/browse/SPARK-4921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4921:
-
Summary: TaskSetManager mistakenly returns PROCESS_LOCAL for NO_PREF tasks  
(was: Performance issue caused by TaskSetManager returning  PROCESS_LOCAL for 
NO_PREF tasks)

 TaskSetManager mistakenly returns PROCESS_LOCAL for NO_PREF tasks
 -

 Key: SPARK-4921
 URL: https://issues.apache.org/jira/browse/SPARK-4921
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Xuefu Zhang
 Attachments: NO_PREF.patch


 During research for HIVE-9153, we found that TaskSetManager returns 
 PROCESS_LOCAL for NO_PREF tasks, which may caused performance degradation. 
 Changing the return value to NO_PREF, as demonstrated in the attached patch, 
 seemingly improves the performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-3545) Put HadoopRDD.getPartitions forward and put TaskScheduler.start back in SparkContext to reduce DAGScheduler.JobSubmitted processing time and shorten cluster resources occu


 [ 
https://issues.apache.org/jira/browse/SPARK-3545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-3545.

Resolution: Won't Fix

 Put HadoopRDD.getPartitions forward and put TaskScheduler.start back in 
 SparkContext to reduce DAGScheduler.JobSubmitted processing time and shorten 
 cluster resources occupation period
 

 Key: SPARK-3545
 URL: https://issues.apache.org/jira/browse/SPARK-3545
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Reporter: YanTang Zhai
Priority: Minor

 We have two problems:
 (1) HadoopRDD.getPartitions is lazyied to process in 
 DAGScheduler.JobSubmitted. If inputdir is large, getPartitions may spend much 
 time. 
 For example, in our cluster, it needs from 0.029s to 766.699s. If one 
 JobSubmitted event is processing, others should wait. Thus, we 
 want to put HadoopRDD.getPartitions forward to reduce 
 DAGScheduler.JobSubmitted processing time. Then other JobSubmitted event 
 don't 
 need to wait much time. HadoopRDD object could get its partitons when it is 
 instantiated.
 (2) When SparkContext object is instantiated, TaskScheduler is started and 
 some resources are allocated from cluster. However, these 
 resources may be not used for the moment. For example, 
 DAGScheduler.JobSubmitted is processing and so on. These resources are wasted 
 in 
 this period. Thus, we want to put TaskScheduler.start back to shorten cluster 
 resources occupation period specially for busy cluster. 
 TaskScheduler could be started just before running stages.
 We could analyse and compare the execution time before and after optimization.
 TaskScheduler.start execution time: [time1__]
 DAGScheduler.JobSubmitted (excluding HadoopRDD.getPartitions or 
 TaskScheduler.start) execution time: [time2_]
 HadoopRDD.getPartitions execution time: [time3___]
 Stages execution time: [time4_]
 (1) The app has only one job
 (a)
 The execution time of the job before optimization is 
 [time1__][time2_][time3___][time4_].
 The execution time of the job after optimization 
 is[time3___][time2_][time1__][time4_].
 (b)
 The cluster resources occupation period before optimization is 
 [time2_][time3___][time4_].
 The cluster resources occupation period after optimization is[time4_].
 In summary, if the app has only one job, the total execution time is same 
 before and after optimization while the cluster resources 
 occupation period after optimization is less than before.
 (2) The app has 4 jobs
 (a) Before optimization,
 job1 execution time is [time2_][time3___][time4_],
 job2 execution time is [time2__][time3___][time4_],
 job3 execution time 
 is[time2][time3___][time4_],
 job4 execution time 
 is[time2__][time3___][time4_].
 After optimization,  
 job1 execution time is [time3___][time2_][time1__][time4_],
 job2 execution time is [time3___][time2__][time4_],
 job3 execution time 
 is[time3___][time2_][time4_],
 job4 execution time 
 is[time3___][time2__][time4_].
 In summary, if the app has multiple jobs, average execution time after 
 optimization is less than before and the cluster resources 
 occupation period after optimization is less than before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2015-02-19 Thread Zhan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated SPARK-1537:
--
Attachment: SPARK-1537.txt

High level design doc for spark ATS integration.

 Add integration with Yarn's Application Timeline Server
 ---

 Key: SPARK-1537
 URL: https://issues.apache.org/jira/browse/SPARK-1537
 Project: Spark
  Issue Type: New Feature
  Components: YARN
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
 Attachments: SPARK-1537.txt, spark-1573.patch


 It would be nice to have Spark integrate with Yarn's Application Timeline 
 Server (see YARN-321, YARN-1530). This would allow users running Spark on 
 Yarn to have a single place to go for all their history needs, and avoid 
 having to manage a separate service (Spark's built-in server).
 At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, 
 although there is still some ongoing work. But the basics are there, and I 
 wouldn't expect them to change (much) at this point.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5814) Remove JBLAS from runtime dependencies


 [ 
https://issues.apache.org/jira/browse/SPARK-5814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5814:
-
Priority: Major  (was: Critical)

 Remove JBLAS from runtime dependencies
 --

 Key: SPARK-5814
 URL: https://issues.apache.org/jira/browse/SPARK-5814
 Project: Spark
  Issue Type: Dependency upgrade
  Components: GraphX, MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 We are using mixed breeze/netlib-java and jblas code in MLlib. They take 
 different approaches to utilize native libraries and we should keep only one 
 of them. netlib-java has a clear separation between Java implementation and 
 native JNI libraries, while JBLAS packs statically linked binaries that 
 causes license issues (SPARK-5669). So we want to remove JBLAS from Spark 
 runtime.
 One issue with this approach is that we have JBLAS' DoubleMatrix exposed (by 
 mistake) in SVDPlusPlus of GraphX. We should deprecate it and replace 
 `DoubleMatrix` by `Array[Double]`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5911) Make Column.cast(to: String) support fixed precision and scale decimal type

Yin Huai created SPARK-5911:
---

 Summary: Make Column.cast(to: String) support fixed precision and 
scale decimal type
 Key: SPARK-5911
 URL: https://issues.apache.org/jira/browse/SPARK-5911
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5744) RDD.isEmpty / take fails for (empty) RDD of Nothing


[ 
https://issues.apache.org/jira/browse/SPARK-5744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328173#comment-14328173
 ] 

Apache Spark commented on SPARK-5744:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4698

 RDD.isEmpty / take fails for (empty) RDD of Nothing
 ---

 Key: SPARK-5744
 URL: https://issues.apache.org/jira/browse/SPARK-5744
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Tobias Bertelsen
Assignee: Tobias Bertelsen
Priority: Minor
   Original Estimate: 0h
  Remaining Estimate: 0h

 The implementation of {{RDD.isEmpty()}} fails if there is empty partitions. 
 It was introduce by https://github.com/apache/spark/pull/4074
 Example:
 {code}
 sc.parallelize(Seq(), 1).isEmpty()
 {code}
 The above code throws an exception like this:
 {code}
 org.apache.spark.SparkDriverExecutionException: Execution error
 at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:977)
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1374)
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1338)
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
 Cause: java.lang.ArrayStoreException: [Ljava.lang.Object;
 at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:88)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1466)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1466)
 at org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:56)
 at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:973)
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1374)
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1338)
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4848) On a stand-alone cluster, several worker-specific variables are read only on the master


 [ 
https://issues.apache.org/jira/browse/SPARK-4848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4848:
-
Component/s: (was: Project Infra)
 Deploy

 On a stand-alone cluster, several worker-specific variables are read only on 
 the master
 ---

 Key: SPARK-4848
 URL: https://issues.apache.org/jira/browse/SPARK-4848
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.0.0
 Environment: stand-alone spark cluster
Reporter: Nathan Kronenfeld
   Original Estimate: 24h
  Remaining Estimate: 24h

 On a stand-alone spark cluster, much of the determination of worker 
 specifics, especially one has multiple instances per node, is done only on 
 the master.
 The master loops over instances, and starts a worker per instance on each 
 node.
 This means, if your workers have different values of SPARK_WORKER_INSTANCES 
 or SPARK_WORKER_WEBUI_PORT from each other (or from the master), all values 
 are ignored except the one on the master.
 SPARK_WORKER_PORT looks like it is unread in scripts, but read in code - I'm 
 not sure how it will behave, since all instances will read the same value 
 from the environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2015-02-19 Thread Zhan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated SPARK-1537:
--
Attachment: spark-1573.patch

Patch against v1.2.1

 Add integration with Yarn's Application Timeline Server
 ---

 Key: SPARK-1537
 URL: https://issues.apache.org/jira/browse/SPARK-1537
 Project: Spark
  Issue Type: New Feature
  Components: YARN
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
 Attachments: spark-1573.patch


 It would be nice to have Spark integrate with Yarn's Application Timeline 
 Server (see YARN-321, YARN-1530). This would allow users running Spark on 
 Yarn to have a single place to go for all their history needs, and avoid 
 having to manage a separate service (Spark's built-in server).
 At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, 
 although there is still some ongoing work. But the basics are there, and I 
 wouldn't expect them to change (much) at this point.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4848) On a stand-alone cluster, several worker-specific variables are read only on the master


 [ 
https://issues.apache.org/jira/browse/SPARK-4848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4848:
-
Affects Version/s: 1.0.0

 On a stand-alone cluster, several worker-specific variables are read only on 
 the master
 ---

 Key: SPARK-4848
 URL: https://issues.apache.org/jira/browse/SPARK-4848
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.0.0
 Environment: stand-alone spark cluster
Reporter: Nathan Kronenfeld
   Original Estimate: 24h
  Remaining Estimate: 24h

 On a stand-alone spark cluster, much of the determination of worker 
 specifics, especially one has multiple instances per node, is done only on 
 the master.
 The master loops over instances, and starts a worker per instance on each 
 node.
 This means, if your workers have different values of SPARK_WORKER_INSTANCES 
 or SPARK_WORKER_WEBUI_PORT from each other (or from the master), all values 
 are ignored except the one on the master.
 SPARK_WORKER_PORT looks like it is unread in scripts, but read in code - I'm 
 not sure how it will behave, since all instances will read the same value 
 from the environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4721) Improve first thread to put block failed

[
https://issues.apache.org/jira/browse/SPARK-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrew Or updated SPARK-4721:
-
Affects Version/s: 1.0.0

Improve first thread to put block failed

Key: SPARK-4721
URL: https://issues.apache.org/jira/browse/SPARK-4721
Project: Spark
Issue Type: Improvement
Components: Block Manager
Affects Versions: 1.0.0
Reporter: SuYan

In current code, it assumes that multi-thread try to put same blockID block
in blockManager, the thread that first put info in blockinfos to do the put
process, and others will wait until the put in failed or success.
it's ok in put success, but if fails, have some problem:
1. the failed thread will remove info from blockinfo
2. other threads wake up, and use the old info.synchronized to try put
3. and if success, mark success will tell not in pending status, and “mark
success” failed. all other remaining threads will do the same thing: got
info.syn and mark success or failed even that have one success.
first, I can't understand why remove info from blockinfos while there have
other threads was wait. the comment tell us is for other threads to create
new block info. but block info is just a ID and level, use the old one and
the new one is doesn't matters if there any waits threads.
second, how about if there first threads is failed, other waits thread can do
the same process one by one but need less than all .
or just if first thread is failed, all other threads log a warning and return
after waking up.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4669) Allow users to set arbitrary akka configurations via property file


 [ 
https://issues.apache.org/jira/browse/SPARK-4669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4669:
-
Affects Version/s: 1.0.0

 Allow users to set arbitrary akka configurations via property file
 --

 Key: SPARK-4669
 URL: https://issues.apache.org/jira/browse/SPARK-4669
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Tao Wang

 Currently spark only support several configuration settings in property file 
 and arbitrary setting in SparkConf. If we wanna set some values to other 
 items in akka configuration, for instance, akka.remote.startup-timeout, it 
 will be unavailable to do this in property file.
 I review the history commits and could not find why we keep current strategy. 
 So it it better to open all akka seetings in property file in my opinion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-2188) Support sbt/sbt for Windows


 [ 
https://issues.apache.org/jira/browse/SPARK-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-2188.

Resolution: Won't Fix

 Support sbt/sbt for Windows
 ---

 Key: SPARK-2188
 URL: https://issues.apache.org/jira/browse/SPARK-2188
 Project: Spark
  Issue Type: New Feature
  Components: Build
Affects Versions: 1.0.0
Reporter: Pat McDonough

 Add the equivalent of sbt/sbt for Windows users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-911) Support map pruning on sorted (K, V) RDD's


 [ 
https://issues.apache.org/jira/browse/SPARK-911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-911:

Affects Version/s: 1.0.0

 Support map pruning on sorted (K, V) RDD's
 --

 Key: SPARK-911
 URL: https://issues.apache.org/jira/browse/SPARK-911
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Patrick Wendell

 If someone has sorted a (K, V) rdd, we should offer them a way to filter a 
 range of the partitions that employs map pruning. This would be simple using 
 a small range index within the rdd itself. A good example is I sort my 
 dataset by time and then I want to serve queries that are restricted to a 
 certain time range.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3051) Support looking-up named accumulators in a registry


 [ 
https://issues.apache.org/jira/browse/SPARK-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3051:
-
Affects Version/s: 1.0.0

 Support looking-up named accumulators in a registry
 ---

 Key: SPARK-3051
 URL: https://issues.apache.org/jira/browse/SPARK-3051
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Neil Ferguson

 This is a proposed enhancement to Spark based on the following mailing list 
 discussion: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/quot-Dynamic-variables-quot-in-Spark-td7450.html.
 This proposal builds on SPARK-2380 (Support displaying accumulator values in 
 the web UI) to allow named accumulables to be looked-up in a registry, as 
 opposed to having to be passed to every method that need to access them.
 The use case was described well by [~shivaram], as follows:
 Lets say you have two functions you use 
 in a map call and want to measure how much time each of them takes. For 
 example, if you have a code block like the one below and you want to 
 measure how much time f1 takes as a fraction of the task. 
 {noformat}
 a.map { l = 
val f = f1(l) 
... some work here ... 
 } 
 {noformat}
 It would be really cool if we could do something like 
 {noformat}
 a.map { l = 
val start = System.nanoTime 
val f = f1(l) 
TaskMetrics.get(f1-time).add(System.nanoTime - start) 
 } 
 {noformat}
 SPARK-2380 provides a partial solution to this problem -- however the 
 accumulables would still need to be passed to every function that needs them, 
 which I think would be cumbersome in any application of reasonable complexity.
 The proposal, as suggested by [~pwendell], is to have a registry of 
 accumulables, that can be looked-up by name. 
 Regarding the implementation details, I'd propose that we broadcast a 
 serialized version of all named accumulables in the DAGScheduler (similar to 
 what SPARK-2521 does for Tasks). These can then be deserialized in the 
 Executor. 
 Accumulables are already stored in thread-local variables in the Accumulators 
 object, so exposing these in the registry should be simply a matter of 
 wrapping this object, and keying the accumulables by name (they are currently 
 keyed by ID).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2033) Automatically cleanup checkpoint


 [ 
https://issues.apache.org/jira/browse/SPARK-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-2033:
-
Affects Version/s: 1.0.0

 Automatically cleanup checkpoint 
 -

 Key: SPARK-2033
 URL: https://issues.apache.org/jira/browse/SPARK-2033
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Guoqiang Li
Assignee: Guoqiang Li

 Now we use ContextCleaner asynchronous cleanup RDD, shuffle, and broadcast. 
 But no checkpoint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5912) Programming guide for feature selection

Joseph K. Bradley created SPARK-5912:


 Summary: Programming guide for feature selection
 Key: SPARK-5912
 URL: https://issues.apache.org/jira/browse/SPARK-5912
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley


The new ChiSqSelector for feature selection should have a section in the 
Programming Guide.  It should probably be under the feature extraction and 
transformation section as a new subsection for feature selection.

If we get more feature selection methods later on, we could expand it to a 
larger section of the guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3882) JobProgressListener gets permanently out of sync with long running job


[ 
https://issues.apache.org/jira/browse/SPARK-3882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328237#comment-14328237
 ] 

Andrew Or commented on SPARK-3882:
--

Hi [~dgshep] is this still an issue after upgrading to Spark 1.1 and beyond? If 
not I think we should close this issue.

 JobProgressListener gets permanently out of sync with long running job
 --

 Key: SPARK-3882
 URL: https://issues.apache.org/jira/browse/SPARK-3882
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.0.2
Reporter: Davis Shepherd
 Attachments: Screen Shot 2014-10-03 at 12.50.59 PM.png


 A long running spark context (non-streaming) will eventually start throwing 
 the following in the driver:
 {code}
 java.util.NoSuchElementException: key not found: 12771
   at scala.collection.MapLike$class.default(MapLike.scala:228)
   at scala.collection.AbstractMap.default(Map.scala:58)
   at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
   at 
 org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79)
   at 
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
   at 
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
   at 
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81)
   at 
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79)
   at 
 org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48)
   at 
 org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46)
 2014-10-09 18:45:33,523 [SparkListenerBus] ERROR 
 org.apache.spark.scheduler.LiveListenerBus - Listener JobProgressListener 
 threw an exception
 java.util.NoSuchElementException: key not found: 12782
   at scala.collection.MapLike$class.default(MapLike.scala:228)
   at scala.collection.AbstractMap.default(Map.scala:58)
   at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
   at 
 org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79)
   at 
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
   at 
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
   at 
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81)
   at 
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79)
   at 
 org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48)
   at 
 org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56)
   at

[jira] [Commented] (SPARK-5912) Programming guide for feature selection


[ 
https://issues.apache.org/jira/browse/SPARK-5912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328238#comment-14328238
 ] 

Joseph K. Bradley commented on SPARK-5912:
--

[~avulanov]  Would you have time to make this guide for the 1.3 release (as 
soon as possible, really)?  If not, I could add it.  Thanks!

 Programming guide for feature selection
 ---

 Key: SPARK-5912
 URL: https://issues.apache.org/jira/browse/SPARK-5912
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 The new ChiSqSelector for feature selection should have a section in the 
 Programming Guide.  It should probably be under the feature extraction and 
 transformation section as a new subsection for feature selection.
 If we get more feature selection methods later on, we could expand it to a 
 larger section of the guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5912) Programming guide for feature selection

2015-02-19 Thread Alexander Ulanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328246#comment-14328246
 ] 

Alexander Ulanov commented on SPARK-5912:
-

Sure, I can. Could you point me to some template or a good example of a 
programming guide?

 Programming guide for feature selection
 ---

 Key: SPARK-5912
 URL: https://issues.apache.org/jira/browse/SPARK-5912
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 The new ChiSqSelector for feature selection should have a section in the 
 Programming Guide.  It should probably be under the feature extraction and 
 transformation section as a new subsection for feature selection.
 If we get more feature selection methods later on, we could expand it to a 
 larger section of the guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1476) 2GB limit in spark for blocks

2015-02-19 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328248#comment-14328248
 ] 

Marcelo Vanzin commented on SPARK-1476:
---

Hi [~irashid],

Approach sounds good. It would be nice to measure whether the optimization for 
smaller blocks actually makes a difference; from what I can tell, supporting 
multiple ByteBuffer instances just means having an array and picking the right 
ByteBuffer based on an offset, both of which should be pretty cheap.

 2GB limit in spark for blocks
 -

 Key: SPARK-1476
 URL: https://issues.apache.org/jira/browse/SPARK-1476
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
 Environment: all
Reporter: Mridul Muralidharan
Assignee: Mridul Muralidharan
Priority: Critical
 Attachments: 2g_fix_proposal.pdf


 The underlying abstraction for blocks in spark is a ByteBuffer : which limits 
 the size of the block to 2GB.
 This has implication not just for managed blocks in use, but also for shuffle 
 blocks (memory mapped blocks are limited to 2gig, even though the api allows 
 for long), ser-deser via byte array backed outstreams (SPARK-1391), etc.
 This is a severe limitation for use of spark when used on non trivial 
 datasets.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5918) Spark Thrift server reports metadata for VARCHAR column as STRING in result set schema

2015-02-19 Thread Holman Lan (JIRA)

Holman Lan created SPARK-5918:
-

 Summary: Spark Thrift server reports metadata for VARCHAR column 
as STRING in result set schema
 Key: SPARK-5918
 URL: https://issues.apache.org/jira/browse/SPARK-5918
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0, 1.1.1
Reporter: Holman Lan


This is reproducible using the open source JDBC driver by executing a query 
that will return a VARCHAR column then retrieving the result set metadata. The 
type name returned by the JDBC driver is VARCHAR which is expected but reports 
the column type as string[12] and precision/column length as 2147483647 (which 
is what the JDBC driver would return for STRING column) even though we created 
a VARCHAR column with max length of 1000.

Further investigation indicates the GetResultSetMetadata Thrift client API call 
returns the incorrect metadata.

We have confirmed this behaviour in  versions 1.1.1 and 1.2.0. We have not yet 
tested this against 1.2.1 but will do so and report our findings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5879) spary_ec2.py should expose/return master and slave lists (e.g. write to file)

2015-02-19 Thread Florian Verhein (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328612#comment-14328612
 ] 

Florian Verhein commented on SPARK-5879:


cc [~shivaram], any opinions on how to best do this?

 spary_ec2.py should expose/return master and slave lists (e.g. write to file)
 -

 Key: SPARK-5879
 URL: https://issues.apache.org/jira/browse/SPARK-5879
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Florian Verhein

 After running spark_ec2.py, it is often useful/necessary to know the master's 
 ip / dn. Particularly if running spark_ec2.py is part of a larger pipeline.
 For example, consider a wrapper that launches a cluster, then waits for 
 completion of some application running on it (e.g. polling via ssh), before 
 destroying the cluster.
 Some options: 
 - write `launch-variables.sh` with MASTERS and SLAVES exports (i.e. basically 
 a subset of the ec2_variables.sh that is temporarily created as part of 
 deploy_files variable substitution)
 - launch-variables.json (same info but as json) 
 Both would be useful depending on the wrapper language. 
 I think we should incorporate the cluster name for the case that multiple 
 clusters are launched. E.g. cluster_name_variables.sh/.json
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4144) Support incremental model training of Naive Bayes classifier

2015-02-19 Thread Jatinpreet Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328569#comment-14328569
 ] 

Jatinpreet Singh commented on SPARK-4144:
-

Hi, 

I have been waiting for this feature to be included. It would be great if this 
can be done.

Thanks,
Jatin

 Support incremental model training of Naive Bayes classifier
 

 Key: SPARK-4144
 URL: https://issues.apache.org/jira/browse/SPARK-4144
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, Streaming
Reporter: Chris Fregly
Assignee: Liquan Pei

 Per Xiangrui Meng from the following user list discussion:  
 http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3CCAJgQjQ_QjMGO=jmm8weq1v8yqfov8du03abzy7eeavgjrou...@mail.gmail.com%3E

 For Naive Bayes, we need to update the priors and conditional
 probabilities, which means we should also remember the number of
 observations for the updates.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4655) Split Stage into ShuffleMapStage and ResultStage subclasses


[ 
https://issues.apache.org/jira/browse/SPARK-4655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328659#comment-14328659
 ] 

Apache Spark commented on SPARK-4655:
-

User 'ilganeli' has created a pull request for this issue:
https://github.com/apache/spark/pull/4703

 Split Stage into ShuffleMapStage and ResultStage subclasses
 ---

 Key: SPARK-4655
 URL: https://issues.apache.org/jira/browse/SPARK-4655
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Josh Rosen
Assignee: Ilya Ganelin

 The scheduler's {{Stage}} class has many fields which are only applicable to 
 result stages or shuffle map stages.  As a result, I think that it makes 
 sense to make {{Stage}} into an abstract base class with two subclasses, 
 {{ResultStage}} and {{ShuffleMapStage}}.  This would improve the 
 understandability of the DAGScheduler code. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5912) Programming guide for feature selection

[
https://issues.apache.org/jira/browse/SPARK-5912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328255#comment-14328255
]

Joseph K. Bradley commented on SPARK-5912:
--

Sure, can you please follow the examples in
[https://github.com/apache/spark/blob/master/docs/mllib-feature-extraction.md],
which generates into
[http://spark.apache.org/docs/latest/mllib-feature-extraction.html]?

I'd add a new subsection at the level of the other algorithms (TF-IDF,
Word2Vec, etc.). There can be Scala/Java examples but we can of course skip
Python since that API isn't available yet.

To see what it looks like on your machine, you can compile the docs using the
instructions here: [https://github.com/apache/spark/tree/master/docs]

Let me know if you run into questions. Thanks!

Programming guide for feature selection
---

Key: SPARK-5912
URL: https://issues.apache.org/jira/browse/SPARK-5912
Project: Spark
Issue Type: Documentation
Components: Documentation, MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

The new ChiSqSelector for feature selection should have a section in the
Programming Guide. It should probably be under the feature extraction and
transformation section as a new subsection for feature selection.
If we get more feature selection methods later on, we could expand it to a
larger section of the guide.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5914) Spark-submit cannot execute without machine admin permission on windows


 [ 
https://issues.apache.org/jira/browse/SPARK-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5914:
-
Component/s: (was: Spark Core)
 Windows
 Spark Submit

Yes of course you are not expected to run as admin. It'd be good to find a way 
to set the permissions correctly. I don't know how well Java plays with Windows 
file permissions though?

 Spark-submit cannot execute without machine admin permission on windows
 ---

 Key: SPARK-5914
 URL: https://issues.apache.org/jira/browse/SPARK-5914
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit, Windows
 Environment: Windows
Reporter: Judy Nash
Priority: Minor

 On windows platform only. 
 If slave is executed with user permission, spark-submit fails with 
 java.lang.ClassNotFoundException when attempting to read the cached jar from 
 spark_home\work folder. 
 This is due to the jars do not have read permission set by default on 
 windows. Fix is to add read permission explicitly for owner of the file. 
 Having service account running as admin (equivalent of sudo in Linux) is a 
 major security risk for production clusters. This make it easy for hackers to 
 compromise the cluster by taking over the service account. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5900) Wrap the results returned by PIC and FPGrowth in case classes


 [ 
https://issues.apache.org/jira/browse/SPARK-5900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5900.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4695
[https://github.com/apache/spark/pull/4695]

 Wrap the results returned by PIC and FPGrowth in case classes
 -

 Key: SPARK-5900
 URL: https://issues.apache.org/jira/browse/SPARK-5900
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.3.0


 We return tuples in the current version of PIC and FPGrowth. This is not very 
 Java-friendly because the primitive types are translated into Objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5909) Add a clearCache command to Spark SQL's cache manager

Yin Huai created SPARK-5909:
---

 Summary: Add a clearCache command to Spark SQL's cache manager
 Key: SPARK-5909
 URL: https://issues.apache.org/jira/browse/SPARK-5909
 Project: Spark
  Issue Type: Task
  Components: SQL
Reporter: Yin Huai


This command will clear all cached data from the in-memory cache, which will be 
useful when users want to quickly clear the cache or as a workaround of cases 
like SPARK-5881.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5909) Add a clearCache command to Spark SQL's cache manager


[ 
https://issues.apache.org/jira/browse/SPARK-5909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327706#comment-14327706
 ] 

Apache Spark commented on SPARK-5909:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/4694

 Add a clearCache command to Spark SQL's cache manager
 -

 Key: SPARK-5909
 URL: https://issues.apache.org/jira/browse/SPARK-5909
 Project: Spark
  Issue Type: Task
  Components: SQL
Reporter: Yin Huai

 This command will clear all cached data from the in-memory cache, which will 
 be useful when users want to quickly clear the cache or as a workaround of 
 cases like SPARK-5881.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5881) RDD remains cached after the table gets overridden by CACHE TABLE


[ 
https://issues.apache.org/jira/browse/SPARK-5881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327696#comment-14327696
 ] 

Yin Huai commented on SPARK-5881:
-

As mentioned by [~lian cheng], we should also track the table names in the 
Cache Manager to correctly handle the following case.
{code}
val df1 = sql(SELECT * FROM testData LIMIT 10)
df1.registerTempTable(t1)

// Cache t1 explicitly
sql(CACHE TABLE t1)

// t1 and t2 share the same query plan
sql(CACHE TABLE t2 AS SELECT * FROM testData LIMIT 10)

// Replace t2 with a different query plan
sql(CACHE TABLE t2 AS SELECT * FROM testData LIMIT 5)
{code}


 RDD remains cached after the table gets overridden by CACHE TABLE
 ---

 Key: SPARK-5881
 URL: https://issues.apache.org/jira/browse/SPARK-5881
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Priority: Blocker

 {code}
 val rdd = sc.parallelize((1 to 10).map(i = s{a:$i, b:str${i}}))
 sqlContext.jsonRDD(rdd).registerTempTable(jt)
 sqlContext.sql(CACHE TABLE foo AS SELECT * FROM jt)
 sqlContext.sql(CACHE TABLE foo AS SELECT a FROM jt)
 {code}
 After the second CACHE TABLE command, the RDD for the first table still 
 remains in the cache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5881) RDD remains cached after the table gets overridden by CACHE TABLE


 [ 
https://issues.apache.org/jira/browse/SPARK-5881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-5881:

Priority: Major  (was: Blocker)

 RDD remains cached after the table gets overridden by CACHE TABLE
 ---

 Key: SPARK-5881
 URL: https://issues.apache.org/jira/browse/SPARK-5881
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai

 {code}
 val rdd = sc.parallelize((1 to 10).map(i = s{a:$i, b:str${i}}))
 sqlContext.jsonRDD(rdd).registerTempTable(jt)
 sqlContext.sql(CACHE TABLE foo AS SELECT * FROM jt)
 sqlContext.sql(CACHE TABLE foo AS SELECT a FROM jt)
 {code}
 After the second CACHE TABLE command, the RDD for the first table still 
 remains in the cache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5907) Selected column from DataFrame should not re-analyze logical plan


[ 
https://issues.apache.org/jira/browse/SPARK-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327494#comment-14327494
 ] 

Apache Spark commented on SPARK-5907:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/4691

 Selected column from DataFrame should not re-analyze logical plan
 -

 Key: SPARK-5907
 URL: https://issues.apache.org/jira/browse/SPARK-5907
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh

 Currently, selecting a column from DataFrame wraps the original logical plan 
 with a Project. As the column is used, the logical plan will be analyzed 
 again.  For some query plan, re-analyzing would side-effect that increases 
 expression id. So when accessing the column, column's expr and its analyzed 
 plan will point to different expressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5908) Hive udtf with single alias should be resolved correctly

2015-02-19 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created SPARK-5908:
--

 Summary: Hive udtf with single alias should be resolved correctly
 Key: SPARK-5908
 URL: https://issues.apache.org/jira/browse/SPARK-5908
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh


ResolveUdtfsAlias in hiveUdfs only considers the HiveGenericUdtf with multiple 
alias. When only single alias is used with HiveGenericUdtf, the alias is not 
working.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-5907) Selected column from DataFrame should not re-analyze logical plan

2015-02-19 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh closed SPARK-5907.
--
Resolution: Duplicate

 Selected column from DataFrame should not re-analyze logical plan
 -

 Key: SPARK-5907
 URL: https://issues.apache.org/jira/browse/SPARK-5907
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh

 Currently, selecting a column from DataFrame wraps the original logical plan 
 with a Project. As the column is used, the logical plan will be analyzed 
 again.  For some query plan, re-analyzing would side-effect that increases 
 expression id. So when accessing the column, column's expr and its analyzed 
 plan will point to different expressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5900) Wrap the results returned by PIC and FPGrowth in case classes


[ 
https://issues.apache.org/jira/browse/SPARK-5900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327787#comment-14327787
 ] 

Apache Spark commented on SPARK-5900:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/4695

 Wrap the results returned by PIC and FPGrowth in case classes
 -

 Key: SPARK-5900
 URL: https://issues.apache.org/jira/browse/SPARK-5900
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 We return tuples in the current version of PIC and FPGrowth. This is not very 
 Java-friendly because the primitive types are translated into Objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-5548) Flaky test: o.a.s.util.AkkaUtilsSuite.remote fetch ssl on - untrusted server


 [ 
https://issues.apache.org/jira/browse/SPARK-5548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-5548.

  Resolution: Fixed
   Fix Version/s: 1.3.0
Target Version/s: 1.3.0

Closing again https://github.com/apache/spark/pull/4653. Let's hope we won't 
have to reopen this again.

 Flaky test: o.a.s.util.AkkaUtilsSuite.remote fetch ssl on - untrusted server
 

 Key: SPARK-5548
 URL: https://issues.apache.org/jira/browse/SPARK-5548
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Patrick Wendell
Assignee: Jacek Lewandowski
Priority: Critical
  Labels: flaky-test
 Fix For: 1.3.0


 {code}
 sbt.ForkMain$ForkError: Expected exception 
 java.util.concurrent.TimeoutException to be thrown, but 
 akka.actor.ActorNotFound was thrown.
   at 
 org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:496)
   at 
 org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
   at org.scalatest.Assertions$class.intercept(Assertions.scala:1004)
   at org.scalatest.FunSuite.intercept(FunSuite.scala:1555)
   at 
 org.apache.spark.util.AkkaUtilsSuite$$anonfun$8.apply$mcV$sp(AkkaUtilsSuite.scala:373)
   at 
 org.apache.spark.util.AkkaUtilsSuite$$anonfun$8.apply(AkkaUtilsSuite.scala:349)
   at 
 org.apache.spark.util.AkkaUtilsSuite$$anonfun$8.apply(AkkaUtilsSuite.scala:349)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at 
 org.apache.spark.util.AkkaUtilsSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(AkkaUtilsSuite.scala:37)
   at 
 org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
   at org.apache.spark.util.AkkaUtilsSuite.runTest(AkkaUtilsSuite.scala:37)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
   at org.scalatest.Suite$class.run(Suite.scala:1424)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
   at 
 org.apache.spark.util.AkkaUtilsSuite.org$scalatest$BeforeAndAfterAll$$super$run(AkkaUtilsSuite.scala:37)
   at 
 org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
   at 
 org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
   at org.apache.spark.util.AkkaUtilsSuite.run(AkkaUtilsSuite.scala:37)
   at 
 org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
   at 
 org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
   at sbt.ForkMain$Run$2.call(ForkMain.java:294)
   at sbt.ForkMain$Run$2.call(ForkMain.java:284)
   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
   at

[jira] [Resolved] (SPARK-5889) remove pid file in spark-daemon.sh after killing the process.


 [ 
https://issues.apache.org/jira/browse/SPARK-5889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5889.
--
   Resolution: Fixed
Fix Version/s: 1.2.2
   1.3.0

Issue resolved by pull request 4676
[https://github.com/apache/spark/pull/4676]

 remove pid file in spark-daemon.sh after killing the process.
 -

 Key: SPARK-5889
 URL: https://issues.apache.org/jira/browse/SPARK-5889
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.2.1
Reporter: Zhan Zhang
Assignee: Zhan Zhang
Priority: Minor
 Fix For: 1.3.0, 1.2.2


 Currently, if the thriftserver/history server are stopped. The pid file is 
 not deleted. The fix is trial, but it is important for some service checking 
 relying on the file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5914) Spark-submit cannot execute without machine admin permission on windows

2015-02-19 Thread Judy Nash (JIRA)

Judy Nash created SPARK-5914:


 Summary: Spark-submit cannot execute without machine admin 
permission on windows
 Key: SPARK-5914
 URL: https://issues.apache.org/jira/browse/SPARK-5914
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
 Environment: Windows
Reporter: Judy Nash
Priority: Minor


On windows platform only. 

If slave is executed with user permission, spark-submit fails with 
java.lang.ClassNotFoundException when attempting to read the cached jar from 
spark_home\work folder. 

This is due to the jars do not have read permission set by default on windows. 
Fix is to add read permission explicitly for owner of the file. 

Having service account running as admin (equivalent of sudo in Linux) is a 
major security risk for production clusters. This make it easy for hackers to 
compromise the cluster by taking over the service account. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5915) Spillable should check every N bytes rather than every 32 elements

2015-02-19 Thread Mingyu Kim (JIRA)

Mingyu Kim created SPARK-5915:
-

 Summary: Spillable should check every N bytes rather than every 32 
elements
 Key: SPARK-5915
 URL: https://issues.apache.org/jira/browse/SPARK-5915
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Mingyu Kim


Spillable currently checks for spill every 32 elements. However, this puts it 
at a risk of OOM if each element is large enough. A better alternative is to 
check every N bytes accumulated.

N should be decided to a reasonable number via proper testing.

This is a follow-up of SPARK-4808, and was discussed originally in 
https://github.com/apache/spark/pull/4420.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4808) Spark fails to spill with small number of large objects


 [ 
https://issues.apache.org/jira/browse/SPARK-4808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4808:
-
Target Version/s: 1.3.0, 1.4.0  (was: 1.2.1)

 Spark fails to spill with small number of large objects
 ---

 Key: SPARK-4808
 URL: https://issues.apache.org/jira/browse/SPARK-4808
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.1.0, 1.2.0, 1.2.1
Reporter: Dennis Lawler

 Spillable's maybeSpill does not allow spill to occur until at least 1000 
 elements have been spilled, and then will only evaluate spill every 32nd 
 element thereafter.  When there is a small number of very large items being 
 tracked, out-of-memory conditions may occur.
 I suspect that this and the every-32nd-element behavior was to reduce the 
 impact of the estimateSize() call.  This method was extracted into 
 SizeTracker, which implements its own exponential backup for size estimation, 
 so now we are only avoiding using the resulting estimated size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5915) Spillable should check every N bytes rather than every 32 elements


 [ 
https://issues.apache.org/jira/browse/SPARK-5915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5915:
-
Target Version/s: 1.4.0

 Spillable should check every N bytes rather than every 32 elements
 --

 Key: SPARK-5915
 URL: https://issues.apache.org/jira/browse/SPARK-5915
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Mingyu Kim

 Spillable currently checks for spill every 32 elements. However, this puts it 
 at a risk of OOM if each element is large enough. A better alternative is to 
 check every N bytes accumulated.
 N should be decided to a reasonable number via proper testing.
 This is a follow-up of SPARK-4808, and was discussed originally in 
 https://github.com/apache/spark/pull/4420.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5915) Spillable should check every N bytes rather than every 32 elements


 [ 
https://issues.apache.org/jira/browse/SPARK-5915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5915:
-
Affects Version/s: 1.0.0

 Spillable should check every N bytes rather than every 32 elements
 --

 Key: SPARK-5915
 URL: https://issues.apache.org/jira/browse/SPARK-5915
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Mingyu Kim

 Spillable currently checks for spill every 32 elements. However, this puts it 
 at a risk of OOM if each element is large enough. A better alternative is to 
 check every N bytes accumulated.
 N should be decided to a reasonable number via proper testing.
 This is a follow-up of SPARK-4808, and was discussed originally in 
 https://github.com/apache/spark/pull/4420.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5753) add basic support to JDBCRDD for postgresql types: uuid, hstore, and array

2015-02-19 Thread Evan Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328449#comment-14328449
 ] 

Evan Yu commented on SPARK-5753:


Ignore this, commit under wrong ticket

 add basic support to JDBCRDD for postgresql types: uuid, hstore, and array
 --

 Key: SPARK-5753
 URL: https://issues.apache.org/jira/browse/SPARK-5753
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Ricky Nguyen

 I recently saw the new JDBCRDD merged into master. Thanks for that, it works 
 pretty well and is really convenient.
 It would be nice if it could have basic support for a few more types.
 * uuid (as StringType)
 * hstore (as MapType). keys and values are both strings.
 * array (as ArrayType)
 I have a patch that gets started in this direction. Not sure where or how to 
 write/run tests, but I ran manual tests in spark-shell against my postgres db.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5917) Distinct is broken

2015-02-19 Thread Derrick Burns (JIRA)

Derrick Burns created SPARK-5917:


 Summary: Distinct is broken
 Key: SPARK-5917
 URL: https://issues.apache.org/jira/browse/SPARK-5917
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.1
 Environment: Spark 1.1.1 running on YARN 2.4 via Amazon EMR.
Reporter: Derrick Burns
Priority: Critical


I hate to file bugs that are hard to reproduce (by other people), but after 
spending a full week trying to debug my code, I constructed a scenario where 
the following assertion FAILS.

val x : RDD[T] = 
val y = x.distinct()
assert( y.count() = x.count() )

I am at a complete loss as to how this can occur under ANY definition of 
equality/order unless the RDD underlying x is mutable. Since none of my RDD 
transforms mutate any existing RDD data and I am reading from immutable sources 
(data on S3), I conclude that there must be a bug in Spark or I am mutating my 
data unknowingly.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4682) Consolidate various 'Clock' classes


 [ 
https://issues.apache.org/jira/browse/SPARK-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4682:
-
Affects Version/s: 1.2.0

 Consolidate various 'Clock' classes
 ---

 Key: SPARK-4682
 URL: https://issues.apache.org/jira/browse/SPARK-4682
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Streaming
Affects Versions: 1.2.0
Reporter: Josh Rosen
 Fix For: 1.3.0


 Spark currently has at four different {{Clock}} classes for mocking out 
 wall-clock time, most of which are nearly identical.  We should replace all 
 of these by one Clock class that lives in the utilities package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-4682) Consolidate various 'Clock' classes


 [ 
https://issues.apache.org/jira/browse/SPARK-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4682.

  Resolution: Fixed
   Fix Version/s: 1.3.0
Assignee: Sean Owen
Target Version/s: 1.3.0

 Consolidate various 'Clock' classes
 ---

 Key: SPARK-4682
 URL: https://issues.apache.org/jira/browse/SPARK-4682
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Streaming
Affects Versions: 1.2.0
Reporter: Josh Rosen
Assignee: Sean Owen
 Fix For: 1.3.0


 Spark currently has at four different {{Clock}} classes for mocking out 
 wall-clock time, most of which are nearly identical.  We should replace all 
 of these by one Clock class that lives in the utilities package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5669) Spark assembly includes incompatibly licensed libgfortran, libgcc code via JBLAS


[ 
https://issues.apache.org/jira/browse/SPARK-5669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328364#comment-14328364
 ] 

Sean Owen commented on SPARK-5669:
--

It *should* be fine on the grounds that the native libs are on the classpath 
and there is no conflict. That said I have not tried it. Are you proposing the 
new PR for 1.3.0? That would also solve the issue. If not, I would support it 
if you felt more comfortable restoring the native libs for 1.3.0 instead.

 Spark assembly includes incompatibly licensed libgfortran, libgcc code via 
 JBLAS
 

 Key: SPARK-5669
 URL: https://issues.apache.org/jira/browse/SPARK-5669
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Blocker
 Fix For: 1.3.0


 Sorry for Blocker, but it's a license issue. The Spark assembly includes 
 the following, from JBLAS:
 {code}
 lib/
 lib/static/
 lib/static/Mac OS X/
 lib/static/Mac OS X/x86_64/
 lib/static/Mac OS X/x86_64/libjblas_arch_flavor.jnilib
 lib/static/Mac OS X/x86_64/sse3/
 lib/static/Mac OS X/x86_64/sse3/libjblas.jnilib
 lib/static/Windows/
 lib/static/Windows/x86/
 lib/static/Windows/x86/libgfortran-3.dll
 lib/static/Windows/x86/libgcc_s_dw2-1.dll
 lib/static/Windows/x86/jblas_arch_flavor.dll
 lib/static/Windows/x86/sse3/
 lib/static/Windows/x86/sse3/jblas.dll
 lib/static/Windows/amd64/
 lib/static/Windows/amd64/libgfortran-3.dll
 lib/static/Windows/amd64/jblas.dll
 lib/static/Windows/amd64/libgcc_s_sjlj-1.dll
 lib/static/Windows/amd64/jblas_arch_flavor.dll
 lib/static/Linux/
 lib/static/Linux/i386/
 lib/static/Linux/i386/sse3/
 lib/static/Linux/i386/sse3/libjblas.so
 lib/static/Linux/i386/libjblas_arch_flavor.so
 lib/static/Linux/amd64/
 lib/static/Linux/amd64/sse3/
 lib/static/Linux/amd64/sse3/libjblas.so
 lib/static/Linux/amd64/libjblas_arch_flavor.so
 {code}
 Unfortunately the libgfortran and libgcc libraries included for Windows are 
 not licensed in a way that's compatible with Spark and the AL2 -- LGPL at 
 least.
 It's easy to exclude them. I'm not clear what it does to running on Windows; 
 I assume it can still work but the libs would have to be made available 
 locally and put on the shared library path manually. I don't think there's a 
 package manager as in Linux that would make it easily available. I'm not able 
 to test on Windows.
 If it doesn't work, the follow-up question is whether that means JBLAS has to 
 be removed on the double, or treated as a known issue for 1.3.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5913) Python API for ChiSqSelector

Joseph K. Bradley created SPARK-5913:


 Summary: Python API for ChiSqSelector
 Key: SPARK-5913
 URL: https://issues.apache.org/jira/browse/SPARK-5913
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor


Add a Python API for mllib.feature.ChiSqSelector



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5860) JdbcRDD: overflow on large range with high number of partitions