[jira] [Commented] (SPARK-5669) Spark assembly includes incompatibly licensed libgfortran, libgcc code via JBLAS
[ https://issues.apache.org/jira/browse/SPARK-5669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327133#comment-14327133 ] Sean Owen commented on SPARK-5669: -- [~mengxr] That just applies to GCC, right? it still wouldn't change the LGPL license for libgfortran. I also don't know if Spark qualifies given the definition of Eligible Compilation Process. My understanding is that without this exception, anything compiled by GCC would be copyleft, and this prevents that. I don't know if it generally allows redistribution of libgcc. Spark assembly includes incompatibly licensed libgfortran, libgcc code via JBLAS Key: SPARK-5669 URL: https://issues.apache.org/jira/browse/SPARK-5669 Project: Spark Issue Type: Bug Components: Build Reporter: Sean Owen Assignee: Sean Owen Priority: Blocker Fix For: 1.3.0 Sorry for Blocker, but it's a license issue. The Spark assembly includes the following, from JBLAS: {code} lib/ lib/static/ lib/static/Mac OS X/ lib/static/Mac OS X/x86_64/ lib/static/Mac OS X/x86_64/libjblas_arch_flavor.jnilib lib/static/Mac OS X/x86_64/sse3/ lib/static/Mac OS X/x86_64/sse3/libjblas.jnilib lib/static/Windows/ lib/static/Windows/x86/ lib/static/Windows/x86/libgfortran-3.dll lib/static/Windows/x86/libgcc_s_dw2-1.dll lib/static/Windows/x86/jblas_arch_flavor.dll lib/static/Windows/x86/sse3/ lib/static/Windows/x86/sse3/jblas.dll lib/static/Windows/amd64/ lib/static/Windows/amd64/libgfortran-3.dll lib/static/Windows/amd64/jblas.dll lib/static/Windows/amd64/libgcc_s_sjlj-1.dll lib/static/Windows/amd64/jblas_arch_flavor.dll lib/static/Linux/ lib/static/Linux/i386/ lib/static/Linux/i386/sse3/ lib/static/Linux/i386/sse3/libjblas.so lib/static/Linux/i386/libjblas_arch_flavor.so lib/static/Linux/amd64/ lib/static/Linux/amd64/sse3/ lib/static/Linux/amd64/sse3/libjblas.so lib/static/Linux/amd64/libjblas_arch_flavor.so {code} Unfortunately the libgfortran and libgcc libraries included for Windows are not licensed in a way that's compatible with Spark and the AL2 -- LGPL at least. It's easy to exclude them. I'm not clear what it does to running on Windows; I assume it can still work but the libs would have to be made available locally and put on the shared library path manually. I don't think there's a package manager as in Linux that would make it easily available. I'm not able to test on Windows. If it doesn't work, the follow-up question is whether that means JBLAS has to be removed on the double, or treated as a known issue for 1.3.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5910) DataFrame.selectExpr(col as newName) does not work
Yin Huai created SPARK-5910: --- Summary: DataFrame.selectExpr(col as newName) does not work Key: SPARK-5910 URL: https://issues.apache.org/jira/browse/SPARK-5910 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Priority: Blocker {code} val rdd = sc.parallelize((1 to 10).map(i = s{a:$i, b:str${i}})) sqlContext.jsonRDD(rdd).selectExpr(a as newName) {code} {code} java.lang.RuntimeException: [1.3] failure: ``or'' expected but `as' found a as newName ^ at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.SqlParser.parseExpression(SqlParser.scala:45) {code} For selectExpr, we need to use projection parser instead of expression parser (which cannot parse AS). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5337) respect spark.task.cpus when launch executors
[ https://issues.apache.org/jira/browse/SPARK-5337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5337: - Affects Version/s: 1.0.0 respect spark.task.cpus when launch executors - Key: SPARK-5337 URL: https://issues.apache.org/jira/browse/SPARK-5337 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Tao Wang In standalone mode, we did not respect spark.task.cpus when luanch executors. Some executors would have not enough cores to launch a single task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-2628) Mesos backend throwing unable to find LoginModule
[ https://issues.apache.org/jira/browse/SPARK-2628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Chen closed SPARK-2628. --- Resolution: Won't Fix Mesos backend throwing unable to find LoginModule -- Key: SPARK-2628 URL: https://issues.apache.org/jira/browse/SPARK-2628 Project: Spark Issue Type: Bug Components: Mesos Reporter: Timothy Chen Assignee: Tim Chen http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3c1401892590126-6927.p...@n3.nabble.com%3E 14/07/22 19:57:59 INFO HttpServer: Starting HTTP Server 14/07/22 19:57:59 ERROR Executor: Uncaught exception in thread Thread[Executor task launch worker-1,5,main] java.lang.Error: java.io.IOException: failure to login at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1116) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636) Caused by: java.io.IOException: failure to login at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:490) at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:452) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:40) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) ... 2 more Caused by: javax.security.auth.login.LoginException: unable to find LoginModule class: org/apache/hadoop/security/UserGroupInformation$HadoopLoginModule at javax.security.auth.login.LoginContext.invoke(LoginContext.java:823) at javax.security.auth.login.LoginContext.access$000(LoginContext.java:203) at javax.security.auth.login.LoginContext$5.run(LoginContext.java:721) at javax.security.auth.login.LoginContext$5.run(LoginContext.java:719) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.login.LoginContext.invokeCreatorPriv(LoginContext.java:718) at javax.security.auth.login.LoginContext.login(LoginContext.java:590) at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:471) ... 6 more 14/07/22 19:57:59 ERROR Executor: Uncaught exception in thread Thread[Executor task launch worker-0,5,main] java.lang.Error: java.io.IOException: failure to login at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1116) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636) Caused by: java.io.IOException: failure to login at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:490) at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:452) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:40) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) ... 2 more Caused by: javax.security.auth.login.LoginException: unable to find LoginModule class: org/apache/hadoop/security/UserGroupInformation$HadoopLoginModule at javax.security.auth.login.LoginContext.invoke(LoginContext.java:823) at javax.security.auth.login.LoginContext.access$000(LoginContext.java:203) at javax.security.auth.login.LoginContext$5.run(LoginContext.java:721) at javax.security.auth.login.LoginContext$5.run(LoginContext.java:719) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.login.LoginContext.invokeCreatorPriv(LoginContext.java:718) at javax.security.auth.login.LoginContext.login(LoginContext.java:590) at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:471) ... 6 more -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2628) Mesos backend throwing unable to find LoginModule
[ https://issues.apache.org/jira/browse/SPARK-2628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327978#comment-14327978 ] Timothy Chen commented on SPARK-2628: - Seems like this is fixed post 1.0.4, somewhere in 1.1. If users are using older versions than 1.1 people can run into this. Will close this as won't fix. Mesos backend throwing unable to find LoginModule -- Key: SPARK-2628 URL: https://issues.apache.org/jira/browse/SPARK-2628 Project: Spark Issue Type: Bug Components: Mesos Reporter: Timothy Chen Assignee: Tim Chen http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3c1401892590126-6927.p...@n3.nabble.com%3E 14/07/22 19:57:59 INFO HttpServer: Starting HTTP Server 14/07/22 19:57:59 ERROR Executor: Uncaught exception in thread Thread[Executor task launch worker-1,5,main] java.lang.Error: java.io.IOException: failure to login at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1116) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636) Caused by: java.io.IOException: failure to login at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:490) at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:452) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:40) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) ... 2 more Caused by: javax.security.auth.login.LoginException: unable to find LoginModule class: org/apache/hadoop/security/UserGroupInformation$HadoopLoginModule at javax.security.auth.login.LoginContext.invoke(LoginContext.java:823) at javax.security.auth.login.LoginContext.access$000(LoginContext.java:203) at javax.security.auth.login.LoginContext$5.run(LoginContext.java:721) at javax.security.auth.login.LoginContext$5.run(LoginContext.java:719) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.login.LoginContext.invokeCreatorPriv(LoginContext.java:718) at javax.security.auth.login.LoginContext.login(LoginContext.java:590) at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:471) ... 6 more 14/07/22 19:57:59 ERROR Executor: Uncaught exception in thread Thread[Executor task launch worker-0,5,main] java.lang.Error: java.io.IOException: failure to login at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1116) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636) Caused by: java.io.IOException: failure to login at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:490) at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:452) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:40) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) ... 2 more Caused by: javax.security.auth.login.LoginException: unable to find LoginModule class: org/apache/hadoop/security/UserGroupInformation$HadoopLoginModule at javax.security.auth.login.LoginContext.invoke(LoginContext.java:823) at javax.security.auth.login.LoginContext.access$000(LoginContext.java:203) at javax.security.auth.login.LoginContext$5.run(LoginContext.java:721) at javax.security.auth.login.LoginContext$5.run(LoginContext.java:719) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.login.LoginContext.invokeCreatorPriv(LoginContext.java:718) at javax.security.auth.login.LoginContext.login(LoginContext.java:590) at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:471) ... 6 more -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5669) Spark assembly includes incompatibly licensed libgfortran, libgcc code via JBLAS
[ https://issues.apache.org/jira/browse/SPARK-5669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328010#comment-14328010 ] Xiangrui Meng commented on SPARK-5669: -- Yes, we are going to remove JBLAS anyway in 1.4. Having a simple dependency tree is always a good thing. The problem is how we should proceed for branch-1.0/1.1/1.2. If we are covered by this exemption. Maybe the only thing we need to do is to put a notice. This also applies to branch-1.3. We don't call native routines in Spark but it doesn't mean that users don't. With the current solution, users need to supply the origin JBLAS jar at runtime to use native routines, and I haven't tested whether it works or not. So if we are covered by this exemption, the best thing to do might be to revert the patch and put a notice. Spark assembly includes incompatibly licensed libgfortran, libgcc code via JBLAS Key: SPARK-5669 URL: https://issues.apache.org/jira/browse/SPARK-5669 Project: Spark Issue Type: Bug Components: Build Reporter: Sean Owen Assignee: Sean Owen Priority: Blocker Fix For: 1.3.0 Sorry for Blocker, but it's a license issue. The Spark assembly includes the following, from JBLAS: {code} lib/ lib/static/ lib/static/Mac OS X/ lib/static/Mac OS X/x86_64/ lib/static/Mac OS X/x86_64/libjblas_arch_flavor.jnilib lib/static/Mac OS X/x86_64/sse3/ lib/static/Mac OS X/x86_64/sse3/libjblas.jnilib lib/static/Windows/ lib/static/Windows/x86/ lib/static/Windows/x86/libgfortran-3.dll lib/static/Windows/x86/libgcc_s_dw2-1.dll lib/static/Windows/x86/jblas_arch_flavor.dll lib/static/Windows/x86/sse3/ lib/static/Windows/x86/sse3/jblas.dll lib/static/Windows/amd64/ lib/static/Windows/amd64/libgfortran-3.dll lib/static/Windows/amd64/jblas.dll lib/static/Windows/amd64/libgcc_s_sjlj-1.dll lib/static/Windows/amd64/jblas_arch_flavor.dll lib/static/Linux/ lib/static/Linux/i386/ lib/static/Linux/i386/sse3/ lib/static/Linux/i386/sse3/libjblas.so lib/static/Linux/i386/libjblas_arch_flavor.so lib/static/Linux/amd64/ lib/static/Linux/amd64/sse3/ lib/static/Linux/amd64/sse3/libjblas.so lib/static/Linux/amd64/libjblas_arch_flavor.so {code} Unfortunately the libgfortran and libgcc libraries included for Windows are not licensed in a way that's compatible with Spark and the AL2 -- LGPL at least. It's easy to exclude them. I'm not clear what it does to running on Windows; I assume it can still work but the libs would have to be made available locally and put on the shared library path manually. I don't think there's a package manager as in Linux that would make it easily available. I'm not able to test on Windows. If it doesn't work, the follow-up question is whether that means JBLAS has to be removed on the double, or treated as a known issue for 1.3.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5669) Spark assembly includes incompatibly licensed libgfortran, libgcc code via JBLAS
[ https://issues.apache.org/jira/browse/SPARK-5669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328010#comment-14328010 ] Xiangrui Meng edited comment on SPARK-5669 at 2/19/15 7:43 PM: --- Yes, we are going to remove JBLAS anyway in 1.4. Having a simple dependency tree is always a good thing. The problem is how we should proceed for branch-1.0/1.1/1.2. If we are covered by this exemption, maybe the only thing we need to do is to put a notice. This also applies to branch-1.3. We don't call native routines in Spark but it doesn't mean that users don't. With the current solution, users need to supply the origin JBLAS jar at runtime to use native routines, and I haven't tested whether it works or not. So if we are covered by this exemption, the best thing to do might be to revert the patch and put a notice. was (Author: mengxr): Yes, we are going to remove JBLAS anyway in 1.4. Having a simple dependency tree is always a good thing. The problem is how we should proceed for branch-1.0/1.1/1.2. If we are covered by this exemption. Maybe the only thing we need to do is to put a notice. This also applies to branch-1.3. We don't call native routines in Spark but it doesn't mean that users don't. With the current solution, users need to supply the origin JBLAS jar at runtime to use native routines, and I haven't tested whether it works or not. So if we are covered by this exemption, the best thing to do might be to revert the patch and put a notice. Spark assembly includes incompatibly licensed libgfortran, libgcc code via JBLAS Key: SPARK-5669 URL: https://issues.apache.org/jira/browse/SPARK-5669 Project: Spark Issue Type: Bug Components: Build Reporter: Sean Owen Assignee: Sean Owen Priority: Blocker Fix For: 1.3.0 Sorry for Blocker, but it's a license issue. The Spark assembly includes the following, from JBLAS: {code} lib/ lib/static/ lib/static/Mac OS X/ lib/static/Mac OS X/x86_64/ lib/static/Mac OS X/x86_64/libjblas_arch_flavor.jnilib lib/static/Mac OS X/x86_64/sse3/ lib/static/Mac OS X/x86_64/sse3/libjblas.jnilib lib/static/Windows/ lib/static/Windows/x86/ lib/static/Windows/x86/libgfortran-3.dll lib/static/Windows/x86/libgcc_s_dw2-1.dll lib/static/Windows/x86/jblas_arch_flavor.dll lib/static/Windows/x86/sse3/ lib/static/Windows/x86/sse3/jblas.dll lib/static/Windows/amd64/ lib/static/Windows/amd64/libgfortran-3.dll lib/static/Windows/amd64/jblas.dll lib/static/Windows/amd64/libgcc_s_sjlj-1.dll lib/static/Windows/amd64/jblas_arch_flavor.dll lib/static/Linux/ lib/static/Linux/i386/ lib/static/Linux/i386/sse3/ lib/static/Linux/i386/sse3/libjblas.so lib/static/Linux/i386/libjblas_arch_flavor.so lib/static/Linux/amd64/ lib/static/Linux/amd64/sse3/ lib/static/Linux/amd64/sse3/libjblas.so lib/static/Linux/amd64/libjblas_arch_flavor.so {code} Unfortunately the libgfortran and libgcc libraries included for Windows are not licensed in a way that's compatible with Spark and the AL2 -- LGPL at least. It's easy to exclude them. I'm not clear what it does to running on Windows; I assume it can still work but the libs would have to be made available locally and put on the shared library path manually. I don't think there's a package manager as in Linux that would make it easily available. I'm not able to test on Windows. If it doesn't work, the follow-up question is whether that means JBLAS has to be removed on the double, or treated as a known issue for 1.3.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5825) Failure stopping Services while command line argument is too long
[ https://issues.apache.org/jira/browse/SPARK-5825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5825: - Affects Version/s: 1.0.0 Failure stopping Services while command line argument is too long - Key: SPARK-5825 URL: https://issues.apache.org/jira/browse/SPARK-5825 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.0.0 Reporter: Cheng Hao Assignee: Cheng Hao Priority: Blocker Stopping service in `spark-daemon.sh` will confirm the process id by fuzzy matching the class name, however, it will fail if the java process arguments is very long (greater than 4096). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5825) Failure stopping Services while command line argument is too long
[ https://issues.apache.org/jira/browse/SPARK-5825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-5825. Resolution: Fixed Fix Version/s: 1.2.2 1.3.0 Failure stopping Services while command line argument is too long - Key: SPARK-5825 URL: https://issues.apache.org/jira/browse/SPARK-5825 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.0.0 Reporter: Cheng Hao Assignee: Cheng Hao Priority: Blocker Fix For: 1.3.0, 1.2.2 Stopping service in `spark-daemon.sh` will confirm the process id by fuzzy matching the class name, however, it will fail if the java process arguments is very long (greater than 4096). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5669) Spark assembly includes incompatibly licensed libgfortran, libgcc code via JBLAS
[ https://issues.apache.org/jira/browse/SPARK-5669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328092#comment-14328092 ] Sean Owen commented on SPARK-5669: -- I do find it confusing. I can see an argument that this is allowed on the grounds that it *does* meet the exception because the target work is created *without* GCC, and on the grounds that it is *not* a Category X license case as described in http://www.apache.org/legal/resolved.html#category-x , even though the list calls out special exception to the GPL licenses, since the problem it identifies regarding derived works is *not* part of the exception terms. If that's true, I don't even see that a notice is required. On that grounds, you could put back the binaries into 1.3. (Yes, it's a moot point in 1.4). Your reasoning is that it would prevent users from having to bring their own JBLAS if they already use JBLAS. But they'll have to in 1.4 anyway, and we've always required programs to bring their own dependencies even if they're also used by Spark. I suppose I'd favor taking that hit earlier than later, since it happens anyway, and if it lets us be a tiny bit more conservative about the licensing issue. But I do not feel strongly about it. Having said all that, would you rather proceed with just putting back the libs in 1.3? Spark assembly includes incompatibly licensed libgfortran, libgcc code via JBLAS Key: SPARK-5669 URL: https://issues.apache.org/jira/browse/SPARK-5669 Project: Spark Issue Type: Bug Components: Build Reporter: Sean Owen Assignee: Sean Owen Priority: Blocker Fix For: 1.3.0 Sorry for Blocker, but it's a license issue. The Spark assembly includes the following, from JBLAS: {code} lib/ lib/static/ lib/static/Mac OS X/ lib/static/Mac OS X/x86_64/ lib/static/Mac OS X/x86_64/libjblas_arch_flavor.jnilib lib/static/Mac OS X/x86_64/sse3/ lib/static/Mac OS X/x86_64/sse3/libjblas.jnilib lib/static/Windows/ lib/static/Windows/x86/ lib/static/Windows/x86/libgfortran-3.dll lib/static/Windows/x86/libgcc_s_dw2-1.dll lib/static/Windows/x86/jblas_arch_flavor.dll lib/static/Windows/x86/sse3/ lib/static/Windows/x86/sse3/jblas.dll lib/static/Windows/amd64/ lib/static/Windows/amd64/libgfortran-3.dll lib/static/Windows/amd64/jblas.dll lib/static/Windows/amd64/libgcc_s_sjlj-1.dll lib/static/Windows/amd64/jblas_arch_flavor.dll lib/static/Linux/ lib/static/Linux/i386/ lib/static/Linux/i386/sse3/ lib/static/Linux/i386/sse3/libjblas.so lib/static/Linux/i386/libjblas_arch_flavor.so lib/static/Linux/amd64/ lib/static/Linux/amd64/sse3/ lib/static/Linux/amd64/sse3/libjblas.so lib/static/Linux/amd64/libjblas_arch_flavor.so {code} Unfortunately the libgfortran and libgcc libraries included for Windows are not licensed in a way that's compatible with Spark and the AL2 -- LGPL at least. It's easy to exclude them. I'm not clear what it does to running on Windows; I assume it can still work but the libs would have to be made available locally and put on the shared library path manually. I don't think there's a package manager as in Linux that would make it easily available. I'm not able to test on Windows. If it doesn't work, the follow-up question is whether that means JBLAS has to be removed on the double, or treated as a known issue for 1.3.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5902) PipelineStage.transformSchema should be public, not private
[ https://issues.apache.org/jira/browse/SPARK-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5902. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4682 [https://github.com/apache/spark/pull/4682] PipelineStage.transformSchema should be public, not private --- Key: SPARK-5902 URL: https://issues.apache.org/jira/browse/SPARK-5902 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Minor Fix For: 1.3.0 For users to implement their own PipelineStages, we need to make PipelineStage.transformSchema be public instead of private to ml. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5825) Failure stopping Services while command line argument is too long
[ https://issues.apache.org/jira/browse/SPARK-5825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5825: - Component/s: (was: Spark Submit) Deploy Target Version/s: 1.3.0, 1.2.2 (was: 1.3.0) Failure stopping Services while command line argument is too long - Key: SPARK-5825 URL: https://issues.apache.org/jira/browse/SPARK-5825 Project: Spark Issue Type: Bug Components: Deploy Reporter: Cheng Hao Assignee: Cheng Hao Priority: Blocker Stopping service in `spark-daemon.sh` will confirm the process id by fuzzy matching the class name, however, it will fail if the java process arguments is very long (greater than 4096). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5837) HTTP 500 if try to access Spark UI in yarn-cluster or yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-5837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327182#comment-14327182 ] Rok Roskar edited comment on SPARK-5837 at 2/19/15 9:59 AM: this looks to perhaps be a related yarn issue: https://issues.apache.org/jira/browse/YARN-2713 though I don't know if this is why the ApplicationMaster link results in a connection refused error was (Author: rok): this looks to be a yarn issue: https://issues.apache.org/jira/browse/YARN-2713 HTTP 500 if try to access Spark UI in yarn-cluster or yarn-client mode -- Key: SPARK-5837 URL: https://issues.apache.org/jira/browse/SPARK-5837 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.2.0, 1.2.1 Reporter: Marco Capuccini Both Spark 1.2.0 and Spark 1.2.1 return this error when I try to access the Spark UI if I run over yarn (version 2.4.0): HTTP ERROR 500 Problem accessing /proxy/application_1423564210894_0017/. Reason: Connection refused Caused by: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at java.net.Socket.connect(Socket.java:528) at java.net.Socket.init(Socket.java:425) at java.net.Socket.init(Socket.java:280) at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80) at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122) at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707) at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:346) at org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.proxyLink(WebAppProxyServlet.java:187) at org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:344) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:79) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1192) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at
[jira] [Updated] (SPARK-5889) remove pid file in spark-daemon.sh after killing the process.
[ https://issues.apache.org/jira/browse/SPARK-5889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5889: - Priority: Minor (was: Major) Target Version/s: 1.3.0, 1.2.2 Affects Version/s: 1.2.1 Assignee: Zhan Zhang remove pid file in spark-daemon.sh after killing the process. - Key: SPARK-5889 URL: https://issues.apache.org/jira/browse/SPARK-5889 Project: Spark Issue Type: Bug Affects Versions: 1.2.1 Reporter: Zhan Zhang Assignee: Zhan Zhang Priority: Minor Currently, if the thriftserver/history server are stopped. The pid file is not deleted. The fix is trial, but it is important for some service checking relying on the file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5889) remove pid file in spark-daemon.sh after killing the process.
[ https://issues.apache.org/jira/browse/SPARK-5889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327238#comment-14327238 ] Sean Owen commented on SPARK-5889: -- Yeah I wanted to do this in the original PR, although I think there's a small potential problem: what if {{kill}} fails? then you lose the PID file. In that case, a lot of bets are off anyway and it's not clear that subsequent retries would succeed. Still, since the script handles old PID files already (er, it's trying to), I wonder if this can be slightly more conservative and only remove if kill succeeds? remove pid file in spark-daemon.sh after killing the process. - Key: SPARK-5889 URL: https://issues.apache.org/jira/browse/SPARK-5889 Project: Spark Issue Type: Bug Affects Versions: 1.2.1 Reporter: Zhan Zhang Currently, if the thriftserver/history server are stopped. The pid file is not deleted. The fix is trial, but it is important for some service checking relying on the file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5899) Viewing specific stage information which contains thousands of tasks will freak out the driver and CPU cores from where it runs
[ https://issues.apache.org/jira/browse/SPARK-5899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5899. -- Resolution: Duplicate Viewing specific stage information which contains thousands of tasks will freak out the driver and CPU cores from where it runs --- Key: SPARK-5899 URL: https://issues.apache.org/jira/browse/SPARK-5899 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.3.0, 1.2.1 Environment: CentOS 7, Spark Standalone Reporter: Mark Khaitman Priority: Critical If a user ever tries to view specific stage stats, for example, a repartition stage which involves 3 partitions, the Web UI attempts to load every single task result onto a single webpage, which is completely destroying CPU usage on the driver, which subsequently causes the remaining tasks / jobs to be nearly impossible to complete. Ideally the task results should be paged (if not too much trouble) to prevent this from happening. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5837) HTTP 500 if try to access Spark UI in yarn-cluster or yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-5837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327182#comment-14327182 ] Rok Roskar commented on SPARK-5837: --- this looks to be a yarn issue: https://issues.apache.org/jira/browse/YARN-2713 HTTP 500 if try to access Spark UI in yarn-cluster or yarn-client mode -- Key: SPARK-5837 URL: https://issues.apache.org/jira/browse/SPARK-5837 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.2.0, 1.2.1 Reporter: Marco Capuccini Both Spark 1.2.0 and Spark 1.2.1 return this error when I try to access the Spark UI if I run over yarn (version 2.4.0): HTTP ERROR 500 Problem accessing /proxy/application_1423564210894_0017/. Reason: Connection refused Caused by: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at java.net.Socket.connect(Socket.java:528) at java.net.Socket.init(Socket.java:425) at java.net.Socket.init(Socket.java:280) at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80) at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122) at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707) at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:346) at org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.proxyLink(WebAppProxyServlet.java:187) at org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:344) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:79) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1192) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at
[jira] [Updated] (SPARK-5889) remove pid file in spark-daemon.sh after killing the process.
[ https://issues.apache.org/jira/browse/SPARK-5889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5889: - Component/s: Deploy remove pid file in spark-daemon.sh after killing the process. - Key: SPARK-5889 URL: https://issues.apache.org/jira/browse/SPARK-5889 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.2.1 Reporter: Zhan Zhang Assignee: Zhan Zhang Priority: Minor Currently, if the thriftserver/history server are stopped. The pid file is not deleted. The fix is trial, but it is important for some service checking relying on the file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1476) 2GB limit in spark for blocks
[ https://issues.apache.org/jira/browse/SPARK-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327394#comment-14327394 ] Imran Rashid commented on SPARK-1476: - Based on discussion on the dev list, [~mridulm80] isn't actively working on this. I'd like to start on it, with the following very minimal goals: 1. Make it *possible* for blocks to be bigger than 2GB 2. Maintain performance on smaller blocks ie., I'm not going to try to do anything fancy to optimize performance of the large blocks. To that end, my plan is to 1. create a {{LargeByteBuffer}} interface, which just has the same methods we use on {{ByteBuffer}} 2. have one implementation that just wraps one {{ByteBuffer}}, and another which wraps a completely static set of {{ByteBuffer}}s (eg., if you map a 3 GB file, it'll just immediately map it to 2 {{ByteBuffer}}s, nothing fancy with only mapping the first half of the file until the second is needed etc. etc.) 3. change {{ByteBuffer}} to {{LargeByteBuffer}} in {{ShuffleBlockManager}} and {{BlockStore}} I see that about a year back there was a lot of discussion on this, and some alternate proposals. I'd like to push forward with a POC to try to move the discussion along again. I know there was some discussion about how important this is, and whether or not we want to support it. IMO this is a big limitation and results in a lot of frustration for the users, we really need a solution for this. 2GB limit in spark for blocks - Key: SPARK-1476 URL: https://issues.apache.org/jira/browse/SPARK-1476 Project: Spark Issue Type: Improvement Components: Spark Core Environment: all Reporter: Mridul Muralidharan Assignee: Mridul Muralidharan Priority: Critical Attachments: 2g_fix_proposal.pdf The underlying abstraction for blocks in spark is a ByteBuffer : which limits the size of the block to 2GB. This has implication not just for managed blocks in use, but also for shuffle blocks (memory mapped blocks are limited to 2gig, even though the api allows for long), ser-deser via byte array backed outstreams (SPARK-1391), etc. This is a severe limitation for use of spark when used on non trivial datasets. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5494) SparkSqlSerializer Ignores KryoRegistrators
[ https://issues.apache.org/jira/browse/SPARK-5494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327657#comment-14327657 ] Apache Spark commented on SPARK-5494: - User 'hkothari' has created a pull request for this issue: https://github.com/apache/spark/pull/4693 SparkSqlSerializer Ignores KryoRegistrators --- Key: SPARK-5494 URL: https://issues.apache.org/jira/browse/SPARK-5494 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Hamel Ajay Kothari We should make SparkSqlSerializer call {{super.newKryo}} before doing any of it's custom stuff in order to make sure it picks up on custom KryoRegistrators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5907) Selected column from DataFrame should not re-analyze logical plan
Liang-Chi Hsieh created SPARK-5907: -- Summary: Selected column from DataFrame should not re-analyze logical plan Key: SPARK-5907 URL: https://issues.apache.org/jira/browse/SPARK-5907 Project: Spark Issue Type: Bug Components: SQL Reporter: Liang-Chi Hsieh Currently, selecting a column from DataFrame wraps the original logical plan with a Project. As the column is used, the logical plan will be analyzed again. For some query plan, re-analyzing would side-effect that increases expression id. So when accessing the column, column's expr and its analyzed plan will point to different expressions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5908) Hive udtf with single alias should be resolved correctly
[ https://issues.apache.org/jira/browse/SPARK-5908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327512#comment-14327512 ] Apache Spark commented on SPARK-5908: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/4692 Hive udtf with single alias should be resolved correctly Key: SPARK-5908 URL: https://issues.apache.org/jira/browse/SPARK-5908 Project: Spark Issue Type: Bug Components: SQL Reporter: Liang-Chi Hsieh ResolveUdtfsAlias in hiveUdfs only considers the HiveGenericUdtf with multiple alias. When only single alias is used with HiveGenericUdtf, the alias is not working. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5719) allow daemons to bind to specified host
[ https://issues.apache.org/jira/browse/SPARK-5719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5719: - Affects Version/s: 1.0.0 allow daemons to bind to specified host --- Key: SPARK-5719 URL: https://issues.apache.org/jira/browse/SPARK-5719 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 1.0.0 Reporter: Tao Wang Priority: Minor Now web ui binds to 0.0.0.0. When multiple network plane is enabled, we may try to bind ui port to some specified ip address so that it is possible to do some firewall work(ip filter). The added config items also work for daemons. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5423) ExternalAppendOnlyMap won't delete temp spilled file if some exception happens during using it
[ https://issues.apache.org/jira/browse/SPARK-5423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-5423. Resolution: Fixed Fix Version/s: 1.2.2 1.1.2 1.3.0 Assignee: Shixiong Zhu Target Version/s: 1.3.0, 1.1.2, 1.2.2 ExternalAppendOnlyMap won't delete temp spilled file if some exception happens during using it --- Key: SPARK-5423 URL: https://issues.apache.org/jira/browse/SPARK-5423 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 1.0.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu Fix For: 1.3.0, 1.1.2, 1.2.2 ExternalAppendOnlyMap won't delete temp spilled file if some exception happens during using it. There is already a TODO in the comment: {code} // TODO: Ensure this gets called even if the iterator isn't drained. private def cleanup() { batchIndex = batchOffsets.length // Prevent reading any other batch val ds = deserializeStream deserializeStream = null fileStream = null ds.close() file.delete() } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5775) GenericRow cannot be cast to SpecificMutableRow when nested data and partitioned table
[ https://issues.apache.org/jira/browse/SPARK-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327938#comment-14327938 ] Anselme Vignon commented on SPARK-5775: --- This bug is due to a problem in the TableScanOperations, involving indeed partition columns and complex type columns. I made a pull request patching up the issue here : https://github.com/apache/spark/pull/4697 GenericRow cannot be cast to SpecificMutableRow when nested data and partitioned table -- Key: SPARK-5775 URL: https://issues.apache.org/jira/browse/SPARK-5775 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1 Reporter: Ayoub Benali Labels: hivecontext, nested, parquet, partition Using the LOAD sql command in Hive context to load parquet files into a partitioned table causes exceptions during query time. The bug requires the table to have a column of *type Array of struct* and to be *partitioned*. The example bellow shows how to reproduce the bug and you can see that if the table is not partitioned the query works fine. {noformat} scala val data1 = {data_array:[{field1:1,field2:2}]} scala val data2 = {data_array:[{field1:3,field2:4}]} scala val jsonRDD = sc.makeRDD(data1 :: data2 :: Nil) scala val schemaRDD = hiveContext.jsonRDD(jsonRDD) scala schemaRDD.printSchema root |-- data_array: array (nullable = true) ||-- element: struct (containsNull = false) |||-- field1: integer (nullable = true) |||-- field2: integer (nullable = true) scala hiveContext.sql(create external table if not exists partitioned_table(data_array ARRAY STRUCTfield1: INT, field2: INT) Partitioned by (date STRING) STORED AS PARQUET Location 'hdfs:///partitioned_table') scala hiveContext.sql(create external table if not exists none_partitioned_table(data_array ARRAY STRUCTfield1: INT, field2: INT) STORED AS PARQUET Location 'hdfs:///none_partitioned_table') scala schemaRDD.saveAsParquetFile(hdfs:///tmp_data_1) scala schemaRDD.saveAsParquetFile(hdfs:///tmp_data_2) scala hiveContext.sql(LOAD DATA INPATH 'hdfs://qa-hdc001.ffm.nugg.ad:8020/erlogd/tmp_data_1' INTO TABLE partitioned_table PARTITION(date='2015-02-12')) scala hiveContext.sql(LOAD DATA INPATH 'hdfs://qa-hdc001.ffm.nugg.ad:8020/erlogd/tmp_data_2' INTO TABLE none_partitioned_table) scala hiveContext.sql(select data.field1 from none_partitioned_table LATERAL VIEW explode(data_array) nestedStuff AS data).collect res23: Array[org.apache.spark.sql.Row] = Array([1], [3]) scala hiveContext.sql(select data.field1 from partitioned_table LATERAL VIEW explode(data_array) nestedStuff AS data).collect 15/02/12 16:21:03 INFO ParseDriver: Parsing command: select data.field1 from partitioned_table LATERAL VIEW explode(data_array) nestedStuff AS data 15/02/12 16:21:03 INFO ParseDriver: Parse Completed 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(260661) called with curMem=0, maxMem=280248975 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_18 stored as values in memory (estimated size 254.6 KB, free 267.0 MB) 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(28615) called with curMem=260661, maxMem=280248975 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_18_piece0 stored as bytes in memory (estimated size 27.9 KB, free 267.0 MB) 15/02/12 16:21:03 INFO BlockManagerInfo: Added broadcast_18_piece0 in memory on *:51990 (size: 27.9 KB, free: 267.2 MB) 15/02/12 16:21:03 INFO BlockManagerMaster: Updated info of block broadcast_18_piece0 15/02/12 16:21:03 INFO SparkContext: Created broadcast 18 from NewHadoopRDD at ParquetTableOperations.scala:119 15/02/12 16:21:03 INFO FileInputFormat: Total input paths to process : 3 15/02/12 16:21:03 INFO ParquetInputFormat: Total input paths to process : 3 15/02/12 16:21:03 INFO FilteringParquetRowInputFormat: Using Task Side Metadata Split Strategy 15/02/12 16:21:03 INFO SparkContext: Starting job: collect at SparkPlan.scala:84 15/02/12 16:21:03 INFO DAGScheduler: Got job 12 (collect at SparkPlan.scala:84) with 3 output partitions (allowLocal=false) 15/02/12 16:21:03 INFO DAGScheduler: Final stage: Stage 13(collect at SparkPlan.scala:84) 15/02/12 16:21:03 INFO DAGScheduler: Parents of final stage: List() 15/02/12 16:21:03 INFO DAGScheduler: Missing parents: List() 15/02/12 16:21:03 INFO DAGScheduler: Submitting Stage 13 (MappedRDD[111] at map at SparkPlan.scala:84), which has no missing parents 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(7632) called with curMem=289276, maxMem=280248975 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_19 stored as values in memory (estimated size 7.5 KB, free 267.0 MB) 15/02/12 16:21:03 INFO
[jira] [Commented] (SPARK-5669) Spark assembly includes incompatibly licensed libgfortran, libgcc code via JBLAS
[ https://issues.apache.org/jira/browse/SPARK-5669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327953#comment-14327953 ] Xiangrui Meng commented on SPARK-5669: -- GFortran is part of GCC (https://gcc.gnu.org/wiki/GFortran) and hence the `libgfortran` library. In Apple's libgfortran header file (http://www.opensource.apple.com/source/gcc/gcc-5484/libgfortran/libgfortran.h), I found the following: {code} As a special exception, if you link this library with other files, some of which are compiled with GCC, to produce an executable, this library does not by itself cause the resulting executable to be covered by the GNU General Public License. This exception does not however invalidate any other reasons why the executable file might be covered by the GNU General Public License. {code} The official one linked to the special exception page: https://github.com/gcc-mirror/gcc/blob/master/libgfortran/libgfortran.h#L18 Spark assembly includes incompatibly licensed libgfortran, libgcc code via JBLAS Key: SPARK-5669 URL: https://issues.apache.org/jira/browse/SPARK-5669 Project: Spark Issue Type: Bug Components: Build Reporter: Sean Owen Assignee: Sean Owen Priority: Blocker Fix For: 1.3.0 Sorry for Blocker, but it's a license issue. The Spark assembly includes the following, from JBLAS: {code} lib/ lib/static/ lib/static/Mac OS X/ lib/static/Mac OS X/x86_64/ lib/static/Mac OS X/x86_64/libjblas_arch_flavor.jnilib lib/static/Mac OS X/x86_64/sse3/ lib/static/Mac OS X/x86_64/sse3/libjblas.jnilib lib/static/Windows/ lib/static/Windows/x86/ lib/static/Windows/x86/libgfortran-3.dll lib/static/Windows/x86/libgcc_s_dw2-1.dll lib/static/Windows/x86/jblas_arch_flavor.dll lib/static/Windows/x86/sse3/ lib/static/Windows/x86/sse3/jblas.dll lib/static/Windows/amd64/ lib/static/Windows/amd64/libgfortran-3.dll lib/static/Windows/amd64/jblas.dll lib/static/Windows/amd64/libgcc_s_sjlj-1.dll lib/static/Windows/amd64/jblas_arch_flavor.dll lib/static/Linux/ lib/static/Linux/i386/ lib/static/Linux/i386/sse3/ lib/static/Linux/i386/sse3/libjblas.so lib/static/Linux/i386/libjblas_arch_flavor.so lib/static/Linux/amd64/ lib/static/Linux/amd64/sse3/ lib/static/Linux/amd64/sse3/libjblas.so lib/static/Linux/amd64/libjblas_arch_flavor.so {code} Unfortunately the libgfortran and libgcc libraries included for Windows are not licensed in a way that's compatible with Spark and the AL2 -- LGPL at least. It's easy to exclude them. I'm not clear what it does to running on Windows; I assume it can still work but the libs would have to be made available locally and put on the shared library path manually. I don't think there's a package manager as in Linux that would make it easily available. I'm not able to test on Windows. If it doesn't work, the follow-up question is whether that means JBLAS has to be removed on the double, or treated as a known issue for 1.3.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4423) Improve foreach() documentation to avoid confusion between local- and cluster-mode behavior
[ https://issues.apache.org/jira/browse/SPARK-4423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327833#comment-14327833 ] Apache Spark commented on SPARK-4423: - User 'ilganeli' has created a pull request for this issue: https://github.com/apache/spark/pull/4696 Improve foreach() documentation to avoid confusion between local- and cluster-mode behavior --- Key: SPARK-4423 URL: https://issues.apache.org/jira/browse/SPARK-4423 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Josh Rosen Assignee: Ilya Ganelin {{foreach}} seems to be a common source of confusion for new users: in {{local}} mode, {{foreach}} can be used to update local variables on the driver, but programs that do this will not work properly when executed on clusters, since the {{foreach}} will update per-executor variables (note that this _will_ work correctly for accumulators, but not for other types of mutable objects). Similarly, I've seen users become confused when {{.foreach(println)}} doesn't print to the driver's standard output. At a minimum, we should improve the documentation to warn users against unsafe uses of {{foreach}} that won't work properly when transitioning from local mode to a real cluster. We might also consider changes to local mode so that its behavior more closely matches the cluster modes; this will require some discussion, though, since any change of behavior here would technically be a user-visible backwards-incompatible change (I don't think that we made any explicit guarantees about the current local-mode behavior, but someone might be relying on the current implicit behavior). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5902) PipelineStage.transformSchema should be public, not private
[ https://issues.apache.org/jira/browse/SPARK-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-5902: - Description: For users to implement their own PipelineStages, we need to make PipelineStage.transformSchema be public instead of private to ml. (was: For users to implement their own PipelineStages, we need to make PipelineStage.transformSchema be protected instead of private to ml.) PipelineStage.transformSchema should be public, not private --- Key: SPARK-5902 URL: https://issues.apache.org/jira/browse/SPARK-5902 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Minor For users to implement their own PipelineStages, we need to make PipelineStage.transformSchema be public instead of private to ml. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5423) ExternalAppendOnlyMap won't delete temp spilled file if some exception happens during using it
[ https://issues.apache.org/jira/browse/SPARK-5423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5423: - Affects Version/s: 1.0.0 ExternalAppendOnlyMap won't delete temp spilled file if some exception happens during using it --- Key: SPARK-5423 URL: https://issues.apache.org/jira/browse/SPARK-5423 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 1.0.0 Reporter: Shixiong Zhu Priority: Minor ExternalAppendOnlyMap won't delete temp spilled file if some exception happens during using it. There is already a TODO in the comment: {code} // TODO: Ensure this gets called even if the iterator isn't drained. private def cleanup() { batchIndex = batchOffsets.length // Prevent reading any other batch val ds = deserializeStream deserializeStream = null fileStream = null ds.close() file.delete() } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5887) Class not found exception com.datastax.spark.connector.rdd.partitioner.CassandraPartition
[ https://issues.apache.org/jira/browse/SPARK-5887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327921#comment-14327921 ] Vijay Pawnarkar commented on SPARK-5887: Thanks! This could be a class loader issue in Spark. The class is present in the connector jar and the jar is being added to class loader's list of jars as per the logs . However classloader is not able to find it. Property spark.files.userClassPathFirst is documented as being experimental. Debugging this further. Class not found exception com.datastax.spark.connector.rdd.partitioner.CassandraPartition -- Key: SPARK-5887 URL: https://issues.apache.org/jira/browse/SPARK-5887 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Environment: Spark 1.2.1 Spark Cassandra Connector 1.2.0 Alpha2 Reporter: Vijay Pawnarkar I am getting following class not found exception when using Spark 1.2.1 with spark-cassandra-connector_2.10-1.2.0-alpha2. When the job is submitted to Spark.. it successfully adds required connector JAR file to Worker's classpath. Corresponding log entries are also included in following description. From log statements and looking at spark 1.2.1 codebase it looks like the JAR get added to urlClassLoader via Executor.scala's updateDependencies method. However when it time to execute the Task, its not able to resolve the class name. [task-result-getter-0] WARN org.apache.spark.scheduler.TaskSetManager - Lost task 0.0 in stage 0.0 (TID 0, 127.0.0.1): java.lang.ClassNotFoundException: com.datastax.spark.connector.rdd.partitioner.CassandraPartition at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:274) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) -- LOG indicating JAR files were added to worker classpath. 15/02/17 16:56:48 INFO Executor: Fetching http://127.0.0.1:64265/jars/spark-cassandra-connector-java_2.10-1.2.0-alpha2.jar with timestamp 1424210185005 15/02/17 16:56:48 INFO Utils: Fetching http://127.0.0.1:64265/jars/spark-cassandra-connector-java_2.10-1.2.0-alpha2.jar to C:\Users\sparkus\AppData\Local\Temp\spark-10f5e149-5460-4899-9c8f-b19b19bdaf55\spark-fba24b2b-5847-4b04-848c-90677d12ff99\spark-35f5ed4b-041d-40d8-8854-b243787de188\fetchFileTemp4665176275367448514.tmp 15/02/17 16:56:48 DEBUG Utils: fetchFile not using security 15/02/17 16:56:48 INFO Utils: Copying C:\Users\sparkus\AppData\Local\Temp\spark-10f5e149-5460-4899-9c8f-b19b19bdaf55\spark-fba24b2b-5847-4b04-848c-90677d12ff99\spark-35f5ed4b-041d-40d8-8854-b243787de188\16215993091424210185005_cache to C:\localapps\spark-1.2.1-bin-hadoop2.4\work\app-20150217165625-0006\0\.\spark-cassandra-connector-java_2.10-1.2.0-alpha2.jar 15/02/17 16:56:48 INFO Executor: Adding file:/C:/localapps/spark-1.2.1-bin-hadoop2.4/work/app-20150217165625-0006/0/./spark-cassandra-connector-java_2.10-1.2.0-alpha2.jar to class loader 15/02/17 16:56:50 INFO Executor: Fetching http://127.0.0.1:64265/jars/spark-cassandra-connector_2.10-1.2.0-alpha2.jar with timestamp 1424210185012 15/02/17 16:56:50 INFO Utils: Fetching
[jira] [Commented] (SPARK-5775) GenericRow cannot be cast to SpecificMutableRow when nested data and partitioned table
[ https://issues.apache.org/jira/browse/SPARK-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327936#comment-14327936 ] Apache Spark commented on SPARK-5775: - User 'anselmevignon' has created a pull request for this issue: https://github.com/apache/spark/pull/4697 GenericRow cannot be cast to SpecificMutableRow when nested data and partitioned table -- Key: SPARK-5775 URL: https://issues.apache.org/jira/browse/SPARK-5775 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1 Reporter: Ayoub Benali Labels: hivecontext, nested, parquet, partition Using the LOAD sql command in Hive context to load parquet files into a partitioned table causes exceptions during query time. The bug requires the table to have a column of *type Array of struct* and to be *partitioned*. The example bellow shows how to reproduce the bug and you can see that if the table is not partitioned the query works fine. {noformat} scala val data1 = {data_array:[{field1:1,field2:2}]} scala val data2 = {data_array:[{field1:3,field2:4}]} scala val jsonRDD = sc.makeRDD(data1 :: data2 :: Nil) scala val schemaRDD = hiveContext.jsonRDD(jsonRDD) scala schemaRDD.printSchema root |-- data_array: array (nullable = true) ||-- element: struct (containsNull = false) |||-- field1: integer (nullable = true) |||-- field2: integer (nullable = true) scala hiveContext.sql(create external table if not exists partitioned_table(data_array ARRAY STRUCTfield1: INT, field2: INT) Partitioned by (date STRING) STORED AS PARQUET Location 'hdfs:///partitioned_table') scala hiveContext.sql(create external table if not exists none_partitioned_table(data_array ARRAY STRUCTfield1: INT, field2: INT) STORED AS PARQUET Location 'hdfs:///none_partitioned_table') scala schemaRDD.saveAsParquetFile(hdfs:///tmp_data_1) scala schemaRDD.saveAsParquetFile(hdfs:///tmp_data_2) scala hiveContext.sql(LOAD DATA INPATH 'hdfs://qa-hdc001.ffm.nugg.ad:8020/erlogd/tmp_data_1' INTO TABLE partitioned_table PARTITION(date='2015-02-12')) scala hiveContext.sql(LOAD DATA INPATH 'hdfs://qa-hdc001.ffm.nugg.ad:8020/erlogd/tmp_data_2' INTO TABLE none_partitioned_table) scala hiveContext.sql(select data.field1 from none_partitioned_table LATERAL VIEW explode(data_array) nestedStuff AS data).collect res23: Array[org.apache.spark.sql.Row] = Array([1], [3]) scala hiveContext.sql(select data.field1 from partitioned_table LATERAL VIEW explode(data_array) nestedStuff AS data).collect 15/02/12 16:21:03 INFO ParseDriver: Parsing command: select data.field1 from partitioned_table LATERAL VIEW explode(data_array) nestedStuff AS data 15/02/12 16:21:03 INFO ParseDriver: Parse Completed 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(260661) called with curMem=0, maxMem=280248975 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_18 stored as values in memory (estimated size 254.6 KB, free 267.0 MB) 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(28615) called with curMem=260661, maxMem=280248975 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_18_piece0 stored as bytes in memory (estimated size 27.9 KB, free 267.0 MB) 15/02/12 16:21:03 INFO BlockManagerInfo: Added broadcast_18_piece0 in memory on *:51990 (size: 27.9 KB, free: 267.2 MB) 15/02/12 16:21:03 INFO BlockManagerMaster: Updated info of block broadcast_18_piece0 15/02/12 16:21:03 INFO SparkContext: Created broadcast 18 from NewHadoopRDD at ParquetTableOperations.scala:119 15/02/12 16:21:03 INFO FileInputFormat: Total input paths to process : 3 15/02/12 16:21:03 INFO ParquetInputFormat: Total input paths to process : 3 15/02/12 16:21:03 INFO FilteringParquetRowInputFormat: Using Task Side Metadata Split Strategy 15/02/12 16:21:03 INFO SparkContext: Starting job: collect at SparkPlan.scala:84 15/02/12 16:21:03 INFO DAGScheduler: Got job 12 (collect at SparkPlan.scala:84) with 3 output partitions (allowLocal=false) 15/02/12 16:21:03 INFO DAGScheduler: Final stage: Stage 13(collect at SparkPlan.scala:84) 15/02/12 16:21:03 INFO DAGScheduler: Parents of final stage: List() 15/02/12 16:21:03 INFO DAGScheduler: Missing parents: List() 15/02/12 16:21:03 INFO DAGScheduler: Submitting Stage 13 (MappedRDD[111] at map at SparkPlan.scala:84), which has no missing parents 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(7632) called with curMem=289276, maxMem=280248975 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_19 stored as values in memory (estimated size 7.5 KB, free 267.0 MB) 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(4230) called with curMem=296908, maxMem=280248975 15/02/12 16:21:03 INFO
[jira] [Updated] (SPARK-5423) ExternalAppendOnlyMap won't delete temp spilled file if some exception happens during using it
[ https://issues.apache.org/jira/browse/SPARK-5423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5423: - Priority: Major (was: Minor) ExternalAppendOnlyMap won't delete temp spilled file if some exception happens during using it --- Key: SPARK-5423 URL: https://issues.apache.org/jira/browse/SPARK-5423 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 1.0.0 Reporter: Shixiong Zhu ExternalAppendOnlyMap won't delete temp spilled file if some exception happens during using it. There is already a TODO in the comment: {code} // TODO: Ensure this gets called even if the iterator isn't drained. private def cleanup() { batchIndex = batchOffsets.length // Prevent reading any other batch val ds = deserializeStream deserializeStream = null fileStream = null ds.close() file.delete() } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5887) Class not found exception com.datastax.spark.connector.rdd.partitioner.CassandraPartition
[ https://issues.apache.org/jira/browse/SPARK-5887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-5887. Resolution: Invalid The Datastax connector is not part of the Apache Spark distribution, it's maintained by Datastax directly. So please reach out to them for support. Thanks! Class not found exception com.datastax.spark.connector.rdd.partitioner.CassandraPartition -- Key: SPARK-5887 URL: https://issues.apache.org/jira/browse/SPARK-5887 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Environment: Spark 1.2.1 Spark Cassandra Connector 1.2.0 Alpha2 Reporter: Vijay Pawnarkar I am getting following class not found exception when using Spark 1.2.1 with spark-cassandra-connector_2.10-1.2.0-alpha2. When the job is submitted to Spark.. it successfully adds required connector JAR file to Worker's classpath. Corresponding log entries are also included in following description. From log statements and looking at spark 1.2.1 codebase it looks like the JAR get added to urlClassLoader via Executor.scala's updateDependencies method. However when it time to execute the Task, its not able to resolve the class name. [task-result-getter-0] WARN org.apache.spark.scheduler.TaskSetManager - Lost task 0.0 in stage 0.0 (TID 0, 127.0.0.1): java.lang.ClassNotFoundException: com.datastax.spark.connector.rdd.partitioner.CassandraPartition at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:274) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) -- LOG indicating JAR files were added to worker classpath. 15/02/17 16:56:48 INFO Executor: Fetching http://127.0.0.1:64265/jars/spark-cassandra-connector-java_2.10-1.2.0-alpha2.jar with timestamp 1424210185005 15/02/17 16:56:48 INFO Utils: Fetching http://127.0.0.1:64265/jars/spark-cassandra-connector-java_2.10-1.2.0-alpha2.jar to C:\Users\sparkus\AppData\Local\Temp\spark-10f5e149-5460-4899-9c8f-b19b19bdaf55\spark-fba24b2b-5847-4b04-848c-90677d12ff99\spark-35f5ed4b-041d-40d8-8854-b243787de188\fetchFileTemp4665176275367448514.tmp 15/02/17 16:56:48 DEBUG Utils: fetchFile not using security 15/02/17 16:56:48 INFO Utils: Copying C:\Users\sparkus\AppData\Local\Temp\spark-10f5e149-5460-4899-9c8f-b19b19bdaf55\spark-fba24b2b-5847-4b04-848c-90677d12ff99\spark-35f5ed4b-041d-40d8-8854-b243787de188\16215993091424210185005_cache to C:\localapps\spark-1.2.1-bin-hadoop2.4\work\app-20150217165625-0006\0\.\spark-cassandra-connector-java_2.10-1.2.0-alpha2.jar 15/02/17 16:56:48 INFO Executor: Adding file:/C:/localapps/spark-1.2.1-bin-hadoop2.4/work/app-20150217165625-0006/0/./spark-cassandra-connector-java_2.10-1.2.0-alpha2.jar to class loader 15/02/17 16:56:50 INFO Executor: Fetching http://127.0.0.1:64265/jars/spark-cassandra-connector_2.10-1.2.0-alpha2.jar with timestamp 1424210185012 15/02/17 16:56:50 INFO Utils: Fetching http://127.0.0.1:64265/jars/spark-cassandra-connector_2.10-1.2.0-alpha2.jar to
[jira] [Updated] (SPARK-5863) Performance regression in Spark SQL/Parquet due to ScalaReflection.convertRowToScala
[ https://issues.apache.org/jira/browse/SPARK-5863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5863: --- Priority: Critical (was: Major) Performance regression in Spark SQL/Parquet due to ScalaReflection.convertRowToScala Key: SPARK-5863 URL: https://issues.apache.org/jira/browse/SPARK-5863 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.2.1 Reporter: Cristian Priority: Critical Was doing some perf testing on reading parquet files and noticed that moving from Spark 1.1 to 1.2 the performance is 3x worse. In the profiler the culprit showed up as being in ScalaReflection.convertRowToScala. Particularly this zip is the issue: {code} r.toSeq.zip(schema.fields.map(_.dataType)) {code} I see there's a comment on that currently that this is slow but it wasn't fixed. This actually produces a 3x degradation in parquet read performance, at least in my test case. Edit: the map is part of the issue as well. This whole code block is in a tight loop and allocates a new ListBuffer that needs to grow for each transformation. A possible solution is to change to using seq.view which would allocate iterators instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5887) Class not found exception com.datastax.spark.connector.rdd.partitioner.CassandraPartition
[ https://issues.apache.org/jira/browse/SPARK-5887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327921#comment-14327921 ] Vijay Pawnarkar edited comment on SPARK-5887 at 2/19/15 6:35 PM: - Thanks! This could be a class loader issue in Spark. The class is present in the connector jar and the jar is being added to class loader's list of jars as per the logs . However classloader is not able to find it. Property spark.files.userClassPathFirst is documented as being experimental. Debugging this further. Had logged a issue with Datastax as well. https://datastax-oss.atlassian.net/browse/SPARKC-59 was (Author: tech20nn): Thanks! This could be a class loader issue in Spark. The class is present in the connector jar and the jar is being added to class loader's list of jars as per the logs . However classloader is not able to find it. Property spark.files.userClassPathFirst is documented as being experimental. Debugging this further. Class not found exception com.datastax.spark.connector.rdd.partitioner.CassandraPartition -- Key: SPARK-5887 URL: https://issues.apache.org/jira/browse/SPARK-5887 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Environment: Spark 1.2.1 Spark Cassandra Connector 1.2.0 Alpha2 Reporter: Vijay Pawnarkar I am getting following class not found exception when using Spark 1.2.1 with spark-cassandra-connector_2.10-1.2.0-alpha2. When the job is submitted to Spark.. it successfully adds required connector JAR file to Worker's classpath. Corresponding log entries are also included in following description. From log statements and looking at spark 1.2.1 codebase it looks like the JAR get added to urlClassLoader via Executor.scala's updateDependencies method. However when it time to execute the Task, its not able to resolve the class name. [task-result-getter-0] WARN org.apache.spark.scheduler.TaskSetManager - Lost task 0.0 in stage 0.0 (TID 0, 127.0.0.1): java.lang.ClassNotFoundException: com.datastax.spark.connector.rdd.partitioner.CassandraPartition at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:274) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) -- LOG indicating JAR files were added to worker classpath. 15/02/17 16:56:48 INFO Executor: Fetching http://127.0.0.1:64265/jars/spark-cassandra-connector-java_2.10-1.2.0-alpha2.jar with timestamp 1424210185005 15/02/17 16:56:48 INFO Utils: Fetching http://127.0.0.1:64265/jars/spark-cassandra-connector-java_2.10-1.2.0-alpha2.jar to C:\Users\sparkus\AppData\Local\Temp\spark-10f5e149-5460-4899-9c8f-b19b19bdaf55\spark-fba24b2b-5847-4b04-848c-90677d12ff99\spark-35f5ed4b-041d-40d8-8854-b243787de188\fetchFileTemp4665176275367448514.tmp 15/02/17 16:56:48 DEBUG Utils: fetchFile not using security 15/02/17 16:56:48 INFO Utils: Copying C:\Users\sparkus\AppData\Local\Temp\spark-10f5e149-5460-4899-9c8f-b19b19bdaf55\spark-fba24b2b-5847-4b04-848c-90677d12ff99\spark-35f5ed4b-041d-40d8-8854-b243787de188\16215993091424210185005_cache to
[jira] [Updated] (SPARK-5316) DAGScheduler may make shuffleToMapStage leak if getParentStages failes
[ https://issues.apache.org/jira/browse/SPARK-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5316: - Priority: Major (was: Minor) DAGScheduler may make shuffleToMapStage leak if getParentStages failes -- Key: SPARK-5316 URL: https://issues.apache.org/jira/browse/SPARK-5316 Project: Spark Issue Type: Bug Components: Scheduler Reporter: YanTang Zhai DAGScheduler may make shuffleToMapStage leak if getParentStages failes. If getParentStages has exception for example input path does not exist, DAGScheduler would fail to handle job submission, while shuffleToMapStage may be put some records when getParentStages. However these records in shuffleToMapStage aren't going to be cleaned. A simple job as follows: {code:java} val inputFile1 = ... // Input path does not exist when this job submits val inputFile2 = ... val outputFile = ... val conf = new SparkConf() val sc = new SparkContext(conf) val rdd1 = sc.textFile(inputFile1) .flatMap(line = line.split( )) .map(word = (word, 1)) .reduceByKey(_ + _, 1) val rdd2 = sc.textFile(inputFile2) .flatMap(line = line.split(,)) .map(word = (word, 1)) .reduceByKey(_ + _, 1) try { val rdd3 = new PairRDDFunctions(rdd1).join(rdd2, 1) rdd3.saveAsTextFile(outputFile) } catch { case e : Exception = logError(e) } // print the information of DAGScheduler's shuffleToMapStage to check // whether it still has uncleaned records. ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4962) Put TaskScheduler.start back in SparkContext to shorten cluster resources occupation period
[ https://issues.apache.org/jira/browse/SPARK-4962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4962: - Affects Version/s: 1.0.0 Put TaskScheduler.start back in SparkContext to shorten cluster resources occupation period --- Key: SPARK-4962 URL: https://issues.apache.org/jira/browse/SPARK-4962 Project: Spark Issue Type: Improvement Components: Scheduler Affects Versions: 1.0.0 Reporter: YanTang Zhai Priority: Minor When SparkContext object is instantiated, TaskScheduler is started and some resources are allocated from cluster. However, these resources may be not used for the moment. For example, DAGScheduler.JobSubmitted is processing and so on. These resources are wasted in this period. Thus, we want to put TaskScheduler.start back to shorten cluster resources occupation period specially for busy cluster. TaskScheduler could be started just before running stages. We could analyse and compare the resources occupation period before and after optimization. TaskScheduler.start execution time: [time1__] DAGScheduler.JobSubmitted (excluding HadoopRDD.getPartitions or TaskScheduler.start) execution time: [time2_] HadoopRDD.getPartitions execution time: [time3___] Stages execution time: [time4_] The cluster resources occupation period before optimization is [time2_][time3___][time4_]. The cluster resources occupation period after optimization is[time3___][time4_]. In summary, the cluster resources occupation period after optimization is less than before. If HadoopRDD.getPartitions could be put forward (SPARK-4961), the period may be shorten more which is [time4_]. The resources saving is important for busy cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5316) DAGScheduler may make shuffleToMapStage leak if getParentStages failes
[ https://issues.apache.org/jira/browse/SPARK-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5316: - Affects Version/s: 1.0.0 DAGScheduler may make shuffleToMapStage leak if getParentStages failes -- Key: SPARK-5316 URL: https://issues.apache.org/jira/browse/SPARK-5316 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 1.0.0 Reporter: YanTang Zhai DAGScheduler may make shuffleToMapStage leak if getParentStages failes. If getParentStages has exception for example input path does not exist, DAGScheduler would fail to handle job submission, while shuffleToMapStage may be put some records when getParentStages. However these records in shuffleToMapStage aren't going to be cleaned. A simple job as follows: {code:java} val inputFile1 = ... // Input path does not exist when this job submits val inputFile2 = ... val outputFile = ... val conf = new SparkConf() val sc = new SparkContext(conf) val rdd1 = sc.textFile(inputFile1) .flatMap(line = line.split( )) .map(word = (word, 1)) .reduceByKey(_ + _, 1) val rdd2 = sc.textFile(inputFile2) .flatMap(line = line.split(,)) .map(word = (word, 1)) .reduceByKey(_ + _, 1) try { val rdd3 = new PairRDDFunctions(rdd1).join(rdd2, 1) rdd3.saveAsTextFile(outputFile) } catch { case e : Exception = logError(e) } // print the information of DAGScheduler's shuffleToMapStage to check // whether it still has uncleaned records. ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4921) TaskSetManager mistakenly returns PROCESS_LOCAL for NO_PREF tasks
[ https://issues.apache.org/jira/browse/SPARK-4921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4921: - Summary: TaskSetManager mistakenly returns PROCESS_LOCAL for NO_PREF tasks (was: Performance issue caused by TaskSetManager returning PROCESS_LOCAL for NO_PREF tasks) TaskSetManager mistakenly returns PROCESS_LOCAL for NO_PREF tasks - Key: SPARK-4921 URL: https://issues.apache.org/jira/browse/SPARK-4921 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Xuefu Zhang Attachments: NO_PREF.patch During research for HIVE-9153, we found that TaskSetManager returns PROCESS_LOCAL for NO_PREF tasks, which may caused performance degradation. Changing the return value to NO_PREF, as demonstrated in the attached patch, seemingly improves the performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3545) Put HadoopRDD.getPartitions forward and put TaskScheduler.start back in SparkContext to reduce DAGScheduler.JobSubmitted processing time and shorten cluster resources occu
[ https://issues.apache.org/jira/browse/SPARK-3545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-3545. Resolution: Won't Fix Put HadoopRDD.getPartitions forward and put TaskScheduler.start back in SparkContext to reduce DAGScheduler.JobSubmitted processing time and shorten cluster resources occupation period Key: SPARK-3545 URL: https://issues.apache.org/jira/browse/SPARK-3545 Project: Spark Issue Type: Improvement Components: Scheduler Reporter: YanTang Zhai Priority: Minor We have two problems: (1) HadoopRDD.getPartitions is lazyied to process in DAGScheduler.JobSubmitted. If inputdir is large, getPartitions may spend much time. For example, in our cluster, it needs from 0.029s to 766.699s. If one JobSubmitted event is processing, others should wait. Thus, we want to put HadoopRDD.getPartitions forward to reduce DAGScheduler.JobSubmitted processing time. Then other JobSubmitted event don't need to wait much time. HadoopRDD object could get its partitons when it is instantiated. (2) When SparkContext object is instantiated, TaskScheduler is started and some resources are allocated from cluster. However, these resources may be not used for the moment. For example, DAGScheduler.JobSubmitted is processing and so on. These resources are wasted in this period. Thus, we want to put TaskScheduler.start back to shorten cluster resources occupation period specially for busy cluster. TaskScheduler could be started just before running stages. We could analyse and compare the execution time before and after optimization. TaskScheduler.start execution time: [time1__] DAGScheduler.JobSubmitted (excluding HadoopRDD.getPartitions or TaskScheduler.start) execution time: [time2_] HadoopRDD.getPartitions execution time: [time3___] Stages execution time: [time4_] (1) The app has only one job (a) The execution time of the job before optimization is [time1__][time2_][time3___][time4_]. The execution time of the job after optimization is[time3___][time2_][time1__][time4_]. (b) The cluster resources occupation period before optimization is [time2_][time3___][time4_]. The cluster resources occupation period after optimization is[time4_]. In summary, if the app has only one job, the total execution time is same before and after optimization while the cluster resources occupation period after optimization is less than before. (2) The app has 4 jobs (a) Before optimization, job1 execution time is [time2_][time3___][time4_], job2 execution time is [time2__][time3___][time4_], job3 execution time is[time2][time3___][time4_], job4 execution time is[time2__][time3___][time4_]. After optimization, job1 execution time is [time3___][time2_][time1__][time4_], job2 execution time is [time3___][time2__][time4_], job3 execution time is[time3___][time2_][time4_], job4 execution time is[time3___][time2__][time4_]. In summary, if the app has multiple jobs, average execution time after optimization is less than before and the cluster resources occupation period after optimization is less than before. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1537) Add integration with Yarn's Application Timeline Server
[ https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated SPARK-1537: -- Attachment: SPARK-1537.txt High level design doc for spark ATS integration. Add integration with Yarn's Application Timeline Server --- Key: SPARK-1537 URL: https://issues.apache.org/jira/browse/SPARK-1537 Project: Spark Issue Type: New Feature Components: YARN Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Attachments: SPARK-1537.txt, spark-1573.patch It would be nice to have Spark integrate with Yarn's Application Timeline Server (see YARN-321, YARN-1530). This would allow users running Spark on Yarn to have a single place to go for all their history needs, and avoid having to manage a separate service (Spark's built-in server). At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, although there is still some ongoing work. But the basics are there, and I wouldn't expect them to change (much) at this point. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5814) Remove JBLAS from runtime dependencies
[ https://issues.apache.org/jira/browse/SPARK-5814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5814: - Priority: Major (was: Critical) Remove JBLAS from runtime dependencies -- Key: SPARK-5814 URL: https://issues.apache.org/jira/browse/SPARK-5814 Project: Spark Issue Type: Dependency upgrade Components: GraphX, MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng We are using mixed breeze/netlib-java and jblas code in MLlib. They take different approaches to utilize native libraries and we should keep only one of them. netlib-java has a clear separation between Java implementation and native JNI libraries, while JBLAS packs statically linked binaries that causes license issues (SPARK-5669). So we want to remove JBLAS from Spark runtime. One issue with this approach is that we have JBLAS' DoubleMatrix exposed (by mistake) in SVDPlusPlus of GraphX. We should deprecate it and replace `DoubleMatrix` by `Array[Double]`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5911) Make Column.cast(to: String) support fixed precision and scale decimal type
Yin Huai created SPARK-5911: --- Summary: Make Column.cast(to: String) support fixed precision and scale decimal type Key: SPARK-5911 URL: https://issues.apache.org/jira/browse/SPARK-5911 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5744) RDD.isEmpty / take fails for (empty) RDD of Nothing
[ https://issues.apache.org/jira/browse/SPARK-5744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328173#comment-14328173 ] Apache Spark commented on SPARK-5744: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/4698 RDD.isEmpty / take fails for (empty) RDD of Nothing --- Key: SPARK-5744 URL: https://issues.apache.org/jira/browse/SPARK-5744 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Tobias Bertelsen Assignee: Tobias Bertelsen Priority: Minor Original Estimate: 0h Remaining Estimate: 0h The implementation of {{RDD.isEmpty()}} fails if there is empty partitions. It was introduce by https://github.com/apache/spark/pull/4074 Example: {code} sc.parallelize(Seq(), 1).isEmpty() {code} The above code throws an exception like this: {code} org.apache.spark.SparkDriverExecutionException: Execution error at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:977) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1374) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1338) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) Cause: java.lang.ArrayStoreException: [Ljava.lang.Object; at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:88) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1466) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1466) at org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:56) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:973) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1374) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1338) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4848) On a stand-alone cluster, several worker-specific variables are read only on the master
[ https://issues.apache.org/jira/browse/SPARK-4848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4848: - Component/s: (was: Project Infra) Deploy On a stand-alone cluster, several worker-specific variables are read only on the master --- Key: SPARK-4848 URL: https://issues.apache.org/jira/browse/SPARK-4848 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.0.0 Environment: stand-alone spark cluster Reporter: Nathan Kronenfeld Original Estimate: 24h Remaining Estimate: 24h On a stand-alone spark cluster, much of the determination of worker specifics, especially one has multiple instances per node, is done only on the master. The master loops over instances, and starts a worker per instance on each node. This means, if your workers have different values of SPARK_WORKER_INSTANCES or SPARK_WORKER_WEBUI_PORT from each other (or from the master), all values are ignored except the one on the master. SPARK_WORKER_PORT looks like it is unread in scripts, but read in code - I'm not sure how it will behave, since all instances will read the same value from the environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1537) Add integration with Yarn's Application Timeline Server
[ https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated SPARK-1537: -- Attachment: spark-1573.patch Patch against v1.2.1 Add integration with Yarn's Application Timeline Server --- Key: SPARK-1537 URL: https://issues.apache.org/jira/browse/SPARK-1537 Project: Spark Issue Type: New Feature Components: YARN Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Attachments: spark-1573.patch It would be nice to have Spark integrate with Yarn's Application Timeline Server (see YARN-321, YARN-1530). This would allow users running Spark on Yarn to have a single place to go for all their history needs, and avoid having to manage a separate service (Spark's built-in server). At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, although there is still some ongoing work. But the basics are there, and I wouldn't expect them to change (much) at this point. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4848) On a stand-alone cluster, several worker-specific variables are read only on the master
[ https://issues.apache.org/jira/browse/SPARK-4848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4848: - Affects Version/s: 1.0.0 On a stand-alone cluster, several worker-specific variables are read only on the master --- Key: SPARK-4848 URL: https://issues.apache.org/jira/browse/SPARK-4848 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.0.0 Environment: stand-alone spark cluster Reporter: Nathan Kronenfeld Original Estimate: 24h Remaining Estimate: 24h On a stand-alone spark cluster, much of the determination of worker specifics, especially one has multiple instances per node, is done only on the master. The master loops over instances, and starts a worker per instance on each node. This means, if your workers have different values of SPARK_WORKER_INSTANCES or SPARK_WORKER_WEBUI_PORT from each other (or from the master), all values are ignored except the one on the master. SPARK_WORKER_PORT looks like it is unread in scripts, but read in code - I'm not sure how it will behave, since all instances will read the same value from the environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4721) Improve first thread to put block failed
[ https://issues.apache.org/jira/browse/SPARK-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4721: - Affects Version/s: 1.0.0 Improve first thread to put block failed Key: SPARK-4721 URL: https://issues.apache.org/jira/browse/SPARK-4721 Project: Spark Issue Type: Improvement Components: Block Manager Affects Versions: 1.0.0 Reporter: SuYan In current code, it assumes that multi-thread try to put same blockID block in blockManager, the thread that first put info in blockinfos to do the put process, and others will wait until the put in failed or success. it's ok in put success, but if fails, have some problem: 1. the failed thread will remove info from blockinfo 2. other threads wake up, and use the old info.synchronized to try put 3. and if success, mark success will tell not in pending status, and “mark success” failed. all other remaining threads will do the same thing: got info.syn and mark success or failed even that have one success. first, I can't understand why remove info from blockinfos while there have other threads was wait. the comment tell us is for other threads to create new block info. but block info is just a ID and level, use the old one and the new one is doesn't matters if there any waits threads. second, how about if there first threads is failed, other waits thread can do the same process one by one but need less than all . or just if first thread is failed, all other threads log a warning and return after waking up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4669) Allow users to set arbitrary akka configurations via property file
[ https://issues.apache.org/jira/browse/SPARK-4669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4669: - Affects Version/s: 1.0.0 Allow users to set arbitrary akka configurations via property file -- Key: SPARK-4669 URL: https://issues.apache.org/jira/browse/SPARK-4669 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Tao Wang Currently spark only support several configuration settings in property file and arbitrary setting in SparkConf. If we wanna set some values to other items in akka configuration, for instance, akka.remote.startup-timeout, it will be unavailable to do this in property file. I review the history commits and could not find why we keep current strategy. So it it better to open all akka seetings in property file in my opinion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-2188) Support sbt/sbt for Windows
[ https://issues.apache.org/jira/browse/SPARK-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-2188. Resolution: Won't Fix Support sbt/sbt for Windows --- Key: SPARK-2188 URL: https://issues.apache.org/jira/browse/SPARK-2188 Project: Spark Issue Type: New Feature Components: Build Affects Versions: 1.0.0 Reporter: Pat McDonough Add the equivalent of sbt/sbt for Windows users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-911) Support map pruning on sorted (K, V) RDD's
[ https://issues.apache.org/jira/browse/SPARK-911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-911: Affects Version/s: 1.0.0 Support map pruning on sorted (K, V) RDD's -- Key: SPARK-911 URL: https://issues.apache.org/jira/browse/SPARK-911 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Patrick Wendell If someone has sorted a (K, V) rdd, we should offer them a way to filter a range of the partitions that employs map pruning. This would be simple using a small range index within the rdd itself. A good example is I sort my dataset by time and then I want to serve queries that are restricted to a certain time range. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3051) Support looking-up named accumulators in a registry
[ https://issues.apache.org/jira/browse/SPARK-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3051: - Affects Version/s: 1.0.0 Support looking-up named accumulators in a registry --- Key: SPARK-3051 URL: https://issues.apache.org/jira/browse/SPARK-3051 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Neil Ferguson This is a proposed enhancement to Spark based on the following mailing list discussion: http://apache-spark-developers-list.1001551.n3.nabble.com/quot-Dynamic-variables-quot-in-Spark-td7450.html. This proposal builds on SPARK-2380 (Support displaying accumulator values in the web UI) to allow named accumulables to be looked-up in a registry, as opposed to having to be passed to every method that need to access them. The use case was described well by [~shivaram], as follows: Lets say you have two functions you use in a map call and want to measure how much time each of them takes. For example, if you have a code block like the one below and you want to measure how much time f1 takes as a fraction of the task. {noformat} a.map { l = val f = f1(l) ... some work here ... } {noformat} It would be really cool if we could do something like {noformat} a.map { l = val start = System.nanoTime val f = f1(l) TaskMetrics.get(f1-time).add(System.nanoTime - start) } {noformat} SPARK-2380 provides a partial solution to this problem -- however the accumulables would still need to be passed to every function that needs them, which I think would be cumbersome in any application of reasonable complexity. The proposal, as suggested by [~pwendell], is to have a registry of accumulables, that can be looked-up by name. Regarding the implementation details, I'd propose that we broadcast a serialized version of all named accumulables in the DAGScheduler (similar to what SPARK-2521 does for Tasks). These can then be deserialized in the Executor. Accumulables are already stored in thread-local variables in the Accumulators object, so exposing these in the registry should be simply a matter of wrapping this object, and keying the accumulables by name (they are currently keyed by ID). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2033) Automatically cleanup checkpoint
[ https://issues.apache.org/jira/browse/SPARK-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-2033: - Affects Version/s: 1.0.0 Automatically cleanup checkpoint - Key: SPARK-2033 URL: https://issues.apache.org/jira/browse/SPARK-2033 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.0.0 Reporter: Guoqiang Li Assignee: Guoqiang Li Now we use ContextCleaner asynchronous cleanup RDD, shuffle, and broadcast. But no checkpoint. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5912) Programming guide for feature selection
Joseph K. Bradley created SPARK-5912: Summary: Programming guide for feature selection Key: SPARK-5912 URL: https://issues.apache.org/jira/browse/SPARK-5912 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley The new ChiSqSelector for feature selection should have a section in the Programming Guide. It should probably be under the feature extraction and transformation section as a new subsection for feature selection. If we get more feature selection methods later on, we could expand it to a larger section of the guide. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3882) JobProgressListener gets permanently out of sync with long running job
[ https://issues.apache.org/jira/browse/SPARK-3882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328237#comment-14328237 ] Andrew Or commented on SPARK-3882: -- Hi [~dgshep] is this still an issue after upgrading to Spark 1.1 and beyond? If not I think we should close this issue. JobProgressListener gets permanently out of sync with long running job -- Key: SPARK-3882 URL: https://issues.apache.org/jira/browse/SPARK-3882 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.0.2 Reporter: Davis Shepherd Attachments: Screen Shot 2014-10-03 at 12.50.59 PM.png A long running spark context (non-streaming) will eventually start throwing the following in the driver: {code} java.util.NoSuchElementException: key not found: 12771 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79) at org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48) at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) at org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46) 2014-10-09 18:45:33,523 [SparkListenerBus] ERROR org.apache.spark.scheduler.LiveListenerBus - Listener JobProgressListener threw an exception java.util.NoSuchElementException: key not found: 12782 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79) at org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48) at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56) at
[jira] [Commented] (SPARK-5912) Programming guide for feature selection
[ https://issues.apache.org/jira/browse/SPARK-5912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328238#comment-14328238 ] Joseph K. Bradley commented on SPARK-5912: -- [~avulanov] Would you have time to make this guide for the 1.3 release (as soon as possible, really)? If not, I could add it. Thanks! Programming guide for feature selection --- Key: SPARK-5912 URL: https://issues.apache.org/jira/browse/SPARK-5912 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley The new ChiSqSelector for feature selection should have a section in the Programming Guide. It should probably be under the feature extraction and transformation section as a new subsection for feature selection. If we get more feature selection methods later on, we could expand it to a larger section of the guide. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5912) Programming guide for feature selection
[ https://issues.apache.org/jira/browse/SPARK-5912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328246#comment-14328246 ] Alexander Ulanov commented on SPARK-5912: - Sure, I can. Could you point me to some template or a good example of a programming guide? Programming guide for feature selection --- Key: SPARK-5912 URL: https://issues.apache.org/jira/browse/SPARK-5912 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley The new ChiSqSelector for feature selection should have a section in the Programming Guide. It should probably be under the feature extraction and transformation section as a new subsection for feature selection. If we get more feature selection methods later on, we could expand it to a larger section of the guide. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1476) 2GB limit in spark for blocks
[ https://issues.apache.org/jira/browse/SPARK-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328248#comment-14328248 ] Marcelo Vanzin commented on SPARK-1476: --- Hi [~irashid], Approach sounds good. It would be nice to measure whether the optimization for smaller blocks actually makes a difference; from what I can tell, supporting multiple ByteBuffer instances just means having an array and picking the right ByteBuffer based on an offset, both of which should be pretty cheap. 2GB limit in spark for blocks - Key: SPARK-1476 URL: https://issues.apache.org/jira/browse/SPARK-1476 Project: Spark Issue Type: Improvement Components: Spark Core Environment: all Reporter: Mridul Muralidharan Assignee: Mridul Muralidharan Priority: Critical Attachments: 2g_fix_proposal.pdf The underlying abstraction for blocks in spark is a ByteBuffer : which limits the size of the block to 2GB. This has implication not just for managed blocks in use, but also for shuffle blocks (memory mapped blocks are limited to 2gig, even though the api allows for long), ser-deser via byte array backed outstreams (SPARK-1391), etc. This is a severe limitation for use of spark when used on non trivial datasets. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5918) Spark Thrift server reports metadata for VARCHAR column as STRING in result set schema
Holman Lan created SPARK-5918: - Summary: Spark Thrift server reports metadata for VARCHAR column as STRING in result set schema Key: SPARK-5918 URL: https://issues.apache.org/jira/browse/SPARK-5918 Project: Spark Issue Type: Bug Affects Versions: 1.2.0, 1.1.1 Reporter: Holman Lan This is reproducible using the open source JDBC driver by executing a query that will return a VARCHAR column then retrieving the result set metadata. The type name returned by the JDBC driver is VARCHAR which is expected but reports the column type as string[12] and precision/column length as 2147483647 (which is what the JDBC driver would return for STRING column) even though we created a VARCHAR column with max length of 1000. Further investigation indicates the GetResultSetMetadata Thrift client API call returns the incorrect metadata. We have confirmed this behaviour in versions 1.1.1 and 1.2.0. We have not yet tested this against 1.2.1 but will do so and report our findings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5879) spary_ec2.py should expose/return master and slave lists (e.g. write to file)
[ https://issues.apache.org/jira/browse/SPARK-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328612#comment-14328612 ] Florian Verhein commented on SPARK-5879: cc [~shivaram], any opinions on how to best do this? spary_ec2.py should expose/return master and slave lists (e.g. write to file) - Key: SPARK-5879 URL: https://issues.apache.org/jira/browse/SPARK-5879 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Florian Verhein After running spark_ec2.py, it is often useful/necessary to know the master's ip / dn. Particularly if running spark_ec2.py is part of a larger pipeline. For example, consider a wrapper that launches a cluster, then waits for completion of some application running on it (e.g. polling via ssh), before destroying the cluster. Some options: - write `launch-variables.sh` with MASTERS and SLAVES exports (i.e. basically a subset of the ec2_variables.sh that is temporarily created as part of deploy_files variable substitution) - launch-variables.json (same info but as json) Both would be useful depending on the wrapper language. I think we should incorporate the cluster name for the case that multiple clusters are launched. E.g. cluster_name_variables.sh/.json Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4144) Support incremental model training of Naive Bayes classifier
[ https://issues.apache.org/jira/browse/SPARK-4144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328569#comment-14328569 ] Jatinpreet Singh commented on SPARK-4144: - Hi, I have been waiting for this feature to be included. It would be great if this can be done. Thanks, Jatin Support incremental model training of Naive Bayes classifier Key: SPARK-4144 URL: https://issues.apache.org/jira/browse/SPARK-4144 Project: Spark Issue Type: Improvement Components: MLlib, Streaming Reporter: Chris Fregly Assignee: Liquan Pei Per Xiangrui Meng from the following user list discussion: http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3CCAJgQjQ_QjMGO=jmm8weq1v8yqfov8du03abzy7eeavgjrou...@mail.gmail.com%3E For Naive Bayes, we need to update the priors and conditional probabilities, which means we should also remember the number of observations for the updates. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4655) Split Stage into ShuffleMapStage and ResultStage subclasses
[ https://issues.apache.org/jira/browse/SPARK-4655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328659#comment-14328659 ] Apache Spark commented on SPARK-4655: - User 'ilganeli' has created a pull request for this issue: https://github.com/apache/spark/pull/4703 Split Stage into ShuffleMapStage and ResultStage subclasses --- Key: SPARK-4655 URL: https://issues.apache.org/jira/browse/SPARK-4655 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Josh Rosen Assignee: Ilya Ganelin The scheduler's {{Stage}} class has many fields which are only applicable to result stages or shuffle map stages. As a result, I think that it makes sense to make {{Stage}} into an abstract base class with two subclasses, {{ResultStage}} and {{ShuffleMapStage}}. This would improve the understandability of the DAGScheduler code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5912) Programming guide for feature selection
[ https://issues.apache.org/jira/browse/SPARK-5912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328255#comment-14328255 ] Joseph K. Bradley commented on SPARK-5912: -- Sure, can you please follow the examples in [https://github.com/apache/spark/blob/master/docs/mllib-feature-extraction.md], which generates into [http://spark.apache.org/docs/latest/mllib-feature-extraction.html]? I'd add a new subsection at the level of the other algorithms (TF-IDF, Word2Vec, etc.). There can be Scala/Java examples but we can of course skip Python since that API isn't available yet. To see what it looks like on your machine, you can compile the docs using the instructions here: [https://github.com/apache/spark/tree/master/docs] Let me know if you run into questions. Thanks! Programming guide for feature selection --- Key: SPARK-5912 URL: https://issues.apache.org/jira/browse/SPARK-5912 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley The new ChiSqSelector for feature selection should have a section in the Programming Guide. It should probably be under the feature extraction and transformation section as a new subsection for feature selection. If we get more feature selection methods later on, we could expand it to a larger section of the guide. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5914) Spark-submit cannot execute without machine admin permission on windows
[ https://issues.apache.org/jira/browse/SPARK-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5914: - Component/s: (was: Spark Core) Windows Spark Submit Yes of course you are not expected to run as admin. It'd be good to find a way to set the permissions correctly. I don't know how well Java plays with Windows file permissions though? Spark-submit cannot execute without machine admin permission on windows --- Key: SPARK-5914 URL: https://issues.apache.org/jira/browse/SPARK-5914 Project: Spark Issue Type: Bug Components: Spark Submit, Windows Environment: Windows Reporter: Judy Nash Priority: Minor On windows platform only. If slave is executed with user permission, spark-submit fails with java.lang.ClassNotFoundException when attempting to read the cached jar from spark_home\work folder. This is due to the jars do not have read permission set by default on windows. Fix is to add read permission explicitly for owner of the file. Having service account running as admin (equivalent of sudo in Linux) is a major security risk for production clusters. This make it easy for hackers to compromise the cluster by taking over the service account. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5900) Wrap the results returned by PIC and FPGrowth in case classes
[ https://issues.apache.org/jira/browse/SPARK-5900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5900. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4695 [https://github.com/apache/spark/pull/4695] Wrap the results returned by PIC and FPGrowth in case classes - Key: SPARK-5900 URL: https://issues.apache.org/jira/browse/SPARK-5900 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.3.0 We return tuples in the current version of PIC and FPGrowth. This is not very Java-friendly because the primitive types are translated into Objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5909) Add a clearCache command to Spark SQL's cache manager
Yin Huai created SPARK-5909: --- Summary: Add a clearCache command to Spark SQL's cache manager Key: SPARK-5909 URL: https://issues.apache.org/jira/browse/SPARK-5909 Project: Spark Issue Type: Task Components: SQL Reporter: Yin Huai This command will clear all cached data from the in-memory cache, which will be useful when users want to quickly clear the cache or as a workaround of cases like SPARK-5881. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5909) Add a clearCache command to Spark SQL's cache manager
[ https://issues.apache.org/jira/browse/SPARK-5909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327706#comment-14327706 ] Apache Spark commented on SPARK-5909: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/4694 Add a clearCache command to Spark SQL's cache manager - Key: SPARK-5909 URL: https://issues.apache.org/jira/browse/SPARK-5909 Project: Spark Issue Type: Task Components: SQL Reporter: Yin Huai This command will clear all cached data from the in-memory cache, which will be useful when users want to quickly clear the cache or as a workaround of cases like SPARK-5881. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5881) RDD remains cached after the table gets overridden by CACHE TABLE
[ https://issues.apache.org/jira/browse/SPARK-5881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327696#comment-14327696 ] Yin Huai commented on SPARK-5881: - As mentioned by [~lian cheng], we should also track the table names in the Cache Manager to correctly handle the following case. {code} val df1 = sql(SELECT * FROM testData LIMIT 10) df1.registerTempTable(t1) // Cache t1 explicitly sql(CACHE TABLE t1) // t1 and t2 share the same query plan sql(CACHE TABLE t2 AS SELECT * FROM testData LIMIT 10) // Replace t2 with a different query plan sql(CACHE TABLE t2 AS SELECT * FROM testData LIMIT 5) {code} RDD remains cached after the table gets overridden by CACHE TABLE --- Key: SPARK-5881 URL: https://issues.apache.org/jira/browse/SPARK-5881 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Priority: Blocker {code} val rdd = sc.parallelize((1 to 10).map(i = s{a:$i, b:str${i}})) sqlContext.jsonRDD(rdd).registerTempTable(jt) sqlContext.sql(CACHE TABLE foo AS SELECT * FROM jt) sqlContext.sql(CACHE TABLE foo AS SELECT a FROM jt) {code} After the second CACHE TABLE command, the RDD for the first table still remains in the cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5881) RDD remains cached after the table gets overridden by CACHE TABLE
[ https://issues.apache.org/jira/browse/SPARK-5881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-5881: Priority: Major (was: Blocker) RDD remains cached after the table gets overridden by CACHE TABLE --- Key: SPARK-5881 URL: https://issues.apache.org/jira/browse/SPARK-5881 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai {code} val rdd = sc.parallelize((1 to 10).map(i = s{a:$i, b:str${i}})) sqlContext.jsonRDD(rdd).registerTempTable(jt) sqlContext.sql(CACHE TABLE foo AS SELECT * FROM jt) sqlContext.sql(CACHE TABLE foo AS SELECT a FROM jt) {code} After the second CACHE TABLE command, the RDD for the first table still remains in the cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5907) Selected column from DataFrame should not re-analyze logical plan
[ https://issues.apache.org/jira/browse/SPARK-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327494#comment-14327494 ] Apache Spark commented on SPARK-5907: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/4691 Selected column from DataFrame should not re-analyze logical plan - Key: SPARK-5907 URL: https://issues.apache.org/jira/browse/SPARK-5907 Project: Spark Issue Type: Bug Components: SQL Reporter: Liang-Chi Hsieh Currently, selecting a column from DataFrame wraps the original logical plan with a Project. As the column is used, the logical plan will be analyzed again. For some query plan, re-analyzing would side-effect that increases expression id. So when accessing the column, column's expr and its analyzed plan will point to different expressions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5908) Hive udtf with single alias should be resolved correctly
Liang-Chi Hsieh created SPARK-5908: -- Summary: Hive udtf with single alias should be resolved correctly Key: SPARK-5908 URL: https://issues.apache.org/jira/browse/SPARK-5908 Project: Spark Issue Type: Bug Components: SQL Reporter: Liang-Chi Hsieh ResolveUdtfsAlias in hiveUdfs only considers the HiveGenericUdtf with multiple alias. When only single alias is used with HiveGenericUdtf, the alias is not working. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5907) Selected column from DataFrame should not re-analyze logical plan
[ https://issues.apache.org/jira/browse/SPARK-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh closed SPARK-5907. -- Resolution: Duplicate Selected column from DataFrame should not re-analyze logical plan - Key: SPARK-5907 URL: https://issues.apache.org/jira/browse/SPARK-5907 Project: Spark Issue Type: Bug Components: SQL Reporter: Liang-Chi Hsieh Currently, selecting a column from DataFrame wraps the original logical plan with a Project. As the column is used, the logical plan will be analyzed again. For some query plan, re-analyzing would side-effect that increases expression id. So when accessing the column, column's expr and its analyzed plan will point to different expressions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5900) Wrap the results returned by PIC and FPGrowth in case classes
[ https://issues.apache.org/jira/browse/SPARK-5900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14327787#comment-14327787 ] Apache Spark commented on SPARK-5900: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/4695 Wrap the results returned by PIC and FPGrowth in case classes - Key: SPARK-5900 URL: https://issues.apache.org/jira/browse/SPARK-5900 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng We return tuples in the current version of PIC and FPGrowth. This is not very Java-friendly because the primitive types are translated into Objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5548) Flaky test: o.a.s.util.AkkaUtilsSuite.remote fetch ssl on - untrusted server
[ https://issues.apache.org/jira/browse/SPARK-5548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-5548. Resolution: Fixed Fix Version/s: 1.3.0 Target Version/s: 1.3.0 Closing again https://github.com/apache/spark/pull/4653. Let's hope we won't have to reopen this again. Flaky test: o.a.s.util.AkkaUtilsSuite.remote fetch ssl on - untrusted server Key: SPARK-5548 URL: https://issues.apache.org/jira/browse/SPARK-5548 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Patrick Wendell Assignee: Jacek Lewandowski Priority: Critical Labels: flaky-test Fix For: 1.3.0 {code} sbt.ForkMain$ForkError: Expected exception java.util.concurrent.TimeoutException to be thrown, but akka.actor.ActorNotFound was thrown. at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:496) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) at org.scalatest.Assertions$class.intercept(Assertions.scala:1004) at org.scalatest.FunSuite.intercept(FunSuite.scala:1555) at org.apache.spark.util.AkkaUtilsSuite$$anonfun$8.apply$mcV$sp(AkkaUtilsSuite.scala:373) at org.apache.spark.util.AkkaUtilsSuite$$anonfun$8.apply(AkkaUtilsSuite.scala:349) at org.apache.spark.util.AkkaUtilsSuite$$anonfun$8.apply(AkkaUtilsSuite.scala:349) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.util.AkkaUtilsSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(AkkaUtilsSuite.scala:37) at org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255) at org.apache.spark.util.AkkaUtilsSuite.runTest(AkkaUtilsSuite.scala:37) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) at org.apache.spark.util.AkkaUtilsSuite.org$scalatest$BeforeAndAfterAll$$super$run(AkkaUtilsSuite.scala:37) at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257) at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256) at org.apache.spark.util.AkkaUtilsSuite.run(AkkaUtilsSuite.scala:37) at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462) at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671) at sbt.ForkMain$Run$2.call(ForkMain.java:294) at sbt.ForkMain$Run$2.call(ForkMain.java:284) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at
[jira] [Resolved] (SPARK-5889) remove pid file in spark-daemon.sh after killing the process.
[ https://issues.apache.org/jira/browse/SPARK-5889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5889. -- Resolution: Fixed Fix Version/s: 1.2.2 1.3.0 Issue resolved by pull request 4676 [https://github.com/apache/spark/pull/4676] remove pid file in spark-daemon.sh after killing the process. - Key: SPARK-5889 URL: https://issues.apache.org/jira/browse/SPARK-5889 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.2.1 Reporter: Zhan Zhang Assignee: Zhan Zhang Priority: Minor Fix For: 1.3.0, 1.2.2 Currently, if the thriftserver/history server are stopped. The pid file is not deleted. The fix is trial, but it is important for some service checking relying on the file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5914) Spark-submit cannot execute without machine admin permission on windows
Judy Nash created SPARK-5914: Summary: Spark-submit cannot execute without machine admin permission on windows Key: SPARK-5914 URL: https://issues.apache.org/jira/browse/SPARK-5914 Project: Spark Issue Type: Bug Components: Spark Core Environment: Windows Reporter: Judy Nash Priority: Minor On windows platform only. If slave is executed with user permission, spark-submit fails with java.lang.ClassNotFoundException when attempting to read the cached jar from spark_home\work folder. This is due to the jars do not have read permission set by default on windows. Fix is to add read permission explicitly for owner of the file. Having service account running as admin (equivalent of sudo in Linux) is a major security risk for production clusters. This make it easy for hackers to compromise the cluster by taking over the service account. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5915) Spillable should check every N bytes rather than every 32 elements
Mingyu Kim created SPARK-5915: - Summary: Spillable should check every N bytes rather than every 32 elements Key: SPARK-5915 URL: https://issues.apache.org/jira/browse/SPARK-5915 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Mingyu Kim Spillable currently checks for spill every 32 elements. However, this puts it at a risk of OOM if each element is large enough. A better alternative is to check every N bytes accumulated. N should be decided to a reasonable number via proper testing. This is a follow-up of SPARK-4808, and was discussed originally in https://github.com/apache/spark/pull/4420. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4808) Spark fails to spill with small number of large objects
[ https://issues.apache.org/jira/browse/SPARK-4808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4808: - Target Version/s: 1.3.0, 1.4.0 (was: 1.2.1) Spark fails to spill with small number of large objects --- Key: SPARK-4808 URL: https://issues.apache.org/jira/browse/SPARK-4808 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0, 1.2.0, 1.2.1 Reporter: Dennis Lawler Spillable's maybeSpill does not allow spill to occur until at least 1000 elements have been spilled, and then will only evaluate spill every 32nd element thereafter. When there is a small number of very large items being tracked, out-of-memory conditions may occur. I suspect that this and the every-32nd-element behavior was to reduce the impact of the estimateSize() call. This method was extracted into SizeTracker, which implements its own exponential backup for size estimation, so now we are only avoiding using the resulting estimated size. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5915) Spillable should check every N bytes rather than every 32 elements
[ https://issues.apache.org/jira/browse/SPARK-5915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5915: - Target Version/s: 1.4.0 Spillable should check every N bytes rather than every 32 elements -- Key: SPARK-5915 URL: https://issues.apache.org/jira/browse/SPARK-5915 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Mingyu Kim Spillable currently checks for spill every 32 elements. However, this puts it at a risk of OOM if each element is large enough. A better alternative is to check every N bytes accumulated. N should be decided to a reasonable number via proper testing. This is a follow-up of SPARK-4808, and was discussed originally in https://github.com/apache/spark/pull/4420. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5915) Spillable should check every N bytes rather than every 32 elements
[ https://issues.apache.org/jira/browse/SPARK-5915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5915: - Affects Version/s: 1.0.0 Spillable should check every N bytes rather than every 32 elements -- Key: SPARK-5915 URL: https://issues.apache.org/jira/browse/SPARK-5915 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Mingyu Kim Spillable currently checks for spill every 32 elements. However, this puts it at a risk of OOM if each element is large enough. A better alternative is to check every N bytes accumulated. N should be decided to a reasonable number via proper testing. This is a follow-up of SPARK-4808, and was discussed originally in https://github.com/apache/spark/pull/4420. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5753) add basic support to JDBCRDD for postgresql types: uuid, hstore, and array
[ https://issues.apache.org/jira/browse/SPARK-5753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328449#comment-14328449 ] Evan Yu commented on SPARK-5753: Ignore this, commit under wrong ticket add basic support to JDBCRDD for postgresql types: uuid, hstore, and array -- Key: SPARK-5753 URL: https://issues.apache.org/jira/browse/SPARK-5753 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Ricky Nguyen I recently saw the new JDBCRDD merged into master. Thanks for that, it works pretty well and is really convenient. It would be nice if it could have basic support for a few more types. * uuid (as StringType) * hstore (as MapType). keys and values are both strings. * array (as ArrayType) I have a patch that gets started in this direction. Not sure where or how to write/run tests, but I ran manual tests in spark-shell against my postgres db. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5917) Distinct is broken
Derrick Burns created SPARK-5917: Summary: Distinct is broken Key: SPARK-5917 URL: https://issues.apache.org/jira/browse/SPARK-5917 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.1 Environment: Spark 1.1.1 running on YARN 2.4 via Amazon EMR. Reporter: Derrick Burns Priority: Critical I hate to file bugs that are hard to reproduce (by other people), but after spending a full week trying to debug my code, I constructed a scenario where the following assertion FAILS. val x : RDD[T] = val y = x.distinct() assert( y.count() = x.count() ) I am at a complete loss as to how this can occur under ANY definition of equality/order unless the RDD underlying x is mutable. Since none of my RDD transforms mutate any existing RDD data and I am reading from immutable sources (data on S3), I conclude that there must be a bug in Spark or I am mutating my data unknowingly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4682) Consolidate various 'Clock' classes
[ https://issues.apache.org/jira/browse/SPARK-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4682: - Affects Version/s: 1.2.0 Consolidate various 'Clock' classes --- Key: SPARK-4682 URL: https://issues.apache.org/jira/browse/SPARK-4682 Project: Spark Issue Type: Improvement Components: Spark Core, Streaming Affects Versions: 1.2.0 Reporter: Josh Rosen Fix For: 1.3.0 Spark currently has at four different {{Clock}} classes for mocking out wall-clock time, most of which are nearly identical. We should replace all of these by one Clock class that lives in the utilities package. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4682) Consolidate various 'Clock' classes
[ https://issues.apache.org/jira/browse/SPARK-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-4682. Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Sean Owen Target Version/s: 1.3.0 Consolidate various 'Clock' classes --- Key: SPARK-4682 URL: https://issues.apache.org/jira/browse/SPARK-4682 Project: Spark Issue Type: Improvement Components: Spark Core, Streaming Affects Versions: 1.2.0 Reporter: Josh Rosen Assignee: Sean Owen Fix For: 1.3.0 Spark currently has at four different {{Clock}} classes for mocking out wall-clock time, most of which are nearly identical. We should replace all of these by one Clock class that lives in the utilities package. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5669) Spark assembly includes incompatibly licensed libgfortran, libgcc code via JBLAS
[ https://issues.apache.org/jira/browse/SPARK-5669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328364#comment-14328364 ] Sean Owen commented on SPARK-5669: -- It *should* be fine on the grounds that the native libs are on the classpath and there is no conflict. That said I have not tried it. Are you proposing the new PR for 1.3.0? That would also solve the issue. If not, I would support it if you felt more comfortable restoring the native libs for 1.3.0 instead. Spark assembly includes incompatibly licensed libgfortran, libgcc code via JBLAS Key: SPARK-5669 URL: https://issues.apache.org/jira/browse/SPARK-5669 Project: Spark Issue Type: Bug Components: Build Reporter: Sean Owen Assignee: Sean Owen Priority: Blocker Fix For: 1.3.0 Sorry for Blocker, but it's a license issue. The Spark assembly includes the following, from JBLAS: {code} lib/ lib/static/ lib/static/Mac OS X/ lib/static/Mac OS X/x86_64/ lib/static/Mac OS X/x86_64/libjblas_arch_flavor.jnilib lib/static/Mac OS X/x86_64/sse3/ lib/static/Mac OS X/x86_64/sse3/libjblas.jnilib lib/static/Windows/ lib/static/Windows/x86/ lib/static/Windows/x86/libgfortran-3.dll lib/static/Windows/x86/libgcc_s_dw2-1.dll lib/static/Windows/x86/jblas_arch_flavor.dll lib/static/Windows/x86/sse3/ lib/static/Windows/x86/sse3/jblas.dll lib/static/Windows/amd64/ lib/static/Windows/amd64/libgfortran-3.dll lib/static/Windows/amd64/jblas.dll lib/static/Windows/amd64/libgcc_s_sjlj-1.dll lib/static/Windows/amd64/jblas_arch_flavor.dll lib/static/Linux/ lib/static/Linux/i386/ lib/static/Linux/i386/sse3/ lib/static/Linux/i386/sse3/libjblas.so lib/static/Linux/i386/libjblas_arch_flavor.so lib/static/Linux/amd64/ lib/static/Linux/amd64/sse3/ lib/static/Linux/amd64/sse3/libjblas.so lib/static/Linux/amd64/libjblas_arch_flavor.so {code} Unfortunately the libgfortran and libgcc libraries included for Windows are not licensed in a way that's compatible with Spark and the AL2 -- LGPL at least. It's easy to exclude them. I'm not clear what it does to running on Windows; I assume it can still work but the libs would have to be made available locally and put on the shared library path manually. I don't think there's a package manager as in Linux that would make it easily available. I'm not able to test on Windows. If it doesn't work, the follow-up question is whether that means JBLAS has to be removed on the double, or treated as a known issue for 1.3.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5913) Python API for ChiSqSelector
Joseph K. Bradley created SPARK-5913: Summary: Python API for ChiSqSelector Key: SPARK-5913 URL: https://issues.apache.org/jira/browse/SPARK-5913 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor Add a Python API for mllib.feature.ChiSqSelector -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5860) JdbcRDD: overflow on large range with high number of partitions
[ https://issues.apache.org/jira/browse/SPARK-5860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328454#comment-14328454 ] Apache Spark commented on SPARK-5860: - User 'hotou' has created a pull request for this issue: https://github.com/apache/spark/pull/4701 JdbcRDD: overflow on large range with high number of partitions --- Key: SPARK-5860 URL: https://issues.apache.org/jira/browse/SPARK-5860 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Reporter: Jeroen Warmerdam Priority: Minor {code} val jdbcRDD = new JdbcRDD(sc, () = DriverManager.getConnection(url, username, password), SELECT id FROM documents WHERE ? = id AND id = ?, lowerBound = 1131544775L, upperBound = 567279358897692673L, numPartitions = 20, mapRow = r = (r.getLong(id)) ) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org