[jira] [Commented] (SPARK-4094) checkpoint should still be available after rdd actions
[ https://issues.apache.org/jira/browse/SPARK-4094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184897#comment-14184897 ] Apache Spark commented on SPARK-4094: - User 'liyezhang556520' has created a pull request for this issue: https://github.com/apache/spark/pull/2956 > checkpoint should still be available after rdd actions > -- > > Key: SPARK-4094 > URL: https://issues.apache.org/jira/browse/SPARK-4094 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Zhang, Liye > > rdd.checkpoint() must be called before any actions on this rdd, if there is > any other actions before, checkpoint would never succeed. For the following > code as example: > *rdd = sc.makeRDD(...)* > *rdd.collect()* > *rdd.checkpoint()* > *rdd.count()* > This rdd would never be checkpointed. But this would not happen for RDD > cache. RDD cache would always make successfully before rdd actions no matter > whether there is any actions before cache(). > So rdd.checkpoint() should also be with the same behavior with rdd.cache(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4096) Update executor memory description in the help message
[ https://issues.apache.org/jira/browse/SPARK-4096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184894#comment-14184894 ] Apache Spark commented on SPARK-4096: - User 'WangTaoTheTonic' has created a pull request for this issue: https://github.com/apache/spark/pull/2955 > Update executor memory description in the help message > -- > > Key: SPARK-4096 > URL: https://issues.apache.org/jira/browse/SPARK-4096 > Project: Spark > Issue Type: Improvement > Components: YARN >Reporter: WangTaoTheTonic >Priority: Minor > > Here `ApplicationMaster` accept executor memory argument only in number > format, we should update the description in help message. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4096) Update executor memory description in the help message
WangTaoTheTonic created SPARK-4096: -- Summary: Update executor memory description in the help message Key: SPARK-4096 URL: https://issues.apache.org/jira/browse/SPARK-4096 Project: Spark Issue Type: Improvement Components: YARN Reporter: WangTaoTheTonic Priority: Minor Here `ApplicationMaster` accept executor memory argument only in number format, we should update the description in help message. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4094) checkpoint should still be available after rdd actions
[ https://issues.apache.org/jira/browse/SPARK-4094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhang, Liye updated SPARK-4094: --- Description: rdd.checkpoint() must be called before any actions on this rdd, if there is any other actions before, checkpoint would never succeed. For the following code as example: *rdd = sc.makeRDD(...)* *rdd.collect()* *rdd.checkpoint()* *rdd.count()* This rdd would never be checkpointed. But this would not happen for RDD cache. RDD cache would always make successfully before rdd actions no matter whether there is any actions before cache(). So rdd.checkpoint() should also be with the same behavior with rdd.cache(). was:kjh > checkpoint should still be available after rdd actions > -- > > Key: SPARK-4094 > URL: https://issues.apache.org/jira/browse/SPARK-4094 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Zhang, Liye > > rdd.checkpoint() must be called before any actions on this rdd, if there is > any other actions before, checkpoint would never succeed. For the following > code as example: > *rdd = sc.makeRDD(...)* > *rdd.collect()* > *rdd.checkpoint()* > *rdd.count()* > This rdd would never be checkpointed. But this would not happen for RDD > cache. RDD cache would always make successfully before rdd actions no matter > whether there is any actions before cache(). > So rdd.checkpoint() should also be with the same behavior with rdd.cache(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4049) Storage web UI "fraction cached" shows as > 100%
[ https://issues.apache.org/jira/browse/SPARK-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184889#comment-14184889 ] Patrick Wendell commented on SPARK-4049: This actually seems alright to me if it means that a single partition is cached in two locations. > Storage web UI "fraction cached" shows as > 100% > > > Key: SPARK-4049 > URL: https://issues.apache.org/jira/browse/SPARK-4049 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.2.0 >Reporter: Josh Rosen >Priority: Minor > > In the Storage tab of the Spark Web UI, I saw a case where the "Fraction > Cached" was greater than 100%: > !http://i.imgur.com/Gm2hEeL.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4094) checkpoint should still be available after rdd actions
[ https://issues.apache.org/jira/browse/SPARK-4094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhang, Liye updated SPARK-4094: --- Description: kjh > checkpoint should still be available after rdd actions > -- > > Key: SPARK-4094 > URL: https://issues.apache.org/jira/browse/SPARK-4094 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Zhang, Liye > > kjh -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4095) [YARN][Minor]extract val isLaunchingDriver in ClientBase
[ https://issues.apache.org/jira/browse/SPARK-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184881#comment-14184881 ] Apache Spark commented on SPARK-4095: - User 'WangTaoTheTonic' has created a pull request for this issue: https://github.com/apache/spark/pull/2954 > [YARN][Minor]extract val isLaunchingDriver in ClientBase > > > Key: SPARK-4095 > URL: https://issues.apache.org/jira/browse/SPARK-4095 > Project: Spark > Issue Type: Improvement > Components: YARN >Reporter: WangTaoTheTonic >Priority: Minor > > Instead of checking if `args.userClass` is null repeatedly, we extract it to > an global val as in `ApplicationMaster`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4095) [YARN][Minor]extract val isLaunchingDriver in ClientBase
WangTaoTheTonic created SPARK-4095: -- Summary: [YARN][Minor]extract val isLaunchingDriver in ClientBase Key: SPARK-4095 URL: https://issues.apache.org/jira/browse/SPARK-4095 Project: Spark Issue Type: Improvement Components: YARN Reporter: WangTaoTheTonic Priority: Minor Instead of checking if `args.userClass` is null repeatedly, we extract it to an global val as in `ApplicationMaster`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4094) checkpoint should still be available after rdd actions
Zhang, Liye created SPARK-4094: -- Summary: checkpoint should still be available after rdd actions Key: SPARK-4094 URL: https://issues.apache.org/jira/browse/SPARK-4094 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Zhang, Liye -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1442) Add Window function support
[ https://issues.apache.org/jira/browse/SPARK-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184873#comment-14184873 ] Apache Spark commented on SPARK-1442: - User 'guowei2' has created a pull request for this issue: https://github.com/apache/spark/pull/2953 > Add Window function support > --- > > Key: SPARK-1442 > URL: https://issues.apache.org/jira/browse/SPARK-1442 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Chengxiang Li > > similiar to Hive, add window function support for catalyst. > https://issues.apache.org/jira/browse/HIVE-4197 > https://issues.apache.org/jira/browse/HIVE-896 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib
[ https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184848#comment-14184848 ] Ashutosh Trivedi commented on SPARK-2336: - Is anybody already working on it ? I can take up this task. We can also implement KNN joins which will be a nice utility for data mining. Here is the link for KNN-joins http://ww2.cs.fsu.edu/~czhang/knnjedbt/ > Approximate k-NN Models for MLLib > - > > Key: SPARK-2336 > URL: https://issues.apache.org/jira/browse/SPARK-2336 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Brian Gawalt >Priority: Minor > Labels: features, newbie > > After tackling the general k-Nearest Neighbor model as per > https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to > also offer approximate k-Nearest Neighbor. A promising approach would involve > building a kd-tree variant within from each partition, a la > http://www.autonlab.org/autonweb/14714.html?branch=1&language=2 > This could offer a simple non-linear ML model that can label new data with > much lower latency than the plain-vanilla kNN versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3988) Public API for DateType support
[ https://issues.apache.org/jira/browse/SPARK-3988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180933#comment-14180933 ] Adrian Wang edited comment on SPARK-3988 at 10/27/14 4:28 AM: -- have to investigate solution 3 in spark-2674 was (Author: adrian-wang): have to investigate solution 3 in spark-2179 > Public API for DateType support > --- > > Key: SPARK-3988 > URL: https://issues.apache.org/jira/browse/SPARK-3988 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Adrian Wang >Assignee: Adrian Wang >Priority: Minor > > add Python API and something else. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2396) Spark EC2 scripts fail when trying to log in to EC2 instances
[ https://issues.apache.org/jira/browse/SPARK-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184803#comment-14184803 ] Anant Daksh Asthana commented on SPARK-2396: Seems like a python issue on your system. You are missing the subprocess module. > Spark EC2 scripts fail when trying to log in to EC2 instances > - > > Key: SPARK-2396 > URL: https://issues.apache.org/jira/browse/SPARK-2396 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 1.0.0 > Environment: Windows 8, Cygwin and command prompt, Python 2.7 >Reporter: Stephen M. Hopper > Labels: aws, ec2, ssh > > I cannot seem to successfully start up a Spark EC2 cluster using the > spark-ec2 script. > I'm using variations on the following command: > ./spark-ec2 --instance-type=m1.small --region=us-west-1 --spot-price=0.05 > --spark-version=1.0.0 -k my-key-name -i my-key-name.pem -s 1 launch > spark-test-cluster > The script always allocates the EC2 instances without much trouble, but can > never seem to complete the SSH step to install Spark on the cluster. It > always complains about my SSH key. If I try to log in with my ssh key doing > something like this: > ssh -i my-key-name.pem root@ > it fails. However, if I log in to the AWS console, click on my instance and > select "connect", it displays the instructions for SSHing into my instance > (which are no different from the ssh command from above). So, if I rerun the > SSH command from above, I'm able to log in. > Next, if I try to rerun the spark-ec2 command from above (replacing "launch" > with "start"), the script logs in and starts installing Spark. However, it > eventually errors out with the following output: > Cloning into 'spark-ec2'... > remote: Counting objects: 1465, done. > remote: Compressing objects: 100% (697/697), done. > remote: Total 1465 (delta 485), reused 1465 (delta 485) > Receiving objects: 100% (1465/1465), 228.51 KiB | 287 KiB/s, done. > Resolving deltas: 100% (485/485), done. > Connection to ec2-.us-west-1.compute.amazonaws.com closed. > Searching for existing cluster spark-test-cluster... > Found 1 master(s), 1 slaves > Starting slaves... > Starting master... > Waiting for instances to start up... > Waiting 120 more seconds... > Deploying files to master... > Traceback (most recent call last): > File "./spark_ec2.py", line 823, in > main() > File "./spark_ec2.py", line 815, in main > real_main() > File "./spark_ec2.py", line 806, in real_main > setup_cluster(conn, master_nodes, slave_nodes, opts, False) > File "./spark_ec2.py", line 450, in setup_cluster > deploy_files(conn, "deploy.generic", opts, master_nodes, slave_nodes, > modules) > File "./spark_ec2.py", line 593, in deploy_files > subprocess.check_call(command) > File "E:\windows_programs\Python27\lib\subprocess.py", line 535, in > check_call > retcode = call(*popenargs, **kwargs) > File "E:\windows_programs\Python27\lib\subprocess.py", line 522, in call > return Popen(*popenargs, **kwargs).wait() > File "E:\windows_programs\Python27\lib\subprocess.py", line 710, in __init__ > errread, errwrite) > File "E:\windows_programs\Python27\lib\subprocess.py", line 958, in > _execute_child > startupinfo) > WindowsError: [Error 2] The system cannot find the file specified > So, in short, am I missing something or is this a bug? Any help would be > appreciated. > Other notes: > -I've tried both us-west-1 and us-east-1 regions. > -I've tried several different instance types. > -I've tried playing with the permissions on the ssh key (600, 400, etc.), but > to no avail -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3838) Python code example for Word2Vec in user guide
[ https://issues.apache.org/jira/browse/SPARK-3838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184750#comment-14184750 ] Anant Daksh Asthana commented on SPARK-3838: Pull request for resolution can be found at https://github.com/apache/spark/pull/2952 > Python code example for Word2Vec in user guide > -- > > Key: SPARK-3838 > URL: https://issues.apache.org/jira/browse/SPARK-3838 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib >Reporter: Xiangrui Meng >Assignee: Anant Daksh Asthana >Priority: Trivial > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3838) Python code example for Word2Vec in user guide
[ https://issues.apache.org/jira/browse/SPARK-3838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184749#comment-14184749 ] Apache Spark commented on SPARK-3838: - User 'anantasty' has created a pull request for this issue: https://github.com/apache/spark/pull/2952 > Python code example for Word2Vec in user guide > -- > > Key: SPARK-3838 > URL: https://issues.apache.org/jira/browse/SPARK-3838 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib >Reporter: Xiangrui Meng >Assignee: Anant Daksh Asthana >Priority: Trivial > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4091) Occasionally spark.local.dir can be deleted twice and causes test failure
[ https://issues.apache.org/jira/browse/SPARK-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian closed SPARK-4091. - Resolution: Duplicate > Occasionally spark.local.dir can be deleted twice and causes test failure > - > > Key: SPARK-4091 > URL: https://issues.apache.org/jira/browse/SPARK-4091 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Cheng Lian > > By persisting an arbitrary RDD with storage level {{MEMORY_AND_DISK}}, Spark > may occasionally throw the following exception when shutting down: > {code} > java.io.IOException: Failed to list files for dir: > /var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027005012-5bcd/0b > at org.apache.spark.util.Utils$.listFilesSafely(Utils.scala:664) > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678) > at > org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:680) > at > org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:678) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678) > at > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:177) > at > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:175) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) > at > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply$mcV$sp(Utils.scala:175) > at > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173) > at > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1323) > at org.apache.spark.util.Utils$$anon$4.run(Utils.scala:173) > {code} > By adding log output to {{Utils.deleteRecursively}}, setting breakpoints at > {{File.delete}} in IntelliJ, and asking IntelliJ to evaluate and log > {{Thread.currentThread().getStackTrace()}} when the breakpoint is hit rather > than suspend execution, we can get the following result, which shows > {{spark.local.dir}} is deleted twice from both {{DiskBlockManager.stop}} and > the shutdown hook installed in {{Utils}}: > {code} > +++ Deleting file: > /var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027003412-7fae/1d > Breakpoint reached at java.io.File.delete(File.java:1028) > [java.lang.Thread.getStackTrace(Thread.java:1589) > java.io.File.delete(File.java:1028) > org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:695) > > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:177) > > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:175) > scala.collection.mutable.HashSet.foreach(HashSet.scala:79) > > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply$mcV$sp(Utils.scala:175) > > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173) > > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173) > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1323) > org.apache.spark.util.Utils$$anon$4.run(Utils.scala:173)] > +++ Deleting file: > /var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027003412-7fae/1d > Breakpoint reached at java.io.File.delete(File.java:1028) > [java.lang.Thread.getStackTrace(Thread.java:1589) > java.io.File.delete(File.java:1028) > org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:695) > > org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:680) > > org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:678) > > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) > org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678) > > org.apache.spark.storage.DiskBlockManager$$anonfun$stop$1.apply(DiskBlockManager.scala:157) > > org.apache.spark.storage.DiskBlockManager$$anonfun$stop$1.apply(DiskBlockManager.scala:154) > > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > > org.apache.spark.storage.DiskBlockManager.stop(DiskBlockManager.scala:154) > > org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$1.appl
[jira] [Commented] (SPARK-4091) Occasionally spark.local.dir can be deleted twice and causes test failure
[ https://issues.apache.org/jira/browse/SPARK-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184741#comment-14184741 ] Cheng Lian commented on SPARK-4091: --- Yes, thanks [~joshrosen], closing this. > Occasionally spark.local.dir can be deleted twice and causes test failure > - > > Key: SPARK-4091 > URL: https://issues.apache.org/jira/browse/SPARK-4091 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Cheng Lian > > By persisting an arbitrary RDD with storage level {{MEMORY_AND_DISK}}, Spark > may occasionally throw the following exception when shutting down: > {code} > java.io.IOException: Failed to list files for dir: > /var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027005012-5bcd/0b > at org.apache.spark.util.Utils$.listFilesSafely(Utils.scala:664) > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678) > at > org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:680) > at > org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:678) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678) > at > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:177) > at > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:175) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) > at > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply$mcV$sp(Utils.scala:175) > at > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173) > at > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1323) > at org.apache.spark.util.Utils$$anon$4.run(Utils.scala:173) > {code} > By adding log output to {{Utils.deleteRecursively}}, setting breakpoints at > {{File.delete}} in IntelliJ, and asking IntelliJ to evaluate and log > {{Thread.currentThread().getStackTrace()}} when the breakpoint is hit rather > than suspend execution, we can get the following result, which shows > {{spark.local.dir}} is deleted twice from both {{DiskBlockManager.stop}} and > the shutdown hook installed in {{Utils}}: > {code} > +++ Deleting file: > /var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027003412-7fae/1d > Breakpoint reached at java.io.File.delete(File.java:1028) > [java.lang.Thread.getStackTrace(Thread.java:1589) > java.io.File.delete(File.java:1028) > org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:695) > > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:177) > > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:175) > scala.collection.mutable.HashSet.foreach(HashSet.scala:79) > > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply$mcV$sp(Utils.scala:175) > > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173) > > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173) > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1323) > org.apache.spark.util.Utils$$anon$4.run(Utils.scala:173)] > +++ Deleting file: > /var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027003412-7fae/1d > Breakpoint reached at java.io.File.delete(File.java:1028) > [java.lang.Thread.getStackTrace(Thread.java:1589) > java.io.File.delete(File.java:1028) > org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:695) > > org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:680) > > org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:678) > > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) > org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678) > > org.apache.spark.storage.DiskBlockManager$$anonfun$stop$1.apply(DiskBlockManager.scala:157) > > org.apache.spark.storage.DiskBlockManager$$anonfun$stop$1.apply(DiskBlockManager.scala:154) > > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > > org.apache.spark.storage.DiskBlockManager.stop(DiskBlockManager.scala:154) >
[jira] [Created] (SPARK-4093) Simplify the unwrap/wrap between HiveUDFs
Cheng Hao created SPARK-4093: Summary: Simplify the unwrap/wrap between HiveUDFs Key: SPARK-4093 URL: https://issues.apache.org/jira/browse/SPARK-4093 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Priority: Minor Currently, the nested Hive UDFs invoking will cause extra overhead in "unwrapping" / "wrapping" data. e.g. SELECT cos(sin(a)) FROM t; We can reuse the ObjectInspector & output result of nested Hive UDF(sin), and avoid the extra data "unwrap" and "wrap". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3970) Remove duplicate removal of local dirs
[ https://issues.apache.org/jira/browse/SPARK-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3970: - Assignee: Liang-Chi Hsieh > Remove duplicate removal of local dirs > -- > > Key: SPARK-3970 > URL: https://issues.apache.org/jira/browse/SPARK-3970 > Project: Spark > Issue Type: Bug >Affects Versions: 1.1.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 1.2.0 > > > The shutdown hook of DiskBlockManager would remove localDirs. So do not need > to register them with Utils.registerShutdownDeleteDir. It causes duplicate > removal of these local dirs and corresponding exceptions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3970) Remove duplicate removal of local dirs
[ https://issues.apache.org/jira/browse/SPARK-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-3970. Resolution: Fixed Fix Version/s: 1.2.0 Target Version/s: 1.2.0 > Remove duplicate removal of local dirs > -- > > Key: SPARK-3970 > URL: https://issues.apache.org/jira/browse/SPARK-3970 > Project: Spark > Issue Type: Bug >Affects Versions: 1.1.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 1.2.0 > > > The shutdown hook of DiskBlockManager would remove localDirs. So do not need > to register them with Utils.registerShutdownDeleteDir. It causes duplicate > removal of these local dirs and corresponding exceptions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3970) Remove duplicate removal of local dirs
[ https://issues.apache.org/jira/browse/SPARK-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3970: - Affects Version/s: 1.1.0 > Remove duplicate removal of local dirs > -- > > Key: SPARK-3970 > URL: https://issues.apache.org/jira/browse/SPARK-3970 > Project: Spark > Issue Type: Bug >Affects Versions: 1.1.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 1.2.0 > > > The shutdown hook of DiskBlockManager would remove localDirs. So do not need > to register them with Utils.registerShutdownDeleteDir. It causes duplicate > removal of these local dirs and corresponding exceptions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2760) Caching tables from multiple databases does not work
[ https://issues.apache.org/jira/browse/SPARK-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2760. - Resolution: Fixed Fix Version/s: 1.2.0 Assignee: Michael Armbrust This was fixed with the caching overhaul. > Caching tables from multiple databases does not work > > > Key: SPARK-2760 > URL: https://issues.apache.org/jira/browse/SPARK-2760 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Michael Armbrust >Assignee: Michael Armbrust >Priority: Critical > Fix For: 1.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4042) append columns ids and names before broadcast
[ https://issues.apache.org/jira/browse/SPARK-4042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4042. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2885 [https://github.com/apache/spark/pull/2885] > append columns ids and names before broadcast > - > > Key: SPARK-4042 > URL: https://issues.apache.org/jira/browse/SPARK-4042 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: wangfei > Fix For: 1.2.0 > > > appended columns ids and names will not broadcast because we append them > after creating table reader. This leads to the config broadcasted to executor > side dose not contain the configs of appended columns and names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4061) We cannot use EOL character in the operand of LIKE predicate.
[ https://issues.apache.org/jira/browse/SPARK-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4061. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2908 [https://github.com/apache/spark/pull/2908] > We cannot use EOL character in the operand of LIKE predicate. > - > > Key: SPARK-4061 > URL: https://issues.apache.org/jira/browse/SPARK-4061 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Kousuke Saruta > Fix For: 1.2.0 > > > We cannot use EOL character like \n or \r in the operand of LIKE predicate. > So following condition is never true. > {code} > -- someStr is 'hoge\nfuga' > where someStr LIKE 'hoge_fuga' > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3959) SqlParser fails to parse literal -9223372036854775808 (Long.MinValue).
[ https://issues.apache.org/jira/browse/SPARK-3959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3959. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2816 [https://github.com/apache/spark/pull/2816] > SqlParser fails to parse literal -9223372036854775808 (Long.MinValue). > -- > > Key: SPARK-3959 > URL: https://issues.apache.org/jira/browse/SPARK-3959 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Kousuke Saruta >Priority: Critical > Fix For: 1.2.0 > > > SqlParser fails to parse -9223372036854775808 (Long.MinValue) so we cannot > write queries such like as follows. > {code} > SELECT value FROM someTable WHERE value > -9223372036854775808 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3483) Special chars in column names
[ https://issues.apache.org/jira/browse/SPARK-3483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3483. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2927 [https://github.com/apache/spark/pull/2927] > Special chars in column names > - > > Key: SPARK-3483 > URL: https://issues.apache.org/jira/browse/SPARK-3483 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.0.2 >Reporter: Kuldeep >Assignee: Ravindra Pesala > Fix For: 1.2.0 > > > For columns with special characters in names, double quoted ANSI syntax would > be nice to have. > select "a/b" from mytable > Is there a workaround for this? Currently the grammar interprets this as a > string value. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3266) JavaDoubleRDD doesn't contain max()
[ https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184704#comment-14184704 ] Patrick Wendell commented on SPARK-3266: I think it sort of depends how many people use JavaRDDLike and how they use it. In my mind it wasn't intended to be used by user applications, but probably some do because there isn't really a way to write functions that pass RDD's around and deal with both Pair RDD's and normal ones in Java. [~matei], what do you think of this vis-a-vis compatibility? > JavaDoubleRDD doesn't contain max() > --- > > Key: SPARK-3266 > URL: https://issues.apache.org/jira/browse/SPARK-3266 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 1.0.1, 1.0.2, 1.1.0, 1.2.0 >Reporter: Amey Chaugule >Assignee: Josh Rosen > Attachments: spark-repro-3266.tar.gz > > > While I can compile my code, I see: > Caused by: java.lang.NoSuchMethodError: > org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double; > When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I > don't notice max() > although it is clearly listed in the documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4068) NPE in jsonRDD schema inference
[ https://issues.apache.org/jira/browse/SPARK-4068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4068. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2918 [https://github.com/apache/spark/pull/2918] > NPE in jsonRDD schema inference > --- > > Key: SPARK-4068 > URL: https://issues.apache.org/jira/browse/SPARK-4068 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Michael Armbrust >Assignee: Yin Huai >Priority: Critical > Fix For: 1.2.0 > > > {code} > val jsonData = """{"data":[[null], [[["Test"}""" :: """{"other": ""}""" > :: Nil > sqlContext.jsonRDD(sc.parallelize(jsonData)) > {code} > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 13 in > stage 5.0 failed 4 times, most recent failure: Lost task 13.3 in stage 5.0 > (TID 347, ip-10-0-234-152.us-west-2.compute.internal): > java.lang.NullPointerException: > > org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$allKeysWithValueTypes$1.org$apache$spark$sql$json$JsonRDD$$anonfun$$buildKeyPathForInnerStructs$1(JsonRDD.scala:252) > > org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$allKeysWithValueTypes$1$$anonfun$org$apache$spark$sql$json$JsonRDD$$anonfun$$buildKeyPathForInnerStructs$1$3.apply(JsonRDD.scala:253) > > org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$allKeysWithValueTypes$1$$anonfun$org$apache$spark$sql$json$JsonRDD$$anonfun$$buildKeyPathForInnerStructs$1$3.apply(JsonRDD.scala:253) > > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) > scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4052) Use scala.collection.Map for pattern matching instead of using Predef.Map (it is scala.collection.immutable.Map)
[ https://issues.apache.org/jira/browse/SPARK-4052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4052. - Resolution: Fixed Fix Version/s: 1.2.0 > Use scala.collection.Map for pattern matching instead of using Predef.Map (it > is scala.collection.immutable.Map) > > > Key: SPARK-4052 > URL: https://issues.apache.org/jira/browse/SPARK-4052 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Minor > Fix For: 1.2.0 > > > Seems ScalaReflection and InsertIntoHiveTable only take > scala.collection.immutable.Map as the value type of MapType. Here are test > cases showing errors. > {code} > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) > import sqlContext.createSchemaRDD > val rdd = sc.parallelize(("key", "value") :: Nil) > // Test1: This one fails. > case class Test1(m: scala.collection.Map[String, String]) > val rddOfTest1 = rdd.map { case (k, v) => Test1(Map(k->v)) } > rddOfTest1.registerTempTable("t1") > /* Stack trace > scala.MatchError: scala.collection.Map[String,String] (of class > scala.reflect.internal.Types$TypeRef$$anon$5) > at > org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:53) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:64) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:62) > ... > */ > // Test2: This one is fine. > case class Test2(m: scala.collection.immutable.Map[String, String]) > val rddOfTest2 = rdd.map { case (k, v) => Test2(Map(k->v)) } > rddOfTest2.registerTempTable("t2") > sqlContext.sql("SELECT m FROM t2").collect > sqlContext.sql("SELECT m['key'] FROM t2").collect > // Test3: This one fails. > val schema = StructType(StructField("m", MapType(StringType, StringType), > true) :: Nil) > val rowRDD = rdd.map { case (k, v) => > Row(scala.collection.mutable.HashMap(k->v)) } > val schemaRDD = sqlContext.applySchema(rowRDD, schema) > schemaRDD.registerTempTable("t3") > sqlContext.sql("SELECT m FROM t3").collect > sqlContext.sql("SELECT m['key'] FROM t3").collect > sqlContext.sql("CREATE TABLE testHiveTable1(m MAP )") > sqlContext.sql("INSERT OVERWRITE TABLE testHiveTable1 SELECT m FROM t3") > /* Stack trace > 14/10/22 19:30:56 INFO DAGScheduler: Job 4 failed: runJob at > InsertIntoHiveTable.scala:124, took 1.384579 s > org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in > stage 4.0 failed 4 times, most recent failure: Lost task 1.3 in stage 4.0 > (TID 12, yins-mbp): java.lang.ClassCastException: > scala.collection.mutable.HashMap cannot be cast to > scala.collection.immutable.Map > > org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$wrapperFor$5.apply(InsertIntoHiveTable.scala:96) > > org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$wrapperFor$5.apply(InsertIntoHiveTable.scala:96) > > org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:148) > > org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:145) > */ > // Test4: This one is fine. > val rowRDD = rdd.map { case (k, v) => Row(Map(k->v)) } > val schemaRDD = sqlContext.applySchema(rowRDD, schema) > schemaRDD.registerTempTable("t4") > sqlContext.sql("SELECT m FROM t4").collect > sqlContext.sql("SELECT m['key'] FROM t4").collect > sqlContext.sql("CREATE TABLE testHiveTable1(m MAP )") > sqlContext.sql("INSERT OVERWRITE TABLE testHiveTable1 SELECT m FROM t4") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3953) Confusable variable name.
[ https://issues.apache.org/jira/browse/SPARK-3953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3953. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2807 [https://github.com/apache/spark/pull/2807] > Confusable variable name. > - > > Key: SPARK-3953 > URL: https://issues.apache.org/jira/browse/SPARK-3953 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.2.0 >Reporter: Kousuke Saruta >Priority: Minor > Fix For: 1.2.0 > > > In SqlParser.scala, there is following code. > {code} > case d ~ p ~ r ~ f ~ g ~ h ~ o ~ l => > val base = r.getOrElse(NoRelation) > val withFilter = f.map(f => Filter(f, base)).getOrElse(base) > {code} > in the code above, there 2 variables which has same name "f" in near place. > One is receiver "f" and other is bound variable "f". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3997) scalastyle should output the error location
[ https://issues.apache.org/jira/browse/SPARK-3997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3997. --- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2846 [https://github.com/apache/spark/pull/2846] > scalastyle should output the error location > --- > > Key: SPARK-3997 > URL: https://issues.apache.org/jira/browse/SPARK-3997 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: Guoqiang Li > Fix For: 1.2.0 > > > {{./dev/scalastyle}} => > {noformat} > Scalastyle checks failed at following occurrences: > java.lang.RuntimeException: exists error > at scala.sys.package$.error(package.scala:27) > at scala.Predef$.error(Predef.scala:142) > [error] (mllib/*:scalastyle) exists error > {noformat} > scalastyle should output the error location: > {noformat} > [error] > /Users/witgo/work/code/java/spark/mllib/src/main/scala/org/apache/spark/mllib/feature/TopicModeling.scala:413: > File line length exceeds 100 characters > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3266) JavaDoubleRDD doesn't contain max()
[ https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184696#comment-14184696 ] Josh Rosen commented on SPARK-3266: --- I've opened a new pull request which tries to work around the Scala issue by moving the implementations of these methods from the Java*Like traits into abstract base classes that inherit from those traits (essentially making the traits act as interfaces). This breaks binary compatibility from Scala's point of view, since the fact that a trait contains a default implementation of a method is part of its API contract (it affects implementors of that trait). I don't think there's any legitimate reason for someone to have extended JavaRDDLike from their own code, so we shouldn't have to worry about this. >From a simplicity perspective, I prefer the approach from my first PR of >simply converting JavaRDDLike into an abstract class. This would cause >problems for Java API users who were invoking methods through the interface, >though. I can't imagine that most users would have done this, but maybe it's >important to not break compatibility. On the other hand, the current API is >functionally broken as long as it's throwing NoSuchMethodErrors. The one approach that doesn't break _any_ binary compatibility would be to just keep the default implementations of methods in JavaRDDLike then copy-paste the ones affected by the bugs into the individual JavaRDD classes. This is a mess, but I can do it if necessary. > JavaDoubleRDD doesn't contain max() > --- > > Key: SPARK-3266 > URL: https://issues.apache.org/jira/browse/SPARK-3266 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 1.0.1, 1.0.2, 1.1.0, 1.2.0 >Reporter: Amey Chaugule >Assignee: Josh Rosen > Attachments: spark-repro-3266.tar.gz > > > While I can compile my code, I see: > Caused by: java.lang.NoSuchMethodError: > org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double; > When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I > don't notice max() > although it is clearly listed in the documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3537) Statistics for cached RDDs
[ https://issues.apache.org/jira/browse/SPARK-3537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3537. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2860 [https://github.com/apache/spark/pull/2860] > Statistics for cached RDDs > -- > > Key: SPARK-3537 > URL: https://issues.apache.org/jira/browse/SPARK-3537 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Michael Armbrust >Assignee: Cheng Lian > Fix For: 1.2.0 > > > Right now we only have limited statistics for hive tables. We could easily > collect this data when caching an RDD as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3266) JavaDoubleRDD doesn't contain max()
[ https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184688#comment-14184688 ] Apache Spark commented on SPARK-3266: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/2951 > JavaDoubleRDD doesn't contain max() > --- > > Key: SPARK-3266 > URL: https://issues.apache.org/jira/browse/SPARK-3266 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 1.0.1, 1.0.2, 1.1.0, 1.2.0 >Reporter: Amey Chaugule >Assignee: Josh Rosen > Attachments: spark-repro-3266.tar.gz > > > While I can compile my code, I see: > Caused by: java.lang.NoSuchMethodError: > org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double; > When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I > don't notice max() > although it is clearly listed in the documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3925) Do not consider the ordering of qualifiers during comparison
[ https://issues.apache.org/jira/browse/SPARK-3925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3925. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2783 [https://github.com/apache/spark/pull/2783] > Do not consider the ordering of qualifiers during comparison > > > Key: SPARK-3925 > URL: https://issues.apache.org/jira/browse/SPARK-3925 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh > Fix For: 1.2.0 > > > The qualifiers orderings should not be considered during the comparison > between old qualifiers and new qualifiers when calling 'withQualifiers'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-799) Windows versions of the deploy scripts
[ https://issues.apache.org/jira/browse/SPARK-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184663#comment-14184663 ] Andrew Tweddle commented on SPARK-799: -- Powershell is the modern Microsoft shell for Windows. Do you specifically want .cmd files rather than .ps1? What about .cmd files that delegate to .ps1 scripts? > Windows versions of the deploy scripts > -- > > Key: SPARK-799 > URL: https://issues.apache.org/jira/browse/SPARK-799 > Project: Spark > Issue Type: Bug > Components: Deploy, Windows >Reporter: Matei Zaharia > Labels: Starter > > Although the Spark daemons run fine on Windows with run.cmd, the deploy > scripts (bin/start-all.sh and such) don't do so unless you have Cygwin. It > would be nice to make .cmd versions of those. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3960) We can apply unary minus only to literal.
[ https://issues.apache.org/jira/browse/SPARK-3960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184649#comment-14184649 ] Apache Spark commented on SPARK-3960: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/2949 > We can apply unary minus only to literal. > - > > Key: SPARK-3960 > URL: https://issues.apache.org/jira/browse/SPARK-3960 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Kousuke Saruta >Priority: Critical > > Because of the wrong syntax definition, we cannot apply unary minus only to > literal. So, we cannot write such expressions. > {code} > -(value1 + value2) // Parenthesized expressions > -column // Columns > -MAX(column) // Functions > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3959) SqlParser fails to parse literal -9223372036854775808 (Long.MinValue).
[ https://issues.apache.org/jira/browse/SPARK-3959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184648#comment-14184648 ] Apache Spark commented on SPARK-3959: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/2949 > SqlParser fails to parse literal -9223372036854775808 (Long.MinValue). > -- > > Key: SPARK-3959 > URL: https://issues.apache.org/jira/browse/SPARK-3959 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Kousuke Saruta >Priority: Critical > > SqlParser fails to parse -9223372036854775808 (Long.MinValue) so we cannot > write queries such like as follows. > {code} > SELECT value FROM someTable WHERE value > -9223372036854775808 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4092) Input metrics don't work for coalesce()'d RDD's
Patrick Wendell created SPARK-4092: -- Summary: Input metrics don't work for coalesce()'d RDD's Key: SPARK-4092 URL: https://issues.apache.org/jira/browse/SPARK-4092 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell Priority: Critical In every case where we set input metrics (from both Hadoop and block storage) we currently assume that exactly one input partition is computed within the task. This is not a correct assumption in the general case. The main example in the current API is coalesce(), but user-defined RDD's could also be affected. To deal with the most general case, we would need to support the notion of a single task having multiple input sources. A more surgical and less general fix is to simply go to HadoopRDD and check if there are already inputMetrics defined for the task with the same "type". If there are, then merge in the new data rather than blowing away the old one. This wouldn't cover case where, e.g. a single task has input from both on-disk and in-memory blocks. It _would_ cover the case where someone calls coalesce on a HadoopRDD... which is more common. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2811) update algebird to 0.8.1
[ https://issues.apache.org/jira/browse/SPARK-2811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184635#comment-14184635 ] Adam Pingel commented on SPARK-2811: This seemed like an easy first way to contribute to spark. I created a pull request with the 1-line change https://github.com/apache/spark/pull/2947 and confirmed that the two uses of Algebird (the streaming examples TwitterAlgebirdHLL and TwitterAlgebirdCMS) still work. > update algebird to 0.8.1 > > > Key: SPARK-2811 > URL: https://issues.apache.org/jira/browse/SPARK-2811 > Project: Spark > Issue Type: Sub-task > Components: Build, Spark Core >Reporter: Anand Avati > > First algebird_2.11 0.8.1 has to be released -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2811) update algebird to 0.8.1
[ https://issues.apache.org/jira/browse/SPARK-2811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184627#comment-14184627 ] Apache Spark commented on SPARK-2811: - User 'adampingel' has created a pull request for this issue: https://github.com/apache/spark/pull/2947 > update algebird to 0.8.1 > > > Key: SPARK-2811 > URL: https://issues.apache.org/jira/browse/SPARK-2811 > Project: Spark > Issue Type: Sub-task > Components: Build, Spark Core >Reporter: Anand Avati > > First algebird_2.11 0.8.1 has to be released -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1812) Support cross-building with Scala 2.11
[ https://issues.apache.org/jira/browse/SPARK-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184625#comment-14184625 ] Apache Spark commented on SPARK-1812: - User 'adampingel' has created a pull request for this issue: https://github.com/apache/spark/pull/2947 > Support cross-building with Scala 2.11 > -- > > Key: SPARK-1812 > URL: https://issues.apache.org/jira/browse/SPARK-1812 > Project: Spark > Issue Type: New Feature > Components: Build, Spark Core >Reporter: Matei Zaharia >Assignee: Prashant Sharma > > Since Scala 2.10/2.11 are source compatible, we should be able to cross build > for both versions. From what I understand there are basically three things we > need to figure out: > 1. Have a two versions of our dependency graph, one that uses 2.11 > dependencies and the other that uses 2.10 dependencies. > 2. Figure out how to publish different poms for 2.10 and 2.11. > I think (1) can be accomplished by having a scala 2.11 profile. (2) isn't > really well supported by Maven since published pom's aren't generated > dynamically. But we can probably script around it to make it work. I've done > some initial sanity checks with a simple build here: > https://github.com/pwendell/scala-maven-crossbuild -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4061) We cannot use EOL character in the operand of LIKE predicate.
[ https://issues.apache.org/jira/browse/SPARK-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184623#comment-14184623 ] Apache Spark commented on SPARK-4061: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/2946 > We cannot use EOL character in the operand of LIKE predicate. > - > > Key: SPARK-4061 > URL: https://issues.apache.org/jira/browse/SPARK-4061 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Kousuke Saruta > > We cannot use EOL character like \n or \r in the operand of LIKE predicate. > So following condition is never true. > {code} > -- someStr is 'hoge\nfuga' > where someStr LIKE 'hoge_fuga' > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2105) SparkUI doesn't remove active stages that failed
[ https://issues.apache.org/jira/browse/SPARK-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184620#comment-14184620 ] Andrew Or commented on SPARK-2105: -- Hey Josh I think this was fixed by this commit: https://github.com/apache/spark/commit/d934801d53fc2f1d57d3534ae4e1e9384c7dda99 The root cause is because we were dropping events, and that happened because one of the listeners was taking all the time to process the events. We may run into this only if the application attaches arbitrary listeners to Spark and these listeners perform expensive operations, but from Spark's side I don't think there's anything we can do about that. > SparkUI doesn't remove active stages that failed > > > Key: SPARK-2105 > URL: https://issues.apache.org/jira/browse/SPARK-2105 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 1.0.0 >Reporter: Andrew Or > > If a stage fails because its tasks cannot be serialized, for instance, the > failed stage remains in the Active Stages section forever. This is because > the StageCompleted event is never posted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3590) Expose async APIs in the Java API
[ https://issues.apache.org/jira/browse/SPARK-3590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3590. --- Resolution: Fixed Fix Version/s: 1.2.0 Assignee: Josh Rosen > Expose async APIs in the Java API > - > > Key: SPARK-3590 > URL: https://issues.apache.org/jira/browse/SPARK-3590 > Project: Spark > Issue Type: New Feature > Components: Java API >Reporter: Marcelo Vanzin >Assignee: Josh Rosen > Fix For: 1.2.0 > > > Currently, a single async method is exposed through the Java API > (JavaRDDLike::foreachAsync). That method returns a Scala future > (FutureAction). > We should bring the Java API up to sync with the Scala async APIs, and also > expose Java-friendly types (e.g. a proper java.util.concurrent.Future). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3274) Spark Streaming Java API reports java.lang.ClassCastException when calling collectAsMap on JavaPairDStream
[ https://issues.apache.org/jira/browse/SPARK-3274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3274. --- Resolution: Invalid > Spark Streaming Java API reports java.lang.ClassCastException when calling > collectAsMap on JavaPairDStream > -- > > Key: SPARK-3274 > URL: https://issues.apache.org/jira/browse/SPARK-3274 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 1.0.2 >Reporter: Jack Hu > > Reproduce code: > scontext > .socketTextStream("localhost", 1) > .mapToPair(new PairFunction(){ > public Tuple2 call(String arg0) > throws Exception { > return new Tuple2("1", arg0); > } > }) > .foreachRDD(new Function2, Time, > Void>() { > public Void call(JavaPairRDD v1, Time > v2) throws Exception { > System.out.println(v2.toString() + ": " + > v1.collectAsMap().toString()); > return null; > } > }); > Exception: > java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to > [Lscala.Tupl > e2; > at > org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.s > cala:447) > at > org.apache.spark.api.java.JavaPairRDD.collectAsMap(JavaPairRDD.scala: > 464) > at tuk.usecase.failedcall.FailedCall$1.call(FailedCall.java:90) > at tuk.usecase.failedcall.FailedCall$1.call(FailedCall.java:88) > at > org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachR > DD$2.apply(JavaDStreamLike.scala:282) > at > org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachR > DD$2.apply(JavaDStreamLike.scala:282) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mc > V$sp(ForEachDStream.scala:41) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(Fo > rEachDStream.scala:40) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(Fo > rEachDStream.scala:40) > at scala.util.Try$.apply(Try.scala:161) > at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) > at > org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2105) SparkUI doesn't remove active stages that failed
[ https://issues.apache.org/jira/browse/SPARK-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184613#comment-14184613 ] Josh Rosen commented on SPARK-2105: --- I tried and failed to reproduce this: https://github.com/apache/spark/commit/bf589fc717c842d1998e3c3a523bc8775cb30269#diff-f346ada4cd59416756b6dd36b6c2605aR97 That doesn't mean that we've fixed the issue, though. In my tests, the stage never becomes active because the ClosureCleaner detects that the task isn't serializable. Maybe there's some UDF that manages to slip through the closure cleaning step and fails once the stage is submitted to the scheduler, so it's still possible that we could hit this bug. > SparkUI doesn't remove active stages that failed > > > Key: SPARK-2105 > URL: https://issues.apache.org/jira/browse/SPARK-2105 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 1.0.0 >Reporter: Andrew Or > > If a stage fails because its tasks cannot be serialized, for instance, the > failed stage remains in the Active Stages section forever. This is because > the StageCompleted event is never posted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3021) Job remains in Active Stages after failing
[ https://issues.apache.org/jira/browse/SPARK-3021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3021. --- Resolution: Cannot Reproduce Fix Version/s: 1.2.0 Assignee: Josh Rosen I tried to reproduce this in Selenium (https://github.com/apache/spark/commit/bf589fc717c842d1998e3c3a523bc8775cb30269#diff-f346ada4cd59416756b6dd36b6c2605aR87), but wasn't able to find a reproduction in Spark 1.2. Therefore, I'm going to resolve this as "Cannot Reproduce" for now. > Job remains in Active Stages after failing > -- > > Key: SPARK-3021 > URL: https://issues.apache.org/jira/browse/SPARK-3021 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.1.0 >Reporter: Michael Armbrust >Assignee: Josh Rosen > Fix For: 1.2.0 > > > It died with the following exception, but i still hanging out in the UI. > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 20 in > stage 8.1 failed 4 times, most recent failure: Lost task 20.3 in stage 8.1 > (TID 710, ip-10-0-166-165.us-west-2.compute.internal): ExecutorLostFailure > (executor lost) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1153) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1142) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1141) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2527) incorrect persistence level shown in Spark UI after repersisting
[ https://issues.apache.org/jira/browse/SPARK-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-2527: -- Fix Version/s: 1.2.0 > incorrect persistence level shown in Spark UI after repersisting > > > Key: SPARK-2527 > URL: https://issues.apache.org/jira/browse/SPARK-2527 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.0.0 >Reporter: Diana Carroll >Assignee: Josh Rosen > Fix For: 1.2.0 > > Attachments: persistbug1.png, persistbug2.png > > > If I persist an RDD at one level, unpersist it, then repersist it at another > level, the UI will continue to show the RDD at the first level...but > correctly show individual partitions at the second level. > {code} > import org.apache.spark.api.java.StorageLevels > import org.apache.spark.api.java.StorageLevels._ > val test1 = sc.parallelize(Array(1,2,3))test1.persist(StorageLevels.DISK_ONLY) > test1.count() > test1.unpersist() > test1.persist(StorageLevels.MEMORY_ONLY) > test1.count() > {code} > after the first call to persist and count, the Spark App web UI shows: > RDD Storage Info for 14 Storage Level: Disk Serialized 1x Replicated > rdd_14_0 Disk Serialized 1x Replicated > After the second call, it shows: > RDD Storage Info for 14 Storage Level: Disk Serialized 1x Replicated > rdd_14_0 Memory Deserialized 1x Replicated -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2527) incorrect persistence level shown in Spark UI after repersisting
[ https://issues.apache.org/jira/browse/SPARK-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-2527. --- Resolution: Cannot Reproduce Assignee: Josh Rosen I think that this was fixed in either 1.1 or 1.2 since I was unable to reproduce this when writing a Selenium test to run your example script: https://github.com/apache/spark/commit/bf589fc717c842d1998e3c3a523bc8775cb30269#diff-f346ada4cd59416756b6dd36b6c2605aR53 Therefore, I'm going to mark this as "Cannot Reproduce" since it was probably fixed. Please re-open this ticket if you observe this in the wild with a newer version of Spark. > incorrect persistence level shown in Spark UI after repersisting > > > Key: SPARK-2527 > URL: https://issues.apache.org/jira/browse/SPARK-2527 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.0.0 >Reporter: Diana Carroll >Assignee: Josh Rosen > Attachments: persistbug1.png, persistbug2.png > > > If I persist an RDD at one level, unpersist it, then repersist it at another > level, the UI will continue to show the RDD at the first level...but > correctly show individual partitions at the second level. > {code} > import org.apache.spark.api.java.StorageLevels > import org.apache.spark.api.java.StorageLevels._ > val test1 = sc.parallelize(Array(1,2,3))test1.persist(StorageLevels.DISK_ONLY) > test1.count() > test1.unpersist() > test1.persist(StorageLevels.MEMORY_ONLY) > test1.count() > {code} > after the first call to persist and count, the Spark App web UI shows: > RDD Storage Info for 14 Storage Level: Disk Serialized 1x Replicated > rdd_14_0 Disk Serialized 1x Replicated > After the second call, it shows: > RDD Storage Info for 14 Storage Level: Disk Serialized 1x Replicated > rdd_14_0 Memory Deserialized 1x Replicated -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2698) RDD pages shows negative bytes remaining for some executors
[ https://issues.apache.org/jira/browse/SPARK-2698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-2698: -- Summary: RDD pages shows negative bytes remaining for some executors (was: RDD page Spark Web UI bug) > RDD pages shows negative bytes remaining for some executors > --- > > Key: SPARK-2698 > URL: https://issues.apache.org/jira/browse/SPARK-2698 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.0.0 >Reporter: Hossein Falaki > Attachments: spark ui.png > > > The RDD page shows negative bytes remaining for some executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3616) Add Selenium tests to Web UI
[ https://issues.apache.org/jira/browse/SPARK-3616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3616. --- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2474 [https://github.com/apache/spark/pull/2474] > Add Selenium tests to Web UI > > > Key: SPARK-3616 > URL: https://issues.apache.org/jira/browse/SPARK-3616 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 1.2.0 > > > We should add basic Selenium tests to Web UI suite. This will make it easy > to write regression tests / reproductions for UI bugs and will be useful in > testing some planned refactorings / redesigns that I'm working on. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1758) failing test org.apache.spark.JavaAPISuite.wholeTextFiles
[ https://issues.apache.org/jira/browse/SPARK-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-1758. --- Resolution: Cannot Reproduce Resolving this as "Cannot Reproduce" for now, since I haven't observed this problem and both PRs for this were closed. > failing test org.apache.spark.JavaAPISuite.wholeTextFiles > - > > Key: SPARK-1758 > URL: https://issues.apache.org/jira/browse/SPARK-1758 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 1.0.0 >Reporter: Nishkam Ravi > Fix For: 1.0.0 > > Attachments: SPARK-1758.patch > > > Test org.apache.spark.JavaAPISuite.wholeTextFiles fails (during sbt/sbt test) > with the following error message: > Test org.apache.spark.JavaAPISuite.wholeTextFiles failed: > java.lang.AssertionError: expected: but was: -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3962) Mark spark dependency as "provided" in external libraries
[ https://issues.apache.org/jira/browse/SPARK-3962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184580#comment-14184580 ] Patrick Wendell commented on SPARK-3962: [~prashant_] can you take a crack at this? It's pretty simple, we just want the streaming external projects to mark spark-core as provided. > Mark spark dependency as "provided" in external libraries > - > > Key: SPARK-3962 > URL: https://issues.apache.org/jira/browse/SPARK-3962 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Patrick Wendell >Assignee: Prashant Sharma >Priority: Blocker > > Right now there is not an easy way for users to link against the external > streaming libraries and not accidentally pull Spark into their assembly jar. > We should mark Spark as "provided" in the external connector pom's so that > user applications can simply include those like any other dependency in the > user's jar. > This is also the best format for third-party libraries that depend on Spark > (of which there will eventually be many) so it would be nice for our own > build to conform to this nicely. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3962) Mark spark dependency as "provided" in external libraries
[ https://issues.apache.org/jira/browse/SPARK-3962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184580#comment-14184580 ] Patrick Wendell edited comment on SPARK-3962 at 10/26/14 6:11 PM: -- [~prashant_] can you take a crack at this? It's pretty simple, we just want the streaming external projects to mark spark-core and spark-streaming as provided. was (Author: pwendell): [~prashant_] can you take a crack at this? It's pretty simple, we just want the streaming external projects to mark spark-core as provided. > Mark spark dependency as "provided" in external libraries > - > > Key: SPARK-3962 > URL: https://issues.apache.org/jira/browse/SPARK-3962 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Patrick Wendell >Assignee: Prashant Sharma >Priority: Blocker > > Right now there is not an easy way for users to link against the external > streaming libraries and not accidentally pull Spark into their assembly jar. > We should mark Spark as "provided" in the external connector pom's so that > user applications can simply include those like any other dependency in the > user's jar. > This is also the best format for third-party libraries that depend on Spark > (of which there will eventually be many) so it would be nice for our own > build to conform to this nicely. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3962) Mark spark dependency as "provided" in external libraries
[ https://issues.apache.org/jira/browse/SPARK-3962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3962: --- Assignee: Prashant Sharma > Mark spark dependency as "provided" in external libraries > - > > Key: SPARK-3962 > URL: https://issues.apache.org/jira/browse/SPARK-3962 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Patrick Wendell >Assignee: Prashant Sharma >Priority: Blocker > > Right now there is not an easy way for users to link against the external > streaming libraries and not accidentally pull Spark into their assembly jar. > We should mark Spark as "provided" in the external connector pom's so that > user applications can simply include those like any other dependency in the > user's jar. > This is also the best format for third-party libraries that depend on Spark > (of which there will eventually be many) so it would be nice for our own > build to conform to this nicely. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2633) enhance spark listener API to gather more spark job information
[ https://issues.apache.org/jira/browse/SPARK-2633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2633. Resolution: Duplicate I believe the design of SPARK-2321 is such that it covers Hive's use case. So I'm closing this as a dup of that issue. > enhance spark listener API to gather more spark job information > --- > > Key: SPARK-2633 > URL: https://issues.apache.org/jira/browse/SPARK-2633 > Project: Spark > Issue Type: New Feature > Components: Java API >Reporter: Chengxiang Li >Priority: Critical > Labels: hive > Attachments: Spark listener enhancement for Hive on Spark job monitor > and statistic.docx > > > Based on Hive on Spark job status monitoring and statistic collection > requirement, try to enhance spark listener API to gather more spark job > information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4091) Occasionally spark.local.dir can be deleted twice and causes test failure
[ https://issues.apache.org/jira/browse/SPARK-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184577#comment-14184577 ] Josh Rosen commented on SPARK-4091: --- This looks like a duplicate of SPARK-3970. > Occasionally spark.local.dir can be deleted twice and causes test failure > - > > Key: SPARK-4091 > URL: https://issues.apache.org/jira/browse/SPARK-4091 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Cheng Lian > > By persisting an arbitrary RDD with storage level {{MEMORY_AND_DISK}}, Spark > may occasionally throw the following exception when shutting down: > {code} > java.io.IOException: Failed to list files for dir: > /var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027005012-5bcd/0b > at org.apache.spark.util.Utils$.listFilesSafely(Utils.scala:664) > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678) > at > org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:680) > at > org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:678) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678) > at > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:177) > at > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:175) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) > at > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply$mcV$sp(Utils.scala:175) > at > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173) > at > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1323) > at org.apache.spark.util.Utils$$anon$4.run(Utils.scala:173) > {code} > By adding log output to {{Utils.deleteRecursively}}, setting breakpoints at > {{File.delete}} in IntelliJ, and asking IntelliJ to evaluate and log > {{Thread.currentThread().getStackTrace()}} when the breakpoint is hit rather > than suspend execution, we can get the following result, which shows > {{spark.local.dir}} is deleted twice from both {{DiskBlockManager.stop}} and > the shutdown hook installed in {{Utils}}: > {code} > +++ Deleting file: > /var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027003412-7fae/1d > Breakpoint reached at java.io.File.delete(File.java:1028) > [java.lang.Thread.getStackTrace(Thread.java:1589) > java.io.File.delete(File.java:1028) > org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:695) > > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:177) > > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:175) > scala.collection.mutable.HashSet.foreach(HashSet.scala:79) > > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply$mcV$sp(Utils.scala:175) > > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173) > > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173) > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1323) > org.apache.spark.util.Utils$$anon$4.run(Utils.scala:173)] > +++ Deleting file: > /var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027003412-7fae/1d > Breakpoint reached at java.io.File.delete(File.java:1028) > [java.lang.Thread.getStackTrace(Thread.java:1589) > java.io.File.delete(File.java:1028) > org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:695) > > org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:680) > > org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:678) > > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) > org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678) > > org.apache.spark.storage.DiskBlockManager$$anonfun$stop$1.apply(DiskBlockManager.scala:157) > > org.apache.spark.storage.DiskBlockManager$$anonfun$stop$1.apply(DiskBlockManager.scala:154) > > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > > org.apache.spark.storage.DiskBlockManager.stop(DiskBlockManager.scala:154) >
[jira] [Commented] (SPARK-2532) Fix issues with consolidated shuffle
[ https://issues.apache.org/jira/browse/SPARK-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184576#comment-14184576 ] Patrick Wendell commented on SPARK-2532: Hey [~matei] - you created some sub-tasks here that are pretty tersely described... would you mind looking through them and deciding whether these are still relevant? Not sure whether we can close this. > Fix issues with consolidated shuffle > > > Key: SPARK-2532 > URL: https://issues.apache.org/jira/browse/SPARK-2532 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 1.1.0 > Environment: All >Reporter: Mridul Muralidharan >Assignee: Mridul Muralidharan >Priority: Critical > > Will file PR with changes as soon as merge is done (earlier merge became > outdated in 2 weeks unfortunately :) ). > Consolidated shuffle is broken in multiple ways in spark : > a) Task failure(s) can cause the state to become inconsistent. > b) Multiple revert's or combination of close/revert/close can cause the state > to be inconsistent. > (As part of exception/error handling). > c) Some of the api in block writer causes implementation issues - for > example: a revert is always followed by close : but the implemention tries to > keep them separate, resulting in surface for errors. > d) Fetching data from consolidated shuffle files can go badly wrong if the > file is being actively written to : it computes length by subtracting next > offset from current offset (or length if this is last offset)- the latter > fails when fetch is happening in parallel to write. > Note, this happens even if there are no task failures of any kind ! > This usually results in stream corruption or decompression errors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3917) Compress data before network transfer
[ https://issues.apache.org/jira/browse/SPARK-3917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3917: --- Priority: Major (was: Critical) > Compress data before network transfer > - > > Key: SPARK-3917 > URL: https://issues.apache.org/jira/browse/SPARK-3917 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.1.0 > Environment: All >Reporter: junlong > Fix For: 1.1.0 > > > When training Gradient Boosting Decision Tree on large sparse data, heavy > network flow pull down CPU utilization ratio. And through compression on > network flow data, 90% are reduced. > So maybe compression before transfering may provide higher speedup on > spark. And user can configure it to decide whether compress or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)
[ https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184575#comment-14184575 ] koert kuipers commented on SPARK-3655: -- can you assign to me? i will have 2 pullreq in a few days > Support sorting of values in addition to keys (i.e. secondary sort) > --- > > Key: SPARK-3655 > URL: https://issues.apache.org/jira/browse/SPARK-3655 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: koert kuipers >Priority: Minor > > Now that spark has a sort based shuffle, can we expect a secondary sort soon? > There are some use cases where getting a sorted iterator of values per key is > helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2760) Caching tables from multiple databases does not work
[ https://issues.apache.org/jira/browse/SPARK-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2760: --- Component/s: SQL > Caching tables from multiple databases does not work > > > Key: SPARK-2760 > URL: https://issues.apache.org/jira/browse/SPARK-2760 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Michael Armbrust >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4085) Job will fail if a shuffle file that's read locally gets deleted
[ https://issues.apache.org/jira/browse/SPARK-4085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4085: --- Component/s: Spark Core > Job will fail if a shuffle file that's read locally gets deleted > > > Key: SPARK-4085 > URL: https://issues.apache.org/jira/browse/SPARK-4085 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Kay Ousterhout >Assignee: Reynold Xin >Priority: Critical > > This commit: > https://github.com/apache/spark/commit/665e71d14debb8a7fc1547c614867a8c3b1f806a > changed the behavior of fetching local shuffle blocks such that if a shuffle > block is not found locally, the shuffle block is no longer marked as failed, > and a fetch failed exception is not thrown (this is because the "catch" block > here won't ever be invoked: > https://github.com/apache/spark/commit/665e71d14debb8a7fc1547c614867a8c3b1f806a#diff-e6e1631fa01e17bf851f49d30d028823R202 > because the exception called from getLocalFromDisk() doesn't get thrown > until next() gets called on the iterator). > [~rxin] [~matei] it looks like you guys changed the test for this to catch > the new exception that gets thrown > (https://github.com/apache/spark/commit/665e71d14debb8a7fc1547c614867a8c3b1f806a#diff-9c2e1918319de967045d04caf813a7d1R93). > Was that intentional? Because the new exception is a SparkException and > not a FetchFailedException, jobs with missing local shuffle data will now > fail, rather than having the map stage get retried. > This problem is reproducible with this test case: > {code} > test("hash shuffle manager recovers when local shuffle files get deleted") { > val conf = new SparkConf(false) > conf.set("spark.shuffle.manager", "hash") > sc = new SparkContext("local", "test", conf) > val rdd = sc.parallelize(1 to 10, 2).map((_, 1)).reduceByKey(_+_) > rdd.count() > // Delete one of the local shuffle blocks. > sc.env.blockManager.diskBlockManager.getFile(new ShuffleBlockId(0, 0, > 0)).delete() > rdd.count() > } > {code} > which will fail on the second rdd.count(). > This is a regression from 1.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4056) Upgrade snappy-java to 1.1.1.5
[ https://issues.apache.org/jira/browse/SPARK-4056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184570#comment-14184570 ] Josh Rosen commented on SPARK-4056: --- We reverted the 1.1.5 upgrade after discovering that it caused a memory leak. It looks like this has been fixed in 1.1.6 if we still want to upgrade. > Upgrade snappy-java to 1.1.1.5 > -- > > Key: SPARK-4056 > URL: https://issues.apache.org/jira/browse/SPARK-4056 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 1.1.1, 1.2.0 > > > We should upgrade snappy-java to 1.1.1.5 across all of our maintenance > branches. This release improves error messages when attempting to > deserialize empty inputs using SnappyInputStream (this operation is always an > error, but the old error messages made it hard to distinguish failures due to > empty streams from ones due to reading invalid / corrupted streams); see > https://github.com/xerial/snappy-java/issues/89 for more context. > This should be a major help in the Snappy debugging work that I've been doing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4056) Upgrade snappy-java to 1.1.1.5
[ https://issues.apache.org/jira/browse/SPARK-4056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4056: --- Component/s: Spark Core > Upgrade snappy-java to 1.1.1.5 > -- > > Key: SPARK-4056 > URL: https://issues.apache.org/jira/browse/SPARK-4056 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 1.1.1, 1.2.0 > > > We should upgrade snappy-java to 1.1.1.5 across all of our maintenance > branches. This release improves error messages when attempting to > deserialize empty inputs using SnappyInputStream (this operation is always an > error, but the old error messages made it hard to distinguish failures due to > empty streams from ones due to reading invalid / corrupted streams); see > https://github.com/xerial/snappy-java/issues/89 for more context. > This should be a major help in the Snappy debugging work that I've been doing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)
[ https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3655: --- Summary: Support sorting of values in addition to keys (i.e. secondary sort) (was: Secondary sort) > Support sorting of values in addition to keys (i.e. secondary sort) > --- > > Key: SPARK-3655 > URL: https://issues.apache.org/jira/browse/SPARK-3655 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: koert kuipers >Priority: Minor > > Now that spark has a sort based shuffle, can we expect a secondary sort soon? > There are some use cases where getting a sorted iterator of values per key is > helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3655) Secondary sort
[ https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184565#comment-14184565 ] Patrick Wendell commented on SPARK-3655: Okay, sounds good. > Secondary sort > -- > > Key: SPARK-3655 > URL: https://issues.apache.org/jira/browse/SPARK-3655 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: koert kuipers >Priority: Minor > > Now that spark has a sort based shuffle, can we expect a secondary sort soon? > There are some use cases where getting a sorted iterator of values per key is > helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4064) NioBlockTransferService should deal with empty messages correctly
[ https://issues.apache.org/jira/browse/SPARK-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4064: --- Summary: NioBlockTransferService should deal with empty messages correctly (was: If we create a lot of big broadcast variables, Spark has great possibility to hang) > NioBlockTransferService should deal with empty messages correctly > - > > Key: SPARK-4064 > URL: https://issues.apache.org/jira/browse/SPARK-4064 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Guoqiang Li >Assignee: Guoqiang Li >Priority: Critical > Fix For: 1.2.0 > > Attachments: executor.log, jstack.txt, screenshot.png > > > When I test [the PR 1983|https://github.com/apache/spark/pull/1983], The > probability of a third, spark hangs -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4090) Memory leak in snappy-java 1.1.1.4/5
[ https://issues.apache.org/jira/browse/SPARK-4090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184562#comment-14184562 ] Davies Liu commented on SPARK-4090: --- [~joshrosen] 1.1.1.6 is released. > Memory leak in snappy-java 1.1.1.4/5 > > > Key: SPARK-4090 > URL: https://issues.apache.org/jira/browse/SPARK-4090 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Davies Liu >Priority: Blocker > Attachments: screenshot-12.png > > > There is memory-leak bug in snappy-java 1.1.1.4/5, we should rollback to > 1.1.1.3 or wait for bugfix. > The jenkins tests timeouted or OOM multiple times recently. While test it > locally, I got the heap dump of leaked JVM: > Then found that it's a bug in recent releases of snappy-java: > {code} > +inputBuffer = inputBufferAllocator.allocate(inputSize); > +outputBuffer = inputBufferAllocator.allocate(outputSize); > {code} > The outputBuffer is allocated from inputBufferAllocator but released to > outputBufferAllocator: https://github.com/xerial/snappy-java/issues/91 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4091) Occasionally spark.local.dir can be deleted twice and causes test failure
[ https://issues.apache.org/jira/browse/SPARK-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184560#comment-14184560 ] Apache Spark commented on SPARK-4091: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/2945 > Occasionally spark.local.dir can be deleted twice and causes test failure > - > > Key: SPARK-4091 > URL: https://issues.apache.org/jira/browse/SPARK-4091 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Cheng Lian > > By persisting an arbitrary RDD with storage level {{MEMORY_AND_DISK}}, Spark > may occasionally throw the following exception when shutting down: > {code} > java.io.IOException: Failed to list files for dir: > /var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027005012-5bcd/0b > at org.apache.spark.util.Utils$.listFilesSafely(Utils.scala:664) > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678) > at > org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:680) > at > org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:678) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678) > at > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:177) > at > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:175) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) > at > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply$mcV$sp(Utils.scala:175) > at > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173) > at > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1323) > at org.apache.spark.util.Utils$$anon$4.run(Utils.scala:173) > {code} > By adding log output to {{Utils.deleteRecursively}}, setting breakpoints at > {{File.delete}} in IntelliJ, and asking IntelliJ to evaluate and log > {{Thread.currentThread().getStackTrace()}} when the breakpoint is hit rather > than suspend execution, we can get the following result, which shows > {{spark.local.dir}} is deleted twice from both {{DiskBlockManager.stop}} and > the shutdown hook installed in {{Utils}}: > {code} > +++ Deleting file: > /var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027003412-7fae/1d > Breakpoint reached at java.io.File.delete(File.java:1028) > [java.lang.Thread.getStackTrace(Thread.java:1589) > java.io.File.delete(File.java:1028) > org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:695) > > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:177) > > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:175) > scala.collection.mutable.HashSet.foreach(HashSet.scala:79) > > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply$mcV$sp(Utils.scala:175) > > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173) > > org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173) > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1323) > org.apache.spark.util.Utils$$anon$4.run(Utils.scala:173)] > +++ Deleting file: > /var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027003412-7fae/1d > Breakpoint reached at java.io.File.delete(File.java:1028) > [java.lang.Thread.getStackTrace(Thread.java:1589) > java.io.File.delete(File.java:1028) > org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:695) > > org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:680) > > org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:678) > > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) > org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678) > > org.apache.spark.storage.DiskBlockManager$$anonfun$stop$1.apply(DiskBlockManager.scala:157) > > org.apache.spark.storage.DiskBlockManager$$anonfun$stop$1.apply(DiskBlockManager.scala:154) > > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > > org.apache.spar
[jira] [Updated] (SPARK-4091) Occasionally spark.local.dir can be deleted twice and causes test failure
[ https://issues.apache.org/jira/browse/SPARK-4091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-4091: -- Description: By persisting an arbitrary RDD with storage level {{MEMORY_AND_DISK}}, Spark may occasionally throw the following exception when shutting down: {code} java.io.IOException: Failed to list files for dir: /var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027005012-5bcd/0b at org.apache.spark.util.Utils$.listFilesSafely(Utils.scala:664) at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678) at org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:680) at org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:678) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678) at org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:177) at org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:175) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply$mcV$sp(Utils.scala:175) at org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173) at org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1323) at org.apache.spark.util.Utils$$anon$4.run(Utils.scala:173) {code} By adding log output to {{Utils.deleteRecursively}}, setting breakpoints at {{File.delete}} in IntelliJ, and asking IntelliJ to evaluate and log {{Thread.currentThread().getStackTrace()}} when the breakpoint is hit rather than suspend execution, we can get the following result, which shows {{spark.local.dir}} is deleted twice from both {{DiskBlockManager.stop}} and the shutdown hook installed in {{Utils}}: {code} +++ Deleting file: /var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027003412-7fae/1d Breakpoint reached at java.io.File.delete(File.java:1028) [java.lang.Thread.getStackTrace(Thread.java:1589) java.io.File.delete(File.java:1028) org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:695) org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:177) org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:175) scala.collection.mutable.HashSet.foreach(HashSet.scala:79) org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply$mcV$sp(Utils.scala:175) org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173) org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173) org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1323) org.apache.spark.util.Utils$$anon$4.run(Utils.scala:173)] +++ Deleting file: /var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027003412-7fae/1d Breakpoint reached at java.io.File.delete(File.java:1028) [java.lang.Thread.getStackTrace(Thread.java:1589) java.io.File.delete(File.java:1028) org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:695) org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:680) org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:678) scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678) org.apache.spark.storage.DiskBlockManager$$anonfun$stop$1.apply(DiskBlockManager.scala:157) org.apache.spark.storage.DiskBlockManager$$anonfun$stop$1.apply(DiskBlockManager.scala:154) scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) org.apache.spark.storage.DiskBlockManager.stop(DiskBlockManager.scala:154) org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$1.apply$mcV$sp(DiskBlockManager.scala:147) org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$1.apply(DiskBlockManager.scala:145) org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$1.apply(DiskBlockManager.scala:145) org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1323) org.apache.spark.storage.DiskBlockManager$$anon$1.run(DiskBlockManager.scala:145)] {code} When this bug happens during Jenkins build, it fails {{CliSuite}}. was: By p
[jira] [Created] (SPARK-4091) Occasionally spark.local.dir can be deleted twice and causes test failure
Cheng Lian created SPARK-4091: - Summary: Occasionally spark.local.dir can be deleted twice and causes test failure Key: SPARK-4091 URL: https://issues.apache.org/jira/browse/SPARK-4091 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Cheng Lian By persisting an arbitrary RDD with storage level {{MEMORY_AND_DISK}}, Spark may occasionally throw the following exception when shutting down: {code} java.io.IOException: Failed to list files for dir: /var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027005012-5bcd/0b at org.apache.spark.util.Utils$.listFilesSafely(Utils.scala:664) at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678) at org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:680) at org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:678) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678) at org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:177) at org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:175) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply$mcV$sp(Utils.scala:175) at org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173) at org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1323) at org.apache.spark.util.Utils$$anon$4.run(Utils.scala:173) {code} By adding log output to {{Utils.deleteRecursively}}, setting breakpoints at {{File.delete}} in IntelliJ, and asking IntelliJ to evaluate and log {{Thread.currentThread().getStackTrace()}} when the breakpoint is hit rather than suspend execution, we can get the following result, which shows {{spark.local.dir}} is deleted twice from both {{DiskBlockManager.stop}} and the shutdown hook installed in {{Utils}}: {code} +++ Deleting file: /var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027003412-7fae/1d Breakpoint reached at java.io.File.delete(File.java:1028) [java.lang.Thread.getStackTrace(Thread.java:1589) java.io.File.delete(File.java:1028) org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:695) org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:177) org.apache.spark.util.Utils$$anon$4$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(Utils.scala:175) scala.collection.mutable.HashSet.foreach(HashSet.scala:79) org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply$mcV$sp(Utils.scala:175) org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173) org.apache.spark.util.Utils$$anon$4$$anonfun$run$1.apply(Utils.scala:173) org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1323) org.apache.spark.util.Utils$$anon$4.run(Utils.scala:173)] +++ Deleting file: /var/folders/kh/r9ylmzln40n9nrlchnsry2qwgn/T/spark-local-20141027003412-7fae/1d Breakpoint reached at java.io.File.delete(File.java:1028) [java.lang.Thread.getStackTrace(Thread.java:1589) java.io.File.delete(File.java:1028) org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:695) org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:680) org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:678) scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:678) org.apache.spark.storage.DiskBlockManager$$anonfun$stop$1.apply(DiskBlockManager.scala:157) org.apache.spark.storage.DiskBlockManager$$anonfun$stop$1.apply(DiskBlockManager.scala:154) scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) org.apache.spark.storage.DiskBlockManager.stop(DiskBlockManager.scala:154) org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$1.apply$mcV$sp(DiskBlockManager.scala:147) org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$1.apply(DiskBlockManager.scala:145) org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$1.apply(DiskBlockManager.scala:145) org.apache.spark.util.Utils$.logUncaughtEx
[jira] [Commented] (SPARK-1473) Feature selection for high dimensional datasets
[ https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184457#comment-14184457 ] David Martinez Rego commented on SPARK-1473: Dear Sam, Thank you for the invitation. Funny enough, I am a usual at the meet ups and I have been already invited by Martin Goodson to do a talk about ... "selected topics on ML in Big Data". Currently I have a lab in Spain polishing the code and deploying it on a cluster to prove its performance (and support a future pull request). Dr. Brown has suggested me a couple of improvements using semi-supervised data. When we have solid results, at least on my side, I would love to share them with the community. > Feature selection for high dimensional datasets > --- > > Key: SPARK-1473 > URL: https://issues.apache.org/jira/browse/SPARK-1473 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Ignacio Zendejas >Assignee: Alexander Ulanov >Priority: Minor > Labels: features > > For classification tasks involving large feature spaces in the order of tens > of thousands or higher (e.g., text classification with n-grams, where n > 1), > it is often useful to rank and filter features that are irrelevant thereby > reducing the feature space by at least one or two orders of magnitude without > impacting performance on key evaluation metrics (accuracy/precision/recall). > A feature evaluation interface which is flexible needs to be designed and at > least two methods should be implemented with Information Gain being a > priority as it has been shown to be amongst the most reliable. > Special consideration should be taken in the design to account for wrapper > methods (see research papers below) which are more practical for lower > dimensional data. > Relevant research: > * Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional > likelihood maximisation: a unifying framework for information theoretic > feature selection.*The Journal of Machine Learning Research*, *13*, 27-66. > * Forman, George. "An extensive empirical study of feature selection metrics > for text classification." The Journal of machine learning research 3 (2003): > 1289-1305. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1473) Feature selection for high dimensional datasets
[ https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184453#comment-14184453 ] sam commented on SPARK-1473: [~gbr...@cs.man.ac.uk] Thanks for taking the time to respond to my questions, and I thank you again for writing the paper as I always enjoy reading foundational (i.e. information theoretic) approaches to Machine Learning. Regarding your final point about empiricism, yes this is better than "arbitrary" and so my original comment was too strong. I guess I was hoping for the same kind of foundational approach used to define the feature selection, and I am optimistic that there does exist a principled approach to how to define independence (which I think would also link with estimation). I notice that your email address indicates that you are at Manchester University (I must have overlooked this when reading the paper - typical mathematician). This is where I learnt about Information Theory - in the maths department; Jeff Paris, George Wilmers, Vencovska, etc have all done sterling work. Do you ever come to London? Do you have any interest in applications? We have a Spark Meetup in London and it would be great if you could attend - much easier to share ideas in person. Perhaps yourself and [~torito1984] may even be willing to give a talk on "Information Theoretic Feature Selection with Implementation in Spark"? > Feature selection for high dimensional datasets > --- > > Key: SPARK-1473 > URL: https://issues.apache.org/jira/browse/SPARK-1473 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Ignacio Zendejas >Assignee: Alexander Ulanov >Priority: Minor > Labels: features > > For classification tasks involving large feature spaces in the order of tens > of thousands or higher (e.g., text classification with n-grams, where n > 1), > it is often useful to rank and filter features that are irrelevant thereby > reducing the feature space by at least one or two orders of magnitude without > impacting performance on key evaluation metrics (accuracy/precision/recall). > A feature evaluation interface which is flexible needs to be designed and at > least two methods should be implemented with Information Gain being a > priority as it has been shown to be amongst the most reliable. > Special consideration should be taken in the design to account for wrapper > methods (see research papers below) which are more practical for lower > dimensional data. > Relevant research: > * Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional > likelihood maximisation: a unifying framework for information theoretic > feature selection.*The Journal of Machine Learning Research*, *13*, 27-66. > * Forman, George. "An extensive empirical study of feature selection metrics > for text classification." The Journal of machine learning research 3 (2003): > 1289-1305. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4090) Memory leak in snappy-java 1.1.1.4/5
[ https://issues.apache.org/jira/browse/SPARK-4090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-4090. --- Resolution: Fixed > Memory leak in snappy-java 1.1.1.4/5 > > > Key: SPARK-4090 > URL: https://issues.apache.org/jira/browse/SPARK-4090 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Davies Liu >Priority: Blocker > Attachments: screenshot-12.png > > > There is memory-leak bug in snappy-java 1.1.1.4/5, we should rollback to > 1.1.1.3 or wait for bugfix. > The jenkins tests timeouted or OOM multiple times recently. While test it > locally, I got the heap dump of leaked JVM: > Then found that it's a bug in recent releases of snappy-java: > {code} > +inputBuffer = inputBufferAllocator.allocate(inputSize); > +outputBuffer = inputBufferAllocator.allocate(outputSize); > {code} > The outputBuffer is allocated from inputBufferAllocator but released to > outputBufferAllocator: https://github.com/xerial/snappy-java/issues/91 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4090) Memory leak in snappy-java 1.1.1.4/5
[ https://issues.apache.org/jira/browse/SPARK-4090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184423#comment-14184423 ] Josh Rosen commented on SPARK-4090: --- I rolled back earlier today, so the build should be fixed now. > Memory leak in snappy-java 1.1.1.4/5 > > > Key: SPARK-4090 > URL: https://issues.apache.org/jira/browse/SPARK-4090 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Davies Liu >Priority: Blocker > Attachments: screenshot-12.png > > > There is memory-leak bug in snappy-java 1.1.1.4/5, we should rollback to > 1.1.1.3 or wait for bugfix. > The jenkins tests timeouted or OOM multiple times recently. While test it > locally, I got the heap dump of leaked JVM: > Then found that it's a bug in recent releases of snappy-java: > {code} > +inputBuffer = inputBufferAllocator.allocate(inputSize); > +outputBuffer = inputBufferAllocator.allocate(outputSize); > {code} > The outputBuffer is allocated from inputBufferAllocator but released to > outputBufferAllocator: https://github.com/xerial/snappy-java/issues/91 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4090) Memory leak in snappy-java 1.1.1.4/5
[ https://issues.apache.org/jira/browse/SPARK-4090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-4090: -- Attachment: screenshot-12.png > Memory leak in snappy-java 1.1.1.4/5 > > > Key: SPARK-4090 > URL: https://issues.apache.org/jira/browse/SPARK-4090 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Davies Liu >Priority: Blocker > Attachments: screenshot-12.png > > > There is memory-leak bug in snappy-java 1.1.1.4/5, we should rollback to > 1.1.1.3 or wait for bugfix. > The jenkins tests timeouted or OOM multiple times recently. While test it > locally, I got the heap dump of leaked JVM: > Then found that it's a bug in recent releases of snappy-java: > {code} > +inputBuffer = inputBufferAllocator.allocate(inputSize); > +outputBuffer = inputBufferAllocator.allocate(outputSize); > {code} > The outputBuffer is allocated from inputBufferAllocator but released to > outputBufferAllocator: https://github.com/xerial/snappy-java/issues/91 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4090) Memory leak in snappy-java 1.1.1.4/5
Davies Liu created SPARK-4090: - Summary: Memory leak in snappy-java 1.1.1.4/5 Key: SPARK-4090 URL: https://issues.apache.org/jira/browse/SPARK-4090 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Davies Liu Priority: Blocker There is memory-leak bug in snappy-java 1.1.1.4/5, we should rollback to 1.1.1.3 or wait for bugfix. The jenkins tests timeouted or OOM multiple times recently. While test it locally, I got the heap dump of leaked JVM: Then found that it's a bug in recent releases of snappy-java: {code} +inputBuffer = inputBufferAllocator.allocate(inputSize); +outputBuffer = inputBufferAllocator.allocate(outputSize); {code} The outputBuffer is allocated from inputBufferAllocator but released to outputBufferAllocator: https://github.com/xerial/snappy-java/issues/91 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4049) Storage web UI "fraction cached" shows as > 100%
[ https://issues.apache.org/jira/browse/SPARK-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4049: -- Priority: Minor (was: Major) > Storage web UI "fraction cached" shows as > 100% > > > Key: SPARK-4049 > URL: https://issues.apache.org/jira/browse/SPARK-4049 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.2.0 >Reporter: Josh Rosen >Priority: Minor > > In the Storage tab of the Spark Web UI, I saw a case where the "Fraction > Cached" was greater than 100%: > !http://i.imgur.com/Gm2hEeL.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org