[jira] [Commented] (SPARK-22357) SparkContext.binaryFiles ignore minPartitions parameter
[ https://issues.apache.org/jira/browse/SPARK-22357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16510507#comment-16510507 ] John Brock commented on SPARK-22357: What are people's opinions of [~bomeng]'s fix? This bug just bit me, so I'd like to see this fixed. > SparkContext.binaryFiles ignore minPartitions parameter > --- > > Key: SPARK-22357 > URL: https://issues.apache.org/jira/browse/SPARK-22357 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.2, 2.2.0 >Reporter: Weichen Xu >Priority: Major > > this is a bug in binaryFiles - even though we give it the partitions, > binaryFiles ignores it. > This is a bug introduced in spark 2.1 from spark 2.0, in file > PortableDataStream.scala the argument “minPartitions” is no longer used (with > the push to master on 11/7/6): > {code} > /** > Allow minPartitions set by end-user in order to keep compatibility with old > Hadoop API > which is set through setMaxSplitSize > */ > def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions: > Int) { > val defaultMaxSplitBytes = > sc.getConf.get(config.FILES_MAX_PARTITION_BYTES) > val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES) > val defaultParallelism = sc.defaultParallelism > val files = listStatus(context).asScala > val totalBytes = files.filterNot(.isDirectory).map(.getLen + > openCostInBytes).sum > val bytesPerCore = totalBytes / defaultParallelism > val maxSplitSize = Math.min(defaultMaxSplitBytes, > Math.max(openCostInBytes, bytesPerCore)) > super.setMaxSplitSize(maxSplitSize) > } > {code} > The code previously, in version 2.0, was: > {code} > def setMinPartitions(context: JobContext, minPartitions: Int) { > val totalLen = > listStatus(context).asScala.filterNot(.isDirectory).map(.getLen).sum > val maxSplitSize = math.ceil(totalLen / math.max(minPartitions, > 1.0)).toLong > super.setMaxSplitSize(maxSplitSize) > } > {code} > The new code is very smart, but it ignores what the user passes in and uses > the data size, which is kind of a breaking change in some sense > In our specific case this was a problem, because we initially read in just > the files names and only after that the dataframe becomes very large, when > reading in the images themselves – and in this case the new code does not > handle the partitioning very well. > I’m not sure if it can be easily fixed because I don’t understand the full > context of the change in spark (but at the very least the unused parameter > should be removed to avoid confusion). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22851) Download mirror for spark-2.2.1-bin-hadoop2.7.tgz has file with incorrect checksum
[ https://issues.apache.org/jira/browse/SPARK-22851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16300814#comment-16300814 ] John Brock commented on SPARK-22851: I wouldn't call this resolved yet. Is there a way to contact whoever is in charge of the pair.com mirror? It's not a bug in Spark itself, but it is a bug with a mirror listed on the official Apache Spark downloads page, so we should endeavor to get it fixed. > Download mirror for spark-2.2.1-bin-hadoop2.7.tgz has file with incorrect > checksum > -- > > Key: SPARK-22851 > URL: https://issues.apache.org/jira/browse/SPARK-22851 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: John Brock >Priority: Critical > > The correct sha512 is: > 349ee4bc95c760259c1c28aaae0d9db4146115b03d710fe57685e0d18c9f9538d0b90d9c28f4031ed45f69def5bd217a5bf77fd50f685d93eb207445787f2685. > However, the file I downloaded from > http://apache.mirrors.pair.com/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz > is giving me a different sha256: > 039935ef9c4813eca15b29e7ddf91706844a52287999e8c5780f4361b736eb454110825224ae1b58cac9d686785ae0944a1c29e0b345532762752abab9b2cba9 > It looks like this mirror has a file that isn't actually gzipped, just > tarred. If I ungzip one of the copies of spark-2.2.1-bin-hadoop2.7.tgz with > the correct sha512, and take the sha512 of the resulting tar, I get the same > incorrect hash above of > 039935ef9c4813eca15b29e7ddf91706844a52287999e8c5780f4361b736eb454110825224ae1b58cac9d686785ae0944a1c29e0b345532762752abab9b2cba9. > I asked some colleagues to download the incorrect file themselves to check > the hash -- some of them got a file that was gzipped and some didn't. I'm > assuming there's some caching or mirroring happening that may give you a > different file than the one I got. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22851) Download mirror for spark-2.2.1-bin-hadoop2.7.tgz has file with incorrect checksum
[ https://issues.apache.org/jira/browse/SPARK-22851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16300740#comment-16300740 ] John Brock edited comment on SPARK-22851 at 12/21/17 11:59 PM: --- I think the inconsistent behavior in Chrome is due to different headers being sent back from the mirrors: {code:none} > curl -I > http://apache.mirrors.pair.com/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz HTTP/1.1 200 OK Date: Thu, 21 Dec 2017 23:46:58 GMT Server: Apache/2.2.29 Last-Modified: Sat, 25 Nov 2017 02:44:26 GMT ETag: "32b662-bfa03c4-55ec5a5c358a1" Accept-Ranges: bytes Content-Length: 200934340 Content-Type: application/x-tar Content-Encoding: x-gzip > curl -I > http://apache.cs.utah.edu/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz HTTP/1.1 200 OK Date: Thu, 21 Dec 2017 23:47:19 GMT Server: Apache/2.2.14 (Ubuntu) Last-Modified: Sat, 25 Nov 2017 02:44:26 GMT ETag: "2ae630-bfa03c4-55ec5a5c0d680" Accept-Ranges: bytes Content-Length: 200934340 Content-Type: application/x-gzip {code} Note that for the first mirror above, {{Content-Type}} is {{application/x-tar}}, and {{Content-Encoding}} is {{x-gzip}}. For the second mirror above, {{Content-Type}} is {{applicaton/x-gzip}} and there is no {{Content-Encoding}} value. For Safari, both sites give me a tar, so Safari may use some other method than looking at the header to determine whether a file is a gzip tarball. EDIT: See the top answer at https://superuser.com/questions/940605/chromium-prevent-unpacking-tar-gz, it seems like the "bug" is that the first mirror above sends back a {{Content-Encoding}} value of {{x-gzip}}. {quote}Your web server is likely sending the .tar.gz file with a content-encoding: gzip header, causing the web browser to assume a gzip layer was applied only to save bandwidth, and what you really intended to send was the .tar archive. Chrome un-gzips it on the other side like it would with any other file (.html, .js, .css, etc.) that it receives gzipped (it dutifully doesn't modify the filename though). To fix this, make sure your web server serves .tar.gz files without the content-encoding: gzip header. More Info: https://code.google.com/p/chromium/issues/detail?id=83292{quote} was (Author: jbrock): I think the inconsistent behavior in Chrome is due to different headers being sent back from the mirrors: {code:sh} > curl -I > http://apache.mirrors.pair.com/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz HTTP/1.1 200 OK Date: Thu, 21 Dec 2017 23:46:58 GMT Server: Apache/2.2.29 Last-Modified: Sat, 25 Nov 2017 02:44:26 GMT ETag: "32b662-bfa03c4-55ec5a5c358a1" Accept-Ranges: bytes Content-Length: 200934340 Content-Type: application/x-tar Content-Encoding: x-gzip > curl -I > http://apache.cs.utah.edu/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz HTTP/1.1 200 OK Date: Thu, 21 Dec 2017 23:47:19 GMT Server: Apache/2.2.14 (Ubuntu) Last-Modified: Sat, 25 Nov 2017 02:44:26 GMT ETag: "2ae630-bfa03c4-55ec5a5c0d680" Accept-Ranges: bytes Content-Length: 200934340 Content-Type: application/x-gzip {code} Note that for the first mirror above, {{Content-Type}} is {{application/x-tar}}, and {{Content-Encoding}} is {{x-gzip}}. For the second mirror above, {{Content-Type}} is {{applicaton/x-gzip}} and there is no {{Content-Encoding}} value. For Safari, both sites give me a tar, so Safari may use some other method than looking at the header to determine whether a file is a gzip tarball. EDIT: See the top answer at https://superuser.com/questions/940605/chromium-prevent-unpacking-tar-gz, it seems like the "bug" is that the first mirror above sends back a {{Content-Encoding}} value of {{x-gzip}}. {quote}Your web server is likely sending the .tar.gz file with a content-encoding: gzip header, causing the web browser to assume a gzip layer was applied only to save bandwidth, and what you really intended to send was the .tar archive. Chrome un-gzips it on the other side like it would with any other file (.html, .js, .css, etc.) that it receives gzipped (it dutifully doesn't modify the filename though). To fix this, make sure your web server serves .tar.gz files without the content-encoding: gzip header. More Info: https://code.google.com/p/chromium/issues/detail?id=83292{quote} > Download mirror for spark-2.2.1-bin-hadoop2.7.tgz has file with incorrect > checksum > -- > > Key: SPARK-22851 > URL: https://issues.apache.org/jira/browse/SPARK-22851 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: John Brock >Priority: Critical > > The correct sha512 is: > 349ee4bc95c760259c1c28aaae0d9db4146115b03d710fe57685e0d18c9f9538d0b90d9c28f4031ed45f69def5bd217a5bf77fd50f685d93eb207445787f2685. > However,
[jira] [Comment Edited] (SPARK-22851) Download mirror for spark-2.2.1-bin-hadoop2.7.tgz has file with incorrect checksum
[ https://issues.apache.org/jira/browse/SPARK-22851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16300740#comment-16300740 ] John Brock edited comment on SPARK-22851 at 12/21/17 11:57 PM: --- I think the inconsistent behavior in Chrome is due to different headers being sent back from the mirrors: {code:sh} > curl -I > http://apache.mirrors.pair.com/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz HTTP/1.1 200 OK Date: Thu, 21 Dec 2017 23:46:58 GMT Server: Apache/2.2.29 Last-Modified: Sat, 25 Nov 2017 02:44:26 GMT ETag: "32b662-bfa03c4-55ec5a5c358a1" Accept-Ranges: bytes Content-Length: 200934340 Content-Type: application/x-tar Content-Encoding: x-gzip > curl -I > http://apache.cs.utah.edu/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz HTTP/1.1 200 OK Date: Thu, 21 Dec 2017 23:47:19 GMT Server: Apache/2.2.14 (Ubuntu) Last-Modified: Sat, 25 Nov 2017 02:44:26 GMT ETag: "2ae630-bfa03c4-55ec5a5c0d680" Accept-Ranges: bytes Content-Length: 200934340 Content-Type: application/x-gzip {code} Note that for the first mirror above, {{Content-Type}} is {{application/x-tar}}, and {{Content-Encoding}} is {{x-gzip}}. For the second mirror above, {{Content-Type}} is {{applicaton/x-gzip}} and there is no {{Content-Encoding}} value. For Safari, both sites give me a tar, so Safari may use some other method than looking at the header to determine whether a file is a gzip tarball. EDIT: See the top answer at https://superuser.com/questions/940605/chromium-prevent-unpacking-tar-gz, it seems like the "bug" is that the first mirror above sends back a {{Content-Encoding}} value of {{x-gzip}}. {quote}Your web server is likely sending the .tar.gz file with a content-encoding: gzip header, causing the web browser to assume a gzip layer was applied only to save bandwidth, and what you really intended to send was the .tar archive. Chrome un-gzips it on the other side like it would with any other file (.html, .js, .css, etc.) that it receives gzipped (it dutifully doesn't modify the filename though). To fix this, make sure your web server serves .tar.gz files without the content-encoding: gzip header. More Info: https://code.google.com/p/chromium/issues/detail?id=83292{quote} was (Author: jbrock): I think the inconsistent behavior in Chrome is due to different headers being sent back from the mirrors: {code:sh} > curl -I > http://apache.mirrors.pair.com/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz HTTP/1.1 200 OK Date: Thu, 21 Dec 2017 23:46:58 GMT Server: Apache/2.2.29 Last-Modified: Sat, 25 Nov 2017 02:44:26 GMT ETag: "32b662-bfa03c4-55ec5a5c358a1" Accept-Ranges: bytes Content-Length: 200934340 Content-Type: application/x-tar Content-Encoding: x-gzip > curl -I > http://apache.cs.utah.edu/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz HTTP/1.1 200 OK Date: Thu, 21 Dec 2017 23:47:19 GMT Server: Apache/2.2.14 (Ubuntu) Last-Modified: Sat, 25 Nov 2017 02:44:26 GMT ETag: "2ae630-bfa03c4-55ec5a5c0d680" Accept-Ranges: bytes Content-Length: 200934340 Content-Type: application/x-gzip {code} Note that for the first mirror above, {{Content-Type}} is {{application/x-tar}}, and {{Content-Encoding}} is {{x-gzip}}. For the second mirror above, {{Content-Type}} is {{applicaton/x-gzip}} and there is no {{Content-Encoding}} value. For Safari, both sites give me a tar, so Safari may use some other method than looking at the header to determine whether a file is a gzip tarball. > Download mirror for spark-2.2.1-bin-hadoop2.7.tgz has file with incorrect > checksum > -- > > Key: SPARK-22851 > URL: https://issues.apache.org/jira/browse/SPARK-22851 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: John Brock >Priority: Critical > > The correct sha512 is: > 349ee4bc95c760259c1c28aaae0d9db4146115b03d710fe57685e0d18c9f9538d0b90d9c28f4031ed45f69def5bd217a5bf77fd50f685d93eb207445787f2685. > However, the file I downloaded from > http://apache.mirrors.pair.com/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz > is giving me a different sha256: > 039935ef9c4813eca15b29e7ddf91706844a52287999e8c5780f4361b736eb454110825224ae1b58cac9d686785ae0944a1c29e0b345532762752abab9b2cba9 > It looks like this mirror has a file that isn't actually gzipped, just > tarred. If I ungzip one of the copies of spark-2.2.1-bin-hadoop2.7.tgz with > the correct sha512, and take the sha512 of the resulting tar, I get the same > incorrect hash above of > 039935ef9c4813eca15b29e7ddf91706844a52287999e8c5780f4361b736eb454110825224ae1b58cac9d686785ae0944a1c29e0b345532762752abab9b2cba9. > I asked some colleagues to download the incorrect file themselves to check > the hash -- some of them got a file that was gzipped and some didn't
[jira] [Commented] (SPARK-22851) Download mirror for spark-2.2.1-bin-hadoop2.7.tgz has file with incorrect checksum
[ https://issues.apache.org/jira/browse/SPARK-22851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16300740#comment-16300740 ] John Brock commented on SPARK-22851: I think the inconsistent behavior in Chrome is due to different headers being sent back from the mirrors: {code:sh} > curl -I > http://apache.mirrors.pair.com/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz HTTP/1.1 200 OK Date: Thu, 21 Dec 2017 23:46:58 GMT Server: Apache/2.2.29 Last-Modified: Sat, 25 Nov 2017 02:44:26 GMT ETag: "32b662-bfa03c4-55ec5a5c358a1" Accept-Ranges: bytes Content-Length: 200934340 Content-Type: application/x-tar Content-Encoding: x-gzip > curl -I > http://apache.cs.utah.edu/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz HTTP/1.1 200 OK Date: Thu, 21 Dec 2017 23:47:19 GMT Server: Apache/2.2.14 (Ubuntu) Last-Modified: Sat, 25 Nov 2017 02:44:26 GMT ETag: "2ae630-bfa03c4-55ec5a5c0d680" Accept-Ranges: bytes Content-Length: 200934340 Content-Type: application/x-gzip {code} Note that for the first mirror above, {{Content-Type}} is {{application/x-tar}}, and {{Content-Encoding}} is {{x-gzip}}. For the second mirror above, {{Content-Type}} is {{applicaton/x-gzip}} and there is no {{Content-Encoding}} value. For Safari, both sites give me a tar, so Safari may use some other method than looking at the header to determine whether a file is a gzip tarball. > Download mirror for spark-2.2.1-bin-hadoop2.7.tgz has file with incorrect > checksum > -- > > Key: SPARK-22851 > URL: https://issues.apache.org/jira/browse/SPARK-22851 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: John Brock >Priority: Critical > > The correct sha512 is: > 349ee4bc95c760259c1c28aaae0d9db4146115b03d710fe57685e0d18c9f9538d0b90d9c28f4031ed45f69def5bd217a5bf77fd50f685d93eb207445787f2685. > However, the file I downloaded from > http://apache.mirrors.pair.com/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz > is giving me a different sha256: > 039935ef9c4813eca15b29e7ddf91706844a52287999e8c5780f4361b736eb454110825224ae1b58cac9d686785ae0944a1c29e0b345532762752abab9b2cba9 > It looks like this mirror has a file that isn't actually gzipped, just > tarred. If I ungzip one of the copies of spark-2.2.1-bin-hadoop2.7.tgz with > the correct sha512, and take the sha512 of the resulting tar, I get the same > incorrect hash above of > 039935ef9c4813eca15b29e7ddf91706844a52287999e8c5780f4361b736eb454110825224ae1b58cac9d686785ae0944a1c29e0b345532762752abab9b2cba9. > I asked some colleagues to download the incorrect file themselves to check > the hash -- some of them got a file that was gzipped and some didn't. I'm > assuming there's some caching or mirroring happening that may give you a > different file than the one I got. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22851) Download mirror for spark-2.2.1-bin-hadoop2.7.tgz has file with incorrect checksum
[ https://issues.apache.org/jira/browse/SPARK-22851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16300428#comment-16300428 ] John Brock commented on SPARK-22851: Interesting... for that specific pair.com mirror, I'm also seeing wget retrieve the gzipped file with the correct hash, but Chrome continues to give me just the tar. > Download mirror for spark-2.2.1-bin-hadoop2.7.tgz has file with incorrect > checksum > -- > > Key: SPARK-22851 > URL: https://issues.apache.org/jira/browse/SPARK-22851 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: John Brock >Priority: Critical > > The correct sha512 is: > 349ee4bc95c760259c1c28aaae0d9db4146115b03d710fe57685e0d18c9f9538d0b90d9c28f4031ed45f69def5bd217a5bf77fd50f685d93eb207445787f2685. > However, the file I downloaded from > http://apache.mirrors.pair.com/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz > is giving me a different sha256: > 039935ef9c4813eca15b29e7ddf91706844a52287999e8c5780f4361b736eb454110825224ae1b58cac9d686785ae0944a1c29e0b345532762752abab9b2cba9 > It looks like this mirror has a file that isn't actually gzipped, just > tarred. If I ungzip one of the copies of spark-2.2.1-bin-hadoop2.7.tgz with > the correct sha512, and take the sha512 of the resulting tar, I get the same > incorrect hash above of > 039935ef9c4813eca15b29e7ddf91706844a52287999e8c5780f4361b736eb454110825224ae1b58cac9d686785ae0944a1c29e0b345532762752abab9b2cba9. > I asked some colleagues to download the incorrect file themselves to check > the hash -- some of them got a file that was gzipped and some didn't. I'm > assuming there's some caching or mirroring happening that may give you a > different file than the one I got. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22851) Download mirror for spark-2.2.1-bin-hadoop2.7.tgz has file with incorrect checksum
[ https://issues.apache.org/jira/browse/SPARK-22851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299109#comment-16299109 ] John Brock commented on SPARK-22851: Some colleagues claimed to have specifically downloaded from http://apache.mirrors.pair.com/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz and calculated a correct sha512 checksum of 349ee4bc95c760259c1c28aaae0d9db4146115b03d710fe57685e0d18c9f9538d0b90d9c28f4031ed45f69def5bd217a5bf77fd50f685d93eb207445787f2685. However, every time I download from that same mirror I get an incorrect checksum of 039935ef9c4813eca15b29e7ddf91706844a52287999e8c5780f4361b736eb454110825224ae1b58cac9d686785ae0944a1c29e0b345532762752abab9b2cba9. > Download mirror for spark-2.2.1-bin-hadoop2.7.tgz has file with incorrect > checksum > -- > > Key: SPARK-22851 > URL: https://issues.apache.org/jira/browse/SPARK-22851 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: John Brock >Priority: Critical > > The correct sha512 is: > 349ee4bc95c760259c1c28aaae0d9db4146115b03d710fe57685e0d18c9f9538d0b90d9c28f4031ed45f69def5bd217a5bf77fd50f685d93eb207445787f2685. > However, the file I downloaded from > http://apache.mirrors.pair.com/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz > is giving me a different sha256: > 039935ef9c4813eca15b29e7ddf91706844a52287999e8c5780f4361b736eb454110825224ae1b58cac9d686785ae0944a1c29e0b345532762752abab9b2cba9 > It looks like this mirror has a file that isn't actually gzipped, just > tarred. If I ungzip one of the copies of spark-2.2.1-bin-hadoop2.7.tgz with > the correct sha512, and take the sha512 of the resulting tar, I get the same > incorrect hash above of > 039935ef9c4813eca15b29e7ddf91706844a52287999e8c5780f4361b736eb454110825224ae1b58cac9d686785ae0944a1c29e0b345532762752abab9b2cba9. > I asked some colleagues to download the incorrect file themselves to check > the hash -- some of them got a file that was gzipped and some didn't. I'm > assuming there's some caching or mirroring happening that may give you a > different file than the one I got. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22851) Download mirror for spark-2.2.1-bin-hadoop2.7.tgz has file with incorrect checksum
John Brock created SPARK-22851: -- Summary: Download mirror for spark-2.2.1-bin-hadoop2.7.tgz has file with incorrect checksum Key: SPARK-22851 URL: https://issues.apache.org/jira/browse/SPARK-22851 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.1 Reporter: John Brock Priority: Critical The correct sha512 is: 349ee4bc95c760259c1c28aaae0d9db4146115b03d710fe57685e0d18c9f9538d0b90d9c28f4031ed45f69def5bd217a5bf77fd50f685d93eb207445787f2685. However, the file I downloaded from http://apache.mirrors.pair.com/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz is giving me a different sha256: 039935ef9c4813eca15b29e7ddf91706844a52287999e8c5780f4361b736eb454110825224ae1b58cac9d686785ae0944a1c29e0b345532762752abab9b2cba9 It looks like this mirror has a file that isn't actually gzipped, just tarred. If I ungzip one of the copies of spark-2.2.1-bin-hadoop2.7.tgz with the correct sha512, and take the sha512 of the resulting tar, I get the same incorrect hash above of 039935ef9c4813eca15b29e7ddf91706844a52287999e8c5780f4361b736eb454110825224ae1b58cac9d686785ae0944a1c29e0b345532762752abab9b2cba9. I asked some colleagues to download the incorrect file themselves to check the hash -- some of them got a file that was gzipped and some didn't. I'm assuming there's some caching or mirroring happening that may give you a different file than the one I got. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21928) ClassNotFoundException for custom Kryo registrator class during serde in netty threads
[ https://issues.apache.org/jira/browse/SPARK-21928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16175452#comment-16175452 ] John Brock commented on SPARK-21928: Excellent, thanks for looking into this. > ClassNotFoundException for custom Kryo registrator class during serde in > netty threads > -- > > Key: SPARK-21928 > URL: https://issues.apache.org/jira/browse/SPARK-21928 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1, 2.2.0 >Reporter: John Brock >Assignee: Imran Rashid > Fix For: 2.2.1, 2.3.0 > > > From SPARK-13990 & SPARK-13926, Spark's SerializerManager has its own > instance of a KryoSerializer which does not have the defaultClassLoader set > on it. For normal task execution, that doesn't cause problems, because the > serializer falls back to the current thread's task loader, which is set > anyway. > however, netty maintains its own thread pool, and those threads don't change > their classloader to include the extra use jars needed for the custom kryo > registrator. That only matters when blocks are sent across the network which > force serde in the netty thread. That won't happen often, because (a) spark > tries to execute tasks where the RDDs are already cached and (b) broadcast > blocks generally don't require any serde in the netty threads (that occurs in > the task thread that is reading the broadcast value). However it can come up > with remote cache reads, or if fetching a broadcast block forces another > block to disk, which requires serialization. > This doesn't effect the shuffle path, because the serde is never done in the > threads created by netty. > I think a fix for this should be fairly straight-forward, we just need to set > the classloader on that extra kryo instance. > (original problem description below) > I unfortunately can't reliably reproduce this bug; it happens only > occasionally, when training a logistic regression model with very large > datasets. The training will often proceed through several {{treeAggregate}} > calls without any problems, and then suddenly workers will start running into > this {{java.lang.ClassNotFoundException}}. > After doing some debugging, it seems that whenever this error happens, Spark > is trying to use the {{sun.misc.Launcher$AppClassLoader}} {{ClassLoader}} > instance instead of the usual > {{org.apache.spark.util.MutableURLClassLoader}}. {{MutableURLClassLoader}} > can see my custom Kryo registrator, but the {{AppClassLoader}} instance can't. > When this error does pop up, it's usually accompanied by the task seeming to > hang, and I need to kill Spark manually. > I'm running a Spark application in cluster mode via spark-submit, and I have > a custom Kryo registrator. The JAR is built with {{sbt assembly}}. > Exception message: > {noformat} > 17/08/29 22:39:04 ERROR TransportRequestHandler: Error opening block > StreamChunkId{streamId=542074019336, chunkIndex=0} for request from > /10.0.29.65:34332 > org.apache.spark.SparkException: Failed to register classes with Kryo > at > org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:139) > at > org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:292) > at > org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:277) > at > org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:186) > at > org.apache.spark.serializer.SerializerManager.dataSerializeStream(SerializerManager.scala:169) > at > org.apache.spark.storage.BlockManager$$anonfun$dropFromMemory$3.apply(BlockManager.scala:1382) > at > org.apache.spark.storage.BlockManager$$anonfun$dropFromMemory$3.apply(BlockManager.scala:1377) > at org.apache.spark.storage.DiskStore.put(DiskStore.scala:69) > at > org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:1377) > at > org.apache.spark.storage.memory.MemoryStore.org$apache$spark$storage$memory$MemoryStore$$dropBlock$1(MemoryStore.scala:524) > at > org.apache.spark.storage.memory.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:545) > at > org.apache.spark.storage.memory.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:539) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.storage.memory.MemoryStore.evictBlocksToFreeSpace(MemoryStore.scala:539) > at > org.apache.spark.memory.StorageMemoryPool.acquireMemory(StorageMemoryPool.scala:92) > at > org.apache.spark.memory.StorageMemoryPool.acquireMemory(StorageMem
[jira] [Commented] (SPARK-21928) ML LogisticRegression training occasionally produces java.lang.ClassNotFoundException when attempting to load custom Kryo registrator class
[ https://issues.apache.org/jira/browse/SPARK-21928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16175066#comment-16175066 ] John Brock commented on SPARK-21928: It does! I see this in the log right before an executor got stuck: {{17/08/31 19:56:25 INFO MemoryStore: 3 blocks selected for dropping (284.2 MB bytes)}} > ML LogisticRegression training occasionally produces > java.lang.ClassNotFoundException when attempting to load custom Kryo > registrator class > --- > > Key: SPARK-21928 > URL: https://issues.apache.org/jira/browse/SPARK-21928 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1, 2.2.0 >Reporter: John Brock > > I unfortunately can't reliably reproduce this bug; it happens only > occasionally, when training a logistic regression model with very large > datasets. The training will often proceed through several {{treeAggregate}} > calls without any problems, and then suddenly workers will start running into > this {{java.lang.ClassNotFoundException}}. > After doing some debugging, it seems that whenever this error happens, Spark > is trying to use the {{sun.misc.Launcher$AppClassLoader}} {{ClassLoader}} > instance instead of the usual > {{org.apache.spark.util.MutableURLClassLoader}}. {{MutableURLClassLoader}} > can see my custom Kryo registrator, but the {{AppClassLoader}} instance can't. > When this error does pop up, it's usually accompanied by the task seeming to > hang, and I need to kill Spark manually. > I'm running a Spark application in cluster mode via spark-submit, and I have > a custom Kryo registrator. The JAR is built with {{sbt assembly}}. > Exception message: > {noformat} > 17/08/29 22:39:04 ERROR TransportRequestHandler: Error opening block > StreamChunkId{streamId=542074019336, chunkIndex=0} for request from > /10.0.29.65:34332 > org.apache.spark.SparkException: Failed to register classes with Kryo > at > org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:139) > at > org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:292) > at > org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:277) > at > org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:186) > at > org.apache.spark.serializer.SerializerManager.dataSerializeStream(SerializerManager.scala:169) > at > org.apache.spark.storage.BlockManager$$anonfun$dropFromMemory$3.apply(BlockManager.scala:1382) > at > org.apache.spark.storage.BlockManager$$anonfun$dropFromMemory$3.apply(BlockManager.scala:1377) > at org.apache.spark.storage.DiskStore.put(DiskStore.scala:69) > at > org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:1377) > at > org.apache.spark.storage.memory.MemoryStore.org$apache$spark$storage$memory$MemoryStore$$dropBlock$1(MemoryStore.scala:524) > at > org.apache.spark.storage.memory.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:545) > at > org.apache.spark.storage.memory.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:539) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.storage.memory.MemoryStore.evictBlocksToFreeSpace(MemoryStore.scala:539) > at > org.apache.spark.memory.StorageMemoryPool.acquireMemory(StorageMemoryPool.scala:92) > at > org.apache.spark.memory.StorageMemoryPool.acquireMemory(StorageMemoryPool.scala:73) > at > org.apache.spark.memory.StaticMemoryManager.acquireStorageMemory(StaticMemoryManager.scala:72) > at > org.apache.spark.storage.memory.MemoryStore.putBytes(MemoryStore.scala:147) > at > org.apache.spark.storage.BlockManager.maybeCacheDiskBytesInMemory(BlockManager.scala:1143) > at > org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doGetLocalBytes(BlockManager.scala:594) > at > org.apache.spark.storage.BlockManager$$anonfun$getLocalBytes$2.apply(BlockManager.scala:559) > at > org.apache.spark.storage.BlockManager$$anonfun$getLocalBytes$2.apply(BlockManager.scala:559) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:559) > at > org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:353) > at > org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$1.apply(NettyBlockRpcServer.scala:61) > at > org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$1.apply(NettyBlockRpcServer.scala:60) > at scala.collection.I
[jira] [Comment Edited] (SPARK-21928) ML LogisticRegression training occasionally produces java.lang.ClassNotFoundException when attempting to load custom Kryo registrator class
[ https://issues.apache.org/jira/browse/SPARK-21928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172123#comment-16172123 ] John Brock edited comment on SPARK-21928 at 9/19/17 6:26 PM: - [~irashid], thanks for taking a look. I see the same thing as that user -- the executor gets stuck, causing the application to fail. Before I found the workaround I mentioned above with spark.driver.extraClassPath and spark.executor.extraClassPath, I was using speculation to kill off the hanging tasks, although this wasn't always enough (e.g., if a stage got stuck before reaching the speculation threshold), and sometimes caused long-running (but non-stuck) tasks to be killed. was (Author: jbrock): [~irashid], thanks for taking a look. I see the same thing as that user -- the executor gets stuck, causing the application to fail. Before I found the workaround I mentioned above with spark.driver.extraClassPath and spark.executor.extraClassPath, I was using speculation to kill off the hanging tasks, although this wasn't always enough (e.g., if a stage got stuck before reaching the speculation threshold), and sometimes caused long-running but non-stuck tasks to be killed. > ML LogisticRegression training occasionally produces > java.lang.ClassNotFoundException when attempting to load custom Kryo > registrator class > --- > > Key: SPARK-21928 > URL: https://issues.apache.org/jira/browse/SPARK-21928 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1, 2.2.0 >Reporter: John Brock > > I unfortunately can't reliably reproduce this bug; it happens only > occasionally, when training a logistic regression model with very large > datasets. The training will often proceed through several {{treeAggregate}} > calls without any problems, and then suddenly workers will start running into > this {{java.lang.ClassNotFoundException}}. > After doing some debugging, it seems that whenever this error happens, Spark > is trying to use the {{sun.misc.Launcher$AppClassLoader}} {{ClassLoader}} > instance instead of the usual > {{org.apache.spark.util.MutableURLClassLoader}}. {{MutableURLClassLoader}} > can see my custom Kryo registrator, but the {{AppClassLoader}} instance can't. > When this error does pop up, it's usually accompanied by the task seeming to > hang, and I need to kill Spark manually. > I'm running a Spark application in cluster mode via spark-submit, and I have > a custom Kryo registrator. The JAR is built with {{sbt assembly}}. > Exception message: > {noformat} > 17/08/29 22:39:04 ERROR TransportRequestHandler: Error opening block > StreamChunkId{streamId=542074019336, chunkIndex=0} for request from > /10.0.29.65:34332 > org.apache.spark.SparkException: Failed to register classes with Kryo > at > org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:139) > at > org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:292) > at > org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:277) > at > org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:186) > at > org.apache.spark.serializer.SerializerManager.dataSerializeStream(SerializerManager.scala:169) > at > org.apache.spark.storage.BlockManager$$anonfun$dropFromMemory$3.apply(BlockManager.scala:1382) > at > org.apache.spark.storage.BlockManager$$anonfun$dropFromMemory$3.apply(BlockManager.scala:1377) > at org.apache.spark.storage.DiskStore.put(DiskStore.scala:69) > at > org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:1377) > at > org.apache.spark.storage.memory.MemoryStore.org$apache$spark$storage$memory$MemoryStore$$dropBlock$1(MemoryStore.scala:524) > at > org.apache.spark.storage.memory.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:545) > at > org.apache.spark.storage.memory.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:539) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.storage.memory.MemoryStore.evictBlocksToFreeSpace(MemoryStore.scala:539) > at > org.apache.spark.memory.StorageMemoryPool.acquireMemory(StorageMemoryPool.scala:92) > at > org.apache.spark.memory.StorageMemoryPool.acquireMemory(StorageMemoryPool.scala:73) > at > org.apache.spark.memory.StaticMemoryManager.acquireStorageMemory(StaticMemoryManager.scala:72) > at > org.apache.spark.storage.memory.MemoryStore.putBytes(MemoryStore.scala:147) > at > org.apa
[jira] [Commented] (SPARK-21928) ML LogisticRegression training occasionally produces java.lang.ClassNotFoundException when attempting to load custom Kryo registrator class
[ https://issues.apache.org/jira/browse/SPARK-21928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172123#comment-16172123 ] John Brock commented on SPARK-21928: [~irashid], thanks for taking a look. I see the same thing as that user -- the executor gets stuck, causing the application to fail. Before I found the workaround I mentioned above with spark.driver.extraClassPath and spark.executor.extraClassPath, I was using speculation to kill off the hanging tasks, although this wasn't always enough (e.g., if a stage got stuck before reaching the speculation threshold), and sometimes caused long-running but non-stuck tasks to be killed. > ML LogisticRegression training occasionally produces > java.lang.ClassNotFoundException when attempting to load custom Kryo > registrator class > --- > > Key: SPARK-21928 > URL: https://issues.apache.org/jira/browse/SPARK-21928 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1, 2.2.0 >Reporter: John Brock > > I unfortunately can't reliably reproduce this bug; it happens only > occasionally, when training a logistic regression model with very large > datasets. The training will often proceed through several {{treeAggregate}} > calls without any problems, and then suddenly workers will start running into > this {{java.lang.ClassNotFoundException}}. > After doing some debugging, it seems that whenever this error happens, Spark > is trying to use the {{sun.misc.Launcher$AppClassLoader}} {{ClassLoader}} > instance instead of the usual > {{org.apache.spark.util.MutableURLClassLoader}}. {{MutableURLClassLoader}} > can see my custom Kryo registrator, but the {{AppClassLoader}} instance can't. > When this error does pop up, it's usually accompanied by the task seeming to > hang, and I need to kill Spark manually. > I'm running a Spark application in cluster mode via spark-submit, and I have > a custom Kryo registrator. The JAR is built with {{sbt assembly}}. > Exception message: > {noformat} > 17/08/29 22:39:04 ERROR TransportRequestHandler: Error opening block > StreamChunkId{streamId=542074019336, chunkIndex=0} for request from > /10.0.29.65:34332 > org.apache.spark.SparkException: Failed to register classes with Kryo > at > org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:139) > at > org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:292) > at > org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:277) > at > org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:186) > at > org.apache.spark.serializer.SerializerManager.dataSerializeStream(SerializerManager.scala:169) > at > org.apache.spark.storage.BlockManager$$anonfun$dropFromMemory$3.apply(BlockManager.scala:1382) > at > org.apache.spark.storage.BlockManager$$anonfun$dropFromMemory$3.apply(BlockManager.scala:1377) > at org.apache.spark.storage.DiskStore.put(DiskStore.scala:69) > at > org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:1377) > at > org.apache.spark.storage.memory.MemoryStore.org$apache$spark$storage$memory$MemoryStore$$dropBlock$1(MemoryStore.scala:524) > at > org.apache.spark.storage.memory.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:545) > at > org.apache.spark.storage.memory.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:539) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.storage.memory.MemoryStore.evictBlocksToFreeSpace(MemoryStore.scala:539) > at > org.apache.spark.memory.StorageMemoryPool.acquireMemory(StorageMemoryPool.scala:92) > at > org.apache.spark.memory.StorageMemoryPool.acquireMemory(StorageMemoryPool.scala:73) > at > org.apache.spark.memory.StaticMemoryManager.acquireStorageMemory(StaticMemoryManager.scala:72) > at > org.apache.spark.storage.memory.MemoryStore.putBytes(MemoryStore.scala:147) > at > org.apache.spark.storage.BlockManager.maybeCacheDiskBytesInMemory(BlockManager.scala:1143) > at > org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doGetLocalBytes(BlockManager.scala:594) > at > org.apache.spark.storage.BlockManager$$anonfun$getLocalBytes$2.apply(BlockManager.scala:559) > at > org.apache.spark.storage.BlockManager$$anonfun$getLocalBytes$2.apply(BlockManager.scala:559) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:559) >
[jira] [Created] (SPARK-21928) ML LogisticRegression training occasionally produces java.lang.ClassNotFoundException when attempting to load custom Kryo registrator class
John Brock created SPARK-21928: -- Summary: ML LogisticRegression training occasionally produces java.lang.ClassNotFoundException when attempting to load custom Kryo registrator class Key: SPARK-21928 URL: https://issues.apache.org/jira/browse/SPARK-21928 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.2.0 Reporter: John Brock I unfortunately can't reliably reproduce this bug; it happens only occasionally, when training a logistic regression model with very large datasets. The training will often proceed through several {{treeAggregate}} calls without any problems, and then suddenly workers will start running into this {{java.lang.ClassNotFoundException}}. After doing some debugging, it seems that whenever this error happens, Spark is trying to use the {{sun.misc.Launcher$AppClassLoader}} {{ClassLoader}} instance instead of the usual {{org.apache.spark.util.MutableURLClassLoader}}. {{MutableURLClassLoader}} can see my custom Kryo registrator, but the {{AppClassLoader}} instance can't. When this error does pop up, it's usually accompanied by the task seeming to hang, and I need to kill Spark manually. I'm running a Spark application in cluster mode via spark-submit, and I have a custom Kryo registrator. The JAR is built with {{sbt assembly}}. Exception message: {noformat} 17/08/29 22:39:04 ERROR TransportRequestHandler: Error opening block StreamChunkId{streamId=542074019336, chunkIndex=0} for request from /10.0.29.65:34332 org.apache.spark.SparkException: Failed to register classes with Kryo at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:139) at org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:292) at org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:277) at org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:186) at org.apache.spark.serializer.SerializerManager.dataSerializeStream(SerializerManager.scala:169) at org.apache.spark.storage.BlockManager$$anonfun$dropFromMemory$3.apply(BlockManager.scala:1382) at org.apache.spark.storage.BlockManager$$anonfun$dropFromMemory$3.apply(BlockManager.scala:1377) at org.apache.spark.storage.DiskStore.put(DiskStore.scala:69) at org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:1377) at org.apache.spark.storage.memory.MemoryStore.org$apache$spark$storage$memory$MemoryStore$$dropBlock$1(MemoryStore.scala:524) at org.apache.spark.storage.memory.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:545) at org.apache.spark.storage.memory.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:539) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.storage.memory.MemoryStore.evictBlocksToFreeSpace(MemoryStore.scala:539) at org.apache.spark.memory.StorageMemoryPool.acquireMemory(StorageMemoryPool.scala:92) at org.apache.spark.memory.StorageMemoryPool.acquireMemory(StorageMemoryPool.scala:73) at org.apache.spark.memory.StaticMemoryManager.acquireStorageMemory(StaticMemoryManager.scala:72) at org.apache.spark.storage.memory.MemoryStore.putBytes(MemoryStore.scala:147) at org.apache.spark.storage.BlockManager.maybeCacheDiskBytesInMemory(BlockManager.scala:1143) at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doGetLocalBytes(BlockManager.scala:594) at org.apache.spark.storage.BlockManager$$anonfun$getLocalBytes$2.apply(BlockManager.scala:559) at org.apache.spark.storage.BlockManager$$anonfun$getLocalBytes$2.apply(BlockManager.scala:559) at scala.Option.map(Option.scala:146) at org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:559) at org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:353) at org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$1.apply(NettyBlockRpcServer.scala:61) at org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$1.apply(NettyBlockRpcServer.scala:60) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:31) at org.apache.spark.network.server.OneForOneStreamManager.getChunk(OneForOneStreamManager.java:89) at org.apache.spark.network.server.TransportRequestHandler.processFetchRequest(TransportRequestHandler.java:125) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:103) at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(
[jira] [Commented] (SPARK-21526) Add support to ML LogisticRegression for setting initial model
[ https://issues.apache.org/jira/browse/SPARK-21526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16099331#comment-16099331 ] John Brock commented on SPARK-21526: Related to SPARK-21386. > Add support to ML LogisticRegression for setting initial model > -- > > Key: SPARK-21526 > URL: https://issues.apache.org/jira/browse/SPARK-21526 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.1 >Reporter: John Brock > > Make it possible to set the initial model when training a logistic regression > model. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21526) Add support to ML LogisticRegression for setting initial model
John Brock created SPARK-21526: -- Summary: Add support to ML LogisticRegression for setting initial model Key: SPARK-21526 URL: https://issues.apache.org/jira/browse/SPARK-21526 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.1.1 Reporter: John Brock Make it possible to set the initial model when training a logistic regression model. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5480) GraphX pageRank: java.lang.ArrayIndexOutOfBoundsException:
[ https://issues.apache.org/jira/browse/SPARK-5480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15292306#comment-15292306 ] John Brock commented on SPARK-5480: --- I'm seeing the same issue in Spark v1.6.1. > GraphX pageRank: java.lang.ArrayIndexOutOfBoundsException: > --- > > Key: SPARK-5480 > URL: https://issues.apache.org/jira/browse/SPARK-5480 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.2.0, 1.3.1 > Environment: Yarn client >Reporter: Stephane Maarek > > Running the following code: > val subgraph = graph.subgraph ( > vpred = (id,article) => //working predicate) > ).cache() > println( s"Subgraph contains ${subgraph.vertices.count} nodes and > ${subgraph.edges.count} edges") > val prGraph = subgraph.staticPageRank(5).cache > val titleAndPrGraph = subgraph.outerJoinVertices(prGraph.vertices) { > (v, title, rank) => (rank.getOrElse(0.0), title) > } > titleAndPrGraph.vertices.top(13) { > Ordering.by((entry: (VertexId, (Double, _))) => entry._2._1) > }.foreach(t => println(t._2._2._1 + ": " + t._2._1 + ", id:" + t._1)) > Returns a graph with 5000 nodes and 4000 edges. > Then it crashes during the PageRank with the following: > 15/01/29 05:51:07 INFO scheduler.TaskSetManager: Starting task 125.0 in stage > 39.0 (TID 1808, *HIDDEN, PROCESS_LOCAL, 2059 bytes) > 15/01/29 05:51:07 WARN scheduler.TaskSetManager: Lost task 107.0 in stage > 39.0 (TID 1794, *HIDDEN): java.lang.ArrayIndexOutOfBoundsException: -1 > at > org.apache.spark.graphx.util.collection.GraphXPrimitiveKeyOpenHashMap$mcJI$sp.apply$mcJI$sp(GraphXPrimitiveKeyOpenHashMap.scala:64) > at > org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:91) > at > org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75) > at > org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:110) > at > org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:108) > at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) > at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at > org.apache.spark.executor.Executor$TaskRunn