[jira] [Commented] (SPARK-2873) OOM happens when group by and join operation with big data
[ https://issues.apache.org/jira/browse/SPARK-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101915#comment-14101915 ] Apache Spark commented on SPARK-2873: - User 'guowei2' has created a pull request for this issue: https://github.com/apache/spark/pull/2029 OOM happens when group by and join operation with big data --- Key: SPARK-2873 URL: https://issues.apache.org/jira/browse/SPARK-2873 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.0 Reporter: guowei -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3119) Re-implement TorrentBroadcast
Reynold Xin created SPARK-3119: -- Summary: Re-implement TorrentBroadcast Key: SPARK-3119 URL: https://issues.apache.org/jira/browse/SPARK-3119 Project: Spark Issue Type: Improvement Reporter: Reynold Xin Assignee: Reynold Xin -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3119) Re-implement TorrentBroadcast
[ https://issues.apache.org/jira/browse/SPARK-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-3119: --- Description: TorrentBroadcast is unnecessarily complicated: 1. It tracks a lot of mutable states, such as total number of bytes, number of blocks fetched. 2. It has at least two data structures that are not needed: TorrentInfo and TorrentBlock. 3. It uses getSingle on executors to get the block instead of getLocal, resulting in an extra roundtrip to look up the location of the block when the block doesn't exist yet. 4. It has a metadata block that is completely unnecessary. 5. It does an extra memory copy during deserialization to copy all the blocks into a single giant array. Re-implement TorrentBroadcast - Key: SPARK-3119 URL: https://issues.apache.org/jira/browse/SPARK-3119 Project: Spark Issue Type: Improvement Reporter: Reynold Xin Assignee: Reynold Xin TorrentBroadcast is unnecessarily complicated: 1. It tracks a lot of mutable states, such as total number of bytes, number of blocks fetched. 2. It has at least two data structures that are not needed: TorrentInfo and TorrentBlock. 3. It uses getSingle on executors to get the block instead of getLocal, resulting in an extra roundtrip to look up the location of the block when the block doesn't exist yet. 4. It has a metadata block that is completely unnecessary. 5. It does an extra memory copy during deserialization to copy all the blocks into a single giant array. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3119) Re-implement TorrentBroadcast
[ https://issues.apache.org/jira/browse/SPARK-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-3119: --- Component/s: Spark Core Re-implement TorrentBroadcast - Key: SPARK-3119 URL: https://issues.apache.org/jira/browse/SPARK-3119 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Reynold Xin Assignee: Reynold Xin TorrentBroadcast is unnecessarily complicated: 1. It tracks a lot of mutable states, such as total number of bytes, number of blocks fetched. 2. It has at least two data structures that are not needed: TorrentInfo and TorrentBlock. 3. It uses getSingle on executors to get the block instead of getLocal, resulting in an extra roundtrip to look up the location of the block when the block doesn't exist yet. 4. It has a metadata block that is completely unnecessary. 5. It does an extra memory copy during deserialization to copy all the blocks into a single giant array. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3119) Re-implement TorrentBroadcast
[ https://issues.apache.org/jira/browse/SPARK-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101931#comment-14101931 ] Apache Spark commented on SPARK-3119: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/2030 Re-implement TorrentBroadcast - Key: SPARK-3119 URL: https://issues.apache.org/jira/browse/SPARK-3119 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Reynold Xin Assignee: Reynold Xin TorrentBroadcast is unnecessarily complicated: 1. It tracks a lot of mutable states, such as total number of bytes, number of blocks fetched. 2. It has at least two data structures that are not needed: TorrentInfo and TorrentBlock. 3. It uses getSingle on executors to get the block instead of getLocal, resulting in an extra roundtrip to look up the location of the block when the block doesn't exist yet. 4. It has a metadata block that is completely unnecessary. 5. It does an extra memory copy during deserialization to copy all the blocks into a single giant array. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3120) Local Dirs is not useful in yarn-client mode
hzw created SPARK-3120: -- Summary: Local Dirs is not useful in yarn-client mode Key: SPARK-3120 URL: https://issues.apache.org/jira/browse/SPARK-3120 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.0.2 Environment: Spark 1.0.2 Yarn 2.3.0 Reporter: hzw I was using spark1.0.2 and hadoop 2.3.0 to run a spark application on yarn. I was excepted to set the spark.local.dir to separate the shuffle files to many disks, so I exported LOCAL_DIRS in Spark-env.sh. But it failed to create the local dirs in my specify path. It just go to the path in /tmp/hadoop-root/nm-local-dir/usercache/root/appcache/ as the hadoop default path. To reappear this: 1.Do not set the “yarn.nodemanager.local-dirs” in yarn-site.xml which influence the result. 2.run a job and then find the executor log at the INFO DiskBlockManager: Created local directory at .. Inaddtion, I tried to add the exported LOCAL_DIRS in yarn-env.sh. It will lanch the LOCAL_DIRS value in the ExecutorLancher and it still would be overwrite by yarn in lanching the executor container. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3098) In some cases, operation groupByKey get a wrong results
[ https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102012#comment-14102012 ] Guoqiang Li edited comment on SPARK-3098 at 8/19/14 8:22 AM: - I found the error id is continuous. Seems there is something wrong with the {{zipWithIndex}}. was (Author: gq): I found the error id is continuous. Seems there is something wrong with the{{zipWithIndex}}. In some cases, operation groupByKey get a wrong results Key: SPARK-3098 URL: https://issues.apache.org/jira/browse/SPARK-3098 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Reporter: Guoqiang Li Priority: Critical I do not know how to reproduce the bug. This is the case. When I was in operating 10 billion data by groupByKey. the results error: {noformat} (4696501, 370568) (4696501, 376672) (4696501, 374880) . (4696502, 350264) (4696502, 358458) (4696502, 398502) .. {noformat} = {noformat} (4696501,ArrayBuffer(350264, 358458, 398502 )), (4696502,ArrayBuffer(376621, ..)) {noformat} code : {code} val dealOuts = clickPreferences(sc, dealOutPath, periodTime) val dealOrders = orderPreferences(sc, dealOrderPath, periodTime) val favorites = favoritePreferences(sc, favoritePath, periodTime) val allBehaviors = (dealOrders ++ favorites ++ dealOuts) val peferences= allBehaviors.groupByKey().map { ... } {code} spark-defaults.conf: {code} spark.default.parallelism280 {code} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3098) In some cases, operation groupByKey get a wrong results
[ https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102012#comment-14102012 ] Guoqiang Li commented on SPARK-3098: I found the error id is continuous. Seems there is something wrong with the{{zipWithIndex}}. In some cases, operation groupByKey get a wrong results Key: SPARK-3098 URL: https://issues.apache.org/jira/browse/SPARK-3098 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Reporter: Guoqiang Li Priority: Critical I do not know how to reproduce the bug. This is the case. When I was in operating 10 billion data by groupByKey. the results error: {noformat} (4696501, 370568) (4696501, 376672) (4696501, 374880) . (4696502, 350264) (4696502, 358458) (4696502, 398502) .. {noformat} = {noformat} (4696501,ArrayBuffer(350264, 358458, 398502 )), (4696502,ArrayBuffer(376621, ..)) {noformat} code : {code} val dealOuts = clickPreferences(sc, dealOutPath, periodTime) val dealOrders = orderPreferences(sc, dealOrderPath, periodTime) val favorites = favoritePreferences(sc, favoritePath, periodTime) val allBehaviors = (dealOrders ++ favorites ++ dealOuts) val peferences= allBehaviors.groupByKey().map { ... } {code} spark-defaults.conf: {code} spark.default.parallelism280 {code} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3098) In some cases, operation groupByKey get a wrong results
[ https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102015#comment-14102015 ] Sean Owen commented on SPARK-3098: -- zipWithIndex returns an RDD[(T,Long)]. It does not create a continuous value, and doesn't add a value in the first position in the tuples such that it may be used as a key. Your IDs do not seem to be floating-point values here. What does your comment mean? In some cases, operation groupByKey get a wrong results Key: SPARK-3098 URL: https://issues.apache.org/jira/browse/SPARK-3098 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Reporter: Guoqiang Li Priority: Critical I do not know how to reproduce the bug. This is the case. When I was in operating 10 billion data by groupByKey. the results error: {noformat} (4696501, 370568) (4696501, 376672) (4696501, 374880) . (4696502, 350264) (4696502, 358458) (4696502, 398502) .. {noformat} = {noformat} (4696501,ArrayBuffer(350264, 358458, 398502 )), (4696502,ArrayBuffer(376621, ..)) {noformat} code : {code} val dealOuts = clickPreferences(sc, dealOutPath, periodTime) val dealOrders = orderPreferences(sc, dealOrderPath, periodTime) val favorites = favoritePreferences(sc, favoritePath, periodTime) val allBehaviors = (dealOrders ++ favorites ++ dealOuts) val peferences= allBehaviors.groupByKey().map { ... } {code} spark-defaults.conf: {code} spark.default.parallelism280 {code} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3098) In some cases, operation groupByKey get a wrong results
[ https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102024#comment-14102024 ] Guoqiang Li commented on SPARK-3098: the (id, value) pairs are generated there zipWithIndex. Reproduce the code: {code} val c = sc.parallelize(1 to 7899).flatMap { i = (1 to 1).toSeq.map(p = i * 6000 + p) }.distinct().zipWithIndex() val e = c.map(t = (t._1, t._2.toString)) val d = c.filter(t = t._1 % 100 5) e.join(d).filter(t = t._2._1 != t._2._2.toString).take(3) {code} In some cases, operation groupByKey get a wrong results Key: SPARK-3098 URL: https://issues.apache.org/jira/browse/SPARK-3098 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Reporter: Guoqiang Li Priority: Critical I do not know how to reproduce the bug. This is the case. When I was in operating 10 billion data by groupByKey. the results error: {noformat} (4696501, 370568) (4696501, 376672) (4696501, 374880) . (4696502, 350264) (4696502, 358458) (4696502, 398502) .. {noformat} = {noformat} (4696501,ArrayBuffer(350264, 358458, 398502 )), (4696502,ArrayBuffer(376621, ..)) {noformat} code : {code} val dealOuts = clickPreferences(sc, dealOutPath, periodTime) val dealOrders = orderPreferences(sc, dealOrderPath, periodTime) val favorites = favoritePreferences(sc, favoritePath, periodTime) val allBehaviors = (dealOrders ++ favorites ++ dealOuts) val peferences= allBehaviors.groupByKey().map { ... } {code} spark-defaults.conf: {code} spark.default.parallelism280 {code} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results
[ https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-3098: --- Summary: In some cases, operation zipWithIndex get a wrong results (was: In some cases, operation groupByKey get a wrong results) In some cases, operation zipWithIndex get a wrong results -- Key: SPARK-3098 URL: https://issues.apache.org/jira/browse/SPARK-3098 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Reporter: Guoqiang Li Priority: Critical I do not know how to reproduce the bug. This is the case. When I was in operating 10 billion data by groupByKey. the results error: {noformat} (4696501, 370568) (4696501, 376672) (4696501, 374880) . (4696502, 350264) (4696502, 358458) (4696502, 398502) .. {noformat} = {noformat} (4696501,ArrayBuffer(350264, 358458, 398502 )), (4696502,ArrayBuffer(376621, ..)) {noformat} code : {code} val dealOuts = clickPreferences(sc, dealOutPath, periodTime) val dealOrders = orderPreferences(sc, dealOrderPath, periodTime) val favorites = favoritePreferences(sc, favoritePath, periodTime) val allBehaviors = (dealOrders ++ favorites ++ dealOuts) val peferences= allBehaviors.groupByKey().map { ... } {code} spark-defaults.conf: {code} spark.default.parallelism280 {code} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3037) Add ArrayType containing null value support to Parquet.
[ https://issues.apache.org/jira/browse/SPARK-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102034#comment-14102034 ] Apache Spark commented on SPARK-3037: - User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/2032 Add ArrayType containing null value support to Parquet. --- Key: SPARK-3037 URL: https://issues.apache.org/jira/browse/SPARK-3037 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin Priority: Blocker Parquet support should handle {{ArrayType}} when {{containsNull}} is {{true}}. When {{containsNull}} is {{true}}, the schema should be as follows: {noformat} message root { optional group a (LIST) { repeated group bag { optional int32 array_element; } } } {noformat} FYI: Hive's Parquet writer *always* uses this schema, and reader can read only from this schema, i.e. current Parquet support of SparkSQL is not compatible with Hive. NOTICE: If Hive compatiblity is top priority, we also have to use this schma regardless of {{containsNull}}, which will break backward compatibility. But using this schema could affect performance. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3036) Add MapType containing null value support to Parquet.
[ https://issues.apache.org/jira/browse/SPARK-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102033#comment-14102033 ] Apache Spark commented on SPARK-3036: - User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/2032 Add MapType containing null value support to Parquet. - Key: SPARK-3036 URL: https://issues.apache.org/jira/browse/SPARK-3036 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin Priority: Blocker Current Parquet schema for {{MapType}} is as follows regardless of {{valueContainsNull}}: {noformat} message root { optional group a (MAP) { repeated group map (MAP_KEY_VALUE) { required int32 key; required int32 value; } } } {noformat} and if the map contains {{null}} value, it throws runtime exception. To handle {{MapType}} containing {{null}} value, the schema should be as follows if {{valueContainsNull}} is {{true}}: {noformat} message root { optional group a (MAP) { repeated group map (MAP_KEY_VALUE) { required int32 key; optional int32 value; } } } {noformat} FYI: Hive's Parquet writer *always* uses the latter schema, but reader can read from both schema. NOTICE: This change will break backward compatibility when the schema is read from Parquet metadata ({{org.apache.spark.sql.parquet.row.metadata}}). -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results
[ https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102024#comment-14102024 ] Guoqiang Li edited comment on SPARK-3098 at 8/19/14 8:55 AM: - the (id, value) pairs are generated there zipWithIndex. The reproduce code: {code} val c = sc.parallelize(1 to 7899).flatMap { i = (1 to 1).toSeq.map(p = i * 6000 + p) }.distinct().zipWithIndex() val e = c.map(t = (t._1, t._2.toString)) val d = c.filter(t = t._1 % 100 5) e.join(d).filter(t = t._2._1 != t._2._2.toString).take(3) {code} was (Author: gq): the (id, value) pairs are generated there zipWithIndex. Reproduce the code: {code} val c = sc.parallelize(1 to 7899).flatMap { i = (1 to 1).toSeq.map(p = i * 6000 + p) }.distinct().zipWithIndex() val e = c.map(t = (t._1, t._2.toString)) val d = c.filter(t = t._1 % 100 5) e.join(d).filter(t = t._2._1 != t._2._2.toString).take(3) {code} In some cases, operation zipWithIndex get a wrong results -- Key: SPARK-3098 URL: https://issues.apache.org/jira/browse/SPARK-3098 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Reporter: Guoqiang Li Priority: Critical I do not know how to reproduce the bug. This is the case. When I was in operating 10 billion data by groupByKey. the results error: {noformat} (4696501, 370568) (4696501, 376672) (4696501, 374880) . (4696502, 350264) (4696502, 358458) (4696502, 398502) .. {noformat} = {noformat} (4696501,ArrayBuffer(350264, 358458, 398502 )), (4696502,ArrayBuffer(376621, ..)) {noformat} code : {code} val dealOuts = clickPreferences(sc, dealOutPath, periodTime) val dealOrders = orderPreferences(sc, dealOrderPath, periodTime) val favorites = favoritePreferences(sc, favoritePath, periodTime) val allBehaviors = (dealOrders ++ favorites ++ dealOuts) val peferences= allBehaviors.groupByKey().map { ... } {code} spark-defaults.conf: {code} spark.default.parallelism280 {code} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results
[ https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102024#comment-14102024 ] Guoqiang Li edited comment on SPARK-3098 at 8/19/14 8:58 AM: - the (id, value) pairs are generated by zipWithIndex. The reproduce code: {code} val c = sc.parallelize(1 to 7899).flatMap { i = (1 to 1).toSeq.map(p = i * 6000 + p) }.distinct().zipWithIndex() val e = c.map(t = (t._1, t._2.toString)) val d = c.filter(t = t._1 % 100 5) e.join(d).filter(t = t._2._1 != t._2._2.toString).take(3) {code} was (Author: gq): the (id, value) pairs are generated there zipWithIndex. The reproduce code: {code} val c = sc.parallelize(1 to 7899).flatMap { i = (1 to 1).toSeq.map(p = i * 6000 + p) }.distinct().zipWithIndex() val e = c.map(t = (t._1, t._2.toString)) val d = c.filter(t = t._1 % 100 5) e.join(d).filter(t = t._2._1 != t._2._2.toString).take(3) {code} In some cases, operation zipWithIndex get a wrong results -- Key: SPARK-3098 URL: https://issues.apache.org/jira/browse/SPARK-3098 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Reporter: Guoqiang Li Priority: Critical I do not know how to reproduce the bug. This is the case. When I was in operating 10 billion data by groupByKey. the results error: {noformat} (4696501, 370568) (4696501, 376672) (4696501, 374880) . (4696502, 350264) (4696502, 358458) (4696502, 398502) .. {noformat} = {noformat} (4696501,ArrayBuffer(350264, 358458, 398502 )), (4696502,ArrayBuffer(376621, ..)) {noformat} code : {code} val dealOuts = clickPreferences(sc, dealOutPath, periodTime) val dealOrders = orderPreferences(sc, dealOrderPath, periodTime) val favorites = favoritePreferences(sc, favoritePath, periodTime) val allBehaviors = (dealOrders ++ favorites ++ dealOuts) val peferences= allBehaviors.groupByKey().map { ... } {code} spark-defaults.conf: {code} spark.default.parallelism280 {code} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2964) Fix wrong option (-S, --silent), and improve spark-sql and start-thriftserver to leverage bin/util.sh
[ https://issues.apache.org/jira/browse/SPARK-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-2964: -- Summary: Fix wrong option (-S, --silent), and improve spark-sql and start-thriftserver to leverage bin/util.sh (was: Improve spark-sql and start-thriftserver to leverage bin/util.sh) Fix wrong option (-S, --silent), and improve spark-sql and start-thriftserver to leverage bin/util.sh - Key: SPARK-2964 URL: https://issues.apache.org/jira/browse/SPARK-2964 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Kousuke Saruta Priority: Minor Now, we have bin/utils.sh, so let's improve spark-sql and start-thriftserver.sh to leverage utils.sh. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2964) Fix wrong option (-S, --silent), and improve spark-sql and start-thriftserver to leverage bin/util.sh
[ https://issues.apache.org/jira/browse/SPARK-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-2964: -- Description: In spark-sql script, they expect -s option but it's wrong. It's typo for -S (large S). We need to fix that. And now, we have bin/utils.sh, so let's improve spark-sql and start-thriftserver.sh to leverage utils.sh. was:Now, we have bin/utils.sh, so let's improve spark-sql and start-thriftserver.sh to leverage utils.sh. Fix wrong option (-S, --silent), and improve spark-sql and start-thriftserver to leverage bin/util.sh - Key: SPARK-2964 URL: https://issues.apache.org/jira/browse/SPARK-2964 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Kousuke Saruta Priority: Minor In spark-sql script, they expect -s option but it's wrong. It's typo for -S (large S). We need to fix that. And now, we have bin/utils.sh, so let's improve spark-sql and start-thriftserver.sh to leverage utils.sh. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results
[ https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102044#comment-14102044 ] Sean Owen commented on SPARK-3098: -- It would be helpful if you would explain what you are trying to reproduce here; this is just code, and there's not a continuous value here, for example. It looks like you're producing overlapping sequences of numbers like 1..1, 6001..16000, ... and then flattening and removing duplicates, to get the range 1..47404000. That's zipped with its index to get these as (n,n-1) pairs. Then you map the second element to a String, and do the same to a subset of the data, join them, and see if there are any mismatches, because there shouldn't be. All the keys are values are unique. But why would this demonstrate something about zipWithIndex more directly than a test of the RDD c? More importantly, I ran this locally and got an empty Array, as expected. In some cases, operation zipWithIndex get a wrong results -- Key: SPARK-3098 URL: https://issues.apache.org/jira/browse/SPARK-3098 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Reporter: Guoqiang Li Priority: Critical I do not know how to reproduce the bug. This is the case. When I was in operating 10 billion data by groupByKey. the results error: {noformat} (4696501, 370568) (4696501, 376672) (4696501, 374880) . (4696502, 350264) (4696502, 358458) (4696502, 398502) .. {noformat} = {noformat} (4696501,ArrayBuffer(350264, 358458, 398502 )), (4696502,ArrayBuffer(376621, ..)) {noformat} code : {code} val dealOuts = clickPreferences(sc, dealOutPath, periodTime) val dealOrders = orderPreferences(sc, dealOrderPath, periodTime) val favorites = favoritePreferences(sc, favoritePath, periodTime) val allBehaviors = (dealOrders ++ favorites ++ dealOuts) val peferences= allBehaviors.groupByKey().map { ... } {code} spark-defaults.conf: {code} spark.default.parallelism280 {code} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results
[ https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102064#comment-14102064 ] Guoqiang Li edited comment on SPARK-3098 at 8/19/14 9:41 AM: - We cluster on yarn. You can try the following code in cluster mode {code} val c = sc.parallelize(1 to 7899).flatMap { i = (1 to 1).toSeq.map(p = i * 6000 + p) }.distinct().zipWithIndex() c.join(c).filter(t = t._2._1 != t._2._2).take(3) {code} = {code} Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), (36579712,(13,14))) {code} was (Author: gq): We cluster on yarn. You can try the following code in cluster mode {code} val c = sc.parallelize(1 to 7899).flatMap { i = (1 to 1).toSeq.map(p = i * 6000 + p) }.distinct().zipWithIndex() val e = c.map(t = (t._1, t._2)) e.join(e).filter(t = t._2._1 != t._2._2).take(3) {code} = {code} Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), (36579712,(13,14))) {code} In some cases, operation zipWithIndex get a wrong results -- Key: SPARK-3098 URL: https://issues.apache.org/jira/browse/SPARK-3098 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Reporter: Guoqiang Li Priority: Critical I do not know how to reproduce the bug. This is the case. When I was in operating 10 billion data by groupByKey. the results error: {noformat} (4696501, 370568) (4696501, 376672) (4696501, 374880) . (4696502, 350264) (4696502, 358458) (4696502, 398502) .. {noformat} = {noformat} (4696501,ArrayBuffer(350264, 358458, 398502 )), (4696502,ArrayBuffer(376621, ..)) {noformat} code : {code} val dealOuts = clickPreferences(sc, dealOutPath, periodTime) val dealOrders = orderPreferences(sc, dealOrderPath, periodTime) val favorites = favoritePreferences(sc, favoritePath, periodTime) val allBehaviors = (dealOrders ++ favorites ++ dealOuts) val peferences= allBehaviors.groupByKey().map { ... } {code} spark-defaults.conf: {code} spark.default.parallelism280 {code} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3121) Wrong implementation of implicit bytesWritableConverter
Jakub Dubovsky created SPARK-3121: - Summary: Wrong implementation of implicit bytesWritableConverter Key: SPARK-3121 URL: https://issues.apache.org/jira/browse/SPARK-3121 Project: Spark Issue Type: Bug Affects Versions: 1.0.2 Reporter: Jakub Dubovsky Priority: Minor val path = ... //path to seq file with BytesWritable as type of both key and value val file = sc.sequenceFile[Array[Byte],Array[Byte]](path) file.take(1)(0)._1 This prints incorrect content of byte array. Actual content starts with correct one and some random bytes and zeros are appended. BytesWritable has two methods: getBytes() - return content of all internal array which is often longer then actual value stored. It usually contains the rest of previous longer values copyBytes() - return just begining of internal array determined by internal length property It looks like in implicit conversion between BytesWritable and Array[byte] getBytes is used instead of correct copyBytes. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results
[ https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102101#comment-14102101 ] Guoqiang Li commented on SPARK-3098: Seems to be {{zipWithUniqueId}} also has this issue . {code} val c = sc.parallelize(1 to 7899).flatMap { i = (1 to 1).toSeq.map(p = i * 6000 + p) }.distinct().zipWithUniqueId() c.join(c).filter(t = t._2._1 != t._2._2).take(3) {code} In some cases, operation zipWithIndex get a wrong results -- Key: SPARK-3098 URL: https://issues.apache.org/jira/browse/SPARK-3098 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Reporter: Guoqiang Li Priority: Critical I do not know how to reproduce the bug. This is the case. When I was in operating 10 billion data by groupByKey. the results error: {noformat} (4696501, 370568) (4696501, 376672) (4696501, 374880) . (4696502, 350264) (4696502, 358458) (4696502, 398502) .. {noformat} = {noformat} (4696501,ArrayBuffer(350264, 358458, 398502 )), (4696502,ArrayBuffer(376621, ..)) {noformat} code : {code} val dealOuts = clickPreferences(sc, dealOutPath, periodTime) val dealOrders = orderPreferences(sc, dealOrderPath, periodTime) val favorites = favoritePreferences(sc, favoritePath, periodTime) val allBehaviors = (dealOrders ++ favorites ++ dealOuts) val peferences= allBehaviors.groupByKey().map { ... } {code} spark-defaults.conf: {code} spark.default.parallelism280 {code} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3106) *Race Condition Issue* Fix the order of resources in Connection
[ https://issues.apache.org/jira/browse/SPARK-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-3106: -- Summary: *Race Condition Issue* Fix the order of resources in Connection (was: Suppress unwilling Exception and error messages caused by SendingConnection) *Race Condition Issue* Fix the order of resources in Connection --- Key: SPARK-3106 URL: https://issues.apache.org/jira/browse/SPARK-3106 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Kousuke Saruta Now, when we run Spark application, error message is appear on driver's log. The error message includes like as follows. * message caused by ClosedChannelException * message caused by CancelledKeyException * Corresponding SendingConnectionManagerId not found Those are mainly caused by the read behavior of SendingConnection. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3106) *Race Condition Issue* Fix the order of resources in Connection
[ https://issues.apache.org/jira/browse/SPARK-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-3106: -- Description: Now, when we run Spark application, error message is appear on driver's log. The error message includes like as follows. * message caused by ClosedChannelException * message caused by CancelledKeyException * Corresponding SendingConnectionManagerId not found Those are mainly caused by the race condition issue in was: Now, when we run Spark application, error message is appear on driver's log. The error message includes like as follows. * message caused by ClosedChannelException * message caused by CancelledKeyException * Corresponding SendingConnectionManagerId not found Those are mainly caused by the read behavior of SendingConnection. *Race Condition Issue* Fix the order of resources in Connection --- Key: SPARK-3106 URL: https://issues.apache.org/jira/browse/SPARK-3106 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Kousuke Saruta Now, when we run Spark application, error message is appear on driver's log. The error message includes like as follows. * message caused by ClosedChannelException * message caused by CancelledKeyException * Corresponding SendingConnectionManagerId not found Those are mainly caused by the race condition issue in -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3106) *Race Condition Issue* Fix the order of closing resources when Connection is closed
[ https://issues.apache.org/jira/browse/SPARK-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-3106: -- Description: Now, when we run Spark application, error message is appear on driver's log. The error message includes like as follows. * message caused by ClosedChannelException * message caused by CancelledKeyException * Corresponding SendingConnectionManagerId not found Those are mainly caused by the race condition issue of the time Connection is closed. was: Now, when we run Spark application, error message is appear on driver's log. The error message includes like as follows. * message caused by ClosedChannelException * message caused by CancelledKeyException * Corresponding SendingConnectionManagerId not found Those are mainly caused by the race condition issue in *Race Condition Issue* Fix the order of closing resources when Connection is closed --- Key: SPARK-3106 URL: https://issues.apache.org/jira/browse/SPARK-3106 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Kousuke Saruta Now, when we run Spark application, error message is appear on driver's log. The error message includes like as follows. * message caused by ClosedChannelException * message caused by CancelledKeyException * Corresponding SendingConnectionManagerId not found Those are mainly caused by the race condition issue of the time Connection is closed. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3106) *Race Condition Issue* Fix the order of closing resources when Connection is closed
[ https://issues.apache.org/jira/browse/SPARK-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-3106: -- Summary: *Race Condition Issue* Fix the order of closing resources when Connection is closed (was: *Race Condition Issue* Fix the order of resources in Connection) *Race Condition Issue* Fix the order of closing resources when Connection is closed --- Key: SPARK-3106 URL: https://issues.apache.org/jira/browse/SPARK-3106 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Kousuke Saruta Now, when we run Spark application, error message is appear on driver's log. The error message includes like as follows. * message caused by ClosedChannelException * message caused by CancelledKeyException * Corresponding SendingConnectionManagerId not found Those are mainly caused by the race condition issue in -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3115) Improve task broadcast latency for small tasks
[ https://issues.apache.org/jira/browse/SPARK-3115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102142#comment-14102142 ] Mridul Muralidharan commented on SPARK-3115: I had a tab open with pretty much exact same bug comments ready to be filed :-) Improve task broadcast latency for small tasks -- Key: SPARK-3115 URL: https://issues.apache.org/jira/browse/SPARK-3115 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Shivaram Venkataraman Assignee: Reynold Xin Broadcasting the task information helps reduce the amount of data transferred for large tasks. However we've seen that this adds more latency for small tasks. It'll be great to profile and fix this. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results
[ https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102145#comment-14102145 ] Sean Owen commented on SPARK-3098: -- Yes I get the same result with Spark 1.0.0 with patches, including the fix for SPARK-2043, in standalone mode: {code} Array[(Int, (Long, Long))] = Array((9272040,(13,14)), (9985320,(14,13)), (32797680,(24,26))) {code} If I change the code above so that the ranges are not overlapping to begin with, and remove distinct(), I don't see the issue. It also goes away if the RDD c is cached. I would assume distinct() is deterministic, even if it doesn't guarantee an ordering. Same with zipWithIndex(). Either those assumptions are wrong, or it could be an issue either place. A quick check says most keys are correct (no mismatch), and the mismatch is generally small. This makes me wonder if there's some kind of race condition in handing out numbers? I'll look at the code too. In some cases, operation zipWithIndex get a wrong results -- Key: SPARK-3098 URL: https://issues.apache.org/jira/browse/SPARK-3098 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Reporter: Guoqiang Li Priority: Critical I do not know how to reproduce the bug. This is the case. When I was in operating 10 billion data by groupByKey. the results error: {noformat} (4696501, 370568) (4696501, 376672) (4696501, 374880) . (4696502, 350264) (4696502, 358458) (4696502, 398502) .. {noformat} = {noformat} (4696501,ArrayBuffer(350264, 358458, 398502 )), (4696502,ArrayBuffer(376621, ..)) {noformat} code : {code} val dealOuts = clickPreferences(sc, dealOutPath, periodTime) val dealOrders = orderPreferences(sc, dealOrderPath, periodTime) val favorites = favoritePreferences(sc, favoritePath, periodTime) val allBehaviors = (dealOrders ++ favorites ++ dealOuts) val peferences= allBehaviors.groupByKey().map { ... } {code} spark-defaults.conf: {code} spark.default.parallelism280 {code} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results
[ https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-3098: --- Description: {code} val c = sc.parallelize(1 to 7899).flatMap { i = (1 to 1).toSeq.map(p = i * 6000 + p) }.distinct().zipWithIndex() c.join(c).filter(t = t._2._1 != t._2._2).take(3) {code} = {code} Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), (36579712,(13,14))) {code} was: I do not know how to reproduce the bug. This is the case. When I was in operating 10 billion data by groupByKey. the results error: {noformat} (4696501, 370568) (4696501, 376672) (4696501, 374880) . (4696502, 350264) (4696502, 358458) (4696502, 398502) .. {noformat} = {noformat} (4696501,ArrayBuffer(350264, 358458, 398502 )), (4696502,ArrayBuffer(376621, ..)) {noformat} code : {code} val dealOuts = clickPreferences(sc, dealOutPath, periodTime) val dealOrders = orderPreferences(sc, dealOrderPath, periodTime) val favorites = favoritePreferences(sc, favoritePath, periodTime) val allBehaviors = (dealOrders ++ favorites ++ dealOuts) val peferences= allBehaviors.groupByKey().map { ... } {code} spark-defaults.conf: {code} spark.default.parallelism280 {code} In some cases, operation zipWithIndex get a wrong results -- Key: SPARK-3098 URL: https://issues.apache.org/jira/browse/SPARK-3098 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Reporter: Guoqiang Li Priority: Critical {code} val c = sc.parallelize(1 to 7899).flatMap { i = (1 to 1).toSeq.map(p = i * 6000 + p) }.distinct().zipWithIndex() c.join(c).filter(t = t._2._1 != t._2._2).take(3) {code} = {code} Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), (36579712,(13,14))) {code} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results
[ https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-3098: --- Description: The reproduce code: {code} val c = sc.parallelize(1 to 7899).flatMap { i = (1 to 1).toSeq.map(p = i * 6000 + p) }.distinct().zipWithIndex() c.join(c).filter(t = t._2._1 != t._2._2).take(3) {code} = {code} Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), (36579712,(13,14))) {code} was: {code} val c = sc.parallelize(1 to 7899).flatMap { i = (1 to 1).toSeq.map(p = i * 6000 + p) }.distinct().zipWithIndex() c.join(c).filter(t = t._2._1 != t._2._2).take(3) {code} = {code} Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), (36579712,(13,14))) {code} In some cases, operation zipWithIndex get a wrong results -- Key: SPARK-3098 URL: https://issues.apache.org/jira/browse/SPARK-3098 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Reporter: Guoqiang Li Priority: Critical The reproduce code: {code} val c = sc.parallelize(1 to 7899).flatMap { i = (1 to 1).toSeq.map(p = i * 6000 + p) }.distinct().zipWithIndex() c.join(c).filter(t = t._2._1 != t._2._2).take(3) {code} = {code} Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), (36579712,(13,14))) {code} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3099) Staging Directory is never deleed when we run job with YARN Client Mode
[ https://issues.apache.org/jira/browse/SPARK-3099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-3099: -- Summary: Staging Directory is never deleed when we run job with YARN Client Mode (was: Add a shutdown hook for cleanup staging directory to ExecutorLauncher) Staging Directory is never deleed when we run job with YARN Client Mode --- Key: SPARK-3099 URL: https://issues.apache.org/jira/browse/SPARK-3099 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: Kousuke Saruta When we run application with YARN Cluster mode, the class 'ApplicationMaster' is used as ApplicationMaster, which has shutdown hook to cleanup stagind directory (~/.sparkStaging). But, when we run application with YARN Client mode, the class 'ExecutorLauncher' as an ApplicationMaster doesn't cleanup staging directory. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3101) Missing volatile annotation in ApplicationMaster
[ https://issues.apache.org/jira/browse/SPARK-3101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-3101: -- Summary: Missing volatile annotation in ApplicationMaster (was: Flag variable in ApplicationMaster should be declared as volatile) Missing volatile annotation in ApplicationMaster Key: SPARK-3101 URL: https://issues.apache.org/jira/browse/SPARK-3101 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: Kousuke Saruta In ApplicationMaster, a field variable 'isLastAMRetry' is used as a flag but it's not declared as volatile though it's used from multiple threads. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3099) Staging Directory is never deleted when we run job with YARN Client Mode
[ https://issues.apache.org/jira/browse/SPARK-3099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-3099: -- Summary: Staging Directory is never deleted when we run job with YARN Client Mode (was: Staging Directory is never deleed when we run job with YARN Client Mode) Staging Directory is never deleted when we run job with YARN Client Mode Key: SPARK-3099 URL: https://issues.apache.org/jira/browse/SPARK-3099 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: Kousuke Saruta When we run application with YARN Cluster mode, the class 'ApplicationMaster' is used as ApplicationMaster, which has shutdown hook to cleanup stagind directory (~/.sparkStaging). But, when we run application with YARN Client mode, the class 'ExecutorLauncher' as an ApplicationMaster doesn't cleanup staging directory. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3090) Avoid not stopping SparkContext with YARN Client mode
[ https://issues.apache.org/jira/browse/SPARK-3090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-3090: -- Summary: Avoid not stopping SparkContext with YARN Client mode (was: Add shutdown hook to stop SparkContext for YARN Client mode) Avoid not stopping SparkContext with YARN Client mode -- Key: SPARK-3090 URL: https://issues.apache.org/jira/browse/SPARK-3090 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.1.0 Reporter: Kousuke Saruta When we use YARN Cluster mode, ApplicationMaser register a shutdown hook, stopping SparkContext. Thanks to this, SparkContext can stop even if Application forgets to stop SparkContext itself. But, unfortunately, YARN Client mode doesn't have such mechanism. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3089) Fix meaningless error message in ConnectionManager
[ https://issues.apache.org/jira/browse/SPARK-3089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-3089: -- Summary: Fix meaningless error message in ConnectionManager (was: Make error message in ConnectionManager more meaningful) Fix meaningless error message in ConnectionManager -- Key: SPARK-3089 URL: https://issues.apache.org/jira/browse/SPARK-3089 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Kousuke Saruta When ConnectionManager#removeConnection is invoked and it cannot find SendingConnection to be closed corresponding to a ConnectionManagerId, following message is logged. {code} logError(Corresponding SendingConnectionManagerId not found) {code} But, we cannot get which SendingConnectionManagerId is meant from the message. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1782) svd for sparse matrix using ARPACK
[ https://issues.apache.org/jira/browse/SPARK-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102179#comment-14102179 ] Tarek Elgamal commented on SPARK-1782: -- I am interested to try this new svd implementation. Is there an estimate when will spark 1.1.0 be officially released ? svd for sparse matrix using ARPACK -- Key: SPARK-1782 URL: https://issues.apache.org/jira/browse/SPARK-1782 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Li Pu Fix For: 1.1.0 Original Estimate: 672h Remaining Estimate: 672h Currently the svd implementation in mllib calls the dense matrix svd in breeze, which has a limitation of fitting n^2 Gram matrix entries in memory (n is the number of rows or number of columns of the matrix, whichever is smaller). In many use cases, the original matrix is sparse but the Gram matrix might not, and we often need only the largest k singular values/vectors. To make svd really scalable, the memory usage must be propositional to the non-zero entries in the matrix. One solution is to call the de facto standard eigen-decomposition package ARPACK. For an input matrix M, we compute a few eigenvalues and eigenvectors of M^t*M (or M*M^t if its size is smaller) using ARPACK, then use the eigenvalues/vectors to reconstruct singular values/vectors. ARPACK has a reverse communication interface. The user provides a function to multiply a square matrix to be decomposed with a dense vector provided by ARPACK, and return the resulting dense vector to ARPACK. Inside ARPACK it uses an Implicitly Restarted Lanczos Method for symmetric matrix. Outside what we need to provide are two matrix-vector multiplications, first M*x then M^t*x. These multiplications can be done in Spark in a distributed manner. The working memory used by ARPACK is O(n*k). When k (the number of desired singular values) is small, it can be easily fit into the memory of the master machine. The overall model is master machine runs ARPACK, and distribute matrix-vector multiplication onto working executors in each iteration. I made a PR to breeze with an ARPACK-backed svds interface (https://github.com/scalanlp/breeze/pull/240). The interface takes anything that can be multiplied by a DenseVector. On Spark/milib side, just need to implement the sparsematrix-vector multiplication. It might take some time to optimize and fully test this implementation, so set the workload estimate to 4 weeks. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results
[ https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102187#comment-14102187 ] Guoqiang Li commented on SPARK-3098: [~srowen] the following code also has this issue. {code} val c = sc.parallelize(1 to 7899).flatMap { i = (1 to 1).toSeq.map(p = i * 6000 + p) }.distinct() c.zip(c).filter(t = t._1 != t._2).take(3) {code} {{distinct}} seems to be the problem. In some cases, operation zipWithIndex get a wrong results -- Key: SPARK-3098 URL: https://issues.apache.org/jira/browse/SPARK-3098 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Reporter: Guoqiang Li Priority: Critical The reproduce code: {code} val c = sc.parallelize(1 to 7899).flatMap { i = (1 to 1).toSeq.map(p = i * 6000 + p) }.distinct().zipWithIndex() c.join(c).filter(t = t._2._1 != t._2._2).take(3) {code} = {code} Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), (36579712,(13,14))) {code} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-2753) Is it supposed --archives option in yarn cluster mode to uncompress file?
[ https://issues.apache.org/jira/browse/SPARK-2753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] José Manuel Abuín Mosquera closed SPARK-2753. - Resolution: Not a Problem Is it supposed --archives option in yarn cluster mode to uncompress file? - Key: SPARK-2753 URL: https://issues.apache.org/jira/browse/SPARK-2753 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0 Environment: CentOS release 6.5 (64 bits) and Hadoop 2.2.0 Reporter: José Manuel Abuín Mosquera Labels: archives, cache, distributed, yarn Hi all, this is my first sent issue, I googled and searche dinto the Spark code and arrived here. When passing as argument to --archives a tar.gz or a .zip file, Spark uploads it to the distributed cache, but it is not uncompressing it. According the documentation, it is supposed to uncompress it, is this a bug?? Launching command is: /opt/spark-1.0.1/bin/spark-submit --class ProlnatSpark --master yarn-cluster --num-executors 32 --driver-library-path /opt/hadoop/hadoop-2.2.0/lib/native/ --driver-memory 390m --executor-memory 890m --executor-cores 1 --archives=Diccionarios.tar.gz --verbose ProlnatSpark.jar Wikipedias/WikipediaPlain.txt saidaWikipediaSpark In files /yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala and /yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnableUtil.scala doesn't seem to uncompress the files. I hope this helps, thank you very much :) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3120) Local Dirs is not useful in yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-3120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102199#comment-14102199 ] Thomas Graves commented on SPARK-3120: -- Can you please clarify this? You are trying to set the local-dirs on the executors? If so then you are specifically not alllowed to do that. on YARN you should be using the yarn approved directories which properly get managed by YARN (cleaned up when the application finishes). It automatically just picks up the yarn configured directories. Local Dirs is not useful in yarn-client mode Key: SPARK-3120 URL: https://issues.apache.org/jira/browse/SPARK-3120 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.0.2 Environment: Spark 1.0.2 Yarn 2.3.0 Reporter: hzw I was using spark1.0.2 and hadoop 2.3.0 to run a spark application on yarn. I was excepted to set the spark.local.dir to separate the shuffle files to many disks, so I exported LOCAL_DIRS in Spark-env.sh. But it failed to create the local dirs in my specify path. It just go to the path in /tmp/hadoop-root/nm-local-dir/usercache/root/appcache/ as the hadoop default path. To reappear this: 1.Do not set the “yarn.nodemanager.local-dirs” in yarn-site.xml which influence the result. 2.run a job and then find the executor log at the INFO DiskBlockManager: Created local directory at .. Inaddtion, I tried to add the exported LOCAL_DIRS in yarn-env.sh. It will lanch the LOCAL_DIRS value in the ExecutorLancher and it still would be overwrite by yarn in lanching the executor container. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3072) Yarn AM not always properly exiting after unregistering from RM
[ https://issues.apache.org/jira/browse/SPARK-3072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-3072. -- Resolution: Fixed Fix Version/s: 1.1.0 Yarn AM not always properly exiting after unregistering from RM --- Key: SPARK-3072 URL: https://issues.apache.org/jira/browse/SPARK-3072 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.2 Reporter: Thomas Graves Assignee: Thomas Graves Priority: Critical Fix For: 1.1.0 The yarn application master doesn't always exit properly after unregistering from the RM. One way to reproduce is to ask for large containers ( 4g) but use jdk32 so that all of them fail. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results
[ https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102279#comment-14102279 ] Guoqiang Li commented on SPARK-3098: this issue caused by the code: [BlockFetcherIterator.scala#L221|https://github.com/apache/spark/blob/v1.0.1/core/src/main/scala/org/apache/spark/storage/BlockFetcherIterator.scala#L221] {noformat}fetchRequests ++= Utils.randomize(remoteRequests){noformat} =[ShuffledRDD.scala#L65|https://github.com/apache/spark/blob/v1.0.1/core/src/main/scala/org/apache/spark/rdd/ShuffledRDD.scala#L65]{noformat}SparkEnv.get.shuffleFetcher.fetch[P](shuffledId, split.index, context, ser){noformat} = [PairRDDFunctions.scala#L100|https://github.com/apache/spark/blob/v1.0.1/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L100] {noformat} val partitioned = new ShuffledRDD[K, C, (K, C)](combined, partitioner) .setSerializer(serializer) partitioned.mapPartitionsWithContext((context, iter) = { new InterruptibleIterator(context, aggregator.combineCombinersByKey(iter, context)) }, preservesPartitioning = true) {noformat} = [PairRDDFunctions.scala#L163|https://github.com/apache/spark/blob/v1.0.1/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L163] {noformat} def reduceByKey(partitioner: Partitioner, func: (V, V) = V): RDD[(K, V)] = { combineByKey[V]((v: V) = v, func, func, partitioner) } {noformat} = [RDD.scala#L288|https://github.com/apache/spark/blob/v1.0.1/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L288] {noformat} def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = map(x = (x, null)).reduceByKey((x, y) = x, numPartitions).map(_._1) {noformat} In some cases, operation zipWithIndex get a wrong results -- Key: SPARK-3098 URL: https://issues.apache.org/jira/browse/SPARK-3098 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Reporter: Guoqiang Li Priority: Critical The reproduce code: {code} val c = sc.parallelize(1 to 7899).flatMap { i = (1 to 1).toSeq.map(p = i * 6000 + p) }.distinct().zipWithIndex() c.join(c).filter(t = t._2._1 != t._2._2).take(3) {code} = {code} Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), (36579712,(13,14))) {code} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results
[ https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102279#comment-14102279 ] Guoqiang Li edited comment on SPARK-3098 at 8/19/14 3:02 PM: - this issue caused by the code: [BlockFetcherIterator.scala#L221|https://github.com/apache/spark/blob/v1.0.1/core/src/main/scala/org/apache/spark/storage/BlockFetcherIterator.scala#L221] {noformat}fetchRequests ++= Utils.randomize(remoteRequests){noformat} =[ShuffledRDD.scala#L65|https://github.com/apache/spark/blob/v1.0.1/core/src/main/scala/org/apache/spark/rdd/ShuffledRDD.scala#L65]{noformat}SparkEnv.get.shuffleFetcher.fetch[P](shuffledId, split.index, context, ser){noformat} = [PairRDDFunctions.scala#L100|https://github.com/apache/spark/blob/v1.0.1/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L100] {noformat} val partitioned = new ShuffledRDD[K, C, (K, C)](combined, partitioner) .setSerializer(serializer) partitioned.mapPartitionsWithContext((context, iter) = { new InterruptibleIterator(context, aggregator.combineCombinersByKey(iter, context)) }, preservesPartitioning = true) {noformat} = [PairRDDFunctions.scala#L163|https://github.com/apache/spark/blob/v1.0.1/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L163] {noformat} def reduceByKey(partitioner: Partitioner, func: (V, V) = V): RDD[(K, V)] = { combineByKey[V]((v: V) = v, func, func, partitioner) } {noformat} = [RDD.scala#L288|https://github.com/apache/spark/blob/v1.0.1/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L288] {noformat} def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = map(x = (x, null)).reduceByKey((x, y) = x, numPartitions).map(_._1) {noformat} was (Author: gq): this issue caused by the code: [BlockFetcherIterator.scala#L221|https://github.com/apache/spark/blob/v1.0.1/core/src/main/scala/org/apache/spark/storage/BlockFetcherIterator.scala#L221] {noformat}fetchRequests ++= Utils.randomize(remoteRequests){noformat} =[ShuffledRDD.scala#L65|https://github.com/apache/spark/blob/v1.0.1/core/src/main/scala/org/apache/spark/rdd/ShuffledRDD.scala#L65]{noformat}SparkEnv.get.shuffleFetcher.fetch[P](shuffledId, split.index, context, ser){noformat} = [PairRDDFunctions.scala#L100|https://github.com/apache/spark/blob/v1.0.1/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L100] {noformat} val partitioned = new ShuffledRDD[K, C, (K, C)](combined, partitioner) .setSerializer(serializer) partitioned.mapPartitionsWithContext((context, iter) = { new InterruptibleIterator(context, aggregator.combineCombinersByKey(iter, context)) }, preservesPartitioning = true) {noformat} = [PairRDDFunctions.scala#L163|https://github.com/apache/spark/blob/v1.0.1/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L163] {noformat} def reduceByKey(partitioner: Partitioner, func: (V, V) = V): RDD[(K, V)] = { combineByKey[V]((v: V) = v, func, func, partitioner) } {noformat} = [RDD.scala#L288|https://github.com/apache/spark/blob/v1.0.1/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L288] {noformat} def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = map(x = (x, null)).reduceByKey((x, y) = x, numPartitions).map(_._1) {noformat} In some cases, operation zipWithIndex get a wrong results -- Key: SPARK-3098 URL: https://issues.apache.org/jira/browse/SPARK-3098 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Reporter: Guoqiang Li Priority: Critical The reproduce code: {code} val c = sc.parallelize(1 to 7899).flatMap { i = (1 to 1).toSeq.map(p = i * 6000 + p) }.distinct().zipWithIndex() c.join(c).filter(t = t._2._1 != t._2._2).take(3) {code} = {code} Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), (36579712,(13,14))) {code} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3122) hadoop-yarn dependencies cannot be resolved
[ https://issues.apache.org/jira/browse/SPARK-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102288#comment-14102288 ] Guoqiang Li commented on SPARK-3122: Why add {{spark-yarn_2.10}} dependency? Normally add this to your POM file's dependencies section: {code:xml} dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-client/artifactId version2.4.0 /version /dependency {code} hadoop-yarn dependencies cannot be resolved --- Key: SPARK-3122 URL: https://issues.apache.org/jira/browse/SPARK-3122 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0, 1.0.2 Environment: spark 1.0.1 + YARN + hadoop 2.4.0 cluster on linux machines client on windows 7 maven 3.0.4 maven repository http://repo1.maven.org/maven2/org/apache/hadoop/ Reporter: Ran Levi Labels: build, easyfix, hadoop, maven When adding spark-yarn_2.10:1.0.2 dependency to java project, other hadoop-yarn-XXX dependencies are needed. Those dependencies are downloaded using version 1.0.4 which does not exist, resulting in build error. Version 1.0.4 is taken from hadoop.version variable in spark-parent-1.0.2.pom. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3122) hadoop-yarn dependencies cannot be resolved
[ https://issues.apache.org/jira/browse/SPARK-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102300#comment-14102300 ] Ran Levi commented on SPARK-3122: - It was my understanding that it is required to create a JavaSparkContext with master='yarn-client'. Was I wrong? hadoop-yarn dependencies cannot be resolved --- Key: SPARK-3122 URL: https://issues.apache.org/jira/browse/SPARK-3122 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0, 1.0.2 Environment: spark 1.0.1 + YARN + hadoop 2.4.0 cluster on linux machines client on windows 7 maven 3.0.4 maven repository http://repo1.maven.org/maven2/org/apache/hadoop/ Reporter: Ran Levi Labels: build, easyfix, hadoop, maven When adding spark-yarn_2.10:1.0.2 dependency to java project, other hadoop-yarn-XXX dependencies are needed. Those dependencies are downloaded using version 1.0.4 which does not exist, resulting in build error. Version 1.0.4 is taken from hadoop.version variable in spark-parent-1.0.2.pom. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3123) override the setName function to set EdgeRDD's name manually just as VertexRDD does.
uncleGen created SPARK-3123: --- Summary: override the setName function to set EdgeRDD's name manually just as VertexRDD does. Key: SPARK-3123 URL: https://issues.apache.org/jira/browse/SPARK-3123 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.0.2, 1.0.0 Reporter: uncleGen Priority: Minor -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3123) override the setName function to set EdgeRDD's name manually just as VertexRDD does.
[ https://issues.apache.org/jira/browse/SPARK-3123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102308#comment-14102308 ] Apache Spark commented on SPARK-3123: - User 'uncleGen' has created a pull request for this issue: https://github.com/apache/spark/pull/2033 override the setName function to set EdgeRDD's name manually just as VertexRDD does. -- Key: SPARK-3123 URL: https://issues.apache.org/jira/browse/SPARK-3123 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.0.0, 1.0.2 Reporter: uncleGen Priority: Minor -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3118) add SHOW TBLPROPERTIES tblname; and SHOW COLUMNS (FROM|IN) table_name [(FROM|IN) db_name] support
[ https://issues.apache.org/jira/browse/SPARK-3118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102317#comment-14102317 ] Apache Spark commented on SPARK-3118: - User 'u0jing' has created a pull request for this issue: https://github.com/apache/spark/pull/2034 add SHOW TBLPROPERTIES tblname; and SHOW COLUMNS (FROM|IN) table_name [(FROM|IN) db_name] support - Key: SPARK-3118 URL: https://issues.apache.org/jira/browse/SPARK-3118 Project: Spark Issue Type: New Feature Components: SQL Reporter: wangxiaojing Priority: Minor Original Estimate: 12h Remaining Estimate: 12h The SHOW TBLPROPERTIES tblname; and SHOW COLUMNS (FROM|IN) table_name [(FROM|IN) db_name] syntax had been disabled. SHOW COLUMNS shows all the columns in a table including partition columns. SHOW TBLPROPERTIES shows Table Properties. They all describe a hive table. spark-sql SHOW COLUMNS in test; SHOW COLUMNS in test; java.lang.RuntimeException: Unsupported language features in query: SHOW COLUMNS in test TOK_SHOWCOLUMNS TOK_TABNAME test spark-sql SHOW TBLPROPERTIES test; SHOW TBLPROPERTIES test; java.lang.RuntimeException: Unsupported language features in query: SHOW TBLPROPERTIES test TOK_SHOW_TBLPROPERTIES test -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3122) hadoop-yarn dependencies cannot be resolved
[ https://issues.apache.org/jira/browse/SPARK-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102318#comment-14102318 ] Guoqiang Li commented on SPARK-3122: It is not necessary. Only need to these: {code:xml} dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-client/artifactId version2.4.0/version /dependency dependency groupIdorg.apache.hadoop/groupId artifactIdhadoop-client/artifactId version2.4.0 /version /dependency {code} Refer to these: http://spark.apache.org/docs/latest/running-on-yarn.html http://spark.apache.org/docs/latest/submitting-applications.html https://github.com/apache/spark/blob/master/README.md#a-note-about-hadoop-versions hadoop-yarn dependencies cannot be resolved --- Key: SPARK-3122 URL: https://issues.apache.org/jira/browse/SPARK-3122 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0, 1.0.2 Environment: spark 1.0.1 + YARN + hadoop 2.4.0 cluster on linux machines client on windows 7 maven 3.0.4 maven repository http://repo1.maven.org/maven2/org/apache/hadoop/ Reporter: Ran Levi Labels: build, easyfix, hadoop, maven When adding spark-yarn_2.10:1.0.2 dependency to java project, other hadoop-yarn-XXX dependencies are needed. Those dependencies are downloaded using version 1.0.4 which does not exist, resulting in build error. Version 1.0.4 is taken from hadoop.version variable in spark-parent-1.0.2.pom. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3124) Jar version conflict in the assembly package
[ https://issues.apache.org/jira/browse/SPARK-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102330#comment-14102330 ] Guoqiang Li commented on SPARK-3124: What's your command? Jar version conflict in the assembly package Key: SPARK-3124 URL: https://issues.apache.org/jira/browse/SPARK-3124 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Priority: Blocker Both netty-3.2.2.Final.jar and netty-3.6.6.Final.jar are flatten into the assembly package, however, the class(NioWorker) signature difference leads to the failure in launching sparksql CLI/ThriftServer. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3124) Jar version conflict in the assembly package
[ https://issues.apache.org/jira/browse/SPARK-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102330#comment-14102330 ] Guoqiang Li edited comment on SPARK-3124 at 8/19/14 3:42 PM: - What's your command? {{ ./make-distribution.sh -Pyarn -Phadoop-2.3 -Phive-thriftserver -Phive -Dhadoop.version=2.3.0 }} should be no problem. was (Author: gq): What's your command? Jar version conflict in the assembly package Key: SPARK-3124 URL: https://issues.apache.org/jira/browse/SPARK-3124 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Priority: Blocker Both netty-3.2.2.Final.jar and netty-3.6.6.Final.jar are flatten into the assembly package, however, the class(NioWorker) signature difference leads to the failure in launching sparksql CLI/ThriftServer. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3124) Jar version conflict in the assembly package
[ https://issues.apache.org/jira/browse/SPARK-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102334#comment-14102334 ] Apache Spark commented on SPARK-3124: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/2035 Jar version conflict in the assembly package Key: SPARK-3124 URL: https://issues.apache.org/jira/browse/SPARK-3124 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Priority: Blocker Both netty-3.2.2.Final.jar and netty-3.6.6.Final.jar are flatten into the assembly package, however, the class(NioWorker) signature difference leads to the failure in launching sparksql CLI/ThriftServer. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3124) Jar version conflict in the assembly package
[ https://issues.apache.org/jira/browse/SPARK-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102338#comment-14102338 ] Cheng Hao commented on SPARK-3124: -- Can you try bin/spark-sql after make distribution? Jar version conflict in the assembly package Key: SPARK-3124 URL: https://issues.apache.org/jira/browse/SPARK-3124 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Priority: Blocker Both netty-3.2.2.Final.jar and netty-3.6.6.Final.jar are flatten into the assembly package, however, the class(NioWorker) signature difference leads to the failure in launching sparksql CLI/ThriftServer. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3125) hive thriftserver test suite failure
wangfei created SPARK-3125: -- Summary: hive thriftserver test suite failure Key: SPARK-3125 URL: https://issues.apache.org/jira/browse/SPARK-3125 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2 Reporter: wangfei hive thriftserver test suite failure 1 CliSuite: [info] - simple commands *** FAILED *** [info] java.lang.AssertionError: assertion failed: Didn't find OK in the output: [info] at scala.Predef$.assert(Predef.scala:179) [info] at org.apache.spark.sql.hive.thriftserver.TestUtils$class.waitForQuery(TestUtils.scala:70) [info] at org.apache.spark.sql.hive.thriftserver.CliSuite.waitForQuery(CliSuite.scala:26) [info] at org.apache.spark.sql.hive.thriftserver.TestUtils$class.executeQuery(TestUtils.scala:62) [info] at org.apache.spark.sql.hive.thriftserver.CliSuite.executeQuery(CliSuite.scala:26) [info] at org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$1.apply$mcV$sp(CliSuite.scala:54) [info] at org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$1.apply(CliSuite.scala:52) [info] at org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$1.apply(CliSuite.scala:52) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) [info] ... 2.HiveThriftServer2Suite - test query execution against a Hive Thrift server *** FAILED *** [info] java.sql.SQLException: Could not open connection to jdbc:hive2://localhost:41419/: java.net.ConnectException: Connection refused [info] at org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:146) [info] at org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:123) [info] at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105) [info] at java.sql.DriverManager.getConnection(DriverManager.java:571) [info] at java.sql.DriverManager.getConnection(DriverManager.java:215) [info] at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite.getConnection(HiveThriftServer2Suite.scala:152) [info] at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite.createStatement(HiveThriftServer2Suite.scala:155) [info] at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite$$anonfun$1.apply$mcV$sp(HiveThriftServer2Suite.scala:113) [info] at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite$$anonfun$1.apply(HiveThriftServer2Suite.scala:110) [info] at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite$$anonfun$1.apply(HiveThriftServer2Suite.scala:110) [info] ... [info] Cause: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused [info] at org.apache.thrift.transport.TSocket.open(TSocket.java:185) [info] at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:248) [info] at org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37) [info] at org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:144) [info] at org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:123) [info] at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105) [info] at java.sql.DriverManager.getConnection(DriverManager.java:571) [info] at java.sql.DriverManager.getConnection(DriverManager.java:215) [info] at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite.getConnection(HiveThriftServer2Suite.scala:152) [info] at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite.createStatement(HiveThriftServer2Suite.scala:155) [info] ... [info] Cause: java.net.ConnectException: Connection refused [info] at java.net.PlainSocketImpl.socketConnect(Native Method) [info] at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) [info] at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) [info] at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) [info] at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) [info] at java.net.Socket.connect(Socket.java:579) [info] at org.apache.thrift.transport.TSocket.open(TSocket.java:180) [info] at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:248) [info] at org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37) [info] at org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:144) [info] ... [info] - SPARK-3004 regression: result set containing NULL *** FAILED *** [info] java.sql.SQLException: Could not open connection to jdbc:hive2://localhost:41419/: java.net.ConnectException: Connection refused [info] at org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:146) [info] at org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:123) [info]
[jira] [Created] (SPARK-3126) HiveThriftServer2Suite hangs
Cheng Lian created SPARK-3126: - Summary: HiveThriftServer2Suite hangs Key: SPARK-3126 URL: https://issues.apache.org/jira/browse/SPARK-3126 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Lian Priority: Blocker Fix For: 1.0.1, 1.0.2 [PR #1851|https://github.com/apache/spark/pull/1851] modified {{sbin/start-thriftserver.sh}}, added proper quotation and removed {{eval}}, but {{HiveThriftServer2Suite}}, which invokes {{sbin/start-thriftserver.sh}}, was not updated accordingly. The JDBC URL command line option shouldn't be quoted after removing {{eval}} from the script, otherwise, the following wrong command will be issued (notice the unclosed double quote): {code} ../../sbin/start-thriftserver.sh ... --hiveconf javax.jdo.option.ConnectionURL=xxx ... {code} This makes test cases hang until time out. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3124) Jar version conflict in the assembly package
[ https://issues.apache.org/jira/browse/SPARK-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102344#comment-14102344 ] Guoqiang Li commented on SPARK-3124: We should modify the file sql/hive-thriftserver/pom.xml {code:xml} dependency groupIdorg.spark-project.hive/groupId artifactIdhive-cli/artifactId version${hive.version}/version exclusions exclusion groupIdorg.jboss.netty/groupId artifactIdnetty/artifactId /exclusion /exclusions /dependency {code} Jar version conflict in the assembly package Key: SPARK-3124 URL: https://issues.apache.org/jira/browse/SPARK-3124 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Priority: Blocker Both netty-3.2.2.Final.jar and netty-3.6.6.Final.jar are flatten into the assembly package, however, the class(NioWorker) signature difference leads to the failure in launching sparksql CLI/ThriftServer. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3127) Modifying Spark SQL related scripts should trigger Spark SQL test suites
Cheng Lian created SPARK-3127: - Summary: Modifying Spark SQL related scripts should trigger Spark SQL test suites Key: SPARK-3127 URL: https://issues.apache.org/jira/browse/SPARK-3127 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2, 1.0.1 Reporter: Cheng Lian Currently only modifying files under {{sql/}} triggers execution of Spark SQL test suites, {{bin/spark-sql}} and {{sbin/start-thriftserver.sh}} are not included. This is an indirect cause of SPARK-3126. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3120) Local Dirs is not useful in yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-3120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102350#comment-14102350 ] hzw commented on SPARK-3120: Do you mean that : If I want to change the local-dirs in Yarn Mode, I must restart the cluster since the configuration was cached in memory? And I was also confused where to set the LOCAL_DIRS as the Spark Configuration said NOTE: In Spark 1.0 and later this will be overriden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN) environment variables set by the cluster manager.“ Local Dirs is not useful in yarn-client mode Key: SPARK-3120 URL: https://issues.apache.org/jira/browse/SPARK-3120 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.0.2 Environment: Spark 1.0.2 Yarn 2.3.0 Reporter: hzw I was using spark1.0.2 and hadoop 2.3.0 to run a spark application on yarn. I was excepted to set the spark.local.dir to separate the shuffle files to many disks, so I exported LOCAL_DIRS in Spark-env.sh. But it failed to create the local dirs in my specify path. It just go to the path in /tmp/hadoop-root/nm-local-dir/usercache/root/appcache/ as the hadoop default path. To reappear this: 1.Do not set the “yarn.nodemanager.local-dirs” in yarn-site.xml which influence the result. 2.run a job and then find the executor log at the INFO DiskBlockManager: Created local directory at .. Inaddtion, I tried to add the exported LOCAL_DIRS in yarn-env.sh. It will lanch the LOCAL_DIRS value in the ExecutorLancher and it still would be overwrite by yarn in lanching the executor container. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3124) Jar version conflict in the assembly package
[ https://issues.apache.org/jira/browse/SPARK-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102352#comment-14102352 ] Cheng Hao commented on SPARK-3124: -- Yes, actually I did in the PR. Jar version conflict in the assembly package Key: SPARK-3124 URL: https://issues.apache.org/jira/browse/SPARK-3124 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Priority: Blocker Both netty-3.2.2.Final.jar and netty-3.6.6.Final.jar are flatten into the assembly package, however, the class(NioWorker) signature difference leads to the failure in launching sparksql CLI/ThriftServer. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2929) Rewrite HiveThriftServer2Suite and CliSuite
[ https://issues.apache.org/jira/browse/SPARK-2929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102370#comment-14102370 ] Cheng Lian commented on SPARK-2929: --- Opened SPARK-3126 SPARK-3127 to track failure of these test suites more precisely. [PR #2036|https://github.com/apache/spark/pull/2036] was submitted to fix both of these two issues. Set SPARK-3126 as blocker and restored this one to major. Rewrite HiveThriftServer2Suite and CliSuite --- Key: SPARK-2929 URL: https://issues.apache.org/jira/browse/SPARK-2929 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.1, 1.0.2 Reporter: Cheng Lian Assignee: Cheng Lian {{HiveThriftServer2Suite}} and {{CliSuite}} were inherited from Shark and contain too may hard coded timeouts and timing assumptions when doing IPC. This makes these tests both flaky and slow. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3127) Modifying Spark SQL related scripts should trigger Spark SQL test suites
[ https://issues.apache.org/jira/browse/SPARK-3127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102373#comment-14102373 ] Apache Spark commented on SPARK-3127: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/2036 Modifying Spark SQL related scripts should trigger Spark SQL test suites Key: SPARK-3127 URL: https://issues.apache.org/jira/browse/SPARK-3127 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.1, 1.0.2 Reporter: Cheng Lian Currently only modifying files under {{sql/}} triggers execution of Spark SQL test suites, {{bin/spark-sql}} and {{sbin/start-thriftserver.sh}} are not included. This is an indirect cause of SPARK-3126. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3125) hive thriftserver test suite failure
[ https://issues.apache.org/jira/browse/SPARK-3125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102424#comment-14102424 ] wangfei commented on SPARK-3125: for clisuite i print the error info, as follows: log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Logging initialized using configuration in jar:file:/home/wf/code/spark/assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.3.0.jar!/hive-log4j.properties FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:301) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:271) at org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35) at org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35) at org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:38) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:359) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:359) at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:103) at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:98) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:58) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:291) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:314) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:73) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) hive thriftserver test suite failure Key: SPARK-3125 URL: https://issues.apache.org/jira/browse/SPARK-3125 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2 Reporter: wangfei hive thriftserver test suite failure 1 CliSuite: [info] - simple commands *** FAILED *** [info] java.lang.AssertionError: assertion failed: Didn't find OK in the output: [info] at scala.Predef$.assert(Predef.scala:179) [info] at org.apache.spark.sql.hive.thriftserver.TestUtils$class.waitForQuery(TestUtils.scala:70) [info] at org.apache.spark.sql.hive.thriftserver.CliSuite.waitForQuery(CliSuite.scala:26) [info] at org.apache.spark.sql.hive.thriftserver.TestUtils$class.executeQuery(TestUtils.scala:62) [info] at org.apache.spark.sql.hive.thriftserver.CliSuite.executeQuery(CliSuite.scala:26) [info] at org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$1.apply$mcV$sp(CliSuite.scala:54) [info] at org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$1.apply(CliSuite.scala:52) [info] at org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$1.apply(CliSuite.scala:52) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) [info] ... 2.HiveThriftServer2Suite - test query execution against a Hive Thrift server *** FAILED *** [info] java.sql.SQLException: Could not open connection to jdbc:hive2://localhost:41419/: java.net.ConnectException: Connection refused [info] at org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:146) [info] at org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:123) [info] at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105) [info] at
[jira] [Commented] (SPARK-1782) svd for sparse matrix using ARPACK
[ https://issues.apache.org/jira/browse/SPARK-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102427#comment-14102427 ] Xiangrui Meng commented on SPARK-1782: -- The plan is to release v1.1 by the end of the month. The feature is available in both master and branch-1.1. You can also checkout the current snapshot and have a try. svd for sparse matrix using ARPACK -- Key: SPARK-1782 URL: https://issues.apache.org/jira/browse/SPARK-1782 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Li Pu Fix For: 1.1.0 Original Estimate: 672h Remaining Estimate: 672h Currently the svd implementation in mllib calls the dense matrix svd in breeze, which has a limitation of fitting n^2 Gram matrix entries in memory (n is the number of rows or number of columns of the matrix, whichever is smaller). In many use cases, the original matrix is sparse but the Gram matrix might not, and we often need only the largest k singular values/vectors. To make svd really scalable, the memory usage must be propositional to the non-zero entries in the matrix. One solution is to call the de facto standard eigen-decomposition package ARPACK. For an input matrix M, we compute a few eigenvalues and eigenvectors of M^t*M (or M*M^t if its size is smaller) using ARPACK, then use the eigenvalues/vectors to reconstruct singular values/vectors. ARPACK has a reverse communication interface. The user provides a function to multiply a square matrix to be decomposed with a dense vector provided by ARPACK, and return the resulting dense vector to ARPACK. Inside ARPACK it uses an Implicitly Restarted Lanczos Method for symmetric matrix. Outside what we need to provide are two matrix-vector multiplications, first M*x then M^t*x. These multiplications can be done in Spark in a distributed manner. The working memory used by ARPACK is O(n*k). When k (the number of desired singular values) is small, it can be easily fit into the memory of the master machine. The overall model is master machine runs ARPACK, and distribute matrix-vector multiplication onto working executors in each iteration. I made a PR to breeze with an ARPACK-backed svds interface (https://github.com/scalanlp/breeze/pull/240). The interface takes anything that can be multiplied by a DenseVector. On Spark/milib side, just need to implement the sparsematrix-vector multiplication. It might take some time to optimize and fully test this implementation, so set the workload estimate to 4 weeks. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3128) Use streaming test suite for StreamingLR
Jeremy Freeman created SPARK-3128: - Summary: Use streaming test suite for StreamingLR Key: SPARK-3128 URL: https://issues.apache.org/jira/browse/SPARK-3128 Project: Spark Issue Type: Improvement Components: MLlib, Streaming Affects Versions: 1.1.0 Reporter: Jeremy Freeman Priority: Minor Unit tests for Streaming Linear Regression currently use file writing to generate input data and a TextFileStream to read the data. It would be better to use existing utilities from the streaming test suite to simulate DStreams and collect and evaluate results of DStream operations. This will make tests faster, simpler, and easier to maintain / extend. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3129) Prevent data loss in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Shreedharan updated SPARK-3129: Issue Type: New Feature (was: Bug) Prevent data loss in Spark Streaming Key: SPARK-3129 URL: https://issues.apache.org/jira/browse/SPARK-3129 Project: Spark Issue Type: New Feature Reporter: Hari Shreedharan Attachments: StreamingPreventDataLoss.pdf Spark Streaming can small amounts of data when the driver goes down - and the sending system cannot re-send the data (or the data has already expired on the sender side). The document attached has more details. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3129) Prevent data loss in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Shreedharan updated SPARK-3129: Attachment: StreamingPreventDataLoss.pdf Prevent data loss in Spark Streaming Key: SPARK-3129 URL: https://issues.apache.org/jira/browse/SPARK-3129 Project: Spark Issue Type: Bug Reporter: Hari Shreedharan Attachments: StreamingPreventDataLoss.pdf Spark Streaming can small amounts of data when the driver goes down - and the sending system cannot re-send the data (or the data has already expired on the sender side). The document attached has more details. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3122) hadoop-yarn dependencies cannot be resolved
[ https://issues.apache.org/jira/browse/SPARK-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102472#comment-14102472 ] Sean Owen commented on SPARK-3122: -- [~gq] You do not need to depend on hadoop-client for Spark's sake. You may need to depend on it because your code uses Hadoop directly. Again if you include hadoop-client, it should be provided. Your example shows a compile time dependency, and duplicates a single dependency. [~ran.l...@hp.com] spark-yarn is really the 'server side' implementation. You depend on spark-core. It should be a provided dependency. The Spark artifacts published to Maven happens to reference Hadoop 1.x but it will not matter for your build, because you would not inherit its dependencies. All you care about is Spark's API, which does not vary with Hadoop version. hadoop-yarn dependencies cannot be resolved --- Key: SPARK-3122 URL: https://issues.apache.org/jira/browse/SPARK-3122 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0, 1.0.2 Environment: spark 1.0.1 + YARN + hadoop 2.4.0 cluster on linux machines client on windows 7 maven 3.0.4 maven repository http://repo1.maven.org/maven2/org/apache/hadoop/ Reporter: Ran Levi Labels: build, easyfix, hadoop, maven When adding spark-yarn_2.10:1.0.2 dependency to java project, other hadoop-yarn-XXX dependencies are needed. Those dependencies are downloaded using version 1.0.4 which does not exist, resulting in build error. Version 1.0.4 is taken from hadoop.version variable in spark-parent-1.0.2.pom. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3130) Should not allow negative values in naive Bayes
Xiangrui Meng created SPARK-3130: Summary: Should not allow negative values in naive Bayes Key: SPARK-3130 URL: https://issues.apache.org/jira/browse/SPARK-3130 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng because NB treats feature values as term frequencies. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3110) Add a ha mode in YARN mode to keep executors in between restarts
[ https://issues.apache.org/jira/browse/SPARK-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Shreedharan updated SPARK-3110: Issue Type: Sub-task (was: Bug) Parent: SPARK-3129 Add a ha mode in YARN mode to keep executors in between restarts -- Key: SPARK-3110 URL: https://issues.apache.org/jira/browse/SPARK-3110 Project: Spark Issue Type: Sub-task Reporter: Hari Shreedharan The idea is for long running processes like streaming, you'd want the AM to come back up and reuse the same executors, so it can get the blocks from the memory of the executors because many streaming systems like Flume cannot really replay the data once it has been taken out. Even for others which can, the time period before data expires can mean some data could be lost. This is the first step in a series of patches for this one. The next is to get the AM to find the executors. My current plan is to use HDFS to keep track of where the executors are running and then communicate to them via Akka, to get a block list. I plan to expose this via SparkSubmit as the last step once we have all of the other pieces in place. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3128) Use streaming test suite for StreamingLR
[ https://issues.apache.org/jira/browse/SPARK-3128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102481#comment-14102481 ] Apache Spark commented on SPARK-3128: - User 'freeman-lab' has created a pull request for this issue: https://github.com/apache/spark/pull/2037 Use streaming test suite for StreamingLR Key: SPARK-3128 URL: https://issues.apache.org/jira/browse/SPARK-3128 Project: Spark Issue Type: Improvement Components: MLlib, Streaming Affects Versions: 1.1.0 Reporter: Jeremy Freeman Priority: Minor Unit tests for Streaming Linear Regression currently use file writing to generate input data and a TextFileStream to read the data. It would be better to use existing utilities from the streaming test suite to simulate DStreams and collect and evaluate results of DStream operations. This will make tests faster, simpler, and easier to maintain / extend. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3089) Fix meaningless error message in ConnectionManager
[ https://issues.apache.org/jira/browse/SPARK-3089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3089. --- Resolution: Fixed Fix Version/s: 1.1.0 Assignee: Kousuke Saruta Fix meaningless error message in ConnectionManager -- Key: SPARK-3089 URL: https://issues.apache.org/jira/browse/SPARK-3089 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta Fix For: 1.1.0 When ConnectionManager#removeConnection is invoked and it cannot find SendingConnection to be closed corresponding to a ConnectionManagerId, following message is logged. {code} logError(Corresponding SendingConnectionManagerId not found) {code} But, we cannot get which SendingConnectionManagerId is meant from the message. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3130) Should not allow negative values in naive Bayes
[ https://issues.apache.org/jira/browse/SPARK-3130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102499#comment-14102499 ] Apache Spark commented on SPARK-3130: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/2038 Should not allow negative values in naive Bayes --- Key: SPARK-3130 URL: https://issues.apache.org/jira/browse/SPARK-3130 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng because NB treats feature values as term frequencies. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102501#comment-14102501 ] Hari Shreedharan commented on SPARK-3129: - This doc is an early list of fixes. I may have missed some, and/or they may be better ways to do this. Please post any feedback you have! Thanks! Prevent data loss in Spark Streaming Key: SPARK-3129 URL: https://issues.apache.org/jira/browse/SPARK-3129 Project: Spark Issue Type: New Feature Reporter: Hari Shreedharan Assignee: Hari Shreedharan Attachments: StreamingPreventDataLoss.pdf Spark Streaming can small amounts of data when the driver goes down - and the sending system cannot re-send the data (or the data has already expired on the sender side). The document attached has more details. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102504#comment-14102504 ] Thomas Graves commented on SPARK-3129: -- A couple of random thoughts on this for yarn. yarn added this ability in 2.4.0 and you have to tell it you want it in the application submission context. So you will have to handle other versions of yarn properly where its not supported. I believe yarn will tell you what nodes you have containers already running on but you'll have to figure out details about ports, etc. I haven't looked at all the specifics. You'll have to figure out how to do authentication properly. This gets forgotten about many times. I think we should flush out more of the high level design concerns between yarn/standalone/mesos and on yarn the client/cluster modes. Prevent data loss in Spark Streaming Key: SPARK-3129 URL: https://issues.apache.org/jira/browse/SPARK-3129 Project: Spark Issue Type: New Feature Reporter: Hari Shreedharan Assignee: Hari Shreedharan Attachments: StreamingPreventDataLoss.pdf Spark Streaming can small amounts of data when the driver goes down - and the sending system cannot re-send the data (or the data has already expired on the sender side). The document attached has more details. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102579#comment-14102579 ] Hari Shreedharan commented on SPARK-3129: - The way the driver finds the executors would be common for all the scheduling systems (it should really be independent of the scheduling/deployment). I agree about the auth part too. [~tdas] mentioned there is something similar already in standalone. I'd like to concentrate on YARN - if someone else is interested in Mesos please feel free to take it up! I posted an initial patch for Client mode to simply keep the executors around (though it is not exposed via SparkSubmit which we can do once we can get the whole series of patches in). For YARN mode, does that mean the method calls have to be via reflection? I'd assume so. The reason I mentioned doing it via HDFS and then pinging the executors is to make it independent of YARN/Mesos/Standalone - we can just do it via StreamingContext and make it completely independent of the backend on which Spark is running (I am not even sure this should be a valid option for non-streaming cases, as it does not really add any value elsewhere). Prevent data loss in Spark Streaming Key: SPARK-3129 URL: https://issues.apache.org/jira/browse/SPARK-3129 Project: Spark Issue Type: New Feature Reporter: Hari Shreedharan Assignee: Hari Shreedharan Attachments: StreamingPreventDataLoss.pdf Spark Streaming can small amounts of data when the driver goes down - and the sending system cannot re-send the data (or the data has already expired on the sender side). The document attached has more details. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3131) Allow user to set parquet compression codec for writing ParquetFile in SQLContext
[ https://issues.apache.org/jira/browse/SPARK-3131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Teng Qiu updated SPARK-3131: Summary: Allow user to set parquet compression codec for writing ParquetFile in SQLContext (was: Allow user to set parquet compression codec) Allow user to set parquet compression codec for writing ParquetFile in SQLContext - Key: SPARK-3131 URL: https://issues.apache.org/jira/browse/SPARK-3131 Project: Spark Issue Type: Improvement Components: SQL Reporter: Teng Qiu There are 4 different compression codec available for ParquetOutputFormat currently it was set as a hard-coded value in {code}ParquetRelation.defaultCompression{code} original discuss: https://github.com/apache/spark/pull/195#discussion-diff-11002083 so we need to add a new config property in SQLConf to allow user change this compression codec, and i used similar short names syntax as described in SPARK-2953 btw, which codec should we use as default? it was set to GZIP (https://github.com/apache/spark/pull/195/files#diff-4), but i think maybe we should change this to SNAPPY, since SNAPPY is already the default codec for shuffling in spark-core (SPARK-2469), and parquet-mr supports Snappy codec natively. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3131) Allow user to set parquet compression codec for writing ParquetFile in SQLContext
[ https://issues.apache.org/jira/browse/SPARK-3131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Teng Qiu updated SPARK-3131: Description: There are 4 different compression codec available for ParquetOutputFormat in Spark SQL it was set as a hard-coded value in {code}ParquetRelation.defaultCompression{code} original discuss: https://github.com/apache/spark/pull/195#discussion-diff-11002083 so we need to add a new config property in SQLConf to allow user to change this compression codec, and i used similar short names syntax as described in SPARK-2953 btw, which codec should we use as default? it was set to GZIP (https://github.com/apache/spark/pull/195/files#diff-4), but i think maybe we should change this to SNAPPY, since SNAPPY is already the default codec for shuffling in spark-core (SPARK-2469), and parquet-mr supports Snappy codec natively. was: There are 4 different compression codec available for ParquetOutputFormat currently it was set as a hard-coded value in {code}ParquetRelation.defaultCompression{code} original discuss: https://github.com/apache/spark/pull/195#discussion-diff-11002083 so we need to add a new config property in SQLConf to allow user change this compression codec, and i used similar short names syntax as described in SPARK-2953 btw, which codec should we use as default? it was set to GZIP (https://github.com/apache/spark/pull/195/files#diff-4), but i think maybe we should change this to SNAPPY, since SNAPPY is already the default codec for shuffling in spark-core (SPARK-2469), and parquet-mr supports Snappy codec natively. Allow user to set parquet compression codec for writing ParquetFile in SQLContext - Key: SPARK-3131 URL: https://issues.apache.org/jira/browse/SPARK-3131 Project: Spark Issue Type: Improvement Components: SQL Reporter: Teng Qiu There are 4 different compression codec available for ParquetOutputFormat in Spark SQL it was set as a hard-coded value in {code}ParquetRelation.defaultCompression{code} original discuss: https://github.com/apache/spark/pull/195#discussion-diff-11002083 so we need to add a new config property in SQLConf to allow user to change this compression codec, and i used similar short names syntax as described in SPARK-2953 btw, which codec should we use as default? it was set to GZIP (https://github.com/apache/spark/pull/195/files#diff-4), but i think maybe we should change this to SNAPPY, since SNAPPY is already the default codec for shuffling in spark-core (SPARK-2469), and parquet-mr supports Snappy codec natively. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3131) Allow user to set parquet compression codec for writing ParquetFile in SQLContext
[ https://issues.apache.org/jira/browse/SPARK-3131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102724#comment-14102724 ] Apache Spark commented on SPARK-3131: - User 'chutium' has created a pull request for this issue: https://github.com/apache/spark/pull/2039 Allow user to set parquet compression codec for writing ParquetFile in SQLContext - Key: SPARK-3131 URL: https://issues.apache.org/jira/browse/SPARK-3131 Project: Spark Issue Type: Improvement Components: SQL Reporter: Teng Qiu There are 4 different compression codec available for ParquetOutputFormat in Spark SQL it was set as a hard-coded value in {code}ParquetRelation.defaultCompression{code} original discuss: https://github.com/apache/spark/pull/195#discussion-diff-11002083 so we need to add a new config property in SQLConf to allow user to change this compression codec, and i used similar short names syntax as described in SPARK-2953 btw, which codec should we use as default? it was set to GZIP (https://github.com/apache/spark/pull/195/files#diff-4), but i think maybe we should change this to SNAPPY, since SNAPPY is already the default codec for shuffling in spark-core (SPARK-2469), and parquet-mr supports Snappy codec natively. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3117) Avoid serialization for TorrentBroadcast blocks
[ https://issues.apache.org/jira/browse/SPARK-3117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102758#comment-14102758 ] Reynold Xin commented on SPARK-3117: This is going to be fixed by https://github.com/apache/spark/pull/2030 Avoid serialization for TorrentBroadcast blocks --- Key: SPARK-3117 URL: https://issues.apache.org/jira/browse/SPARK-3117 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Reynold Xin Assignee: Reynold Xin TorrentBroadcast uses a bunch of wrapper objects and MEMORY_AND_DISK storage level to store the torrent blocks. I don't think those are necessary. We can probably get rid of them completely to store everything in serialized form. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3132) Avoid serialization for Array[Byte] in TorrentBroadcast
Reynold Xin created SPARK-3132: -- Summary: Avoid serialization for Array[Byte] in TorrentBroadcast Key: SPARK-3132 URL: https://issues.apache.org/jira/browse/SPARK-3132 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin If the input data is a byte array, we should allow TorrentBroadcast to skip serializing and compressing the input. To do this, we should add a new parameter (shortCircuitByteArray) to TorrentBroadcast, and then avoid serialization in if the input is byte array and shortCircuitByteArray is true. We should then also do compression in task serialization itself instead of doing it in TorrentBroadcast. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3133) Piggyback get location RPC call to fetch small blocks
Reynold Xin created SPARK-3133: -- Summary: Piggyback get location RPC call to fetch small blocks Key: SPARK-3133 URL: https://issues.apache.org/jira/browse/SPARK-3133 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin We should add a new API to the BlockManagerMasterActor to get location or the data block directly if the data block is small. Once we use this, this effectively makes TorrentBroadcast behaves similarly to HttpBroadcast. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3134) Update block locations asynchronously in TorrentBroadcast
Reynold Xin created SPARK-3134: -- Summary: Update block locations asynchronously in TorrentBroadcast Key: SPARK-3134 URL: https://issues.apache.org/jira/browse/SPARK-3134 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin Once the TorrentBroadcast gets the data blocks, it needs to tell the master the new location. We should make the location update non-blocking to reduce roundtrips we need to launch tasks. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3135) Avoid memory copy in TorrentBroadcast serialization
Reynold Xin created SPARK-3135: -- Summary: Avoid memory copy in TorrentBroadcast serialization Key: SPARK-3135 URL: https://issues.apache.org/jira/browse/SPARK-3135 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin TorrentBroadcast uses a ByteArrayOutputStream to serialize broadcast object into a single giant byte array, and then separates it into smaller chunks. We should implement a new OutputStream that writes serialized bytes directly into chunks of byte arrays so we don't need the extra memory copy. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3136) create java-friendly methods in RandomRDDs
Xiangrui Meng created SPARK-3136: Summary: create java-friendly methods in RandomRDDs Key: SPARK-3136 URL: https://issues.apache.org/jira/browse/SPARK-3136 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Though we don't use default argument for methods in RandomRDDs, it is still not easy for Java users to use because the output type is either `RDD[Double]` or `RDD[Vector]`. Java users should expect `JavaDoubleRDD` and `JavaRDD[Vector]`, respectively. We should create dedicated methods for Java users, and allow default arguments in Scala methods in RandomRDDs, to make life easier for both Java and Scala users. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3135) Avoid memory copy in TorrentBroadcast serialization
[ https://issues.apache.org/jira/browse/SPARK-3135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-3135: --- Description: TorrentBroadcast.blockifyObject uses a ByteArrayOutputStream to serialize broadcast object into a single giant byte array, and then separates it into smaller chunks. We should implement a new OutputStream that writes serialized bytes directly into chunks of byte arrays so we don't need the extra memory copy. (was: TorrentBroadcast uses a ByteArrayOutputStream to serialize broadcast object into a single giant byte array, and then separates it into smaller chunks. We should implement a new OutputStream that writes serialized bytes directly into chunks of byte arrays so we don't need the extra memory copy.) Avoid memory copy in TorrentBroadcast serialization --- Key: SPARK-3135 URL: https://issues.apache.org/jira/browse/SPARK-3135 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin Labels: starter TorrentBroadcast.blockifyObject uses a ByteArrayOutputStream to serialize broadcast object into a single giant byte array, and then separates it into smaller chunks. We should implement a new OutputStream that writes serialized bytes directly into chunks of byte arrays so we don't need the extra memory copy. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3135) Avoid memory copy in TorrentBroadcast serialization
[ https://issues.apache.org/jira/browse/SPARK-3135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-3135: --- Labels: starter (was: ) Avoid memory copy in TorrentBroadcast serialization --- Key: SPARK-3135 URL: https://issues.apache.org/jira/browse/SPARK-3135 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin Labels: starter TorrentBroadcast.blockifyObject uses a ByteArrayOutputStream to serialize broadcast object into a single giant byte array, and then separates it into smaller chunks. We should implement a new OutputStream that writes serialized bytes directly into chunks of byte arrays so we don't need the extra memory copy. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3133) Piggyback get location RPC call to fetch small blocks
[ https://issues.apache.org/jira/browse/SPARK-3133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-3133: --- Description: We should add a new API to the BlockManagerMasterActor to get location or the data block directly if the data block is small. This effectively makes TorrentBroadcast behaves similarly to HttpBroadcast. was: We should add a new API to the BlockManagerMasterActor to get location or the data block directly if the data block is small. Once we use this, this effectively makes TorrentBroadcast behaves similarly to HttpBroadcast. Piggyback get location RPC call to fetch small blocks - Key: SPARK-3133 URL: https://issues.apache.org/jira/browse/SPARK-3133 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin We should add a new API to the BlockManagerMasterActor to get location or the data block directly if the data block is small. This effectively makes TorrentBroadcast behaves similarly to HttpBroadcast. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3128) Use streaming test suite for StreamingLR
[ https://issues.apache.org/jira/browse/SPARK-3128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-3128. -- Resolution: Fixed Fix Version/s: 1.2.0 1.1.0 Use streaming test suite for StreamingLR Key: SPARK-3128 URL: https://issues.apache.org/jira/browse/SPARK-3128 Project: Spark Issue Type: Improvement Components: MLlib, Streaming Affects Versions: 1.1.0 Reporter: Jeremy Freeman Priority: Minor Fix For: 1.1.0, 1.2.0 Unit tests for Streaming Linear Regression currently use file writing to generate input data and a TextFileStream to read the data. It would be better to use existing utilities from the streaming test suite to simulate DStreams and collect and evaluate results of DStream operations. This will make tests faster, simpler, and easier to maintain / extend. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3133) Piggyback get location RPC call to fetch small blocks
[ https://issues.apache.org/jira/browse/SPARK-3133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-3133: --- Description: We should add a new API to the BlockManagerMasterActor to get location or the data block directly if the data block is small. This effectively makes TorrentBroadcast behaves similarly to HttpBroadcast for small blocks. was: We should add a new API to the BlockManagerMasterActor to get location or the data block directly if the data block is small. This effectively makes TorrentBroadcast behaves similarly to HttpBroadcast. Piggyback get location RPC call to fetch small blocks - Key: SPARK-3133 URL: https://issues.apache.org/jira/browse/SPARK-3133 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin We should add a new API to the BlockManagerMasterActor to get location or the data block directly if the data block is small. This effectively makes TorrentBroadcast behaves similarly to HttpBroadcast for small blocks. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3137) Use finer grained locking in TorrentBroadcast.readObject
Reynold Xin created SPARK-3137: -- Summary: Use finer grained locking in TorrentBroadcast.readObject Key: SPARK-3137 URL: https://issues.apache.org/jira/browse/SPARK-3137 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin TorrentBroadcast.readObject uses a global lock so only one task can be fetching the blocks at the same time. This is not optimal if we are running multiple stages concurrently because they should be able to independently fetch their own blocks. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3137) Use finer grained locking in TorrentBroadcast.readObject
[ https://issues.apache.org/jira/browse/SPARK-3137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-3137: --- Component/s: Spark Core Target Version/s: 1.2.0 Use finer grained locking in TorrentBroadcast.readObject Key: SPARK-3137 URL: https://issues.apache.org/jira/browse/SPARK-3137 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Reynold Xin TorrentBroadcast.readObject uses a global lock so only one task can be fetching the blocks at the same time. This is not optimal if we are running multiple stages concurrently because they should be able to independently fetch their own blocks. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3136) create java-friendly methods in RandomRDDs
[ https://issues.apache.org/jira/browse/SPARK-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102791#comment-14102791 ] Apache Spark commented on SPARK-3136: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/2041 create java-friendly methods in RandomRDDs -- Key: SPARK-3136 URL: https://issues.apache.org/jira/browse/SPARK-3136 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Though we don't use default argument for methods in RandomRDDs, it is still not easy for Java users to use because the output type is either `RDD[Double]` or `RDD[Vector]`. Java users should expect `JavaDoubleRDD` and `JavaRDD[Vector]`, respectively. We should create dedicated methods for Java users, and allow default arguments in Scala methods in RandomRDDs, to make life easier for both Java and Scala users. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2333) spark_ec2 script should allow option for existing security group
[ https://issues.apache.org/jira/browse/SPARK-2333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-2333. --- Resolution: Fixed spark_ec2 script should allow option for existing security group Key: SPARK-2333 URL: https://issues.apache.org/jira/browse/SPARK-2333 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.0.0 Reporter: Kam Kasravi Priority: Minor Fix For: 1.1.0 spark-ec2 will create a new security with hardcoded attributes - an option --use_group should be provided to use an existing security group -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2333) spark_ec2 script should allow option for existing security group
[ https://issues.apache.org/jira/browse/SPARK-2333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-2333: -- Issue Type: Improvement (was: Bug) spark_ec2 script should allow option for existing security group Key: SPARK-2333 URL: https://issues.apache.org/jira/browse/SPARK-2333 Project: Spark Issue Type: Improvement Components: EC2 Affects Versions: 1.0.0 Reporter: Kam Kasravi Priority: Minor Fix For: 1.1.0 spark-ec2 will create a new security with hardcoded attributes - an option --use_group should be provided to use an existing security group -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2839) Documentation for statistical functions
[ https://issues.apache.org/jira/browse/SPARK-2839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2839: - Assignee: Burak Yavuz Documentation for statistical functions --- Key: SPARK-2839 URL: https://issues.apache.org/jira/browse/SPARK-2839 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Reporter: Xiangrui Meng Assignee: Burak Yavuz Add documentation and code examples for statistical functions to MLlib's programming guide. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3112) Documentation for Streaming Logistic Regression Streaming
[ https://issues.apache.org/jira/browse/SPARK-3112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3112: - Assignee: Jeremy Freeman Documentation for Streaming Logistic Regression Streaming - Key: SPARK-3112 URL: https://issues.apache.org/jira/browse/SPARK-3112 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Reporter: Ameet Talwalkar Assignee: Jeremy Freeman -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2790) PySpark zip() doesn't work properly if RDDs have different serializers
[ https://issues.apache.org/jira/browse/SPARK-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-2790. --- Resolution: Fixed Fix Version/s: 1.1.0 PySpark zip() doesn't work properly if RDDs have different serializers -- Key: SPARK-2790 URL: https://issues.apache.org/jira/browse/SPARK-2790 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0, 1.1.0 Reporter: Josh Rosen Assignee: Davies Liu Priority: Critical Fix For: 1.1.0 In PySpark, attempting to {{zip()}} two RDDs may fail if the RDDs have different serializers (e.g. batched vs. unbatched), even if those RDDs have the same number of partitions and same numbers of elements. This problem occurs in the MLlib Python APIs, where we might want to zip a JavaRDD of LabelledPoints with a JavaRDD of batch-serialized Python objects. This is problematic because whether zip() succeeds or errors depends on the partitioning / batching strategy, and we don't want to surface the serialization details to users. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3138) sqlContext.parquetFile should be able to take a single file as parameter
Teng Qiu created SPARK-3138: --- Summary: sqlContext.parquetFile should be able to take a single file as parameter Key: SPARK-3138 URL: https://issues.apache.org/jira/browse/SPARK-3138 Project: Spark Issue Type: Bug Components: SQL Reporter: Teng Qiu http://apache-spark-user-list.1001560.n3.nabble.com/sqlContext-parquetFile-path-fails-if-path-is-a-file-but-succeeds-if-a-directory-tp12345.html to reproduce this issue in spark-shell {code:java} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ import org.apache.hadoop.fs.{FileSystem, Path} case class TestRDDEntry(key: Int, value: String) val path = /tmp/parquet_test sc.parallelize((1 to 100)).map(i = TestRDDEntry(i, sval_$i)).coalesce(1).saveAsParquetFile(path) val fsPath = new Path(path) val fs: FileSystem = fsPath.getFileSystem(sc.hadoopConfiguration) val children = fs.listStatus(fsPath).filter(_.getPath.getName.endsWith(.parquet)) val readFile = sqlContext.parquetFile(path + / + children(0).getPath.getName) {code} it throws exception: {code} java.lang.IllegalArgumentException: Expected file:/tmp/parquet_test/part-r-1.parquet for be a directory with Parquet files/metadata at org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.scala:374) at org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:414) at org.apache.spark.sql.parquet.ParquetRelation.init(ParquetRelation.scala:66) {code} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3139) Akka timeouts from ContextCleaner when cleaning shuffles
Josh Rosen created SPARK-3139: - Summary: Akka timeouts from ContextCleaner when cleaning shuffles Key: SPARK-3139 URL: https://issues.apache.org/jira/browse/SPARK-3139 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Environment: 10 r3.2xlarge tests on EC2, running the scala-agg-by-key-int spark-perf test against master commit d7e80c2597d4a9cae2e0cb35a86f7889323f4cbb. Reporter: Josh Rosen Priority: Blocker When running spark-perf tests on EC2, I have a job that's consistently logging the following Akka exceptions: {code} 4/08/19 22:07:12 ERROR spark.ContextCleaner: Error cleaning shuffle 0 java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.storage.BlockManagerMaster.removeShuffle(BlockManagerMaster.scala:118) at org.apache.spark.ContextCleaner.doCleanupShuffle(ContextCleaner.scala:159) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:131) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:124) at scala.Option.foreach(Option.scala:236) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:124) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:120) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:120) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1252) at org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:119) at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:65) {code} and {code} 14/08/19 22:07:12 ERROR storage.BlockManagerMaster: Failed to remove shuffle 0 akka.pattern.AskTimeoutException: Timed out at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334) at akka.actor.Scheduler$$anon$11.run(Scheduler.scala:118) at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694) at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691) at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:455) at akka.actor.LightArrayRevolverScheduler$$anon$12.executeBucket$1(Scheduler.scala:407) at akka.actor.LightArrayRevolverScheduler$$anon$12.nextTick(Scheduler.scala:411) at akka.actor.LightArrayRevolverScheduler$$anon$12.run(Scheduler.scala:363) at java.lang.Thread.run(Thread.java:745) {code} This doesn't seem to prevent the job from completing successfully, but it's serious issue because it means that resources aren't being cleaned up. The test script, ScalaAggByKeyInt, runs each test 10 times, and I see the same error after each test, so this seems deterministically reproducible. I'll look at the executor logs to see if I can find more info there. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3138) sqlContext.parquetFile should be able to take a single file as parameter
[ https://issues.apache.org/jira/browse/SPARK-3138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102961#comment-14102961 ] Teng Qiu commented on SPARK-3138: - be careful if someone is working on SPARK-2551, make sure the new change passes test case {code}test(Read a parquet file instead of a directory){code} sqlContext.parquetFile should be able to take a single file as parameter Key: SPARK-3138 URL: https://issues.apache.org/jira/browse/SPARK-3138 Project: Spark Issue Type: Bug Components: SQL Reporter: Teng Qiu http://apache-spark-user-list.1001560.n3.nabble.com/sqlContext-parquetFile-path-fails-if-path-is-a-file-but-succeeds-if-a-directory-tp12345.html to reproduce this issue in spark-shell {code:java} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ import org.apache.hadoop.fs.{FileSystem, Path} case class TestRDDEntry(key: Int, value: String) val path = /tmp/parquet_test sc.parallelize((1 to 100)).map(i = TestRDDEntry(i, sval_$i)).coalesce(1).saveAsParquetFile(path) val fsPath = new Path(path) val fs: FileSystem = fsPath.getFileSystem(sc.hadoopConfiguration) val children = fs.listStatus(fsPath).filter(_.getPath.getName.endsWith(.parquet)) val readFile = sqlContext.parquetFile(path + / + children(0).getPath.getName) {code} it throws exception: {code} java.lang.IllegalArgumentException: Expected file:/tmp/parquet_test/part-r-1.parquet for be a directory with Parquet files/metadata at org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.scala:374) at org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:414) at org.apache.spark.sql.parquet.ParquetRelation.init(ParquetRelation.scala:66) {code} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3139) Akka timeouts from ContextCleaner when cleaning shuffles
[ https://issues.apache.org/jira/browse/SPARK-3139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102960#comment-14102960 ] Josh Rosen commented on SPARK-3139: --- I used pssh + grep to search through the application logs on the workers and I couldn't find any ERRORs or Exceptions (I'm sure that I was searching the right log directories, since other searches return matches). Akka timeouts from ContextCleaner when cleaning shuffles Key: SPARK-3139 URL: https://issues.apache.org/jira/browse/SPARK-3139 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Environment: 10 r3.2xlarge tests on EC2, running the scala-agg-by-key-int spark-perf test against master commit d7e80c2597d4a9cae2e0cb35a86f7889323f4cbb. Reporter: Josh Rosen Priority: Blocker When running spark-perf tests on EC2, I have a job that's consistently logging the following Akka exceptions: {code} 4/08/19 22:07:12 ERROR spark.ContextCleaner: Error cleaning shuffle 0 java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.storage.BlockManagerMaster.removeShuffle(BlockManagerMaster.scala:118) at org.apache.spark.ContextCleaner.doCleanupShuffle(ContextCleaner.scala:159) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:131) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:124) at scala.Option.foreach(Option.scala:236) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:124) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:120) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:120) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1252) at org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:119) at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:65) {code} and {code} 14/08/19 22:07:12 ERROR storage.BlockManagerMaster: Failed to remove shuffle 0 akka.pattern.AskTimeoutException: Timed out at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334) at akka.actor.Scheduler$$anon$11.run(Scheduler.scala:118) at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694) at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691) at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:455) at akka.actor.LightArrayRevolverScheduler$$anon$12.executeBucket$1(Scheduler.scala:407) at akka.actor.LightArrayRevolverScheduler$$anon$12.nextTick(Scheduler.scala:411) at akka.actor.LightArrayRevolverScheduler$$anon$12.run(Scheduler.scala:363) at java.lang.Thread.run(Thread.java:745) {code} This doesn't seem to prevent the job from completing successfully, but it's serious issue because it means that resources aren't being cleaned up. The test script, ScalaAggByKeyInt, runs each test 10 times, and I see the same error after each test, so this seems deterministically reproducible. I'll look at the executor logs to see if I can find more info there. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3140) PySpark start-up throws confusing exception
Andrew Or created SPARK-3140: Summary: PySpark start-up throws confusing exception Key: SPARK-3140 URL: https://issues.apache.org/jira/browse/SPARK-3140 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.2 Reporter: Andrew Or Priority: Critical Currently we read the pyspark port through stdout of the spark-submit subprocess. However, if there is stdout interference, e.g. spark-submit echoes something unexpected to stdout, we print the following: {code} Exception: Launching GatewayServer failed! (Warning: unexpected output detected.) {code} This condition is fine. However, we actually throw the same exception if there is *no* output from the subprocess as well. This is very confusing because it implies that the subprocess is outputting something (possibly whitespace, which is not visible) when it's actually not. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org