[jira] [Created] (SPARK-3988) Public API for DateType support
Adrian Wang created SPARK-3988: -- Summary: Public API for DateType support Key: SPARK-3988 URL: https://issues.apache.org/jira/browse/SPARK-3988 Project: Spark Issue Type: Improvement Components: SQL Reporter: Adrian Wang Priority: Minor add Python API and something else. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3940) Print the error code three times
[ https://issues.apache.org/jira/browse/SPARK-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-3940: -- Target Version/s: 1.1.1, 1.2.0 (was: 1.1.0) > Print the error code three times > - > > Key: SPARK-3940 > URL: https://issues.apache.org/jira/browse/SPARK-3940 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: wangxj > Labels: patch > Original Estimate: 0.05h > Remaining Estimate: 0.05h > > if an error of SQL,the console print Error three times。 > eg: > {noformat} > spark-sql> show tablesss; > show tablesss; > 14/10/13 20:56:29 INFO ParseDriver: Parsing command: show tablesss > NoViableAltException(26@[598:1: ddlStatement : ( createDatabaseStatement | > switchDatabaseStatement | dropDatabaseStatement | createTableStatement | > dropTableStatement | truncateTableStatement | alterStatement | descStatement > | showStatement | metastoreCheck | createViewStatement | dropViewStatement | > createFunctionStatement | createMacroStatement | createIndexStatement | > dropIndexStatement | dropFunctionStatement | dropMacroStatement | > analyzeStatement | lockStatement | unlockStatement | createRoleStatement | > dropRoleStatement | grantPrivileges | revokePrivileges | showGrants | > showRoleGrants | grantRole | revokeRole );]) > at org.antlr.runtime.DFA.noViableAlt(DFA.java:158) > at org.antlr.runtime.DFA.predict(DFA.java:144) > at > org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:1962) > at > org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1298) > at > org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:938) > at > org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:190) > at > org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:161) > at org.apache.spark.sql.hive.HiveQl$.getAst(HiveQl.scala:218) > at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:226) > at > org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50) > at > org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49) > at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) > at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) > at > scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) > at > scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) > at > scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) > at > org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:31) > at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:130) > at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:130) > at > org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:184) > at > org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:183) > at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) > at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.comb
[jira] [Updated] (SPARK-3940) Print the error code three times
[ https://issues.apache.org/jira/browse/SPARK-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-3940: -- Fix Version/s: (was: 1.3.0) (was: 1.0.3) > Print the error code three times > - > > Key: SPARK-3940 > URL: https://issues.apache.org/jira/browse/SPARK-3940 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: wangxj > Labels: patch > Original Estimate: 0.05h > Remaining Estimate: 0.05h > > if an error of SQL,the console print Error three times。 > eg: > {noformat} > spark-sql> show tablesss; > show tablesss; > 14/10/13 20:56:29 INFO ParseDriver: Parsing command: show tablesss > NoViableAltException(26@[598:1: ddlStatement : ( createDatabaseStatement | > switchDatabaseStatement | dropDatabaseStatement | createTableStatement | > dropTableStatement | truncateTableStatement | alterStatement | descStatement > | showStatement | metastoreCheck | createViewStatement | dropViewStatement | > createFunctionStatement | createMacroStatement | createIndexStatement | > dropIndexStatement | dropFunctionStatement | dropMacroStatement | > analyzeStatement | lockStatement | unlockStatement | createRoleStatement | > dropRoleStatement | grantPrivileges | revokePrivileges | showGrants | > showRoleGrants | grantRole | revokeRole );]) > at org.antlr.runtime.DFA.noViableAlt(DFA.java:158) > at org.antlr.runtime.DFA.predict(DFA.java:144) > at > org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:1962) > at > org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1298) > at > org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:938) > at > org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:190) > at > org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:161) > at org.apache.spark.sql.hive.HiveQl$.getAst(HiveQl.scala:218) > at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:226) > at > org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50) > at > org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49) > at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) > at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) > at > scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) > at > scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) > at > scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) > at > org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:31) > at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:130) > at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:130) > at > org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:184) > at > org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:183) > at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) > at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at >
[jira] [Updated] (SPARK-3940) Print the error code three times
[ https://issues.apache.org/jira/browse/SPARK-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-3940: -- Description: if an error of SQL,the console print Error three times。 eg: {noformat} spark-sql> show tablesss; show tablesss; 14/10/13 20:56:29 INFO ParseDriver: Parsing command: show tablesss NoViableAltException(26@[598:1: ddlStatement : ( createDatabaseStatement | switchDatabaseStatement | dropDatabaseStatement | createTableStatement | dropTableStatement | truncateTableStatement | alterStatement | descStatement | showStatement | metastoreCheck | createViewStatement | dropViewStatement | createFunctionStatement | createMacroStatement | createIndexStatement | dropIndexStatement | dropFunctionStatement | dropMacroStatement | analyzeStatement | lockStatement | unlockStatement | createRoleStatement | dropRoleStatement | grantPrivileges | revokePrivileges | showGrants | showRoleGrants | grantRole | revokeRole );]) at org.antlr.runtime.DFA.noViableAlt(DFA.java:158) at org.antlr.runtime.DFA.predict(DFA.java:144) at org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:1962) at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1298) at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:938) at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:190) at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:161) at org.apache.spark.sql.hive.HiveQl$.getAst(HiveQl.scala:218) at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:226) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) at scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:31) at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:130) at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:130) at org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:184) at org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:183) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Pa
[jira] [Commented] (SPARK-3736) Workers should reconnect to Master if disconnected
[ https://issues.apache.org/jira/browse/SPARK-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174753#comment-14174753 ] Andrew Ash commented on SPARK-3736: --- The configuration for Hadoop's retry policy was added in HDFS-3504 {quote} + * Return the default retry policy used in RPC. + * + * If dfs.client.retry.policy.enabled == false, use TRY_ONCE_THEN_FAIL. + * + * Otherwise, first unwrap ServiceException if possible, and then + * (1) use multipleLinearRandomRetry for + * - SafeModeException, or + * - IOException other than RemoteException, or + * - ServiceException; and + * (2) use TRY_ONCE_THEN_FAIL for + * - non-SafeMode RemoteException, or + * - non-IOException. + * + * Note that dfs.client.retry.max < 0 is not allowed. {quote} >From >https://github.com/apache/hadoop/commit/45fafc2b8fc1aab0a082600b0d50ad693491ea70#diff-36b19e9d8816002ed9dff8580055d3fbR44 > it looks like the default policy is to retry every 10 seconds for 6 attempts, >and then every 60 seconds for 10 attempts. > Workers should reconnect to Master if disconnected > -- > > Key: SPARK-3736 > URL: https://issues.apache.org/jira/browse/SPARK-3736 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.2, 1.1.0 >Reporter: Andrew Ash >Assignee: Matthew Cheah >Priority: Critical > > In standalone mode, when a worker gets disconnected from the master for some > reason it never attempts to reconnect. In this situation you have to bounce > the worker before it will reconnect to the master. > The preferred alternative is to follow what Hadoop does -- when there's a > disconnect, attempt to reconnect at a particular interval until successful (I > think it repeats indefinitely every 10sec). > This has been observed by: > - [~pkolaczk] in > http://apache-spark-user-list.1001560.n3.nabble.com/Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td6240.html > - [~romi-totango] in > http://apache-spark-user-list.1001560.n3.nabble.com/Re-Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td15335.html > - [~aash] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3957) Broadcast variable memory usage not reflected in UI
[ https://issues.apache.org/jira/browse/SPARK-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174747#comment-14174747 ] Nan Zhu commented on SPARK-3957: Ok, when i work on executor tab, i rwslize that, we eventually need a per-executor record of broadcast usageso will still follow the heartbeat based strategy > Broadcast variable memory usage not reflected in UI > --- > > Key: SPARK-3957 > URL: https://issues.apache.org/jira/browse/SPARK-3957 > Project: Spark > Issue Type: Bug > Components: Block Manager, Web UI >Affects Versions: 1.0.2, 1.1.0 >Reporter: Shivaram Venkataraman >Assignee: Nan Zhu > > Memory used by broadcast variables are not reflected in the memory usage > reported in the WebUI. For example, the executors tab shows memory used in > each executor but this number doesn't include memory used by broadcast > variables. Similarly the storage tab only shows list of rdds cached and how > much memory they use. > We should add a separate column / tab for broadcast variables to make it > easier to debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3987) NNLS generates incorrect result
Debasish Das created SPARK-3987: --- Summary: NNLS generates incorrect result Key: SPARK-3987 URL: https://issues.apache.org/jira/browse/SPARK-3987 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.0 Reporter: Debasish Das Hi, Please see the example gram matrix and linear term: val P2 = new DoubleMatrix(20, 20, 333907.312770, -60814.043975, 207935.829941, -162881.367739, -43730.396770, 17511.428983, -243340.496449, -225245.957922, 104700.445881, 32430.845099, 336378.693135, -373497.970207, -41147.159621, 53928.060360, -293517.883778, 53105.278068, 0.00, -85257.781696, 84913.970469, -10584.080103, -60814.043975, 13826.806664, -38032.612640, 33475.833875, 10791.916809, -1040.950810, 48106.552472, 45390.073380, -16310.282190, -2861.455903, -60790.833191, 73109.516544, 9826.614644, -8283.992464, 56991.742991, -6171.366034, 0.00, 19152.382499, -13218.721710, 2793.734234, 207935.829941, -38032.612640, 129661.677608, -101682.098412, -27401.299347, 10787.713362, -151803.006149, -140563.601672, 65067.935324, 20031.263383, 209521.268600, -232958.054688, -25764.179034, 33507.951918, -183046.845592, 32884.782835, 0.00, -53315.811196, 52770.762546, -6642.187643, -162881.367739, 33475.833875, -101682.098412, 85094.407608, 25422.850782, -5437.646141, 124197.166330, 116206.265909, -47093.484134, -11420.168521, -163429.436848, 189574.783900, 23447.172314, -24087.375367, 148311.355507, -20848.385466, 0.00, 46835.814559, -38180.352878, 6415.873901, -43730.396770, 10791.916809, -27401.299347, 25422.850782, 8882.869799, 15.638084, 35933.473986, 34186.371325, -10745.330690, -974.314375, -43537.709621, 54371.010558, 7894.453004, -5408.929644, 42231.381747, -3192.010574, 0.00, 15058.753110, -8704.757256, 2316.581535, 17511.428983, -1040.950810, 10787.713362, -5437.646141, 15.638084, 2794.949847, -9681.950987, -8258.171646, 7754.358930, 4193.359412, 18052.143842, -15456.096769, -253.356253, 4089.672804, -12524.380088, 5651.579348, 0.00, -1513.302547, 6296.461898, 152.427321, -243340.496449, 48106.552472, -151803.006149, 124197.166330, 35933.473986, -9681.950987, 182931.600236, 170454.352953, -72361.174145, -19270.461728, -244518.179729, 279551.060579, 33340.452802, -37103.267653, 219025.288975, -33687.141423, 0.00, 67347.950443, -58673.009647, 8957.800259, -225245.957922, 45390.073380, -140563.601672, 116206.265909, 34186.371325, -8258.171646, 170454.352953, 159322.942894, -66074.960534, -16839.743193, -226173.967766, 260421.044094, 31624.194003, -33839.612565, 203889.695169, -30034.828909, 0.00, 63525.040745, -53572.741748, 8575.071847, 104700.445881, -16310.282190, 65067.935324, -47093.484134, -10745.330690, 7754.358930, -72361.174145, -66074.960534, 35869.598076, 13378.653317, 106033.647837, -111831.682883, -10455.465743, 18537.392481, -88370.612394, 20344.288488, 0.00, -22935.482766, 29004.543704, -2409.461759, 32430.845099, -2861.455903, 20031.263383, -11420.168521, -974.314375, 4193.359412, -19270.461728, -16839.743193, 13378.653317, 6802.081898, 33256.395091, -30421.985199, -1296.785870, 7026.518692, -24443.378205, 9221.982599, 0.00, -4088.076871, 10861.014242, -25.092938, 336378.693135, -60790.833191, 209521.268600, -163429.436848, -43537.709621, 18052.143842, -244518.179729, -226173.967766, 106033.647837, 33256.395091, 339200.268106, -375442.716811, -41027.594509, 54636.778527, -295133.248586, 54177.278365, 0.00, -85237.666701, 85996.957056, -10503.209968, -373497.970207, 73109.516544, -232958.054688, 189574.783900, 54371.010558, -15456.096769, 279551.060579, 260421.044094, -111831.682883, -30421.985199, -375442.716811, 427793.208465, 50528.074431, -57375.986301, 335203.382015, -52676.385869, 0.00, 102368.307670, -90679.792485, 13509.390393, -41147.159621, 9826.614644, -25764.179034, 23447.172314, 7894.453004, -253.356253, 33340.452802, 31624.194003, -10455.465743, -1296.785870, -41027.594509, 50528.074431, 7255.977434, -5281.636812, 39298.355527, -3440.450858, 0.00, 13717.870243, -8471.405582, 2071.812204, 53928.060360, -8283.992464, 33507.951918, -24087.375367, -5408.929644, 4089.672804, -37103.267653, -33839.612565, 18537.392481, 7026.518692, 54636.778527, -57375.986301, -5281.636812, 9735.061160, -45360.674033, 10634.633559, 0.00, -11652.364691, 15039.566630, -1202.539106, -293517.883778, 56991.742991, -183046.845592, 148311.355507, 42231.381747, -12524.380088, 219025.288975, 203889.695169, -88370.612394, -24443.378205, -295133.248586, 335203.382015, 39298.355527, -45360.674033, 262923.925938, -42012.606885, 0.00, 79810.919951, -71657.856143, 10464.327491, 53105.278068, -6171.366034, 32884.782835, -20848.385466, -3192.010574, 5651.579348, -33687.141423, -30034.828909, 20344.288488, 9221.982599, 54177.278365, -52676.385869, -3440.450858, 10634.6335
[jira] [Commented] (SPARK-3986) Fix package names to fit their directory names.
[ https://issues.apache.org/jira/browse/SPARK-3986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174696#comment-14174696 ] Apache Spark commented on SPARK-3986: - User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/2835 > Fix package names to fit their directory names. > --- > > Key: SPARK-3986 > URL: https://issues.apache.org/jira/browse/SPARK-3986 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Takuya Ueshin >Priority: Minor > > Package names of 2 test suites are different from their directory names. > - {{GeneratedEvaluationSuite}} > - {{GeneratedMutableEvaluationSuite}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3986) Fix package names to fit their directory names.
Takuya Ueshin created SPARK-3986: Summary: Fix package names to fit their directory names. Key: SPARK-3986 URL: https://issues.apache.org/jira/browse/SPARK-3986 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin Priority: Minor Package names of 2 test suites are different from their directory names. - {{GeneratedEvaluationSuite}} - {{GeneratedMutableEvaluationSuite}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3957) Broadcast variable memory usage not reflected in UI
[ https://issues.apache.org/jira/browse/SPARK-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174684#comment-14174684 ] Nan Zhu edited comment on SPARK-3957 at 10/17/14 3:04 AM: -- [~andrewor14], why we don't report broadcast variable resource usage to BlockManagerMaster in the current implementation? was (Author: codingcat): [~andrewor14], why we didn't report broadcast variable resource usage to BlockManagerMaster in the current implementation? > Broadcast variable memory usage not reflected in UI > --- > > Key: SPARK-3957 > URL: https://issues.apache.org/jira/browse/SPARK-3957 > Project: Spark > Issue Type: Bug > Components: Block Manager, Web UI >Affects Versions: 1.0.2, 1.1.0 >Reporter: Shivaram Venkataraman >Assignee: Nan Zhu > > Memory used by broadcast variables are not reflected in the memory usage > reported in the WebUI. For example, the executors tab shows memory used in > each executor but this number doesn't include memory used by broadcast > variables. Similarly the storage tab only shows list of rdds cached and how > much memory they use. > We should add a separate column / tab for broadcast variables to make it > easier to debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3957) Broadcast variable memory usage not reflected in UI
[ https://issues.apache.org/jira/browse/SPARK-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174684#comment-14174684 ] Nan Zhu commented on SPARK-3957: [~andrewor14], why we didn't report broadcast variable resource usage to BlockManagerMaster in the current implementation? > Broadcast variable memory usage not reflected in UI > --- > > Key: SPARK-3957 > URL: https://issues.apache.org/jira/browse/SPARK-3957 > Project: Spark > Issue Type: Bug > Components: Block Manager, Web UI >Affects Versions: 1.0.2, 1.1.0 >Reporter: Shivaram Venkataraman >Assignee: Nan Zhu > > Memory used by broadcast variables are not reflected in the memory usage > reported in the WebUI. For example, the executors tab shows memory used in > each executor but this number doesn't include memory used by broadcast > variables. Similarly the storage tab only shows list of rdds cached and how > much memory they use. > We should add a separate column / tab for broadcast variables to make it > easier to debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3985) json file path is not right
[ https://issues.apache.org/jira/browse/SPARK-3985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174683#comment-14174683 ] Apache Spark commented on SPARK-3985: - User 'adrian-wang' has created a pull request for this issue: https://github.com/apache/spark/pull/2834 > json file path is not right > --- > > Key: SPARK-3985 > URL: https://issues.apache.org/jira/browse/SPARK-3985 > Project: Spark > Issue Type: Bug > Components: Examples >Reporter: Adrian Wang >Assignee: Adrian Wang >Priority: Minor > > in examples/src/main/python/sql.py, we just add SPARK_HOME and "examples/..." > together instead of using "os.path.join", would cause a problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3985) json file path is not right
Adrian Wang created SPARK-3985: -- Summary: json file path is not right Key: SPARK-3985 URL: https://issues.apache.org/jira/browse/SPARK-3985 Project: Spark Issue Type: Bug Components: Examples Reporter: Adrian Wang Priority: Minor in examples/src/main/python/sql.py, we just add SPARK_HOME and "examples/..." together instead of using "os.path.join", would cause a problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3957) Broadcast variable memory usage not reflected in UI
[ https://issues.apache.org/jira/browse/SPARK-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174675#comment-14174675 ] Nan Zhu commented on SPARK-3957: BlockId can directly tell if the corresponding block is a broadcast variable > Broadcast variable memory usage not reflected in UI > --- > > Key: SPARK-3957 > URL: https://issues.apache.org/jira/browse/SPARK-3957 > Project: Spark > Issue Type: Bug > Components: Block Manager, Web UI >Affects Versions: 1.0.2, 1.1.0 >Reporter: Shivaram Venkataraman >Assignee: Nan Zhu > > Memory used by broadcast variables are not reflected in the memory usage > reported in the WebUI. For example, the executors tab shows memory used in > each executor but this number doesn't include memory used by broadcast > variables. Similarly the storage tab only shows list of rdds cached and how > much memory they use. > We should add a separate column / tab for broadcast variables to make it > easier to debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3973) Print callSite information for broadcast variables
[ https://issues.apache.org/jira/browse/SPARK-3973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3973. --- Resolution: Fixed Issue resolved by pull request 2829 [https://github.com/apache/spark/pull/2829] > Print callSite information for broadcast variables > -- > > Key: SPARK-3973 > URL: https://issues.apache.org/jira/browse/SPARK-3973 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Shivaram Venkataraman >Assignee: Shivaram Venkataraman >Priority: Minor > Fix For: 1.2.0 > > > Printing call site information for broadcast variables will help in debugging > which variables are used, when they are used etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3963) Support getting task-scoped properties from TaskContext
[ https://issues.apache.org/jira/browse/SPARK-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174660#comment-14174660 ] Patrick Wendell edited comment on SPARK-3963 at 10/17/14 2:35 AM: -- It does make sense longer term to merge this with TaskMetrics - but I'm proposing here just a simple API for users to get properties that are strings. What we want in the long term is definitely something more general than "metrics" (hence the naming "properties" here). It would be fairly simple to extend this to have e.g. `getIntProperty` where type safety is wanted. So for now I'd prefer not to do either of those things, the bigger question is does this tie our hands from doing them in the future. I think it doesn't... was (Author: pwendell): It does make sense longer term to merge this with TaskMetrics - but I'm proposing here just a simple API for users to get properties that are strings. So for now I'd prefer not to do either of those things. > Support getting task-scoped properties from TaskContext > --- > > Key: SPARK-3963 > URL: https://issues.apache.org/jira/browse/SPARK-3963 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Patrick Wendell > > This is a proposal for a minor feature. Given stabilization of the > TaskContext API, it would be nice to have a mechanism for Spark jobs to > access properties that are defined based on task-level scope by Spark RDD's. > I'd like to propose adding a simple properties hash map with some standard > spark properties that users can access. Later it would be nice to support > users setting these properties, but for now to keep it simple in 1.2. I'd > prefer users not be able to set them. > The main use case is providing the file name from Hadoop RDD's, a very common > request. But I'd imagine us using this for other things later on. We could > also use this to expose some of the taskMetrics, such as e.g. the input bytes. > {code} > val data = sc.textFile("s3n//..2014/*/*/*.json") > data.mapPartitions { > val tc = TaskContext.get > val filename = tc.getProperty(TaskContext.HADOOP_FILE_NAME) > val parts = fileName.split("/") > val (year, month, day) = (parts[3], parts[4], parts[5]) > ... > } > {code} > Internally we'd have a method called setProperty, but this wouldn't be > exposed initially. This is structured as a simple (String, String) hash map > for ease of porting to python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3957) Broadcast variable memory usage not reflected in UI
[ https://issues.apache.org/jira/browse/SPARK-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174664#comment-14174664 ] Nan Zhu commented on SPARK-3957: After looking at the problem more closely, I think we might just set the tellMaster flag to true to get this information (after put, it will report to BlockManagerMaster), instead of introducing a fat heartbeat message or open new channel the only thing we need to add is that, we need distinguish RDD and broadcast variable in BlockStatus how you guys think about it? > Broadcast variable memory usage not reflected in UI > --- > > Key: SPARK-3957 > URL: https://issues.apache.org/jira/browse/SPARK-3957 > Project: Spark > Issue Type: Bug > Components: Block Manager, Web UI >Affects Versions: 1.0.2, 1.1.0 >Reporter: Shivaram Venkataraman >Assignee: Nan Zhu > > Memory used by broadcast variables are not reflected in the memory usage > reported in the WebUI. For example, the executors tab shows memory used in > each executor but this number doesn't include memory used by broadcast > variables. Similarly the storage tab only shows list of rdds cached and how > much memory they use. > We should add a separate column / tab for broadcast variables to make it > easier to debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3067) JobProgressPage could not show Fair Scheduler Pools section sometimes
[ https://issues.apache.org/jira/browse/SPARK-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3067: - Affects Version/s: (was: 1.2.0) 1.1.0 > JobProgressPage could not show Fair Scheduler Pools section sometimes > - > > Key: SPARK-3067 > URL: https://issues.apache.org/jira/browse/SPARK-3067 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 1.1.0 >Reporter: YanTang Zhai >Priority: Minor > Fix For: 1.2.0 > > > JobProgressPage could not show Fair Scheduler Pools section sometimes. > SparkContext starts webui and then postEnvironmentUpdate. Sometimes > JobProgressPage is accessed between webui starting and postEnvironmentUpdate, > then the lazy val isFairScheduler will be false. The Fair Scheduler Pools > section will not display any more. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3067) JobProgressPage could not show Fair Scheduler Pools section sometimes
[ https://issues.apache.org/jira/browse/SPARK-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-3067. Resolution: Fixed Fix Version/s: 1.2.0 Assignee: YanTang Zhai > JobProgressPage could not show Fair Scheduler Pools section sometimes > - > > Key: SPARK-3067 > URL: https://issues.apache.org/jira/browse/SPARK-3067 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 1.1.0 >Reporter: YanTang Zhai >Assignee: YanTang Zhai >Priority: Minor > Fix For: 1.2.0 > > > JobProgressPage could not show Fair Scheduler Pools section sometimes. > SparkContext starts webui and then postEnvironmentUpdate. Sometimes > JobProgressPage is accessed between webui starting and postEnvironmentUpdate, > then the lazy val isFairScheduler will be false. The Fair Scheduler Pools > section will not display any more. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3067) JobProgressPage could not show Fair Scheduler Pools section sometimes
[ https://issues.apache.org/jira/browse/SPARK-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3067: - Affects Version/s: 1.2.0 > JobProgressPage could not show Fair Scheduler Pools section sometimes > - > > Key: SPARK-3067 > URL: https://issues.apache.org/jira/browse/SPARK-3067 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 1.2.0 >Reporter: YanTang Zhai >Priority: Minor > > JobProgressPage could not show Fair Scheduler Pools section sometimes. > SparkContext starts webui and then postEnvironmentUpdate. Sometimes > JobProgressPage is accessed between webui starting and postEnvironmentUpdate, > then the lazy val isFairScheduler will be false. The Fair Scheduler Pools > section will not display any more. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3963) Support getting task-scoped properties from TaskContext
[ https://issues.apache.org/jira/browse/SPARK-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174660#comment-14174660 ] Patrick Wendell edited comment on SPARK-3963 at 10/17/14 2:30 AM: -- It does make sense longer term to merge this with TaskMetrics - but I'm proposing here just a simple API for users to get properties that are strings. So for now I'd prefer not to do either of those things. was (Author: pwendell): In the initial version of this - I don't want to do either of those things. > Support getting task-scoped properties from TaskContext > --- > > Key: SPARK-3963 > URL: https://issues.apache.org/jira/browse/SPARK-3963 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Patrick Wendell > > This is a proposal for a minor feature. Given stabilization of the > TaskContext API, it would be nice to have a mechanism for Spark jobs to > access properties that are defined based on task-level scope by Spark RDD's. > I'd like to propose adding a simple properties hash map with some standard > spark properties that users can access. Later it would be nice to support > users setting these properties, but for now to keep it simple in 1.2. I'd > prefer users not be able to set them. > The main use case is providing the file name from Hadoop RDD's, a very common > request. But I'd imagine us using this for other things later on. We could > also use this to expose some of the taskMetrics, such as e.g. the input bytes. > {code} > val data = sc.textFile("s3n//..2014/*/*/*.json") > data.mapPartitions { > val tc = TaskContext.get > val filename = tc.getProperty(TaskContext.HADOOP_FILE_NAME) > val parts = fileName.split("/") > val (year, month, day) = (parts[3], parts[4], parts[5]) > ... > } > {code} > Internally we'd have a method called setProperty, but this wouldn't be > exposed initially. This is structured as a simple (String, String) hash map > for ease of porting to python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3958) Possible stream-corruption issues in TorrentBroadcast
[ https://issues.apache.org/jira/browse/SPARK-3958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174662#comment-14174662 ] Josh Rosen commented on SPARK-3958: --- [~davies] ran across this exception while testing a pull request that modifies TorrentBroadcast: https://github.com/apache/spark/pull/2681#issuecomment-59120483 That PR's reproductioncould be a valuable debugging clue for this issue. > Possible stream-corruption issues in TorrentBroadcast > - > > Key: SPARK-3958 > URL: https://issues.apache.org/jira/browse/SPARK-3958 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > > TorrentBroadcast deserialization sometimes fails with decompression errors, > which are most likely caused by stream-corruption exceptions. For example, > this can manifest itself as a Snappy PARSING_ERROR when deserializing a > broadcasted task: > {code} > 14/10/14 17:20:55.016 DEBUG BlockManager: Getting local block broadcast_8 > 14/10/14 17:20:55.016 DEBUG BlockManager: Block broadcast_8 not registered > locally > 14/10/14 17:20:55.016 INFO TorrentBroadcast: Started reading broadcast > variable 8 > 14/10/14 17:20:55.017 INFO TorrentBroadcast: Reading broadcast variable 8 > took 5.3433E-5 s > 14/10/14 17:20:55.017 ERROR Executor: Exception in task 2.0 in stage 8.0 (TID > 18) > java.io.IOException: PARSING_ERROR(2) > at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84) > at org.xerial.snappy.SnappyNative.uncompressedLength(Native Method) > at org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:594) > at > org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:125) > at > org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) > at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58) > at > org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128) > at > org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:216) > at > org.apache.spark.broadcast.TorrentBroadcast.readObject(TorrentBroadcast.scala:170) > at sun.reflect.GeneratedMethodAccessor92.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:164) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > SPARK-3630 is an umbrella ticket for investigating all causes of these Kryo > and Snappy deserialization errors. This ticket is for a more > narrowly-focused exploration of the TorrentBroadcast version of these errors, > since the similar errors that we've seen in sort-based shuffle seem to be > explained by a different cause (see SPARK-3948). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3963) Support getting task-scoped properties from TaskContext
[ https://issues.apache.org/jira/browse/SPARK-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174660#comment-14174660 ] Patrick Wendell commented on SPARK-3963: In the initial version of this - I don't want to do either of those things. > Support getting task-scoped properties from TaskContext > --- > > Key: SPARK-3963 > URL: https://issues.apache.org/jira/browse/SPARK-3963 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Patrick Wendell > > This is a proposal for a minor feature. Given stabilization of the > TaskContext API, it would be nice to have a mechanism for Spark jobs to > access properties that are defined based on task-level scope by Spark RDD's. > I'd like to propose adding a simple properties hash map with some standard > spark properties that users can access. Later it would be nice to support > users setting these properties, but for now to keep it simple in 1.2. I'd > prefer users not be able to set them. > The main use case is providing the file name from Hadoop RDD's, a very common > request. But I'd imagine us using this for other things later on. We could > also use this to expose some of the taskMetrics, such as e.g. the input bytes. > {code} > val data = sc.textFile("s3n//..2014/*/*/*.json") > data.mapPartitions { > val tc = TaskContext.get > val filename = tc.getProperty(TaskContext.HADOOP_FILE_NAME) > val parts = fileName.split("/") > val (year, month, day) = (parts[3], parts[4], parts[5]) > ... > } > {code} > Internally we'd have a method called setProperty, but this wouldn't be > exposed initially. This is structured as a simple (String, String) hash map > for ease of porting to python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3890) remove redundant spark.executor.memory in doc
[ https://issues.apache.org/jira/browse/SPARK-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3890: - Affects Version/s: (was: 1.2.0) 1.1.0 > remove redundant spark.executor.memory in doc > - > > Key: SPARK-3890 > URL: https://issues.apache.org/jira/browse/SPARK-3890 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.1.0 >Reporter: WangTaoTheTonic >Priority: Minor > Fix For: 1.1.1, 1.2.0 > > > Seems like there is a redundant spark.executor.memory config item in docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3890) remove redundant spark.executor.memory in doc
[ https://issues.apache.org/jira/browse/SPARK-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-3890. Resolution: Fixed Fix Version/s: 1.2.0 1.1.1 Assignee: WangTaoTheTonic Target Version/s: 1.1.1, 1.2.0 > remove redundant spark.executor.memory in doc > - > > Key: SPARK-3890 > URL: https://issues.apache.org/jira/browse/SPARK-3890 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.1.0 >Reporter: WangTaoTheTonic >Assignee: WangTaoTheTonic >Priority: Minor > Fix For: 1.1.1, 1.2.0 > > > Seems like there is a redundant spark.executor.memory config item in docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3890) remove redundant spark.executor.memory in doc
[ https://issues.apache.org/jira/browse/SPARK-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3890: - Affects Version/s: 1.2.0 > remove redundant spark.executor.memory in doc > - > > Key: SPARK-3890 > URL: https://issues.apache.org/jira/browse/SPARK-3890 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.1.0 >Reporter: WangTaoTheTonic >Priority: Minor > Fix For: 1.1.1, 1.2.0 > > > Seems like there is a redundant spark.executor.memory config item in docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3941) _remainingMem should not increase twice when updateBlockInfo
[ https://issues.apache.org/jira/browse/SPARK-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-3941. Resolution: Fixed Fix Version/s: 1.2.0 Assignee: Zhang, Liye Target Version/s: 1.2.0 > _remainingMem should not increase twice when updateBlockInfo > > > Key: SPARK-3941 > URL: https://issues.apache.org/jira/browse/SPARK-3941 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Zhang, Liye >Assignee: Zhang, Liye > Fix For: 1.2.0 > > > In BlockManagermasterActor, _remainingMem would increase memSize for twice > when updateBlockInfo if new storageLevel is invalid. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3923) All Standalone Mode services time out with each other
[ https://issues.apache.org/jira/browse/SPARK-3923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-3923. Resolution: Fixed Fix Version/s: 1.2.0 Assignee: Aaron Davidson Target Version/s: 1.2.0 > All Standalone Mode services time out with each other > - > > Key: SPARK-3923 > URL: https://issues.apache.org/jira/browse/SPARK-3923 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.2.0 >Reporter: Aaron Davidson >Assignee: Aaron Davidson >Priority: Blocker > Fix For: 1.2.0 > > > I'm seeing an issue where it seems that components in Standalone Mode > (Worker, Master, Driver, and Executor) all seem to time out with each other > after around 1000 seconds. Here is an example log: > {code} > 14/10/13 06:43:55 INFO Master: Registering worker > ip-10-0-147-189.us-west-2.compute.internal:38922 with 4 cores, 29.0 GB RAM > 14/10/13 06:43:55 INFO Master: Registering worker > ip-10-0-175-214.us-west-2.compute.internal:42918 with 4 cores, 59.0 GB RAM > 14/10/13 06:43:56 INFO Master: Registering app Databricks Shell > 14/10/13 06:43:56 INFO Master: Registered app Databricks Shell with ID > app-20141013064356- > ... precisely 1000 seconds later ... > 14/10/13 07:00:35 WARN ReliableDeliverySupervisor: Association with remote > system > [akka.tcp://sparkwor...@ip-10-0-147-189.us-west-2.compute.internal:38922] has > failed, address is now gated for [5000] ms. Reason is: [Disassociated]. > 14/10/13 07:00:35 INFO Master: > akka.tcp://sparkwor...@ip-10-0-147-189.us-west-2.compute.internal:38922 got > disassociated, removing it. > 14/10/13 07:00:35 INFO LocalActorRef: Message > [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from > Actor[akka://sparkMaster/deadLetters] to > Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.147.189%3A54956-1#1529980245] > was not delivered. [2] dead letters encountered. This logging can be turned > off or adjusted with configuration settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'. > 14/10/13 07:00:35 INFO Master: > akka.tcp://sparkwor...@ip-10-0-175-214.us-west-2.compute.internal:42918 got > disassociated, removing it. > 14/10/13 07:00:35 INFO Master: Removing worker > worker-20141013064354-ip-10-0-175-214.us-west-2.compute.internal-42918 on > ip-10-0-175-214.us-west-2.compute.internal:42918 > 14/10/13 07:00:35 INFO Master: Telling app of lost executor: 1 > 14/10/13 07:00:35 INFO Master: > akka.tcp://sparkwor...@ip-10-0-175-214.us-west-2.compute.internal:42918 got > disassociated, removing it. > 14/10/13 07:00:35 WARN ReliableDeliverySupervisor: Association with remote > system > [akka.tcp://sparkwor...@ip-10-0-175-214.us-west-2.compute.internal:42918] has > failed, address is now gated for [5000] ms. Reason is: [Disassociated]. > 14/10/13 07:00:35 INFO LocalActorRef: Message > [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from > Actor[akka://sparkMaster/deadLetters] to > Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.214%3A35958-2#314633324] > was not delivered. [3] dead letters encountered. This logging can be turned > off or adjusted with configuration settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'. > 14/10/13 07:00:35 INFO LocalActorRef: Message > [akka.remote.transport.AssociationHandle$Disassociated] from > Actor[akka://sparkMaster/deadLetters] to > Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.214%3A35958-2#314633324] > was not delivered. [4] dead letters encountered. This logging can be turned > off or adjusted with configuration settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'. > 14/10/13 07:00:36 INFO ProtocolStateActor: No response from remote. Handshake > timed out or transport failure detector triggered. > 14/10/13 07:00:36 INFO Master: > akka.tcp://sparkdri...@ip-10-0-175-215.us-west-2.compute.internal:58259 got > disassociated, removing it. > 14/10/13 07:00:36 INFO LocalActorRef: Message > [akka.remote.transport.AssociationHandle$InboundPayload] from > Actor[akka://sparkMaster/deadLetters] to > Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.215%3A41987-3#1944377249] > was not delivered. [5] dead letters encountered. This logging can be turned > off or adjusted with configuration settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'. > 14/10/13 07:00:36 INFO Master: Removing app app-20141013064356- > 14/10/13
[jira] [Updated] (SPARK-3973) Print callSite information for broadcast variables
[ https://issues.apache.org/jira/browse/SPARK-3973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3973: --- Component/s: Spark Core > Print callSite information for broadcast variables > -- > > Key: SPARK-3973 > URL: https://issues.apache.org/jira/browse/SPARK-3973 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Shivaram Venkataraman >Assignee: Shivaram Venkataraman >Priority: Minor > Fix For: 1.2.0 > > > Printing call site information for broadcast variables will help in debugging > which variables are used, when they are used etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3882) JobProgressListener gets permanently out of sync with long running job
[ https://issues.apache.org/jira/browse/SPARK-3882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174629#comment-14174629 ] Patrick Wendell commented on SPARK-3882: This is a known issue (SPARK-2316) that was fixed in Spark 1.1. To verify that you are hitting the same issue, would you mind testing your job with Spark 1.1 and seeing if you observe it? > JobProgressListener gets permanently out of sync with long running job > -- > > Key: SPARK-3882 > URL: https://issues.apache.org/jira/browse/SPARK-3882 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.0.2 >Reporter: Davis Shepherd > Attachments: Screen Shot 2014-10-03 at 12.50.59 PM.png > > > A long running spark context (non-streaming) will eventually start throwing > the following in the driver: > {code} > java.util.NoSuchElementException: key not found: 12771 > at scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:58) > at scala.collection.mutable.HashMap.apply(HashMap.scala:64) > at > org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79) > at > org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48) > at > org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46) > 2014-10-09 18:45:33,523 [SparkListenerBus] ERROR > org.apache.spark.scheduler.LiveListenerBus - Listener JobProgressListener > threw an exception > java.util.NoSuchElementException: key not found: 12782 > at scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:58) > at scala.collection.mutable.HashMap.apply(HashMap.scala:64) > at > org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81) > at > org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79) > at > org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48) > at > org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.LiveListenerBu
[jira] [Updated] (SPARK-3977) Conversions between {Row, Coordinate}Matrix <-> BlockMatrix
[ https://issues.apache.org/jira/browse/SPARK-3977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reza Zadeh updated SPARK-3977: -- Component/s: MLlib > Conversions between {Row, Coordinate}Matrix <-> BlockMatrix > --- > > Key: SPARK-3977 > URL: https://issues.apache.org/jira/browse/SPARK-3977 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Reza Zadeh > > Build conversion functions between {Row, Coordinate}Matrix <-> BlockMatrix -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3976) Detect block matrix partitioning schemes
[ https://issues.apache.org/jira/browse/SPARK-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reza Zadeh updated SPARK-3976: -- Component/s: MLlib > Detect block matrix partitioning schemes > > > Key: SPARK-3976 > URL: https://issues.apache.org/jira/browse/SPARK-3976 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Reza Zadeh > > Provide repartitioning methods for block matrices to repartition matrix for > add/multiply of non-identically partitioned matrices -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3974) Block matrix abstracitons and partitioners
[ https://issues.apache.org/jira/browse/SPARK-3974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reza Zadeh updated SPARK-3974: -- Component/s: MLlib > Block matrix abstracitons and partitioners > -- > > Key: SPARK-3974 > URL: https://issues.apache.org/jira/browse/SPARK-3974 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Reza Zadeh > > We need abstractions for block matrices with fixed block sizes, with each > block being dense. Partitioners along both rows and columns required. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3975) Block Matrix addition and multiplication
[ https://issues.apache.org/jira/browse/SPARK-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3975: --- Component/s: MLlib > Block Matrix addition and multiplication > > > Key: SPARK-3975 > URL: https://issues.apache.org/jira/browse/SPARK-3975 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Reza Zadeh > > Block matrix addition and multiplication, for the case when partitioning > schemes match. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3874) Provide stable TaskContext API
[ https://issues.apache.org/jira/browse/SPARK-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-3874. Resolution: Fixed Fix Version/s: 1.2.0 > Provide stable TaskContext API > -- > > Key: SPARK-3874 > URL: https://issues.apache.org/jira/browse/SPARK-3874 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Prashant Sharma > Fix For: 1.2.0 > > > We made some improvements in SPARK-3543 but for Spark 1.2 we should convert > TaskContext into a fully stable API. To do this I’d suggest the following > changes - note that some of this reverses parts of SPARK-3543. The goal is to > provide a class that users can’t easily construct and exposes only the public > functionality. > 1. Separate TaskContext into a public abstract class (TaskContext) and a > private implementation called TaskContextImpl. The former should be a Java > abstract class - the latter should be a private[spark] Scala class to reduce > visibility (or maybe we can keep it as Java and tell people not to use it?). > 2. TaskContext abstract class will have (NOTE: this changes getXX() to XX() > intentionally) > public isCompleted() > public isInterrupted() > public addTaskCompletionListener(...) > public addTaskCompletionCallback(...) (deprecated) > public stageId() > public partitionId() > public attemptId() > pubic isRunningLocally() > STATIC > public get() > set() and unset() at default visibility > 3. A new private[spark] static object TaskContextHelper in the same package > as TaskContext will exist to expose set() and unset() from within Spark using > forwarder methods that just call TaskContext.set(). If someone within Spark > wants to set this they call TaskContextHelper.set() and it forwards it. > 4. TaskContextImpl will be used whenever we construct a TaskContext > internally. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3982) receiverStream in Python API
[ https://issues.apache.org/jira/browse/SPARK-3982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174580#comment-14174580 ] Apache Spark commented on SPARK-3982: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/2833 > receiverStream in Python API > > > Key: SPARK-3982 > URL: https://issues.apache.org/jira/browse/SPARK-3982 > Project: Spark > Issue Type: New Feature > Components: PySpark, Streaming >Reporter: Davies Liu >Assignee: Davies Liu > > receiverStream() is used to extend the input sources of streaming, it will be > very useful to have it in Python API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3983) Scheduler delay (shown in the UI) is incorrect
[ https://issues.apache.org/jira/browse/SPARK-3983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174577#comment-14174577 ] Apache Spark commented on SPARK-3983: - User 'kayousterhout' has created a pull request for this issue: https://github.com/apache/spark/pull/2832 > Scheduler delay (shown in the UI) is incorrect > -- > > Key: SPARK-3983 > URL: https://issues.apache.org/jira/browse/SPARK-3983 > Project: Spark > Issue Type: Bug > Components: Web UI >Reporter: Kay Ousterhout >Assignee: Kay Ousterhout > Fix For: 1.2.0 > > > The reported scheduler delay includes time to get a new thread (from a > threadpool) in order to start the task, time to deserialize the task, and > time to serialize the result. None of these things are delay caused by the > scheduler; including them as such is misleading. > This is especially problematic when debugging performance of short tasks > (that run in 10s of milliseconds), when the scheduler delay can be very large > relative to the task duration. > cc [~sparks] [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3984) Display finer grained metrics about task launch overhead in the UI
[ https://issues.apache.org/jira/browse/SPARK-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174575#comment-14174575 ] Kay Ousterhout commented on SPARK-3984: --- https://github.com/apache/spark/pull/2832 > Display finer grained metrics about task launch overhead in the UI > -- > > Key: SPARK-3984 > URL: https://issues.apache.org/jira/browse/SPARK-3984 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Kay Ousterhout >Assignee: Kay Ousterhout > Fix For: 1.2.0 > > > Right now, the UI does not display the time to deserialize the task, to > serialize the task result, or to launch a new thread for the task. When > running short jobs (e.g., for ML) these overheads can become significant. It > would be great to show these in the summary quantiles for each stage in the > UI to facilitate better performance debugging. > cc [~sparks] [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-3983) Scheduler delay (shown in the UI) is incorrect
[ https://issues.apache.org/jira/browse/SPARK-3983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-3983: -- Comment: was deleted (was: https://github.com/apache/spark/pull/2832) > Scheduler delay (shown in the UI) is incorrect > -- > > Key: SPARK-3983 > URL: https://issues.apache.org/jira/browse/SPARK-3983 > Project: Spark > Issue Type: Bug > Components: Web UI >Reporter: Kay Ousterhout >Assignee: Kay Ousterhout > Fix For: 1.2.0 > > > The reported scheduler delay includes time to get a new thread (from a > threadpool) in order to start the task, time to deserialize the task, and > time to serialize the result. None of these things are delay caused by the > scheduler; including them as such is misleading. > This is especially problematic when debugging performance of short tasks > (that run in 10s of milliseconds), when the scheduler delay can be very large > relative to the task duration. > cc [~sparks] [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3983) Scheduler delay (shown in the UI) is incorrect
[ https://issues.apache.org/jira/browse/SPARK-3983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-3983: -- Component/s: Web UI > Scheduler delay (shown in the UI) is incorrect > -- > > Key: SPARK-3983 > URL: https://issues.apache.org/jira/browse/SPARK-3983 > Project: Spark > Issue Type: Bug > Components: Web UI >Reporter: Kay Ousterhout >Assignee: Kay Ousterhout > Fix For: 1.2.0 > > > The reported scheduler delay includes time to get a new thread (from a > threadpool) in order to start the task, time to deserialize the task, and > time to serialize the result. None of these things are delay caused by the > scheduler; including them as such is misleading. > This is especially problematic when debugging performance of short tasks > (that run in 10s of milliseconds), when the scheduler delay can be very large > relative to the task duration. > cc [~sparks] [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-3984) Display finer grained metrics about task launch overhead in the UI
[ https://issues.apache.org/jira/browse/SPARK-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-3984: -- Comment: was deleted (was: https://github.com/apache/spark/pull/2832) > Display finer grained metrics about task launch overhead in the UI > -- > > Key: SPARK-3984 > URL: https://issues.apache.org/jira/browse/SPARK-3984 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Kay Ousterhout >Assignee: Kay Ousterhout > Fix For: 1.2.0 > > > Right now, the UI does not display the time to deserialize the task, to > serialize the task result, or to launch a new thread for the task. When > running short jobs (e.g., for ML) these overheads can become significant. It > would be great to show these in the summary quantiles for each stage in the > UI to facilitate better performance debugging. > cc [~sparks] [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3984) Display finer grained metrics about task launch overhead in the UI
[ https://issues.apache.org/jira/browse/SPARK-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174576#comment-14174576 ] Apache Spark commented on SPARK-3984: - User 'kayousterhout' has created a pull request for this issue: https://github.com/apache/spark/pull/2832 > Display finer grained metrics about task launch overhead in the UI > -- > > Key: SPARK-3984 > URL: https://issues.apache.org/jira/browse/SPARK-3984 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Kay Ousterhout >Assignee: Kay Ousterhout > Fix For: 1.2.0 > > > Right now, the UI does not display the time to deserialize the task, to > serialize the task result, or to launch a new thread for the task. When > running short jobs (e.g., for ML) these overheads can become significant. It > would be great to show these in the summary quantiles for each stage in the > UI to facilitate better performance debugging. > cc [~sparks] [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3983) Scheduler delay (shown in the UI) is incorrect
[ https://issues.apache.org/jira/browse/SPARK-3983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174574#comment-14174574 ] Kay Ousterhout commented on SPARK-3983: --- https://github.com/apache/spark/pull/2832 > Scheduler delay (shown in the UI) is incorrect > -- > > Key: SPARK-3983 > URL: https://issues.apache.org/jira/browse/SPARK-3983 > Project: Spark > Issue Type: Bug >Reporter: Kay Ousterhout >Assignee: Kay Ousterhout > Fix For: 1.2.0 > > > The reported scheduler delay includes time to get a new thread (from a > threadpool) in order to start the task, time to deserialize the task, and > time to serialize the result. None of these things are delay caused by the > scheduler; including them as such is misleading. > This is especially problematic when debugging performance of short tasks > (that run in 10s of milliseconds), when the scheduler delay can be very large > relative to the task duration. > cc [~sparks] [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3984) Display finer grained metrics about task launch overhead in the UI
Kay Ousterhout created SPARK-3984: - Summary: Display finer grained metrics about task launch overhead in the UI Key: SPARK-3984 URL: https://issues.apache.org/jira/browse/SPARK-3984 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Kay Ousterhout Assignee: Kay Ousterhout Fix For: 1.2.0 Right now, the UI does not display the time to deserialize the task, to serialize the task result, or to launch a new thread for the task. When running short jobs (e.g., for ML) these overheads can become significant. It would be great to show these in the summary quantiles for each stage in the UI to facilitate better performance debugging. cc [~sparks] [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3877) The exit code of spark-submit is still 0 when an yarn application fails
[ https://issues.apache.org/jira/browse/SPARK-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174507#comment-14174507 ] Marcelo Vanzin edited comment on SPARK-3877 at 10/17/14 12:08 AM: -- [~tgraves] this can be seen as a subset of SPARK-2167, but as I mentioned on that bug, I don't think it's fixable for all cases. SparkSubmit is executing user code, so it can only report errors when the user code does. e.g., a job like this would report an error today {code} val sc = ... try { // do stuff if (somethingBad) throw MyJobFailedException() } finally { sc.stop() } {code} But this one wouldn't: {code} val sc = ... try { // do stuff if (somethingBad) throw MyJobFailedException() } catch { case e: Exception => logError("Oops, something bad happened.", e) } finally { sc.stop() } {code} yarn-client mode will abruptly stop the SparkContext when the Yarn app fails. But depending on how the user's {{main()}} deals with errors, that still may not result in a non-zero exit status. was (Author: vanzin): [~tgraves] this can be seen as a subset of SPARK-2167, but as I mentioned on that bug, I don't think it's fixable for all cases. SparkSubmit is executing user code, so it can only report errors when the user code does. e.g., a job like this would report an error today {code} val sc = ... try { // do stuff if (somethingBad) throw MyJobFailedException() } finally { sc.stop() } {code} But this one wouldn't: {code} val sc = ... try { // do stuff if (somethingBad) throw MyJobFailedException() } catch { case e: Exception => logError("Oops, something bad happened.", e) } finally { sc.stop() } {code} yarn-client mode will abruptly stop the SparkContext when the Yarn app fails. But depending on how the user's {main()} deals with errors, that still may not result in a non-zero exit status. > The exit code of spark-submit is still 0 when an yarn application fails > --- > > Key: SPARK-3877 > URL: https://issues.apache.org/jira/browse/SPARK-3877 > Project: Spark > Issue Type: Bug > Components: YARN >Reporter: Shixiong Zhu >Priority: Minor > Labels: yarn > > When an yarn application fails (yarn-cluster mode), the exit code of > spark-submit is still 0. It's hard for people to write some automatic scripts > to run spark jobs in yarn because the failure can not be detected in these > scripts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3877) The exit code of spark-submit is still 0 when an yarn application fails
[ https://issues.apache.org/jira/browse/SPARK-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174507#comment-14174507 ] Marcelo Vanzin commented on SPARK-3877: --- [~tgraves] this can be seen as a subset of SPARK-2167, but as I mentioned on that bug, I don't think it's fixable for all cases. SparkSubmit is executing user code, so it can only report errors when the user code does. e.g., a job like this would report an error today {code} val sc = ... try { // do stuff if (somethingBad) throw MyJobFailedException() } finally { sc.stop() } {code} But this one wouldn't: {code} val sc = ... try { // do stuff if (somethingBad) throw MyJobFailedException() } catch { case e: Exception => logError("Oops, something bad happened.", e) } finally { sc.stop() } {code} yarn-client mode will abruptly stop the SparkContext when the Yarn app fails. But depending on how the user's {main()} deals with errors, that still may not result in a non-zero exit status. > The exit code of spark-submit is still 0 when an yarn application fails > --- > > Key: SPARK-3877 > URL: https://issues.apache.org/jira/browse/SPARK-3877 > Project: Spark > Issue Type: Bug > Components: YARN >Reporter: Shixiong Zhu >Priority: Minor > Labels: yarn > > When an yarn application fails (yarn-cluster mode), the exit code of > spark-submit is still 0. It's hard for people to write some automatic scripts > to run spark jobs in yarn because the failure can not be detected in these > scripts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3983) Scheduler delay (shown in the UI) is incorrect
[ https://issues.apache.org/jira/browse/SPARK-3983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-3983: -- Description: The reported scheduler delay includes time to get a new thread (from a threadpool) in order to start the task, time to deserialize the task, and time to serialize the result. None of these things are delay caused by the scheduler; including them as such is misleading. This is especially problematic when debugging performance of short tasks (that run in 10s of milliseconds), when the scheduler delay can be very large relative to the task duration. cc [~sparks] [~shivaram] was: The reported scheduler delay includes time to get a new thread (from a threadpool) in order to start the task, time to deserialize the task, and time to serialize the result. None of these things are delay caused by the scheduler; including them as such is misleading. cc [~sparks] [~shivaram] > Scheduler delay (shown in the UI) is incorrect > -- > > Key: SPARK-3983 > URL: https://issues.apache.org/jira/browse/SPARK-3983 > Project: Spark > Issue Type: Bug >Reporter: Kay Ousterhout >Assignee: Kay Ousterhout > Fix For: 1.2.0 > > > The reported scheduler delay includes time to get a new thread (from a > threadpool) in order to start the task, time to deserialize the task, and > time to serialize the result. None of these things are delay caused by the > scheduler; including them as such is misleading. > This is especially problematic when debugging performance of short tasks > (that run in 10s of milliseconds), when the scheduler delay can be very large > relative to the task duration. > cc [~sparks] [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3983) Scheduler delay (shown in the UI) is incorrect
[ https://issues.apache.org/jira/browse/SPARK-3983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174452#comment-14174452 ] Kay Ousterhout commented on SPARK-3983: --- This is especially problematic when debugging performance of short tasks (that run in 10s of milliseconds), when the scheduler delay can be very large relative to the task duration. > Scheduler delay (shown in the UI) is incorrect > -- > > Key: SPARK-3983 > URL: https://issues.apache.org/jira/browse/SPARK-3983 > Project: Spark > Issue Type: Bug >Reporter: Kay Ousterhout >Assignee: Kay Ousterhout > Fix For: 1.2.0 > > > The reported scheduler delay includes time to get a new thread (from a > threadpool) in order to start the task, time to deserialize the task, and > time to serialize the result. None of these things are delay caused by the > scheduler; including them as such is misleading. > cc [~sparks] [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-3983) Scheduler delay (shown in the UI) is incorrect
[ https://issues.apache.org/jira/browse/SPARK-3983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-3983: -- Comment: was deleted (was: This is especially problematic when debugging performance of short tasks (that run in 10s of milliseconds), when the scheduler delay can be very large relative to the task duration.) > Scheduler delay (shown in the UI) is incorrect > -- > > Key: SPARK-3983 > URL: https://issues.apache.org/jira/browse/SPARK-3983 > Project: Spark > Issue Type: Bug >Reporter: Kay Ousterhout >Assignee: Kay Ousterhout > Fix For: 1.2.0 > > > The reported scheduler delay includes time to get a new thread (from a > threadpool) in order to start the task, time to deserialize the task, and > time to serialize the result. None of these things are delay caused by the > scheduler; including them as such is misleading. > cc [~sparks] [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3983) Scheduler delay (shown in the UI) is incorrect
Kay Ousterhout created SPARK-3983: - Summary: Scheduler delay (shown in the UI) is incorrect Key: SPARK-3983 URL: https://issues.apache.org/jira/browse/SPARK-3983 Project: Spark Issue Type: Bug Reporter: Kay Ousterhout Assignee: Kay Ousterhout Fix For: 1.2.0 The reported scheduler delay includes time to get a new thread (from a threadpool) in order to start the task, time to deserialize the task, and time to serialize the result. None of these things are delay caused by the scheduler; including them as such is misleading. cc [~sparks] [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3980) GraphX Performance Issue
[ https://issues.apache.org/jira/browse/SPARK-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jarred Li updated SPARK-3980: - Description: I run 4 workes in AWS (c3.xlarge), 4g memory for executor, 85,331,846 edges from(http://socialcomputing.asu.edu/uploads/1296759055/Twitter-dataset.zip). For PageRank algorithm, the job can not be completed within 7 hours. For small dataset with 5,000,000 edges(http://socialcomputing.asu.edu/uploads/1296591553/Last.fm-dataset.zip) , the job can be completed within 16 seconds. was:I run 4 workes in AWS (c3.xlarge), 4g memory for executor, 85,331,846 edges from(http://socialcomputing.asu.edu/uploads/1296759055/Twitter-dataset.zip). For PageRank algorithm, the job can not be completed within 7 hours. > GraphX Performance Issue > > > Key: SPARK-3980 > URL: https://issues.apache.org/jira/browse/SPARK-3980 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.1.0 >Reporter: Jarred Li > > I run 4 workes in AWS (c3.xlarge), 4g memory for executor, 85,331,846 edges > from(http://socialcomputing.asu.edu/uploads/1296759055/Twitter-dataset.zip). > For PageRank algorithm, the job can not be completed within 7 hours. For > small dataset with 5,000,000 > edges(http://socialcomputing.asu.edu/uploads/1296591553/Last.fm-dataset.zip) > , the job can be completed within 16 seconds. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3982) receiverStream in Python API
Davies Liu created SPARK-3982: - Summary: receiverStream in Python API Key: SPARK-3982 URL: https://issues.apache.org/jira/browse/SPARK-3982 Project: Spark Issue Type: New Feature Components: PySpark, Streaming Reporter: Davies Liu receiverStream() is used to extend the input sources of streaming, it will be very useful to have it in Python API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3980) GraphX Performance Issue
[ https://issues.apache.org/jira/browse/SPARK-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jarred Li updated SPARK-3980: - Description: I run 4 workes in AWS (c3.xlarge), 4g memory for executor, 85,331,846 edges from(http://socialcomputing.asu.edu/uploads/1296759055/Twitter-dataset.zip). For PageRank algorithm, the job can not be completed within 7 hours. (was: I run 4 workes in AWS (c3.xlarge), 4g memory for executor, 85,331,846 edges from(http://socialcomputing.asu.edu/uploads/1296759055/Twitter-dataset.zip). For PageRank algorithm, the job can not be completed withon 7 hours.) > GraphX Performance Issue > > > Key: SPARK-3980 > URL: https://issues.apache.org/jira/browse/SPARK-3980 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.1.0 >Reporter: Jarred Li > > I run 4 workes in AWS (c3.xlarge), 4g memory for executor, 85,331,846 edges > from(http://socialcomputing.asu.edu/uploads/1296759055/Twitter-dataset.zip). > For PageRank algorithm, the job can not be completed within 7 hours. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3971) Failed to deserialize Vector in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-3971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-3971. -- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2830 [https://github.com/apache/spark/pull/2830] > Failed to deserialize Vector in cluster mode > > > Key: SPARK-3971 > URL: https://issues.apache.org/jira/browse/SPARK-3971 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.2.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Blocker > Fix For: 1.2.0 > > > The serialization of Vector/Rating did not work in cluster mode, because the > initializer is not called in executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3981) Consider a better approach to initialize SerDe on executors
Xiangrui Meng created SPARK-3981: Summary: Consider a better approach to initialize SerDe on executors Key: SPARK-3981 URL: https://issues.apache.org/jira/browse/SPARK-3981 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Affects Versions: 1.2.0 Reporter: Xiangrui Meng In SPARK-3971, we copied SerDe code from Core to MLlib in order to recognize MLlib types on executors as a hotfix. This is not ideal. We should find a way to add hooks to the SerDe in Core to support MLlib types in a pluggable way. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3980) GraphX Performance Issue
Jarred Li created SPARK-3980: Summary: GraphX Performance Issue Key: SPARK-3980 URL: https://issues.apache.org/jira/browse/SPARK-3980 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.1.0 Reporter: Jarred Li I run 4 workes in AWS (c3.xlarge), 4g memory for executor, 85,331,846 edges from(http://socialcomputing.asu.edu/uploads/1296759055/Twitter-dataset.zip). For PageRank algorithm, the job can not be completed withon 7 hours. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3979) Yarn backend's default file replication should match HDFS's default one
[ https://issues.apache.org/jira/browse/SPARK-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174224#comment-14174224 ] Apache Spark commented on SPARK-3979: - User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/2831 > Yarn backend's default file replication should match HDFS's default one > --- > > Key: SPARK-3979 > URL: https://issues.apache.org/jira/browse/SPARK-3979 > Project: Spark > Issue Type: Bug > Components: YARN >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > > This code in ClientBase.scala sets the replication used for files uploaded to > HDFS: > {code} > val replication = sparkConf.getInt("spark.yarn.submit.file.replication", > 3).toShort > {code} > Instead of a hardcoded "3" (which is the default value for HDFS), it should > be using the default value from the HDFS conf ("dfs.replication"). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3979) Yarn backend's default file replication should match HDFS's default one
[ https://issues.apache.org/jira/browse/SPARK-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-3979: -- Description: This code in ClientBase.scala sets the replication used for files uploaded to HDFS: {code} val replication = sparkConf.getInt("spark.yarn.submit.file.replication", 3).toShort {code} Instead of a hardcoded "3" (which is the default value for HDFS), it should be using the default value from the HDFS conf ("dfs.replication"). was: This code in ClientBase.scala sets the replication used for files uploaded to HDFS: {{noformat}} val replication = sparkConf.getInt("spark.yarn.submit.file.replication", 3).toShort {{noformat}} Instead of a hardcoded "3" (which is the default value for HDFS), it should be using the default value from the HDFS conf ("dfs.replication"). > Yarn backend's default file replication should match HDFS's default one > --- > > Key: SPARK-3979 > URL: https://issues.apache.org/jira/browse/SPARK-3979 > Project: Spark > Issue Type: Bug > Components: YARN >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > > This code in ClientBase.scala sets the replication used for files uploaded to > HDFS: > {code} > val replication = sparkConf.getInt("spark.yarn.submit.file.replication", > 3).toShort > {code} > Instead of a hardcoded "3" (which is the default value for HDFS), it should > be using the default value from the HDFS conf ("dfs.replication"). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3979) Yarn backend's default file replication should match HDFS's default one
[ https://issues.apache.org/jira/browse/SPARK-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174165#comment-14174165 ] Marcelo Vanzin commented on SPARK-3979: --- BTW, this would avoid issues like this: {noformat} Exception in thread "main" org.apache.hadoop.ipc.RemoteException(java.io.IOException): file /user/systest/.sparkStaging/application_1413485082283_0001/spark-assembly-1.2.0-SNAPSHOT-hadoop2.3.0.jar. Requested replication 3 exceeds maximum 1 at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.verifyReplication(BlockManager.java:943) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setReplicationInt(FSNamesystem.java:2243) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setReplication(FSNamesystem.java:2233) ... at org.apache.spark.deploy.yarn.ClientBase$class.copyFileToRemote(ClientBase.scala:101) {noformat} > Yarn backend's default file replication should match HDFS's default one > --- > > Key: SPARK-3979 > URL: https://issues.apache.org/jira/browse/SPARK-3979 > Project: Spark > Issue Type: Bug > Components: YARN >Reporter: Marcelo Vanzin >Priority: Minor > > This code in ClientBase.scala sets the replication used for files uploaded to > HDFS: > {{noformat}} > val replication = sparkConf.getInt("spark.yarn.submit.file.replication", > 3).toShort > {{noformat}} > Instead of a hardcoded "3" (which is the default value for HDFS), it should > be using the default value from the HDFS conf ("dfs.replication"). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3979) Yarn backend's default file replication should match HDFS's default one
Marcelo Vanzin created SPARK-3979: - Summary: Yarn backend's default file replication should match HDFS's default one Key: SPARK-3979 URL: https://issues.apache.org/jira/browse/SPARK-3979 Project: Spark Issue Type: Bug Components: YARN Reporter: Marcelo Vanzin Priority: Minor This code in ClientBase.scala sets the replication used for files uploaded to HDFS: {{noformat}} val replication = sparkConf.getInt("spark.yarn.submit.file.replication", 3).toShort {{noformat}} Instead of a hardcoded "3" (which is the default value for HDFS), it should be using the default value from the HDFS conf ("dfs.replication"). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3971) Failed to deserialize Vector in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-3971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3971: - Assignee: Davies Liu > Failed to deserialize Vector in cluster mode > > > Key: SPARK-3971 > URL: https://issues.apache.org/jira/browse/SPARK-3971 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.2.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Blocker > > The serialization of Vector/Rating did not work in cluster mode, because the > initializer is not called in executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3971) Failed to deserialize Vector in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-3971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3971: - Target Version/s: 1.2.0 Affects Version/s: 1.2.0 > Failed to deserialize Vector in cluster mode > > > Key: SPARK-3971 > URL: https://issues.apache.org/jira/browse/SPARK-3971 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.2.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Blocker > > The serialization of Vector/Rating did not work in cluster mode, because the > initializer is not called in executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3466) Limit size of results that a driver collects for each action
[ https://issues.apache.org/jira/browse/SPARK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-3466: -- Assignee: Matthew Cheah > Limit size of results that a driver collects for each action > > > Key: SPARK-3466 > URL: https://issues.apache.org/jira/browse/SPARK-3466 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Assignee: Matthew Cheah > > Right now, operations like {{collect()}} and {{take()}} can crash the driver > with an OOM if they bring back too many data. We should add a > {{spark.driver.maxResultSize}} setting (or something like that) that will > make the driver abort a job if its result is too big. We can set it to some > fraction of the driver's memory by default, or to something like 100 MB. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3978) Schema change on Spark-Hive (Parquet file format) table not working
Nilesh Barge created SPARK-3978: --- Summary: Schema change on Spark-Hive (Parquet file format) table not working Key: SPARK-3978 URL: https://issues.apache.org/jira/browse/SPARK-3978 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Nilesh Barge On following releases: Spark 1.1.0 (built using sbt/sbt -Dhadoop.version=2.2.0 -Phive assembly) , Apache HDFS 2.2 Spark job is able to create/add/read data in hive, parquet formatted, tables using HiveContext. But, after changing schema, spark job is not able to read data and throws following exception: java.lang.ArrayIndexOutOfBoundsException: 2 at org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector.getStructFieldData(ArrayWritableObjectInspector.java:127) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:284) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:278) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:774) at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:774) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) code snippet in short: hiveContext.sql("CREATE EXTERNAL TABLE IF NOT EXISTS people_table (name String, age INT) ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat' OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat'"); hiveContext.sql("INSERT INTO TABLE people_table SELECT name, age FROM temp_table_people1"); hiveContext.sql("SELECT * FROM people_table"); //Here, data read was successful. hiveContext.sql("ALTER TABLE people_table ADD COLUMNS (gender STRING)"); hiveContext.sql("SELECT * FROM people_table"); //Not able to read existing data and ArrayIndexOutOfBoundsException is thrown. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3466) Limit size of results that a driver collects for each action
[ https://issues.apache.org/jira/browse/SPARK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174148#comment-14174148 ] Matt Cheah commented on SPARK-3466: --- I'll look into this. Someone please assign to me! > Limit size of results that a driver collects for each action > > > Key: SPARK-3466 > URL: https://issues.apache.org/jira/browse/SPARK-3466 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia > > Right now, operations like {{collect()}} and {{take()}} can crash the driver > with an OOM if they bring back too many data. We should add a > {{spark.driver.maxResultSize}} setting (or something like that) that will > make the driver abort a job if its result is too big. We can set it to some > fraction of the driver's memory by default, or to something like 100 MB. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3971) Failed to deserialize Vector in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-3971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174146#comment-14174146 ] Apache Spark commented on SPARK-3971: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/2830 > Failed to deserialize Vector in cluster mode > > > Key: SPARK-3971 > URL: https://issues.apache.org/jira/browse/SPARK-3971 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Reporter: Davies Liu >Priority: Blocker > > The serialization of Vector/Rating did not work in cluster mode, because the > initializer is not called in executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3977) Conversions between {Row, Coordinate}Matrix <-> BlockMatrix
Reza Zadeh created SPARK-3977: - Summary: Conversions between {Row, Coordinate}Matrix <-> BlockMatrix Key: SPARK-3977 URL: https://issues.apache.org/jira/browse/SPARK-3977 Project: Spark Issue Type: Improvement Reporter: Reza Zadeh Build conversion functions between {Row, Coordinate}Matrix <-> BlockMatrix -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3976) Detect block matrix partitioning schemes
Reza Zadeh created SPARK-3976: - Summary: Detect block matrix partitioning schemes Key: SPARK-3976 URL: https://issues.apache.org/jira/browse/SPARK-3976 Project: Spark Issue Type: Improvement Reporter: Reza Zadeh Provide repartitioning methods for block matrices to repartition matrix for add/multiply of non-identically partitioned matrices -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3975) Block Matrix addition and multiplication
Reza Zadeh created SPARK-3975: - Summary: Block Matrix addition and multiplication Key: SPARK-3975 URL: https://issues.apache.org/jira/browse/SPARK-3975 Project: Spark Issue Type: Improvement Reporter: Reza Zadeh Block matrix addition and multiplication, for the case when partitioning schemes match. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3974) Block matrix abstracitons and partitioners
Reza Zadeh created SPARK-3974: - Summary: Block matrix abstracitons and partitioners Key: SPARK-3974 URL: https://issues.apache.org/jira/browse/SPARK-3974 Project: Spark Issue Type: Improvement Reporter: Reza Zadeh We need abstractions for block matrices with fixed block sizes, with each block being dense. Partitioners along both rows and columns required. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3882) JobProgressListener gets permanently out of sync with long running job
[ https://issues.apache.org/jira/browse/SPARK-3882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-3882: --- Description: A long running spark context (non-streaming) will eventually start throwing the following in the driver: {code} java.util.NoSuchElementException: key not found: 12771 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79) at org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48) at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) at org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46) 2014-10-09 18:45:33,523 [SparkListenerBus] ERROR org.apache.spark.scheduler.LiveListenerBus - Listener JobProgressListener threw an exception java.util.NoSuchElementException: key not found: 12782 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.mutable.HashMap.apply(HashMap.scala:64) at org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81) at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79) at org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48) at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) at org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46) {code} And the ui will show running jobs that are in fact no longer running and never clean them up. (see attached screenshot) The result is that the ui becomes unusable, and the JobProgressListener leaks memory as the list of "running" jobs continues to grow. was: A long running spark context (non-streaming) will eventually start throwing the following in the driver: java.util.NoSuchElementExceptio
[jira] [Commented] (SPARK-3972) PySpark Error on Windows with sc.wholeTextFiles
[ https://issues.apache.org/jira/browse/SPARK-3972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174110#comment-14174110 ] Michael Griffiths commented on SPARK-3972: -- This issue does NOT occur if I build Spark from source; using Bash and sbt\sbt assembly. It's restricted to the pre-compiled version. > PySpark Error on Windows with sc.wholeTextFiles > --- > > Key: SPARK-3972 > URL: https://issues.apache.org/jira/browse/SPARK-3972 > Project: Spark > Issue Type: Bug > Components: Input/Output, PySpark, Windows >Affects Versions: 1.1.0 > Environment: Windows 8.1 x64 > Java SE Version 8 Update 20 (build 1.8.0_20-b26); > Python 2.7.7 >Reporter: Michael Griffiths >Priority: Minor > > When running sc.wholeTextFiles() on a directory, I can run the command but > not do anything with the resulting RDD – specifically, I get an error in > py4j.protocol.Py4JJavaError; the error is unspecified. This occurs even if I > can read the text file(s) individually with sc.textFile() > Steps followed: > 1) Download Spark 1.1.0 (pre-builet for Hadoop 2.4: > [spark-1.1.0-bin-hadoop2.4.tgz|http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-hadoop2.4.tgz]) > 2) Extract into folder at root of drive: **D:\spark** > 3) Create test folder at **D:\testdata** with one (HTML) file contained > within it. > 4) Launch PySpark at **bin\PySpark** > 5) Try to use sc.wholeTextFiles('d:/testdata'); fail. > Note: I followed instructions from the upcoming O'Reilly book [Learning > Spark|http://shop.oreilly.com/product/0636920028512.do] for this. I do not > have any related tools installed (e.g. Hadoop) on the Windows machine. > See session (below)with tracebacks from errors. > {noformat} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 1.1.0 > /_/ > Using Python version 2.7.7 (default, Jun 11 2014 10:40:02) > SparkContext available as sc. > >>> file = sc.textFile("d:/testdata/cbcc5b470ec06f212990c68c8f76e887b884") > >>> file.count() > 732 > >>> file.first() > u'' > >>> data = sc.wholeTextFiles('d:/testdata') > >>> data.first() > Traceback (most recent call last): > File "", line 1, in > File "D:\spark\python\pyspark\rdd.py", line 1167, in first > return self.take(1)[0] > File "D:\spark\python\pyspark\rdd.py", line 1126, in take > totalParts = self._jrdd.partitions().size() > File "D:\spark\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py", line > 538, in __call__ > File "D:\spark\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py", line 300, > in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o21.partitions. > : java.lang.NullPointerException > at java.lang.ProcessBuilder.start(Unknown Source) > at org.apache.hadoop.util.Shell.runCommand(Shell.java:445) > at org.apache.hadoop.util.Shell.run(Shell.java:418) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:739) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:722) > at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097) > at > org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:559) > at > org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:534) > at > org.apache.hadoop.fs.LocatedFileStatus.(LocatedFileStatus.java:42) >at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1697) > at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1679) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:302) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:263) > at > org.apache.spark.input.WholeTextFileInputFormat.setMaxSplitSize(WholeTextFileInputFormat.scala:54) > at > org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:219) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) > at > org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:50) > at > org.apache.spark.api.java.JavaPairRDD.partitions(JavaPairRDD.scala:44) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) > at sun
[jira] [Commented] (SPARK-3957) Broadcast variable memory usage not reflected in UI
[ https://issues.apache.org/jira/browse/SPARK-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174107#comment-14174107 ] Dev Lakhani commented on SPARK-3957: Hi For now I am happy for [~CodingCat] to take this on, maybe once there are some commits I can help with the UI side, but for now I'll hold back. > Broadcast variable memory usage not reflected in UI > --- > > Key: SPARK-3957 > URL: https://issues.apache.org/jira/browse/SPARK-3957 > Project: Spark > Issue Type: Bug > Components: Block Manager, Web UI >Affects Versions: 1.0.2, 1.1.0 >Reporter: Shivaram Venkataraman >Assignee: Nan Zhu > > Memory used by broadcast variables are not reflected in the memory usage > reported in the WebUI. For example, the executors tab shows memory used in > each executor but this number doesn't include memory used by broadcast > variables. Similarly the storage tab only shows list of rdds cached and how > much memory they use. > We should add a separate column / tab for broadcast variables to make it > easier to debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2706) Enable Spark to support Hive 0.13
[ https://issues.apache.org/jira/browse/SPARK-2706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174073#comment-14174073 ] Zhan Zhang commented on SPARK-2706: --- The code does not go to upstream yet. To build 0.13.1 support, you need to apply the patch. But now it uses the customized package with org.spark-project, which seems to be withdraw from published. So to use it, you need to change org.spark-project to original hive package. > Enable Spark to support Hive 0.13 > - > > Key: SPARK-2706 > URL: https://issues.apache.org/jira/browse/SPARK-2706 > Project: Spark > Issue Type: Dependency upgrade > Components: SQL >Affects Versions: 1.0.1 >Reporter: Chunjun Xiao >Assignee: Zhan Zhang > Attachments: hive.diff, spark-2706-v1.txt, spark-2706-v2.txt, > spark-hive.err, v1.0.2.diff > > > It seems Spark cannot work with Hive 0.13 well. > When I compiled Spark with Hive 0.13.1, I got some error messages, as > attached below. > So, when can Spark be enabled to support Hive 0.13? > Compiling Error: > {quote} > [ERROR] > /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala:180: > type mismatch; > found : String > required: Array[String] > [ERROR] val proc: CommandProcessor = > CommandProcessorFactory.get(tokens(0), hiveconf) > [ERROR] ^ > [ERROR] > /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:264: > overloaded method constructor TableDesc with alternatives: > (x$1: Class[_ <: org.apache.hadoop.mapred.InputFormat[_, _]],x$2: > Class[_],x$3: java.util.Properties)org.apache.hadoop.hive.ql.plan.TableDesc > > ()org.apache.hadoop.hive.ql.plan.TableDesc > cannot be applied to (Class[org.apache.hadoop.hive.serde2.Deserializer], > Class[(some other)?0(in value tableDesc)(in value tableDesc)], Class[?0(in > value tableDesc)(in value tableDesc)], java.util.Properties) > [ERROR] val tableDesc = new TableDesc( > [ERROR] ^ > [ERROR] > /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala:140: > value getPartitionPath is not a member of > org.apache.hadoop.hive.ql.metadata.Partition > [ERROR] val partPath = partition.getPartitionPath > [ERROR]^ > [ERROR] > /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScan.scala:132: > value appendReadColumnNames is not a member of object > org.apache.hadoop.hive.serde2.ColumnProjectionUtils > [ERROR] ColumnProjectionUtils.appendReadColumnNames(hiveConf, > attributes.map(_.name)) > [ERROR] ^ > [ERROR] > /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:79: > org.apache.hadoop.hive.common.type.HiveDecimal does not have a constructor > [ERROR] new HiveDecimal(bd.underlying()) > [ERROR] ^ > [ERROR] > /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:132: > type mismatch; > found : org.apache.hadoop.fs.Path > required: String > [ERROR] > SparkHiveHadoopWriter.createPathFromString(fileSinkConf.getDirName, conf)) > [ERROR] ^ > [ERROR] > /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:179: > value getExternalTmpFileURI is not a member of > org.apache.hadoop.hive.ql.Context > [ERROR] val tmpLocation = hiveContext.getExternalTmpFileURI(tableLocation) > [ERROR] ^ > [ERROR] > /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala:209: > org.apache.hadoop.hive.common.type.HiveDecimal does not have a constructor > [ERROR] case bd: BigDecimal => new HiveDecimal(bd.underlying()) > [ERROR] ^ > [ERROR] 8 errors found > [DEBUG] Compilation failed (CompilerInterface) > [INFO] > > [INFO] Reactor Summary: > [INFO] > [INFO] Spark Project Parent POM .. SUCCESS [2.579s] > [INFO] Spark Project Core SUCCESS [2:39.805s] > [INFO] Spark Project Bagel ... SUCCESS [21.148s] > [INFO] Spark Project GraphX .. SUCCESS [59.950s] > [INFO] Spark Project ML Library .. SUCCESS [1:08.771s] > [INFO] Spark Project Streaming ... SUCCESS [1:17.759s] > [INFO] Spark Project Tools ... SUCCESS [15.405s] > [INFO] Spark Project Catalyst SUCCESS [1:17.405s] > [INFO] Spark Project SQL .
[jira] [Commented] (SPARK-3973) Print callSite information for broadcast variables
[ https://issues.apache.org/jira/browse/SPARK-3973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174066#comment-14174066 ] Apache Spark commented on SPARK-3973: - User 'shivaram' has created a pull request for this issue: https://github.com/apache/spark/pull/2829 > Print callSite information for broadcast variables > -- > > Key: SPARK-3973 > URL: https://issues.apache.org/jira/browse/SPARK-3973 > Project: Spark > Issue Type: Bug >Reporter: Shivaram Venkataraman >Assignee: Shivaram Venkataraman >Priority: Minor > Fix For: 1.2.0 > > > Printing call site information for broadcast variables will help in debugging > which variables are used, when they are used etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3973) Print callSite information for broadcast variables
Shivaram Venkataraman created SPARK-3973: Summary: Print callSite information for broadcast variables Key: SPARK-3973 URL: https://issues.apache.org/jira/browse/SPARK-3973 Project: Spark Issue Type: Bug Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Priority: Minor Fix For: 1.2.0 Printing call site information for broadcast variables will help in debugging which variables are used, when they are used etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3957) Broadcast variable memory usage not reflected in UI
[ https://issues.apache.org/jira/browse/SPARK-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174053#comment-14174053 ] Nan Zhu commented on SPARK-3957: I agree with [~andrewor14], I was also thinking about piggyback the information in the heartbeat between heartbeatReceiver and the executor ...not sure about the current Hadoop implementation, in 1.x version, TaskStatus was piggyback in the heartbeat between TaskTracker and JobTracker...to me, it's a very natural way to do this I accepted it this morning and have started some work, so, [~devlakhani], please let me finish this, thanks > Broadcast variable memory usage not reflected in UI > --- > > Key: SPARK-3957 > URL: https://issues.apache.org/jira/browse/SPARK-3957 > Project: Spark > Issue Type: Bug > Components: Block Manager, Web UI >Affects Versions: 1.0.2, 1.1.0 >Reporter: Shivaram Venkataraman >Assignee: Nan Zhu > > Memory used by broadcast variables are not reflected in the memory usage > reported in the WebUI. For example, the executors tab shows memory used in > each executor but this number doesn't include memory used by broadcast > variables. Similarly the storage tab only shows list of rdds cached and how > much memory they use. > We should add a separate column / tab for broadcast variables to make it > easier to debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3736) Workers should reconnect to Master if disconnected
[ https://issues.apache.org/jira/browse/SPARK-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174049#comment-14174049 ] Apache Spark commented on SPARK-3736: - User 'mccheah' has created a pull request for this issue: https://github.com/apache/spark/pull/2828 > Workers should reconnect to Master if disconnected > -- > > Key: SPARK-3736 > URL: https://issues.apache.org/jira/browse/SPARK-3736 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.2, 1.1.0 >Reporter: Andrew Ash >Assignee: Matthew Cheah >Priority: Critical > > In standalone mode, when a worker gets disconnected from the master for some > reason it never attempts to reconnect. In this situation you have to bounce > the worker before it will reconnect to the master. > The preferred alternative is to follow what Hadoop does -- when there's a > disconnect, attempt to reconnect at a particular interval until successful (I > think it repeats indefinitely every 10sec). > This has been observed by: > - [~pkolaczk] in > http://apache-spark-user-list.1001560.n3.nabble.com/Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td6240.html > - [~romi-totango] in > http://apache-spark-user-list.1001560.n3.nabble.com/Re-Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td15335.html > - [~aash] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3972) PySpark Error on Windows with sc.wholeTextFiles
Michael Griffiths created SPARK-3972: Summary: PySpark Error on Windows with sc.wholeTextFiles Key: SPARK-3972 URL: https://issues.apache.org/jira/browse/SPARK-3972 Project: Spark Issue Type: Bug Components: Input/Output, PySpark, Windows Affects Versions: 1.1.0 Environment: Windows 8.1 x64 Java SE Version 8 Update 20 (build 1.8.0_20-b26); Python 2.7.7 Reporter: Michael Griffiths Priority: Minor When running sc.wholeTextFiles() on a directory, I can run the command but not do anything with the resulting RDD – specifically, I get an error in py4j.protocol.Py4JJavaError; the error is unspecified. This occurs even if I can read the text file(s) individually with sc.textFile() Steps followed: 1) Download Spark 1.1.0 (pre-builet for Hadoop 2.4: [spark-1.1.0-bin-hadoop2.4.tgz|http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-hadoop2.4.tgz]) 2) Extract into folder at root of drive: **D:\spark** 3) Create test folder at **D:\testdata** with one (HTML) file contained within it. 4) Launch PySpark at **bin\PySpark** 5) Try to use sc.wholeTextFiles('d:/testdata'); fail. Note: I followed instructions from the upcoming O'Reilly book [Learning Spark|http://shop.oreilly.com/product/0636920028512.do] for this. I do not have any related tools installed (e.g. Hadoop) on the Windows machine. See session (below)with tracebacks from errors. {noformat} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.1.0 /_/ Using Python version 2.7.7 (default, Jun 11 2014 10:40:02) SparkContext available as sc. >>> file = sc.textFile("d:/testdata/cbcc5b470ec06f212990c68c8f76e887b884") >>> file.count() 732 >>> file.first() u'' >>> data = sc.wholeTextFiles('d:/testdata') >>> data.first() Traceback (most recent call last): File "", line 1, in File "D:\spark\python\pyspark\rdd.py", line 1167, in first return self.take(1)[0] File "D:\spark\python\pyspark\rdd.py", line 1126, in take totalParts = self._jrdd.partitions().size() File "D:\spark\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py", line 538, in __call__ File "D:\spark\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o21.partitions. : java.lang.NullPointerException at java.lang.ProcessBuilder.start(Unknown Source) at org.apache.hadoop.util.Shell.runCommand(Shell.java:445) at org.apache.hadoop.util.Shell.run(Shell.java:418) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650) at org.apache.hadoop.util.Shell.execCommand(Shell.java:739) at org.apache.hadoop.util.Shell.execCommand(Shell.java:722) at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097) at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:559) at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:534) at org.apache.hadoop.fs.LocatedFileStatus.(LocatedFileStatus.java:42) at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1697) at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1679) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:302) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:263) at org.apache.spark.input.WholeTextFileInputFormat.setMaxSplitSize(WholeTextFileInputFormat.scala:54) at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:50) at org.apache.spark.api.java.JavaPairRDD.partitions(JavaPairRDD.scala:44) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnec
[jira] [Resolved] (SPARK-2585) Remove special handling of Hadoop JobConf
[ https://issues.apache.org/jira/browse/SPARK-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-2585. --- Resolution: Fixed Due to the CONFIGURATION_INSTANTIATION_LOCK thread-safety issue, I think that we'll still end up having to serialize the Configuration separately. If we didn't, then we'd have to hold CONFIGURATION_INSTANTIATION_LOCK while deserializing each task, which could have a huge performance penalty (it's fine to hold the lock while loading the Configuration, since that doesn't take too long). Therefore, I'm closing this as "Won't Fix." The thread-safety issues with Configuration will be addressed by a separate clone() patch. > Remove special handling of Hadoop JobConf > - > > Key: SPARK-2585 > URL: https://issues.apache.org/jira/browse/SPARK-2585 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Josh Rosen >Priority: Critical > > This is a follow up to SPARK-2521 and should close SPARK-2546 (provided the > implementation does not use shared conf objects). We no longer need to > specially broadcast the Hadoop configuration since we are broadcasting RDD > data anyways. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3971) Failed to deserialize Vector in cluster mode
Davies Liu created SPARK-3971: - Summary: Failed to deserialize Vector in cluster mode Key: SPARK-3971 URL: https://issues.apache.org/jira/browse/SPARK-3971 Project: Spark Issue Type: Bug Components: MLlib, PySpark Reporter: Davies Liu Priority: Blocker The serialization of Vector/Rating did not work in cluster mode, because the initializer is not called in executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1761) Add broadcast information on SparkUI storage tab
[ https://issues.apache.org/jira/browse/SPARK-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174026#comment-14174026 ] Andrew Or commented on SPARK-1761: -- Closing in favor of SPARK-3957, which is more descriptive. > Add broadcast information on SparkUI storage tab > > > Key: SPARK-1761 > URL: https://issues.apache.org/jira/browse/SPARK-1761 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Andrew Or > > It would be nice to know where the broadcast blocks are persisted. More > details coming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3957) Broadcast variable memory usage not reflected in UI
[ https://issues.apache.org/jira/browse/SPARK-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174024#comment-14174024 ] Andrew Or commented on SPARK-3957: -- Hey [~devl.development] are you planning to work on this? Or is [~CodingCat]? The latter is currently assigned but maybe you guys should work it out. > Broadcast variable memory usage not reflected in UI > --- > > Key: SPARK-3957 > URL: https://issues.apache.org/jira/browse/SPARK-3957 > Project: Spark > Issue Type: Bug > Components: Block Manager, Web UI >Affects Versions: 1.0.2, 1.1.0 >Reporter: Shivaram Venkataraman >Assignee: Nan Zhu > > Memory used by broadcast variables are not reflected in the memory usage > reported in the WebUI. For example, the executors tab shows memory used in > each executor but this number doesn't include memory used by broadcast > variables. Similarly the storage tab only shows list of rdds cached and how > much memory they use. > We should add a separate column / tab for broadcast variables to make it > easier to debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-1761) Add broadcast information on SparkUI storage tab
[ https://issues.apache.org/jira/browse/SPARK-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-1761. Resolution: Duplicate > Add broadcast information on SparkUI storage tab > > > Key: SPARK-1761 > URL: https://issues.apache.org/jira/browse/SPARK-1761 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Andrew Or > > It would be nice to know where the broadcast blocks are persisted. More > details coming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3957) Broadcast variable memory usage not reflected in UI
[ https://issues.apache.org/jira/browse/SPARK-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174019#comment-14174019 ] Andrew Or commented on SPARK-3957: -- Yeah my understanding is that broadcast blocks aren't reported to the driver (and it makes sense to not report them because the driver is the one who initiated the broadcast in the first place). The source of the broadcast info we want to display is in the BlockManager of each executor, and we need to get this to the driver somehow. We could add some periodic reporting but that opens another channel between the driver and the executors. There is an ongoing effort to do something similar for task metrics https://github.com/apache/spark/pull/2087, so maybe we can piggyback this information on the heartbeats there. Also I believe this is a duplicate of an old issue SPARK-1761, though this one contains more information so let's keep this one open. I will close the other one in favor of this. > Broadcast variable memory usage not reflected in UI > --- > > Key: SPARK-3957 > URL: https://issues.apache.org/jira/browse/SPARK-3957 > Project: Spark > Issue Type: Bug > Components: Block Manager, Web UI >Affects Versions: 1.0.2, 1.1.0 >Reporter: Shivaram Venkataraman >Assignee: Nan Zhu > > Memory used by broadcast variables are not reflected in the memory usage > reported in the WebUI. For example, the executors tab shows memory used in > each executor but this number doesn't include memory used by broadcast > variables. Similarly the storage tab only shows list of rdds cached and how > much memory they use. > We should add a separate column / tab for broadcast variables to make it > easier to debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2750) Add Https support for Web UI
[ https://issues.apache.org/jira/browse/SPARK-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174003#comment-14174003 ] Marcelo Vanzin edited comment on SPARK-2750 at 10/16/14 5:35 PM: - FYI, any PR here should make sure the default configuration is safe against the "POODLE" attack (https://access.redhat.com/security/cve/CVE-2014-3566). Here's something for Jetty: http://stackoverflow.com/questions/26382540/how-to-disable-the-sslv3-protocol-in-jetty-to-prevent-poodle-attack was (Author: vanzin): FYI, any PR here should make sure the default configuration is save against the "POODLE" attack (https://access.redhat.com/security/cve/CVE-2014-3566). Here's something for Jetty: http://stackoverflow.com/questions/26382540/how-to-disable-the-sslv3-protocol-in-jetty-to-prevent-poodle-attack > Add Https support for Web UI > > > Key: SPARK-2750 > URL: https://issues.apache.org/jira/browse/SPARK-2750 > Project: Spark > Issue Type: New Feature > Components: Web UI >Reporter: WangTaoTheTonic > Labels: https, ssl, webui > Fix For: 1.0.3 > > Original Estimate: 96h > Remaining Estimate: 96h > > Now I try to add https support for web ui using Jetty ssl integration.Below > is the plan: > 1.Web UI include Master UI, Worker UI, HistoryServer UI and Spark Ui. User > can switch between https and http by configure "spark.http.policy" in JVM > property for each process, while choose http by default. > 2.Web port of Master and worker would be decided in order of launch > arguments, JVM property, System Env and default port. > 3.Below is some other configuration items: > spark.ssl.server.keystore.location The file or URL of the SSL Key store > spark.ssl.server.keystore.password The password for the key store > spark.ssl.server.keystore.keypassword The password (if any) for the specific > key within the key store > spark.ssl.server.keystore.type The type of the key store (default "JKS") > spark.client.https.need-auth True if SSL needs client authentication > spark.ssl.server.truststore.location The file name or URL of the trust store > location > spark.ssl.server.truststore.password The password for the trust store > spark.ssl.server.truststore.type The type of the trust store (default "JKS") > Any feedback is welcome! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2750) Add Https support for Web UI
[ https://issues.apache.org/jira/browse/SPARK-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174003#comment-14174003 ] Marcelo Vanzin commented on SPARK-2750: --- FYI, any PR here should make sure the default configuration is save against the "POODLE" attack (https://access.redhat.com/security/cve/CVE-2014-3566). Here's something for Jetty: http://stackoverflow.com/questions/26382540/how-to-disable-the-sslv3-protocol-in-jetty-to-prevent-poodle-attack > Add Https support for Web UI > > > Key: SPARK-2750 > URL: https://issues.apache.org/jira/browse/SPARK-2750 > Project: Spark > Issue Type: New Feature > Components: Web UI >Reporter: WangTaoTheTonic > Labels: https, ssl, webui > Fix For: 1.0.3 > > Original Estimate: 96h > Remaining Estimate: 96h > > Now I try to add https support for web ui using Jetty ssl integration.Below > is the plan: > 1.Web UI include Master UI, Worker UI, HistoryServer UI and Spark Ui. User > can switch between https and http by configure "spark.http.policy" in JVM > property for each process, while choose http by default. > 2.Web port of Master and worker would be decided in order of launch > arguments, JVM property, System Env and default port. > 3.Below is some other configuration items: > spark.ssl.server.keystore.location The file or URL of the SSL Key store > spark.ssl.server.keystore.password The password for the key store > spark.ssl.server.keystore.keypassword The password (if any) for the specific > key within the key store > spark.ssl.server.keystore.type The type of the key store (default "JKS") > spark.client.https.need-auth True if SSL needs client authentication > spark.ssl.server.truststore.location The file name or URL of the trust store > location > spark.ssl.server.truststore.password The password for the trust store > spark.ssl.server.truststore.type The type of the trust store (default "JKS") > Any feedback is welcome! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3883) Provide SSL support for Akka and HttpServer based connections
[ https://issues.apache.org/jira/browse/SPARK-3883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174005#comment-14174005 ] Marcelo Vanzin commented on SPARK-3883: --- FYI, any PR here should make sure the default configuration is safe against the "POODLE" attack (https://access.redhat.com/security/cve/CVE-2014-3566). Here's something for Jetty: http://stackoverflow.com/questions/26382540/how-to-disable-the-sslv3-protocol-in-jetty-to-prevent-poodle-attack > Provide SSL support for Akka and HttpServer based connections > - > > Key: SPARK-3883 > URL: https://issues.apache.org/jira/browse/SPARK-3883 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Jacek Lewandowski > > Spark uses at least 4 logical communication channels: > 1. Control messages - Akka based > 2. JARs and other files - Jetty based (HttpServer) > 3. Computation results - Java NIO based > 4. Web UI - Jetty based > The aim of this feature is to enable SSL for (1) and (2). > Why: > Spark configuration is sent through (1). Spark configuration may contain > sensitive information like credentials for accessing external data sources or > streams. Application JAR files (2) may include the application logic and > therefore they may include information about the structure of the external > data sources, and credentials as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3174) Provide elastic scaling within a Spark application
[ https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14173951#comment-14173951 ] Marcelo Vanzin commented on SPARK-3174: --- bq. Lets say we start the Spark application with just 2 executors. It will double the number of executors and hence goes to 4, 8 and so on. Well, I'd say it's unusual for applications to start with a low number of executors, especially if the user knows it will be executing things right away. So if I start it with 32 executors, your code will right away try to make it 64. Andrew's approach would try to make it 33, then 35, then... But I agree that it might be a good idea to make the auto-scaling backend an interface, so that we can easily play with different approaches. That shouldn't be hard at all. bq. The main point being, It does all these without making any changes in TaskSchedulerImpl/TaskSetManager Theoretically, I agree that's a good thing. I haven't gone through the code in detail, though, to know whether all the information Andrew is using from the scheduler is available from SparkListener events. If you can derive that info, great, I think it would be worth it to make the auto-scale code decoupled from the scheduler. If not, then we either have the choice of hooking the auto-scaling backend into the scheduler (like Andrew's change) or exposing more info in the events - which may or may not be a good thing, depending on what that info is. Anyway, as I've said, both approaches are not irreconcilably different - they're actually more similar than not. > Provide elastic scaling within a Spark application > -- > > Key: SPARK-3174 > URL: https://issues.apache.org/jira/browse/SPARK-3174 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 1.0.2 >Reporter: Sandy Ryza >Assignee: Andrew Or > Attachments: SPARK-3174design.pdf, SparkElasticScalingDesignB.pdf, > dynamic-scaling-executors-10-6-14.pdf > > > A common complaint with Spark in a multi-tenant environment is that > applications have a fixed allocation that doesn't grow and shrink with their > resource needs. We're blocked on YARN-1197 for dynamically changing the > resources within executors, but we can still allocate and discard whole > executors. > It would be useful to have some heuristics that > * Request more executors when many pending tasks are building up > * Discard executors when they are idle > See the latest design doc for more information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3957) Broadcast variable memory usage not reflected in UI
[ https://issues.apache.org/jira/browse/SPARK-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14173954#comment-14173954 ] Shivaram Venkataraman commented on SPARK-3957: -- I think it needs to be tracked in the Block Manager -- However we also need to track this on a per-executor basis and not just at the driver. Right now AFAIK, executors do not report new broadcast blocks to the master to reduce communication. However we could add broadcast blocks to some periodic report. [~andrewor] might know more. > Broadcast variable memory usage not reflected in UI > --- > > Key: SPARK-3957 > URL: https://issues.apache.org/jira/browse/SPARK-3957 > Project: Spark > Issue Type: Bug > Components: Block Manager, Web UI >Affects Versions: 1.0.2, 1.1.0 >Reporter: Shivaram Venkataraman >Assignee: Nan Zhu > > Memory used by broadcast variables are not reflected in the memory usage > reported in the WebUI. For example, the executors tab shows memory used in > each executor but this number doesn't include memory used by broadcast > variables. Similarly the storage tab only shows list of rdds cached and how > much memory they use. > We should add a separate column / tab for broadcast variables to make it > easier to debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3957) Broadcast variable memory usage not reflected in UI
[ https://issues.apache.org/jira/browse/SPARK-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14173926#comment-14173926 ] Dev Lakhani commented on SPARK-3957: Here is my thoughts on a possible approach. Hi All The broadcast occurs form the Spark Context to the broadcastmanager and new Broadcast method. In the first instance, the broadcasted data is stored in the Block Manager (see HttpBroadCast) of the executor. Any tracking of broadcast variables must be referenced by the BlockManagerSlaveActor and BlockManagerMasterActor. In particular UpdateBlockInfo and RemoveBroadcast should update the total memory in blocks used when blocks are added and removed. These can then be hooked up to the UI using a new Page like ExecutorsPage and defining a new methods in the relevant listener such as StorageStatusListener. These are my initial thoughts for someone new to these components, any other ideas or approaches? > Broadcast variable memory usage not reflected in UI > --- > > Key: SPARK-3957 > URL: https://issues.apache.org/jira/browse/SPARK-3957 > Project: Spark > Issue Type: Bug > Components: Block Manager, Web UI >Affects Versions: 1.0.2, 1.1.0 >Reporter: Shivaram Venkataraman >Assignee: Nan Zhu > > Memory used by broadcast variables are not reflected in the memory usage > reported in the WebUI. For example, the executors tab shows memory used in > each executor but this number doesn't include memory used by broadcast > variables. Similarly the storage tab only shows list of rdds cached and how > much memory they use. > We should add a separate column / tab for broadcast variables to make it > easier to debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2365) Add IndexedRDD, an efficient updatable key-value store
[ https://issues.apache.org/jira/browse/SPARK-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14173904#comment-14173904 ] Akshat Aranya commented on SPARK-2365: -- This looks great! I have been using IndexedRDD for a while, to great effect. I have one suggestion: it would be nice to override setName() in IndexedRDDLike {code} override def setName(_name: String): this.type = { partitionsRDD.setName(_name) this } {code} so that the IndexedRDD shows up with friendly names in the storage UI, just like regular, cached RDDs do. > Add IndexedRDD, an efficient updatable key-value store > -- > > Key: SPARK-2365 > URL: https://issues.apache.org/jira/browse/SPARK-2365 > Project: Spark > Issue Type: New Feature > Components: GraphX, Spark Core >Reporter: Ankur Dave >Assignee: Ankur Dave > Attachments: 2014-07-07-IndexedRDD-design-review.pdf > > > RDDs currently provide a bulk-updatable, iterator-based interface. This > imposes minimal requirements on the storage layer, which only needs to > support sequential access, enabling on-disk and serialized storage. > However, many applications would benefit from a richer interface. Efficient > support for point lookups would enable serving data out of RDDs, but it > currently requires iterating over an entire partition to find the desired > element. Point updates similarly require copying an entire iterator. Joins > are also expensive, requiring a shuffle and local hash joins. > To address these problems, we propose IndexedRDD, an efficient key-value > store built on RDDs. IndexedRDD would extend RDD[(Long, V)] by enforcing key > uniqueness and pre-indexing the entries for efficient joins and point > lookups, updates, and deletions. > It would be implemented by (1) hash-partitioning the entries by key, (2) > maintaining a hash index within each partition, and (3) using purely > functional (immutable and efficiently updatable) data structures to enable > efficient modifications and deletions. > GraphX would be the first user of IndexedRDD, since it currently implements a > limited form of this functionality in VertexRDD. We envision a variety of > other uses for IndexedRDD, including streaming updates to RDDs, direct > serving from RDDs, and as an execution strategy for Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3970) Remove duplicate removal of local dirs
[ https://issues.apache.org/jira/browse/SPARK-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14173856#comment-14173856 ] Apache Spark commented on SPARK-3970: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/2826 > Remove duplicate removal of local dirs > -- > > Key: SPARK-3970 > URL: https://issues.apache.org/jira/browse/SPARK-3970 > Project: Spark > Issue Type: Bug >Reporter: Liang-Chi Hsieh > > The shutdown hook of DiskBlockManager would remove localDirs. So do not need > to register them with Utils.registerShutdownDeleteDir. It causes duplicate > removal of these local dirs and corresponding exceptions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3969) Optimizer should have a super class as an interface.
Takuya Ueshin created SPARK-3969: Summary: Optimizer should have a super class as an interface. Key: SPARK-3969 URL: https://issues.apache.org/jira/browse/SPARK-3969 Project: Spark Issue Type: Improvement Components: SQL Reporter: Takuya Ueshin Some developers want to replace {{Optimizer}} to fit their projects but can't do so because currently {{Optimizer}} is an {{object}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3970) Remove duplicate removal of local dirs
Liang-Chi Hsieh created SPARK-3970: -- Summary: Remove duplicate removal of local dirs Key: SPARK-3970 URL: https://issues.apache.org/jira/browse/SPARK-3970 Project: Spark Issue Type: Bug Reporter: Liang-Chi Hsieh The shutdown hook of DiskBlockManager would remove localDirs. So do not need to register them with Utils.registerShutdownDeleteDir. It causes duplicate removal of these local dirs and corresponding exceptions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3969) Optimizer should have a super class as an interface.
[ https://issues.apache.org/jira/browse/SPARK-3969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14173854#comment-14173854 ] Apache Spark commented on SPARK-3969: - User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/2825 > Optimizer should have a super class as an interface. > > > Key: SPARK-3969 > URL: https://issues.apache.org/jira/browse/SPARK-3969 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Takuya Ueshin > > Some developers want to replace {{Optimizer}} to fit their projects but can't > do so because currently {{Optimizer}} is an {{object}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3907) add "truncate table" support
[ https://issues.apache.org/jira/browse/SPARK-3907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14173806#comment-14173806 ] Apache Spark commented on SPARK-3907: - User 'wangxiaojing' has created a pull request for this issue: https://github.com/apache/spark/pull/2770 > add "truncate table" support > - > > Key: SPARK-3907 > URL: https://issues.apache.org/jira/browse/SPARK-3907 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.1.0 >Reporter: wangxj >Priority: Minor > Labels: features > Fix For: 1.1.0 > > Original Estimate: 0.05h > Remaining Estimate: 0.05h > > The "truncate table " syntax had been disabled. > Removes all rows from a table or partition(s),Currently target table should > be native/managed table or exception will be thrown.User can specify partial > partition_spec for truncating multiple partitions at once and omitting > partition_spec will truncate all partitions in the table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3948) Sort-based shuffle can lead to assorted stream-corruption exceptions
[ https://issues.apache.org/jira/browse/SPARK-3948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14173757#comment-14173757 ] Apache Spark commented on SPARK-3948: - User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/2824 > Sort-based shuffle can lead to assorted stream-corruption exceptions > > > Key: SPARK-3948 > URL: https://issues.apache.org/jira/browse/SPARK-3948 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.2.0 >Reporter: Saisai Shao >Assignee: Saisai Shao > > Several exceptions occurred when running TPC-DS queries against latest master > branch with sort-based shuffle enable, like PARSING_ERROR(2) in snappy, > deserializing error in Kryo and offset out-range in FileManagedBuffer, all > these exceptions are gone when we changed to hash-based shuffle. > With deep investigation, we found that some shuffle output file is > unexpectedly smaller than the others, as the log shows: > {noformat} > 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: > shuffle_6_9_11, offset: 3055635, length: 236708, file length: 47274167 > 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: > shuffle_6_10_11, offset: 2986484, length: 222755, file length: 47174539 > 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: > shuffle_6_11_11, offset: 2995341, length: 259871, file length: 383405 > 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: > shuffle_6_12_11, offset: 2991030, length: 268191, file length: 47478892 > 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: > shuffle_6_13_11, offset: 3016292, length: 230694, file length: 47420826 > 14/10/14 18:25:06 INFO shuffle.IndexShuffleBlockManager: Block id: > shuffle_6_14_11, offset: 3061400, length: 241136, file length: 47395509 > {noformat} > As you can see the total file length of shuffle_6_11_11 is much smaller than > other same stage map output results. > And we also dump the map outputs in map side to see if this small size output > is correct or not, below is the log: > {noformat} > In bypass merge sort, file name: /mnt/DP_disk1/animal/spark/spark-local- > 20141014182142-8345/22/shuffle_6_11_0.data, file length: 383405length: > 274722 262597 291290 272902 264941 270358 291005 295285 252482 > 287142 232617 259871 233734 241439 228897 234282 253834 235619 > 233803 255532 270739 253825 262087 266404 234273 250120 262983 > 257024 255947 254971 258908 247862 221613 258566 245399 251684 > 274843 226150 264278 245279 225656 235084 239466 212851 242245 > 218781 222191 215500 211548 234256 208601 204113 191923 217895 > 227020 215331 212313 223725 250876 256875 239276 266777 235520 > 237462 234063 242270 246825 255888 235937 236956 233099 264508 > 260303 233294 239061 254856 257475 230105 246553 260412 210355 > 211201 219572 206636 226866 209937 226618 218208 206255 248069 > 221717 222112 215734 248088 239207 246125 239056 241133 253091 > 246738 233128 242794 231606 255737 221123 252115 247286 229688 > 251087 250047 237579 263079 256251 238214 208641 201120 204009 > 200825 211965 200600 194492 226471 194887 226975 215072 206008 > 233288 222132 208860 219064 218162 237126 220465 201343 225711 > 232178 233786 212767 211462 213671 215853 227822 233782 214727 > 247001 228968 247413 222674 214241 184122 215643 207665 219079 > 215185 207718 212723 201613 216600 212591 208174 204195 208099 > 229079 230274 223373 214999 256626 228895 231821 383405 229646 > 220212 245495 245960 227556 213266 237203 203805 240509 239306 > 242365 218416 238487 219397 240026 251011 258369 255365 259811 > 283313 248450 264286 264562 257485 279459 249187 257609 274964 > 292369 273826 > {noformat} > Here I dump the file name, length and each partition's length, obviously the > sum of all partition lengths is not equal to file length. So I think there > may be a situation paritionWriter in ExternalSorter not always append to the > end of previous written file, the file's content is overwritten in some > parts, and this lead to the exceptions I mentioned before. > Also I changed the code of copyStream by disable transferTo, use the previous > one, all the issues are gone. So I think there maybe some flushing problems > in transferTo when processed data is large. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3629) Improvements to YARN doc
[ https://issues.apache.org/jira/browse/SPARK-3629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14173731#comment-14173731 ] ssj commented on SPARK-3629: need someone to verifity this patch > Improvements to YARN doc > > > Key: SPARK-3629 > URL: https://issues.apache.org/jira/browse/SPARK-3629 > Project: Spark > Issue Type: Documentation > Components: Documentation, YARN >Reporter: Matei Zaharia > Labels: starter > > Right now this doc starts off with a big list of config options, and only > then tells you how to submit an app. It would be better to put that part and > the packaging part first, and the config options only at the end. > In addition, the doc mentions yarn-cluster vs yarn-client as separate > masters, which is inconsistent with the help output from spark-submit (which > says to always use "yarn"). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3968) Using parquet-mr filter2 api in spark sql, add a custom filter for InSet clause
[ https://issues.apache.org/jira/browse/SPARK-3968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yash Datta updated SPARK-3968: -- Shepherd: Yash Datta > Using parquet-mr filter2 api in spark sql, add a custom filter for InSet > clause > --- > > Key: SPARK-3968 > URL: https://issues.apache.org/jira/browse/SPARK-3968 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.1.0 >Reporter: Yash Datta >Priority: Minor > Fix For: 1.1.1 > > > The parquet-mr project has introduced a new filter api , along with several > fixes , like filtering on OPTIONAL columns as well. It can also eliminate > entire RowGroups depending on certain statistics like min/max > We can leverage that to further improve performance of queries with filters. > Also filter2 api introduces ability to create custom filters. We can create a > custom filter for the optimized In clause (InSet) , so that elimination > happens in the ParquetRecordReader itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org