[jira] [Created] (SPARK-11150) Dynamic partition pruning
Younes created SPARK-11150: -- Summary: Dynamic partition pruning Key: SPARK-11150 URL: https://issues.apache.org/jira/browse/SPARK-11150 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.1, 1.6.0 Reporter: Younes Partitions are not pruned when joined on the partition columns. This is the same issue as HIVE-9152. Ex: Select from tab where partcol=1 will prune on value 1 Select from tab join dim on (dim.partcol=tab.partcol) where dim.partcol=1 will scan all partitions. Tables are based on parquets. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11152) Streaming UI: Input sizes are 0 for makeup batches started from a checkpoint
[ https://issues.apache.org/jira/browse/SPARK-11152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yongjia Wang updated SPARK-11152: - Description: When a streaming job is resumed from a checkpoint at batch time x, and say the current time when we resume this streaming job is x+10. In this scenario, since Spark will schedule the missing batches from x+1 to x+10 without any metadata, the behavior is to pack up all the backlogged inputs into batch x+1, then assign any new inputs into x+2 to x+10 immediately without waiting. This results in tiny batches that capture inputs only during the back to back scheduling intervals. This behavior is very reasonable. However, the streaming UI does not show correctly the input sizes for all these makeup batches - they are all 0 from batch x to x+10. Fixing this would be very helpful. This happens when I use Kafka direct streaming, I assume this would happen for all other streaming sources as well. (was: When a streaming job starts from a checkpoint at batch time x, and say the current time when we resume this streaming job is x+10. In this scenario, since Spark will schedule the missing batches from x+1 to x+10 without any metadata, the behavior is to pack up all the backlogged inputs into batch x+1, then assign any new inputs into x+2 to x+10 immediately without waiting. This results in tiny batches that capture inputs only during the back to back scheduling intervals. This behavior is very reasonable. However, the streaming UI does not show correctly the input sizes for all these makeup batches - they are all 0 from batch x to x+10. Fixing this would be very helpful. This happens when I use Kafka direct streaming, I assume this would happen for all other streaming sources as well.) > Streaming UI: Input sizes are 0 for makeup batches started from a checkpoint > - > > Key: SPARK-11152 > URL: https://issues.apache.org/jira/browse/SPARK-11152 > Project: Spark > Issue Type: Bug > Components: Streaming, Web UI >Reporter: Yongjia Wang >Priority: Minor > > When a streaming job is resumed from a checkpoint at batch time x, and say > the current time when we resume this streaming job is x+10. In this scenario, > since Spark will schedule the missing batches from x+1 to x+10 without any > metadata, the behavior is to pack up all the backlogged inputs into batch > x+1, then assign any new inputs into x+2 to x+10 immediately without waiting. > This results in tiny batches that capture inputs only during the back to back > scheduling intervals. This behavior is very reasonable. However, the > streaming UI does not show correctly the input sizes for all these makeup > batches - they are all 0 from batch x to x+10. Fixing this would be very > helpful. This happens when I use Kafka direct streaming, I assume this would > happen for all other streaming sources as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns
[ https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-11153: --- Description: Due to PARQUET-251, {{BINARY}} columns in existing Parquet files may be written with corrupted statistics information. This information is used by filter push-down optimization. Since Spark 1.5 turns on Parquet filter push-down by default, we may end up with wrong query results. PARQUET-251 has been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0. Note that this kind of corrupted Parquet files could be produced by any Parquet data models. This affects all Spark SQL data types that can be mapped to Parquet {{BINARY}}, namely: - {{StringType}} - {{BinaryType}} - {{DecimalType}} (but Spark SQL doesn't support pushing down {{DecimalType}} columns for now.) To avoid wrong query results, we should disable filter push-down for columns of {{StringType}} and {{BinaryType}} until we upgrade to parquet-mr 1.8. was: Due to PARQUET-251, {{BINARY}} columns in existing Parquet files may be written with corrupted statistics information. This information is used by filter push-down optimization. Since Spark 1.5 turns on Parquet filter push-down by default, we may end up with wrong query results. PARQUET-251 has been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0. This affects all Spark SQL data types that can be mapped to Parquet {{BINARY}}, namely: - {{StringType}} - {{BinaryType}} - {{DecimalType}} (but Spark SQL doesn't support pushing down {{DecimalType}} columns for now.) To avoid wrong query results, we should disable filter push-down for columns of {{StringType}} and {{BinaryType}} until we upgrade to parquet-mr 1.8. > Turns off Parquet filter push-down for string and binary columns > > > Key: SPARK-11153 > URL: https://issues.apache.org/jira/browse/SPARK-11153 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Blocker > > Due to PARQUET-251, {{BINARY}} columns in existing Parquet files may be > written with corrupted statistics information. This information is used by > filter push-down optimization. Since Spark 1.5 turns on Parquet filter > push-down by default, we may end up with wrong query results. PARQUET-251 has > been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0. > Note that this kind of corrupted Parquet files could be produced by any > Parquet data models. > This affects all Spark SQL data types that can be mapped to Parquet > {{BINARY}}, namely: > - {{StringType}} > - {{BinaryType}} > - {{DecimalType}} (but Spark SQL doesn't support pushing down {{DecimalType}} > columns for now.) > To avoid wrong query results, we should disable filter push-down for columns > of {{StringType}} and {{BinaryType}} until we upgrade to parquet-mr 1.8. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10953) Benchmark codegen vs. hand-written code for univariate statistics
[ https://issues.apache.org/jira/browse/SPARK-10953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-10953. --- Resolution: Done Fix Version/s: 1.6.0 > Benchmark codegen vs. hand-written code for univariate statistics > - > > Key: SPARK-10953 > URL: https://issues.apache.org/jira/browse/SPARK-10953 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Xiangrui Meng >Assignee: Jihong MA > Fix For: 1.6.0 > > > I checked the generated code for a simple stddev_pop call: > {code} > val df = sqlContext.range(100) > df.select(stddev_pop(col("id"))).show() > {code} > This is the generated code for the merge part, which is very long and > complex. I'm not sure whether we can get benefit from the code generation for > univariate statistics. We should benchmark it against Scala implementation. > {code} > 15/10/06 10:10:57 DEBUG GenerateMutableProjection: code for if > (isnull(input[1, DoubleType])) cast(0 as double) else input[1, DoubleType],if > (isnull(input[1, DoubleType])) input[6, DoubleType] else if (isnull(input[6, > DoubleType])) input[1, DoubleType] else (input[1, DoubleType] + input[6, > DoubleType]),if (isnull(input[3, DoubleType])) cast(0 as double) else > input[3, DoubleType],if (isnull(input[3, DoubleType])) input[8, DoubleType] > else if (isnull(input[8, DoubleType])) input[3, DoubleType] else (((input[3, > DoubleType] * input[0, DoubleType]) + (input[8, DoubleType] * input[6, > DoubleType])) / (input[0, DoubleType] + input[6, DoubleType])),if > (isnull(input[4, DoubleType])) input[9, DoubleType] else if (isnull(input[9, > DoubleType])) input[4, DoubleType] else ((input[4, DoubleType] + input[9, > DoubleType]) + input[8, DoubleType] - input[2, DoubleType]) * (input[8, > DoubleType] - input[2, DoubleType])) * (input[0, DoubleType] * input[6, > DoubleType])) / (input[0, DoubleType] + input[6, DoubleType]))): > public Object generate(org.apache.spark.sql.catalyst.expressions.Expression[] > expr) { > return new SpecificMutableProjection(expr); > } > class SpecificMutableProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection { > private org.apache.spark.sql.catalyst.expressions.Expression[] expressions; > private org.apache.spark.sql.catalyst.expressions.MutableRow mutableRow; > public > SpecificMutableProjection(org.apache.spark.sql.catalyst.expressions.Expression[] > expr) { > expressions = expr; > mutableRow = new > org.apache.spark.sql.catalyst.expressions.GenericMutableRow(5); > } > public > org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection > target(org.apache.spark.sql.catalyst.expressions.MutableRow row) { > mutableRow = row; > return this; > } > /* Provide immutable access to the last projected row. */ > public InternalRow currentValue() { > return (InternalRow) mutableRow; > } > public Object apply(Object _i) { > InternalRow i = (InternalRow) _i; > /* if (isnull(input[1, DoubleType])) cast(0 as double) else input[1, > DoubleType] */ > /* isnull(input[1, DoubleType]) */ > /* input[1, DoubleType] */ > boolean isNull4 = i.isNullAt(1); > double primitive5 = isNull4 ? -1.0 : (i.getDouble(1)); > boolean isNull0 = false; > double primitive1 = -1.0; > if (!false && isNull4) { > /* cast(0 as double) */ > /* 0 */ > boolean isNull6 = false; > double primitive7 = -1.0; > if (!false) { > primitive7 = (double) 0; > } > isNull0 = isNull6; > primitive1 = primitive7; > } else { > /* input[1, DoubleType] */ > boolean isNull10 = i.isNullAt(1); > double primitive11 = isNull10 ? -1.0 : (i.getDouble(1)); > isNull0 = isNull10; > primitive1 = primitive11; > } > if (isNull0) { > mutableRow.setNullAt(0); > } else { > mutableRow.setDouble(0, primitive1); > } > /* if (isnull(input[1, DoubleType])) input[6, DoubleType] else if > (isnull(input[6, DoubleType])) input[1, DoubleType] else (input[1, > DoubleType] + input[6, DoubleType]) */ > /* isnull(input[1, DoubleType]) */ > /* input[1, DoubleType] */ > boolean isNull16 = i.isNullAt(1); > double primitive17 = isNull16 ? -1.0 : (i.getDouble(1)); > boolean isNull12 = false; > double primitive13 = -1.0; > if (!false && isNull16) { > /* input[6, DoubleType] */ > boolean isNull18 = i.isNullAt(6); > double primitive19 = isNull18 ? -1.0 : (i.getDouble(6)); > isNull12 = isNull18; > primitive13 = primitive19; > } else { > /* if (isnull(input[6, DoubleType])) input[1, DoubleType] else > (input[1, DoubleType] + input[6, DoubleType]) */ > /* isnull(input[6,
[jira] [Commented] (SPARK-10994) Local clustering coefficient computation in GraphX
[ https://issues.apache.org/jira/browse/SPARK-10994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961000#comment-14961000 ] Apache Spark commented on SPARK-10994: -- User 'SherlockYang' has created a pull request for this issue: https://github.com/apache/spark/pull/9150 > Local clustering coefficient computation in GraphX > -- > > Key: SPARK-10994 > URL: https://issues.apache.org/jira/browse/SPARK-10994 > Project: Spark > Issue Type: New Feature > Components: GraphX >Reporter: Yang Yang > Original Estimate: 336h > Remaining Estimate: 336h > > We propose to implement an algorithm to compute the local clustering > coefficient in GraphX. The local clustering coefficient of a vertex (node) in > a graph quantifies how close its neighbors are to being a clique (complete > graph). More specifically, the local clustering coefficient C_i for a vertex > v_i is given by the proportion of links between the vertices within its > neighbourhood divided by the number of links that could possibly exist > between them. Duncan J. Watts and Steven Strogatz introduced the measure in > 1998 to determine whether a graph is a small-world network. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10994) Local clustering coefficient computation in GraphX
[ https://issues.apache.org/jira/browse/SPARK-10994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10994: Assignee: (was: Apache Spark) > Local clustering coefficient computation in GraphX > -- > > Key: SPARK-10994 > URL: https://issues.apache.org/jira/browse/SPARK-10994 > Project: Spark > Issue Type: New Feature > Components: GraphX >Reporter: Yang Yang > Original Estimate: 336h > Remaining Estimate: 336h > > We propose to implement an algorithm to compute the local clustering > coefficient in GraphX. The local clustering coefficient of a vertex (node) in > a graph quantifies how close its neighbors are to being a clique (complete > graph). More specifically, the local clustering coefficient C_i for a vertex > v_i is given by the proportion of links between the vertices within its > neighbourhood divided by the number of links that could possibly exist > between them. Duncan J. Watts and Steven Strogatz introduced the measure in > 1998 to determine whether a graph is a small-world network. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10994) Local clustering coefficient computation in GraphX
[ https://issues.apache.org/jira/browse/SPARK-10994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10994: Assignee: Apache Spark > Local clustering coefficient computation in GraphX > -- > > Key: SPARK-10994 > URL: https://issues.apache.org/jira/browse/SPARK-10994 > Project: Spark > Issue Type: New Feature > Components: GraphX >Reporter: Yang Yang >Assignee: Apache Spark > Original Estimate: 336h > Remaining Estimate: 336h > > We propose to implement an algorithm to compute the local clustering > coefficient in GraphX. The local clustering coefficient of a vertex (node) in > a graph quantifies how close its neighbors are to being a clique (complete > graph). More specifically, the local clustering coefficient C_i for a vertex > v_i is given by the proportion of links between the vertices within its > neighbourhood divided by the number of links that could possibly exist > between them. Duncan J. Watts and Steven Strogatz introduced the measure in > 1998 to determine whether a graph is a small-world network. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11149) Improve performance of primitive types in columnar cache
Davies Liu created SPARK-11149: -- Summary: Improve performance of primitive types in columnar cache Key: SPARK-11149 URL: https://issues.apache.org/jira/browse/SPARK-11149 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu Assignee: Davies Liu Improve performance of primitive types in columnar cache -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns
Cheng Lian created SPARK-11153: -- Summary: Turns off Parquet filter push-down for string and binary columns Key: SPARK-11153 URL: https://issues.apache.org/jira/browse/SPARK-11153 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.1, 1.5.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Critical Due to PARQUET-251, {{BINARY}} columns in existing Parquet files may be written with corrupted statistics information. This information is used by filter push-down optimization. Since Spark 1.5 turns on Parquet filter push-down by default, we may end up with wrong query results. PARQUET-251 has been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0. This affects all Spark SQL data types that can be mapped to Parquet {{BINARY}}, namely: - {{StringType}} - {{BinaryType}} - {{DecimalType}} (but Spark SQL doesn't support pushing down {{DecimalType}} columns for now.) To avoid wrong query results, we should disable filter push-down for columns of {{StringType}} and {{BinaryType}} until we upgrade to parquet-mr 1.8. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns
[ https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-11153: --- Priority: Blocker (was: Critical) > Turns off Parquet filter push-down for string and binary columns > > > Key: SPARK-11153 > URL: https://issues.apache.org/jira/browse/SPARK-11153 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Blocker > > Due to PARQUET-251, {{BINARY}} columns in existing Parquet files may be > written with corrupted statistics information. This information is used by > filter push-down optimization. Since Spark 1.5 turns on Parquet filter > push-down by default, we may end up with wrong query results. PARQUET-251 has > been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0. > This affects all Spark SQL data types that can be mapped to Parquet > {{BINARY}}, namely: > - {{StringType}} > - {{BinaryType}} > - {{DecimalType}} (but Spark SQL doesn't support pushing down {{DecimalType}} > columns for now.) > To avoid wrong query results, we should disable filter push-down for columns > of {{StringType}} and {{BinaryType}} until we upgrade to parquet-mr 1.8. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11155) Stage summary json should include stage duration
Imran Rashid created SPARK-11155: Summary: Stage summary json should include stage duration Key: SPARK-11155 URL: https://issues.apache.org/jira/browse/SPARK-11155 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Imran Rashid Priority: Minor The json endpoint for stages doesn't include information on the stage duration that is present in the UI. This looks like a simple oversight, they should be included. eg., the metrics should be included at {{api/v1/applications//stages}}. The missing metrics are {{submissionTime}} and {{completionTime}} (and whatever other metrics come out of the discussion on SPARK-10930) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10895) Add pushdown string filters for Parquet
[ https://issues.apache.org/jira/browse/SPARK-10895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-10895: --- Assignee: Liang-Chi Hsieh > Add pushdown string filters for Parquet > --- > > Key: SPARK-10895 > URL: https://issues.apache.org/jira/browse/SPARK-10895 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > > We should be able to push down string filters such as contains, startsWith > and endsWith to Parquet. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10165) Nested Hive UDF resolution fails in Analyzer
[ https://issues.apache.org/jira/browse/SPARK-10165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961159#comment-14961159 ] Michael Armbrust commented on SPARK-10165: -- That sounds like a different issue. Please open up a separate JIRA. > Nested Hive UDF resolution fails in Analyzer > > > Key: SPARK-10165 > URL: https://issues.apache.org/jira/browse/SPARK-10165 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Michael Armbrust >Assignee: Michael Armbrust >Priority: Blocker > Fix For: 1.5.0 > > > When running a query with hive udfs nested in hive udfs the analyzer fails > since we don't check children resolution first. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11154) make specificaition spark.yarn.executor.memoryOverhead consistent with typical JVM options
Dustin Cote created SPARK-11154: --- Summary: make specificaition spark.yarn.executor.memoryOverhead consistent with typical JVM options Key: SPARK-11154 URL: https://issues.apache.org/jira/browse/SPARK-11154 Project: Spark Issue Type: Improvement Components: Documentation, Spark Submit Reporter: Dustin Cote Priority: Minor spark.yarn.executor.memoryOverhead is currently specified in megabytes by default, but it would be nice to allow users to specify the size as though it were a typical -Xmx option to a JVM where you can have 'm' and 'g' appended to the end to explicitly specify megabytes or gigabytes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-10994) Local clustering coefficient computation in GraphX
[ https://issues.apache.org/jira/browse/SPARK-10994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Yang updated SPARK-10994: -- Comment: was deleted (was: Proposed implementation: https://github.com/amplab/graphx/pull/148/) > Local clustering coefficient computation in GraphX > -- > > Key: SPARK-10994 > URL: https://issues.apache.org/jira/browse/SPARK-10994 > Project: Spark > Issue Type: New Feature > Components: GraphX >Reporter: Yang Yang > Original Estimate: 336h > Remaining Estimate: 336h > > We propose to implement an algorithm to compute the local clustering > coefficient in GraphX. The local clustering coefficient of a vertex (node) in > a graph quantifies how close its neighbors are to being a clique (complete > graph). More specifically, the local clustering coefficient C_i for a vertex > v_i is given by the proportion of links between the vertices within its > neighbourhood divided by the number of links that could possibly exist > between them. Duncan J. Watts and Steven Strogatz introduced the measure in > 1998 to determine whether a graph is a small-world network. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11152) Streaming UI: Input sizes are 0 for makeup batches started from a checkpoint
Yongjia Wang created SPARK-11152: Summary: Streaming UI: Input sizes are 0 for makeup batches started from a checkpoint Key: SPARK-11152 URL: https://issues.apache.org/jira/browse/SPARK-11152 Project: Spark Issue Type: Bug Components: Streaming, Web UI Reporter: Yongjia Wang Priority: Minor When a streaming job starts from a checkpoint at batch time x, and say the current time when we resume this streaming job is x+10. In this scenario, since Spark will schedule the missing batches from x+1 to x+10 without any metadata, the behavior is to pack up all the backlogged inputs into batch x+1, then assign any new inputs into x+2 to x+10 immediately without waiting. This results in tiny batches that capture inputs only during the back to back scheduling intervals. This behavior is very reasonable. However, the streaming UI does not show correctly the input sizes for all these makeup batches - they are all 0 from batch x to x+10. Fixing this would be very helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11152) Streaming UI: Input sizes are 0 for makeup batches started from a checkpoint
[ https://issues.apache.org/jira/browse/SPARK-11152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yongjia Wang updated SPARK-11152: - Description: When a streaming job starts from a checkpoint at batch time x, and say the current time when we resume this streaming job is x+10. In this scenario, since Spark will schedule the missing batches from x+1 to x+10 without any metadata, the behavior is to pack up all the backlogged inputs into batch x+1, then assign any new inputs into x+2 to x+10 immediately without waiting. This results in tiny batches that capture inputs only during the back to back scheduling intervals. This behavior is very reasonable. However, the streaming UI does not show correctly the input sizes for all these makeup batches - they are all 0 from batch x to x+10. Fixing this would be very helpful. This happens when I use Kafka direct streaming, I assume this would happen for all other streaming sources as well. (was: When a streaming job starts from a checkpoint at batch time x, and say the current time when we resume this streaming job is x+10. In this scenario, since Spark will schedule the missing batches from x+1 to x+10 without any metadata, the behavior is to pack up all the backlogged inputs into batch x+1, then assign any new inputs into x+2 to x+10 immediately without waiting. This results in tiny batches that capture inputs only during the back to back scheduling intervals. This behavior is very reasonable. However, the streaming UI does not show correctly the input sizes for all these makeup batches - they are all 0 from batch x to x+10. Fixing this would be very helpful.) > Streaming UI: Input sizes are 0 for makeup batches started from a checkpoint > - > > Key: SPARK-11152 > URL: https://issues.apache.org/jira/browse/SPARK-11152 > Project: Spark > Issue Type: Bug > Components: Streaming, Web UI >Reporter: Yongjia Wang >Priority: Minor > > When a streaming job starts from a checkpoint at batch time x, and say the > current time when we resume this streaming job is x+10. In this scenario, > since Spark will schedule the missing batches from x+1 to x+10 without any > metadata, the behavior is to pack up all the backlogged inputs into batch > x+1, then assign any new inputs into x+2 to x+10 immediately without waiting. > This results in tiny batches that capture inputs only during the back to back > scheduling intervals. This behavior is very reasonable. However, the > streaming UI does not show correctly the input sizes for all these makeup > batches - they are all 0 from batch x to x+10. Fixing this would be very > helpful. This happens when I use Kafka direct streaming, I assume this would > happen for all other streaming sources as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11149) Improve performance of primitive types in columnar cache
[ https://issues.apache.org/jira/browse/SPARK-11149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961003#comment-14961003 ] Apache Spark commented on SPARK-11149: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/9145 > Improve performance of primitive types in columnar cache > > > Key: SPARK-11149 > URL: https://issues.apache.org/jira/browse/SPARK-11149 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > Improve performance of primitive types in columnar cache -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11149) Improve performance of primitive types in columnar cache
[ https://issues.apache.org/jira/browse/SPARK-11149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11149: Assignee: Apache Spark (was: Davies Liu) > Improve performance of primitive types in columnar cache > > > Key: SPARK-11149 > URL: https://issues.apache.org/jira/browse/SPARK-11149 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Apache Spark > > Improve performance of primitive types in columnar cache -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11149) Improve performance of primitive types in columnar cache
[ https://issues.apache.org/jira/browse/SPARK-11149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11149: Assignee: Davies Liu (was: Apache Spark) > Improve performance of primitive types in columnar cache > > > Key: SPARK-11149 > URL: https://issues.apache.org/jira/browse/SPARK-11149 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > Improve performance of primitive types in columnar cache -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns
[ https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961147#comment-14961147 ] Michael Armbrust commented on SPARK-11153: -- Its actually corrupted statistics in data that is written? Does parquet write the version in the metadata? Should we actually be turning this off based on writer version? > Turns off Parquet filter push-down for string and binary columns > > > Key: SPARK-11153 > URL: https://issues.apache.org/jira/browse/SPARK-11153 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Blocker > > Due to PARQUET-251, {{BINARY}} columns in existing Parquet files may be > written with corrupted statistics information. This information is used by > filter push-down optimization. Since Spark 1.5 turns on Parquet filter > push-down by default, we may end up with wrong query results. PARQUET-251 has > been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0. > This affects all Spark SQL data types that can be mapped to Parquet > {{BINARY}}, namely: > - {{StringType}} > - {{BinaryType}} > - {{DecimalType}} (but Spark SQL doesn't support pushing down {{DecimalType}} > columns for now.) > To avoid wrong query results, we should disable filter push-down for columns > of {{StringType}} and {{BinaryType}} until we upgrade to parquet-mr 1.8. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10953) Benchmark codegen vs. hand-written code for univariate statistics
[ https://issues.apache.org/jira/browse/SPARK-10953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961209#comment-14961209 ] Xiangrui Meng commented on SPARK-10953: --- That sounds good. I'm closing this for now since the conclusion is clear. > Benchmark codegen vs. hand-written code for univariate statistics > - > > Key: SPARK-10953 > URL: https://issues.apache.org/jira/browse/SPARK-10953 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Xiangrui Meng >Assignee: Jihong MA > Fix For: 1.6.0 > > > I checked the generated code for a simple stddev_pop call: > {code} > val df = sqlContext.range(100) > df.select(stddev_pop(col("id"))).show() > {code} > This is the generated code for the merge part, which is very long and > complex. I'm not sure whether we can get benefit from the code generation for > univariate statistics. We should benchmark it against Scala implementation. > {code} > 15/10/06 10:10:57 DEBUG GenerateMutableProjection: code for if > (isnull(input[1, DoubleType])) cast(0 as double) else input[1, DoubleType],if > (isnull(input[1, DoubleType])) input[6, DoubleType] else if (isnull(input[6, > DoubleType])) input[1, DoubleType] else (input[1, DoubleType] + input[6, > DoubleType]),if (isnull(input[3, DoubleType])) cast(0 as double) else > input[3, DoubleType],if (isnull(input[3, DoubleType])) input[8, DoubleType] > else if (isnull(input[8, DoubleType])) input[3, DoubleType] else (((input[3, > DoubleType] * input[0, DoubleType]) + (input[8, DoubleType] * input[6, > DoubleType])) / (input[0, DoubleType] + input[6, DoubleType])),if > (isnull(input[4, DoubleType])) input[9, DoubleType] else if (isnull(input[9, > DoubleType])) input[4, DoubleType] else ((input[4, DoubleType] + input[9, > DoubleType]) + input[8, DoubleType] - input[2, DoubleType]) * (input[8, > DoubleType] - input[2, DoubleType])) * (input[0, DoubleType] * input[6, > DoubleType])) / (input[0, DoubleType] + input[6, DoubleType]))): > public Object generate(org.apache.spark.sql.catalyst.expressions.Expression[] > expr) { > return new SpecificMutableProjection(expr); > } > class SpecificMutableProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection { > private org.apache.spark.sql.catalyst.expressions.Expression[] expressions; > private org.apache.spark.sql.catalyst.expressions.MutableRow mutableRow; > public > SpecificMutableProjection(org.apache.spark.sql.catalyst.expressions.Expression[] > expr) { > expressions = expr; > mutableRow = new > org.apache.spark.sql.catalyst.expressions.GenericMutableRow(5); > } > public > org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection > target(org.apache.spark.sql.catalyst.expressions.MutableRow row) { > mutableRow = row; > return this; > } > /* Provide immutable access to the last projected row. */ > public InternalRow currentValue() { > return (InternalRow) mutableRow; > } > public Object apply(Object _i) { > InternalRow i = (InternalRow) _i; > /* if (isnull(input[1, DoubleType])) cast(0 as double) else input[1, > DoubleType] */ > /* isnull(input[1, DoubleType]) */ > /* input[1, DoubleType] */ > boolean isNull4 = i.isNullAt(1); > double primitive5 = isNull4 ? -1.0 : (i.getDouble(1)); > boolean isNull0 = false; > double primitive1 = -1.0; > if (!false && isNull4) { > /* cast(0 as double) */ > /* 0 */ > boolean isNull6 = false; > double primitive7 = -1.0; > if (!false) { > primitive7 = (double) 0; > } > isNull0 = isNull6; > primitive1 = primitive7; > } else { > /* input[1, DoubleType] */ > boolean isNull10 = i.isNullAt(1); > double primitive11 = isNull10 ? -1.0 : (i.getDouble(1)); > isNull0 = isNull10; > primitive1 = primitive11; > } > if (isNull0) { > mutableRow.setNullAt(0); > } else { > mutableRow.setDouble(0, primitive1); > } > /* if (isnull(input[1, DoubleType])) input[6, DoubleType] else if > (isnull(input[6, DoubleType])) input[1, DoubleType] else (input[1, > DoubleType] + input[6, DoubleType]) */ > /* isnull(input[1, DoubleType]) */ > /* input[1, DoubleType] */ > boolean isNull16 = i.isNullAt(1); > double primitive17 = isNull16 ? -1.0 : (i.getDouble(1)); > boolean isNull12 = false; > double primitive13 = -1.0; > if (!false && isNull16) { > /* input[6, DoubleType] */ > boolean isNull18 = i.isNullAt(6); > double primitive19 = isNull18 ? -1.0 : (i.getDouble(6)); > isNull12 = isNull18; > primitive13 = primitive19; > } else { > /* if (isnull(input[6, DoubleType])) input[1, DoubleType] else > (input[1,
[jira] [Created] (SPARK-11151) Use Long internally for DecimalType with precision <= 18
Davies Liu created SPARK-11151: -- Summary: Use Long internally for DecimalType with precision <= 18 Key: SPARK-11151 URL: https://issues.apache.org/jira/browse/SPARK-11151 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu It's expensive to create a Decimal object for small, we could use Long directly, just like what we had done for Date and Timestamp. This will involved lots of change that including: 1) inbound/outbound conversion 2) access/storage in InternalRow 3) all the expression that support DecimalType -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11147) HTTP 500 if try to access Spark UI in yarn-cluster
[ https://issues.apache.org/jira/browse/SPARK-11147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961088#comment-14961088 ] Sebastian YEPES FERNANDEZ commented on SPARK-11147: --- I don't think its a networking issue as until now we have not had any issue like this, we are regularly submitting jobs in client mode and all worker nodes communicate correctly. What part of the logs (yarn or spark) would be the most useful so we can pinpoint this problem. Note: Between all the servers there are no firewalls nor OS filtering. > HTTP 500 if try to access Spark UI in yarn-cluster > -- > > Key: SPARK-11147 > URL: https://issues.apache.org/jira/browse/SPARK-11147 > Project: Spark > Issue Type: Bug > Components: Web UI, YARN >Affects Versions: 1.5.1 > Environment: HDP: 2.3.2.0-2950 (Hadoop 2.7.1.2.3.2.0-2950) > Spark: 1.5.x (c27e1904) >Reporter: Sebastian YEPES FERNANDEZ > > Hello, > I am facing a similar issue as described in SPARK-5837, but is my case the > SparkUI only work in "yarn-client" mode. If a run the same job using > "yarn-cluster" I get the HTTP 500 error: > {code} > HTTP ERROR 500 > Problem accessing /proxy/application_1444297190346_0085/. Reason: > Connection to http://XX.XX.XX.XX:55827 refused > Caused by: > org.apache.http.conn.HttpHostConnectException: Connection to > http://XX.XX.XX.XX:55827 refused > at > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:190) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:294) > at > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:643) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:479) > at > org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906) > {code} > I have verified that the UI port "55827" is actually Listening on the worker > node, I can even run a "curl http://XX.XX.XX.XX:55827; and it redirects me to > another URL: http://YY.YY.YY.YY:8088/proxy/application_1444297190346_0082 > The strange thing is the its redirecting me to the app "_0082" and not the > actually running job "_0085" > Does anyone have any suggestions on what could be causing this issue? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11124) JsonParser/Generator should be closed for resource recycle
[ https://issues.apache.org/jira/browse/SPARK-11124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-11124. - Resolution: Fixed Assignee: Navis Fix Version/s: 1.6.0 > JsonParser/Generator should be closed for resource recycle > -- > > Key: SPARK-11124 > URL: https://issues.apache.org/jira/browse/SPARK-11124 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Navis >Assignee: Navis >Priority: Trivial > Fix For: 1.6.0 > > > Some json parsers are not closed. parser in JacksonParser#parseJson, for > example. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11154) make specificaition spark.yarn.executor.memoryOverhead consistent with typical JVM options
[ https://issues.apache.org/jira/browse/SPARK-11154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961184#comment-14961184 ] Sean Owen commented on SPARK-11154: --- Should be for all similar properties, not just this one. The twist is that you have to support the current syntax. 1000 must mean "1000 megabytes". But then someone writing "100" would be surprised to find that it means "100 megabytes". (CM might do just this, note.) Hence I'm actually not sure if this is feasible. > make specificaition spark.yarn.executor.memoryOverhead consistent with > typical JVM options > -- > > Key: SPARK-11154 > URL: https://issues.apache.org/jira/browse/SPARK-11154 > Project: Spark > Issue Type: Improvement > Components: Documentation, Spark Submit >Reporter: Dustin Cote >Priority: Minor > > spark.yarn.executor.memoryOverhead is currently specified in megabytes by > default, but it would be nice to allow users to specify the size as though it > were a typical -Xmx option to a JVM where you can have 'm' and 'g' appended > to the end to explicitly specify megabytes or gigabytes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10641) skewness and kurtosis support
[ https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10641: -- Attachment: simpler-moments.pdf I did some calculation offline and got a simpler formula for updating high-order moments. [~sethah] If you are interested, you can implement ImperativeAggregate using this formula. > skewness and kurtosis support > - > > Key: SPARK-10641 > URL: https://issues.apache.org/jira/browse/SPARK-10641 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Jihong MA >Assignee: Seth Hendrickson > Attachments: simpler-moments.pdf > > > Implementing skewness and kurtosis support based on following algorithm: > https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11157) Allow Spark to be built without assemblies
Marcelo Vanzin created SPARK-11157: -- Summary: Allow Spark to be built without assemblies Key: SPARK-11157 URL: https://issues.apache.org/jira/browse/SPARK-11157 Project: Spark Issue Type: Umbrella Components: Build, Spark Core, YARN Reporter: Marcelo Vanzin For reasoning, discussion of pros and cons, and other more detailed information, please see attached doc. The idea is to be able to build a Spark distribution that has just a directory full of jars instead of the huge assembly files we currently have. Getting there requires changes in a bunch of places, I'll try to list the ones I identified in the document, in the order that I think would be needed to not break things: * make streaming backends not be assemblies Since people may depend on the current assembly artifacts in their deployments, we can't really remove them; but we can make them be dummy jars and rely on dependency resolution to download all the jars. PySpark tests would also need some tweaking here. * make examples jar not be an assembly Probably requires tweaks to the {{run-example}} script. The location of the examples jar would have to change (it won't be able to live in the same place as the main Spark jars anymore). * update YARN backend to handle a directory full of jars when launching apps Currently YARN localizes the Spark assembly (depending on the user configuration); it needs to be modified so that it can localize all needed libraries instead of a single jar. * Modify launcher library to handle the jars directory This should be trivial * Modify {{assembly/pom.xml}} to generate assembly or a {{libs}} directory depending on which profile is enabled. We should keep the option to build with the assembly on by default, for backwards compatibility, to give people time to prepare. Filing this bug as an umbrella; please file sub-tasks if you plan to work on a specific part of the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11155) Stage summary json should include stage duration
[ https://issues.apache.org/jira/browse/SPARK-11155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961366#comment-14961366 ] Kay Ousterhout commented on SPARK-11155: [~imranr] where exactly do you mean this is missing? I thought you meant the Json info for StageSubmitted / StageCompleted, but that does include the stage submission and completion time (via StageInfo), which can be used to compute the duration. > Stage summary json should include stage duration > - > > Key: SPARK-11155 > URL: https://issues.apache.org/jira/browse/SPARK-11155 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Imran Rashid >Priority: Minor > Labels: Starter > > The json endpoint for stages doesn't include information on the stage > duration that is present in the UI. This looks like a simple oversight, they > should be included. eg., the metrics should be included at > {{api/v1/applications//stages}}. The missing metrics are > {{submissionTime}} and {{completionTime}} (and whatever other metrics come > out of the discussion on SPARK-10930) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11050) PySpark SparseVector can return wrong index in error message
[ https://issues.apache.org/jira/browse/SPARK-11050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-11050. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9069 [https://github.com/apache/spark/pull/9069] > PySpark SparseVector can return wrong index in error message > > > Key: SPARK-11050 > URL: https://issues.apache.org/jira/browse/SPARK-11050 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Affects Versions: 1.6.0 >Reporter: Joseph K. Bradley >Assignee: Bhargav Mangipudi >Priority: Trivial > Labels: starter > Fix For: 1.6.0 > > > PySpark {{SparseVector.__getitem__}} returns an error message if given a bad > index here: > [https://github.com/apache/spark/blob/a16396df76cc27099011bfb96b28cbdd7f964ca8/python/pyspark/mllib/linalg/__init__.py#L770] > But the index it complains about could have been modified (if negative), > meaning the index in the error message could be wrong. This should be > corrected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns
[ https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11153: Assignee: Cheng Lian (was: Apache Spark) > Turns off Parquet filter push-down for string and binary columns > > > Key: SPARK-11153 > URL: https://issues.apache.org/jira/browse/SPARK-11153 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Blocker > > Due to PARQUET-251, {{BINARY}} columns in existing Parquet files may be > written with corrupted statistics information. This information is used by > filter push-down optimization. Since Spark 1.5 turns on Parquet filter > push-down by default, we may end up with wrong query results. PARQUET-251 has > been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0. > Note that this kind of corrupted Parquet files could be produced by any > Parquet data models. > This affects all Spark SQL data types that can be mapped to Parquet > {{BINARY}}, namely: > - {{StringType}} > - {{BinaryType}} > - {{DecimalType}} (but Spark SQL doesn't support pushing down {{DecimalType}} > columns for now.) > To avoid wrong query results, we should disable filter push-down for columns > of {{StringType}} and {{BinaryType}} until we upgrade to parquet-mr 1.8. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns
[ https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961300#comment-14961300 ] Apache Spark commented on SPARK-11153: -- User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/9152 > Turns off Parquet filter push-down for string and binary columns > > > Key: SPARK-11153 > URL: https://issues.apache.org/jira/browse/SPARK-11153 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Blocker > > Due to PARQUET-251, {{BINARY}} columns in existing Parquet files may be > written with corrupted statistics information. This information is used by > filter push-down optimization. Since Spark 1.5 turns on Parquet filter > push-down by default, we may end up with wrong query results. PARQUET-251 has > been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0. > Note that this kind of corrupted Parquet files could be produced by any > Parquet data models. > This affects all Spark SQL data types that can be mapped to Parquet > {{BINARY}}, namely: > - {{StringType}} > - {{BinaryType}} > - {{DecimalType}} (but Spark SQL doesn't support pushing down {{DecimalType}} > columns for now.) > To avoid wrong query results, we should disable filter push-down for columns > of {{StringType}} and {{BinaryType}} until we upgrade to parquet-mr 1.8. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11158) Add more information in Error statment for sql/types _verify_type()
[ https://issues.apache.org/jira/browse/SPARK-11158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11158: Assignee: Apache Spark > Add more information in Error statment for sql/types _verify_type() > --- > > Key: SPARK-11158 > URL: https://issues.apache.org/jira/browse/SPARK-11158 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Mahmoud Lababidi >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11158) Add more information in Error statment for sql/types _verify_type()
[ https://issues.apache.org/jira/browse/SPARK-11158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11158: Assignee: (was: Apache Spark) > Add more information in Error statment for sql/types _verify_type() > --- > > Key: SPARK-11158 > URL: https://issues.apache.org/jira/browse/SPARK-11158 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Mahmoud Lababidi >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11158) Add more information in Error statment for sql/types _verify_type()
[ https://issues.apache.org/jira/browse/SPARK-11158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961315#comment-14961315 ] Apache Spark commented on SPARK-11158: -- User 'lababidi' has created a pull request for this issue: https://github.com/apache/spark/pull/9149 > Add more information in Error statment for sql/types _verify_type() > --- > > Key: SPARK-11158 > URL: https://issues.apache.org/jira/browse/SPARK-11158 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Mahmoud Lababidi >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10581) Groups are not resolved in scaladoc for org.apache.spark.sql.Column
[ https://issues.apache.org/jira/browse/SPARK-10581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-10581. - Resolution: Fixed Fix Version/s: 1.6.0 1.5.2 > Groups are not resolved in scaladoc for org.apache.spark.sql.Column > --- > > Key: SPARK-10581 > URL: https://issues.apache.org/jira/browse/SPARK-10581 > Project: Spark > Issue Type: Bug > Components: Documentation >Reporter: Jacek Laskowski >Priority: Minor > Fix For: 1.5.2, 1.6.0 > > > The Scala API documentation (scaladoc) for > [org.apache.spark.sql.Column|http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.Column] > does not resolve groups, and they appear unresolved like {{df_ops}}, > {{expr_ops}}, et al. instead of _DataFrame functions._, _Expression > operators._, et al. > BTW, > [DataFrame|http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame] > and other classes in the > [org.apache.spark.sql|http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.package] > package seem fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10641) skewness and kurtosis support
[ https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961359#comment-14961359 ] Seth Hendrickson edited comment on SPARK-10641 at 10/16/15 9:07 PM: [~mengxr] I am interested, do you mind providing it or a link to it? was (Author: sethah): [~mengxr] I am interested, do you mine providing it or a link to it? > skewness and kurtosis support > - > > Key: SPARK-10641 > URL: https://issues.apache.org/jira/browse/SPARK-10641 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Jihong MA >Assignee: Seth Hendrickson > Attachments: simpler-moments.pdf > > > Implementing skewness and kurtosis support based on following algorithm: > https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10581) Groups are not resolved in scaladoc for org.apache.spark.sql.Column
[ https://issues.apache.org/jira/browse/SPARK-10581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-10581: Assignee: Pravin Vishnu Gadakh > Groups are not resolved in scaladoc for org.apache.spark.sql.Column > --- > > Key: SPARK-10581 > URL: https://issues.apache.org/jira/browse/SPARK-10581 > Project: Spark > Issue Type: Bug > Components: Documentation >Reporter: Jacek Laskowski >Assignee: Pravin Vishnu Gadakh >Priority: Minor > Fix For: 1.5.2, 1.6.0 > > > The Scala API documentation (scaladoc) for > [org.apache.spark.sql.Column|http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.Column] > does not resolve groups, and they appear unresolved like {{df_ops}}, > {{expr_ops}}, et al. instead of _DataFrame functions._, _Expression > operators._, et al. > BTW, > [DataFrame|http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame] > and other classes in the > [org.apache.spark.sql|http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.package] > package seem fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11156) Web UI doesn't count or show info about replicated blocks
Ryan Williams created SPARK-11156: - Summary: Web UI doesn't count or show info about replicated blocks Key: SPARK-11156 URL: https://issues.apache.org/jira/browse/SPARK-11156 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.5.1 Reporter: Ryan Williams When executors receive a replica of a block, they [notify the driver with a {{UpdateBlockInfo}} message|https://github.com/apache/spark/blob/4ee2cea2a43f7d04ab8511d9c029f80c5dadd48e/core/src/main/scala/org/apache/spark/storage/BlockManagerMaster.scala#L59-L61] which [sends a {{SparkListenerBlockUpdated}} event to SparkListeners|https://github.com/apache/spark/blob/4ee2cea2a43f7d04ab8511d9c029f80c5dadd48e/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala#L67]. However, the web UI (via its BlockStatusListener) [ignores {{SparkListenerBlockUpdated}} events for non-streaming blocks|https://github.com/apache/spark/blob/4ee2cea2a43f7d04ab8511d9c029f80c5dadd48e/core/src/main/scala/org/apache/spark/storage/BlockStatusListener.scala#L57-L60]. As a result, in non-streaming apps: * The "Executors" column on RDD Page doesn't show executors housing replicas; it can only show the executor that initially computed (and initiated replication of) the block. * The executor-memory-usage and related stats displayed throughout the web interface are undercounting due to ignorance of the existence of block replicas. For example, here is the Storage tab for a simple app with 3 identical RDDs cached with replication equal to 1, 2, and 3: !http://f.cl.ly/items/3m3B2v2k2J23350I3t1c/Screen%20Shot%202015-10-16%20at%2012.30.54%20AM.png! These were generated with: {code} val bar1 = sc.parallelize(1 to 1, 100).map(_ % 100 -> 1).reduceByKey(_+_, 100).setName("bar1").persist(StorageLevel(false, true, false, true, 1)) bar1.count val bar2 = sc.parallelize(1 to 1, 100).map(_ % 100 -> 1).reduceByKey(_+_, 100).setName("bar2").persist(StorageLevel(false, true, false, true, 2)) bar2.count val bar3 = sc.parallelize(1 to 1, 100).map(_ % 100 -> 1).reduceByKey(_+_, 100).setName("bar3").persist(StorageLevel(false, true, false, true, 3)) bar3.count {code} Note the identically-reported memory usage across the three. Here is the RDD page for the 3x-replicated RDD above: !http://f.cl.ly/items/0t0H1o2S2g140s1A0X0k/Screen%20Shot%202015-10-16%20at%2012.31.24%20AM.png! Note that only one executor is listed for each partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9409) make-distribution.sh should copy all files in conf, so that it's easy to create a distro with custom configuration and property settings
[ https://issues.apache.org/jira/browse/SPARK-9409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-9409. --- Resolution: Won't Fix > make-distribution.sh should copy all files in conf, so that it's easy to > create a distro with custom configuration and property settings > > > Key: SPARK-9409 > URL: https://issues.apache.org/jira/browse/SPARK-9409 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.4.1 > Environment: MacOS, Linux >Reporter: Dean Wampler >Priority: Minor > Labels: easyfix > Original Estimate: 1h > Remaining Estimate: 1h > > When using make-distribution.sh to build a custom distribution, it would be > nice to be able to drop custom configuration files in the conf directory and > have them included in the archive. Currently, only the *.template files are > included. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11155) Stage summary json should include stage duration
[ https://issues.apache.org/jira/browse/SPARK-11155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961282#comment-14961282 ] Xin Ren commented on SPARK-11155: - Hi, I'd like to have a try on this one. Thanks > Stage summary json should include stage duration > - > > Key: SPARK-11155 > URL: https://issues.apache.org/jira/browse/SPARK-11155 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Imran Rashid >Priority: Minor > Labels: Starter > > The json endpoint for stages doesn't include information on the stage > duration that is present in the UI. This looks like a simple oversight, they > should be included. eg., the metrics should be included at > {{api/v1/applications//stages}}. The missing metrics are > {{submissionTime}} and {{completionTime}} (and whatever other metrics come > out of the discussion on SPARK-10930) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11127) Upgrade Kinesis Client Library to the latest stable version
[ https://issues.apache.org/jira/browse/SPARK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11127: Assignee: Tathagata Das (was: Apache Spark) > Upgrade Kinesis Client Library to the latest stable version > --- > > Key: SPARK-11127 > URL: https://issues.apache.org/jira/browse/SPARK-11127 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Xiangrui Meng >Assignee: Tathagata Das > > We use KCL 1.3.0 in the current master. KCL 1.4.0 added integration with > Kinesis Producer Library (KPL) and support auto de-aggregation. It would be > great to upgrade KCL to the latest stable version. > Note that the latest version is 1.6.1 and 1.6.0 restored compatibility with > dynamodb-streams-kinesis-adapter, which was broken in 1.4.0. See > https://github.com/awslabs/amazon-kinesis-client#release-notes. > [~tdas] [~brkyvz] Please recommend a version for upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11127) Upgrade Kinesis Client Library to the latest stable version
[ https://issues.apache.org/jira/browse/SPARK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961308#comment-14961308 ] Apache Spark commented on SPARK-11127: -- User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/9153 > Upgrade Kinesis Client Library to the latest stable version > --- > > Key: SPARK-11127 > URL: https://issues.apache.org/jira/browse/SPARK-11127 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Xiangrui Meng >Assignee: Tathagata Das > > We use KCL 1.3.0 in the current master. KCL 1.4.0 added integration with > Kinesis Producer Library (KPL) and support auto de-aggregation. It would be > great to upgrade KCL to the latest stable version. > Note that the latest version is 1.6.1 and 1.6.0 restored compatibility with > dynamodb-streams-kinesis-adapter, which was broken in 1.4.0. See > https://github.com/awslabs/amazon-kinesis-client#release-notes. > [~tdas] [~brkyvz] Please recommend a version for upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11160) CloudPickeSerializer conflicts with xmlrunner
Gabor Liptak created SPARK-11160: Summary: CloudPickeSerializer conflicts with xmlrunner Key: SPARK-11160 URL: https://issues.apache.org/jira/browse/SPARK-11160 Project: Spark Issue Type: Bug Components: PySpark Reporter: Gabor Liptak Priority: Minor Change L259 in pyspark/tests.py to: {code} # Regression test for SPARK-3415 def test_pickling_file_handles(self): # JIRA number here if xmlrunner is None: ser = CloudPickleSerializer() out1 = sys.stderr out2 = ser.loads(ser.dumps(out1)) self.assertEqual(out1, out2) {code} The issue is CloudPickeSerializer wraps stderr, which conflicts with xmlrunner. But it might take some time to fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns
[ https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961281#comment-14961281 ] Cheng Lian commented on SPARK-11153: Yes, it's the statistics information that is corrupted. And yes, Parquet does write version in the metadata. Parquet-mr 1.8 handles this issue in exactly the way you suggested, namely ignoring binary statistics when necessary according to version information written in the metadata. However, Spark SQL performs filter push-down on driver side. This means we need to gather Parquet versions from all Parquet files using a distributed Spark job. We can probably merge this one into the job used to merge Spark schemata. But I think this is too risky for 1.5.2 at this stage. So I'd propose we simply disables filter push-down for strings and binaries in all cases until a parquet-mr upgrade. > Turns off Parquet filter push-down for string and binary columns > > > Key: SPARK-11153 > URL: https://issues.apache.org/jira/browse/SPARK-11153 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Blocker > > Due to PARQUET-251, {{BINARY}} columns in existing Parquet files may be > written with corrupted statistics information. This information is used by > filter push-down optimization. Since Spark 1.5 turns on Parquet filter > push-down by default, we may end up with wrong query results. PARQUET-251 has > been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0. > Note that this kind of corrupted Parquet files could be produced by any > Parquet data models. > This affects all Spark SQL data types that can be mapped to Parquet > {{BINARY}}, namely: > - {{StringType}} > - {{BinaryType}} > - {{DecimalType}} (but Spark SQL doesn't support pushing down {{DecimalType}} > columns for now.) > To avoid wrong query results, we should disable filter push-down for columns > of {{StringType}} and {{BinaryType}} until we upgrade to parquet-mr 1.8. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns
[ https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11153: Assignee: Apache Spark (was: Cheng Lian) > Turns off Parquet filter push-down for string and binary columns > > > Key: SPARK-11153 > URL: https://issues.apache.org/jira/browse/SPARK-11153 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 >Reporter: Cheng Lian >Assignee: Apache Spark >Priority: Blocker > > Due to PARQUET-251, {{BINARY}} columns in existing Parquet files may be > written with corrupted statistics information. This information is used by > filter push-down optimization. Since Spark 1.5 turns on Parquet filter > push-down by default, we may end up with wrong query results. PARQUET-251 has > been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0. > Note that this kind of corrupted Parquet files could be produced by any > Parquet data models. > This affects all Spark SQL data types that can be mapped to Parquet > {{BINARY}}, namely: > - {{StringType}} > - {{BinaryType}} > - {{DecimalType}} (but Spark SQL doesn't support pushing down {{DecimalType}} > columns for now.) > To avoid wrong query results, we should disable filter push-down for columns > of {{StringType}} and {{BinaryType}} until we upgrade to parquet-mr 1.8. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9162) Implement code generation for ScalaUDF
[ https://issues.apache.org/jira/browse/SPARK-9162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961335#comment-14961335 ] Reynold Xin commented on SPARK-9162: [~viirya] can you work on this? We can then close this umbrella ticket .. > Implement code generation for ScalaUDF > -- > > Key: SPARK-9162 > URL: https://issues.apache.org/jira/browse/SPARK-9162 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8100) Make able to refer lost executor log
[ https://issues.apache.org/jira/browse/SPARK-8100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-8100. --- Resolution: Duplicate This looks like a duplicate of SPARK-7729 > Make able to refer lost executor log > > > Key: SPARK-8100 > URL: https://issues.apache.org/jira/browse/SPARK-8100 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.3.1 >Reporter: SuYan >Priority: Minor > > While application is still running, but the lost executor's info is > disappeared from SPARK UI -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11157) Allow Spark to be built without assemblies
[ https://issues.apache.org/jira/browse/SPARK-11157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-11157: --- Attachment: no-assemblies.pdf > Allow Spark to be built without assemblies > -- > > Key: SPARK-11157 > URL: https://issues.apache.org/jira/browse/SPARK-11157 > Project: Spark > Issue Type: Umbrella > Components: Build, Spark Core, YARN >Reporter: Marcelo Vanzin > Attachments: no-assemblies.pdf > > > For reasoning, discussion of pros and cons, and other more detailed > information, please see attached doc. > The idea is to be able to build a Spark distribution that has just a > directory full of jars instead of the huge assembly files we currently have. > Getting there requires changes in a bunch of places, I'll try to list the > ones I identified in the document, in the order that I think would be needed > to not break things: > * make streaming backends not be assemblies > Since people may depend on the current assembly artifacts in their > deployments, we can't really remove them; but we can make them be dummy jars > and rely on dependency resolution to download all the jars. > PySpark tests would also need some tweaking here. > * make examples jar not be an assembly > Probably requires tweaks to the {{run-example}} script. The location of the > examples jar would have to change (it won't be able to live in the same place > as the main Spark jars anymore). > * update YARN backend to handle a directory full of jars when launching apps > Currently YARN localizes the Spark assembly (depending on the user > configuration); it needs to be modified so that it can localize all needed > libraries instead of a single jar. > * Modify launcher library to handle the jars directory > This should be trivial > * Modify {{assembly/pom.xml}} to generate assembly or a {{libs}} directory > depending on which profile is enabled. > We should keep the option to build with the assembly on by default, for > backwards compatibility, to give people time to prepare. > Filing this bug as an umbrella; please file sub-tasks if you plan to work on > a specific part of the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11127) Upgrade Kinesis Client Library to the latest stable version
[ https://issues.apache.org/jira/browse/SPARK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11127: Assignee: Apache Spark (was: Tathagata Das) > Upgrade Kinesis Client Library to the latest stable version > --- > > Key: SPARK-11127 > URL: https://issues.apache.org/jira/browse/SPARK-11127 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Xiangrui Meng >Assignee: Apache Spark > > We use KCL 1.3.0 in the current master. KCL 1.4.0 added integration with > Kinesis Producer Library (KPL) and support auto de-aggregation. It would be > great to upgrade KCL to the latest stable version. > Note that the latest version is 1.6.1 and 1.6.0 restored compatibility with > dynamodb-streams-kinesis-adapter, which was broken in 1.4.0. See > https://github.com/awslabs/amazon-kinesis-client#release-notes. > [~tdas] [~brkyvz] Please recommend a version for upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11127) Upgrade Kinesis Client Library to the latest stable version
[ https://issues.apache.org/jira/browse/SPARK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-11127: - Assignee: Xiangrui Meng (was: Tathagata Das) > Upgrade Kinesis Client Library to the latest stable version > --- > > Key: SPARK-11127 > URL: https://issues.apache.org/jira/browse/SPARK-11127 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > We use KCL 1.3.0 in the current master. KCL 1.4.0 added integration with > Kinesis Producer Library (KPL) and support auto de-aggregation. It would be > great to upgrade KCL to the latest stable version. > Note that the latest version is 1.6.1 and 1.6.0 restored compatibility with > dynamodb-streams-kinesis-adapter, which was broken in 1.4.0. See > https://github.com/awslabs/amazon-kinesis-client#release-notes. > [~tdas] [~brkyvz] Please recommend a version for upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6859) Parquet File Binary column statistics error when reuse byte[] among rows
[ https://issues.apache.org/jira/browse/SPARK-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961318#comment-14961318 ] Cheng Lian commented on SPARK-6859: --- This issue was left unresolved because Parquet filter push-down wasn't enabled by default. But now in 1.5, it's turned on by default. Opened SPARK-11153 to disable filter push-down for strings and binaries. > Parquet File Binary column statistics error when reuse byte[] among rows > > > Key: SPARK-6859 > URL: https://issues.apache.org/jira/browse/SPARK-6859 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0, 1.3.0, 1.4.0 >Reporter: Yijie Shen >Priority: Minor > > Suppose I create a dataRDD which extends RDD\[Row\], and each row is > GenericMutableRow(Array(Int, Array\[Byte\])). A same Array\[Byte\] object is > reused among rows but has different content each time. When I convert it to a > dataFrame and save it as Parquet File, the file's row group statistic(max & > min) of Binary column would be wrong. > \\ > \\ > Here is the reason: In Parquet, BinaryStatistic just keep max & min as > parquet.io.api.Binary references, Spark sql would generate a new Binary > backed by the same Array\[Byte\] passed from row. > > | |reference| |backed| | > |max: Binary|-->|ByteArrayBackedBinary|-->|Array\[Byte\]| > Therefore, each time parquet updating row group's statistic, max & min would > always refer to the same Array\[Byte\], which has new content each time. When > parquet decides to save it into file, the last row's content would be saved > as both max & min. > \\ > \\ > It seems it is a parquet bug because it's parquet's responsibility to update > statistics correctly. > But not quite sure. Should I report it as a bug in parquet JIRA? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8360) Streaming DataFrames
[ https://issues.apache.org/jira/browse/SPARK-8360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8360: --- Target Version/s: (was: 1.6.0) > Streaming DataFrames > > > Key: SPARK-8360 > URL: https://issues.apache.org/jira/browse/SPARK-8360 > Project: Spark > Issue Type: Umbrella > Components: SQL, Streaming >Reporter: Reynold Xin > > Umbrella ticket to track what's needed to make streaming DataFrame a reality. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns
[ https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961371#comment-14961371 ] Felix Cheung commented on SPARK-11153: -- so the corrupted stats data would still be a problem when for future releases? how would it be handled then? > Turns off Parquet filter push-down for string and binary columns > > > Key: SPARK-11153 > URL: https://issues.apache.org/jira/browse/SPARK-11153 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Blocker > > Due to PARQUET-251, {{BINARY}} columns in existing Parquet files may be > written with corrupted statistics information. This information is used by > filter push-down optimization. Since Spark 1.5 turns on Parquet filter > push-down by default, we may end up with wrong query results. PARQUET-251 has > been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0. > Note that this kind of corrupted Parquet files could be produced by any > Parquet data models. > This affects all Spark SQL data types that can be mapped to Parquet > {{BINARY}}, namely: > - {{StringType}} > - {{BinaryType}} > - {{DecimalType}} (but Spark SQL doesn't support pushing down {{DecimalType}} > columns for now.) > To avoid wrong query results, we should disable filter push-down for columns > of {{StringType}} and {{BinaryType}} until we upgrade to parquet-mr 1.8. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10641) skewness and kurtosis support
[ https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961396#comment-14961396 ] Xiangrui Meng commented on SPARK-10641: --- See attached PDF file. > skewness and kurtosis support > - > > Key: SPARK-10641 > URL: https://issues.apache.org/jira/browse/SPARK-10641 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Jihong MA >Assignee: Seth Hendrickson > Attachments: simpler-moments.pdf > > > Implementing skewness and kurtosis support based on following algorithm: > https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11158) Add more information in Error statment for sql/types _verify_type()
Mahmoud Lababidi created SPARK-11158: Summary: Add more information in Error statment for sql/types _verify_type() Key: SPARK-11158 URL: https://issues.apache.org/jira/browse/SPARK-11158 Project: Spark Issue Type: Improvement Components: SQL Reporter: Mahmoud Lababidi Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10974) Add progress bar for output operation column and use red dots for failed batches
[ https://issues.apache.org/jira/browse/SPARK-10974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-10974. --- Resolution: Fixed > Add progress bar for output operation column and use red dots for failed > batches > > > Key: SPARK-10974 > URL: https://issues.apache.org/jira/browse/SPARK-10974 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11104) A potential deadlock in StreamingContext.stop and stopOnShutdown
[ https://issues.apache.org/jira/browse/SPARK-11104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-11104. --- Resolution: Fixed Assignee: Shixiong Zhu Fix Version/s: 1.6.0 1.5.2 > A potential deadlock in StreamingContext.stop and stopOnShutdown > > > Key: SPARK-11104 > URL: https://issues.apache.org/jira/browse/SPARK-11104 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 1.5.2, 1.6.0 > > > When the shutdown hook of StreamingContext and StreamingContext.stop are > running at the same time (e.g., press CTRL-C when StreamingContext.stop is > running), the following deadlock may happen: > {code} > Java stack information for the threads listed above: > === > "Thread-2": > at > org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:699) > - waiting to lock <0x0005405a1680> (a > org.apache.spark.streaming.StreamingContext) > at > org.apache.spark.streaming.StreamingContext.org$apache$spark$streaming$StreamingContext$$stopOnShutdown(StreamingContext.scala:729) > at > org.apache.spark.streaming.StreamingContext$$anonfun$start$1.apply$mcV$sp(StreamingContext.scala:625) > at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:266) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:236) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:236) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:236) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1697) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:236) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:236) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:236) > at scala.util.Try$.apply(Try.scala:161) > at > org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:236) > - locked <0x0005405b6a00> (a > org.apache.spark.util.SparkShutdownHookManager) > at > org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216) > at > org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) > "main": > at > org.apache.spark.util.SparkShutdownHookManager.remove(ShutdownHookManager.scala:248) > - waiting to lock <0x0005405b6a00> (a > org.apache.spark.util.SparkShutdownHookManager) > at > org.apache.spark.util.ShutdownHookManager$.removeShutdownHook(ShutdownHookManager.scala:199) > at > org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:712) > - locked <0x0005405a1680> (a > org.apache.spark.streaming.StreamingContext) > at > org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:684) > - locked <0x0005405a1680> (a > org.apache.spark.streaming.StreamingContext) > at > org.apache.spark.streaming.SessionByKeyBenchmark$.main(SessionByKeyBenchmark.scala:108) > at > org.apache.spark.streaming.SessionByKeyBenchmark.main(SessionByKeyBenchmark.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:680) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11109) move FsHistoryProvider off import org.apache.hadoop.fs.permission.AccessControlException
[ https://issues.apache.org/jira/browse/SPARK-11109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-11109. Resolution: Fixed Assignee: Glenn Weidner Fix Version/s: 1.6.0 > move FsHistoryProvider off import > org.apache.hadoop.fs.permission.AccessControlException > > > Key: SPARK-11109 > URL: https://issues.apache.org/jira/browse/SPARK-11109 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Steve Loughran >Assignee: Glenn Weidner >Priority: Minor > Fix For: 1.6.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > {{FsHistoryProvider}} imports and uses > {{org.apache.hadoop.fs.permission.AccessControlException}}; this has been > superceded by its subclass > {{org.apache.hadoop.security.AccessControlException}} since ~2011. Moving to > that subclass would remove a deprecation warning and ensure that were the > Hadoop team to remove that old method (as HADOOP-11356 has currently done to > trunk), everything still compiles and links -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10641) skewness and kurtosis support
[ https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961359#comment-14961359 ] Seth Hendrickson commented on SPARK-10641: -- [~mengxr] I am interested, do you mine providing it or a link to it? > skewness and kurtosis support > - > > Key: SPARK-10641 > URL: https://issues.apache.org/jira/browse/SPARK-10641 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Jihong MA >Assignee: Seth Hendrickson > Attachments: simpler-moments.pdf > > > Implementing skewness and kurtosis support based on following algorithm: > https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11159) Nested SQL UDF raises java.lang.UnsupportedOperationException: Cannot evaluate expression
Jacob Wellington created SPARK-11159: Summary: Nested SQL UDF raises java.lang.UnsupportedOperationException: Cannot evaluate expression Key: SPARK-11159 URL: https://issues.apache.org/jira/browse/SPARK-11159 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.1 Reporter: Jacob Wellington I'm running spark 1.5.1 and getting the following error: {{java.lang.UnsupportedOperationException: Cannot evaluate expression: PythonUDF#func_db_v1863()}} whenever I run a query like: {{SELECT func_format_v1863(func_db_v1863('')) as ds261_v1869 FROM df1}} after registering {{func_db_v1863}} and {{func_format_v1863}} as functions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns
[ https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961374#comment-14961374 ] Felix Cheung commented on SPARK-11153: -- re-read what you said, I think it makes sense. I assume it means for Spark 1.6.x it would handle it like Parquet-mr 1.8 in that it would check the writer version and enable/disable push-down for sting/binary columns. > Turns off Parquet filter push-down for string and binary columns > > > Key: SPARK-11153 > URL: https://issues.apache.org/jira/browse/SPARK-11153 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Blocker > > Due to PARQUET-251, {{BINARY}} columns in existing Parquet files may be > written with corrupted statistics information. This information is used by > filter push-down optimization. Since Spark 1.5 turns on Parquet filter > push-down by default, we may end up with wrong query results. PARQUET-251 has > been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0. > Note that this kind of corrupted Parquet files could be produced by any > Parquet data models. > This affects all Spark SQL data types that can be mapped to Parquet > {{BINARY}}, namely: > - {{StringType}} > - {{BinaryType}} > - {{DecimalType}} (but Spark SQL doesn't support pushing down {{DecimalType}} > columns for now.) > To avoid wrong query results, we should disable filter push-down for columns > of {{StringType}} and {{BinaryType}} until we upgrade to parquet-mr 1.8. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11050) PySpark SparseVector can return wrong index in error message
[ https://issues.apache.org/jira/browse/SPARK-11050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-11050: -- Assignee: Bhargav Mangipudi > PySpark SparseVector can return wrong index in error message > > > Key: SPARK-11050 > URL: https://issues.apache.org/jira/browse/SPARK-11050 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Affects Versions: 1.6.0 >Reporter: Joseph K. Bradley >Assignee: Bhargav Mangipudi >Priority: Trivial > Labels: starter > > PySpark {{SparseVector.__getitem__}} returns an error message if given a bad > index here: > [https://github.com/apache/spark/blob/a16396df76cc27099011bfb96b28cbdd7f964ca8/python/pyspark/mllib/linalg/__init__.py#L770] > But the index it complains about could have been modified (if negative), > meaning the index in the error message could be wrong. This should be > corrected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
[ https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960296#comment-14960296 ] patcharee commented on SPARK-11087: --- [~zhazhan] Below is my test. Please check. I tried to change "hive.exec.orc.split.strategy" also, but none of them given " OrcInputFormat [INFO] ORC pushdown predicate" as same as your result 2508 case class Contact(name: String, phone: String) 2509 case class Person(name: String, age: Int, contacts: Seq[Contact]) 2510 val records = (1 to 100).map { i => Person(s"name_$i", i, (0 to 1).map { m => Contact(s"contact_$m", s"phone_$m") } ) 2511 } 2512 sqlContext.setConf("spark.sql.orc.filterPushdown", "true") 2513 sc.parallelize(records).toDF().write.format("orc").partitionBy("age").save("peoplePartitioned") 2514 val peoplePartitioned = sqlContext.read.format("orc").load("peoplePartitioned") 2515 peoplePartitioned.registerTempTable("peoplePartitioned") scala> sqlContext.setConf("hive.exec.orc.split.strategy", "ETL") 15/10/16 09:10:49 DEBUG VariableSubstitution: Substitution is on: ETL 15/10/16 09:10:49 DEBUG VariableSubstitution: Substitution is on: ETL 15/10/16 09:10:49 DEBUG VariableSubstitution: Substitution is on: ETL 15/10/16 09:10:49 DEBUG VariableSubstitution: Substitution is on: ETL scala> sqlContext.sql("SELECT * FROM peoplePartitioned WHERE age = 20 and name = 'name_20'").count 15/10/16 09:10:52 DEBUG VariableSubstitution: Substitution is on: SELECT * FROM peoplePartitioned WHERE age = 20 and name = 'name_20' 15/10/16 09:10:52 DEBUG VariableSubstitution: Substitution is on: SELECT * FROM peoplePartitioned WHERE age = 20 and name = 'name_20' 15/10/16 09:10:53 INFO PerfLogger: 15/10/16 09:10:53 INFO PerfLogger: 15/10/16 09:10:53 DEBUG OrcInputFormat: Number of buckets specified by conf file is 0 15/10/16 09:10:53 DEBUG OrcInputFormat: Number of buckets specified by conf file is 0 15/10/16 09:10:53 DEBUG AcidUtils: in directory hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc base = null deltas = 0 15/10/16 09:10:53 DEBUG AcidUtils: in directory hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc base = null deltas = 0 15/10/16 09:10:53 DEBUG OrcInputFormat: BISplitStrategy strategy for hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc 15/10/16 09:10:53 DEBUG OrcInputFormat: BISplitStrategy strategy for hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc 15/10/16 09:10:53 INFO OrcInputFormat: FooterCacheHitRatio: 0/0 15/10/16 09:10:53 INFO OrcInputFormat: FooterCacheHitRatio: 0/0 15/10/16 09:10:53 DEBUG OrcInputFormat: hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc:0+551 projected_columns_uncompressed_size: -1 15/10/16 09:10:53 DEBUG OrcInputFormat: hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc:0+551 projected_columns_uncompressed_size: -1 15/10/16 09:10:53 INFO PerfLogger: 15/10/16 09:10:53 INFO PerfLogger: res5: Long = 1 scala> sqlContext.setConf("hive.exec.orc.split.strategy", "BI") 15/10/16 09:11:13 DEBUG VariableSubstitution: Substitution is on: BI 15/10/16 09:11:13 DEBUG VariableSubstitution: Substitution is on: BI 15/10/16 09:11:13 DEBUG VariableSubstitution: Substitution is on: BI 15/10/16 09:11:13 DEBUG VariableSubstitution: Substitution is on: BI scala> sqlContext.sql("SELECT * FROM peoplePartitioned WHERE age = 20 and name = 'name_20'").count 15/10/16 09:11:19 DEBUG VariableSubstitution: Substitution is on: SELECT * FROM peoplePartitioned WHERE age = 20 and name = 'name_20' 15/10/16 09:11:19 DEBUG VariableSubstitution: Substitution is on: SELECT * FROM peoplePartitioned WHERE age = 20 and name = 'name_20' 15/10/16 09:11:19 INFO PerfLogger: 15/10/16 09:11:19 INFO PerfLogger: 15/10/16 09:11:19 DEBUG OrcInputFormat: Number of buckets specified by conf file is 0 15/10/16 09:11:19 DEBUG OrcInputFormat: Number of buckets specified by conf file is 0 15/10/16 09:11:19 DEBUG AcidUtils: in directory hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc base = null deltas = 0 15/10/16 09:11:19 DEBUG AcidUtils: in directory hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc base = null deltas = 0 15/10/16 09:11:19 DEBUG OrcInputFormat: BISplitStrategy strategy for hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc 15/10/16 09:11:19 DEBUG OrcInputFormat: BISplitStrategy strategy for hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc
[jira] [Resolved] (SPARK-10974) Add progress bar for output operation column and use red dots for failed batches
[ https://issues.apache.org/jira/browse/SPARK-10974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-10974. --- Resolution: Fixed Fix Version/s: 1.6.0 > Add progress bar for output operation column and use red dots for failed > batches > > > Key: SPARK-10974 > URL: https://issues.apache.org/jira/browse/SPARK-10974 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7271) Redesign shuffle interface for binary processing
[ https://issues.apache.org/jira/browse/SPARK-7271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960363#comment-14960363 ] Hong Shen commented on SPARK-7271: -- Hi, I have a question, are you plan to rededign the shuffle reader to implement binary processing? If so, when will you complete it? > Redesign shuffle interface for binary processing > > > Key: SPARK-7271 > URL: https://issues.apache.org/jira/browse/SPARK-7271 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Reporter: Reynold Xin >Assignee: Josh Rosen > > Current shuffle interface is not exactly ideal for binary processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10974) Add progress bar for output operation column and use red dots for failed batches
[ https://issues.apache.org/jira/browse/SPARK-10974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das reassigned SPARK-10974: - Assignee: Tathagata Das > Add progress bar for output operation column and use red dots for failed > batches > > > Key: SPARK-10974 > URL: https://issues.apache.org/jira/browse/SPARK-10974 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Shixiong Zhu >Assignee: Tathagata Das > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10974) Add progress bar for output operation column and use red dots for failed batches
[ https://issues.apache.org/jira/browse/SPARK-10974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-10974: -- Assignee: Shixiong Zhu (was: Tathagata Das) > Add progress bar for output operation column and use red dots for failed > batches > > > Key: SPARK-10974 > URL: https://issues.apache.org/jira/browse/SPARK-10974 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3950) Completed time is blank for some successful tasks
[ https://issues.apache.org/jira/browse/SPARK-3950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3950. -- Resolution: Cannot Reproduce > Completed time is blank for some successful tasks > - > > Key: SPARK-3950 > URL: https://issues.apache.org/jira/browse/SPARK-3950 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.1.1 >Reporter: Aaron Davidson > > In the Spark web UI, some tasks appear to have a blank Duration column. It's > possible that these ran for <.5 seconds, but if so, we should use > milliseconds like we do for GC time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-10974) Add progress bar for output operation column and use red dots for failed batches
[ https://issues.apache.org/jira/browse/SPARK-10974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das reopened SPARK-10974: --- > Add progress bar for output operation column and use red dots for failed > batches > > > Key: SPARK-10974 > URL: https://issues.apache.org/jira/browse/SPARK-10974 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11145) Cannot filter using a partition key and another column
Julien Buret created SPARK-11145: Summary: Cannot filter using a partition key and another column Key: SPARK-11145 URL: https://issues.apache.org/jira/browse/SPARK-11145 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.5.1 Reporter: Julien Buret A Dataframe, loaded from partitionned parquet files, cannot be filtered by a predicate comparing a partition key and another column. In this case all records are returned Example {code} from pyspark.sql import SQLContext sqlContext = SQLContext(sc) d = [ {'name': 'a', 'YEAR': 2015, 'year_2': 2014, 'statut': 'a'}, {'name': 'b', 'YEAR': 2014, 'year_2': 2014, 'statut': 'a'}, {'name': 'c', 'YEAR': 2013, 'year_2': 2011, 'statut': 'a'}, {'name': 'd', 'YEAR': 2014, 'year_2': 2013, 'statut': 'a'}, {'name': 'e', 'YEAR': 2016, 'year_2': 2017, 'statut': 'p'} ] rdd = sc.parallelize(d) df = sqlContext.createDataFrame(rdd) df.write.partitionBy('YEAR').mode('overwrite').parquet('data') df2 = sqlContext.read.parquet('data') df2.filter(df2.YEAR == df2.year_2).show() {code} return {code} ++--+--++ |name|statut|year_2|YEAR| ++--+--++ | d| a| 2013|2014| | b| a| 2014|2014| | c| a| 2011|2013| | e| p| 2017|2016| | a| a| 2014|2015| ++--+--++ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11139) Make SparkContext.stop() exception-safe
[ https://issues.apache.org/jira/browse/SPARK-11139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960378#comment-14960378 ] Sean Owen commented on SPARK-11139: --- Yes please. StreamingContext probably needs a similar treatment: execute a series of functions that do part of the cleanup and ensure that an exception from one doesn't stop the rest from executing. > Make SparkContext.stop() exception-safe > --- > > Key: SPARK-11139 > URL: https://issues.apache.org/jira/browse/SPARK-11139 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Felix Cheung >Priority: Minor > > In SparkContext.stop(), when an exception is thrown the rest of the > stop/cleanup action is aborted. > Work has been done in SPARK-4194 to allow for cleanup to partial > initialization. > Similarly issue in StreamingContext SPARK-11137 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11137) Make StreamingContext.stop() exception-safe
[ https://issues.apache.org/jira/browse/SPARK-11137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-11137. --- Resolution: Duplicate If you don't mind, this is too logically related to SPARK-11139 to make separate JIRAs. > Make StreamingContext.stop() exception-safe > --- > > Key: SPARK-11137 > URL: https://issues.apache.org/jira/browse/SPARK-11137 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.1 >Reporter: Felix Cheung >Priority: Minor > > In StreamingContext.stop(), when an exception is thrown the rest of the > stop/cleanup action is aborted. > Discussed in https://github.com/apache/spark/pull/9116, > srowen commented > Hm, this is getting unwieldy. There are several nested try blocks here. The > same argument goes for many of these methods -- if one fails should they not > continue trying? A more tidy solution would be to execute a series of () -> > Unit code blocks that perform some cleanup and make sure that they each fire > in succession, regardless of the others. The final one to remove the shutdown > hook could occur outside synchronization. > I realize we're expanding the scope of the change here, but is it maybe > worthwhile to go all the way here? > Really, something similar could be done for SparkContext and there's an > existing JIRA for it somewhere. > At least, I'd prefer to either narrowly fix the deadlock here, or fix all of > the finally-related issue separately and all at once. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11143) SparkMesosDispatcher can not launch driver in docker
[ https://issues.apache.org/jira/browse/SPARK-11143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960420#comment-14960420 ] Klaus Ma commented on SPARK-11143: -- I addressed the issue by a new docker image which is more environment; but I still suggest to provide parameters to simplify the configuration. > SparkMesosDispatcher can not launch driver in docker > > > Key: SPARK-11143 > URL: https://issues.apache.org/jira/browse/SPARK-11143 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.5.1 > Environment: Ubuntu 14.04 >Reporter: Klaus Ma > > I'm working on integration between Mesos & Spark. For now, I can start > SlaveMesosDispatcher in a docker; and I like to also run Spark executor in > Mesos docker. I do the following configuration for it, but I got an error; > any suggestion? > Configuration: > Spark: conf/spark-defaults.conf > {code} > spark.mesos.executor.docker.imageubuntu > spark.mesos.executor.docker.volumes > /usr/bin:/usr/bin,/usr/local/lib:/usr/local/lib,/usr/lib:/usr/lib,/lib:/lib,/home/test/workshop/spark:/root/spark > spark.mesos.executor.home/root/spark > #spark.executorEnv.SPARK_HOME /root/spark > spark.executorEnv.MESOS_NATIVE_LIBRARY /usr/local/lib > {code} > NOTE: The spark are installed in /home/test/workshop/spark, and all > dependencies are installed. > After submit SparkPi to the dispatcher, the driver job is started but failed. > The error messes is: > {code} > I1015 11:10:29.488456 18697 exec.cpp:134] Version: 0.26.0 > I1015 11:10:29.506619 18699 exec.cpp:208] Executor registered on slave > b7e24114-7585-40bc-879b-6a1188cb65b6-S1 > WARNING: Your kernel does not support swap limit capabilities, memory limited > without swap. > /bin/sh: 1: ./bin/spark-submit: not found > {code} > Does any know how to map/set spark home in docker for this case? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11060) Fix some potential NPEs in DStream transformation
[ https://issues.apache.org/jira/browse/SPARK-11060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-11060. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9070 [https://github.com/apache/spark/pull/9070] > Fix some potential NPEs in DStream transformation > - > > Key: SPARK-11060 > URL: https://issues.apache.org/jira/browse/SPARK-11060 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.0, 1.5.1 >Reporter: Saisai Shao >Priority: Minor > Fix For: 1.6.0 > > > Guard out some potential NPEs when input stream returns None instead of empty > RDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11060) Fix some potential NPEs in DStream transformation
[ https://issues.apache.org/jira/browse/SPARK-11060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-11060: -- Assignee: Saisai Shao > Fix some potential NPEs in DStream transformation > - > > Key: SPARK-11060 > URL: https://issues.apache.org/jira/browse/SPARK-11060 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.0, 1.5.1 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Minor > Fix For: 1.6.0 > > > Guard out some potential NPEs when input stream returns None instead of empty > RDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11144) Add SparkLauncher for Spark Streaming, Spark SQL, etc
Yuhang Chen created SPARK-11144: --- Summary: Add SparkLauncher for Spark Streaming, Spark SQL, etc Key: SPARK-11144 URL: https://issues.apache.org/jira/browse/SPARK-11144 Project: Spark Issue Type: Improvement Components: Spark Core, SQL, Streaming Affects Versions: 1.5.1 Environment: Linux x64 Reporter: Yuhang Chen Priority: Minor Now we hava org.apache.spark.launcher.SparkLauncher to lauch spark as a child process. However, it does not support other libs, such as Spark Streaming and Spark SQL. What I'm looking for is an utility like spark-submit, with which you can submit any spark lib jobs to all supported resource manager(Standalone, YARN, Mesos, etc) in Java/Scala code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11143) SparkMesosDispatcher can not launch driver in docker
[ https://issues.apache.org/jira/browse/SPARK-11143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960420#comment-14960420 ] Klaus Ma edited comment on SPARK-11143 at 10/16/15 9:24 AM: I addressed the issue by a new docker image which is more about environment; I still suggest to provide parameters to simplify the docker configuration. was (Author: klaus1982): I addressed the issue by a new docker image which is more environment; but I still suggest to provide parameters to simplify the configuration. > SparkMesosDispatcher can not launch driver in docker > > > Key: SPARK-11143 > URL: https://issues.apache.org/jira/browse/SPARK-11143 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.5.1 > Environment: Ubuntu 14.04 >Reporter: Klaus Ma > > I'm working on integration between Mesos & Spark. For now, I can start > SlaveMesosDispatcher in a docker; and I like to also run Spark executor in > Mesos docker. I do the following configuration for it, but I got an error; > any suggestion? > Configuration: > Spark: conf/spark-defaults.conf > {code} > spark.mesos.executor.docker.imageubuntu > spark.mesos.executor.docker.volumes > /usr/bin:/usr/bin,/usr/local/lib:/usr/local/lib,/usr/lib:/usr/lib,/lib:/lib,/home/test/workshop/spark:/root/spark > spark.mesos.executor.home/root/spark > #spark.executorEnv.SPARK_HOME /root/spark > spark.executorEnv.MESOS_NATIVE_LIBRARY /usr/local/lib > {code} > NOTE: The spark are installed in /home/test/workshop/spark, and all > dependencies are installed. > After submit SparkPi to the dispatcher, the driver job is started but failed. > The error messes is: > {code} > I1015 11:10:29.488456 18697 exec.cpp:134] Version: 0.26.0 > I1015 11:10:29.506619 18699 exec.cpp:208] Executor registered on slave > b7e24114-7585-40bc-879b-6a1188cb65b6-S1 > WARNING: Your kernel does not support swap limit capabilities, memory limited > without swap. > /bin/sh: 1: ./bin/spark-submit: not found > {code} > Does any know how to map/set spark home in docker for this case? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10581) Groups are not resolved in scaladoc for org.apache.spark.sql.Column
[ https://issues.apache.org/jira/browse/SPARK-10581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10581: Assignee: (was: Apache Spark) > Groups are not resolved in scaladoc for org.apache.spark.sql.Column > --- > > Key: SPARK-10581 > URL: https://issues.apache.org/jira/browse/SPARK-10581 > Project: Spark > Issue Type: Bug > Components: Documentation >Reporter: Jacek Laskowski >Priority: Minor > > The Scala API documentation (scaladoc) for > [org.apache.spark.sql.Column|http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.Column] > does not resolve groups, and they appear unresolved like {{df_ops}}, > {{expr_ops}}, et al. instead of _DataFrame functions._, _Expression > operators._, et al. > BTW, > [DataFrame|http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame] > and other classes in the > [org.apache.spark.sql|http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.package] > package seem fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10965) Optimize filesEqualRecursive
[ https://issues.apache.org/jira/browse/SPARK-10965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960515#comment-14960515 ] Sean Owen commented on SPARK-10965: --- I'd like to resolve this, at least for now. I am not sure I see a way to optimize this without introducing significantly more complication. If it's not a major problem, I suspect it's not worth it. > Optimize filesEqualRecursive > > > Key: SPARK-10965 > URL: https://issues.apache.org/jira/browse/SPARK-10965 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2 >Reporter: Mark Grover >Priority: Minor > > When we try to download dependencies, if there is a file at the destination > already, we compare if the files are equal (recursively, if they are > directories). For files, we compare their bytes. Now, these dependencies can > be jars and be really large and byte-by-byte comparisons can super slow. > I think it'd be better to do a checksum. > Here's the code in question: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L500 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10581) Groups are not resolved in scaladoc for org.apache.spark.sql.Column
[ https://issues.apache.org/jira/browse/SPARK-10581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10581: Assignee: Apache Spark > Groups are not resolved in scaladoc for org.apache.spark.sql.Column > --- > > Key: SPARK-10581 > URL: https://issues.apache.org/jira/browse/SPARK-10581 > Project: Spark > Issue Type: Bug > Components: Documentation >Reporter: Jacek Laskowski >Assignee: Apache Spark >Priority: Minor > > The Scala API documentation (scaladoc) for > [org.apache.spark.sql.Column|http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.Column] > does not resolve groups, and they appear unresolved like {{df_ops}}, > {{expr_ops}}, et al. instead of _DataFrame functions._, _Expression > operators._, et al. > BTW, > [DataFrame|http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame] > and other classes in the > [org.apache.spark.sql|http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.package] > package seem fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11092) Add source URLs to API documentation.
[ https://issues.apache.org/jira/browse/SPARK-11092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-11092. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9110 [https://github.com/apache/spark/pull/9110] > Add source URLs to API documentation. > - > > Key: SPARK-11092 > URL: https://issues.apache.org/jira/browse/SPARK-11092 > Project: Spark > Issue Type: Documentation > Components: Build, Documentation >Reporter: Jakob Odersky >Assignee: Jakob Odersky >Priority: Trivial > Fix For: 1.6.0 > > > It would be nice to have source URLs in the Spark scaladoc, similar to the > standard library (e.g. > http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.List). > The fix should be really simple, just adding a line to the sbt unidoc > settings. > I'll use the github repo url > bq. https://github.com/apache/spark/tree/v${version}/${FILE_PATH} > Feel free to tell me if I should use something else as base url. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10581) Groups are not resolved in scaladoc for org.apache.spark.sql.Column
[ https://issues.apache.org/jira/browse/SPARK-10581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960513#comment-14960513 ] Apache Spark commented on SPARK-10581: -- User 'pravingadakh' has created a pull request for this issue: https://github.com/apache/spark/pull/9148 > Groups are not resolved in scaladoc for org.apache.spark.sql.Column > --- > > Key: SPARK-10581 > URL: https://issues.apache.org/jira/browse/SPARK-10581 > Project: Spark > Issue Type: Bug > Components: Documentation >Reporter: Jacek Laskowski >Priority: Minor > > The Scala API documentation (scaladoc) for > [org.apache.spark.sql.Column|http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.Column] > does not resolve groups, and they appear unresolved like {{df_ops}}, > {{expr_ops}}, et al. instead of _DataFrame functions._, _Expression > operators._, et al. > BTW, > [DataFrame|http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame] > and other classes in the > [org.apache.spark.sql|http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.package] > package seem fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11146) missing or invalid dependency detected while loading class file 'RDDOperationScope.class
[ https://issues.apache.org/jira/browse/SPARK-11146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-11146. --- Resolution: Cannot Reproduce Since all tests are passing, it sounds strongly like a problem local to your environment. Run a clean build please. We can reopen if you can show this happens on a fresh build from git, but then please give more info. > missing or invalid dependency detected while loading class file > 'RDDOperationScope.class > > > Key: SPARK-11146 > URL: https://issues.apache.org/jira/browse/SPARK-11146 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1 > Environment: hadoop 2.2.0 ubuntu,eclipse mars,scala 2.10.4 >Reporter: Veerendra Nath Jasthi > > I am getting error whenever trying to run the scala code in eclipse (MARS) > ERROR: > missing or invalid dependency detected while loading class file > 'RDDOperationScope.class -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11094) Test runner script fails to parse Java version.
[ https://issues.apache.org/jira/browse/SPARK-11094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-11094: -- Assignee: Jakob Odersky > Test runner script fails to parse Java version. > --- > > Key: SPARK-11094 > URL: https://issues.apache.org/jira/browse/SPARK-11094 > Project: Spark > Issue Type: Bug > Components: Tests > Environment: Debian testing >Reporter: Jakob Odersky >Assignee: Jakob Odersky >Priority: Minor > Fix For: 1.5.2, 1.6.0 > > > Running {{dev/run-tests}} fails when the local Java version has an extra > string appended to the version. > For example, in Debian Stretch (currently testing distribution), {{java > -version}} yields "1.8.0_66-internal" where the extra part "-internal" causes > the script to fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11147) HTTP 500 if try to access Spark UI in yarn-cluster
Sebastian YEPES FERNANDEZ created SPARK-11147: - Summary: HTTP 500 if try to access Spark UI in yarn-cluster Key: SPARK-11147 URL: https://issues.apache.org/jira/browse/SPARK-11147 Project: Spark Issue Type: Bug Components: Web UI, YARN Affects Versions: 1.5.1 Environment: HDP: 2.3.2.0-2950 (Hadoop 2.7.1.2.3.2.0-2950) Spark: 1.5.x (c27e1904) Reporter: Sebastian YEPES FERNANDEZ Hello, I am facing a similar issue as described in SPARK-5837, but is my case the SparkUI only work in "yarn-client" mode. If a run the same job using "yarn-cluster" I get the HTTP 500 error: {code} HTTP ERROR 500 Problem accessing /proxy/application_1444297190346_0085/. Reason: Connection to http://XX.XX.XX.XX:55827 refused Caused by: org.apache.http.conn.HttpHostConnectException: Connection to http://XX.XX.XX.XX:55827 refused at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:190) at org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:294) at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:643) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:479) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906) {code} I have verified that the UI port "55827" is actually Listening on the worker node, I can even run a "curl http://XX.XX.XX.XX:55827; and it redirects me to another URL: http://YY.YY.YY.YY:8088/proxy/application_1444297190346_0082 The strange thing is the its redirecting me to the app "_0082" and not the actually running job "_0085" Does anyone have any suggestions on what could be causing this issue? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10965) Optimize filesEqualRecursive
[ https://issues.apache.org/jira/browse/SPARK-10965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Grover resolved SPARK-10965. - Resolution: Won't Fix Thanks Sean. Marking this as Won't Fix since I don't think this is super important. > Optimize filesEqualRecursive > > > Key: SPARK-10965 > URL: https://issues.apache.org/jira/browse/SPARK-10965 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2 >Reporter: Mark Grover >Priority: Minor > > When we try to download dependencies, if there is a file at the destination > already, we compare if the files are equal (recursively, if they are > directories). For files, we compare their bytes. Now, these dependencies can > be jars and be really large and byte-by-byte comparisons can super slow. > I think it'd be better to do a checksum. > Here's the code in question: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L500 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10754) table and column name are case sensitive when json Dataframe was registered as tempTable using JavaSparkContext.
[ https://issues.apache.org/jira/browse/SPARK-10754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960802#comment-14960802 ] Rick Hillegas commented on SPARK-10754: --- Note that unquoted identifiers are case-insensitive in the SQL Standard. Thanks. > table and column name are case sensitive when json Dataframe was registered > as tempTable using JavaSparkContext. > - > > Key: SPARK-10754 > URL: https://issues.apache.org/jira/browse/SPARK-10754 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0, 1.3.1, 1.4.1 > Environment: Linux ,Hadoop Version 1.3 >Reporter: Babulal > > Create a dataframe using json data source > SparkConf conf=new > SparkConf().setMaster("spark://xyz:7077")).setAppName("Spark Tabble"); > JavaSparkContext javacontext=new JavaSparkContext(conf); > SQLContext sqlContext=new SQLContext(javacontext); > > DataFrame df = > sqlContext.jsonFile("/user/root/examples/src/main/resources/people.json"); > > df.registerTempTable("sparktable"); > > Run the Query > > sqlContext.sql("select * from sparktable").show()// this will PASs > > > sqlContext.sql("select * from sparkTable").show()/// This will FAIL > > java.lang.RuntimeException: Table Not Found: sparkTable > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:115) > at > org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:115) > at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) > at scala.collection.AbstractMap.getOrElse(Map.scala:58) > at > org.apache.spark.sql.catalyst.analysis.SimpleCatalog.lookupRelation(Catalog.scala:115) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:233) > > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11148) Unable to create views
Lunen created SPARK-11148: - Summary: Unable to create views Key: SPARK-11148 URL: https://issues.apache.org/jira/browse/SPARK-11148 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.1 Environment: Ubuntu 14.04 Spark-1.5.1-bin-hadoop2.6 (I don't have Hadoop or Hive installed) Start spark-all.sh and thriftserver with mysql jar driver Reporter: Lunen Priority: Critical I am unable to create views within spark SQL. Creating tables without specifying the column names work. eg. CREATE TABLE trade2 USING org.apache.spark.sql.jdbc OPTIONS ( url "jdbc:mysql://192.168.30.191:3318/?user=root", dbtable "database.trade", driver "com.mysql.jdbc.Driver" ); Ceating tables with datatypes gives an error: CREATE TABLE trade2( COL1 timestamp, COL2 STRING, COL3 STRING) USING org.apache.spark.sql.jdbc OPTIONS ( url "jdbc:mysql://192.168.30.191:3318/?user=root", dbtable "database.trade", driver "com.mysql.jdbc.Driver" ); Error: org.apache.spark.sql.AnalysisException: org.apache.spark.sql.execution.datasources.jdbc.DefaultSource does not allow user-specified schemas.; SQLState: null ErrorCode: 0 Trying to create a VIEW from the table that was created.(The select statement below returns data) CREATE VIEW viewtrade as Select Col1 from trade2; Error: org.apache.spark.sql.execution.QueryExecutionException: FAILED: SemanticException [Error 10004]: Line 1:30 Invalid table alias or column reference 'Col1': (possible column names are: col) SQLState: null ErrorCode: 0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961513#comment-14961513 ] Sandy Ryza commented on SPARK-: --- So ClassTags would work for case classes and Avro specific records, but wouldn't work for tuples (or anywhere else types get erased). Blrgh. I wonder if the former is enough? Tuples are pretty useful though. > RDD-like API on top of Catalyst/DataFrame > - > > Key: SPARK- > URL: https://issues.apache.org/jira/browse/SPARK- > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Reynold Xin >Assignee: Michael Armbrust > > The RDD API is very flexible, and as a result harder to optimize its > execution in some cases. The DataFrame API, on the other hand, is much easier > to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to > use UDFs, lack of strong types in Scala/Java). > The goal of Spark Datasets is to provide an API that allows users to easily > express transformations on domain objects, while also providing the > performance and robustness advantages of the Spark SQL execution engine. > h2. Requirements > - *Fast* - In most cases, the performance of Datasets should be equal to or > better than working with RDDs. Encoders should be as fast or faster than > Kryo and Java serialization, and unnecessary conversion should be avoided. > - *Typesafe* - Similar to RDDs, objects and functions that operate on those > objects should provide compile-time safety where possible. When converting > from data where the schema is not known at compile-time (for example data > read from an external source such as JSON), the conversion function should > fail-fast if there is a schema mismatch. > - *Support for a variety of object models* - Default encoders should be > provided for a variety of object models: primitive types, case classes, > tuples, POJOs, JavaBeans, etc. Ideally, objects that follow standard > conventions, such as Avro SpecificRecords, should also work out of the box. > - *Java Compatible* - Datasets should provide a single API that works in > both Scala and Java. Where possible, shared types like Array will be used in > the API. Where not possible, overloaded functions should be provided for > both languages. Scala concepts, such as ClassTags should not be required in > the user-facing API. > - *Interoperates with DataFrames* - Users should be able to seamlessly > transition between Datasets and DataFrames, without specifying conversion > boiler-plate. When names used in the input schema line-up with fields in the > given class, no extra mapping should be necessary. Libraries like MLlib > should not need to provide different interfaces for accepting DataFrames and > Datasets as input. > For a detailed outline of the complete proposed API: > [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files] > For an initial discussion of the design considerations in this API: [design > doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11070) Remove older releases on dist.apache.org
[ https://issues.apache.org/jira/browse/SPARK-11070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell reassigned SPARK-11070: --- Assignee: Patrick Wendell > Remove older releases on dist.apache.org > > > Key: SPARK-11070 > URL: https://issues.apache.org/jira/browse/SPARK-11070 > Project: Spark > Issue Type: Task > Components: Build >Reporter: Sean Owen >Assignee: Patrick Wendell >Priority: Trivial > Attachments: SPARK-11070.patch > > > dist.apache.org should be periodically cleaned up such that it only includes > the latest releases in each active minor release branch. This is to reduce > load on mirrors. It can probably lose the 1.2.x releases at this point. In > total this would clean out 6 of the 9 releases currently mirrored at > https://dist.apache.org/repos/dist/release/spark/ > All releases are always archived at archive.apache.org and continue to be > available. The JS behind spark.apache.org/downloads.html needs to be updated > to point at archive.apache.org for older releases, then. > There won't be a pull request for this as it's strictly an update to the site > hosted in SVN, and the files hosted by Apache. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11163) Remove unnecessary addPendingTask calls in TaskSetManager.executorLost
[ https://issues.apache.org/jira/browse/SPARK-11163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11163: Assignee: Apache Spark (was: Kay Ousterhout) > Remove unnecessary addPendingTask calls in TaskSetManager.executorLost > -- > > Key: SPARK-11163 > URL: https://issues.apache.org/jira/browse/SPARK-11163 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Reporter: Kay Ousterhout >Assignee: Apache Spark >Priority: Minor > Fix For: 1.5.1, 1.5.2 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10599) Decrease communication in BlockMatrix multiply and increase performance
[ https://issues.apache.org/jira/browse/SPARK-10599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-10599. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8757 [https://github.com/apache/spark/pull/8757] > Decrease communication in BlockMatrix multiply and increase performance > --- > > Key: SPARK-10599 > URL: https://issues.apache.org/jira/browse/SPARK-10599 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Burak Yavuz >Assignee: Burak Yavuz > Fix For: 1.6.0 > > > The BlockMatrix multiply sends each block to all the corresponding columns of > the right BlockMatrix, even though there might not be any corresponding block > to multiply with. > Some optimizations we can perform are: > - Simulate the multiplication on the driver, and figure out which blocks > actually need to be shuffled > - Send the block once to a partition, and join inside the partition rather > than sending multiple copies to the same partition -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11162) Allow enabling debug logging from the command line
Ryan Williams created SPARK-11162: - Summary: Allow enabling debug logging from the command line Key: SPARK-11162 URL: https://issues.apache.org/jira/browse/SPARK-11162 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.5.1 Reporter: Ryan Williams Priority: Minor Per [~vanzin] on [the user list|http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-log-level-of-spark-executor-on-YARN-using-yarn-cluster-mode-tp16528p16529.html], it would be nice if debug-logging could be enabled from the command line. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10877) Assertions fail straightforward DataFrame job due to word alignment
[ https://issues.apache.org/jira/browse/SPARK-10877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961505#comment-14961505 ] Davies Liu commented on SPARK-10877: This is already fixed in master and 1.5 branch. > Assertions fail straightforward DataFrame job due to word alignment > --- > > Key: SPARK-10877 > URL: https://issues.apache.org/jira/browse/SPARK-10877 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Matt Cheah >Assignee: Davies Liu > Attachments: SparkFilterByKeyTest.scala > > > I have some code that I’m running in a unit test suite, but the code I’m > running is failing with an assertion error. > I have translated the JUnit test that was failing, to a Scala script that I > will attach to the ticket. The assertion error is the following: > {code} > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: > Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.AssertionError: > lengthInBytes must be a multiple of 8 (word-aligned) > at > org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeWords(Murmur3_x86_32.java:53) > at > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.hashCode(UnsafeArrayData.java:289) > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.hashCode(rows.scala:149) > at > org.apache.spark.sql.catalyst.expressions.GenericMutableRow.hashCode(rows.scala:247) > at org.apache.spark.HashPartitioner.getPartition(Partitioner.scala:85) > at > org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1$$anonfun$4$$anonfun$apply$4.apply(Exchange.scala:180) > at > org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1$$anonfun$4$$anonfun$apply$4.apply(Exchange.scala:180) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > {code} > However, it turns out that this code actually works normally and computes the > correct result if assertions are turned off. > I traced the code and found that when hashUnsafeWords was called, it was > given a byte-length of 12, which clearly is not a multiple of 8. However, the > job seems to compute correctly regardless of this fact. Of course, I can’t > just disable assertions for my unit test though. > A few things we need to understand: > 1. Why is the lengthInBytes of size 12? > 2. Is it actually a problem that the byte length is not word-aligned? If so, > how should we fix the byte length? If it's not a problem, why is the > assertion flagging a false negative? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11070) Remove older releases on dist.apache.org
[ https://issues.apache.org/jira/browse/SPARK-11070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961515#comment-14961515 ] Patrick Wendell commented on SPARK-11070: - I removed them - I did leave 1.5.0 for now, but we can remove it in a bit - just because 1.5.1 is so new. {code} svn rm https://dist.apache.org/repos/dist/release/spark/spark-1.1.1 -m "Remving Spark 1.1.1 release" svn rm https://dist.apache.org/repos/dist/release/spark/spark-1.2.1 -m "Remving Spark 1.2.1 release" svn rm https://dist.apache.org/repos/dist/release/spark/spark-1.2.2 -m "Remving Spark 1.2.2 release" svn rm https://dist.apache.org/repos/dist/release/spark/spark-1.3.0 -m "Remving Spark 1.3.0 release" svn rm https://dist.apache.org/repos/dist/release/spark/spark-1.4.0 -m "Remving Spark 1.4.0 release" {code} > Remove older releases on dist.apache.org > > > Key: SPARK-11070 > URL: https://issues.apache.org/jira/browse/SPARK-11070 > Project: Spark > Issue Type: Task > Components: Build >Reporter: Sean Owen >Assignee: Patrick Wendell >Priority: Trivial > Attachments: SPARK-11070.patch > > > dist.apache.org should be periodically cleaned up such that it only includes > the latest releases in each active minor release branch. This is to reduce > load on mirrors. It can probably lose the 1.2.x releases at this point. In > total this would clean out 6 of the 9 releases currently mirrored at > https://dist.apache.org/repos/dist/release/spark/ > All releases are always archived at archive.apache.org and continue to be > available. The JS behind spark.apache.org/downloads.html needs to be updated > to point at archive.apache.org for older releases, then. > There won't be a pull request for this as it's strictly an update to the site > hosted in SVN, and the files hosted by Apache. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11070) Remove older releases on dist.apache.org
[ https://issues.apache.org/jira/browse/SPARK-11070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-11070. - Resolution: Fixed > Remove older releases on dist.apache.org > > > Key: SPARK-11070 > URL: https://issues.apache.org/jira/browse/SPARK-11070 > Project: Spark > Issue Type: Task > Components: Build >Reporter: Sean Owen >Assignee: Patrick Wendell >Priority: Trivial > Attachments: SPARK-11070.patch > > > dist.apache.org should be periodically cleaned up such that it only includes > the latest releases in each active minor release branch. This is to reduce > load on mirrors. It can probably lose the 1.2.x releases at this point. In > total this would clean out 6 of the 9 releases currently mirrored at > https://dist.apache.org/repos/dist/release/spark/ > All releases are always archived at archive.apache.org and continue to be > available. The JS behind spark.apache.org/downloads.html needs to be updated > to point at archive.apache.org for older releases, then. > There won't be a pull request for this as it's strictly an update to the site > hosted in SVN, and the files hosted by Apache. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961518#comment-14961518 ] Michael Armbrust commented on SPARK-: - Yeah, I think tuples are a pretty important use case. Perhaps more importantly though, I think having a concept of encoders instead of relying on JVM types future proofs the API by giving us more control. If you look closely at the test case examples, there are some pretty crazy macro examples (i.e., {{R(a = 1, b = 2L)}}) where we actually create something like named tuples that codegen at compile time the logic required to directly encode the users results into tungsten format without needing to allocate an intermediate object. > RDD-like API on top of Catalyst/DataFrame > - > > Key: SPARK- > URL: https://issues.apache.org/jira/browse/SPARK- > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Reynold Xin >Assignee: Michael Armbrust > > The RDD API is very flexible, and as a result harder to optimize its > execution in some cases. The DataFrame API, on the other hand, is much easier > to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to > use UDFs, lack of strong types in Scala/Java). > The goal of Spark Datasets is to provide an API that allows users to easily > express transformations on domain objects, while also providing the > performance and robustness advantages of the Spark SQL execution engine. > h2. Requirements > - *Fast* - In most cases, the performance of Datasets should be equal to or > better than working with RDDs. Encoders should be as fast or faster than > Kryo and Java serialization, and unnecessary conversion should be avoided. > - *Typesafe* - Similar to RDDs, objects and functions that operate on those > objects should provide compile-time safety where possible. When converting > from data where the schema is not known at compile-time (for example data > read from an external source such as JSON), the conversion function should > fail-fast if there is a schema mismatch. > - *Support for a variety of object models* - Default encoders should be > provided for a variety of object models: primitive types, case classes, > tuples, POJOs, JavaBeans, etc. Ideally, objects that follow standard > conventions, such as Avro SpecificRecords, should also work out of the box. > - *Java Compatible* - Datasets should provide a single API that works in > both Scala and Java. Where possible, shared types like Array will be used in > the API. Where not possible, overloaded functions should be provided for > both languages. Scala concepts, such as ClassTags should not be required in > the user-facing API. > - *Interoperates with DataFrames* - Users should be able to seamlessly > transition between Datasets and DataFrames, without specifying conversion > boiler-plate. When names used in the input schema line-up with fields in the > given class, no extra mapping should be necessary. Libraries like MLlib > should not need to provide different interfaces for accepting DataFrames and > Datasets as input. > For a detailed outline of the complete proposed API: > [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files] > For an initial discussion of the design considerations in this API: [design > doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11163) Remove unnecessary addPendingTask calls in TaskSetManager.executorLost
[ https://issues.apache.org/jira/browse/SPARK-11163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-11163: --- Summary: Remove unnecessary addPendingTask calls in TaskSetManager.executorLost (was: Remove unnecessary addPendingTask calls in TaskSetManager) > Remove unnecessary addPendingTask calls in TaskSetManager.executorLost > -- > > Key: SPARK-11163 > URL: https://issues.apache.org/jira/browse/SPARK-11163 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Reporter: Kay Ousterhout >Assignee: Kay Ousterhout >Priority: Minor > Fix For: 1.5.1, 1.5.2 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11163) Remove unnecessary addPendingTask calls in TaskSetManager
Kay Ousterhout created SPARK-11163: -- Summary: Remove unnecessary addPendingTask calls in TaskSetManager Key: SPARK-11163 URL: https://issues.apache.org/jira/browse/SPARK-11163 Project: Spark Issue Type: Improvement Components: Scheduler Reporter: Kay Ousterhout Assignee: Kay Ousterhout Priority: Minor Fix For: 1.5.2, 1.5.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org