[jira] [Created] (SPARK-11150) Dynamic partition pruning

2015-10-16 Thread Younes (JIRA)
Younes created SPARK-11150:
--

 Summary: Dynamic partition pruning
 Key: SPARK-11150
 URL: https://issues.apache.org/jira/browse/SPARK-11150
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1, 1.6.0
Reporter: Younes


Partitions are not pruned when joined on the partition columns.
This is the same issue as HIVE-9152.
Ex: 
Select  from tab where partcol=1 will prune on value 1
Select  from tab join dim on (dim.partcol=tab.partcol) where dim.partcol=1 
will scan all partitions.
Tables are based on parquets.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11152) Streaming UI: Input sizes are 0 for makeup batches started from a checkpoint

2015-10-16 Thread Yongjia Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjia Wang updated SPARK-11152:
-
Description: When a streaming job is resumed from a checkpoint at batch 
time x, and say the current time when we resume this streaming job is x+10. In 
this scenario, since Spark will schedule the missing batches from x+1 to x+10 
without any metadata, the behavior is to pack up all the backlogged inputs into 
batch x+1, then assign any new inputs into x+2 to x+10 immediately without 
waiting. This results in tiny batches that capture inputs only during the back 
to back scheduling intervals. This behavior is very reasonable. However, the 
streaming UI does not show correctly the input sizes for all these makeup 
batches - they are all 0 from batch x to x+10. Fixing this would be very 
helpful. This happens when I use Kafka direct streaming, I assume this would 
happen for all other streaming sources as well.  (was: When a streaming job 
starts from a checkpoint at batch time x, and say the current time when we 
resume this streaming job is x+10. In this scenario, since Spark will schedule 
the missing batches from x+1 to x+10 without any metadata, the behavior is to 
pack up all the backlogged inputs into batch x+1, then assign any new inputs 
into x+2 to x+10 immediately without waiting. This results in tiny batches that 
capture inputs only during the back to back scheduling intervals. This behavior 
is very reasonable. However, the streaming UI does not show correctly the input 
sizes for all these makeup batches - they are all 0 from batch x to x+10. 
Fixing this would be very helpful. This happens when I use Kafka direct 
streaming, I assume this would happen for all other streaming sources as well.)

> Streaming UI: Input sizes are 0 for makeup batches started from a checkpoint 
> -
>
> Key: SPARK-11152
> URL: https://issues.apache.org/jira/browse/SPARK-11152
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, Web UI
>Reporter: Yongjia Wang
>Priority: Minor
>
> When a streaming job is resumed from a checkpoint at batch time x, and say 
> the current time when we resume this streaming job is x+10. In this scenario, 
> since Spark will schedule the missing batches from x+1 to x+10 without any 
> metadata, the behavior is to pack up all the backlogged inputs into batch 
> x+1, then assign any new inputs into x+2 to x+10 immediately without waiting. 
> This results in tiny batches that capture inputs only during the back to back 
> scheduling intervals. This behavior is very reasonable. However, the 
> streaming UI does not show correctly the input sizes for all these makeup 
> batches - they are all 0 from batch x to x+10. Fixing this would be very 
> helpful. This happens when I use Kafka direct streaming, I assume this would 
> happen for all other streaming sources as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns

2015-10-16 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-11153:
---
Description: 
Due to PARQUET-251, {{BINARY}} columns in existing Parquet files may be written 
with corrupted statistics information. This information is used by filter 
push-down optimization. Since Spark 1.5 turns on Parquet filter push-down by 
default, we may end up with wrong query results. PARQUET-251 has been fixed in 
parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0.

Note that this kind of corrupted Parquet files could be produced by any Parquet 
data models.

This affects all Spark SQL data types that can be mapped to Parquet {{BINARY}}, 
namely:

- {{StringType}}
- {{BinaryType}}
- {{DecimalType}} (but Spark SQL doesn't support pushing down {{DecimalType}} 
columns for now.)

To avoid wrong query results, we should disable filter push-down for columns of 
{{StringType}} and {{BinaryType}} until we upgrade to parquet-mr 1.8.

  was:
Due to PARQUET-251, {{BINARY}} columns in existing Parquet files may be written 
with corrupted statistics information. This information is used by filter 
push-down optimization. Since Spark 1.5 turns on Parquet filter push-down by 
default, we may end up with wrong query results. PARQUET-251 has been fixed in 
parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0.

This affects all Spark SQL data types that can be mapped to Parquet {{BINARY}}, 
namely:

- {{StringType}}
- {{BinaryType}}
- {{DecimalType}} (but Spark SQL doesn't support pushing down {{DecimalType}} 
columns for now.)

To avoid wrong query results, we should disable filter push-down for columns of 
{{StringType}} and {{BinaryType}} until we upgrade to parquet-mr 1.8.


> Turns off Parquet filter push-down for string and binary columns
> 
>
> Key: SPARK-11153
> URL: https://issues.apache.org/jira/browse/SPARK-11153
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Blocker
>
> Due to PARQUET-251, {{BINARY}} columns in existing Parquet files may be 
> written with corrupted statistics information. This information is used by 
> filter push-down optimization. Since Spark 1.5 turns on Parquet filter 
> push-down by default, we may end up with wrong query results. PARQUET-251 has 
> been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0.
> Note that this kind of corrupted Parquet files could be produced by any 
> Parquet data models.
> This affects all Spark SQL data types that can be mapped to Parquet 
> {{BINARY}}, namely:
> - {{StringType}}
> - {{BinaryType}}
> - {{DecimalType}} (but Spark SQL doesn't support pushing down {{DecimalType}} 
> columns for now.)
> To avoid wrong query results, we should disable filter push-down for columns 
> of {{StringType}} and {{BinaryType}} until we upgrade to parquet-mr 1.8.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10953) Benchmark codegen vs. hand-written code for univariate statistics

2015-10-16 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10953.
---
   Resolution: Done
Fix Version/s: 1.6.0

> Benchmark codegen vs. hand-written code for univariate statistics
> -
>
> Key: SPARK-10953
> URL: https://issues.apache.org/jira/browse/SPARK-10953
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Xiangrui Meng
>Assignee: Jihong MA
> Fix For: 1.6.0
>
>
> I checked the generated code for a simple stddev_pop call:
> {code}
> val df = sqlContext.range(100)
> df.select(stddev_pop(col("id"))).show()
> {code}
> This is the generated code for the merge part, which is very long and 
> complex. I'm not sure whether we can get benefit from the code generation for 
> univariate statistics. We should benchmark it against Scala implementation.
> {code}
> 15/10/06 10:10:57 DEBUG GenerateMutableProjection: code for if 
> (isnull(input[1, DoubleType])) cast(0 as double) else input[1, DoubleType],if 
> (isnull(input[1, DoubleType])) input[6, DoubleType] else if (isnull(input[6, 
> DoubleType])) input[1, DoubleType] else (input[1, DoubleType] + input[6, 
> DoubleType]),if (isnull(input[3, DoubleType])) cast(0 as double) else 
> input[3, DoubleType],if (isnull(input[3, DoubleType])) input[8, DoubleType] 
> else if (isnull(input[8, DoubleType])) input[3, DoubleType] else (((input[3, 
> DoubleType] * input[0, DoubleType]) + (input[8, DoubleType] * input[6, 
> DoubleType])) / (input[0, DoubleType] + input[6, DoubleType])),if 
> (isnull(input[4, DoubleType])) input[9, DoubleType] else if (isnull(input[9, 
> DoubleType])) input[4, DoubleType] else ((input[4, DoubleType] + input[9, 
> DoubleType]) + input[8, DoubleType] - input[2, DoubleType]) * (input[8, 
> DoubleType] - input[2, DoubleType])) * (input[0, DoubleType] * input[6, 
> DoubleType])) / (input[0, DoubleType] + input[6, DoubleType]))):
> public Object generate(org.apache.spark.sql.catalyst.expressions.Expression[] 
> expr) {
>   return new SpecificMutableProjection(expr);
> }
> class SpecificMutableProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
>   private org.apache.spark.sql.catalyst.expressions.Expression[] expressions;
>   private org.apache.spark.sql.catalyst.expressions.MutableRow mutableRow;
>   public 
> SpecificMutableProjection(org.apache.spark.sql.catalyst.expressions.Expression[]
>  expr) {
> expressions = expr;
> mutableRow = new 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow(5);
>   }
>   public 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection 
> target(org.apache.spark.sql.catalyst.expressions.MutableRow row) {
> mutableRow = row;
> return this;
>   }
>   /* Provide immutable access to the last projected row. */
>   public InternalRow currentValue() {
> return (InternalRow) mutableRow;
>   }
>   public Object apply(Object _i) {
> InternalRow i = (InternalRow) _i;
> /* if (isnull(input[1, DoubleType])) cast(0 as double) else input[1, 
> DoubleType] */
> /* isnull(input[1, DoubleType]) */
> /* input[1, DoubleType] */
> boolean isNull4 = i.isNullAt(1);
> double primitive5 = isNull4 ? -1.0 : (i.getDouble(1));
> boolean isNull0 = false;
> double primitive1 = -1.0;
> if (!false && isNull4) {
>   /* cast(0 as double) */
>   /* 0 */
>   boolean isNull6 = false;
>   double primitive7 = -1.0;
>   if (!false) {
> primitive7 = (double) 0;
>   }
>   isNull0 = isNull6;
>   primitive1 = primitive7;
> } else {
>   /* input[1, DoubleType] */
>   boolean isNull10 = i.isNullAt(1);
>   double primitive11 = isNull10 ? -1.0 : (i.getDouble(1));
>   isNull0 = isNull10;
>   primitive1 = primitive11;
> }
> if (isNull0) {
>   mutableRow.setNullAt(0);
> } else {
>   mutableRow.setDouble(0, primitive1);
> }
> /* if (isnull(input[1, DoubleType])) input[6, DoubleType] else if 
> (isnull(input[6, DoubleType])) input[1, DoubleType] else (input[1, 
> DoubleType] + input[6, DoubleType]) */
> /* isnull(input[1, DoubleType]) */
> /* input[1, DoubleType] */
> boolean isNull16 = i.isNullAt(1);
> double primitive17 = isNull16 ? -1.0 : (i.getDouble(1));
> boolean isNull12 = false;
> double primitive13 = -1.0;
> if (!false && isNull16) {
>   /* input[6, DoubleType] */
>   boolean isNull18 = i.isNullAt(6);
>   double primitive19 = isNull18 ? -1.0 : (i.getDouble(6));
>   isNull12 = isNull18;
>   primitive13 = primitive19;
> } else {
>   /* if (isnull(input[6, DoubleType])) input[1, DoubleType] else 
> (input[1, DoubleType] + input[6, DoubleType]) */
>   /* isnull(input[6, 

[jira] [Commented] (SPARK-10994) Local clustering coefficient computation in GraphX

2015-10-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961000#comment-14961000
 ] 

Apache Spark commented on SPARK-10994:
--

User 'SherlockYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/9150

> Local clustering coefficient computation in GraphX
> --
>
> Key: SPARK-10994
> URL: https://issues.apache.org/jira/browse/SPARK-10994
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: Yang Yang
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> We propose to implement an algorithm to compute the local clustering 
> coefficient in GraphX. The local clustering coefficient of a vertex (node) in 
> a graph quantifies how close its neighbors are to being a clique (complete 
> graph). More specifically, the local clustering coefficient C_i for a vertex 
> v_i is given by the proportion of links between the vertices within its 
> neighbourhood divided by the number of links that could possibly exist 
> between them. Duncan J. Watts and Steven Strogatz introduced the measure in 
> 1998 to determine whether a graph is a small-world network. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10994) Local clustering coefficient computation in GraphX

2015-10-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10994:


Assignee: (was: Apache Spark)

> Local clustering coefficient computation in GraphX
> --
>
> Key: SPARK-10994
> URL: https://issues.apache.org/jira/browse/SPARK-10994
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: Yang Yang
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> We propose to implement an algorithm to compute the local clustering 
> coefficient in GraphX. The local clustering coefficient of a vertex (node) in 
> a graph quantifies how close its neighbors are to being a clique (complete 
> graph). More specifically, the local clustering coefficient C_i for a vertex 
> v_i is given by the proportion of links between the vertices within its 
> neighbourhood divided by the number of links that could possibly exist 
> between them. Duncan J. Watts and Steven Strogatz introduced the measure in 
> 1998 to determine whether a graph is a small-world network. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10994) Local clustering coefficient computation in GraphX

2015-10-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10994:


Assignee: Apache Spark

> Local clustering coefficient computation in GraphX
> --
>
> Key: SPARK-10994
> URL: https://issues.apache.org/jira/browse/SPARK-10994
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: Yang Yang
>Assignee: Apache Spark
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> We propose to implement an algorithm to compute the local clustering 
> coefficient in GraphX. The local clustering coefficient of a vertex (node) in 
> a graph quantifies how close its neighbors are to being a clique (complete 
> graph). More specifically, the local clustering coefficient C_i for a vertex 
> v_i is given by the proportion of links between the vertices within its 
> neighbourhood divided by the number of links that could possibly exist 
> between them. Duncan J. Watts and Steven Strogatz introduced the measure in 
> 1998 to determine whether a graph is a small-world network. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11149) Improve performance of primitive types in columnar cache

2015-10-16 Thread Davies Liu (JIRA)
Davies Liu created SPARK-11149:
--

 Summary: Improve performance of primitive types in columnar cache
 Key: SPARK-11149
 URL: https://issues.apache.org/jira/browse/SPARK-11149
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu


Improve performance of primitive types in columnar cache



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns

2015-10-16 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-11153:
--

 Summary: Turns off Parquet filter push-down for string and binary 
columns
 Key: SPARK-11153
 URL: https://issues.apache.org/jira/browse/SPARK-11153
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1, 1.5.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Critical


Due to PARQUET-251, {{BINARY}} columns in existing Parquet files may be written 
with corrupted statistics information. This information is used by filter 
push-down optimization. Since Spark 1.5 turns on Parquet filter push-down by 
default, we may end up with wrong query results. PARQUET-251 has been fixed in 
parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0.

This affects all Spark SQL data types that can be mapped to Parquet {{BINARY}}, 
namely:

- {{StringType}}
- {{BinaryType}}
- {{DecimalType}} (but Spark SQL doesn't support pushing down {{DecimalType}} 
columns for now.)

To avoid wrong query results, we should disable filter push-down for columns of 
{{StringType}} and {{BinaryType}} until we upgrade to parquet-mr 1.8.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns

2015-10-16 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-11153:
---
Priority: Blocker  (was: Critical)

> Turns off Parquet filter push-down for string and binary columns
> 
>
> Key: SPARK-11153
> URL: https://issues.apache.org/jira/browse/SPARK-11153
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Blocker
>
> Due to PARQUET-251, {{BINARY}} columns in existing Parquet files may be 
> written with corrupted statistics information. This information is used by 
> filter push-down optimization. Since Spark 1.5 turns on Parquet filter 
> push-down by default, we may end up with wrong query results. PARQUET-251 has 
> been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0.
> This affects all Spark SQL data types that can be mapped to Parquet 
> {{BINARY}}, namely:
> - {{StringType}}
> - {{BinaryType}}
> - {{DecimalType}} (but Spark SQL doesn't support pushing down {{DecimalType}} 
> columns for now.)
> To avoid wrong query results, we should disable filter push-down for columns 
> of {{StringType}} and {{BinaryType}} until we upgrade to parquet-mr 1.8.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11155) Stage summary json should include stage duration

2015-10-16 Thread Imran Rashid (JIRA)
Imran Rashid created SPARK-11155:


 Summary: Stage summary json should include stage duration 
 Key: SPARK-11155
 URL: https://issues.apache.org/jira/browse/SPARK-11155
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Imran Rashid
Priority: Minor


The json endpoint for stages doesn't include information on the stage duration 
that is present in the UI.  This looks like a simple oversight, they should be 
included.  eg., the metrics should be included at 
{{api/v1/applications//stages}}. The missing metrics are 
{{submissionTime}} and {{completionTime}} (and whatever other metrics come out 
of the discussion on SPARK-10930)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10895) Add pushdown string filters for Parquet

2015-10-16 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-10895:
---
Assignee: Liang-Chi Hsieh

> Add pushdown string filters for Parquet
> ---
>
> Key: SPARK-10895
> URL: https://issues.apache.org/jira/browse/SPARK-10895
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>
> We should be able to push down string filters such as contains, startsWith 
> and endsWith to Parquet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10165) Nested Hive UDF resolution fails in Analyzer

2015-10-16 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961159#comment-14961159
 ] 

Michael Armbrust commented on SPARK-10165:
--

That sounds like a different issue.  Please open up a separate JIRA.

> Nested Hive UDF resolution fails in Analyzer
> 
>
> Key: SPARK-10165
> URL: https://issues.apache.org/jira/browse/SPARK-10165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
> Fix For: 1.5.0
>
>
> When running a query with hive udfs nested in hive udfs the analyzer fails 
> since we don't check children resolution first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11154) make specificaition spark.yarn.executor.memoryOverhead consistent with typical JVM options

2015-10-16 Thread Dustin Cote (JIRA)
Dustin Cote created SPARK-11154:
---

 Summary: make specificaition spark.yarn.executor.memoryOverhead 
consistent with typical JVM options
 Key: SPARK-11154
 URL: https://issues.apache.org/jira/browse/SPARK-11154
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Submit
Reporter: Dustin Cote
Priority: Minor


spark.yarn.executor.memoryOverhead is currently specified in megabytes by 
default, but it would be nice to allow users to specify the size as though it 
were a typical -Xmx option to a JVM where you can have 'm' and 'g' appended to 
the end to explicitly specify megabytes or gigabytes.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-10994) Local clustering coefficient computation in GraphX

2015-10-16 Thread Yang Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Yang updated SPARK-10994:
--
Comment: was deleted

(was: Proposed implementation: https://github.com/amplab/graphx/pull/148/)

> Local clustering coefficient computation in GraphX
> --
>
> Key: SPARK-10994
> URL: https://issues.apache.org/jira/browse/SPARK-10994
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: Yang Yang
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> We propose to implement an algorithm to compute the local clustering 
> coefficient in GraphX. The local clustering coefficient of a vertex (node) in 
> a graph quantifies how close its neighbors are to being a clique (complete 
> graph). More specifically, the local clustering coefficient C_i for a vertex 
> v_i is given by the proportion of links between the vertices within its 
> neighbourhood divided by the number of links that could possibly exist 
> between them. Duncan J. Watts and Steven Strogatz introduced the measure in 
> 1998 to determine whether a graph is a small-world network. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11152) Streaming UI: Input sizes are 0 for makeup batches started from a checkpoint

2015-10-16 Thread Yongjia Wang (JIRA)
Yongjia Wang created SPARK-11152:


 Summary: Streaming UI: Input sizes are 0 for makeup batches 
started from a checkpoint 
 Key: SPARK-11152
 URL: https://issues.apache.org/jira/browse/SPARK-11152
 Project: Spark
  Issue Type: Bug
  Components: Streaming, Web UI
Reporter: Yongjia Wang
Priority: Minor


When a streaming job starts from a checkpoint at batch time x, and say the 
current time when we resume this streaming job is x+10. In this scenario, since 
Spark will schedule the missing batches from x+1 to x+10 without any metadata, 
the behavior is to pack up all the backlogged inputs into batch x+1, then 
assign any new inputs into x+2 to x+10 immediately without waiting. This 
results in tiny batches that capture inputs only during the back to back 
scheduling intervals. This behavior is very reasonable. However, the streaming 
UI does not show correctly the input sizes for all these makeup batches - they 
are all 0 from batch x to x+10. Fixing this would be very helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11152) Streaming UI: Input sizes are 0 for makeup batches started from a checkpoint

2015-10-16 Thread Yongjia Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjia Wang updated SPARK-11152:
-
Description: When a streaming job starts from a checkpoint at batch time x, 
and say the current time when we resume this streaming job is x+10. In this 
scenario, since Spark will schedule the missing batches from x+1 to x+10 
without any metadata, the behavior is to pack up all the backlogged inputs into 
batch x+1, then assign any new inputs into x+2 to x+10 immediately without 
waiting. This results in tiny batches that capture inputs only during the back 
to back scheduling intervals. This behavior is very reasonable. However, the 
streaming UI does not show correctly the input sizes for all these makeup 
batches - they are all 0 from batch x to x+10. Fixing this would be very 
helpful. This happens when I use Kafka direct streaming, I assume this would 
happen for all other streaming sources as well.  (was: When a streaming job 
starts from a checkpoint at batch time x, and say the current time when we 
resume this streaming job is x+10. In this scenario, since Spark will schedule 
the missing batches from x+1 to x+10 without any metadata, the behavior is to 
pack up all the backlogged inputs into batch x+1, then assign any new inputs 
into x+2 to x+10 immediately without waiting. This results in tiny batches that 
capture inputs only during the back to back scheduling intervals. This behavior 
is very reasonable. However, the streaming UI does not show correctly the input 
sizes for all these makeup batches - they are all 0 from batch x to x+10. 
Fixing this would be very helpful.)

> Streaming UI: Input sizes are 0 for makeup batches started from a checkpoint 
> -
>
> Key: SPARK-11152
> URL: https://issues.apache.org/jira/browse/SPARK-11152
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, Web UI
>Reporter: Yongjia Wang
>Priority: Minor
>
> When a streaming job starts from a checkpoint at batch time x, and say the 
> current time when we resume this streaming job is x+10. In this scenario, 
> since Spark will schedule the missing batches from x+1 to x+10 without any 
> metadata, the behavior is to pack up all the backlogged inputs into batch 
> x+1, then assign any new inputs into x+2 to x+10 immediately without waiting. 
> This results in tiny batches that capture inputs only during the back to back 
> scheduling intervals. This behavior is very reasonable. However, the 
> streaming UI does not show correctly the input sizes for all these makeup 
> batches - they are all 0 from batch x to x+10. Fixing this would be very 
> helpful. This happens when I use Kafka direct streaming, I assume this would 
> happen for all other streaming sources as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11149) Improve performance of primitive types in columnar cache

2015-10-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961003#comment-14961003
 ] 

Apache Spark commented on SPARK-11149:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/9145

> Improve performance of primitive types in columnar cache
> 
>
> Key: SPARK-11149
> URL: https://issues.apache.org/jira/browse/SPARK-11149
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Improve performance of primitive types in columnar cache



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11149) Improve performance of primitive types in columnar cache

2015-10-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11149:


Assignee: Apache Spark  (was: Davies Liu)

> Improve performance of primitive types in columnar cache
> 
>
> Key: SPARK-11149
> URL: https://issues.apache.org/jira/browse/SPARK-11149
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> Improve performance of primitive types in columnar cache



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11149) Improve performance of primitive types in columnar cache

2015-10-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11149:


Assignee: Davies Liu  (was: Apache Spark)

> Improve performance of primitive types in columnar cache
> 
>
> Key: SPARK-11149
> URL: https://issues.apache.org/jira/browse/SPARK-11149
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Improve performance of primitive types in columnar cache



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns

2015-10-16 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961147#comment-14961147
 ] 

Michael Armbrust commented on SPARK-11153:
--

Its actually corrupted statistics in data that is written?  Does parquet write 
the version in the metadata?  Should we actually be turning this off based on 
writer version?

> Turns off Parquet filter push-down for string and binary columns
> 
>
> Key: SPARK-11153
> URL: https://issues.apache.org/jira/browse/SPARK-11153
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Blocker
>
> Due to PARQUET-251, {{BINARY}} columns in existing Parquet files may be 
> written with corrupted statistics information. This information is used by 
> filter push-down optimization. Since Spark 1.5 turns on Parquet filter 
> push-down by default, we may end up with wrong query results. PARQUET-251 has 
> been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0.
> This affects all Spark SQL data types that can be mapped to Parquet 
> {{BINARY}}, namely:
> - {{StringType}}
> - {{BinaryType}}
> - {{DecimalType}} (but Spark SQL doesn't support pushing down {{DecimalType}} 
> columns for now.)
> To avoid wrong query results, we should disable filter push-down for columns 
> of {{StringType}} and {{BinaryType}} until we upgrade to parquet-mr 1.8.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10953) Benchmark codegen vs. hand-written code for univariate statistics

2015-10-16 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961209#comment-14961209
 ] 

Xiangrui Meng commented on SPARK-10953:
---

That sounds good. I'm closing this for now since the conclusion is clear.

> Benchmark codegen vs. hand-written code for univariate statistics
> -
>
> Key: SPARK-10953
> URL: https://issues.apache.org/jira/browse/SPARK-10953
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Xiangrui Meng
>Assignee: Jihong MA
> Fix For: 1.6.0
>
>
> I checked the generated code for a simple stddev_pop call:
> {code}
> val df = sqlContext.range(100)
> df.select(stddev_pop(col("id"))).show()
> {code}
> This is the generated code for the merge part, which is very long and 
> complex. I'm not sure whether we can get benefit from the code generation for 
> univariate statistics. We should benchmark it against Scala implementation.
> {code}
> 15/10/06 10:10:57 DEBUG GenerateMutableProjection: code for if 
> (isnull(input[1, DoubleType])) cast(0 as double) else input[1, DoubleType],if 
> (isnull(input[1, DoubleType])) input[6, DoubleType] else if (isnull(input[6, 
> DoubleType])) input[1, DoubleType] else (input[1, DoubleType] + input[6, 
> DoubleType]),if (isnull(input[3, DoubleType])) cast(0 as double) else 
> input[3, DoubleType],if (isnull(input[3, DoubleType])) input[8, DoubleType] 
> else if (isnull(input[8, DoubleType])) input[3, DoubleType] else (((input[3, 
> DoubleType] * input[0, DoubleType]) + (input[8, DoubleType] * input[6, 
> DoubleType])) / (input[0, DoubleType] + input[6, DoubleType])),if 
> (isnull(input[4, DoubleType])) input[9, DoubleType] else if (isnull(input[9, 
> DoubleType])) input[4, DoubleType] else ((input[4, DoubleType] + input[9, 
> DoubleType]) + input[8, DoubleType] - input[2, DoubleType]) * (input[8, 
> DoubleType] - input[2, DoubleType])) * (input[0, DoubleType] * input[6, 
> DoubleType])) / (input[0, DoubleType] + input[6, DoubleType]))):
> public Object generate(org.apache.spark.sql.catalyst.expressions.Expression[] 
> expr) {
>   return new SpecificMutableProjection(expr);
> }
> class SpecificMutableProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
>   private org.apache.spark.sql.catalyst.expressions.Expression[] expressions;
>   private org.apache.spark.sql.catalyst.expressions.MutableRow mutableRow;
>   public 
> SpecificMutableProjection(org.apache.spark.sql.catalyst.expressions.Expression[]
>  expr) {
> expressions = expr;
> mutableRow = new 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow(5);
>   }
>   public 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection 
> target(org.apache.spark.sql.catalyst.expressions.MutableRow row) {
> mutableRow = row;
> return this;
>   }
>   /* Provide immutable access to the last projected row. */
>   public InternalRow currentValue() {
> return (InternalRow) mutableRow;
>   }
>   public Object apply(Object _i) {
> InternalRow i = (InternalRow) _i;
> /* if (isnull(input[1, DoubleType])) cast(0 as double) else input[1, 
> DoubleType] */
> /* isnull(input[1, DoubleType]) */
> /* input[1, DoubleType] */
> boolean isNull4 = i.isNullAt(1);
> double primitive5 = isNull4 ? -1.0 : (i.getDouble(1));
> boolean isNull0 = false;
> double primitive1 = -1.0;
> if (!false && isNull4) {
>   /* cast(0 as double) */
>   /* 0 */
>   boolean isNull6 = false;
>   double primitive7 = -1.0;
>   if (!false) {
> primitive7 = (double) 0;
>   }
>   isNull0 = isNull6;
>   primitive1 = primitive7;
> } else {
>   /* input[1, DoubleType] */
>   boolean isNull10 = i.isNullAt(1);
>   double primitive11 = isNull10 ? -1.0 : (i.getDouble(1));
>   isNull0 = isNull10;
>   primitive1 = primitive11;
> }
> if (isNull0) {
>   mutableRow.setNullAt(0);
> } else {
>   mutableRow.setDouble(0, primitive1);
> }
> /* if (isnull(input[1, DoubleType])) input[6, DoubleType] else if 
> (isnull(input[6, DoubleType])) input[1, DoubleType] else (input[1, 
> DoubleType] + input[6, DoubleType]) */
> /* isnull(input[1, DoubleType]) */
> /* input[1, DoubleType] */
> boolean isNull16 = i.isNullAt(1);
> double primitive17 = isNull16 ? -1.0 : (i.getDouble(1));
> boolean isNull12 = false;
> double primitive13 = -1.0;
> if (!false && isNull16) {
>   /* input[6, DoubleType] */
>   boolean isNull18 = i.isNullAt(6);
>   double primitive19 = isNull18 ? -1.0 : (i.getDouble(6));
>   isNull12 = isNull18;
>   primitive13 = primitive19;
> } else {
>   /* if (isnull(input[6, DoubleType])) input[1, DoubleType] else 
> (input[1, 

[jira] [Created] (SPARK-11151) Use Long internally for DecimalType with precision <= 18

2015-10-16 Thread Davies Liu (JIRA)
Davies Liu created SPARK-11151:
--

 Summary: Use Long internally for DecimalType with precision <= 18
 Key: SPARK-11151
 URL: https://issues.apache.org/jira/browse/SPARK-11151
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu


It's expensive to create a Decimal object for small, we could use Long 
directly, just like what we had done for Date and Timestamp.

This will involved lots of change that including:

1) inbound/outbound conversion
2) access/storage in InternalRow
3) all the expression that support DecimalType



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11147) HTTP 500 if try to access Spark UI in yarn-cluster

2015-10-16 Thread Sebastian YEPES FERNANDEZ (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961088#comment-14961088
 ] 

Sebastian YEPES FERNANDEZ commented on SPARK-11147:
---

I don't think its a networking issue as until now we have not had any issue 
like this, we are regularly submitting jobs in client mode and all worker nodes 
communicate correctly.
What part of the logs (yarn or spark) would be the most useful so we can 
pinpoint this problem.

Note: Between all the servers there are no firewalls nor OS filtering.

> HTTP 500 if try to access Spark UI in yarn-cluster
> --
>
> Key: SPARK-11147
> URL: https://issues.apache.org/jira/browse/SPARK-11147
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI, YARN
>Affects Versions: 1.5.1
> Environment: HDP: 2.3.2.0-2950 (Hadoop 2.7.1.2.3.2.0-2950)
> Spark: 1.5.x (c27e1904)
>Reporter: Sebastian YEPES FERNANDEZ
>
> Hello,
> I am facing a similar issue as described in SPARK-5837, but is my case the 
> SparkUI only work in "yarn-client" mode. If a run the same job using 
> "yarn-cluster" I get the HTTP 500 error:
> {code}
> HTTP ERROR 500
> Problem accessing /proxy/application_1444297190346_0085/. Reason:
> Connection to http://XX.XX.XX.XX:55827 refused
> Caused by:
> org.apache.http.conn.HttpHostConnectException: Connection to 
> http://XX.XX.XX.XX:55827 refused
>   at 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:190)
>   at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:294)
>   at 
> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:643)
>   at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:479)
>   at 
> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
> {code}
> I have verified that the UI port "55827" is actually Listening on the worker 
> node, I can even run a "curl http://XX.XX.XX.XX:55827; and it redirects me to 
> another URL: http://YY.YY.YY.YY:8088/proxy/application_1444297190346_0082
> The strange thing is the its redirecting me to the app "_0082" and not the 
> actually running job "_0085"
> Does anyone have any suggestions on what could be causing this issue?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11124) JsonParser/Generator should be closed for resource recycle

2015-10-16 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-11124.
-
   Resolution: Fixed
 Assignee: Navis
Fix Version/s: 1.6.0

> JsonParser/Generator should be closed for resource recycle
> --
>
> Key: SPARK-11124
> URL: https://issues.apache.org/jira/browse/SPARK-11124
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Navis
>Assignee: Navis
>Priority: Trivial
> Fix For: 1.6.0
>
>
> Some json parsers are not closed. parser in JacksonParser#parseJson, for 
> example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11154) make specificaition spark.yarn.executor.memoryOverhead consistent with typical JVM options

2015-10-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961184#comment-14961184
 ] 

Sean Owen commented on SPARK-11154:
---

Should be for all similar properties, not just this one. The twist is that you 
have to support the current syntax. 1000 must mean "1000 megabytes". But then 
someone writing "100" would be surprised to find that it means "100 
megabytes". (CM might do just this, note.) Hence I'm actually not sure if this 
is feasible.

> make specificaition spark.yarn.executor.memoryOverhead consistent with 
> typical JVM options
> --
>
> Key: SPARK-11154
> URL: https://issues.apache.org/jira/browse/SPARK-11154
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Submit
>Reporter: Dustin Cote
>Priority: Minor
>
> spark.yarn.executor.memoryOverhead is currently specified in megabytes by 
> default, but it would be nice to allow users to specify the size as though it 
> were a typical -Xmx option to a JVM where you can have 'm' and 'g' appended 
> to the end to explicitly specify megabytes or gigabytes.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10641) skewness and kurtosis support

2015-10-16 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10641:
--
Attachment: simpler-moments.pdf

I did some calculation offline and got a simpler formula for updating 
high-order moments. [~sethah] If you are interested, you can implement 
ImperativeAggregate using this formula.

> skewness and kurtosis support
> -
>
> Key: SPARK-10641
> URL: https://issues.apache.org/jira/browse/SPARK-10641
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>Assignee: Seth Hendrickson
> Attachments: simpler-moments.pdf
>
>
> Implementing skewness and kurtosis support based on following algorithm:
> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11157) Allow Spark to be built without assemblies

2015-10-16 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-11157:
--

 Summary: Allow Spark to be built without assemblies
 Key: SPARK-11157
 URL: https://issues.apache.org/jira/browse/SPARK-11157
 Project: Spark
  Issue Type: Umbrella
  Components: Build, Spark Core, YARN
Reporter: Marcelo Vanzin


For reasoning, discussion of pros and cons, and other more detailed 
information, please see attached doc.

The idea is to be able to build a Spark distribution that has just a directory 
full of jars instead of the huge assembly files we currently have.

Getting there requires changes in a bunch of places, I'll try to list the ones 
I identified in the document, in the order that I think would be needed to not 
break things:

* make streaming backends not be assemblies

Since people may depend on the current assembly artifacts in their deployments, 
we can't really remove them; but we can make them be dummy jars and rely on 
dependency resolution to download all the jars.

PySpark tests would also need some tweaking here.

* make examples jar not be an assembly

Probably requires tweaks to the {{run-example}} script. The location of the 
examples jar would have to change (it won't be able to live in the same place 
as the main Spark jars anymore).

* update YARN backend to handle a directory full of jars when launching apps

Currently YARN localizes the Spark assembly (depending on the user 
configuration); it needs to be modified so that it can localize all needed 
libraries instead of a single jar.

* Modify launcher library to handle the jars directory

This should be trivial

* Modify {{assembly/pom.xml}} to generate assembly or a {{libs}} directory 
depending on which profile is enabled.

We should keep the option to build with the assembly on by default, for 
backwards compatibility, to give people time to prepare.

Filing this bug as an umbrella; please file sub-tasks if you plan to work on a 
specific part of the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11155) Stage summary json should include stage duration

2015-10-16 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961366#comment-14961366
 ] 

Kay Ousterhout commented on SPARK-11155:


[~imranr] where exactly do you mean this is missing?  I thought you meant the 
Json info for StageSubmitted / StageCompleted, but that does include the stage 
submission and completion time (via StageInfo), which can be used to compute 
the duration.

> Stage summary json should include stage duration 
> -
>
> Key: SPARK-11155
> URL: https://issues.apache.org/jira/browse/SPARK-11155
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Imran Rashid
>Priority: Minor
>  Labels: Starter
>
> The json endpoint for stages doesn't include information on the stage 
> duration that is present in the UI.  This looks like a simple oversight, they 
> should be included.  eg., the metrics should be included at 
> {{api/v1/applications//stages}}. The missing metrics are 
> {{submissionTime}} and {{completionTime}} (and whatever other metrics come 
> out of the discussion on SPARK-10930)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11050) PySpark SparseVector can return wrong index in error message

2015-10-16 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-11050.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9069
[https://github.com/apache/spark/pull/9069]

> PySpark SparseVector can return wrong index in error message
> 
>
> Key: SPARK-11050
> URL: https://issues.apache.org/jira/browse/SPARK-11050
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 1.6.0
>Reporter: Joseph K. Bradley
>Assignee: Bhargav Mangipudi
>Priority: Trivial
>  Labels: starter
> Fix For: 1.6.0
>
>
> PySpark {{SparseVector.__getitem__}} returns an error message if given a bad 
> index here:
> [https://github.com/apache/spark/blob/a16396df76cc27099011bfb96b28cbdd7f964ca8/python/pyspark/mllib/linalg/__init__.py#L770]
> But the index it complains about could have been modified (if negative), 
> meaning the index in the error message could be wrong.  This should be 
> corrected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns

2015-10-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11153:


Assignee: Cheng Lian  (was: Apache Spark)

> Turns off Parquet filter push-down for string and binary columns
> 
>
> Key: SPARK-11153
> URL: https://issues.apache.org/jira/browse/SPARK-11153
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Blocker
>
> Due to PARQUET-251, {{BINARY}} columns in existing Parquet files may be 
> written with corrupted statistics information. This information is used by 
> filter push-down optimization. Since Spark 1.5 turns on Parquet filter 
> push-down by default, we may end up with wrong query results. PARQUET-251 has 
> been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0.
> Note that this kind of corrupted Parquet files could be produced by any 
> Parquet data models.
> This affects all Spark SQL data types that can be mapped to Parquet 
> {{BINARY}}, namely:
> - {{StringType}}
> - {{BinaryType}}
> - {{DecimalType}} (but Spark SQL doesn't support pushing down {{DecimalType}} 
> columns for now.)
> To avoid wrong query results, we should disable filter push-down for columns 
> of {{StringType}} and {{BinaryType}} until we upgrade to parquet-mr 1.8.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns

2015-10-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961300#comment-14961300
 ] 

Apache Spark commented on SPARK-11153:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/9152

> Turns off Parquet filter push-down for string and binary columns
> 
>
> Key: SPARK-11153
> URL: https://issues.apache.org/jira/browse/SPARK-11153
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Blocker
>
> Due to PARQUET-251, {{BINARY}} columns in existing Parquet files may be 
> written with corrupted statistics information. This information is used by 
> filter push-down optimization. Since Spark 1.5 turns on Parquet filter 
> push-down by default, we may end up with wrong query results. PARQUET-251 has 
> been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0.
> Note that this kind of corrupted Parquet files could be produced by any 
> Parquet data models.
> This affects all Spark SQL data types that can be mapped to Parquet 
> {{BINARY}}, namely:
> - {{StringType}}
> - {{BinaryType}}
> - {{DecimalType}} (but Spark SQL doesn't support pushing down {{DecimalType}} 
> columns for now.)
> To avoid wrong query results, we should disable filter push-down for columns 
> of {{StringType}} and {{BinaryType}} until we upgrade to parquet-mr 1.8.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11158) Add more information in Error statment for sql/types _verify_type()

2015-10-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11158:


Assignee: Apache Spark

> Add more information in Error statment for sql/types _verify_type()
> ---
>
> Key: SPARK-11158
> URL: https://issues.apache.org/jira/browse/SPARK-11158
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Mahmoud Lababidi
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11158) Add more information in Error statment for sql/types _verify_type()

2015-10-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11158:


Assignee: (was: Apache Spark)

> Add more information in Error statment for sql/types _verify_type()
> ---
>
> Key: SPARK-11158
> URL: https://issues.apache.org/jira/browse/SPARK-11158
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Mahmoud Lababidi
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11158) Add more information in Error statment for sql/types _verify_type()

2015-10-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961315#comment-14961315
 ] 

Apache Spark commented on SPARK-11158:
--

User 'lababidi' has created a pull request for this issue:
https://github.com/apache/spark/pull/9149

> Add more information in Error statment for sql/types _verify_type()
> ---
>
> Key: SPARK-11158
> URL: https://issues.apache.org/jira/browse/SPARK-11158
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Mahmoud Lababidi
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10581) Groups are not resolved in scaladoc for org.apache.spark.sql.Column

2015-10-16 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-10581.
-
   Resolution: Fixed
Fix Version/s: 1.6.0
   1.5.2

> Groups are not resolved in scaladoc for org.apache.spark.sql.Column
> ---
>
> Key: SPARK-10581
> URL: https://issues.apache.org/jira/browse/SPARK-10581
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Jacek Laskowski
>Priority: Minor
> Fix For: 1.5.2, 1.6.0
>
>
> The Scala API documentation (scaladoc) for 
> [org.apache.spark.sql.Column|http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.Column]
>  does not resolve groups, and they appear unresolved like {{df_ops}}, 
> {{expr_ops}}, et al. instead of _DataFrame functions._, _Expression 
> operators._, et al.  
> BTW, 
> [DataFrame|http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame]
>  and other classes in the 
> [org.apache.spark.sql|http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.package]
>  package seem fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10641) skewness and kurtosis support

2015-10-16 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961359#comment-14961359
 ] 

Seth Hendrickson edited comment on SPARK-10641 at 10/16/15 9:07 PM:


[~mengxr] I am interested, do you mind providing it or a link to it?


was (Author: sethah):
[~mengxr] I am interested, do you mine providing it or a link to it?

> skewness and kurtosis support
> -
>
> Key: SPARK-10641
> URL: https://issues.apache.org/jira/browse/SPARK-10641
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>Assignee: Seth Hendrickson
> Attachments: simpler-moments.pdf
>
>
> Implementing skewness and kurtosis support based on following algorithm:
> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10581) Groups are not resolved in scaladoc for org.apache.spark.sql.Column

2015-10-16 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10581:

Assignee: Pravin Vishnu Gadakh

> Groups are not resolved in scaladoc for org.apache.spark.sql.Column
> ---
>
> Key: SPARK-10581
> URL: https://issues.apache.org/jira/browse/SPARK-10581
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Jacek Laskowski
>Assignee: Pravin Vishnu Gadakh
>Priority: Minor
> Fix For: 1.5.2, 1.6.0
>
>
> The Scala API documentation (scaladoc) for 
> [org.apache.spark.sql.Column|http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.Column]
>  does not resolve groups, and they appear unresolved like {{df_ops}}, 
> {{expr_ops}}, et al. instead of _DataFrame functions._, _Expression 
> operators._, et al.  
> BTW, 
> [DataFrame|http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame]
>  and other classes in the 
> [org.apache.spark.sql|http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.package]
>  package seem fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11156) Web UI doesn't count or show info about replicated blocks

2015-10-16 Thread Ryan Williams (JIRA)
Ryan Williams created SPARK-11156:
-

 Summary: Web UI doesn't count or show info about replicated blocks
 Key: SPARK-11156
 URL: https://issues.apache.org/jira/browse/SPARK-11156
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.5.1
Reporter: Ryan Williams


When executors receive a replica of a block, they [notify the driver with a 
{{UpdateBlockInfo}} 
message|https://github.com/apache/spark/blob/4ee2cea2a43f7d04ab8511d9c029f80c5dadd48e/core/src/main/scala/org/apache/spark/storage/BlockManagerMaster.scala#L59-L61]
 which [sends a {{SparkListenerBlockUpdated}} event to 
SparkListeners|https://github.com/apache/spark/blob/4ee2cea2a43f7d04ab8511d9c029f80c5dadd48e/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala#L67].

However, the web UI (via its BlockStatusListener) [ignores 
{{SparkListenerBlockUpdated}} events for non-streaming 
blocks|https://github.com/apache/spark/blob/4ee2cea2a43f7d04ab8511d9c029f80c5dadd48e/core/src/main/scala/org/apache/spark/storage/BlockStatusListener.scala#L57-L60].

As a result, in non-streaming apps:
* The "Executors" column on RDD Page doesn't show executors housing replicas; 
it can only show the executor that initially computed (and initiated 
replication of) the block.
*  The executor-memory-usage and related stats displayed throughout the web 
interface are undercounting due to ignorance of the existence of block replicas.

For example, here is the Storage tab for a simple app with 3 identical RDDs 
cached with replication equal to 1, 2, and 3:

!http://f.cl.ly/items/3m3B2v2k2J23350I3t1c/Screen%20Shot%202015-10-16%20at%2012.30.54%20AM.png!

These were generated with:

{code}
val bar1 = sc.parallelize(1 to 1, 100).map(_ % 100 -> 
1).reduceByKey(_+_, 100).setName("bar1").persist(StorageLevel(false, true, 
false, true, 1))
bar1.count
val bar2 = sc.parallelize(1 to 1, 100).map(_ % 100 -> 
1).reduceByKey(_+_, 100).setName("bar2").persist(StorageLevel(false, true, 
false, true, 2))
bar2.count
val bar3 = sc.parallelize(1 to 1, 100).map(_ % 100 -> 
1).reduceByKey(_+_, 100).setName("bar3").persist(StorageLevel(false, true, 
false, true, 3))
bar3.count
{code}

Note the identically-reported memory usage across the three.

Here is the RDD page for the 3x-replicated RDD above:

!http://f.cl.ly/items/0t0H1o2S2g140s1A0X0k/Screen%20Shot%202015-10-16%20at%2012.31.24%20AM.png!

Note that only one executor is listed for each partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9409) make-distribution.sh should copy all files in conf, so that it's easy to create a distro with custom configuration and property settings

2015-10-16 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-9409.
---
Resolution: Won't Fix

> make-distribution.sh should copy all files in conf, so that it's easy to 
> create a distro with custom configuration and property settings
> 
>
> Key: SPARK-9409
> URL: https://issues.apache.org/jira/browse/SPARK-9409
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.4.1
> Environment: MacOS, Linux
>Reporter: Dean Wampler
>Priority: Minor
>  Labels: easyfix
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> When using make-distribution.sh to build a custom distribution, it would be 
> nice to be able to drop custom configuration files in the conf directory and 
> have them included in the archive. Currently, only the *.template files are 
> included.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11155) Stage summary json should include stage duration

2015-10-16 Thread Xin Ren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961282#comment-14961282
 ] 

Xin Ren commented on SPARK-11155:
-

Hi, I'd like to have a try on this one. Thanks

> Stage summary json should include stage duration 
> -
>
> Key: SPARK-11155
> URL: https://issues.apache.org/jira/browse/SPARK-11155
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Imran Rashid
>Priority: Minor
>  Labels: Starter
>
> The json endpoint for stages doesn't include information on the stage 
> duration that is present in the UI.  This looks like a simple oversight, they 
> should be included.  eg., the metrics should be included at 
> {{api/v1/applications//stages}}. The missing metrics are 
> {{submissionTime}} and {{completionTime}} (and whatever other metrics come 
> out of the discussion on SPARK-10930)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11127) Upgrade Kinesis Client Library to the latest stable version

2015-10-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11127:


Assignee: Tathagata Das  (was: Apache Spark)

> Upgrade Kinesis Client Library to the latest stable version
> ---
>
> Key: SPARK-11127
> URL: https://issues.apache.org/jira/browse/SPARK-11127
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Tathagata Das
>
> We use KCL 1.3.0 in the current master. KCL 1.4.0 added integration with 
> Kinesis Producer Library (KPL) and support auto de-aggregation. It would be 
> great to upgrade KCL to the latest stable version.
> Note that the latest version is 1.6.1 and 1.6.0 restored compatibility with 
> dynamodb-streams-kinesis-adapter, which was broken in 1.4.0. See 
> https://github.com/awslabs/amazon-kinesis-client#release-notes.
> [~tdas] [~brkyvz] Please recommend a version for upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11127) Upgrade Kinesis Client Library to the latest stable version

2015-10-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961308#comment-14961308
 ] 

Apache Spark commented on SPARK-11127:
--

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/9153

> Upgrade Kinesis Client Library to the latest stable version
> ---
>
> Key: SPARK-11127
> URL: https://issues.apache.org/jira/browse/SPARK-11127
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Tathagata Das
>
> We use KCL 1.3.0 in the current master. KCL 1.4.0 added integration with 
> Kinesis Producer Library (KPL) and support auto de-aggregation. It would be 
> great to upgrade KCL to the latest stable version.
> Note that the latest version is 1.6.1 and 1.6.0 restored compatibility with 
> dynamodb-streams-kinesis-adapter, which was broken in 1.4.0. See 
> https://github.com/awslabs/amazon-kinesis-client#release-notes.
> [~tdas] [~brkyvz] Please recommend a version for upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11160) CloudPickeSerializer conflicts with xmlrunner

2015-10-16 Thread Gabor Liptak (JIRA)
Gabor Liptak created SPARK-11160:


 Summary: CloudPickeSerializer conflicts with xmlrunner
 Key: SPARK-11160
 URL: https://issues.apache.org/jira/browse/SPARK-11160
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Gabor Liptak
Priority: Minor




Change L259 in pyspark/tests.py to:

{code}
   # Regression test for SPARK-3415
   def test_pickling_file_handles(self):
   # JIRA number here
   if xmlrunner is None:
ser = CloudPickleSerializer()
out1 = sys.stderr
out2 = ser.loads(ser.dumps(out1))
self.assertEqual(out1, out2)
{code}
The issue is CloudPickeSerializer wraps stderr, which conflicts with xmlrunner. 
But it might take some time to fix.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns

2015-10-16 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961281#comment-14961281
 ] 

Cheng Lian commented on SPARK-11153:


Yes, it's the statistics information that is corrupted. And yes, Parquet does 
write version in the metadata. Parquet-mr 1.8 handles this issue in exactly the 
way you suggested, namely ignoring binary statistics when necessary according 
to version information written in the metadata.

However, Spark SQL performs filter push-down on driver side. This means we need 
to gather Parquet versions from all Parquet files using a distributed Spark 
job. We can probably merge this one into the job used to merge Spark schemata. 
But I think this is too risky for 1.5.2 at this stage. So I'd propose we simply 
disables filter push-down for strings and binaries in all cases until a 
parquet-mr upgrade.

> Turns off Parquet filter push-down for string and binary columns
> 
>
> Key: SPARK-11153
> URL: https://issues.apache.org/jira/browse/SPARK-11153
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Blocker
>
> Due to PARQUET-251, {{BINARY}} columns in existing Parquet files may be 
> written with corrupted statistics information. This information is used by 
> filter push-down optimization. Since Spark 1.5 turns on Parquet filter 
> push-down by default, we may end up with wrong query results. PARQUET-251 has 
> been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0.
> Note that this kind of corrupted Parquet files could be produced by any 
> Parquet data models.
> This affects all Spark SQL data types that can be mapped to Parquet 
> {{BINARY}}, namely:
> - {{StringType}}
> - {{BinaryType}}
> - {{DecimalType}} (but Spark SQL doesn't support pushing down {{DecimalType}} 
> columns for now.)
> To avoid wrong query results, we should disable filter push-down for columns 
> of {{StringType}} and {{BinaryType}} until we upgrade to parquet-mr 1.8.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns

2015-10-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11153:


Assignee: Apache Spark  (was: Cheng Lian)

> Turns off Parquet filter push-down for string and binary columns
> 
>
> Key: SPARK-11153
> URL: https://issues.apache.org/jira/browse/SPARK-11153
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Cheng Lian
>Assignee: Apache Spark
>Priority: Blocker
>
> Due to PARQUET-251, {{BINARY}} columns in existing Parquet files may be 
> written with corrupted statistics information. This information is used by 
> filter push-down optimization. Since Spark 1.5 turns on Parquet filter 
> push-down by default, we may end up with wrong query results. PARQUET-251 has 
> been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0.
> Note that this kind of corrupted Parquet files could be produced by any 
> Parquet data models.
> This affects all Spark SQL data types that can be mapped to Parquet 
> {{BINARY}}, namely:
> - {{StringType}}
> - {{BinaryType}}
> - {{DecimalType}} (but Spark SQL doesn't support pushing down {{DecimalType}} 
> columns for now.)
> To avoid wrong query results, we should disable filter push-down for columns 
> of {{StringType}} and {{BinaryType}} until we upgrade to parquet-mr 1.8.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9162) Implement code generation for ScalaUDF

2015-10-16 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961335#comment-14961335
 ] 

Reynold Xin commented on SPARK-9162:


[~viirya] can you work on this? We can then close this umbrella ticket ..


> Implement code generation for ScalaUDF
> --
>
> Key: SPARK-9162
> URL: https://issues.apache.org/jira/browse/SPARK-9162
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8100) Make able to refer lost executor log

2015-10-16 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-8100.
---
Resolution: Duplicate

This looks like a duplicate of SPARK-7729

> Make able to refer lost executor log
> 
>
> Key: SPARK-8100
> URL: https://issues.apache.org/jira/browse/SPARK-8100
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.3.1
>Reporter: SuYan
>Priority: Minor
>
> While application is still running, but the lost executor's info is 
> disappeared from SPARK UI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11157) Allow Spark to be built without assemblies

2015-10-16 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-11157:
---
Attachment: no-assemblies.pdf

> Allow Spark to be built without assemblies
> --
>
> Key: SPARK-11157
> URL: https://issues.apache.org/jira/browse/SPARK-11157
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Spark Core, YARN
>Reporter: Marcelo Vanzin
> Attachments: no-assemblies.pdf
>
>
> For reasoning, discussion of pros and cons, and other more detailed 
> information, please see attached doc.
> The idea is to be able to build a Spark distribution that has just a 
> directory full of jars instead of the huge assembly files we currently have.
> Getting there requires changes in a bunch of places, I'll try to list the 
> ones I identified in the document, in the order that I think would be needed 
> to not break things:
> * make streaming backends not be assemblies
> Since people may depend on the current assembly artifacts in their 
> deployments, we can't really remove them; but we can make them be dummy jars 
> and rely on dependency resolution to download all the jars.
> PySpark tests would also need some tweaking here.
> * make examples jar not be an assembly
> Probably requires tweaks to the {{run-example}} script. The location of the 
> examples jar would have to change (it won't be able to live in the same place 
> as the main Spark jars anymore).
> * update YARN backend to handle a directory full of jars when launching apps
> Currently YARN localizes the Spark assembly (depending on the user 
> configuration); it needs to be modified so that it can localize all needed 
> libraries instead of a single jar.
> * Modify launcher library to handle the jars directory
> This should be trivial
> * Modify {{assembly/pom.xml}} to generate assembly or a {{libs}} directory 
> depending on which profile is enabled.
> We should keep the option to build with the assembly on by default, for 
> backwards compatibility, to give people time to prepare.
> Filing this bug as an umbrella; please file sub-tasks if you plan to work on 
> a specific part of the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11127) Upgrade Kinesis Client Library to the latest stable version

2015-10-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11127:


Assignee: Apache Spark  (was: Tathagata Das)

> Upgrade Kinesis Client Library to the latest stable version
> ---
>
> Key: SPARK-11127
> URL: https://issues.apache.org/jira/browse/SPARK-11127
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>
> We use KCL 1.3.0 in the current master. KCL 1.4.0 added integration with 
> Kinesis Producer Library (KPL) and support auto de-aggregation. It would be 
> great to upgrade KCL to the latest stable version.
> Note that the latest version is 1.6.1 and 1.6.0 restored compatibility with 
> dynamodb-streams-kinesis-adapter, which was broken in 1.4.0. See 
> https://github.com/awslabs/amazon-kinesis-client#release-notes.
> [~tdas] [~brkyvz] Please recommend a version for upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11127) Upgrade Kinesis Client Library to the latest stable version

2015-10-16 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-11127:
-

Assignee: Xiangrui Meng  (was: Tathagata Das)

> Upgrade Kinesis Client Library to the latest stable version
> ---
>
> Key: SPARK-11127
> URL: https://issues.apache.org/jira/browse/SPARK-11127
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> We use KCL 1.3.0 in the current master. KCL 1.4.0 added integration with 
> Kinesis Producer Library (KPL) and support auto de-aggregation. It would be 
> great to upgrade KCL to the latest stable version.
> Note that the latest version is 1.6.1 and 1.6.0 restored compatibility with 
> dynamodb-streams-kinesis-adapter, which was broken in 1.4.0. See 
> https://github.com/awslabs/amazon-kinesis-client#release-notes.
> [~tdas] [~brkyvz] Please recommend a version for upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6859) Parquet File Binary column statistics error when reuse byte[] among rows

2015-10-16 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961318#comment-14961318
 ] 

Cheng Lian commented on SPARK-6859:
---

This issue was left unresolved because Parquet filter push-down wasn't enabled 
by default. But now in 1.5, it's turned on by default. Opened SPARK-11153 to 
disable filter push-down for strings and binaries.

> Parquet File Binary column statistics error when reuse byte[] among rows
> 
>
> Key: SPARK-6859
> URL: https://issues.apache.org/jira/browse/SPARK-6859
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0, 1.3.0, 1.4.0
>Reporter: Yijie Shen
>Priority: Minor
>
> Suppose I create a dataRDD which extends RDD\[Row\], and each row is 
> GenericMutableRow(Array(Int, Array\[Byte\])). A same Array\[Byte\] object is 
> reused among rows but has different content each time. When I convert it to a 
> dataFrame and save it as Parquet File, the file's row group statistic(max & 
> min) of Binary column would be wrong.
> \\
> \\
> Here is the reason: In Parquet, BinaryStatistic just keep max & min as 
> parquet.io.api.Binary references, Spark sql would generate a new Binary 
> backed by the same Array\[Byte\] passed from row.
>   
> | |reference| |backed| |  
> |max: Binary|-->|ByteArrayBackedBinary|-->|Array\[Byte\]|
> Therefore, each time parquet updating row group's statistic, max & min would 
> always refer to the same Array\[Byte\], which has new content each time. When 
> parquet decides to save it into file, the last row's content would be saved 
> as both max & min.
> \\
> \\
> It seems it is a parquet bug because it's parquet's responsibility to update 
> statistics correctly.
> But not quite sure. Should I report it as a bug in parquet JIRA? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8360) Streaming DataFrames

2015-10-16 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8360:
---
Target Version/s:   (was: 1.6.0)

> Streaming DataFrames
> 
>
> Key: SPARK-8360
> URL: https://issues.apache.org/jira/browse/SPARK-8360
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL, Streaming
>Reporter: Reynold Xin
>
> Umbrella ticket to track what's needed to make streaming DataFrame a reality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns

2015-10-16 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961371#comment-14961371
 ] 

Felix Cheung commented on SPARK-11153:
--

so the corrupted stats data would still be a problem when for future releases? 
how would it be handled then?


> Turns off Parquet filter push-down for string and binary columns
> 
>
> Key: SPARK-11153
> URL: https://issues.apache.org/jira/browse/SPARK-11153
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Blocker
>
> Due to PARQUET-251, {{BINARY}} columns in existing Parquet files may be 
> written with corrupted statistics information. This information is used by 
> filter push-down optimization. Since Spark 1.5 turns on Parquet filter 
> push-down by default, we may end up with wrong query results. PARQUET-251 has 
> been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0.
> Note that this kind of corrupted Parquet files could be produced by any 
> Parquet data models.
> This affects all Spark SQL data types that can be mapped to Parquet 
> {{BINARY}}, namely:
> - {{StringType}}
> - {{BinaryType}}
> - {{DecimalType}} (but Spark SQL doesn't support pushing down {{DecimalType}} 
> columns for now.)
> To avoid wrong query results, we should disable filter push-down for columns 
> of {{StringType}} and {{BinaryType}} until we upgrade to parquet-mr 1.8.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10641) skewness and kurtosis support

2015-10-16 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961396#comment-14961396
 ] 

Xiangrui Meng commented on SPARK-10641:
---

See attached PDF file.

> skewness and kurtosis support
> -
>
> Key: SPARK-10641
> URL: https://issues.apache.org/jira/browse/SPARK-10641
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>Assignee: Seth Hendrickson
> Attachments: simpler-moments.pdf
>
>
> Implementing skewness and kurtosis support based on following algorithm:
> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11158) Add more information in Error statment for sql/types _verify_type()

2015-10-16 Thread Mahmoud Lababidi (JIRA)
Mahmoud Lababidi created SPARK-11158:


 Summary: Add more information in Error statment for sql/types 
_verify_type()
 Key: SPARK-11158
 URL: https://issues.apache.org/jira/browse/SPARK-11158
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Mahmoud Lababidi
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10974) Add progress bar for output operation column and use red dots for failed batches

2015-10-16 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-10974.
---
Resolution: Fixed

> Add progress bar for output operation column and use red dots for failed 
> batches
> 
>
> Key: SPARK-10974
> URL: https://issues.apache.org/jira/browse/SPARK-10974
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11104) A potential deadlock in StreamingContext.stop and stopOnShutdown

2015-10-16 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-11104.
---
   Resolution: Fixed
 Assignee: Shixiong Zhu
Fix Version/s: 1.6.0
   1.5.2

> A potential deadlock in StreamingContext.stop and stopOnShutdown
> 
>
> Key: SPARK-11104
> URL: https://issues.apache.org/jira/browse/SPARK-11104
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 1.5.2, 1.6.0
>
>
> When the shutdown hook of StreamingContext and StreamingContext.stop are 
> running at the same time (e.g., press CTRL-C when StreamingContext.stop is 
> running), the following deadlock may happen:
> {code}
> Java stack information for the threads listed above:
> ===
> "Thread-2":
>   at 
> org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:699)
>   - waiting to lock <0x0005405a1680> (a 
> org.apache.spark.streaming.StreamingContext)
>   at 
> org.apache.spark.streaming.StreamingContext.org$apache$spark$streaming$StreamingContext$$stopOnShutdown(StreamingContext.scala:729)
>   at 
> org.apache.spark.streaming.StreamingContext$$anonfun$start$1.apply$mcV$sp(StreamingContext.scala:625)
>   at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:266)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:236)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:236)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:236)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1697)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:236)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:236)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:236)
>   at scala.util.Try$.apply(Try.scala:161)
>   at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:236)
>   - locked <0x0005405b6a00> (a 
> org.apache.spark.util.SparkShutdownHookManager)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216)
>   at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
> "main":
>   at 
> org.apache.spark.util.SparkShutdownHookManager.remove(ShutdownHookManager.scala:248)
>   - waiting to lock <0x0005405b6a00> (a 
> org.apache.spark.util.SparkShutdownHookManager)
>   at 
> org.apache.spark.util.ShutdownHookManager$.removeShutdownHook(ShutdownHookManager.scala:199)
>   at 
> org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:712)
>   - locked <0x0005405a1680> (a 
> org.apache.spark.streaming.StreamingContext)
>   at 
> org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:684)
>   - locked <0x0005405a1680> (a 
> org.apache.spark.streaming.StreamingContext)
>   at 
> org.apache.spark.streaming.SessionByKeyBenchmark$.main(SessionByKeyBenchmark.scala:108)
>   at 
> org.apache.spark.streaming.SessionByKeyBenchmark.main(SessionByKeyBenchmark.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:680)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11109) move FsHistoryProvider off import org.apache.hadoop.fs.permission.AccessControlException

2015-10-16 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-11109.

   Resolution: Fixed
 Assignee: Glenn Weidner
Fix Version/s: 1.6.0

> move FsHistoryProvider off import 
> org.apache.hadoop.fs.permission.AccessControlException
> 
>
> Key: SPARK-11109
> URL: https://issues.apache.org/jira/browse/SPARK-11109
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Steve Loughran
>Assignee: Glenn Weidner
>Priority: Minor
> Fix For: 1.6.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> {{FsHistoryProvider}} imports and uses 
> {{org.apache.hadoop.fs.permission.AccessControlException}}; this has been 
> superceded by its subclass 
> {{org.apache.hadoop.security.AccessControlException}} since ~2011. Moving to 
> that subclass would remove a deprecation warning and ensure that were the 
> Hadoop team to remove that old method (as HADOOP-11356 has currently done to 
> trunk), everything still compiles and links



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10641) skewness and kurtosis support

2015-10-16 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961359#comment-14961359
 ] 

Seth Hendrickson commented on SPARK-10641:
--

[~mengxr] I am interested, do you mine providing it or a link to it?

> skewness and kurtosis support
> -
>
> Key: SPARK-10641
> URL: https://issues.apache.org/jira/browse/SPARK-10641
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>Assignee: Seth Hendrickson
> Attachments: simpler-moments.pdf
>
>
> Implementing skewness and kurtosis support based on following algorithm:
> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11159) Nested SQL UDF raises java.lang.UnsupportedOperationException: Cannot evaluate expression

2015-10-16 Thread Jacob Wellington (JIRA)
Jacob Wellington created SPARK-11159:


 Summary: Nested SQL UDF raises 
java.lang.UnsupportedOperationException: Cannot evaluate expression
 Key: SPARK-11159
 URL: https://issues.apache.org/jira/browse/SPARK-11159
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1
Reporter: Jacob Wellington


I'm running spark 1.5.1 and getting the following error: 
{{java.lang.UnsupportedOperationException: Cannot evaluate expression: 
PythonUDF#func_db_v1863()}} whenever I run a query like: {{SELECT 
func_format_v1863(func_db_v1863('')) as ds261_v1869 FROM df1}} after 
registering {{func_db_v1863}} and {{func_format_v1863}} as functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns

2015-10-16 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961374#comment-14961374
 ] 

Felix Cheung commented on SPARK-11153:
--

re-read what you said, I think it makes sense. I assume it means for Spark 
1.6.x it would handle it like Parquet-mr 1.8 in that it would check the writer 
version and enable/disable push-down for sting/binary columns.

> Turns off Parquet filter push-down for string and binary columns
> 
>
> Key: SPARK-11153
> URL: https://issues.apache.org/jira/browse/SPARK-11153
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Blocker
>
> Due to PARQUET-251, {{BINARY}} columns in existing Parquet files may be 
> written with corrupted statistics information. This information is used by 
> filter push-down optimization. Since Spark 1.5 turns on Parquet filter 
> push-down by default, we may end up with wrong query results. PARQUET-251 has 
> been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0.
> Note that this kind of corrupted Parquet files could be produced by any 
> Parquet data models.
> This affects all Spark SQL data types that can be mapped to Parquet 
> {{BINARY}}, namely:
> - {{StringType}}
> - {{BinaryType}}
> - {{DecimalType}} (but Spark SQL doesn't support pushing down {{DecimalType}} 
> columns for now.)
> To avoid wrong query results, we should disable filter push-down for columns 
> of {{StringType}} and {{BinaryType}} until we upgrade to parquet-mr 1.8.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11050) PySpark SparseVector can return wrong index in error message

2015-10-16 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11050:
--
Assignee: Bhargav Mangipudi

> PySpark SparseVector can return wrong index in error message
> 
>
> Key: SPARK-11050
> URL: https://issues.apache.org/jira/browse/SPARK-11050
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 1.6.0
>Reporter: Joseph K. Bradley
>Assignee: Bhargav Mangipudi
>Priority: Trivial
>  Labels: starter
>
> PySpark {{SparseVector.__getitem__}} returns an error message if given a bad 
> index here:
> [https://github.com/apache/spark/blob/a16396df76cc27099011bfb96b28cbdd7f964ca8/python/pyspark/mllib/linalg/__init__.py#L770]
> But the index it complains about could have been modified (if negative), 
> meaning the index in the error message could be wrong.  This should be 
> corrected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate

2015-10-16 Thread patcharee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960296#comment-14960296
 ] 

patcharee commented on SPARK-11087:
---

[~zhazhan]

Below is my test. Please check. I tried to change 
"hive.exec.orc.split.strategy" also, but none of them given " OrcInputFormat 
[INFO] ORC pushdown predicate" as same as your result

2508  case class Contact(name: String, phone: String)
2509  case class Person(name: String, age: Int, contacts: Seq[Contact])
2510  val records = (1 to 100).map { i => Person(s"name_$i", i, (0 to 1).map { 
m => Contact(s"contact_$m", s"phone_$m") } )
2511  }
2512  sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
2513  
sc.parallelize(records).toDF().write.format("orc").partitionBy("age").save("peoplePartitioned")
2514  val peoplePartitioned = 
sqlContext.read.format("orc").load("peoplePartitioned")
2515   peoplePartitioned.registerTempTable("peoplePartitioned")

scala> sqlContext.setConf("hive.exec.orc.split.strategy", "ETL")
15/10/16 09:10:49 DEBUG VariableSubstitution: Substitution is on: ETL
15/10/16 09:10:49 DEBUG VariableSubstitution: Substitution is on: ETL
15/10/16 09:10:49 DEBUG VariableSubstitution: Substitution is on: ETL
15/10/16 09:10:49 DEBUG VariableSubstitution: Substitution is on: ETL

scala>  sqlContext.sql("SELECT * FROM peoplePartitioned WHERE age = 20 and name 
= 'name_20'").count
15/10/16 09:10:52 DEBUG VariableSubstitution: Substitution is on: SELECT * FROM 
peoplePartitioned WHERE age = 20 and name = 'name_20'
15/10/16 09:10:52 DEBUG VariableSubstitution: Substitution is on: SELECT * FROM 
peoplePartitioned WHERE age = 20 and name = 'name_20'
15/10/16 09:10:53 INFO PerfLogger: 
15/10/16 09:10:53 INFO PerfLogger: 
15/10/16 09:10:53 DEBUG OrcInputFormat: Number of buckets specified by conf 
file is 0
15/10/16 09:10:53 DEBUG OrcInputFormat: Number of buckets specified by conf 
file is 0
15/10/16 09:10:53 DEBUG AcidUtils: in directory 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc
 base = null deltas = 0
15/10/16 09:10:53 DEBUG AcidUtils: in directory 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc
 base = null deltas = 0
15/10/16 09:10:53 DEBUG OrcInputFormat: BISplitStrategy strategy for 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc
15/10/16 09:10:53 DEBUG OrcInputFormat: BISplitStrategy strategy for 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc
15/10/16 09:10:53 INFO OrcInputFormat: FooterCacheHitRatio: 0/0
15/10/16 09:10:53 INFO OrcInputFormat: FooterCacheHitRatio: 0/0
15/10/16 09:10:53 DEBUG OrcInputFormat: 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc:0+551
 projected_columns_uncompressed_size: -1
15/10/16 09:10:53 DEBUG OrcInputFormat: 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc:0+551
 projected_columns_uncompressed_size: -1
15/10/16 09:10:53 INFO PerfLogger: 
15/10/16 09:10:53 INFO PerfLogger: 
res5: Long = 1

scala> sqlContext.setConf("hive.exec.orc.split.strategy", "BI")
15/10/16 09:11:13 DEBUG VariableSubstitution: Substitution is on: BI
15/10/16 09:11:13 DEBUG VariableSubstitution: Substitution is on: BI
15/10/16 09:11:13 DEBUG VariableSubstitution: Substitution is on: BI
15/10/16 09:11:13 DEBUG VariableSubstitution: Substitution is on: BI

scala>  sqlContext.sql("SELECT * FROM peoplePartitioned WHERE age = 20 and name 
= 'name_20'").count
15/10/16 09:11:19 DEBUG VariableSubstitution: Substitution is on: SELECT * FROM 
peoplePartitioned WHERE age = 20 and name = 'name_20'
15/10/16 09:11:19 DEBUG VariableSubstitution: Substitution is on: SELECT * FROM 
peoplePartitioned WHERE age = 20 and name = 'name_20'
15/10/16 09:11:19 INFO PerfLogger: 
15/10/16 09:11:19 INFO PerfLogger: 
15/10/16 09:11:19 DEBUG OrcInputFormat: Number of buckets specified by conf 
file is 0
15/10/16 09:11:19 DEBUG OrcInputFormat: Number of buckets specified by conf 
file is 0
15/10/16 09:11:19 DEBUG AcidUtils: in directory 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc
 base = null deltas = 0
15/10/16 09:11:19 DEBUG AcidUtils: in directory 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc
 base = null deltas = 0
15/10/16 09:11:19 DEBUG OrcInputFormat: BISplitStrategy strategy for 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc
15/10/16 09:11:19 DEBUG OrcInputFormat: BISplitStrategy strategy for 
hdfs://helmhdfs/user/patcharee/peoplePartitioned/age=20/part-r-00014-fb3d0874-db8b-40e7-9a4f-0e071c46f509.orc

[jira] [Resolved] (SPARK-10974) Add progress bar for output operation column and use red dots for failed batches

2015-10-16 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-10974.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

> Add progress bar for output operation column and use red dots for failed 
> batches
> 
>
> Key: SPARK-10974
> URL: https://issues.apache.org/jira/browse/SPARK-10974
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7271) Redesign shuffle interface for binary processing

2015-10-16 Thread Hong Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960363#comment-14960363
 ] 

Hong Shen commented on SPARK-7271:
--

Hi, I have a question, are you plan to rededign the shuffle reader to implement 
binary processing? If so, when will you complete it?

> Redesign shuffle interface for binary processing
> 
>
> Key: SPARK-7271
> URL: https://issues.apache.org/jira/browse/SPARK-7271
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Assignee: Josh Rosen
>
> Current shuffle interface is not exactly ideal for binary processing. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10974) Add progress bar for output operation column and use red dots for failed batches

2015-10-16 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das reassigned SPARK-10974:
-

Assignee: Tathagata Das

> Add progress bar for output operation column and use red dots for failed 
> batches
> 
>
> Key: SPARK-10974
> URL: https://issues.apache.org/jira/browse/SPARK-10974
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Tathagata Das
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10974) Add progress bar for output operation column and use red dots for failed batches

2015-10-16 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-10974:
--
Assignee: Shixiong Zhu  (was: Tathagata Das)

> Add progress bar for output operation column and use red dots for failed 
> batches
> 
>
> Key: SPARK-10974
> URL: https://issues.apache.org/jira/browse/SPARK-10974
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3950) Completed time is blank for some successful tasks

2015-10-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3950.
--
Resolution: Cannot Reproduce

> Completed time is blank for some successful tasks
> -
>
> Key: SPARK-3950
> URL: https://issues.apache.org/jira/browse/SPARK-3950
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.1.1
>Reporter: Aaron Davidson
>
> In the Spark web UI, some tasks appear to have a blank Duration column. It's 
> possible that these ran for <.5 seconds, but if so, we should use 
> milliseconds like we do for GC time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-10974) Add progress bar for output operation column and use red dots for failed batches

2015-10-16 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das reopened SPARK-10974:
---

> Add progress bar for output operation column and use red dots for failed 
> batches
> 
>
> Key: SPARK-10974
> URL: https://issues.apache.org/jira/browse/SPARK-10974
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11145) Cannot filter using a partition key and another column

2015-10-16 Thread Julien Buret (JIRA)
Julien Buret created SPARK-11145:


 Summary: Cannot filter using a partition key and another column
 Key: SPARK-11145
 URL: https://issues.apache.org/jira/browse/SPARK-11145
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.5.1
Reporter: Julien Buret


A Dataframe, loaded from partitionned parquet files, cannot be filtered by a 
predicate comparing a partition key and another column.
In this case all records are returned

Example

{code}
from pyspark.sql import SQLContext

sqlContext = SQLContext(sc)
d = [
{'name': 'a', 'YEAR': 2015, 'year_2': 2014, 'statut': 'a'},
{'name': 'b', 'YEAR': 2014, 'year_2': 2014, 'statut': 'a'},
{'name': 'c', 'YEAR': 2013, 'year_2': 2011, 'statut': 'a'},
{'name': 'd', 'YEAR': 2014, 'year_2': 2013, 'statut': 'a'},
{'name': 'e', 'YEAR': 2016, 'year_2': 2017, 'statut': 'p'}
]

rdd = sc.parallelize(d)
df = sqlContext.createDataFrame(rdd)
df.write.partitionBy('YEAR').mode('overwrite').parquet('data')
df2 = sqlContext.read.parquet('data')
df2.filter(df2.YEAR == df2.year_2).show()
{code}

return 

{code}

++--+--++
|name|statut|year_2|YEAR|
++--+--++
|   d| a|  2013|2014|
|   b| a|  2014|2014|
|   c| a|  2011|2013|
|   e| p|  2017|2016|
|   a| a|  2014|2015|
++--+--++
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11139) Make SparkContext.stop() exception-safe

2015-10-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960378#comment-14960378
 ] 

Sean Owen commented on SPARK-11139:
---

Yes please. StreamingContext probably needs a similar treatment: execute a 
series of functions that do part of the cleanup and ensure that an exception 
from one doesn't stop the rest from executing.

> Make SparkContext.stop() exception-safe
> ---
>
> Key: SPARK-11139
> URL: https://issues.apache.org/jira/browse/SPARK-11139
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Felix Cheung
>Priority: Minor
>
> In SparkContext.stop(), when an exception is thrown the rest of the 
> stop/cleanup action is aborted.
> Work has been done in SPARK-4194 to allow for cleanup to partial 
> initialization.
> Similarly issue in StreamingContext SPARK-11137



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11137) Make StreamingContext.stop() exception-safe

2015-10-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11137.
---
Resolution: Duplicate

If you don't mind, this is too logically related to SPARK-11139 to make 
separate JIRAs.

> Make StreamingContext.stop() exception-safe
> ---
>
> Key: SPARK-11137
> URL: https://issues.apache.org/jira/browse/SPARK-11137
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.1
>Reporter: Felix Cheung
>Priority: Minor
>
> In StreamingContext.stop(), when an exception is thrown the rest of the 
> stop/cleanup action is aborted.
> Discussed in https://github.com/apache/spark/pull/9116,
> srowen commented
> Hm, this is getting unwieldy. There are several nested try blocks here. The 
> same argument goes for many of these methods -- if one fails should they not 
> continue trying? A more tidy solution would be to execute a series of () -> 
> Unit code blocks that perform some cleanup and make sure that they each fire 
> in succession, regardless of the others. The final one to remove the shutdown 
> hook could occur outside synchronization.
> I realize we're expanding the scope of the change here, but is it maybe 
> worthwhile to go all the way here?
> Really, something similar could be done for SparkContext and there's an 
> existing JIRA for it somewhere.
> At least, I'd prefer to either narrowly fix the deadlock here, or fix all of 
> the finally-related issue separately and all at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11143) SparkMesosDispatcher can not launch driver in docker

2015-10-16 Thread Klaus Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960420#comment-14960420
 ] 

Klaus Ma commented on SPARK-11143:
--

I addressed the issue by a new docker image which is more environment; but I 
still suggest to provide parameters to simplify the configuration.

> SparkMesosDispatcher can not launch driver in docker
> 
>
> Key: SPARK-11143
> URL: https://issues.apache.org/jira/browse/SPARK-11143
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.5.1
> Environment: Ubuntu 14.04
>Reporter: Klaus Ma
>
> I'm working on integration between Mesos & Spark. For now, I can start 
> SlaveMesosDispatcher in a docker; and I like to also run Spark executor in 
> Mesos docker. I do the following configuration for it, but I got an error; 
> any suggestion?
> Configuration:
> Spark: conf/spark-defaults.conf
> {code}
> spark.mesos.executor.docker.imageubuntu
> spark.mesos.executor.docker.volumes  
> /usr/bin:/usr/bin,/usr/local/lib:/usr/local/lib,/usr/lib:/usr/lib,/lib:/lib,/home/test/workshop/spark:/root/spark
> spark.mesos.executor.home/root/spark
> #spark.executorEnv.SPARK_HOME /root/spark
> spark.executorEnv.MESOS_NATIVE_LIBRARY   /usr/local/lib
> {code}
> NOTE: The spark are installed in /home/test/workshop/spark, and all 
> dependencies are installed.
> After submit SparkPi to the dispatcher, the driver job is started but failed. 
> The error messes is:
> {code}
> I1015 11:10:29.488456 18697 exec.cpp:134] Version: 0.26.0
> I1015 11:10:29.506619 18699 exec.cpp:208] Executor registered on slave 
> b7e24114-7585-40bc-879b-6a1188cb65b6-S1
> WARNING: Your kernel does not support swap limit capabilities, memory limited 
> without swap.
> /bin/sh: 1: ./bin/spark-submit: not found
> {code}
> Does any know how to map/set spark home in docker for this case?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11060) Fix some potential NPEs in DStream transformation

2015-10-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11060.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9070
[https://github.com/apache/spark/pull/9070]

> Fix some potential NPEs in DStream transformation
> -
>
> Key: SPARK-11060
> URL: https://issues.apache.org/jira/browse/SPARK-11060
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Saisai Shao
>Priority: Minor
> Fix For: 1.6.0
>
>
> Guard out some potential NPEs when input stream returns None instead of empty 
> RDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11060) Fix some potential NPEs in DStream transformation

2015-10-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11060:
--
Assignee: Saisai Shao

> Fix some potential NPEs in DStream transformation
> -
>
> Key: SPARK-11060
> URL: https://issues.apache.org/jira/browse/SPARK-11060
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Minor
> Fix For: 1.6.0
>
>
> Guard out some potential NPEs when input stream returns None instead of empty 
> RDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11144) Add SparkLauncher for Spark Streaming, Spark SQL, etc

2015-10-16 Thread Yuhang Chen (JIRA)
Yuhang Chen created SPARK-11144:
---

 Summary: Add SparkLauncher for Spark Streaming, Spark SQL, etc
 Key: SPARK-11144
 URL: https://issues.apache.org/jira/browse/SPARK-11144
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL, Streaming
Affects Versions: 1.5.1
 Environment: Linux x64
Reporter: Yuhang Chen
Priority: Minor


Now we hava org.apache.spark.launcher.SparkLauncher to lauch spark as a child 
process. However, it does not support other libs, such as Spark Streaming and 
Spark SQL.

What I'm looking for is an utility like spark-submit, with which you can submit 
any spark lib jobs to all supported resource manager(Standalone, YARN, Mesos, 
etc) in Java/Scala code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11143) SparkMesosDispatcher can not launch driver in docker

2015-10-16 Thread Klaus Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960420#comment-14960420
 ] 

Klaus Ma edited comment on SPARK-11143 at 10/16/15 9:24 AM:


I addressed the issue by a new docker image which is more about environment; I 
still suggest to provide parameters to simplify the docker configuration.


was (Author: klaus1982):
I addressed the issue by a new docker image which is more environment; but I 
still suggest to provide parameters to simplify the configuration.

> SparkMesosDispatcher can not launch driver in docker
> 
>
> Key: SPARK-11143
> URL: https://issues.apache.org/jira/browse/SPARK-11143
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.5.1
> Environment: Ubuntu 14.04
>Reporter: Klaus Ma
>
> I'm working on integration between Mesos & Spark. For now, I can start 
> SlaveMesosDispatcher in a docker; and I like to also run Spark executor in 
> Mesos docker. I do the following configuration for it, but I got an error; 
> any suggestion?
> Configuration:
> Spark: conf/spark-defaults.conf
> {code}
> spark.mesos.executor.docker.imageubuntu
> spark.mesos.executor.docker.volumes  
> /usr/bin:/usr/bin,/usr/local/lib:/usr/local/lib,/usr/lib:/usr/lib,/lib:/lib,/home/test/workshop/spark:/root/spark
> spark.mesos.executor.home/root/spark
> #spark.executorEnv.SPARK_HOME /root/spark
> spark.executorEnv.MESOS_NATIVE_LIBRARY   /usr/local/lib
> {code}
> NOTE: The spark are installed in /home/test/workshop/spark, and all 
> dependencies are installed.
> After submit SparkPi to the dispatcher, the driver job is started but failed. 
> The error messes is:
> {code}
> I1015 11:10:29.488456 18697 exec.cpp:134] Version: 0.26.0
> I1015 11:10:29.506619 18699 exec.cpp:208] Executor registered on slave 
> b7e24114-7585-40bc-879b-6a1188cb65b6-S1
> WARNING: Your kernel does not support swap limit capabilities, memory limited 
> without swap.
> /bin/sh: 1: ./bin/spark-submit: not found
> {code}
> Does any know how to map/set spark home in docker for this case?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10581) Groups are not resolved in scaladoc for org.apache.spark.sql.Column

2015-10-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10581:


Assignee: (was: Apache Spark)

> Groups are not resolved in scaladoc for org.apache.spark.sql.Column
> ---
>
> Key: SPARK-10581
> URL: https://issues.apache.org/jira/browse/SPARK-10581
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Jacek Laskowski
>Priority: Minor
>
> The Scala API documentation (scaladoc) for 
> [org.apache.spark.sql.Column|http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.Column]
>  does not resolve groups, and they appear unresolved like {{df_ops}}, 
> {{expr_ops}}, et al. instead of _DataFrame functions._, _Expression 
> operators._, et al.  
> BTW, 
> [DataFrame|http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame]
>  and other classes in the 
> [org.apache.spark.sql|http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.package]
>  package seem fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10965) Optimize filesEqualRecursive

2015-10-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960515#comment-14960515
 ] 

Sean Owen commented on SPARK-10965:
---

I'd like to resolve this, at least for now. I am not sure I see a way to 
optimize this without introducing significantly more complication. If it's not 
a major problem, I suspect it's not worth it.

> Optimize filesEqualRecursive
> 
>
> Key: SPARK-10965
> URL: https://issues.apache.org/jira/browse/SPARK-10965
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Mark Grover
>Priority: Minor
>
> When we try to download dependencies, if there is a file at the destination 
> already, we compare if the files are equal (recursively, if they are 
> directories). For files, we compare their bytes. Now, these dependencies can 
> be jars and be really large and byte-by-byte comparisons can super slow.
> I think it'd be better to do a checksum.
> Here's the code in question:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L500



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10581) Groups are not resolved in scaladoc for org.apache.spark.sql.Column

2015-10-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10581:


Assignee: Apache Spark

> Groups are not resolved in scaladoc for org.apache.spark.sql.Column
> ---
>
> Key: SPARK-10581
> URL: https://issues.apache.org/jira/browse/SPARK-10581
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Jacek Laskowski
>Assignee: Apache Spark
>Priority: Minor
>
> The Scala API documentation (scaladoc) for 
> [org.apache.spark.sql.Column|http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.Column]
>  does not resolve groups, and they appear unresolved like {{df_ops}}, 
> {{expr_ops}}, et al. instead of _DataFrame functions._, _Expression 
> operators._, et al.  
> BTW, 
> [DataFrame|http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame]
>  and other classes in the 
> [org.apache.spark.sql|http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.package]
>  package seem fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11092) Add source URLs to API documentation.

2015-10-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11092.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9110
[https://github.com/apache/spark/pull/9110]

> Add source URLs to API documentation.
> -
>
> Key: SPARK-11092
> URL: https://issues.apache.org/jira/browse/SPARK-11092
> Project: Spark
>  Issue Type: Documentation
>  Components: Build, Documentation
>Reporter: Jakob Odersky
>Assignee: Jakob Odersky
>Priority: Trivial
> Fix For: 1.6.0
>
>
> It would be nice to have source URLs in the Spark scaladoc, similar to the 
> standard library (e.g. 
> http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.List).
> The fix should be really simple, just adding a line to the sbt unidoc 
> settings.
> I'll use the github repo url 
> bq. https://github.com/apache/spark/tree/v${version}/${FILE_PATH}
> Feel free to tell me if I should use something else as base url.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10581) Groups are not resolved in scaladoc for org.apache.spark.sql.Column

2015-10-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960513#comment-14960513
 ] 

Apache Spark commented on SPARK-10581:
--

User 'pravingadakh' has created a pull request for this issue:
https://github.com/apache/spark/pull/9148

> Groups are not resolved in scaladoc for org.apache.spark.sql.Column
> ---
>
> Key: SPARK-10581
> URL: https://issues.apache.org/jira/browse/SPARK-10581
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Jacek Laskowski
>Priority: Minor
>
> The Scala API documentation (scaladoc) for 
> [org.apache.spark.sql.Column|http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.Column]
>  does not resolve groups, and they appear unresolved like {{df_ops}}, 
> {{expr_ops}}, et al. instead of _DataFrame functions._, _Expression 
> operators._, et al.  
> BTW, 
> [DataFrame|http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame]
>  and other classes in the 
> [org.apache.spark.sql|http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.package]
>  package seem fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11146) missing or invalid dependency detected while loading class file 'RDDOperationScope.class

2015-10-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11146.
---
Resolution: Cannot Reproduce

Since all tests are passing, it sounds strongly like a problem local to your 
environment. Run a clean build please. We can reopen if you can show this 
happens on a fresh build from git, but then please give more info.

> missing or invalid dependency detected while loading class file 
> 'RDDOperationScope.class
> 
>
> Key: SPARK-11146
> URL: https://issues.apache.org/jira/browse/SPARK-11146
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
> Environment: hadoop 2.2.0 ubuntu,eclipse mars,scala 2.10.4
>Reporter: Veerendra Nath Jasthi
>
> I am getting error whenever trying to run the scala code in eclipse (MARS)
> ERROR:
> missing or invalid dependency detected while loading class file 
> 'RDDOperationScope.class 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11094) Test runner script fails to parse Java version.

2015-10-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11094:
--
Assignee: Jakob Odersky

> Test runner script fails to parse Java version.
> ---
>
> Key: SPARK-11094
> URL: https://issues.apache.org/jira/browse/SPARK-11094
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
> Environment: Debian testing
>Reporter: Jakob Odersky
>Assignee: Jakob Odersky
>Priority: Minor
> Fix For: 1.5.2, 1.6.0
>
>
> Running {{dev/run-tests}} fails when the local Java version has an extra 
> string appended to the version.
> For example, in Debian Stretch (currently testing distribution), {{java 
> -version}} yields "1.8.0_66-internal" where the extra part "-internal" causes 
> the script to fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11147) HTTP 500 if try to access Spark UI in yarn-cluster

2015-10-16 Thread Sebastian YEPES FERNANDEZ (JIRA)
Sebastian YEPES FERNANDEZ created SPARK-11147:
-

 Summary: HTTP 500 if try to access Spark UI in yarn-cluster
 Key: SPARK-11147
 URL: https://issues.apache.org/jira/browse/SPARK-11147
 Project: Spark
  Issue Type: Bug
  Components: Web UI, YARN
Affects Versions: 1.5.1
 Environment: HDP: 2.3.2.0-2950 (Hadoop 2.7.1.2.3.2.0-2950)
Spark: 1.5.x (c27e1904)
Reporter: Sebastian YEPES FERNANDEZ


Hello,

I am facing a similar issue as described in SPARK-5837, but is my case the 
SparkUI only work in "yarn-client" mode. If a run the same job using 
"yarn-cluster" I get the HTTP 500 error:

{code}
HTTP ERROR 500

Problem accessing /proxy/application_1444297190346_0085/. Reason:
Connection to http://XX.XX.XX.XX:55827 refused

Caused by:

org.apache.http.conn.HttpHostConnectException: Connection to 
http://XX.XX.XX.XX:55827 refused
at 
org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:190)
at 
org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:294)
at 
org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:643)
at 
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:479)
at 
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
{code}

I have verified that the UI port "55827" is actually Listening on the worker 
node, I can even run a "curl http://XX.XX.XX.XX:55827; and it redirects me to 
another URL: http://YY.YY.YY.YY:8088/proxy/application_1444297190346_0082

The strange thing is the its redirecting me to the app "_0082" and not the 
actually running job "_0085"


Does anyone have any suggestions on what could be causing this issue?






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10965) Optimize filesEqualRecursive

2015-10-16 Thread Mark Grover (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Grover resolved SPARK-10965.
-
Resolution: Won't Fix

Thanks Sean. Marking this as Won't Fix since I don't think this is super 
important.

> Optimize filesEqualRecursive
> 
>
> Key: SPARK-10965
> URL: https://issues.apache.org/jira/browse/SPARK-10965
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Mark Grover
>Priority: Minor
>
> When we try to download dependencies, if there is a file at the destination 
> already, we compare if the files are equal (recursively, if they are 
> directories). For files, we compare their bytes. Now, these dependencies can 
> be jars and be really large and byte-by-byte comparisons can super slow.
> I think it'd be better to do a checksum.
> Here's the code in question:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L500



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10754) table and column name are case sensitive when json Dataframe was registered as tempTable using JavaSparkContext.

2015-10-16 Thread Rick Hillegas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960802#comment-14960802
 ] 

Rick Hillegas commented on SPARK-10754:
---

Note that unquoted identifiers are case-insensitive in the SQL Standard. Thanks.

> table and column name are case sensitive when json Dataframe was registered 
> as tempTable using JavaSparkContext. 
> -
>
> Key: SPARK-10754
> URL: https://issues.apache.org/jira/browse/SPARK-10754
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0, 1.3.1, 1.4.1
> Environment: Linux ,Hadoop Version 1.3
>Reporter: Babulal
>
> Create a dataframe using json data source 
>   SparkConf conf=new 
> SparkConf().setMaster("spark://xyz:7077")).setAppName("Spark Tabble");
>   JavaSparkContext javacontext=new JavaSparkContext(conf);
>   SQLContext sqlContext=new SQLContext(javacontext);
>   
>   DataFrame df = 
> sqlContext.jsonFile("/user/root/examples/src/main/resources/people.json");
>   
>   df.registerTempTable("sparktable");
>   
>   Run the Query
>   
>   sqlContext.sql("select * from sparktable").show()// this will PASs
>   
>   
>   sqlContext.sql("select * from sparkTable").show()/// This will FAIL 
>   
>   java.lang.RuntimeException: Table Not Found: sparkTable
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:115)
> at 
> org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:115)
> at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
> at scala.collection.AbstractMap.getOrElse(Map.scala:58)
> at 
> org.apache.spark.sql.catalyst.analysis.SimpleCatalog.lookupRelation(Catalog.scala:115)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:233)
>   
>   
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11148) Unable to create views

2015-10-16 Thread Lunen (JIRA)
Lunen created SPARK-11148:
-

 Summary: Unable to create views
 Key: SPARK-11148
 URL: https://issues.apache.org/jira/browse/SPARK-11148
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1
 Environment: Ubuntu 14.04
Spark-1.5.1-bin-hadoop2.6
(I don't have Hadoop or Hive installed)
Start spark-all.sh and thriftserver with mysql jar driver
Reporter: Lunen
Priority: Critical


I am unable to create views within spark SQL. 
Creating tables without specifying the column names work. eg.

CREATE TABLE trade2 
USING org.apache.spark.sql.jdbc
OPTIONS ( 
url "jdbc:mysql://192.168.30.191:3318/?user=root", 
dbtable "database.trade", 
driver "com.mysql.jdbc.Driver" 
);

Ceating tables with datatypes gives an error:

CREATE TABLE trade2( 
COL1 timestamp, 
COL2 STRING, 
COL3 STRING) 
USING org.apache.spark.sql.jdbc 
OPTIONS (
  url "jdbc:mysql://192.168.30.191:3318/?user=root",   
  dbtable "database.trade",   
  driver "com.mysql.jdbc.Driver" 
);
Error: org.apache.spark.sql.AnalysisException: 
org.apache.spark.sql.execution.datasources.jdbc.DefaultSource does not allow 
user-specified schemas.; SQLState: null ErrorCode: 0

Trying to create a VIEW from the table that was created.(The select statement 
below returns data)
CREATE VIEW viewtrade as Select Col1 from trade2;

Error: org.apache.spark.sql.execution.QueryExecutionException: FAILED: 
SemanticException [Error 10004]: Line 1:30 Invalid table alias or column 
reference 'Col1': (possible column names are: col)
SQLState:  null
ErrorCode: 0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame

2015-10-16 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961513#comment-14961513
 ] 

Sandy Ryza commented on SPARK-:
---

So ClassTags would work for case classes and Avro specific records, but 
wouldn't work for tuples (or anywhere else types get erased).  Blrgh.  I wonder 
if the former is enough?  Tuples are pretty useful though.

> RDD-like API on top of Catalyst/DataFrame
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11070) Remove older releases on dist.apache.org

2015-10-16 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell reassigned SPARK-11070:
---

Assignee: Patrick Wendell

> Remove older releases on dist.apache.org
> 
>
> Key: SPARK-11070
> URL: https://issues.apache.org/jira/browse/SPARK-11070
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: Sean Owen
>Assignee: Patrick Wendell
>Priority: Trivial
> Attachments: SPARK-11070.patch
>
>
> dist.apache.org should be periodically cleaned up such that it only includes 
> the latest releases in each active minor release branch. This is to reduce 
> load on mirrors. It can probably lose the 1.2.x releases at this point. In 
> total this would clean out 6 of the 9 releases currently mirrored at 
> https://dist.apache.org/repos/dist/release/spark/ 
> All releases are always archived at archive.apache.org and continue to be 
> available. The JS behind spark.apache.org/downloads.html needs to be updated 
> to point at archive.apache.org for older releases, then.
> There won't be a pull request for this as it's strictly an update to the site 
> hosted in SVN, and the files hosted by Apache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11163) Remove unnecessary addPendingTask calls in TaskSetManager.executorLost

2015-10-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11163:


Assignee: Apache Spark  (was: Kay Ousterhout)

> Remove unnecessary addPendingTask calls in TaskSetManager.executorLost
> --
>
> Key: SPARK-11163
> URL: https://issues.apache.org/jira/browse/SPARK-11163
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Kay Ousterhout
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 1.5.1, 1.5.2
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10599) Decrease communication in BlockMatrix multiply and increase performance

2015-10-16 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-10599.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8757
[https://github.com/apache/spark/pull/8757]

> Decrease communication in BlockMatrix multiply and increase performance
> ---
>
> Key: SPARK-10599
> URL: https://issues.apache.org/jira/browse/SPARK-10599
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 1.6.0
>
>
> The BlockMatrix multiply sends each block to all the corresponding columns of 
> the right BlockMatrix, even though there might not be any corresponding block 
> to multiply with.
> Some optimizations we can perform are:
>  - Simulate the multiplication on the driver, and figure out which blocks 
> actually need to be shuffled
>  - Send the block once to a partition, and join inside the partition rather 
> than sending multiple copies to the same partition



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11162) Allow enabling debug logging from the command line

2015-10-16 Thread Ryan Williams (JIRA)
Ryan Williams created SPARK-11162:
-

 Summary: Allow enabling debug logging from the command line
 Key: SPARK-11162
 URL: https://issues.apache.org/jira/browse/SPARK-11162
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.1
Reporter: Ryan Williams
Priority: Minor


Per [~vanzin] on [the user 
list|http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-log-level-of-spark-executor-on-YARN-using-yarn-cluster-mode-tp16528p16529.html],
 it would be nice if debug-logging could be enabled from the command line.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10877) Assertions fail straightforward DataFrame job due to word alignment

2015-10-16 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961505#comment-14961505
 ] 

Davies Liu commented on SPARK-10877:


This is already fixed in master and 1.5 branch.

> Assertions fail straightforward DataFrame job due to word alignment
> ---
>
> Key: SPARK-10877
> URL: https://issues.apache.org/jira/browse/SPARK-10877
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Matt Cheah
>Assignee: Davies Liu
> Attachments: SparkFilterByKeyTest.scala
>
>
> I have some code that I’m running in a unit test suite, but the code I’m 
> running is failing with an assertion error.
> I have translated the JUnit test that was failing, to a Scala script that I 
> will attach to the ticket. The assertion error is the following:
> {code}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.AssertionError: 
> lengthInBytes must be a multiple of 8 (word-aligned)
> at 
> org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeWords(Murmur3_x86_32.java:53)
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.hashCode(UnsafeArrayData.java:289)
> at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.hashCode(rows.scala:149)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow.hashCode(rows.scala:247)
> at org.apache.spark.HashPartitioner.getPartition(Partitioner.scala:85)
> at 
> org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1$$anonfun$4$$anonfun$apply$4.apply(Exchange.scala:180)
> at 
> org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1$$anonfun$4$$anonfun$apply$4.apply(Exchange.scala:180)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> {code}
> However, it turns out that this code actually works normally and computes the 
> correct result if assertions are turned off.
> I traced the code and found that when hashUnsafeWords was called, it was 
> given a byte-length of 12, which clearly is not a multiple of 8. However, the 
> job seems to compute correctly regardless of this fact. Of course, I can’t 
> just disable assertions for my unit test though.
> A few things we need to understand:
> 1. Why is the lengthInBytes of size 12?
> 2. Is it actually a problem that the byte length is not word-aligned? If so, 
> how should we fix the byte length? If it's not a problem, why is the 
> assertion flagging a false negative?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11070) Remove older releases on dist.apache.org

2015-10-16 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961515#comment-14961515
 ] 

Patrick Wendell commented on SPARK-11070:
-

I removed them - I did leave 1.5.0 for now, but we can remove it in a bit - 
just because 1.5.1 is so new.

{code}
svn rm https://dist.apache.org/repos/dist/release/spark/spark-1.1.1 -m "Remving 
Spark 1.1.1 release"
svn rm https://dist.apache.org/repos/dist/release/spark/spark-1.2.1 -m "Remving 
Spark 1.2.1 release"
svn rm https://dist.apache.org/repos/dist/release/spark/spark-1.2.2 -m "Remving 
Spark 1.2.2 release"
svn rm https://dist.apache.org/repos/dist/release/spark/spark-1.3.0 -m "Remving 
Spark 1.3.0 release"
svn rm https://dist.apache.org/repos/dist/release/spark/spark-1.4.0 -m "Remving 
Spark 1.4.0 release"
{code}

> Remove older releases on dist.apache.org
> 
>
> Key: SPARK-11070
> URL: https://issues.apache.org/jira/browse/SPARK-11070
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: Sean Owen
>Assignee: Patrick Wendell
>Priority: Trivial
> Attachments: SPARK-11070.patch
>
>
> dist.apache.org should be periodically cleaned up such that it only includes 
> the latest releases in each active minor release branch. This is to reduce 
> load on mirrors. It can probably lose the 1.2.x releases at this point. In 
> total this would clean out 6 of the 9 releases currently mirrored at 
> https://dist.apache.org/repos/dist/release/spark/ 
> All releases are always archived at archive.apache.org and continue to be 
> available. The JS behind spark.apache.org/downloads.html needs to be updated 
> to point at archive.apache.org for older releases, then.
> There won't be a pull request for this as it's strictly an update to the site 
> hosted in SVN, and the files hosted by Apache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11070) Remove older releases on dist.apache.org

2015-10-16 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-11070.
-
Resolution: Fixed

> Remove older releases on dist.apache.org
> 
>
> Key: SPARK-11070
> URL: https://issues.apache.org/jira/browse/SPARK-11070
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: Sean Owen
>Assignee: Patrick Wendell
>Priority: Trivial
> Attachments: SPARK-11070.patch
>
>
> dist.apache.org should be periodically cleaned up such that it only includes 
> the latest releases in each active minor release branch. This is to reduce 
> load on mirrors. It can probably lose the 1.2.x releases at this point. In 
> total this would clean out 6 of the 9 releases currently mirrored at 
> https://dist.apache.org/repos/dist/release/spark/ 
> All releases are always archived at archive.apache.org and continue to be 
> available. The JS behind spark.apache.org/downloads.html needs to be updated 
> to point at archive.apache.org for older releases, then.
> There won't be a pull request for this as it's strictly an update to the site 
> hosted in SVN, and the files hosted by Apache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame

2015-10-16 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961518#comment-14961518
 ] 

Michael Armbrust commented on SPARK-:
-

Yeah, I think tuples are a pretty important use case.  Perhaps more importantly 
though, I think having a concept of encoders instead of relying on JVM types 
future proofs the API by giving us more control.  If you look closely at the 
test case examples, there are some pretty crazy macro examples (i.e., {{R(a = 
1, b = 2L)}}) where we actually create something like named tuples that codegen 
at compile time the logic required to directly encode the users results into 
tungsten format without needing to allocate an intermediate object.

> RDD-like API on top of Catalyst/DataFrame
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11163) Remove unnecessary addPendingTask calls in TaskSetManager.executorLost

2015-10-16 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-11163:
---
Summary: Remove unnecessary addPendingTask calls in 
TaskSetManager.executorLost  (was: Remove unnecessary addPendingTask calls in 
TaskSetManager)

> Remove unnecessary addPendingTask calls in TaskSetManager.executorLost
> --
>
> Key: SPARK-11163
> URL: https://issues.apache.org/jira/browse/SPARK-11163
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
> Fix For: 1.5.1, 1.5.2
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11163) Remove unnecessary addPendingTask calls in TaskSetManager

2015-10-16 Thread Kay Ousterhout (JIRA)
Kay Ousterhout created SPARK-11163:
--

 Summary: Remove unnecessary addPendingTask calls in TaskSetManager
 Key: SPARK-11163
 URL: https://issues.apache.org/jira/browse/SPARK-11163
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
Priority: Minor
 Fix For: 1.5.2, 1.5.1






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >