[jira] [Resolved] (SPARK-6536) Add IN to python Column

2015-03-26 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-6536.

  Resolution: Fixed
   Fix Version/s: 1.4.0
  1.3.1
Target Version/s:   (was: 1.4.0)

> Add IN to python Column
> ---
>
> Key: SPARK-6536
> URL: https://issues.apache.org/jira/browse/SPARK-6536
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Davies Liu
> Fix For: 1.3.1, 1.4.0
>
>
> In scala you can check for membership in a set using the DSL function {{in}}
> {code}
> df("column").in(lit(1), lit(2))
> {code}
> It would be nice to be able to do something similar in python (possibly 
> without the lits (which we might revisit for scala as well) /cc [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-6547) Missing import Files in InsertIntoHiveTableSuite.scala

2015-03-26 Thread Zhichao Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhichao  Zhang closed SPARK-6547.
-
Resolution: Duplicate

It is duplicate [SPARK-6546](https://issues.apache.org/jira/browse/SPARK-6546), 
close this.

> Missing import Files in InsertIntoHiveTableSuite.scala
> --
>
> Key: SPARK-6547
> URL: https://issues.apache.org/jira/browse/SPARK-6547
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 1.4.0
>Reporter: Zhichao  Zhang
>Priority: Minor
>
> missing import Files, build error as follows:
> {quote}
> [error] 
> /opt/spark-master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala:201:
>  
> not found: value Files
> [error] val tmpDir = Files.createTempDir()
> {quote}
> use *val tmpDir = Utils.createTempDir()*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6117) describe function for summary statistics

2015-03-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6117:
---

Assignee: (was: Apache Spark)

> describe function for summary statistics
> 
>
> Key: SPARK-6117
> URL: https://issues.apache.org/jira/browse/SPARK-6117
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: starter
>
> DataFrame.describe should return a DataFrame with summary statistics. 
> {code}
> def describe(cols: String*): DataFrame
> {code}
> If cols is empty, then run describe on all numeric columns.
> The returned DataFrame should have 5 rows (count, mean, stddev, min, max) and 
> n + 1 columns. The 1st column is the name of the aggregate function, and the 
> next n columns are the numeric columns of interest in the input DataFrame.
> Similar to Pandas (but removing percentile since accurate percentiles are too 
> expensive to compute for Big Data)
> {code}
> In [19]: df.describe()
> Out[19]: 
>   A B C D
> count  6.00  6.00  6.00  6.00
> mean   0.073711 -0.431125 -0.687758 -0.233103
> std0.843157  0.922818  0.779887  0.973118
> min   -0.861849 -2.104569 -1.509059 -1.135632
> max1.212112  0.567020  0.276232  1.071804
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6117) describe function for summary statistics

2015-03-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6117:
---

Assignee: Apache Spark

> describe function for summary statistics
> 
>
> Key: SPARK-6117
> URL: https://issues.apache.org/jira/browse/SPARK-6117
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>  Labels: starter
>
> DataFrame.describe should return a DataFrame with summary statistics. 
> {code}
> def describe(cols: String*): DataFrame
> {code}
> If cols is empty, then run describe on all numeric columns.
> The returned DataFrame should have 5 rows (count, mean, stddev, min, max) and 
> n + 1 columns. The 1st column is the name of the aggregate function, and the 
> next n columns are the numeric columns of interest in the input DataFrame.
> Similar to Pandas (but removing percentile since accurate percentiles are too 
> expensive to compute for Big Data)
> {code}
> In [19]: df.describe()
> Out[19]: 
>   A B C D
> count  6.00  6.00  6.00  6.00
> mean   0.073711 -0.431125 -0.687758 -0.233103
> std0.843157  0.922818  0.779887  0.973118
> min   -0.861849 -2.104569 -1.509059 -1.135632
> max1.212112  0.567020  0.276232  1.071804
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6117) describe function for summary statistics

2015-03-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381527#comment-14381527
 ] 

Apache Spark commented on SPARK-6117:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/5201

> describe function for summary statistics
> 
>
> Key: SPARK-6117
> URL: https://issues.apache.org/jira/browse/SPARK-6117
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: starter
>
> DataFrame.describe should return a DataFrame with summary statistics. 
> {code}
> def describe(cols: String*): DataFrame
> {code}
> If cols is empty, then run describe on all numeric columns.
> The returned DataFrame should have 5 rows (count, mean, stddev, min, max) and 
> n + 1 columns. The 1st column is the name of the aggregate function, and the 
> next n columns are the numeric columns of interest in the input DataFrame.
> Similar to Pandas (but removing percentile since accurate percentiles are too 
> expensive to compute for Big Data)
> {code}
> In [19]: df.describe()
> Out[19]: 
>   A B C D
> count  6.00  6.00  6.00  6.00
> mean   0.073711 -0.431125 -0.687758 -0.233103
> std0.843157  0.922818  0.779887  0.973118
> min   -0.861849 -2.104569 -1.509059 -1.135632
> max1.212112  0.567020  0.276232  1.071804
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6521) executors in the same node read local shuffle file

2015-03-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6521:
---

Assignee: Apache Spark

> executors in the same node read local shuffle file
> --
>
> Key: SPARK-6521
> URL: https://issues.apache.org/jira/browse/SPARK-6521
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0
>Reporter: xukun
>Assignee: Apache Spark
>
> In the past, executor read other executor's shuffle file in the same node by 
> net. This pr make that executors in the same node read local shuffle file In 
> sort-based Shuffle. It will reduce net transport.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6521) executors in the same node read local shuffle file

2015-03-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6521:
---

Assignee: (was: Apache Spark)

> executors in the same node read local shuffle file
> --
>
> Key: SPARK-6521
> URL: https://issues.apache.org/jira/browse/SPARK-6521
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0
>Reporter: xukun
>
> In the past, executor read other executor's shuffle file in the same node by 
> net. This pr make that executors in the same node read local shuffle file In 
> sort-based Shuffle. It will reduce net transport.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6548) Adding stddev to DataFrame functions

2015-03-26 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6548:
---
Labels: DataFrame starter  (was: starter)

> Adding stddev to DataFrame functions
> 
>
> Key: SPARK-6548
> URL: https://issues.apache.org/jira/browse/SPARK-6548
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: DataFrame, starter
> Fix For: 1.4.0
>
>
> Add it to the list of aggregate functions:
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
> Also add it to 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala
> We can either add a Stddev Catalyst expression, or just compute it using 
> existing functions like here: 
> https://github.com/apache/spark/commit/5bbcd1304cfebba31ec6857a80d3825a40d02e83#diff-c3d0394b2fc08fb2842ff0362a5ac6c9R776



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6548) Adding stddev to DataFrame functions

2015-03-26 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-6548:
--

 Summary: Adding stddev to DataFrame functions
 Key: SPARK-6548
 URL: https://issues.apache.org/jira/browse/SPARK-6548
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin


Add it to the list of aggregate functions:

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala

Also add it to 

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala

We can either add a Stddev Catalyst expression, or just compute it using 
existing functions like here: 
https://github.com/apache/spark/commit/5bbcd1304cfebba31ec6857a80d3825a40d02e83#diff-c3d0394b2fc08fb2842ff0362a5ac6c9R776




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6548) Adding stddev to DataFrame functions

2015-03-26 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6548:
---
Fix Version/s: 1.4.0

> Adding stddev to DataFrame functions
> 
>
> Key: SPARK-6548
> URL: https://issues.apache.org/jira/browse/SPARK-6548
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: DataFrame, starter
> Fix For: 1.4.0
>
>
> Add it to the list of aggregate functions:
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
> Also add it to 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala
> We can either add a Stddev Catalyst expression, or just compute it using 
> existing functions like here: 
> https://github.com/apache/spark/commit/5bbcd1304cfebba31ec6857a80d3825a40d02e83#diff-c3d0394b2fc08fb2842ff0362a5ac6c9R776



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4587) Model export/import

2015-03-26 Thread zhangyouhua (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381544#comment-14381544
 ] 

zhangyouhua commented on SPARK-4587:


“Sorry, this file is invalid so it cannot be displayed.”

could you send me?

> Model export/import
> ---
>
> Key: SPARK-4587
> URL: https://issues.apache.org/jira/browse/SPARK-4587
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This is an umbrella JIRA for one of the most requested features on the user 
> mailing list. Model export/import can be done via Java serialization. But it 
> doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we 
> should provide save/load methods to every model. PMML is an option but it has 
> its limitations. There are couple things we need to discuss: 1) data format, 
> 2) how to preserve partitioning, 3) data compatibility between versions and 
> language APIs, etc.
> UPDATE: [Design doc for model import/export | 
> https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing]
> This document sketches machine learning model import/export plans, including 
> goals, an API, and development plans.
> UPDATE: As in the design doc, we plan to support:
> * Our own Spark-specific format.
> ** This is needed to (a) support distributed models and (b) get model 
> import/export support into Spark quickly (while avoiding the complexity of 
> PMML).
> * PMML
> ** This is needed since it is the most commonly used format in industry.
> This JIRA will be for the internal Spark-specific format described in the 
> design doc. Parallel JIRAs will cover PMML.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6549) Spark console logger logs to stderr by default

2015-03-26 Thread Pavel Sakun (JIRA)
Pavel Sakun created SPARK-6549:
--

 Summary: Spark console logger logs to stderr by default
 Key: SPARK-6549
 URL: https://issues.apache.org/jira/browse/SPARK-6549
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 1.2.0
Reporter: Pavel Sakun
Priority: Trivial


Spark's console logger is configure to log message with INFO level to stderr 
while it should be stdout:
https://github.com/apache/spark/blob/master/core/src/main/resources/org/apache/spark/log4j-defaults.properties
https://github.com/apache/spark/blob/master/conf/log4j.properties.template



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6550) Add PreAnalyzer to keep logical plan consistent across DataFrame

2015-03-26 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-6550:
--

 Summary: Add PreAnalyzer to keep logical plan consistent across 
DataFrame
 Key: SPARK-6550
 URL: https://issues.apache.org/jira/browse/SPARK-6550
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh


h2. Problems

In some cases, the expressions in a logical plan will be modified to new ones 
during analysis, e.g. the handling for self-join cases. If some expressions are 
resolved based on the analyzed plan, they are referring to changed expression 
ids, not original ids.

But the transformation of DataFrame will use logical plan to construct new 
DataFrame, e.g. `groupBy` and aggregation. So in such cases, the expressions in 
these DataFrames will be inconsistent.

The problems are specified as following:

# Expression ids in logical plan are possibly inconsistent if expression ids 
are changed during analysis and some expressions are resolved after that

When we try to run the following codes:
{code}
val df = Seq(1,2,3).map(i => (i, i.toString)).toDF("int", "str")
val df2 = df.as('x).join(df.as('y), $"x.str" === 
$"y.str").groupBy("y.str").min("y.int")
{code}

Because {{groupBy}} and {{min}} will perform resolving based on the analyzed 
logical plan, their expression ids refer to analyzed plan, instead of logical 
plan.

So the logical plan of df2 looks like:

{code}
'Aggregate [str#5], [str#5,MIN(int#4) AS MIN(int)#6]
 'Join Inner, Some(('x.str = 'y.str))
  Subquery x
   Project [_1#0 AS int#2,_2#1 AS str#3]
LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3]]
  Subquery y
   Project [_1#0 AS int#2,_2#1 AS str#3]
LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3]]
{code}

As you see, the expression ids in {{Aggregate}} are different to the expression 
ids in {{Subquery y}}. This is the first problem.

# The {{df2}} can't be performed

The showing logical plan of {{df2}} can't be performed. Because the expression 
ids of {{Subquery y}} will be modified for self-join handling during analysis, 
the analyzed plan of {{df2}} becomes:

{code}
Aggregate [str#5], [str#5,MIN(int#4) AS MIN(int)#6]
 Join Inner, Some((str#3 = str#8))
  Subquery x
   Project [_1#0 AS int#2,_2#1 AS str#3]
LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3]]
  Subquery y
   Project [_1#0 AS int#7,_2#1 AS str#8]
LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3]]
{code}

The expressions referred in {{Aggregate}} are not matching to these in 
{{Subquery y}}. This is the second problem.

h2. Proposed solution

We try to add a PreAnalyzer. When a logical plan {{rawPlan}} is given to 
SQLContext, it uses PreAnalyzer to modify the logical plan before assigning to 
{{QueryExecution.logical}}. Then later operations will based on the 
pre-analyzed logical plan, instead of the original {{rawPlan}}.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6550) Add PreAnalyzer to keep logical plan consistent across DataFrame

2015-03-26 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-6550:
---
Description: 
h2. Problems

In some cases, the expressions in a logical plan will be modified to new ones 
during analysis, e.g. the handling for self-join cases. If some expressions are 
resolved based on the analyzed plan, they are referring to changed expression 
ids, not original ids.

But the transformation of DataFrame will use logical plan to construct new 
DataFrame, e.g. {{groupBy}} and aggregation. So in such cases, the expressions 
in these DataFrames will be inconsistent.

The problems are specified as following:

# Expression ids in logical plan are possibly inconsistent if expression ids 
are changed during analysis and some expressions are resolved after that

When we try to run the following codes:
{code}
val df = Seq(1,2,3).map(i => (i, i.toString)).toDF("int", "str")
val df2 = df.as('x).join(df.as('y), $"x.str" === 
$"y.str").groupBy("y.str").min("y.int")
{code}

Because {{groupBy}} and {{min}} will perform resolving based on the analyzed 
logical plan, their expression ids refer to analyzed plan, instead of logical 
plan.

So the logical plan of df2 looks like:

{code}
'Aggregate [str#5], [str#5,MIN(int#4) AS MIN(int)#6]
 'Join Inner, Some(('x.str = 'y.str))
  Subquery x
   Project [_1#0 AS int#2,_2#1 AS str#3]
LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3]]
  Subquery y
   Project [_1#0 AS int#2,_2#1 AS str#3]
LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3]]
{code}

As you see, the expression ids in {{Aggregate}} are different to the expression 
ids in {{Subquery y}}. This is the first problem.

# The {{df2}} can't be performed

The showing logical plan of {{df2}} can't be performed. Because the expression 
ids of {{Subquery y}} will be modified for self-join handling during analysis, 
the analyzed plan of {{df2}} becomes:

{code}
Aggregate [str#5], [str#5,MIN(int#4) AS MIN(int)#6]
 Join Inner, Some((str#3 = str#8))
  Subquery x
   Project [_1#0 AS int#2,_2#1 AS str#3]
LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3]]
  Subquery y
   Project [_1#0 AS int#7,_2#1 AS str#8]
LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3]]
{code}

The expressions referred in {{Aggregate}} are not matching to these in 
{{Subquery y}}. This is the second problem.

h2. Proposed solution

We try to add a PreAnalyzer. When a logical plan {{rawPlan}} is given to 
SQLContext, it uses PreAnalyzer to modify the logical plan before assigning to 
{{QueryExecution.logical}}. Then later operations will based on the 
pre-analyzed logical plan, instead of the original {{rawPlan}}.


  was:
h2. Problems

In some cases, the expressions in a logical plan will be modified to new ones 
during analysis, e.g. the handling for self-join cases. If some expressions are 
resolved based on the analyzed plan, they are referring to changed expression 
ids, not original ids.

But the transformation of DataFrame will use logical plan to construct new 
DataFrame, e.g. `groupBy` and aggregation. So in such cases, the expressions in 
these DataFrames will be inconsistent.

The problems are specified as following:

# Expression ids in logical plan are possibly inconsistent if expression ids 
are changed during analysis and some expressions are resolved after that

When we try to run the following codes:
{code}
val df = Seq(1,2,3).map(i => (i, i.toString)).toDF("int", "str")
val df2 = df.as('x).join(df.as('y), $"x.str" === 
$"y.str").groupBy("y.str").min("y.int")
{code}

Because {{groupBy}} and {{min}} will perform resolving based on the analyzed 
logical plan, their expression ids refer to analyzed plan, instead of logical 
plan.

So the logical plan of df2 looks like:

{code}
'Aggregate [str#5], [str#5,MIN(int#4) AS MIN(int)#6]
 'Join Inner, Some(('x.str = 'y.str))
  Subquery x
   Project [_1#0 AS int#2,_2#1 AS str#3]
LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3]]
  Subquery y
   Project [_1#0 AS int#2,_2#1 AS str#3]
LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3]]
{code}

As you see, the expression ids in {{Aggregate}} are different to the expression 
ids in {{Subquery y}}. This is the first problem.

# The {{df2}} can't be performed

The showing logical plan of {{df2}} can't be performed. Because the expression 
ids of {{Subquery y}} will be modified for self-join handling during analysis, 
the analyzed plan of {{df2}} becomes:

{code}
Aggregate [str#5], [str#5,MIN(int#4) AS MIN(int)#6]
 Join Inner, Some((str#3 = str#8))
  Subquery x
   Project [_1#0 AS int#2,_2#1 AS str#3]
LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3]]
  Subquery y
   Project [_1#0 AS int#7,_2#1 AS str#8]
LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3]]
{code}

The expressions referred in {{Aggregate}} are not matching to these in 
{{Subquery y}}. This is the second problem.

h2. Proposed solution

We try to add a PreAnalyzer. When a logical plan {{ra

[jira] [Commented] (SPARK-6549) Spark console logger logs to stderr by default

2015-03-26 Thread Pavel Sakun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381568#comment-14381568
 ] 

Pavel Sakun commented on SPARK-6549:


Pull request: https://github.com/apache/spark/pull/5202

> Spark console logger logs to stderr by default
> --
>
> Key: SPARK-6549
> URL: https://issues.apache.org/jira/browse/SPARK-6549
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.2.0
>Reporter: Pavel Sakun
>Priority: Trivial
>  Labels: log4j
>
> Spark's console logger is configure to log message with INFO level to stderr 
> while it should be stdout:
> https://github.com/apache/spark/blob/master/core/src/main/resources/org/apache/spark/log4j-defaults.properties
> https://github.com/apache/spark/blob/master/conf/log4j.properties.template



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6549) Spark console logger logs to stderr by default

2015-03-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6549:
---

Assignee: (was: Apache Spark)

> Spark console logger logs to stderr by default
> --
>
> Key: SPARK-6549
> URL: https://issues.apache.org/jira/browse/SPARK-6549
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.2.0
>Reporter: Pavel Sakun
>Priority: Trivial
>  Labels: log4j
>
> Spark's console logger is configure to log message with INFO level to stderr 
> while it should be stdout:
> https://github.com/apache/spark/blob/master/core/src/main/resources/org/apache/spark/log4j-defaults.properties
> https://github.com/apache/spark/blob/master/conf/log4j.properties.template



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6550) Add PreAnalyzer to keep logical plan consistent across DataFrame

2015-03-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381571#comment-14381571
 ] 

Apache Spark commented on SPARK-6550:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/5203

> Add PreAnalyzer to keep logical plan consistent across DataFrame
> 
>
> Key: SPARK-6550
> URL: https://issues.apache.org/jira/browse/SPARK-6550
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> h2. Problems
> In some cases, the expressions in a logical plan will be modified to new ones 
> during analysis, e.g. the handling for self-join cases. If some expressions 
> are resolved based on the analyzed plan, they are referring to changed 
> expression ids, not original ids.
> But the transformation of DataFrame will use logical plan to construct new 
> DataFrame, e.g. {{groupBy}} and aggregation. So in such cases, the 
> expressions in these DataFrames will be inconsistent.
> The problems are specified as following:
> # Expression ids in logical plan are possibly inconsistent if expression ids 
> are changed during analysis and some expressions are resolved after that
> When we try to run the following codes:
> {code}
> val df = Seq(1,2,3).map(i => (i, i.toString)).toDF("int", "str")
> val df2 = df.as('x).join(df.as('y), $"x.str" === 
> $"y.str").groupBy("y.str").min("y.int")
> {code}
> Because {{groupBy}} and {{min}} will perform resolving based on the analyzed 
> logical plan, their expression ids refer to analyzed plan, instead of logical 
> plan.
> So the logical plan of df2 looks like:
> {code}
> 'Aggregate [str#5], [str#5,MIN(int#4) AS MIN(int)#6]
>  'Join Inner, Some(('x.str = 'y.str))
>   Subquery x
>Project [_1#0 AS int#2,_2#1 AS str#3]
> LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3]]
>   Subquery y
>Project [_1#0 AS int#2,_2#1 AS str#3]
> LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3]]
> {code}
> As you see, the expression ids in {{Aggregate}} are different to the 
> expression ids in {{Subquery y}}. This is the first problem.
> # The {{df2}} can't be performed
> The showing logical plan of {{df2}} can't be performed. Because the 
> expression ids of {{Subquery y}} will be modified for self-join handling 
> during analysis, the analyzed plan of {{df2}} becomes:
> {code}
> Aggregate [str#5], [str#5,MIN(int#4) AS MIN(int)#6]
>  Join Inner, Some((str#3 = str#8))
>   Subquery x
>Project [_1#0 AS int#2,_2#1 AS str#3]
> LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3]]
>   Subquery y
>Project [_1#0 AS int#7,_2#1 AS str#8]
> LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3]]
> {code}
> The expressions referred in {{Aggregate}} are not matching to these in 
> {{Subquery y}}. This is the second problem.
> h2. Proposed solution
> We try to add a PreAnalyzer. When a logical plan {{rawPlan}} is given to 
> SQLContext, it uses PreAnalyzer to modify the logical plan before assigning 
> to {{QueryExecution.logical}}. Then later operations will based on the 
> pre-analyzed logical plan, instead of the original {{rawPlan}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6550) Add PreAnalyzer to keep logical plan consistent across DataFrame

2015-03-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6550:
---

Assignee: Apache Spark

> Add PreAnalyzer to keep logical plan consistent across DataFrame
> 
>
> Key: SPARK-6550
> URL: https://issues.apache.org/jira/browse/SPARK-6550
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> h2. Problems
> In some cases, the expressions in a logical plan will be modified to new ones 
> during analysis, e.g. the handling for self-join cases. If some expressions 
> are resolved based on the analyzed plan, they are referring to changed 
> expression ids, not original ids.
> But the transformation of DataFrame will use logical plan to construct new 
> DataFrame, e.g. {{groupBy}} and aggregation. So in such cases, the 
> expressions in these DataFrames will be inconsistent.
> The problems are specified as following:
> # Expression ids in logical plan are possibly inconsistent if expression ids 
> are changed during analysis and some expressions are resolved after that
> When we try to run the following codes:
> {code}
> val df = Seq(1,2,3).map(i => (i, i.toString)).toDF("int", "str")
> val df2 = df.as('x).join(df.as('y), $"x.str" === 
> $"y.str").groupBy("y.str").min("y.int")
> {code}
> Because {{groupBy}} and {{min}} will perform resolving based on the analyzed 
> logical plan, their expression ids refer to analyzed plan, instead of logical 
> plan.
> So the logical plan of df2 looks like:
> {code}
> 'Aggregate [str#5], [str#5,MIN(int#4) AS MIN(int)#6]
>  'Join Inner, Some(('x.str = 'y.str))
>   Subquery x
>Project [_1#0 AS int#2,_2#1 AS str#3]
> LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3]]
>   Subquery y
>Project [_1#0 AS int#2,_2#1 AS str#3]
> LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3]]
> {code}
> As you see, the expression ids in {{Aggregate}} are different to the 
> expression ids in {{Subquery y}}. This is the first problem.
> # The {{df2}} can't be performed
> The showing logical plan of {{df2}} can't be performed. Because the 
> expression ids of {{Subquery y}} will be modified for self-join handling 
> during analysis, the analyzed plan of {{df2}} becomes:
> {code}
> Aggregate [str#5], [str#5,MIN(int#4) AS MIN(int)#6]
>  Join Inner, Some((str#3 = str#8))
>   Subquery x
>Project [_1#0 AS int#2,_2#1 AS str#3]
> LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3]]
>   Subquery y
>Project [_1#0 AS int#7,_2#1 AS str#8]
> LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3]]
> {code}
> The expressions referred in {{Aggregate}} are not matching to these in 
> {{Subquery y}}. This is the second problem.
> h2. Proposed solution
> We try to add a PreAnalyzer. When a logical plan {{rawPlan}} is given to 
> SQLContext, it uses PreAnalyzer to modify the logical plan before assigning 
> to {{QueryExecution.logical}}. Then later operations will based on the 
> pre-analyzed logical plan, instead of the original {{rawPlan}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6550) Add PreAnalyzer to keep logical plan consistent across DataFrame

2015-03-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6550:
---

Assignee: (was: Apache Spark)

> Add PreAnalyzer to keep logical plan consistent across DataFrame
> 
>
> Key: SPARK-6550
> URL: https://issues.apache.org/jira/browse/SPARK-6550
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> h2. Problems
> In some cases, the expressions in a logical plan will be modified to new ones 
> during analysis, e.g. the handling for self-join cases. If some expressions 
> are resolved based on the analyzed plan, they are referring to changed 
> expression ids, not original ids.
> But the transformation of DataFrame will use logical plan to construct new 
> DataFrame, e.g. {{groupBy}} and aggregation. So in such cases, the 
> expressions in these DataFrames will be inconsistent.
> The problems are specified as following:
> # Expression ids in logical plan are possibly inconsistent if expression ids 
> are changed during analysis and some expressions are resolved after that
> When we try to run the following codes:
> {code}
> val df = Seq(1,2,3).map(i => (i, i.toString)).toDF("int", "str")
> val df2 = df.as('x).join(df.as('y), $"x.str" === 
> $"y.str").groupBy("y.str").min("y.int")
> {code}
> Because {{groupBy}} and {{min}} will perform resolving based on the analyzed 
> logical plan, their expression ids refer to analyzed plan, instead of logical 
> plan.
> So the logical plan of df2 looks like:
> {code}
> 'Aggregate [str#5], [str#5,MIN(int#4) AS MIN(int)#6]
>  'Join Inner, Some(('x.str = 'y.str))
>   Subquery x
>Project [_1#0 AS int#2,_2#1 AS str#3]
> LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3]]
>   Subquery y
>Project [_1#0 AS int#2,_2#1 AS str#3]
> LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3]]
> {code}
> As you see, the expression ids in {{Aggregate}} are different to the 
> expression ids in {{Subquery y}}. This is the first problem.
> # The {{df2}} can't be performed
> The showing logical plan of {{df2}} can't be performed. Because the 
> expression ids of {{Subquery y}} will be modified for self-join handling 
> during analysis, the analyzed plan of {{df2}} becomes:
> {code}
> Aggregate [str#5], [str#5,MIN(int#4) AS MIN(int)#6]
>  Join Inner, Some((str#3 = str#8))
>   Subquery x
>Project [_1#0 AS int#2,_2#1 AS str#3]
> LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3]]
>   Subquery y
>Project [_1#0 AS int#7,_2#1 AS str#8]
> LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3]]
> {code}
> The expressions referred in {{Aggregate}} are not matching to these in 
> {{Subquery y}}. This is the second problem.
> h2. Proposed solution
> We try to add a PreAnalyzer. When a logical plan {{rawPlan}} is given to 
> SQLContext, it uses PreAnalyzer to modify the logical plan before assigning 
> to {{QueryExecution.logical}}. Then later operations will based on the 
> pre-analyzed logical plan, instead of the original {{rawPlan}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6549) Spark console logger logs to stderr by default

2015-03-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6549:
---

Assignee: Apache Spark

> Spark console logger logs to stderr by default
> --
>
> Key: SPARK-6549
> URL: https://issues.apache.org/jira/browse/SPARK-6549
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.2.0
>Reporter: Pavel Sakun
>Assignee: Apache Spark
>Priority: Trivial
>  Labels: log4j
>
> Spark's console logger is configure to log message with INFO level to stderr 
> while it should be stdout:
> https://github.com/apache/spark/blob/master/core/src/main/resources/org/apache/spark/log4j-defaults.properties
> https://github.com/apache/spark/blob/master/conf/log4j.properties.template



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6549) Spark console logger logs to stderr by default

2015-03-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381570#comment-14381570
 ] 

Apache Spark commented on SPARK-6549:
-

User 'pavel-sakun' has created a pull request for this issue:
https://github.com/apache/spark/pull/5202

> Spark console logger logs to stderr by default
> --
>
> Key: SPARK-6549
> URL: https://issues.apache.org/jira/browse/SPARK-6549
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.2.0
>Reporter: Pavel Sakun
>Priority: Trivial
>  Labels: log4j
>
> Spark's console logger is configure to log message with INFO level to stderr 
> while it should be stdout:
> https://github.com/apache/spark/blob/master/core/src/main/resources/org/apache/spark/log4j-defaults.properties
> https://github.com/apache/spark/blob/master/conf/log4j.properties.template



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6550) Add PreAnalyzer to keep logical plan consistent across DataFrame

2015-03-26 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-6550:
---
Description: 
h2. Problems

In some cases, the expressions in a logical plan will be modified to new ones 
during analysis, e.g. the handling for self-join cases. If some expressions are 
resolved based on the analyzed plan, they are referring to changed expression 
ids, not original ids.

But the transformation of DataFrame will use logical plan to construct new 
DataFrame, e.g. {{groupBy}} and aggregation. So in such cases, the expressions 
in these DataFrames will be inconsistent.

The problems are specified as following:

# Expression ids in logical plan are possibly inconsistent if expression ids 
are changed during analysis and some expressions are resolved after that

When we try to run the following codes:
{code}
val df = Seq(1,2,3).map(i => (i, i.toString)).toDF("int", "str")
val df2 = df.as('x).join(df.as('y), $"x.str" === 
$"y.str").groupBy("y.str").min("y.int")
{code}

Because {{groupBy}} and {{min}} will perform resolving based on the analyzed 
logical plan, their expression ids refer to analyzed plan, instead of logical 
plan.

So the logical plan of df2 looks like:

{code}
'Aggregate [str#5], [str#5,MIN(int#4) AS MIN(int)#6]
 'Join Inner, Some(('x.str = 'y.str))
  Subquery x
   Project [_1#0 AS int#2,_2#1 AS str#3]
LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3]]
  Subquery y
   Project [_1#0 AS int#2,_2#1 AS str#3]
LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3]]
{code}

As you see, the expression ids in {{Aggregate}} are different to the expression 
ids in {{Subquery y}}. This is the first problem.

# The {{df2}} can't be performed

The showing logical plan of {{df2}} can't be performed. Because the expression 
ids of {{Subquery y}} will be modified for self-join handling during analysis, 
the analyzed plan of {{df2}} becomes:

{code}
Aggregate [str#5], [str#5,MIN(int#4) AS MIN(int)#6]
 Join Inner, Some((str#3 = str#8))
  Subquery x
   Project [_1#0 AS int#2,_2#1 AS str#3]
LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3]]
  Subquery y
   Project [_1#0 AS int#7,_2#1 AS str#8]
LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3]]
{code}

The expressions referred in {{Aggregate}} are not matching to these in 
{{Subquery y}}. This is the second problem.

h2. Proposed solution

We try to add a {{PreAnalyzer}}. When a logical plan {{rawPlan}} is given to 
SQLContext, it uses PreAnalyzer to modify the logical plan before assigning to 
{{QueryExecution.logical}}. Then later operations will based on the 
pre-analyzed logical plan, instead of the original {{rawPlan}}.


  was:
h2. Problems

In some cases, the expressions in a logical plan will be modified to new ones 
during analysis, e.g. the handling for self-join cases. If some expressions are 
resolved based on the analyzed plan, they are referring to changed expression 
ids, not original ids.

But the transformation of DataFrame will use logical plan to construct new 
DataFrame, e.g. {{groupBy}} and aggregation. So in such cases, the expressions 
in these DataFrames will be inconsistent.

The problems are specified as following:

# Expression ids in logical plan are possibly inconsistent if expression ids 
are changed during analysis and some expressions are resolved after that

When we try to run the following codes:
{code}
val df = Seq(1,2,3).map(i => (i, i.toString)).toDF("int", "str")
val df2 = df.as('x).join(df.as('y), $"x.str" === 
$"y.str").groupBy("y.str").min("y.int")
{code}

Because {{groupBy}} and {{min}} will perform resolving based on the analyzed 
logical plan, their expression ids refer to analyzed plan, instead of logical 
plan.

So the logical plan of df2 looks like:

{code}
'Aggregate [str#5], [str#5,MIN(int#4) AS MIN(int)#6]
 'Join Inner, Some(('x.str = 'y.str))
  Subquery x
   Project [_1#0 AS int#2,_2#1 AS str#3]
LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3]]
  Subquery y
   Project [_1#0 AS int#2,_2#1 AS str#3]
LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3]]
{code}

As you see, the expression ids in {{Aggregate}} are different to the expression 
ids in {{Subquery y}}. This is the first problem.

# The {{df2}} can't be performed

The showing logical plan of {{df2}} can't be performed. Because the expression 
ids of {{Subquery y}} will be modified for self-join handling during analysis, 
the analyzed plan of {{df2}} becomes:

{code}
Aggregate [str#5], [str#5,MIN(int#4) AS MIN(int)#6]
 Join Inner, Some((str#3 = str#8))
  Subquery x
   Project [_1#0 AS int#2,_2#1 AS str#3]
LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3]]
  Subquery y
   Project [_1#0 AS int#7,_2#1 AS str#8]
LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3]]
{code}

The expressions referred in {{Aggregate}} are not matching to these in 
{{Subquery y}}. This is the second problem.

h2. Proposed solution

We try to add a PreAnalyzer. When a logical pla

[jira] [Commented] (SPARK-6480) histogram() bucket function is wrong in some simple edge cases

2015-03-26 Thread Frank Rosner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381588#comment-14381588
 ] 

Frank Rosner commented on SPARK-6480:
-

[~srowen] will do today!

> histogram() bucket function is wrong in some simple edge cases
> --
>
> Key: SPARK-6480
> URL: https://issues.apache.org/jira/browse/SPARK-6480
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>
> (Credit to a customer report here) This test would fail now: 
> {code}
> val rdd = sc.parallelize(Seq(1, 1, 1, 2, 3, 3))
> assert(Array(3, 1, 2) === rdd.map(_.toDouble).histogram(3)._2)
> {code}
> Because it returns 3, 1, 0. The problem ultimately traces to the 'fast' 
> bucket function that judges buckets based on a multiple of the gap between 
> first and second elements. Errors multiply and the end of the final bucket 
> fails to include the max.
> Fairly plausible use case actually.
> This can be tightened up easily with a slightly better expression. It will 
> also fix this test, which is actually expecting the wrong answer:
> {code}
> val rdd = sc.parallelize(6 to 99)
> val (histogramBuckets, histogramResults) = rdd.histogram(9)
> val expectedHistogramResults =
>   Array(11, 10, 11, 10, 10, 11, 10, 10, 11)
> {code}
> (Should be {{Array(11, 10, 10, 11, 10, 10, 11, 10, 11)}})



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6551) Incorrect aggregate results if op(...) mutates first argument

2015-03-26 Thread Jarno Seppanen (JIRA)
Jarno Seppanen created SPARK-6551:
-

 Summary: Incorrect aggregate results if op(...) mutates first 
argument
 Key: SPARK-6551
 URL: https://issues.apache.org/jira/browse/SPARK-6551
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.3.0
 Environment: Amazon EMR, AMI version 3.5
Reporter: Jarno Seppanen


Python RDD.aggregate method doesn't match its documentation w.r.t. seqOp or 
combOp mutating their first argument.

* the results are incorrect if seqOp mutates its first argument
* the zero value is modified if combOp mutates its first argument

I'm seeing the following behavior:

{code}
def inc_mutate(counter, item):
counter[0] += 1
return counter

def inc_pure(counter, item):
return [counter[0] + 1]

def merge_mutate(c1, c2):
c1[0] += c2[0]
return c1

def merge_pure(c1, c2):
return [c1[0] + c2[0]]

# correct answer, when neither function mutates their arguments
init = [0]
sc.parallelize(range(10)).aggregate(init, inc_pure, merge_pure)
# [10]
init
# [0]

# incorrect answer if seqOp mutates its first argument
init = [0]
sc.parallelize(range(10)).aggregate(init, inc_mutate, merge_pure)
# [20] <- WRONG
init
# [0]

# zero value is modified if combOp mutates its first argument
init = [0]
sc.parallelize(range(10)).aggregate(init, inc_pure, merge_mutate)
# [10]
init
# [10] <- funny behavior (though not documented)

# for completeness
init = [0]
sc.parallelize(range(10)).aggregate(init, inc_mutate, merge_mutate)
# [20]
init
# [20]
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6546) Using the wrong code that will make spark compile failed!!

2015-03-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6546:
--
Assignee: DoingDone9

> Using the wrong code that will make spark compile failed!! 
> ---
>
> Key: SPARK-6546
> URL: https://issues.apache.org/jira/browse/SPARK-6546
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: DoingDone9
>Assignee: DoingDone9
>
> wrong code : val tmpDir = Files.createTempDir()
> not Files should Utils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6552) expose start-slave.sh to user and update outdated doc

2015-03-26 Thread Tao Wang (JIRA)
Tao Wang created SPARK-6552:
---

 Summary: expose start-slave.sh to user and update outdated doc
 Key: SPARK-6552
 URL: https://issues.apache.org/jira/browse/SPARK-6552
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: Tao Wang
Priority: Minor


It would be better to expose start-slave.sh to user to allow starting a worker 
on single node.

As the description for starting a worker in document is in foregroud way, I 
alse changed it to backgroud way(using start-slave.sh).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6551) Incorrect aggregate results if seqOp(...) mutates its first argument

2015-03-26 Thread Jarno Seppanen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jarno Seppanen updated SPARK-6551:
--
Description: 
Python RDD.aggregate method doesn't match its documentation w.r.t. seqOp 
mutating its first argument.

* the results are incorrect if seqOp mutates its first argument
* additionally, the zero value is modified if combOp mutates its first argument 
(this is slightly surprising, would be nice to document)

I'm aggregating the RDD into a nontrivial data structure, and it would be 
wasteful to copy the whole data structure into a new instance in every seqOp, 
so mutation is an important feature.

I'm seeing the following behavior:

{code}
def inc_mutate(counter, item):
counter[0] += 1
return counter

def inc_pure(counter, item):
return [counter[0] + 1]

def merge_mutate(c1, c2):
c1[0] += c2[0]
return c1

def merge_pure(c1, c2):
return [c1[0] + c2[0]]

# correct answer, when neither function mutates their arguments
init = [0]
sc.parallelize(range(10)).aggregate(init, inc_pure, merge_pure)
# [10]
init
# [0]

# incorrect answer if seqOp mutates its first argument
init = [0]
sc.parallelize(range(10)).aggregate(init, inc_mutate, merge_pure)
# [20] <- WRONG
init
# [0]

# zero value is modified if combOp mutates its first argument
init = [0]
sc.parallelize(range(10)).aggregate(init, inc_pure, merge_mutate)
# [10]
init
# [10]

# for completeness
init = [0]
sc.parallelize(range(10)).aggregate(init, inc_mutate, merge_mutate)
# [20]
init
# [20]
{code}

I'm running on an EMR cluster launched with:
{code}
aws emr create-cluster --name jarno-spark \
 --ami-version 3.5 \
 --instance-type c3.8xlarge \
 --instance-count 5 \
 --ec2-attributes KeyName=foo \
 --applications Name=Ganglia \
 --log-uri s3://foo/log \
 --bootstrap-actions 
Path=s3://support.elasticmapreduce/spark/install-spark,Args=[-g,-x,-l,ERROR]
{code}


  was:
Python RDD.aggregate method doesn't match its documentation w.r.t. seqOp or 
combOp mutating their first argument.

* the results are incorrect if seqOp mutates its first argument
* the zero value is modified if combOp mutates its first argument

I'm seeing the following behavior:

{code}
def inc_mutate(counter, item):
counter[0] += 1
return counter

def inc_pure(counter, item):
return [counter[0] + 1]

def merge_mutate(c1, c2):
c1[0] += c2[0]
return c1

def merge_pure(c1, c2):
return [c1[0] + c2[0]]

# correct answer, when neither function mutates their arguments
init = [0]
sc.parallelize(range(10)).aggregate(init, inc_pure, merge_pure)
# [10]
init
# [0]

# incorrect answer if seqOp mutates its first argument
init = [0]
sc.parallelize(range(10)).aggregate(init, inc_mutate, merge_pure)
# [20] <- WRONG
init
# [0]

# zero value is modified if combOp mutates its first argument
init = [0]
sc.parallelize(range(10)).aggregate(init, inc_pure, merge_mutate)
# [10]
init
# [10] <- funny behavior (though not documented)

# for completeness
init = [0]
sc.parallelize(range(10)).aggregate(init, inc_mutate, merge_mutate)
# [20]
init
# [20]
{code}

Summary: Incorrect aggregate results if seqOp(...) mutates its first 
argument  (was: Incorrect aggregate results if op(...) mutates first argument)

> Incorrect aggregate results if seqOp(...) mutates its first argument
> 
>
> Key: SPARK-6551
> URL: https://issues.apache.org/jira/browse/SPARK-6551
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.3.0
> Environment: Amazon EMR, AMI version 3.5
>Reporter: Jarno Seppanen
>
> Python RDD.aggregate method doesn't match its documentation w.r.t. seqOp 
> mutating its first argument.
> * the results are incorrect if seqOp mutates its first argument
> * additionally, the zero value is modified if combOp mutates its first 
> argument (this is slightly surprising, would be nice to document)
> I'm aggregating the RDD into a nontrivial data structure, and it would be 
> wasteful to copy the whole data structure into a new instance in every seqOp, 
> so mutation is an important feature.
> I'm seeing the following behavior:
> {code}
> def inc_mutate(counter, item):
> counter[0] += 1
> return counter
> def inc_pure(counter, item):
> return [counter[0] + 1]
> def merge_mutate(c1, c2):
> c1[0] += c2[0]
> return c1
> def merge_pure(c1, c2):
> return [c1[0] + c2[0]]
> # correct answer, when neither function mutates their arguments
> init = [0]
> sc.parallelize(range(10)).aggregate(init, inc_pure, merge_pure)
> # [10]
> init
> # [0]
> # incorrect answer if seqOp mutates its first argument
> init = [0]
> sc.parallelize(range(10)).aggregate(init, inc_mutate, merge_pure)
> # [20] <- WRONG
> init
> # [0]
> # zero value is modified if combOp mutates its first argument
> init = [0]
> s

[jira] [Updated] (SPARK-6552) expose start-slave.sh to user and update outdated doc

2015-03-26 Thread Tao Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Wang updated SPARK-6552:

Component/s: Documentation

> expose start-slave.sh to user and update outdated doc
> -
>
> Key: SPARK-6552
> URL: https://issues.apache.org/jira/browse/SPARK-6552
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Documentation
>Reporter: Tao Wang
>Priority: Minor
>
> It would be better to expose start-slave.sh to user to allow starting a 
> worker on single node.
> As the description for starting a worker in document is in foregroud way, I 
> alse changed it to backgroud way(using start-slave.sh).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6546) Using the wrong code that will make spark compile failed!!

2015-03-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-6546.
---
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5198
[https://github.com/apache/spark/pull/5198]

> Using the wrong code that will make spark compile failed!! 
> ---
>
> Key: SPARK-6546
> URL: https://issues.apache.org/jira/browse/SPARK-6546
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.4.0
>Reporter: DoingDone9
>Assignee: DoingDone9
> Fix For: 1.4.0
>
>
> wrong code : val tmpDir = Files.createTempDir()
> not Files should Utils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6546) Using the wrong code that will make spark compile failed!!

2015-03-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6546:
--
Affects Version/s: 1.4.0

> Using the wrong code that will make spark compile failed!! 
> ---
>
> Key: SPARK-6546
> URL: https://issues.apache.org/jira/browse/SPARK-6546
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.4.0
>Reporter: DoingDone9
>Assignee: DoingDone9
> Fix For: 1.4.0
>
>
> wrong code : val tmpDir = Files.createTempDir()
> not Files should Utils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6546) Build failure caused by PR #5029 together with #4289

2015-03-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6546:
--
 Component/s: (was: Build)
  SQL
 Description: PR [#4289|https://github.com/apache/spark/pull/4289] was 
using Guava's {{com.google.common.io.Files}} according to the first commit of 
that PR, see 
[here|https://github.com/jeanlyn/spark/blob/3b27af36f82580c2171df965140c9a14e62fd5f0/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L22].
 However, [PR #5029|https://github.com/apache/spark/pull/5029] was merged 
earlier, and deprecated Guava {{Files}} by {{Utils.Files}}. These two combined 
caused this build failure. (There're no conflicts in the eyes of Git, but there 
do exist semantic conflicts.)  (was: wrong code : val tmpDir = 
Files.createTempDir()
not Files should Utils)
Target Version/s: 1.4.0
 Summary: Build failure caused by PR #5029 together with #4289  
(was: Using the wrong code that will make spark compile failed!! )

> Build failure caused by PR #5029 together with #4289
> 
>
> Key: SPARK-6546
> URL: https://issues.apache.org/jira/browse/SPARK-6546
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Pei, Zhongshuai
>Assignee: Pei, Zhongshuai
> Fix For: 1.4.0
>
>
> PR [#4289|https://github.com/apache/spark/pull/4289] was using Guava's 
> {{com.google.common.io.Files}} according to the first commit of that PR, see 
> [here|https://github.com/jeanlyn/spark/blob/3b27af36f82580c2171df965140c9a14e62fd5f0/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L22].
>  However, [PR #5029|https://github.com/apache/spark/pull/5029] was merged 
> earlier, and deprecated Guava {{Files}} by {{Utils.Files}}. These two 
> combined caused this build failure. (There're no conflicts in the eyes of 
> Git, but there do exist semantic conflicts.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6552) expose start-slave.sh to user and update outdated doc

2015-03-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381604#comment-14381604
 ] 

Apache Spark commented on SPARK-6552:
-

User 'WangTaoTheTonic' has created a pull request for this issue:
https://github.com/apache/spark/pull/5205

> expose start-slave.sh to user and update outdated doc
> -
>
> Key: SPARK-6552
> URL: https://issues.apache.org/jira/browse/SPARK-6552
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Documentation
>Reporter: Tao Wang
>Priority: Minor
>
> It would be better to expose start-slave.sh to user to allow starting a 
> worker on single node.
> As the description for starting a worker in document is in foregroud way, I 
> alse changed it to backgroud way(using start-slave.sh).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6546) Build failure caused by PR #5029 together with #4289

2015-03-26 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381606#comment-14381606
 ] 

Cheng Lian commented on SPARK-6546:
---

Updated ticket title and description to refect the root cause.

> Build failure caused by PR #5029 together with #4289
> 
>
> Key: SPARK-6546
> URL: https://issues.apache.org/jira/browse/SPARK-6546
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Pei, Zhongshuai
>Assignee: Pei, Zhongshuai
> Fix For: 1.4.0
>
>
> PR [#4289|https://github.com/apache/spark/pull/4289] was using Guava's 
> {{com.google.common.io.Files}} according to the first commit of that PR, see 
> [here|https://github.com/jeanlyn/spark/blob/3b27af36f82580c2171df965140c9a14e62fd5f0/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala#L22].
>  However, [PR #5029|https://github.com/apache/spark/pull/5029] was merged 
> earlier, and deprecated Guava {{Files}} by {{Utils.Files}}. These two 
> combined caused this build failure. (There're no conflicts in the eyes of 
> Git, but there do exist semantic conflicts.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6552) expose start-slave.sh to user and update outdated doc

2015-03-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6552:
---

Assignee: (was: Apache Spark)

> expose start-slave.sh to user and update outdated doc
> -
>
> Key: SPARK-6552
> URL: https://issues.apache.org/jira/browse/SPARK-6552
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Documentation
>Reporter: Tao Wang
>Priority: Minor
>
> It would be better to expose start-slave.sh to user to allow starting a 
> worker on single node.
> As the description for starting a worker in document is in foregroud way, I 
> alse changed it to backgroud way(using start-slave.sh).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6552) expose start-slave.sh to user and update outdated doc

2015-03-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6552:
---

Assignee: Apache Spark

> expose start-slave.sh to user and update outdated doc
> -
>
> Key: SPARK-6552
> URL: https://issues.apache.org/jira/browse/SPARK-6552
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Documentation
>Reporter: Tao Wang
>Assignee: Apache Spark
>Priority: Minor
>
> It would be better to expose start-slave.sh to user to allow starting a 
> worker on single node.
> As the description for starting a worker in document is in foregroud way, I 
> alse changed it to backgroud way(using start-slave.sh).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6553) Support for functools.partial as UserDefinedFunction

2015-03-26 Thread Kalle Jepsen (JIRA)
Kalle Jepsen created SPARK-6553:
---

 Summary: Support for functools.partial as UserDefinedFunction
 Key: SPARK-6553
 URL: https://issues.apache.org/jira/browse/SPARK-6553
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.3.0
Reporter: Kalle Jepsen


Currently {{functools.partial}} s cannot be used as {{UserDefinedFunction}} s 
for {{DataFrame}} s, as  the {{\_\_name\_\_}} attribute does not exist. Passing 
a {{functools.partial}} object will raise an Exception at 
https://github.com/apache/spark/blob/master/python/pyspark/sql/functions.py#L126.
 

{{functools.partial}} is very widely used and should probably be supported, 
despite its lack of a {{\_\_name\_\_}}.

My suggestion is to use {{f.\_\_repr\_\_()}} instead, or check with 
{{hasattr(f, '\_\_name\_\_)}} and use {{\_\_class\_\_}} if {{False}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6553) Support for functools.partial as UserDefinedFunction

2015-03-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6553:
---

Assignee: (was: Apache Spark)

> Support for functools.partial as UserDefinedFunction
> 
>
> Key: SPARK-6553
> URL: https://issues.apache.org/jira/browse/SPARK-6553
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.3.0
>Reporter: Kalle Jepsen
>  Labels: features
>
> Currently {{functools.partial}} s cannot be used as {{UserDefinedFunction}} s 
> for {{DataFrame}} s, as  the {{\_\_name\_\_}} attribute does not exist. 
> Passing a {{functools.partial}} object will raise an Exception at 
> https://github.com/apache/spark/blob/master/python/pyspark/sql/functions.py#L126.
>  
> {{functools.partial}} is very widely used and should probably be supported, 
> despite its lack of a {{\_\_name\_\_}}.
> My suggestion is to use {{f.\_\_repr\_\_()}} instead, or check with 
> {{hasattr(f, '\_\_name\_\_)}} and use {{\_\_class\_\_}} if {{False}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6553) Support for functools.partial as UserDefinedFunction

2015-03-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381616#comment-14381616
 ] 

Apache Spark commented on SPARK-6553:
-

User 'ksonj' has created a pull request for this issue:
https://github.com/apache/spark/pull/5206

> Support for functools.partial as UserDefinedFunction
> 
>
> Key: SPARK-6553
> URL: https://issues.apache.org/jira/browse/SPARK-6553
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.3.0
>Reporter: Kalle Jepsen
>  Labels: features
>
> Currently {{functools.partial}} s cannot be used as {{UserDefinedFunction}} s 
> for {{DataFrame}} s, as  the {{\_\_name\_\_}} attribute does not exist. 
> Passing a {{functools.partial}} object will raise an Exception at 
> https://github.com/apache/spark/blob/master/python/pyspark/sql/functions.py#L126.
>  
> {{functools.partial}} is very widely used and should probably be supported, 
> despite its lack of a {{\_\_name\_\_}}.
> My suggestion is to use {{f.\_\_repr\_\_()}} instead, or check with 
> {{hasattr(f, '\_\_name\_\_)}} and use {{\_\_class\_\_}} if {{False}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6553) Support for functools.partial as UserDefinedFunction

2015-03-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6553:
---

Assignee: Apache Spark

> Support for functools.partial as UserDefinedFunction
> 
>
> Key: SPARK-6553
> URL: https://issues.apache.org/jira/browse/SPARK-6553
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.3.0
>Reporter: Kalle Jepsen
>Assignee: Apache Spark
>  Labels: features
>
> Currently {{functools.partial}} s cannot be used as {{UserDefinedFunction}} s 
> for {{DataFrame}} s, as  the {{\_\_name\_\_}} attribute does not exist. 
> Passing a {{functools.partial}} object will raise an Exception at 
> https://github.com/apache/spark/blob/master/python/pyspark/sql/functions.py#L126.
>  
> {{functools.partial}} is very widely used and should probably be supported, 
> despite its lack of a {{\_\_name\_\_}}.
> My suggestion is to use {{f.\_\_repr\_\_()}} instead, or check with 
> {{hasattr(f, '\_\_name\_\_)}} and use {{\_\_class\_\_}} if {{False}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6435) spark-shell --jars option does not add all jars to classpath

2015-03-26 Thread Masayoshi TSUZUKI (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381641#comment-14381641
 ] 

Masayoshi TSUZUKI commented on SPARK-6435:
--

I think {{"%~2"==""}} way is better for this.
But this script no longer exists now because the launching method of spark has 
changed drastically by [SPARK-4924].

> spark-shell --jars option does not add all jars to classpath
> 
>
> Key: SPARK-6435
> URL: https://issues.apache.org/jira/browse/SPARK-6435
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Windows
>Affects Versions: 1.3.0
> Environment: Win64
>Reporter: vijay
>
> Not all jars supplied via the --jars option will be added to the driver (and 
> presumably executor) classpath.  The first jar(s) will be added, but not all.
> To reproduce this, just add a few jars (I tested 5) to the --jars option, and 
> then try to import a class from the last jar.  This fails.  A simple 
> reproducer: 
> Create a bunch of dummy jars:
> jar cfM jar1.jar log.txt
> jar cfM jar2.jar log.txt
> jar cfM jar3.jar log.txt
> jar cfM jar4.jar log.txt
> Start the spark-shell with the dummy jars and guava at the end:
> %SPARK_HOME%\bin\spark-shell --master local --jars 
> jar1.jar,jar2.jar,jar3.jar,jar4.jar,c:\code\lib\guava-14.0.1.jar
> In the shell, try importing from guava; you'll get an error:
> {code}
> scala> import com.google.common.base.Strings
> :19: error: object Strings is not a member of package 
> com.google.common.base
>import com.google.common.base.Strings
>   ^
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2213) Sort Merge Join

2015-03-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381650#comment-14381650
 ] 

Apache Spark commented on SPARK-2213:
-

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/5208

> Sort Merge Join
> ---
>
> Key: SPARK-2213
> URL: https://issues.apache.org/jira/browse/SPARK-2213
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Liquan Pei
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4830) Spark Streaming Java Application : java.lang.ClassNotFoundException

2015-03-26 Thread sam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381663#comment-14381663
 ] 

sam commented on SPARK-4830:


Could also be related to https://issues.apache.org/jira/browse/SPARK-4660

> Spark Streaming Java Application : java.lang.ClassNotFoundException
> ---
>
> Key: SPARK-4830
> URL: https://issues.apache.org/jira/browse/SPARK-4830
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.1.0
>Reporter: Mykhaylo Telizhyn
>
> h4. Application Overview:
>   
>We have Spark Streaming application that consumes messages from 
> RabbitMQ and processes them. When generating hundreds of events on RabbitMQ 
> and running our application on spark standalone cluster we see some 
> {{java.lang.ClassNotFoundException}} exceptions in the log. 
> Our domain model is simple POJO that represents RabbitMQ events we want to 
> consume and contains some custom properties we are interested in: 
> {code:title=com.xxx.Event.java|borderStyle=solid}
> public class Event implements java.io.Externalizable {
> 
> // custom properties
> // custom implementation of writeExternal(), readExternal() 
> methods
> }
> {code}
> We have implemented custom Spark Streaming receiver that just 
> receives messages from RabbitMQ queue by means of custom consumer (See 
> _"Receiving messages by subscription"_ at 
> https://www.rabbitmq.com/api-guide.html), converts them to our custom domain 
> event objects ({{com.xxx.Event}}) and stores them on spark memory:
> {code:title=RabbitMQReceiver.java|borderStyle=solid}
> byte[] body = // data received from Rabbit using custom consumer
> Event event = new Event(body);
> store(event)  // store into Spark  
> {code}
> The main program is simple, it just set up spark streaming context:
> {code:title=Application.java|borderStyle=solid}
> SparkConf sparkConf = new 
> SparkConf().setAppName(APPLICATION_NAME);
> 
> sparkConf.setJars(SparkContext.jarOfClass(Application.class).toList());  
> JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, 
> new Duration(BATCH_DURATION_MS));
> {code}
> Initialize input streams:
> {code:title=Application.java|borderStyle=solid}
> ReceiverInputDStream stream = // create input stream from 
> RabbitMQ
> JavaReceiverInputDStream events = new 
> JavaReceiverInputDStream(stream, classTag(Event.class));
> {code}
> Process events:
> {code:title=Application.java|borderStyle=solid}
> events.foreachRDD(
> rdd -> {
> rdd.foreachPartition(
> partition -> {
>  
> // process partition
> }
> }
> })
> 
> ssc.start();
> ssc.awaitTermination();
> {code}
> h4. Application submission:
> 
> Application is packaged as a single fat jar file using maven shade 
> plugin (http://maven.apache.org/plugins/maven-shade-plugin/). It is compiled 
> with spark version _1.1.0_   
> We run our application on spark version _1.1.0_ standalone cluster 
> that consists of driver host, master host and two worker hosts. We submit 
> application from driver host.
> 
> On one of the workers we see {{java.lang.ClassNotFoundException}} 
> exceptions:   
> {panel:title=app.log|borderStyle=dashed|borderColor=#ccc|titleBGColor=#e3e4e1|bgColor=#f0f8ff}
> 14/11/27 10:27:10 ERROR BlockManagerWorker: Exception handling buffer message
> java.lang.ClassNotFoundException: com.xxx.Event
> at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:344)
> at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59)
> at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
> at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)

[jira] [Assigned] (SPARK-6471) Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns

2015-03-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6471:
---

Assignee: Apache Spark

> Metastore schema should only be a subset of parquet schema to support 
> dropping of columns using replace columns
> ---
>
> Key: SPARK-6471
> URL: https://issues.apache.org/jira/browse/SPARK-6471
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Yash Datta
>Assignee: Apache Spark
> Fix For: 1.4.0
>
>
> Currently in the parquet relation 2 implementation, error is thrown in case 
> merged schema is not exactly the same as metastore schema. 
> But to support cases like deletion of column using replace column command, we 
> can relax the restriction so that even if metastore schema is a subset of 
> merged parquet schema, the query will work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6471) Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns

2015-03-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6471:
---

Assignee: (was: Apache Spark)

> Metastore schema should only be a subset of parquet schema to support 
> dropping of columns using replace columns
> ---
>
> Key: SPARK-6471
> URL: https://issues.apache.org/jira/browse/SPARK-6471
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Yash Datta
> Fix For: 1.4.0
>
>
> Currently in the parquet relation 2 implementation, error is thrown in case 
> merged schema is not exactly the same as metastore schema. 
> But to support cases like deletion of column using replace column command, we 
> can relax the restriction so that even if metastore schema is a subset of 
> merged parquet schema, the query will work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6554) Cannot use partition columns in where clause

2015-03-26 Thread Jon Chase (JIRA)
Jon Chase created SPARK-6554:


 Summary: Cannot use partition columns in where clause
 Key: SPARK-6554
 URL: https://issues.apache.org/jira/browse/SPARK-6554
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Jon Chase


I'm having trouble referencing partition columns in my queries with Parquet.  

In the following example, 'probeTypeId' is a partition column.  For example, 
the directory structure looks like this:

/mydata
/probeTypeId=1
...files...
/probeTypeId=2
...files...

I see the column when I reference load a DF using the /mydata directory and 
call df.printSchema():

...
 |-- probeTypeId: integer (nullable = true)
...

Parquet is also aware of the column:
 optional int32 probeTypeId;

And this works fine:

sqlContext.sql("select probeTypeId from df limit 1");

...as does df.show() - it shows the correct values for the partition column.


However, when I try to use a partition column in a where clause, I get an 
exception stating that the column was not found in the schema:

sqlContext.sql("select probeTypeId from df where probeTypeId = 1 limit 1");



...
...
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 
0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] was not 
found in schema!
at parquet.Preconditions.checkArgument(Preconditions.java:47)
at 
parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
at 
parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
at 
parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
at 
parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
at 
parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
at 
parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
at 
parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
at 
parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
at 
parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
...
...






Here's the full stack trace:

using local[*] for master
06:05:55,675 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction 
- debug attribute not set
06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About 
to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming 
appender as [STDOUT]
06:05:55,721 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA 
- Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] 
for [encoder] property
06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - 
Setting level of ROOT logger to INFO
06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - 
Attaching appender named [STDOUT] to Logger[ROOT]
06:05:55,769 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction 
- End of configuration.
06:05:55,770 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd 
- Registering current configuration as safe fallback point

INFO  org.apache.spark.SparkContext Running Spark version 1.3.0
WARN  o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library for 
your platform... using builtin-java classes where applicable
INFO  org.apache.spark.SecurityManager Changing view acls to: jon
INFO  org.apache.spark.SecurityManager Changing modify acls to: jon
INFO  org.apache.spark.SecurityManager SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(jon); users with 
modify permissions: Set(jon)
INFO  akka.event.slf4j.Slf4jLogger Slf4jLogger started
INFO  Remoting Starting remoting
INFO  Remoting Remoting started; listening on addresses 
:[akka.tcp://sparkDriver@192.168.1.134:62493]
INFO  org.apache.spark.util.Utils Successfully started service 'sparkDriver' on 
port 62493.
INFO  org.apache.spark.SparkEnv Registering MapOutputTracker
INFO  org.apache.spark.SparkEnv Registering BlockManagerMaster
INFO  o.a.spark.storage.DiskBlockManager Created local directory at 
/var/folders/x7/9hdp8kw9569864088tsl4jmmgn/T/spark-150e23

[jira] [Resolved] (SPARK-6465) GenericRowWithSchema: KryoException: Class cannot be created (missing no-arg constructor):

2015-03-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-6465.
---
   Resolution: Fixed
Fix Version/s: 1.4.0
   1.3.1

Issue resolved by pull request 5191
[https://github.com/apache/spark/pull/5191]

> GenericRowWithSchema: KryoException: Class cannot be created (missing no-arg 
> constructor):
> --
>
> Key: SPARK-6465
> URL: https://issues.apache.org/jira/browse/SPARK-6465
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: Spark 1.3, YARN 2.6.0, CentOS
>Reporter: Earthson Lu
>Assignee: Michael Armbrust
>Priority: Critical
> Fix For: 1.3.1, 1.4.0
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> I can not find a issue for this. 
> register for GenericRowWithSchema is lost in  
> org.apache.spark.sql.execution.SparkSqlSerializer.
> Is this the only thing we need to do?
> Here is the log
> {code}
> 15/03/23 16:21:00 WARN TaskSetManager: Lost task 9.0 in stage 20.0 (TID 
> 31978, datanode06.site): com.esotericsoftware.kryo.KryoException: Class 
> cannot be created (missing no-arg constructor): 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
> at com.esotericsoftware.kryo.Kryo.newInstantiator(Kryo.java:1050)
> at com.esotericsoftware.kryo.Kryo.newInstance(Kryo.java:1062)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.create(FieldSerializer.java:228)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:217)
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42)
> at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:33)
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:138)
> at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:66)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:217)
> at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:64)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6481) Set "In Progress" when a PR is opened for an issue

2015-03-26 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381703#comment-14381703
 ] 

Cheng Lian commented on SPARK-6481:
---

Maybe unrelated to this issue, but I saw a lot of JIRA notifications about 
"Assignee" updates, jumping between a normal user and "Apache Spark". Is this 
behavior a side effect of the "In Progress" PR? (Seems caused by [this code 
block|https://github.com/databricks/spark-pr-dashboard/pull/49/files#diff-6f3562e8b8a773341837373ab53b5462R34].)

> Set "In Progress" when a PR is opened for an issue
> --
>
> Key: SPARK-6481
> URL: https://issues.apache.org/jira/browse/SPARK-6481
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Michael Armbrust
>Assignee: Nicholas Chammas
>
> [~pwendell] and I are not sure if this is possible, but it would be really 
> helpful if the JIRA status was updated to "In Progress" when we do the 
> linking to an open pull request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6549) Spark console logger logs to stderr by default

2015-03-26 Thread Pavel Sakun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Sakun updated SPARK-6549:
---
Description: 
Spark's console logger is configured to log message with INFO level to stderr 
by default while it should be stdout:
https://github.com/apache/spark/blob/master/core/src/main/resources/org/apache/spark/log4j-defaults.properties
https://github.com/apache/spark/blob/master/conf/log4j.properties.template

  was:
Spark's console logger is configure to log message with INFO level to stderr 
while it should be stdout:
https://github.com/apache/spark/blob/master/core/src/main/resources/org/apache/spark/log4j-defaults.properties
https://github.com/apache/spark/blob/master/conf/log4j.properties.template


> Spark console logger logs to stderr by default
> --
>
> Key: SPARK-6549
> URL: https://issues.apache.org/jira/browse/SPARK-6549
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.2.0
>Reporter: Pavel Sakun
>Priority: Trivial
>  Labels: log4j
>
> Spark's console logger is configured to log message with INFO level to stderr 
> by default while it should be stdout:
> https://github.com/apache/spark/blob/master/core/src/main/resources/org/apache/spark/log4j-defaults.properties
> https://github.com/apache/spark/blob/master/conf/log4j.properties.template



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6555) Override equals and hashCode in MetastoreRelation

2015-03-26 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-6555:
-

 Summary: Override equals and hashCode in MetastoreRelation
 Key: SPARK-6555
 URL: https://issues.apache.org/jira/browse/SPARK-6555
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.2.1, 1.1.1, 1.0.2
Reporter: Cheng Lian


This is a follow-up of SPARK-6450.

As explained in [this 
comment|https://issues.apache.org/jira/browse/SPARK-6450?focusedCommentId=14379499&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14379499]
 of SPARK-6450, we resorted to a more surgical fix due to the upcoming 1.3.1 
release. But overriding {{equals}} and {{hashCode}} is the proper fix to that 
problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6544) Problem with Avro and Kryo Serialization

2015-03-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6544:
---

Assignee: Apache Spark

> Problem with Avro and Kryo Serialization
> 
>
> Key: SPARK-6544
> URL: https://issues.apache.org/jira/browse/SPARK-6544
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Dean Chen
>Assignee: Apache Spark
>
> We're running in to the following bug with Avro 1.7.6 and the Kryo serializer 
> causing jobs to fail
> https://issues.apache.org/jira/browse/AVRO-1476?focusedCommentId=13999249&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13999249
> PR here
> https://github.com/apache/spark/pull/5193



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6544) Problem with Avro and Kryo Serialization

2015-03-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6544:
---

Assignee: (was: Apache Spark)

> Problem with Avro and Kryo Serialization
> 
>
> Key: SPARK-6544
> URL: https://issues.apache.org/jira/browse/SPARK-6544
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Dean Chen
>
> We're running in to the following bug with Avro 1.7.6 and the Kryo serializer 
> causing jobs to fail
> https://issues.apache.org/jira/browse/AVRO-1476?focusedCommentId=13999249&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13999249
> PR here
> https://github.com/apache/spark/pull/5193



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6554) Cannot use partition columns in where clause

2015-03-26 Thread Jon Chase (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381720#comment-14381720
 ] 

Jon Chase commented on SPARK-6554:
--

Here's a test case to reproduce the issue:

{code}
@Test
public void testSpark_6554() {
// given:
DataFrame saveDF = sql.jsonRDD(
sc.parallelize(Lists.newArrayList("{\"col1\": 1}")),

DataTypes.createStructType(Lists.newArrayList(DataTypes.createStructField("col1",
 DataTypes.IntegerType, false;

// when:
saveDF.saveAsParquetFile(tmp.getRoot().getAbsolutePath() + "/col2=2");

// then:
DataFrame loadedDF = sql.load(tmp.getRoot().getAbsolutePath());
assertEquals(1, loadedDF.count());

assertEquals(2, loadedDF.schema().fieldNames().length);
assertEquals("col1", loadedDF.schema().fieldNames()[0]);
assertEquals("col2", loadedDF.schema().fieldNames()[1]);

loadedDF.registerTempTable("df");

// this query works
Row[] results = sql.sql("select col1, col2 from df").collect();
assertEquals(1, results.length);
assertEquals(2, results[0].size());

// this query is broken
results = sql.sql("select col1, col2 from df where col2 > 0").collect();
assertEquals(1, results.length);
assertEquals(2, results[0].size());
}
{code}

> Cannot use partition columns in where clause
> 
>
> Key: SPARK-6554
> URL: https://issues.apache.org/jira/browse/SPARK-6554
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Jon Chase
>
> I'm having trouble referencing partition columns in my queries with Parquet.  
> In the following example, 'probeTypeId' is a partition column.  For example, 
> the directory structure looks like this:
> /mydata
> /probeTypeId=1
> ...files...
> /probeTypeId=2
> ...files...
> I see the column when I reference load a DF using the /mydata directory and 
> call df.printSchema():
> ...
>  |-- probeTypeId: integer (nullable = true)
> ...
> Parquet is also aware of the column:
>  optional int32 probeTypeId;
> And this works fine:
> sqlContext.sql("select probeTypeId from df limit 1");
> ...as does df.show() - it shows the correct values for the partition column.
> However, when I try to use a partition column in a where clause, I get an 
> exception stating that the column was not found in the schema:
> sqlContext.sql("select probeTypeId from df where probeTypeId = 1 limit 1");
> ...
> ...
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] 
> was not found in schema!
>   at parquet.Preconditions.checkArgument(Preconditions.java:47)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
>   at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
>   at 
> parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
>   at 
> parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
>   at 
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
> ...
> ...
> Here's the full stack trace:
> using local[*] for master
> 06:05:55,675 |-INFO in 
> ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not 
> set
> 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
> 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> Naming appender as [STDOUT]
> 06:05:55,721 |-INFO in 
> ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default 
> type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] 
> property
> 06:05:55,768 |-INFO in

[jira] [Created] (SPARK-6556) Fix wrong parsing logic of executorTimeoutMs and checkTimeoutIntervalMs in HeartbeatReceiver

2015-03-26 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-6556:
---

 Summary: Fix wrong parsing logic of executorTimeoutMs and 
checkTimeoutIntervalMs in HeartbeatReceiver
 Key: SPARK-6556
 URL: https://issues.apache.org/jira/browse/SPARK-6556
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Shixiong Zhu


The current reading logic of "executorTimeoutMs" is:

{code}
private val executorTimeoutMs = sc.conf.getLong("spark.network.timeout", 
sc.conf.getLong("spark.storage.blockManagerSlaveTimeoutMs", 120)) * 1000
{code}

So if "spark.storage.blockManagerSlaveTimeoutMs" is 1, executorTimeoutMs 
will be 1 * 1000. But the correct value should have been 1. 

"checkTimeoutIntervalMs" has the same issue. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6556) Fix wrong parsing logic of executorTimeoutMs and checkTimeoutIntervalMs in HeartbeatReceiver

2015-03-26 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-6556:

Description: 
The current reading logic of "executorTimeoutMs" is:

{code}
private val executorTimeoutMs = sc.conf.getLong("spark.network.timeout", 
sc.conf.getLong("spark.storage.blockManagerSlaveTimeoutMs", 120)) * 1000
{code}

So if "spark.storage.blockManagerSlaveTimeoutMs" is 1 and 
"spark.network.timeout" is not set, executorTimeoutMs will be 1 * 1000. But 
the correct value should have been 1. 

"checkTimeoutIntervalMs" has the same issue. 

  was:
The current reading logic of "executorTimeoutMs" is:

{code}
private val executorTimeoutMs = sc.conf.getLong("spark.network.timeout", 
sc.conf.getLong("spark.storage.blockManagerSlaveTimeoutMs", 120)) * 1000
{code}

So if "spark.storage.blockManagerSlaveTimeoutMs" is 1, executorTimeoutMs 
will be 1 * 1000. But the correct value should have been 1. 

"checkTimeoutIntervalMs" has the same issue. 


> Fix wrong parsing logic of executorTimeoutMs and checkTimeoutIntervalMs in 
> HeartbeatReceiver
> 
>
> Key: SPARK-6556
> URL: https://issues.apache.org/jira/browse/SPARK-6556
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Shixiong Zhu
>
> The current reading logic of "executorTimeoutMs" is:
> {code}
> private val executorTimeoutMs = sc.conf.getLong("spark.network.timeout", 
> sc.conf.getLong("spark.storage.blockManagerSlaveTimeoutMs", 120)) * 1000
> {code}
> So if "spark.storage.blockManagerSlaveTimeoutMs" is 1 and 
> "spark.network.timeout" is not set, executorTimeoutMs will be 1 * 1000. 
> But the correct value should have been 1. 
> "checkTimeoutIntervalMs" has the same issue. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6556) Fix wrong parsing logic of executorTimeoutMs and checkTimeoutIntervalMs in HeartbeatReceiver

2015-03-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381769#comment-14381769
 ] 

Apache Spark commented on SPARK-6556:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/5209

> Fix wrong parsing logic of executorTimeoutMs and checkTimeoutIntervalMs in 
> HeartbeatReceiver
> 
>
> Key: SPARK-6556
> URL: https://issues.apache.org/jira/browse/SPARK-6556
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Shixiong Zhu
>
> The current reading logic of "executorTimeoutMs" is:
> {code}
> private val executorTimeoutMs = sc.conf.getLong("spark.network.timeout", 
> sc.conf.getLong("spark.storage.blockManagerSlaveTimeoutMs", 120)) * 1000
> {code}
> So if "spark.storage.blockManagerSlaveTimeoutMs" is 1 and 
> "spark.network.timeout" is not set, executorTimeoutMs will be 1 * 1000. 
> But the correct value should have been 1. 
> "checkTimeoutIntervalMs" has the same issue. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6556) Fix wrong parsing logic of executorTimeoutMs and checkTimeoutIntervalMs in HeartbeatReceiver

2015-03-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6556:
---

Assignee: Apache Spark

> Fix wrong parsing logic of executorTimeoutMs and checkTimeoutIntervalMs in 
> HeartbeatReceiver
> 
>
> Key: SPARK-6556
> URL: https://issues.apache.org/jira/browse/SPARK-6556
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> The current reading logic of "executorTimeoutMs" is:
> {code}
> private val executorTimeoutMs = sc.conf.getLong("spark.network.timeout", 
> sc.conf.getLong("spark.storage.blockManagerSlaveTimeoutMs", 120)) * 1000
> {code}
> So if "spark.storage.blockManagerSlaveTimeoutMs" is 1 and 
> "spark.network.timeout" is not set, executorTimeoutMs will be 1 * 1000. 
> But the correct value should have been 1. 
> "checkTimeoutIntervalMs" has the same issue. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6556) Fix wrong parsing logic of executorTimeoutMs and checkTimeoutIntervalMs in HeartbeatReceiver

2015-03-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6556:
---

Assignee: (was: Apache Spark)

> Fix wrong parsing logic of executorTimeoutMs and checkTimeoutIntervalMs in 
> HeartbeatReceiver
> 
>
> Key: SPARK-6556
> URL: https://issues.apache.org/jira/browse/SPARK-6556
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Shixiong Zhu
>
> The current reading logic of "executorTimeoutMs" is:
> {code}
> private val executorTimeoutMs = sc.conf.getLong("spark.network.timeout", 
> sc.conf.getLong("spark.storage.blockManagerSlaveTimeoutMs", 120)) * 1000
> {code}
> So if "spark.storage.blockManagerSlaveTimeoutMs" is 1 and 
> "spark.network.timeout" is not set, executorTimeoutMs will be 1 * 1000. 
> But the correct value should have been 1. 
> "checkTimeoutIntervalMs" has the same issue. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6481) Set "In Progress" when a PR is opened for an issue

2015-03-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381771#comment-14381771
 ] 

Sean Owen commented on SPARK-6481:
--

[~nchammas] Agree, I really like this, though it's generating a lot of extra 
updates, and I'm one of the fools that actually tries to read `issues@`. If 
that's avoidable it would be great!

> Set "In Progress" when a PR is opened for an issue
> --
>
> Key: SPARK-6481
> URL: https://issues.apache.org/jira/browse/SPARK-6481
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Michael Armbrust
>Assignee: Nicholas Chammas
>
> [~pwendell] and I are not sure if this is possible, but it would be really 
> helpful if the JIRA status was updated to "In Progress" when we do the 
> linking to an open pull request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6554) Cannot use partition columns in where clause

2015-03-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6554:
--
Description: 
I'm having trouble referencing partition columns in my queries with Parquet.  

In the following example, 'probeTypeId' is a partition column.  For example, 
the directory structure looks like this:
{noformat}
/mydata
/probeTypeId=1
...files...
/probeTypeId=2
...files...
{noformat}
I see the column when I reference load a DF using the /mydata directory and 
call df.printSchema():
{noformat}
 |-- probeTypeId: integer (nullable = true)
{noformat}
Parquet is also aware of the column:
{noformat}
 optional int32 probeTypeId;
{noformat}
And this works fine:
{code}
sqlContext.sql("select probeTypeId from df limit 1");
{code}
...as does {{df.show()}} - it shows the correct values for the partition column.

However, when I try to use a partition column in a where clause, I get an 
exception stating that the column was not found in the schema:
{noformat}
sqlContext.sql("select probeTypeId from df where probeTypeId = 1 limit 1");
...
...
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 
0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] was not 
found in schema!
at parquet.Preconditions.checkArgument(Preconditions.java:47)
at 
parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
at 
parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
at 
parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
at 
parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
at 
parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
at 
parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
at 
parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
at 
parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
at 
parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
...
...
{noformat}
Here's the full stack trace:
{noformat}
using local[*] for master
06:05:55,675 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction 
- debug attribute not set
06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About 
to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming 
appender as [STDOUT]
06:05:55,721 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA 
- Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] 
for [encoder] property
06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - 
Setting level of ROOT logger to INFO
06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - 
Attaching appender named [STDOUT] to Logger[ROOT]
06:05:55,769 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction 
- End of configuration.
06:05:55,770 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd 
- Registering current configuration as safe fallback point

INFO  org.apache.spark.SparkContext Running Spark version 1.3.0
WARN  o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library for 
your platform... using builtin-java classes where applicable
INFO  org.apache.spark.SecurityManager Changing view acls to: jon
INFO  org.apache.spark.SecurityManager Changing modify acls to: jon
INFO  org.apache.spark.SecurityManager SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(jon); users with 
modify permissions: Set(jon)
INFO  akka.event.slf4j.Slf4jLogger Slf4jLogger started
INFO  Remoting Starting remoting
INFO  Remoting Remoting started; listening on addresses 
:[akka.tcp://sparkDriver@192.168.1.134:62493]
INFO  org.apache.spark.util.Utils Successfully started service 'sparkDriver' on 
port 62493.
INFO  org.apache.spark.SparkEnv Registering MapOutputTracker
INFO  org.apache.spark.SparkEnv Registering BlockManagerMaster
INFO  o.a.spark.storage.DiskBlockManager Created local directory at 
/var/folders/x7/9hdp8kw9569864088tsl4jmmgn/T/spark-150e23b2-ff19-4a51-8cfc-25fb8e1b3f2b/blockmgr-6eea286c-7473-4bda-8886-7250156b68f4
IN

[jira] [Updated] (SPARK-6554) Cannot use partition columns in where clause

2015-03-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6554:
--
Target Version/s: 1.3.1, 1.4.0

> Cannot use partition columns in where clause
> 
>
> Key: SPARK-6554
> URL: https://issues.apache.org/jira/browse/SPARK-6554
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Jon Chase
>Assignee: Cheng Lian
>
> I'm having trouble referencing partition columns in my queries with Parquet.  
> In the following example, 'probeTypeId' is a partition column.  For example, 
> the directory structure looks like this:
> {noformat}
> /mydata
> /probeTypeId=1
> ...files...
> /probeTypeId=2
> ...files...
> {noformat}
> I see the column when I reference load a DF using the /mydata directory and 
> call df.printSchema():
> {noformat}
>  |-- probeTypeId: integer (nullable = true)
> {noformat}
> Parquet is also aware of the column:
> {noformat}
>  optional int32 probeTypeId;
> {noformat}
> And this works fine:
> {code}
> sqlContext.sql("select probeTypeId from df limit 1");
> {code}
> ...as does {{df.show()}} - it shows the correct values for the partition 
> column.
> However, when I try to use a partition column in a where clause, I get an 
> exception stating that the column was not found in the schema:
> {noformat}
> sqlContext.sql("select probeTypeId from df where probeTypeId = 1 limit 1");
> ...
> ...
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] 
> was not found in schema!
>   at parquet.Preconditions.checkArgument(Preconditions.java:47)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
>   at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
>   at 
> parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
>   at 
> parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
>   at 
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
> ...
> ...
> {noformat}
> Here's the full stack trace:
> {noformat}
> using local[*] for master
> 06:05:55,675 |-INFO in 
> ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not 
> set
> 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
> 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> Naming appender as [STDOUT]
> 06:05:55,721 |-INFO in 
> ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default 
> type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] 
> property
> 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - 
> Setting level of ROOT logger to INFO
> 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - 
> Attaching appender named [STDOUT] to Logger[ROOT]
> 06:05:55,769 |-INFO in 
> ch.qos.logback.classic.joran.action.ConfigurationAction - End of 
> configuration.
> 06:05:55,770 |-INFO in 
> ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current 
> configuration as safe fallback point
> INFO  org.apache.spark.SparkContext Running Spark version 1.3.0
> WARN  o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> INFO  org.apache.spark.SecurityManager Changing view acls to: jon
> INFO  org.apache.spark.SecurityManager Changing modify acls to: jon
> INFO  org.apache.spark.SecurityManager SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(jon); users with 
> modify permissions: Set(jon)
> INFO  akka.event.slf4j.Slf4jLogger Slf4jLogger started

[jira] [Assigned] (SPARK-6554) Cannot use partition columns in where clause

2015-03-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned SPARK-6554:
-

Assignee: Cheng Lian

> Cannot use partition columns in where clause
> 
>
> Key: SPARK-6554
> URL: https://issues.apache.org/jira/browse/SPARK-6554
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Jon Chase
>Assignee: Cheng Lian
>
> I'm having trouble referencing partition columns in my queries with Parquet.  
> In the following example, 'probeTypeId' is a partition column.  For example, 
> the directory structure looks like this:
> {noformat}
> /mydata
> /probeTypeId=1
> ...files...
> /probeTypeId=2
> ...files...
> {noformat}
> I see the column when I reference load a DF using the /mydata directory and 
> call df.printSchema():
> {noformat}
>  |-- probeTypeId: integer (nullable = true)
> {noformat}
> Parquet is also aware of the column:
> {noformat}
>  optional int32 probeTypeId;
> {noformat}
> And this works fine:
> {code}
> sqlContext.sql("select probeTypeId from df limit 1");
> {code}
> ...as does {{df.show()}} - it shows the correct values for the partition 
> column.
> However, when I try to use a partition column in a where clause, I get an 
> exception stating that the column was not found in the schema:
> {noformat}
> sqlContext.sql("select probeTypeId from df where probeTypeId = 1 limit 1");
> ...
> ...
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] 
> was not found in schema!
>   at parquet.Preconditions.checkArgument(Preconditions.java:47)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
>   at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
>   at 
> parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
>   at 
> parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
>   at 
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
> ...
> ...
> {noformat}
> Here's the full stack trace:
> {noformat}
> using local[*] for master
> 06:05:55,675 |-INFO in 
> ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not 
> set
> 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
> 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> Naming appender as [STDOUT]
> 06:05:55,721 |-INFO in 
> ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default 
> type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] 
> property
> 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - 
> Setting level of ROOT logger to INFO
> 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - 
> Attaching appender named [STDOUT] to Logger[ROOT]
> 06:05:55,769 |-INFO in 
> ch.qos.logback.classic.joran.action.ConfigurationAction - End of 
> configuration.
> 06:05:55,770 |-INFO in 
> ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current 
> configuration as safe fallback point
> INFO  org.apache.spark.SparkContext Running Spark version 1.3.0
> WARN  o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> INFO  org.apache.spark.SecurityManager Changing view acls to: jon
> INFO  org.apache.spark.SecurityManager Changing modify acls to: jon
> INFO  org.apache.spark.SecurityManager SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(jon); users with 
> modify permissions: Set(jon)
> INFO  akka.event.slf4j.Slf4jLogger Slf4jLogger started
> 

[jira] [Commented] (SPARK-6554) Cannot use partition columns in where clause

2015-03-26 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381788#comment-14381788
 ] 

Cheng Lian commented on SPARK-6554:
---

Hi [~jonchase], did you happen to turn on Parquet filter push-down by setting 
"spark.sql.parquet.filterPushdown" to true? The reason behind this is that, in 
your case, the partition column doesn't exist in the Parquet data file, thus 
Parquet filter push-down logics sees it as an invalid column. We should remove 
all predicates that touch those partition columns which don't exist in Parquet 
data files before doing the push-down optimization.

> Cannot use partition columns in where clause
> 
>
> Key: SPARK-6554
> URL: https://issues.apache.org/jira/browse/SPARK-6554
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Jon Chase
>Assignee: Cheng Lian
>
> I'm having trouble referencing partition columns in my queries with Parquet.  
> In the following example, 'probeTypeId' is a partition column.  For example, 
> the directory structure looks like this:
> {noformat}
> /mydata
> /probeTypeId=1
> ...files...
> /probeTypeId=2
> ...files...
> {noformat}
> I see the column when I reference load a DF using the /mydata directory and 
> call df.printSchema():
> {noformat}
>  |-- probeTypeId: integer (nullable = true)
> {noformat}
> Parquet is also aware of the column:
> {noformat}
>  optional int32 probeTypeId;
> {noformat}
> And this works fine:
> {code}
> sqlContext.sql("select probeTypeId from df limit 1");
> {code}
> ...as does {{df.show()}} - it shows the correct values for the partition 
> column.
> However, when I try to use a partition column in a where clause, I get an 
> exception stating that the column was not found in the schema:
> {noformat}
> sqlContext.sql("select probeTypeId from df where probeTypeId = 1 limit 1");
> ...
> ...
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] 
> was not found in schema!
>   at parquet.Preconditions.checkArgument(Preconditions.java:47)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
>   at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
>   at 
> parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
>   at 
> parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
>   at 
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
> ...
> ...
> {noformat}
> Here's the full stack trace:
> {noformat}
> using local[*] for master
> 06:05:55,675 |-INFO in 
> ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not 
> set
> 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
> 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> Naming appender as [STDOUT]
> 06:05:55,721 |-INFO in 
> ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default 
> type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] 
> property
> 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - 
> Setting level of ROOT logger to INFO
> 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - 
> Attaching appender named [STDOUT] to Logger[ROOT]
> 06:05:55,769 |-INFO in 
> ch.qos.logback.classic.joran.action.ConfigurationAction - End of 
> configuration.
> 06:05:55,770 |-INFO in 
> ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current 
> configuration as safe fallback point
> INFO  org.apache.spark.SparkContext Running Spark version 1.3.0
> WARN  o.a.hadoop.util.NativeCodeLoader Unable to loa

[jira] [Updated] (SPARK-6554) Cannot use partition columns in where clause

2015-03-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6554:
--
Priority: Critical  (was: Major)

> Cannot use partition columns in where clause
> 
>
> Key: SPARK-6554
> URL: https://issues.apache.org/jira/browse/SPARK-6554
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Jon Chase
>Assignee: Cheng Lian
>Priority: Critical
>
> I'm having trouble referencing partition columns in my queries with Parquet.  
> In the following example, 'probeTypeId' is a partition column.  For example, 
> the directory structure looks like this:
> {noformat}
> /mydata
> /probeTypeId=1
> ...files...
> /probeTypeId=2
> ...files...
> {noformat}
> I see the column when I reference load a DF using the /mydata directory and 
> call df.printSchema():
> {noformat}
>  |-- probeTypeId: integer (nullable = true)
> {noformat}
> Parquet is also aware of the column:
> {noformat}
>  optional int32 probeTypeId;
> {noformat}
> And this works fine:
> {code}
> sqlContext.sql("select probeTypeId from df limit 1");
> {code}
> ...as does {{df.show()}} - it shows the correct values for the partition 
> column.
> However, when I try to use a partition column in a where clause, I get an 
> exception stating that the column was not found in the schema:
> {noformat}
> sqlContext.sql("select probeTypeId from df where probeTypeId = 1 limit 1");
> ...
> ...
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] 
> was not found in schema!
>   at parquet.Preconditions.checkArgument(Preconditions.java:47)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
>   at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
>   at 
> parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
>   at 
> parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
>   at 
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
> ...
> ...
> {noformat}
> Here's the full stack trace:
> {noformat}
> using local[*] for master
> 06:05:55,675 |-INFO in 
> ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not 
> set
> 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
> 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> Naming appender as [STDOUT]
> 06:05:55,721 |-INFO in 
> ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default 
> type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] 
> property
> 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - 
> Setting level of ROOT logger to INFO
> 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - 
> Attaching appender named [STDOUT] to Logger[ROOT]
> 06:05:55,769 |-INFO in 
> ch.qos.logback.classic.joran.action.ConfigurationAction - End of 
> configuration.
> 06:05:55,770 |-INFO in 
> ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current 
> configuration as safe fallback point
> INFO  org.apache.spark.SparkContext Running Spark version 1.3.0
> WARN  o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> INFO  org.apache.spark.SecurityManager Changing view acls to: jon
> INFO  org.apache.spark.SecurityManager Changing modify acls to: jon
> INFO  org.apache.spark.SecurityManager SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(jon); users with 
> modify permissions: Set(jon)
> INFO  akka.event.slf

[jira] [Updated] (SPARK-6554) Cannot use partition columns in where clause when Parquet filter push-down is enabled

2015-03-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6554:
--
Summary: Cannot use partition columns in where clause when Parquet filter 
push-down is enabled  (was: Cannot use partition columns in where clause)

> Cannot use partition columns in where clause when Parquet filter push-down is 
> enabled
> -
>
> Key: SPARK-6554
> URL: https://issues.apache.org/jira/browse/SPARK-6554
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Jon Chase
>Assignee: Cheng Lian
>Priority: Critical
>
> I'm having trouble referencing partition columns in my queries with Parquet.  
> In the following example, 'probeTypeId' is a partition column.  For example, 
> the directory structure looks like this:
> {noformat}
> /mydata
> /probeTypeId=1
> ...files...
> /probeTypeId=2
> ...files...
> {noformat}
> I see the column when I reference load a DF using the /mydata directory and 
> call df.printSchema():
> {noformat}
>  |-- probeTypeId: integer (nullable = true)
> {noformat}
> Parquet is also aware of the column:
> {noformat}
>  optional int32 probeTypeId;
> {noformat}
> And this works fine:
> {code}
> sqlContext.sql("select probeTypeId from df limit 1");
> {code}
> ...as does {{df.show()}} - it shows the correct values for the partition 
> column.
> However, when I try to use a partition column in a where clause, I get an 
> exception stating that the column was not found in the schema:
> {noformat}
> sqlContext.sql("select probeTypeId from df where probeTypeId = 1 limit 1");
> ...
> ...
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] 
> was not found in schema!
>   at parquet.Preconditions.checkArgument(Preconditions.java:47)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
>   at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
>   at 
> parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
>   at 
> parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
>   at 
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
> ...
> ...
> {noformat}
> Here's the full stack trace:
> {noformat}
> using local[*] for master
> 06:05:55,675 |-INFO in 
> ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not 
> set
> 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
> 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> Naming appender as [STDOUT]
> 06:05:55,721 |-INFO in 
> ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default 
> type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] 
> property
> 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - 
> Setting level of ROOT logger to INFO
> 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - 
> Attaching appender named [STDOUT] to Logger[ROOT]
> 06:05:55,769 |-INFO in 
> ch.qos.logback.classic.joran.action.ConfigurationAction - End of 
> configuration.
> 06:05:55,770 |-INFO in 
> ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current 
> configuration as safe fallback point
> INFO  org.apache.spark.SparkContext Running Spark version 1.3.0
> WARN  o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> INFO  org.apache.spark.SecurityManager Changing view acls to: jon
> INFO  org.apache.spark.SecurityManager Changing modify acls to: jon
> INFO 

[jira] [Commented] (SPARK-6435) spark-shell --jars option does not add all jars to classpath

2015-03-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381797#comment-14381797
 ] 

Sean Owen commented on SPARK-6435:
--

Yes this is a moot point in 1.4 and after, but I'd love to get a working 
solution for 1.3, even if it's a bit hacky. [~tsudukim] do you think the 
square-bracket syntax is going to not work in some cases? at least, it seems to 
fix this issue. [~vjapache] does the alternative syntax above work, without the 
"x"?

> spark-shell --jars option does not add all jars to classpath
> 
>
> Key: SPARK-6435
> URL: https://issues.apache.org/jira/browse/SPARK-6435
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Windows
>Affects Versions: 1.3.0
> Environment: Win64
>Reporter: vijay
>
> Not all jars supplied via the --jars option will be added to the driver (and 
> presumably executor) classpath.  The first jar(s) will be added, but not all.
> To reproduce this, just add a few jars (I tested 5) to the --jars option, and 
> then try to import a class from the last jar.  This fails.  A simple 
> reproducer: 
> Create a bunch of dummy jars:
> jar cfM jar1.jar log.txt
> jar cfM jar2.jar log.txt
> jar cfM jar3.jar log.txt
> jar cfM jar4.jar log.txt
> Start the spark-shell with the dummy jars and guava at the end:
> %SPARK_HOME%\bin\spark-shell --master local --jars 
> jar1.jar,jar2.jar,jar3.jar,jar4.jar,c:\code\lib\guava-14.0.1.jar
> In the shell, try importing from guava; you'll get an error:
> {code}
> scala> import com.google.common.base.Strings
> :19: error: object Strings is not a member of package 
> com.google.common.base
>import com.google.common.base.Strings
>   ^
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6551) Incorrect aggregate results if seqOp(...) mutates its first argument

2015-03-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381807#comment-14381807
 ] 

Sean Owen commented on SPARK-6551:
--

FWIW an equivalent example works as expected in Scala, and the scaladoc says 
you should be able to mutate the first argument. 

{code}
val data = sc.parallelize(0 until 10)

def seqOp(a: Array[Int], b: Int) = {
  a(0) += 1
  a
}

def combOp(a: Array[Int], b: Array[Int]) = {
  a(0) += b(0)
  a
}

data.aggregate(new Array[Int](1))(seqOp, combOp)

10
{code}

> Incorrect aggregate results if seqOp(...) mutates its first argument
> 
>
> Key: SPARK-6551
> URL: https://issues.apache.org/jira/browse/SPARK-6551
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.3.0
> Environment: Amazon EMR, AMI version 3.5
>Reporter: Jarno Seppanen
>
> Python RDD.aggregate method doesn't match its documentation w.r.t. seqOp 
> mutating its first argument.
> * the results are incorrect if seqOp mutates its first argument
> * additionally, the zero value is modified if combOp mutates its first 
> argument (this is slightly surprising, would be nice to document)
> I'm aggregating the RDD into a nontrivial data structure, and it would be 
> wasteful to copy the whole data structure into a new instance in every seqOp, 
> so mutation is an important feature.
> I'm seeing the following behavior:
> {code}
> def inc_mutate(counter, item):
> counter[0] += 1
> return counter
> def inc_pure(counter, item):
> return [counter[0] + 1]
> def merge_mutate(c1, c2):
> c1[0] += c2[0]
> return c1
> def merge_pure(c1, c2):
> return [c1[0] + c2[0]]
> # correct answer, when neither function mutates their arguments
> init = [0]
> sc.parallelize(range(10)).aggregate(init, inc_pure, merge_pure)
> # [10]
> init
> # [0]
> # incorrect answer if seqOp mutates its first argument
> init = [0]
> sc.parallelize(range(10)).aggregate(init, inc_mutate, merge_pure)
> # [20] <- WRONG
> init
> # [0]
> # zero value is modified if combOp mutates its first argument
> init = [0]
> sc.parallelize(range(10)).aggregate(init, inc_pure, merge_mutate)
> # [10]
> init
> # [10]
> # for completeness
> init = [0]
> sc.parallelize(range(10)).aggregate(init, inc_mutate, merge_mutate)
> # [20]
> init
> # [20]
> {code}
> I'm running on an EMR cluster launched with:
> {code}
> aws emr create-cluster --name jarno-spark \
>  --ami-version 3.5 \
>  --instance-type c3.8xlarge \
>  --instance-count 5 \
>  --ec2-attributes KeyName=foo \
>  --applications Name=Ganglia \
>  --log-uri s3://foo/log \
>  --bootstrap-actions 
> Path=s3://support.elasticmapreduce/spark/install-spark,Args=[-g,-x,-l,ERROR]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6481) Set "In Progress" when a PR is opened for an issue

2015-03-26 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381811#comment-14381811
 ] 

Cheng Lian commented on SPARK-6481:
---

Aha, so I'm not the only one! Although I just started doing this pretty 
recently :P 

> Set "In Progress" when a PR is opened for an issue
> --
>
> Key: SPARK-6481
> URL: https://issues.apache.org/jira/browse/SPARK-6481
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Michael Armbrust
>Assignee: Nicholas Chammas
>
> [~pwendell] and I are not sure if this is possible, but it would be really 
> helpful if the JIRA status was updated to "In Progress" when we do the 
> linking to an open pull request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6554) Cannot use partition columns in where clause when Parquet filter push-down is enabled

2015-03-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381814#comment-14381814
 ] 

Apache Spark commented on SPARK-6554:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/5210

> Cannot use partition columns in where clause when Parquet filter push-down is 
> enabled
> -
>
> Key: SPARK-6554
> URL: https://issues.apache.org/jira/browse/SPARK-6554
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Jon Chase
>Assignee: Cheng Lian
>Priority: Critical
>
> I'm having trouble referencing partition columns in my queries with Parquet.  
> In the following example, 'probeTypeId' is a partition column.  For example, 
> the directory structure looks like this:
> {noformat}
> /mydata
> /probeTypeId=1
> ...files...
> /probeTypeId=2
> ...files...
> {noformat}
> I see the column when I reference load a DF using the /mydata directory and 
> call df.printSchema():
> {noformat}
>  |-- probeTypeId: integer (nullable = true)
> {noformat}
> Parquet is also aware of the column:
> {noformat}
>  optional int32 probeTypeId;
> {noformat}
> And this works fine:
> {code}
> sqlContext.sql("select probeTypeId from df limit 1");
> {code}
> ...as does {{df.show()}} - it shows the correct values for the partition 
> column.
> However, when I try to use a partition column in a where clause, I get an 
> exception stating that the column was not found in the schema:
> {noformat}
> sqlContext.sql("select probeTypeId from df where probeTypeId = 1 limit 1");
> ...
> ...
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] 
> was not found in schema!
>   at parquet.Preconditions.checkArgument(Preconditions.java:47)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
>   at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
>   at 
> parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
>   at 
> parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
>   at 
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
> ...
> ...
> {noformat}
> Here's the full stack trace:
> {noformat}
> using local[*] for master
> 06:05:55,675 |-INFO in 
> ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not 
> set
> 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
> 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> Naming appender as [STDOUT]
> 06:05:55,721 |-INFO in 
> ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default 
> type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] 
> property
> 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - 
> Setting level of ROOT logger to INFO
> 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - 
> Attaching appender named [STDOUT] to Logger[ROOT]
> 06:05:55,769 |-INFO in 
> ch.qos.logback.classic.joran.action.ConfigurationAction - End of 
> configuration.
> 06:05:55,770 |-INFO in 
> ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current 
> configuration as safe fallback point
> INFO  org.apache.spark.SparkContext Running Spark version 1.3.0
> WARN  o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> INFO  org.apache.spark.SecurityManager Changing view acls to: jon
> INFO  org.apache.spark.SecurityManager Changing modify acls to: 

[jira] [Assigned] (SPARK-6554) Cannot use partition columns in where clause when Parquet filter push-down is enabled

2015-03-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6554:
---

Assignee: Cheng Lian  (was: Apache Spark)

> Cannot use partition columns in where clause when Parquet filter push-down is 
> enabled
> -
>
> Key: SPARK-6554
> URL: https://issues.apache.org/jira/browse/SPARK-6554
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Jon Chase
>Assignee: Cheng Lian
>Priority: Critical
>
> I'm having trouble referencing partition columns in my queries with Parquet.  
> In the following example, 'probeTypeId' is a partition column.  For example, 
> the directory structure looks like this:
> {noformat}
> /mydata
> /probeTypeId=1
> ...files...
> /probeTypeId=2
> ...files...
> {noformat}
> I see the column when I reference load a DF using the /mydata directory and 
> call df.printSchema():
> {noformat}
>  |-- probeTypeId: integer (nullable = true)
> {noformat}
> Parquet is also aware of the column:
> {noformat}
>  optional int32 probeTypeId;
> {noformat}
> And this works fine:
> {code}
> sqlContext.sql("select probeTypeId from df limit 1");
> {code}
> ...as does {{df.show()}} - it shows the correct values for the partition 
> column.
> However, when I try to use a partition column in a where clause, I get an 
> exception stating that the column was not found in the schema:
> {noformat}
> sqlContext.sql("select probeTypeId from df where probeTypeId = 1 limit 1");
> ...
> ...
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] 
> was not found in schema!
>   at parquet.Preconditions.checkArgument(Preconditions.java:47)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
>   at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
>   at 
> parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
>   at 
> parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
>   at 
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
> ...
> ...
> {noformat}
> Here's the full stack trace:
> {noformat}
> using local[*] for master
> 06:05:55,675 |-INFO in 
> ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not 
> set
> 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
> 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> Naming appender as [STDOUT]
> 06:05:55,721 |-INFO in 
> ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default 
> type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] 
> property
> 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - 
> Setting level of ROOT logger to INFO
> 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - 
> Attaching appender named [STDOUT] to Logger[ROOT]
> 06:05:55,769 |-INFO in 
> ch.qos.logback.classic.joran.action.ConfigurationAction - End of 
> configuration.
> 06:05:55,770 |-INFO in 
> ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current 
> configuration as safe fallback point
> INFO  org.apache.spark.SparkContext Running Spark version 1.3.0
> WARN  o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> INFO  org.apache.spark.SecurityManager Changing view acls to: jon
> INFO  org.apache.spark.SecurityManager Changing modify acls to: jon
> INFO  org.apache.spark.SecurityManager SecurityManager: authentication 
> disabled; ui acls disabled;

[jira] [Assigned] (SPARK-6554) Cannot use partition columns in where clause when Parquet filter push-down is enabled

2015-03-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6554:
---

Assignee: Apache Spark  (was: Cheng Lian)

> Cannot use partition columns in where clause when Parquet filter push-down is 
> enabled
> -
>
> Key: SPARK-6554
> URL: https://issues.apache.org/jira/browse/SPARK-6554
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Jon Chase
>Assignee: Apache Spark
>Priority: Critical
>
> I'm having trouble referencing partition columns in my queries with Parquet.  
> In the following example, 'probeTypeId' is a partition column.  For example, 
> the directory structure looks like this:
> {noformat}
> /mydata
> /probeTypeId=1
> ...files...
> /probeTypeId=2
> ...files...
> {noformat}
> I see the column when I reference load a DF using the /mydata directory and 
> call df.printSchema():
> {noformat}
>  |-- probeTypeId: integer (nullable = true)
> {noformat}
> Parquet is also aware of the column:
> {noformat}
>  optional int32 probeTypeId;
> {noformat}
> And this works fine:
> {code}
> sqlContext.sql("select probeTypeId from df limit 1");
> {code}
> ...as does {{df.show()}} - it shows the correct values for the partition 
> column.
> However, when I try to use a partition column in a where clause, I get an 
> exception stating that the column was not found in the schema:
> {noformat}
> sqlContext.sql("select probeTypeId from df where probeTypeId = 1 limit 1");
> ...
> ...
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] 
> was not found in schema!
>   at parquet.Preconditions.checkArgument(Preconditions.java:47)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
>   at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
>   at 
> parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
>   at 
> parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
>   at 
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
> ...
> ...
> {noformat}
> Here's the full stack trace:
> {noformat}
> using local[*] for master
> 06:05:55,675 |-INFO in 
> ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not 
> set
> 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
> 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> Naming appender as [STDOUT]
> 06:05:55,721 |-INFO in 
> ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default 
> type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] 
> property
> 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - 
> Setting level of ROOT logger to INFO
> 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - 
> Attaching appender named [STDOUT] to Logger[ROOT]
> 06:05:55,769 |-INFO in 
> ch.qos.logback.classic.joran.action.ConfigurationAction - End of 
> configuration.
> 06:05:55,770 |-INFO in 
> ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current 
> configuration as safe fallback point
> INFO  org.apache.spark.SparkContext Running Spark version 1.3.0
> WARN  o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> INFO  org.apache.spark.SecurityManager Changing view acls to: jon
> INFO  org.apache.spark.SecurityManager Changing modify acls to: jon
> INFO  org.apache.spark.SecurityManager SecurityManager: authentication 
> disabled; ui acls disable

[jira] [Resolved] (SPARK-6515) Use while(true) in OpenHashSet.getPos

2015-03-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6515.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

(Looks like this was merged in 
https://github.com/apache/spark/commit/c14ddd97ed662a8429b9b9078bd7c1a5a1dd3d6c 
)

> Use while(true) in OpenHashSet.getPos
> -
>
> Key: SPARK-6515
> URL: https://issues.apache.org/jira/browse/SPARK-6515
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Minor
> Fix For: 1.4.0
>
>
> Though I don't see any bug in the existing code, using `while (true)` makes 
> the code read better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6468) Fix the race condition of subDirs in DiskBlockManager

2015-03-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6468.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5136
[https://github.com/apache/spark/pull/5136]

> Fix the race condition of subDirs in DiskBlockManager
> -
>
> Key: SPARK-6468
> URL: https://issues.apache.org/jira/browse/SPARK-6468
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.3.0
>Reporter: Shixiong Zhu
>Priority: Minor
> Fix For: 1.4.0
>
>
> There are two race conditions of subDirs in DiskBlockManager:
> 1. `getAllFiles` does not use correct locks to read the contents in 
> `subDirs`. Although it's designed for testing, it's still worth to add 
> correct locks to eliminate the race condition.
> 2. The double-check has a race condition in `getFile(filename: String)`. If a 
> thread finds `subDirs(dirId)(subDirId)` is not null out of the `synchronized` 
> block, it may not be able to see the correct content of the File instance 
> pointed by `subDirs(dirId)(subDirId)` according to the Java memory model 
> (there is no volatile variable here).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6468) Fix the race condition of subDirs in DiskBlockManager

2015-03-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6468:
-
Assignee: Shixiong Zhu

> Fix the race condition of subDirs in DiskBlockManager
> -
>
> Key: SPARK-6468
> URL: https://issues.apache.org/jira/browse/SPARK-6468
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.3.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 1.4.0
>
>
> There are two race conditions of subDirs in DiskBlockManager:
> 1. `getAllFiles` does not use correct locks to read the contents in 
> `subDirs`. Although it's designed for testing, it's still worth to add 
> correct locks to eliminate the race condition.
> 2. The double-check has a race condition in `getFile(filename: String)`. If a 
> thread finds `subDirs(dirId)(subDirId)` is not null out of the `synchronized` 
> block, it may not be able to see the correct content of the File instance 
> pointed by `subDirs(dirId)(subDirId)` according to the Java memory model 
> (there is no volatile variable here).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6480) histogram() bucket function is wrong in some simple edge cases

2015-03-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6480:
---

Assignee: Sean Owen  (was: Apache Spark)

> histogram() bucket function is wrong in some simple edge cases
> --
>
> Key: SPARK-6480
> URL: https://issues.apache.org/jira/browse/SPARK-6480
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>
> (Credit to a customer report here) This test would fail now: 
> {code}
> val rdd = sc.parallelize(Seq(1, 1, 1, 2, 3, 3))
> assert(Array(3, 1, 2) === rdd.map(_.toDouble).histogram(3)._2)
> {code}
> Because it returns 3, 1, 0. The problem ultimately traces to the 'fast' 
> bucket function that judges buckets based on a multiple of the gap between 
> first and second elements. Errors multiply and the end of the final bucket 
> fails to include the max.
> Fairly plausible use case actually.
> This can be tightened up easily with a slightly better expression. It will 
> also fix this test, which is actually expecting the wrong answer:
> {code}
> val rdd = sc.parallelize(6 to 99)
> val (histogramBuckets, histogramResults) = rdd.histogram(9)
> val expectedHistogramResults =
>   Array(11, 10, 11, 10, 10, 11, 10, 10, 11)
> {code}
> (Should be {{Array(11, 10, 10, 11, 10, 10, 11, 10, 11)}})



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6480) histogram() bucket function is wrong in some simple edge cases

2015-03-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6480:
---

Assignee: Apache Spark  (was: Sean Owen)

> histogram() bucket function is wrong in some simple edge cases
> --
>
> Key: SPARK-6480
> URL: https://issues.apache.org/jira/browse/SPARK-6480
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Sean Owen
>Assignee: Apache Spark
>
> (Credit to a customer report here) This test would fail now: 
> {code}
> val rdd = sc.parallelize(Seq(1, 1, 1, 2, 3, 3))
> assert(Array(3, 1, 2) === rdd.map(_.toDouble).histogram(3)._2)
> {code}
> Because it returns 3, 1, 0. The problem ultimately traces to the 'fast' 
> bucket function that judges buckets based on a multiple of the gap between 
> first and second elements. Errors multiply and the end of the final bucket 
> fails to include the max.
> Fairly plausible use case actually.
> This can be tightened up easily with a slightly better expression. It will 
> also fix this test, which is actually expecting the wrong answer:
> {code}
> val rdd = sc.parallelize(6 to 99)
> val (histogramBuckets, histogramResults) = rdd.histogram(9)
> val expectedHistogramResults =
>   Array(11, 10, 11, 10, 10, 11, 10, 10, 11)
> {code}
> (Should be {{Array(11, 10, 10, 11, 10, 10, 11, 10, 11)}})



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6554) Cannot use partition columns in where clause when Parquet filter push-down is enabled

2015-03-26 Thread Jon Chase (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381831#comment-14381831
 ] 

Jon Chase commented on SPARK-6554:
--

"spark.sql.parquet.filterPushdown" was the problem.  Leaving it set to false 
works around the problem for now.  

Thanks for jumping on this.

> Cannot use partition columns in where clause when Parquet filter push-down is 
> enabled
> -
>
> Key: SPARK-6554
> URL: https://issues.apache.org/jira/browse/SPARK-6554
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Jon Chase
>Assignee: Cheng Lian
>Priority: Critical
>
> I'm having trouble referencing partition columns in my queries with Parquet.  
> In the following example, 'probeTypeId' is a partition column.  For example, 
> the directory structure looks like this:
> {noformat}
> /mydata
> /probeTypeId=1
> ...files...
> /probeTypeId=2
> ...files...
> {noformat}
> I see the column when I reference load a DF using the /mydata directory and 
> call df.printSchema():
> {noformat}
>  |-- probeTypeId: integer (nullable = true)
> {noformat}
> Parquet is also aware of the column:
> {noformat}
>  optional int32 probeTypeId;
> {noformat}
> And this works fine:
> {code}
> sqlContext.sql("select probeTypeId from df limit 1");
> {code}
> ...as does {{df.show()}} - it shows the correct values for the partition 
> column.
> However, when I try to use a partition column in a where clause, I get an 
> exception stating that the column was not found in the schema:
> {noformat}
> sqlContext.sql("select probeTypeId from df where probeTypeId = 1 limit 1");
> ...
> ...
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] 
> was not found in schema!
>   at parquet.Preconditions.checkArgument(Preconditions.java:47)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
>   at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
>   at 
> parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
>   at 
> parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
>   at 
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
> ...
> ...
> {noformat}
> Here's the full stack trace:
> {noformat}
> using local[*] for master
> 06:05:55,675 |-INFO in 
> ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not 
> set
> 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
> 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> Naming appender as [STDOUT]
> 06:05:55,721 |-INFO in 
> ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default 
> type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] 
> property
> 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - 
> Setting level of ROOT logger to INFO
> 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - 
> Attaching appender named [STDOUT] to Logger[ROOT]
> 06:05:55,769 |-INFO in 
> ch.qos.logback.classic.joran.action.ConfigurationAction - End of 
> configuration.
> 06:05:55,770 |-INFO in 
> ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current 
> configuration as safe fallback point
> INFO  org.apache.spark.SparkContext Running Spark version 1.3.0
> WARN  o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> INFO  org.apache.spark.SecurityManager Changing view acls to: jon
> INFO  org.apache.spark.Securi

[jira] [Commented] (SPARK-6508) error handling issue running python in yarn cluster mode

2015-03-26 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381838#comment-14381838
 ] 

Thomas Graves commented on SPARK-6508:
--

[~TomStewart] are you running on yarn?  If so the setting the SPARK_HOME part 
is easy.  Just add the following configs to spark-defaults.conf or add them to 
command line.  The location doesn't have to be pointing to anything it just 
needs to be set.

spark.yarn.appMasterEnv.SPARK_HOME /bogus
spark.executorEnv.SPARK_HOME /bogus

> error handling issue running python in yarn cluster mode 
> -
>
> Key: SPARK-6508
> URL: https://issues.apache.org/jira/browse/SPARK-6508
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.3.0
>Reporter: Thomas Graves
>
> I was running python in yarn cluster mode and didn't have the SPARK_HOME 
> envirnoment variables set.  The client reported a failure of: 
> ava.io.FileNotFoundException: File does not exist: 
> hdfs://axonitered-nn1.red.ygrid.yahoo.com:8020/user/tgraves/.sparkStaging/application_1425530846697_59578/pi.py
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:1201)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:1193)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>
> But when you look at the application master log:
> Log Contents:
> Traceback (most recent call last):
>   File "pi.py", line 29, in 
> sc = SparkContext(appName="PythonPi")
>   File 
> "/grid/11/tmp/yarn-local/usercache/tgraves/filecache/37/spark-assembly-1.3.0.0-hadoop2.6.0.6.1502061521.jar/pyspark/context.py",
>  line 108, in __init__
>   File 
> "/grid/11/tmp/yarn-local/usercache/tgraves/filecache/37/spark-assembly-1.3.0.0-hadoop2.6.0.6.1502061521.jar/pyspark/context.py",
>  line 222, in _ensure_initialized
>   File 
> "/grid/11/tmp/yarn-local/usercache/tgraves/filecache/37/spark-assembly-1.3.0.0-hadoop2.6.0.6.1502061521.jar/pyspark/java_gateway.py",
>  line 32, in launch_gateway
>   File "/usr/lib64/python2.6/UserDict.py", line 22, in __getitem__
> raise KeyError(key)
> KeyError: 'SPARK_HOME'
> But the application master thought it succeeded and removed the staging 
> directory when it shouldn't have:
> 15/03/24 14:50:07 INFO yarn.ApplicationMaster: Waiting for spark context 
> initialization ... 
> 15/03/24 14:50:08 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, 
> exitCode: 0, (reason: Shutdown hook called before final status was reported.)
> 15/03/24 14:50:08 INFO yarn.ApplicationMaster: Unregistering 
> ApplicationMaster with SUCCEEDED (diag message: Shutdown hook called before 
> final status was reported.)
> 15/03/24 14:50:08 INFO yarn.ApplicationMaster: Deleting staging directory 
> .sparkStaging/application_1425530846697_59578



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6471) Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns

2015-03-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-6471.
---
   Resolution: Fixed
Fix Version/s: 1.3.1

Issue resolved by pull request 5141
[https://github.com/apache/spark/pull/5141]

> Metastore schema should only be a subset of parquet schema to support 
> dropping of columns using replace columns
> ---
>
> Key: SPARK-6471
> URL: https://issues.apache.org/jira/browse/SPARK-6471
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Yash Datta
> Fix For: 1.3.1, 1.4.0
>
>
> Currently in the parquet relation 2 implementation, error is thrown in case 
> merged schema is not exactly the same as metastore schema. 
> But to support cases like deletion of column using replace column command, we 
> can relax the restriction so that even if metastore schema is a subset of 
> merged parquet schema, the query will work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6471) Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns

2015-03-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6471:
--
Assignee: Yash Datta

> Metastore schema should only be a subset of parquet schema to support 
> dropping of columns using replace columns
> ---
>
> Key: SPARK-6471
> URL: https://issues.apache.org/jira/browse/SPARK-6471
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Yash Datta
>Assignee: Yash Datta
> Fix For: 1.3.1, 1.4.0
>
>
> Currently in the parquet relation 2 implementation, error is thrown in case 
> merged schema is not exactly the same as metastore schema. 
> But to support cases like deletion of column using replace column command, we 
> can relax the restriction so that even if metastore schema is a subset of 
> merged parquet schema, the query will work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6506) python support yarn cluster mode requires SPARK_HOME to be set

2015-03-26 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381844#comment-14381844
 ] 

Lianhui Wang commented on SPARK-6506:
-

hi [~tgraves] I use 1.3.0 to run. if i donot set SPARK_HOME at every node, i 
get the following exception in every executor:
Error from python worker:
  /usr/bin/python: No module named pyspark
PYTHONPATH was:
  
/data/yarnenv/local/usercache/lianhui/filecache/296/spark-assembly-1.3.0-hadoop2.2.0.jar/python
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at 
org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:164)
at 
org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
at 
org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:105)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:69)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)

from the exception, i can find that pyspark of the spark.jar in nodeManager 
cannot be worked. and i donot know why it is. [~andrewor14] can you help me?
so i think now we should put spark dirs to SPARK_HOME at every node.

> python support yarn cluster mode requires SPARK_HOME to be set
> --
>
> Key: SPARK-6506
> URL: https://issues.apache.org/jira/browse/SPARK-6506
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.3.0
>Reporter: Thomas Graves
>
> We added support for python running in yarn cluster mode in 
> https://issues.apache.org/jira/browse/SPARK-5173, but it requires that 
> SPARK_HOME be set in the environment variables for application master and 
> executor.  It doesn't have to be set to anything real but it fails if its not 
> set.  See the command at the end of: https://github.com/apache/spark/pull/3976



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6471) Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns

2015-03-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6471:
--
Priority: Blocker  (was: Major)
Target Version/s: 1.3.1, 1.4.0

> Metastore schema should only be a subset of parquet schema to support 
> dropping of columns using replace columns
> ---
>
> Key: SPARK-6471
> URL: https://issues.apache.org/jira/browse/SPARK-6471
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Yash Datta
>Assignee: Yash Datta
>Priority: Blocker
> Fix For: 1.3.1, 1.4.0
>
>
> Currently in the parquet relation 2 implementation, error is thrown in case 
> merged schema is not exactly the same as metastore schema. 
> But to support cases like deletion of column using replace column command, we 
> can relax the restriction so that even if metastore schema is a subset of 
> merged parquet schema, the query will work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6471) Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns

2015-03-26 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381846#comment-14381846
 ] 

Cheng Lian commented on SPARK-6471:
---

Bumped to blocker level since this is actually a regression from 1.2.

> Metastore schema should only be a subset of parquet schema to support 
> dropping of columns using replace columns
> ---
>
> Key: SPARK-6471
> URL: https://issues.apache.org/jira/browse/SPARK-6471
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Yash Datta
>Assignee: Yash Datta
>Priority: Blocker
> Fix For: 1.3.1, 1.4.0
>
>
> Currently in the parquet relation 2 implementation, error is thrown in case 
> merged schema is not exactly the same as metastore schema. 
> But to support cases like deletion of column using replace column command, we 
> can relax the restriction so that even if metastore schema is a subset of 
> merged parquet schema, the query will work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6506) python support yarn cluster mode requires SPARK_HOME to be set

2015-03-26 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381844#comment-14381844
 ] 

Lianhui Wang edited comment on SPARK-6506 at 3/26/15 1:17 PM:
--

hi [~tgraves] I use 1.3.0 to run. if i donot set SPARK_HOME at every node, i 
get the following exception in every executor:
Error from python worker:
  /usr/bin/python: No module named pyspark
PYTHONPATH was:
  
/data/yarnenv/local/usercache/lianhui/filecache/296/spark-assembly-1.3.0-hadoop2.2.0.jar/python
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at 
org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:164)
at 
org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
at 
org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:105)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:69)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)

from the exception, i can find that pyspark of the spark.jar in nodeManager 
cannot be worked. and i donot know why it is. [~andrewor14] can you help me?
so i think now we should put spark dirs to PYTHONPATH  or SPARK_HOME at every 
node.


was (Author: lianhuiwang):
hi [~tgraves] I use 1.3.0 to run. if i donot set SPARK_HOME at every node, i 
get the following exception in every executor:
Error from python worker:
  /usr/bin/python: No module named pyspark
PYTHONPATH was:
  
/data/yarnenv/local/usercache/lianhui/filecache/296/spark-assembly-1.3.0-hadoop2.2.0.jar/python
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at 
org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:164)
at 
org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
at 
org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:105)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:69)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)

from the exception, i can find that pyspark of the spark.jar in nodeManager 
cannot be worked. and i donot know why it is. [~andrewor14] can you help me?
so i think now we should put spark dirs to SPARK_HOME at every node.

> python support yarn cluster mode requires SPARK_HOME to be set
> --
>
> Key: SPARK-6506
> URL: https://issues.apache.org/jira/browse/SPARK-6506
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.3.0
>Reporter: Thomas Graves
>
> We added support for python running in yarn cluster mode in 
> https://issues.apache.org/jira/browse/SPARK-5173, but it requires that 
> SPARK_HOME be set in the environment variables for application master and 
> executor.  It doesn't have to be set to anything real but it fails if its not 
> set.  See the command at the end of: https://github.com/apache/spark/pull/3976



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6554) Cannot use partition columns in where clause when Parquet filter push-down is enabled

2015-03-26 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381850#comment-14381850
 ] 

Cheng Lian commented on SPARK-6554:
---

Marked this as critical rather than blocker mostly because Parquet filter 
push-down is not enabled by default in 1.3.0.

> Cannot use partition columns in where clause when Parquet filter push-down is 
> enabled
> -
>
> Key: SPARK-6554
> URL: https://issues.apache.org/jira/browse/SPARK-6554
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Jon Chase
>Assignee: Cheng Lian
>Priority: Critical
>
> I'm having trouble referencing partition columns in my queries with Parquet.  
> In the following example, 'probeTypeId' is a partition column.  For example, 
> the directory structure looks like this:
> {noformat}
> /mydata
> /probeTypeId=1
> ...files...
> /probeTypeId=2
> ...files...
> {noformat}
> I see the column when I reference load a DF using the /mydata directory and 
> call df.printSchema():
> {noformat}
>  |-- probeTypeId: integer (nullable = true)
> {noformat}
> Parquet is also aware of the column:
> {noformat}
>  optional int32 probeTypeId;
> {noformat}
> And this works fine:
> {code}
> sqlContext.sql("select probeTypeId from df limit 1");
> {code}
> ...as does {{df.show()}} - it shows the correct values for the partition 
> column.
> However, when I try to use a partition column in a where clause, I get an 
> exception stating that the column was not found in the schema:
> {noformat}
> sqlContext.sql("select probeTypeId from df where probeTypeId = 1 limit 1");
> ...
> ...
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] 
> was not found in schema!
>   at parquet.Preconditions.checkArgument(Preconditions.java:47)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
>   at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
>   at 
> parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
>   at 
> parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
>   at 
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
> ...
> ...
> {noformat}
> Here's the full stack trace:
> {noformat}
> using local[*] for master
> 06:05:55,675 |-INFO in 
> ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not 
> set
> 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
> 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> Naming appender as [STDOUT]
> 06:05:55,721 |-INFO in 
> ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default 
> type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] 
> property
> 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - 
> Setting level of ROOT logger to INFO
> 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - 
> Attaching appender named [STDOUT] to Logger[ROOT]
> 06:05:55,769 |-INFO in 
> ch.qos.logback.classic.joran.action.ConfigurationAction - End of 
> configuration.
> 06:05:55,770 |-INFO in 
> ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current 
> configuration as safe fallback point
> INFO  org.apache.spark.SparkContext Running Spark version 1.3.0
> WARN  o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> INFO  org.apache.spark.SecurityManager Changing view acls to: jon
> INFO  org.apache.spark.SecurityManager Changing m

[jira] [Comment Edited] (SPARK-6506) python support yarn cluster mode requires SPARK_HOME to be set

2015-03-26 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381844#comment-14381844
 ] 

Lianhui Wang edited comment on SPARK-6506 at 3/26/15 1:18 PM:
--

hi [~tgraves] I use 1.3.0 to run. if i donot set SPARK_HOME at every node, i 
get the following exception in every executor:
Error from python worker:
  /usr/bin/python: No module named pyspark
PYTHONPATH was:
  
/data/yarnenv/local/usercache/lianhui/filecache/296/spark-assembly-1.3.0-hadoop2.2.0.jar/python
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at 
org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:164)
at 
org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
at 
org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:105)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:69)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)

from the exception, i can find that pyspark of the spark.jar in nodeManager 
cannot be worked. and i donot know why pyspark of spark.jar cannot be worked. 
[~andrewor14] can you help me?
so i think now we should put spark dirs to PYTHONPATH  or SPARK_HOME at every 
node.


was (Author: lianhuiwang):
hi [~tgraves] I use 1.3.0 to run. if i donot set SPARK_HOME at every node, i 
get the following exception in every executor:
Error from python worker:
  /usr/bin/python: No module named pyspark
PYTHONPATH was:
  
/data/yarnenv/local/usercache/lianhui/filecache/296/spark-assembly-1.3.0-hadoop2.2.0.jar/python
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at 
org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:164)
at 
org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
at 
org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:105)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:69)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)

from the exception, i can find that pyspark of the spark.jar in nodeManager 
cannot be worked. and i donot know why it is. [~andrewor14] can you help me?
so i think now we should put spark dirs to PYTHONPATH  or SPARK_HOME at every 
node.

> python support yarn cluster mode requires SPARK_HOME to be set
> --
>
> Key: SPARK-6506
> URL: https://issues.apache.org/jira/browse/SPARK-6506
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.3.0
>Reporter: Thomas Graves
>
> We added support for python running in yarn cluster mode in 
> https://issues.apache.org/jira/browse/SPARK-5173, but it requires that 
> SPARK_HOME be set in the environment variables for application master and 
> executor.  It doesn't have to be set to anything real but it fails if its not 
> set.  See the command at the end of: https://github.com/apache/spark/pull/3976



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6506) python support yarn cluster mode requires SPARK_HOME to be set

2015-03-26 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381860#comment-14381860
 ] 

Thomas Graves commented on SPARK-6506:
--

If you are running on yarn you just have to set SPARK_HOME like this:
spark.yarn.appMasterEnv.SPARK_HOME /bogus
spark.executorEnv.SPARK_HOME /bogus

But the error you pasted above isn't about that.  I've seen this when building 
the assembly with jdk7 or jdk8 due to the python stuff not being packaged 
properly in the assembly jar.  I have to use jdk6 to package it.  see 
https://issues.apache.org/jira/browse/SPARK-1920

> python support yarn cluster mode requires SPARK_HOME to be set
> --
>
> Key: SPARK-6506
> URL: https://issues.apache.org/jira/browse/SPARK-6506
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.3.0
>Reporter: Thomas Graves
>
> We added support for python running in yarn cluster mode in 
> https://issues.apache.org/jira/browse/SPARK-5173, but it requires that 
> SPARK_HOME be set in the environment variables for application master and 
> executor.  It doesn't have to be set to anything real but it fails if its not 
> set.  See the command at the end of: https://github.com/apache/spark/pull/3976



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6554) Cannot use partition columns in where clause when Parquet filter push-down is enabled

2015-03-26 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381862#comment-14381862
 ] 

Cheng Lian commented on SPARK-6554:
---

Parquet filter push-down isn't enabled by default in 1.3.0 because the most 
recent Parquet version (1.6.0rc3) up until Spark 1.3.0 release suffers from two 
bugs (PARQUET-136 & PARQUET-173). So it's generally not recommended to be used 
in production yet. These two bugs have been fixed in Parquet master, and the 
official 1.6.0 release should be out pretty soon. We probably will upgrade to 
Parquet 1.6.0 in Spark 1.4.0.

> Cannot use partition columns in where clause when Parquet filter push-down is 
> enabled
> -
>
> Key: SPARK-6554
> URL: https://issues.apache.org/jira/browse/SPARK-6554
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Jon Chase
>Assignee: Cheng Lian
>Priority: Critical
>
> I'm having trouble referencing partition columns in my queries with Parquet.  
> In the following example, 'probeTypeId' is a partition column.  For example, 
> the directory structure looks like this:
> {noformat}
> /mydata
> /probeTypeId=1
> ...files...
> /probeTypeId=2
> ...files...
> {noformat}
> I see the column when I reference load a DF using the /mydata directory and 
> call df.printSchema():
> {noformat}
>  |-- probeTypeId: integer (nullable = true)
> {noformat}
> Parquet is also aware of the column:
> {noformat}
>  optional int32 probeTypeId;
> {noformat}
> And this works fine:
> {code}
> sqlContext.sql("select probeTypeId from df limit 1");
> {code}
> ...as does {{df.show()}} - it shows the correct values for the partition 
> column.
> However, when I try to use a partition column in a where clause, I get an 
> exception stating that the column was not found in the schema:
> {noformat}
> sqlContext.sql("select probeTypeId from df where probeTypeId = 1 limit 1");
> ...
> ...
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] 
> was not found in schema!
>   at parquet.Preconditions.checkArgument(Preconditions.java:47)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
>   at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
>   at 
> parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
>   at 
> parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
>   at 
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
> ...
> ...
> {noformat}
> Here's the full stack trace:
> {noformat}
> using local[*] for master
> 06:05:55,675 |-INFO in 
> ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not 
> set
> 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
> 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> Naming appender as [STDOUT]
> 06:05:55,721 |-INFO in 
> ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default 
> type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] 
> property
> 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - 
> Setting level of ROOT logger to INFO
> 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - 
> Attaching appender named [STDOUT] to Logger[ROOT]
> 06:05:55,769 |-INFO in 
> ch.qos.logback.classic.joran.action.ConfigurationAction - End of 
> configuration.
> 06:05:55,770 |-INFO in 
> ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current 
> configuration as safe fallback point
> INFO  org.apache.spark.Sp

[jira] [Updated] (SPARK-6554) Cannot use partition columns in where clause when Parquet filter push-down is enabled

2015-03-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6554:
--
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-5463

> Cannot use partition columns in where clause when Parquet filter push-down is 
> enabled
> -
>
> Key: SPARK-6554
> URL: https://issues.apache.org/jira/browse/SPARK-6554
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Jon Chase
>Assignee: Cheng Lian
>Priority: Critical
>
> I'm having trouble referencing partition columns in my queries with Parquet.  
> In the following example, 'probeTypeId' is a partition column.  For example, 
> the directory structure looks like this:
> {noformat}
> /mydata
> /probeTypeId=1
> ...files...
> /probeTypeId=2
> ...files...
> {noformat}
> I see the column when I reference load a DF using the /mydata directory and 
> call df.printSchema():
> {noformat}
>  |-- probeTypeId: integer (nullable = true)
> {noformat}
> Parquet is also aware of the column:
> {noformat}
>  optional int32 probeTypeId;
> {noformat}
> And this works fine:
> {code}
> sqlContext.sql("select probeTypeId from df limit 1");
> {code}
> ...as does {{df.show()}} - it shows the correct values for the partition 
> column.
> However, when I try to use a partition column in a where clause, I get an 
> exception stating that the column was not found in the schema:
> {noformat}
> sqlContext.sql("select probeTypeId from df where probeTypeId = 1 limit 1");
> ...
> ...
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.lang.IllegalArgumentException: Column [probeTypeId] 
> was not found in schema!
>   at parquet.Preconditions.checkArgument(Preconditions.java:47)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
>   at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
>   at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
>   at 
> parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
>   at 
> parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
>   at 
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
> ...
> ...
> {noformat}
> Here's the full stack trace:
> {noformat}
> using local[*] for master
> 06:05:55,675 |-INFO in 
> ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not 
> set
> 06:05:55,683 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
> 06:05:55,694 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> Naming appender as [STDOUT]
> 06:05:55,721 |-INFO in 
> ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default 
> type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] 
> property
> 06:05:55,768 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - 
> Setting level of ROOT logger to INFO
> 06:05:55,768 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - 
> Attaching appender named [STDOUT] to Logger[ROOT]
> 06:05:55,769 |-INFO in 
> ch.qos.logback.classic.joran.action.ConfigurationAction - End of 
> configuration.
> 06:05:55,770 |-INFO in 
> ch.qos.logback.classic.joran.JoranConfigurator@6aaceffd - Registering current 
> configuration as safe fallback point
> INFO  org.apache.spark.SparkContext Running Spark version 1.3.0
> WARN  o.a.hadoop.util.NativeCodeLoader Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> INFO  org.apache.spark.SecurityManager Changing view acls to: jon
> INFO  org.apache.spark.SecurityManager Changing modify acls to: jon
> INFO  org.apache.spark.SecurityManager SecurityManager: authentication 
> disabled; ui ac

[jira] [Updated] (SPARK-5463) Fix Parquet filter push-down

2015-03-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-5463:
--
Affects Version/s: 1.3.0

> Fix Parquet filter push-down
> 
>
> Key: SPARK-5463
> URL: https://issues.apache.org/jira/browse/SPARK-5463
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0, 1.2.1, 1.2.2, 1.3.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6491) Spark will put the current working dir to the CLASSPATH

2015-03-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6491.
--
   Resolution: Fixed
Fix Version/s: 1.3.1
 Assignee: Liangliang Gu

Resolved by https://github.com/apache/spark/pull/5156

> Spark will put the current working dir to the CLASSPATH
> ---
>
> Key: SPARK-6491
> URL: https://issues.apache.org/jira/browse/SPARK-6491
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.3.0
>Reporter: Liangliang Gu
>Assignee: Liangliang Gu
> Fix For: 1.3.1
>
>
> When running "bin/computer-classpath.sh", the output will be:
> :/spark/conf:/spark/assembly/target/scala-2.10/spark-assembly-1.3.0-hadoop2.5.0-cdh5.2.0.jar:/spark/lib_managed/jars/datanucleus-rdbms-3.2.9.jar:/spark/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/spark/lib_managed/jars/datanucleus-core-3.2.10.jar
> Java will add the current working dir to the CLASSPATH, if the first ":" 
> exists, which is not expected by spark users.
> For example, if I call spark-shell in the folder /root. And there exists a 
> "core-site.xml" under /root/. Spark will use this file as HADOOP CONF file, 
> even if I have already set HADOOP_CONF_DIR=/etc/hadoop/conf.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6538) Add missing nullable Metastore fields when merging a Parquet schema

2015-03-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6538:
---

Assignee: (was: Apache Spark)

> Add missing nullable Metastore fields when merging a Parquet schema
> ---
>
> Key: SPARK-6538
> URL: https://issues.apache.org/jira/browse/SPARK-6538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Adam Budde
> Fix For: 1.3.1
>
>
> When Spark SQL infers a schema for a DataFrame, it will take the union of all 
> field types present in the structured source data (e.g. an RDD of JSON data). 
> When the source data for a row doesn't define a particular field on the 
> DataFrame's schema, a null value will simply be assumed for this field. This 
> workflow makes it very easy to construct tables and query over a set of 
> structured data with a nonuniform schema. However, this behavior is not 
> consistent in some cases when dealing with Parquet files and an external 
> table managed by an external Hive metastore.
> In our particular usecase, we use Spark Streaming to parse and transform our 
> input data and then apply a window function to save an arbitrary-sized batch 
> of data as a Parquet file, which itself will be added as a partition to an 
> external Hive table via an "ALTER TABLE... ADD PARTITION..." statement. Since 
> our input data is nonuniform, it is expected that not every partition batch 
> will contain every field present in the table's schema obtained from the Hive 
> metastore. As such, we expect that the schema of some of our Parquet files 
> may not contain the same set fields present in the full metastore schema.
> In such cases, it seems natural that Spark SQL would simply assume null 
> values for any missing fields in the partition's Parquet file, assuming these 
> fields are specified as nullable by the metastore schema. This is not the 
> case in the current implementation of ParquetRelation2. The 
> mergeMetastoreParquetSchema() method used to reconcile differences between a 
> Parquet file's schema and a schema retrieved from the Hive metastore will 
> raise an exception if the Parquet file doesn't match the same set of fields 
> specified by the metastore.
> I propose altering this implementation in order to allow for any missing 
> metastore fields marked as nullable to be merged in to the Parquet file's 
> schema before continuing with the checks present in 
> mergeMetastoreParquetSchema().
> Classifying this as a bug as it exposes inconsistent behavior, IMHO. If you 
> feel this should be an improvement or new feature instead, please feel free 
> to reclassify this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6538) Add missing nullable Metastore fields when merging a Parquet schema

2015-03-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6538:
---

Assignee: Apache Spark

> Add missing nullable Metastore fields when merging a Parquet schema
> ---
>
> Key: SPARK-6538
> URL: https://issues.apache.org/jira/browse/SPARK-6538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Adam Budde
>Assignee: Apache Spark
> Fix For: 1.3.1
>
>
> When Spark SQL infers a schema for a DataFrame, it will take the union of all 
> field types present in the structured source data (e.g. an RDD of JSON data). 
> When the source data for a row doesn't define a particular field on the 
> DataFrame's schema, a null value will simply be assumed for this field. This 
> workflow makes it very easy to construct tables and query over a set of 
> structured data with a nonuniform schema. However, this behavior is not 
> consistent in some cases when dealing with Parquet files and an external 
> table managed by an external Hive metastore.
> In our particular usecase, we use Spark Streaming to parse and transform our 
> input data and then apply a window function to save an arbitrary-sized batch 
> of data as a Parquet file, which itself will be added as a partition to an 
> external Hive table via an "ALTER TABLE... ADD PARTITION..." statement. Since 
> our input data is nonuniform, it is expected that not every partition batch 
> will contain every field present in the table's schema obtained from the Hive 
> metastore. As such, we expect that the schema of some of our Parquet files 
> may not contain the same set fields present in the full metastore schema.
> In such cases, it seems natural that Spark SQL would simply assume null 
> values for any missing fields in the partition's Parquet file, assuming these 
> fields are specified as nullable by the metastore schema. This is not the 
> case in the current implementation of ParquetRelation2. The 
> mergeMetastoreParquetSchema() method used to reconcile differences between a 
> Parquet file's schema and a schema retrieved from the Hive metastore will 
> raise an exception if the Parquet file doesn't match the same set of fields 
> specified by the metastore.
> I propose altering this implementation in order to allow for any missing 
> metastore fields marked as nullable to be merged in to the Parquet file's 
> schema before continuing with the checks present in 
> mergeMetastoreParquetSchema().
> Classifying this as a bug as it exposes inconsistent behavior, IMHO. If you 
> feel this should be an improvement or new feature instead, please feel free 
> to reclassify this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6532) LDAModel.scala fails scalastyle on Windows

2015-03-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381897#comment-14381897
 ] 

Apache Spark commented on SPARK-6532:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/5211

> LDAModel.scala fails scalastyle on Windows
> --
>
> Key: SPARK-6532
> URL: https://issues.apache.org/jira/browse/SPARK-6532
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Windows
>Affects Versions: 1.3.0
> Environment: Windows 7, Maven 3.1.0
>Reporter: Brian O'Keefe
>Priority: Minor
> Fix For: 1.3.1, 1.4.0
>
>
> When executing mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.5.2 -DskipTests -X 
> clean package, the build fails with the error:
> [DEBUG] Configuring mojo org.scalastyle:scalastyle-maven-plugin:0.4.0:check 
> from plugin realm ClassRealm[plugin>org.sca
> astyle:scalastyle-maven-plugin:0.4.0, parent: 
> sun.misc.Launcher$AppClassLoader@1174ec5]
> [DEBUG] Configuring mojo 'org.scalastyle:scalastyle-maven-plugin:0.4.0:check' 
> with basic configurator -->
> [DEBUG]   (f) baseDirectory = C:\Users\u6013553\spark-1.3.0\spark-1.3.0\mllib
> [DEBUG]   (f) buildDirectory = 
> C:\Users\u6013553\spark-1.3.0\spark-1.3.0\mllib\target
> [DEBUG]   (f) configLocation = scalastyle-config.xml
> [DEBUG]   (f) failOnViolation = true
> [DEBUG]   (f) failOnWarning = false
> [DEBUG]   (f) includeTestSourceDirectory = false
> [DEBUG]   (f) outputEncoding = UTF-8
> [DEBUG]   (f) outputFile = 
> C:\Users\u6013553\spark-1.3.0\spark-1.3.0\mllib\scalastyle-output.xml
> [DEBUG]   (f) quiet = false
> [DEBUG]   (f) skip = false
> [DEBUG]   (f) sourceDirectory = 
> C:\Users\u6013553\spark-1.3.0\spark-1.3.0\mllib\src\main\scala
> [DEBUG]   (f) testSourceDirectory = 
> C:\Users\u6013553\spark-1.3.0\spark-1.3.0\mllib\src\test\scala
> [DEBUG]   (f) verbose = false
> [DEBUG]   (f) project = MavenProject: org.apache.spark:spark-mllib_2.10:1.3.0 
> @ C:\Users\u6013553\spark-1.3.0\spark-1.3
> 0\mllib\dependency-reduced-pom.xml
> [DEBUG] -- end configuration --
> [DEBUG] failOnWarning=false
> [DEBUG] verbose=false
> [DEBUG] quiet=false
> [DEBUG] 
> sourceDirectory=C:\Users\u6013553\spark-1.3.0\spark-1.3.0\mllib\src\main\scala
> [DEBUG] includeTestSourceDirectory=false
> [DEBUG] buildDirectory=C:\Users\u6013553\spark-1.3.0\spark-1.3.0\mllib\target
> [DEBUG] baseDirectory=C:\Users\u6013553\spark-1.3.0\spark-1.3.0\mllib
> [DEBUG] 
> outputFile=C:\Users\u6013553\spark-1.3.0\spark-1.3.0\mllib\scalastyle-output.xml
> [DEBUG] outputEncoding=UTF-8
> [DEBUG] inputEncoding=null
> [DEBUG] processing 
> sourceDirectory=C:\Users\u6013553\spark-1.3.0\spark-1.3.0\mllib\src\main\scala
>  encoding=null
> error 
> file=C:\Users\u6013553\spark-1.3.0\spark-1.3.0\mllib\src\main\scala\org\apache\spark\mllib\clustering\LDAModel.sc
> la message=Input length = 1
> Saving to 
> outputFile=C:\Users\u6013553\spark-1.3.0\spark-1.3.0\mllib\scalastyle-output.xml
> Processed 143 file(s)
> Found 1 errors
> Found 0 warnings
> Found 0 infos
> Finished in 1571 ms
> scalastyle-output.xml reports
> 
> 
>   name="C:\Users\u6013553\spark-1.3.0\spark-1.3.0\mllib\src\main\scala\org\apache\spark\mllib\clustering\LDAModel.scala">
>   
>  
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6548) Adding stddev to DataFrame functions

2015-03-26 Thread sdfox (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381901#comment-14381901
 ] 

sdfox commented on SPARK-6548:
--

I will take it.

> Adding stddev to DataFrame functions
> 
>
> Key: SPARK-6548
> URL: https://issues.apache.org/jira/browse/SPARK-6548
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: DataFrame, starter
> Fix For: 1.4.0
>
>
> Add it to the list of aggregate functions:
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
> Also add it to 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala
> We can either add a Stddev Catalyst expression, or just compute it using 
> existing functions like here: 
> https://github.com/apache/spark/commit/5bbcd1304cfebba31ec6857a80d3825a40d02e83#diff-c3d0394b2fc08fb2842ff0362a5ac6c9R776



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-2475) Check whether #cores > #receivers in local mode

2015-03-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2475:
---

Assignee: (was: Apache Spark)

> Check whether #cores > #receivers in local mode
> ---
>
> Key: SPARK-2475
> URL: https://issues.apache.org/jira/browse/SPARK-2475
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Tathagata Das
>
> When the number of slots in local mode is not more than the number of 
> receivers, then the system should throw an error. Otherwise the system just 
> keeps waiting for resources to process the received data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-2475) Check whether #cores > #receivers in local mode

2015-03-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2475:
---

Assignee: Apache Spark

> Check whether #cores > #receivers in local mode
> ---
>
> Key: SPARK-2475
> URL: https://issues.apache.org/jira/browse/SPARK-2475
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Apache Spark
>
> When the number of slots in local mode is not more than the number of 
> receivers, then the system should throw an error. Otherwise the system just 
> keeps waiting for resources to process the received data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2475) Check whether #cores > #receivers in local mode

2015-03-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381906#comment-14381906
 ] 

Apache Spark commented on SPARK-2475:
-

User 'ArcherShao' has created a pull request for this issue:
https://github.com/apache/spark/pull/5212

> Check whether #cores > #receivers in local mode
> ---
>
> Key: SPARK-2475
> URL: https://issues.apache.org/jira/browse/SPARK-2475
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Tathagata Das
>
> When the number of slots in local mode is not more than the number of 
> receivers, then the system should throw an error. Otherwise the system just 
> keeps waiting for resources to process the received data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6481) Set "In Progress" when a PR is opened for an issue

2015-03-26 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381934#comment-14381934
 ] 

Nicholas Chammas commented on SPARK-6481:
-

Im willing to update this if there is a better approach, but AFAIK we
*have* to assign the issue to the Spark user in order to change the issue's
state. We then change it back to preserve the original assignee if any.

If there is a better flow I missed, let me know and I would love to fix it.
Otherwise I think this is what we're stuck with. An option might be to
filter out these assignee changes in your email client, though I know that
is suboptimal.

Nick



> Set "In Progress" when a PR is opened for an issue
> --
>
> Key: SPARK-6481
> URL: https://issues.apache.org/jira/browse/SPARK-6481
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Michael Armbrust
>Assignee: Nicholas Chammas
>
> [~pwendell] and I are not sure if this is possible, but it would be really 
> helpful if the JIRA status was updated to "In Progress" when we do the 
> linking to an open pull request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6481) Set "In Progress" when a PR is opened for an issue

2015-03-26 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381939#comment-14381939
 ] 

Nicholas Chammas commented on SPARK-6481:
-

Also, there was a one time mass update for all issues triggered yesterday
by Josh, so this will be the only time you see this volume of assignee
changes at once.
2015년 3월 26일 (목) 오전 10:20, Nicholas Chammas 님이



> Set "In Progress" when a PR is opened for an issue
> --
>
> Key: SPARK-6481
> URL: https://issues.apache.org/jira/browse/SPARK-6481
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Michael Armbrust
>Assignee: Nicholas Chammas
>
> [~pwendell] and I are not sure if this is possible, but it would be really 
> helpful if the JIRA status was updated to "In Progress" when we do the 
> linking to an open pull request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6481) Set "In Progress" when a PR is opened for an issue

2015-03-26 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381944#comment-14381944
 ] 

Josh Rosen commented on SPARK-6481:
---

Actually, I haven't triggered the mass update quite yet; it's only been 
updating issues one-by-one as they're updated in GitHub.

> Set "In Progress" when a PR is opened for an issue
> --
>
> Key: SPARK-6481
> URL: https://issues.apache.org/jira/browse/SPARK-6481
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Michael Armbrust
>Assignee: Nicholas Chammas
>
> [~pwendell] and I are not sure if this is possible, but it would be really 
> helpful if the JIRA status was updated to "In Progress" when we do the 
> linking to an open pull request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >