[jira] [Reopened] (SPARK-21096) Pickle error when passing a member variable to Spark executors

2017-06-15 Thread Irina Truong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Irina Truong reopened SPARK-21096:
--

The 2 methods I described should be equivalent, but they are not.

> Pickle error when passing a member variable to Spark executors
> --
>
> Key: SPARK-21096
> URL: https://issues.apache.org/jira/browse/SPARK-21096
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Irina Truong
>
> There is a pickle error when submitting a spark job that references a member 
> variable in a lambda, even when the member variable is a simple type that 
> should be serializable.
> Here is a minimal example:
> https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278
> In the gist above, this method will throw an exception:
> {quote}
> def build_fail(self):
> processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
> return processed.collect()
> {quote}
> While this method will run just fine:
> {quote}
> def build_ok(self):
> mult = self.multiplier
> processed = self.rdd.map(lambda row: process_row(row, mult))
> return processed.collect()
> {quote}
> In this example, {{self.multiplier}} is just an int. However, passing it into 
> a lambda throws a pickle error, because it is trying to pickle the whole 
> {{self}}, and that contains {{sc}}.
> If this is the expected behavior, then why should re-assigning 
> {{self.multiplier}} to a variable make a difference?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21096) Pickle error when passing a member variable to Spark executors

2017-06-15 Thread Irina Truong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050609#comment-16050609
 ] 

Irina Truong commented on SPARK-21096:
--

I am not passing in {{self}}. I am passing in {{self.multiplier}} - an integer 
value.

If this spark behavior is correct, why does the 2nd method not break?

{quote}
def build_ok(self):
mult = self.multiplier
processed = self.rdd.map(lambda row: process_row(row, mult))
return processed.collect()
{quote}

> Pickle error when passing a member variable to Spark executors
> --
>
> Key: SPARK-21096
> URL: https://issues.apache.org/jira/browse/SPARK-21096
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Irina Truong
>
> There is a pickle error when submitting a spark job that references a member 
> variable in a lambda, even when the member variable is a simple type that 
> should be serializable.
> Here is a minimal example:
> https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278
> In the gist above, this method will throw an exception:
> {quote}
> def build_fail(self):
> processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
> return processed.collect()
> {quote}
> While this method will run just fine:
> {quote}
> def build_ok(self):
> mult = self.multiplier
> processed = self.rdd.map(lambda row: process_row(row, mult))
> return processed.collect()
> {quote}
> In this example, {{self.multiplier}} is just an int. However, passing it into 
> a lambda throws a pickle error, because it is trying to pickle the whole 
> {{self}}, and that contains {{sc}}.
> If this is the expected behavior, then why should re-assigning 
> {{self.multiplier}} to a variable make a difference?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21096) Pickle error when passing a member variable to Spark executors

2017-06-14 Thread Irina Truong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Irina Truong updated SPARK-21096:
-
Description: 
There is a pickle error when submitting a spark job that references a member 
variable in a lambda, even when the member variable is a simple type that 
should be serializable.

Here is a minimal example:

https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278

In the gist above, this method will throw an exception:

{quote}
def build_fail(self):
processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
return processed.collect()
{quote}

While this method will run just fine:

{quote}
def build_ok(self):
mult = self.multiplier
processed = self.rdd.map(lambda row: process_row(row, mult))
return processed.collect()
{quote}

In this example, {{self.multiplier}} is just an int. However, passing it into a 
lambda throws a pickle error, because it is trying to pickle the whole 
{{self}}, and that contains {{sc}}.

If this is the expected behavior, then why should re-assigning 
{{self.multiplier}} to a variable make a difference?

  was:
There is a pickle error when submitting a spark job that references a member 
variable in a lambda, even when the member variable is a simple type that 
should be serializable.

Here is a minimal example:

https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278

In the gist above, this method will throw an exception:

bq. def build_fail(self):
bq. processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
bq. return processed.collect()

While this method will run just fine:

{quote}
def build_ok(self):
mult = self.multiplier
processed = self.rdd.map(lambda row: process_row(row, mult))
return processed.collect()
{quote}

In this example, {{self.multiplier}} is just an int. However, passing it into a 
lambda throws a pickle error, because it is trying to pickle the whole 
{{self}}, and that contains {{sc}}.

If this is the expected behavior, then why should re-assigning 
{{self.multiplier}} to a variable make a difference?


> Pickle error when passing a member variable to Spark executors
> --
>
> Key: SPARK-21096
> URL: https://issues.apache.org/jira/browse/SPARK-21096
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Irina Truong
>
> There is a pickle error when submitting a spark job that references a member 
> variable in a lambda, even when the member variable is a simple type that 
> should be serializable.
> Here is a minimal example:
> https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278
> In the gist above, this method will throw an exception:
> {quote}
> def build_fail(self):
> processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
> return processed.collect()
> {quote}
> While this method will run just fine:
> {quote}
> def build_ok(self):
> mult = self.multiplier
> processed = self.rdd.map(lambda row: process_row(row, mult))
> return processed.collect()
> {quote}
> In this example, {{self.multiplier}} is just an int. However, passing it into 
> a lambda throws a pickle error, because it is trying to pickle the whole 
> {{self}}, and that contains {{sc}}.
> If this is the expected behavior, then why should re-assigning 
> {{self.multiplier}} to a variable make a difference?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21096) Pickle error when passing a member variable to Spark executors

2017-06-14 Thread Irina Truong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Irina Truong updated SPARK-21096:
-
Description: 
There is a pickle error when submitting a spark job that references a member 
variable in a lambda, even when the member variable is a simple type that 
should be serializable.

Here is a minimal example:

https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278

In the gist above, this method will throw an exception:

bq. def build_fail(self):
bq. processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
bq. return processed.collect()

While this method will run just fine:

{quote}
def build_ok(self):
mult = self.multiplier
processed = self.rdd.map(lambda row: process_row(row, mult))
return processed.collect()
{quote}

In this example, {{self.multiplier}} is just an int. However, passing it into a 
lambda throws a pickle error, because it is trying to pickle the whole 
{{self}}, and that contains {{sc}}.

If this is the expected behavior, then why should re-assigning 
{{self.multiplier}} to a variable make a difference?

  was:
There is a pickle error when submitting a spark job that references a member 
variable in a lambda, even when the member variable is a simple type that 
should be serializable.

Here is a minimal example:

https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278

In the gist above, this method will throw an exception:

{{
def build_fail(self):
processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
return processed.collect()
}}

While this method will run just fine:

{quote}
def build_ok(self):
mult = self.multiplier
processed = self.rdd.map(lambda row: process_row(row, mult))
return processed.collect()
{quote}

In this example, {{self.multiplier}} is just an int. However, passing it into a 
lambda throws a pickle error, because it is trying to pickle the whole 
{{self}}, and that contains {{sc}}.

If this is the expected behavior, then why should re-assigning 
{{self.multiplier}} to a variable make a difference?


> Pickle error when passing a member variable to Spark executors
> --
>
> Key: SPARK-21096
> URL: https://issues.apache.org/jira/browse/SPARK-21096
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Irina Truong
>
> There is a pickle error when submitting a spark job that references a member 
> variable in a lambda, even when the member variable is a simple type that 
> should be serializable.
> Here is a minimal example:
> https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278
> In the gist above, this method will throw an exception:
> bq. def build_fail(self):
> bq. processed = self.rdd.map(lambda row: process_row(row, 
> self.multiplier))
> bq. return processed.collect()
> While this method will run just fine:
> {quote}
> def build_ok(self):
> mult = self.multiplier
> processed = self.rdd.map(lambda row: process_row(row, mult))
> return processed.collect()
> {quote}
> In this example, {{self.multiplier}} is just an int. However, passing it into 
> a lambda throws a pickle error, because it is trying to pickle the whole 
> {{self}}, and that contains {{sc}}.
> If this is the expected behavior, then why should re-assigning 
> {{self.multiplier}} to a variable make a difference?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21096) Pickle error when passing a member variable to Spark executors

2017-06-14 Thread Irina Truong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Irina Truong updated SPARK-21096:
-
Description: 
There is a pickle error when submitting a spark job that references a member 
variable in a lambda, even when the member variable is a simple type that 
should be serializable.

Here is a minimal example:

https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278

In the gist above, this method will throw an exception:

{{
def build_fail(self):
processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
return processed.collect()
}}

While this method will run just fine:

{quote}
def build_ok(self):
mult = self.multiplier
processed = self.rdd.map(lambda row: process_row(row, mult))
return processed.collect()
{quote}

In this example, {{self.multiplier}} is just an int. However, passing it into a 
lambda throws a pickle error, because it is trying to pickle the whole 
{{self}}, and that contains {{sc}}.

If this is the expected behavior, then why should re-assigning 
{{self.multiplier}} to a variable make a difference?

  was:
There is a pickle error when submitting a spark job that references a member 
variable in a lambda, even when the member variable is a simple type that 
should be serializable.

Here is a minimal example:

https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278

In the gist above, this method will throw an exception:

{quote}
def build_fail(self):
processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
return processed.collect()
{quote}

While this method will run just fine:

{quote}
def build_ok(self):
mult = self.multiplier
processed = self.rdd.map(lambda row: process_row(row, mult))
return processed.collect()
{quote}

In this example, {{self.multiplier}} is just an int. However, passing it into a 
lambda throws a pickle error, because it is trying to pickle the whole 
{{self}}, and that contains {{sc}}.

If this is the expected behavior, then why should re-assigning 
{{self.multiplier}} to a variable make a difference?


> Pickle error when passing a member variable to Spark executors
> --
>
> Key: SPARK-21096
> URL: https://issues.apache.org/jira/browse/SPARK-21096
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Irina Truong
>
> There is a pickle error when submitting a spark job that references a member 
> variable in a lambda, even when the member variable is a simple type that 
> should be serializable.
> Here is a minimal example:
> https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278
> In the gist above, this method will throw an exception:
> {{
> def build_fail(self):
> processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
> return processed.collect()
> }}
> While this method will run just fine:
> {quote}
> def build_ok(self):
> mult = self.multiplier
> processed = self.rdd.map(lambda row: process_row(row, mult))
> return processed.collect()
> {quote}
> In this example, {{self.multiplier}} is just an int. However, passing it into 
> a lambda throws a pickle error, because it is trying to pickle the whole 
> {{self}}, and that contains {{sc}}.
> If this is the expected behavior, then why should re-assigning 
> {{self.multiplier}} to a variable make a difference?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21096) Pickle error when passing a member variable to Spark executors

2017-06-14 Thread Irina Truong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Irina Truong updated SPARK-21096:
-
Description: 
There is a pickle error when submitting a spark job that references a member 
variable in a lambda, even when the member variable is a simple type that 
should be serializable.

Here is a minimal example:

https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278

In the gist above, this method will throw an exception:

{quote}
def build_fail(self):
processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
return processed.collect()
{quote}

While this method will run just fine:

{quote}
def build_ok(self):
mult = self.multiplier
processed = self.rdd.map(lambda row: process_row(row, mult))
return processed.collect()
{quote}

In this example, {{self.multiplier}} is just an int. However, passing it into a 
lambda throws a pickle error, because it is trying to pickle the whole 
{{self}}, and that contains {{sc}}.

If this is the expected behavior, then why should re-assigning 
{{self.multiplier}} to a variable make a difference?

  was:
There is a pickle error when submitting a spark job that references a member 
variable in a lambda, even when the member variable is a simple type that 
should be serializable.

Here is a minimal example:

https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278

In the gist above, this method will throw an exception:

{quote}def build_fail(self):
processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
return processed.collect(){quote}

While this method will run just fine:

{quote}def build_ok(self):
mult = self.multiplier
processed = self.rdd.map(lambda row: process_row(row, mult))
return processed.collect()
{quote}

In this example, {{self.multiplier}} is just an int. However, passing it into a 
lambda throws a pickle error, because it is trying to pickle the whole 
{{self}}, and that contains {{sc}}.

If this is the expected behavior, then why should re-assigning 
{{self.multiplier}} to a variable make a difference?


> Pickle error when passing a member variable to Spark executors
> --
>
> Key: SPARK-21096
> URL: https://issues.apache.org/jira/browse/SPARK-21096
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Irina Truong
>
> There is a pickle error when submitting a spark job that references a member 
> variable in a lambda, even when the member variable is a simple type that 
> should be serializable.
> Here is a minimal example:
> https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278
> In the gist above, this method will throw an exception:
> {quote}
> def build_fail(self):
> processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
> return processed.collect()
> {quote}
> While this method will run just fine:
> {quote}
> def build_ok(self):
> mult = self.multiplier
> processed = self.rdd.map(lambda row: process_row(row, mult))
> return processed.collect()
> {quote}
> In this example, {{self.multiplier}} is just an int. However, passing it into 
> a lambda throws a pickle error, because it is trying to pickle the whole 
> {{self}}, and that contains {{sc}}.
> If this is the expected behavior, then why should re-assigning 
> {{self.multiplier}} to a variable make a difference?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21096) Pickle error when passing a member variable to Spark executors

2017-06-14 Thread Irina Truong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Irina Truong updated SPARK-21096:
-
Description: 
There is a pickle error when submitting a spark job that references a member 
variable in a lambda, even when the member variable is a simple type that 
should be serializable.

Here is a minimal example:

https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278

In the gist above, this method will throw an exception:

{quote}def build_fail(self):
processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
return processed.collect(){quote}

While this method will run just fine:

{quote}def build_ok(self):
mult = self.multiplier
processed = self.rdd.map(lambda row: process_row(row, mult))
return processed.collect()
}}{quote}

In this example, {{self.multiplier}} is just an int. However, passing it into a 
lambda throws a pickle error, because it is trying to pickle the whole 
{{self}}, and that contains {{sc}}.

If this is the expected behavior, then why should re-assigning 
{{self.multiplier}} to a variable make a difference?

  was:
There is a pickle error when submitting a spark job that references a member 
variable in a lambda, even when the member variable is a simple type that 
should be serializable.

Here is a minimal example:

https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278

In the gist above, this method will throw an exception:

{{def build_fail(self):
processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
return processed.collect()
}}

While this method will run just fine:

{{
def build_ok(self):
mult = self.multiplier
processed = self.rdd.map(lambda row: process_row(row, mult))
return processed.collect()
}}

In this example, {{self.multiplier}} is just an int. However, passing it into a 
lambda throws a pickle error, because it is trying to pickle the whole 
{{self}}, and that contains {{sc}}.

If this is the expected behavior, then why should re-assigning 
{{self.multiplier}} to a variable make a difference?


> Pickle error when passing a member variable to Spark executors
> --
>
> Key: SPARK-21096
> URL: https://issues.apache.org/jira/browse/SPARK-21096
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Irina Truong
>
> There is a pickle error when submitting a spark job that references a member 
> variable in a lambda, even when the member variable is a simple type that 
> should be serializable.
> Here is a minimal example:
> https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278
> In the gist above, this method will throw an exception:
> {quote}def build_fail(self):
> processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
> return processed.collect(){quote}
> While this method will run just fine:
> {quote}def build_ok(self):
> mult = self.multiplier
> processed = self.rdd.map(lambda row: process_row(row, mult))
> return processed.collect()
> }}{quote}
> In this example, {{self.multiplier}} is just an int. However, passing it into 
> a lambda throws a pickle error, because it is trying to pickle the whole 
> {{self}}, and that contains {{sc}}.
> If this is the expected behavior, then why should re-assigning 
> {{self.multiplier}} to a variable make a difference?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21096) Pickle error when passing a member variable to Spark executors

2017-06-14 Thread Irina Truong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Irina Truong updated SPARK-21096:
-
Description: 
There is a pickle error when submitting a spark job that references a member 
variable in a lambda, even when the member variable is a simple type that 
should be serializable.

Here is a minimal example:

https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278

In the gist above, this method will throw an exception:

{quote}def build_fail(self):
processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
return processed.collect(){quote}

While this method will run just fine:

{quote}def build_ok(self):
mult = self.multiplier
processed = self.rdd.map(lambda row: process_row(row, mult))
return processed.collect()
{quote}

In this example, {{self.multiplier}} is just an int. However, passing it into a 
lambda throws a pickle error, because it is trying to pickle the whole 
{{self}}, and that contains {{sc}}.

If this is the expected behavior, then why should re-assigning 
{{self.multiplier}} to a variable make a difference?

  was:
There is a pickle error when submitting a spark job that references a member 
variable in a lambda, even when the member variable is a simple type that 
should be serializable.

Here is a minimal example:

https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278

In the gist above, this method will throw an exception:

{quote}def build_fail(self):
processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
return processed.collect(){quote}

While this method will run just fine:

{quote}def build_ok(self):
mult = self.multiplier
processed = self.rdd.map(lambda row: process_row(row, mult))
return processed.collect()
}}{quote}

In this example, {{self.multiplier}} is just an int. However, passing it into a 
lambda throws a pickle error, because it is trying to pickle the whole 
{{self}}, and that contains {{sc}}.

If this is the expected behavior, then why should re-assigning 
{{self.multiplier}} to a variable make a difference?


> Pickle error when passing a member variable to Spark executors
> --
>
> Key: SPARK-21096
> URL: https://issues.apache.org/jira/browse/SPARK-21096
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Irina Truong
>
> There is a pickle error when submitting a spark job that references a member 
> variable in a lambda, even when the member variable is a simple type that 
> should be serializable.
> Here is a minimal example:
> https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278
> In the gist above, this method will throw an exception:
> {quote}def build_fail(self):
> processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
> return processed.collect(){quote}
> While this method will run just fine:
> {quote}def build_ok(self):
> mult = self.multiplier
> processed = self.rdd.map(lambda row: process_row(row, mult))
> return processed.collect()
> {quote}
> In this example, {{self.multiplier}} is just an int. However, passing it into 
> a lambda throws a pickle error, because it is trying to pickle the whole 
> {{self}}, and that contains {{sc}}.
> If this is the expected behavior, then why should re-assigning 
> {{self.multiplier}} to a variable make a difference?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21096) Pickle error when passing a member variable to Spark executors

2017-06-14 Thread Irina Truong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Irina Truong updated SPARK-21096:
-
Description: 
There is a pickle error when submitting a spark job that references a member 
variable in a lambda, even when the member variable is a simple type that 
should be serializable.

Here is a minimal example:

https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278

In the gist above, this method will throw an exception:

{{def build_fail(self):
processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
return processed.collect()
}}

While this method will run just fine:

{{
def build_ok(self):
mult = self.multiplier
processed = self.rdd.map(lambda row: process_row(row, mult))
return processed.collect()
}}

In this example, {{self.multiplier}} is just an int. However, passing it into a 
lambda throws a pickle error, because it is trying to pickle the whole 
{{self}}, and that contains {{sc}}.

If this is the expected behavior, then why should re-assigning 
{{self.multiplier}} to a variable make a difference?

  was:
There is a pickle error when submitting a spark job that references a member 
variable in a lambda, even when the member variable is a simple type that 
should be serializable.

Here is a minimal example:

https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278

In the gist above, this method will throw an exception:

{{
def build_fail(self):
processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
return processed.collect()
}}

While this method will run just fine:

{{
def build_ok(self):
mult = self.multiplier
processed = self.rdd.map(lambda row: process_row(row, mult))
return processed.collect()
}}

In this example, {{self.multiplier}} is just an int. However, passing it into a 
lambda throws a pickle error, because it is trying to pickle the whole 
{{self}}, and that contains {{sc}}.

If this is the expected behavior, then why should re-assigning 
{{self.multiplier}} to a variable make a difference?


> Pickle error when passing a member variable to Spark executors
> --
>
> Key: SPARK-21096
> URL: https://issues.apache.org/jira/browse/SPARK-21096
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Irina Truong
>
> There is a pickle error when submitting a spark job that references a member 
> variable in a lambda, even when the member variable is a simple type that 
> should be serializable.
> Here is a minimal example:
> https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278
> In the gist above, this method will throw an exception:
> {{def build_fail(self):
> processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
> return processed.collect()
> }}
> While this method will run just fine:
> {{
> def build_ok(self):
> mult = self.multiplier
> processed = self.rdd.map(lambda row: process_row(row, mult))
> return processed.collect()
> }}
> In this example, {{self.multiplier}} is just an int. However, passing it into 
> a lambda throws a pickle error, because it is trying to pickle the whole 
> {{self}}, and that contains {{sc}}.
> If this is the expected behavior, then why should re-assigning 
> {{self.multiplier}} to a variable make a difference?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21096) Pickle error when passing a member variable to Spark executors

2017-06-14 Thread Irina Truong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Irina Truong updated SPARK-21096:
-
Description: 
There is a pickle error when submitting a spark job that references a member 
variable in a lambda, even when the member variable is a simple type that 
should be serializable.

Here is a minimal example:

https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278

In the gist above, this method will throw an exception:

{{
def build_fail(self):
processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
return processed.collect()
}}

While this method will run just fine:

{{
def build_ok(self):
mult = self.multiplier
processed = self.rdd.map(lambda row: process_row(row, mult))
return processed.collect()
}}

In this example, {{self.multiplier}} is just an int. However, passing it into a 
lambda throws a pickle error, because it is trying to pickle the whole 
{{self}}, and that contains {{sc}}.

If this is the expected behavior, then why should re-assigning 
{{self.multiplier}} to a variable make a difference?

  was:
There is a pickle error when submitting a spark job that references a member 
variable in a lambda, even when the member variable is a simple type that 
should be serializable.

Here is a minimal example:

https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278

In the gist above, this method will throw an exception:

{{def build_fail(self):
processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
return processed.collect()
}}

While this method will run just fine:

{{def build_ok(self):
mult = self.multiplier
processed = self.rdd.map(lambda row: process_row(row, mult))
return processed.collect()
}}

In this example, {{self.multiplier}} is just an int. However, passing it into a 
lambda throws a pickle error, because it is trying to pickle the whole 
{{self}}, and that contains {{sc}}.

If this is the expected behavior, then why should re-assigning 
{{self.multiplier}} to a variable make a difference?


> Pickle error when passing a member variable to Spark executors
> --
>
> Key: SPARK-21096
> URL: https://issues.apache.org/jira/browse/SPARK-21096
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Irina Truong
>
> There is a pickle error when submitting a spark job that references a member 
> variable in a lambda, even when the member variable is a simple type that 
> should be serializable.
> Here is a minimal example:
> https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278
> In the gist above, this method will throw an exception:
> {{
> def build_fail(self):
> processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
> return processed.collect()
> }}
> While this method will run just fine:
> {{
> def build_ok(self):
> mult = self.multiplier
> processed = self.rdd.map(lambda row: process_row(row, mult))
> return processed.collect()
> }}
> In this example, {{self.multiplier}} is just an int. However, passing it into 
> a lambda throws a pickle error, because it is trying to pickle the whole 
> {{self}}, and that contains {{sc}}.
> If this is the expected behavior, then why should re-assigning 
> {{self.multiplier}} to a variable make a difference?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21096) Pickle error when passing a member variable to Spark executors

2017-06-14 Thread Irina Truong (JIRA)
Irina Truong created SPARK-21096:


 Summary: Pickle error when passing a member variable to Spark 
executors
 Key: SPARK-21096
 URL: https://issues.apache.org/jira/browse/SPARK-21096
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.1
Reporter: Irina Truong


There is a pickle error when submitting a spark job that references a member 
variable in a lambda, even when the member variable is a simple type that 
should be serializable.

Here is a minimal example:

https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278

In the gist above, this method will throw an exception:

{{def build_fail(self):
processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
return processed.collect()
}}

While this method will run just fine:

{{def build_ok(self):
mult = self.multiplier
processed = self.rdd.map(lambda row: process_row(row, mult))
return processed.collect()
}}

In this example, {{self.multiplier}} is just an int. However, passing it into a 
lambda throws a pickle error, because it is trying to pickle the whole 
{{self}}, and that contains {{sc}}.

If this is the expected behavior, then why should re-assigning 
{{self.multiplier}} to a variable make a difference?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16784) Configurable log4j settings

2017-06-07 Thread Irina Truong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16041856#comment-16041856
 ] 

Irina Truong commented on SPARK-16784:
--

In 2.1.0, setting  "spark.driver.extraJavaOptions" to 
"-Dlog4j.configuration=file:/home/hadoop/log4j.properties" in SparkConfig 
seemed to work.

In 2.1.1, it does not work anymore, but setting it via "--driver-java-options" 
still works.

Is this a bug in 2.1.1?

> Configurable log4j settings
> ---
>
> Key: SPARK-16784
> URL: https://issues.apache.org/jira/browse/SPARK-16784
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Michael Gummelt
>
> I often want to change the logging configuration on a single spark job.  This 
> is easy in client mode.  I just modify log4j.properties.  It's difficult in 
> cluster mode, because I need to modify the log4j.properties in the 
> distribution in which the driver runs.  I'd like a way of setting this 
> dynamically, such as a java system property.  Some brief searching showed 
> that log4j doesn't seem to accept such a property, but I'd like to open up 
> this idea for further comment.  Maybe we can find a solution.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19307) SPARK-17387 caused ignorance of conf object passed to SparkContext:

2017-06-07 Thread Irina Truong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16041040#comment-16041040
 ] 

Irina Truong commented on SPARK-19307:
--

Is this available in 2.1.1? I could not find it in release notes.

> SPARK-17387 caused ignorance of conf object passed to SparkContext:
> ---
>
> Key: SPARK-19307
> URL: https://issues.apache.org/jira/browse/SPARK-19307
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: yuriy_hupalo
>Assignee: Marcelo Vanzin
> Attachments: SPARK-19307.patch
>
>
> after patch SPARK-17387 was applied -- Sparkconf object is ignored when 
> launching SparkContext programmatically via python from spark-submit:
> https://github.com/apache/spark/blob/master/python/pyspark/context.py#L128:
> in case when we are running python SparkContext(conf=xxx) from spark-submit:
> conf is set, conf._jconf is None ()
> passed as arg  conf object is ignored (and used only when we are 
> launching java_gateway).
> how to fix:
> python/pyspark/context.py:132
> {code:title=python/pyspark/context.py:132}
> if conf is not None and conf._jconf is not None:
> # conf has been initialized in JVM properly, so use conf 
> directly. This represent the
> # scenario that JVM has been launched before SparkConf is created 
> (e.g. SparkContext is
> # created and then stopped, and we create a new SparkConf and new 
> SparkContext again)
> self._conf = conf
> else:
> self._conf = SparkConf(_jvm=SparkContext._jvm)
> + if conf:
> + for key, value in conf.getAll():
> + self._conf.set(key,value)
> + print(key,value)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4296) Throw "Expression not in GROUP BY" when using same expression in group by clause and select clause

2017-03-21 Thread Irina Truong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935409#comment-15935409
 ] 

Irina Truong edited comment on SPARK-4296 at 3/21/17 10:01 PM:
---

I have the same exception with pyspark when my expression uses a compiled and 
registered Scala UDF. This is how it's registered:


{noformat}
sqlContext.registerJavaFunction("round_date", 'my.package.RoundDate')
{noformat}


And this is how it's called:

{noformat}
ipdb> sqlContext.sql("SELECT round_date(t.ts, '1day') from (select 
timestamp('2017-02-02T10:11:12') as ts union select 
timestamp('2017-02-02T10:19:00') as ts) as t group by round_date(t.ts, 
'1day')").show()
*** AnalysisException: u"expression 't.`ts`' is neither present in the group 
by, nor is it an aggregate function. Add to group by or wrap in first() (or 
first_value) if you don't care which value you get.;;\nAggregate [UDF(ts#80, 
1day)], [UDF(ts#80, 1day) AS UDF(ts, 1day)#82]\n+- SubqueryAlias t\n   +- 
Distinct\n  +- Union\n :- Project [cast(2017-02-02T10:11:12 as 
timestamp) AS ts#80]\n :  +- OneRowRelation$\n +- Project 
[cast(2017-02-02T10:19:00 as timestamp) AS ts#81]\n+- 
OneRowRelation$\n"
{noformat}


was (Author: irinatruong):
I have the same exception with pyspark when my expression uses a compiled and 
registered Scala UDF. This is how it's registered:

{noformat}
sqlContext.registerJavaFunction("round_date", 'my.package.RoundDate')
{noformat}

And this is how it's called:

{noformat}
ipdb> sqlContext.sql("SELECT round_date(t.ts, '1day') from (select 
timestamp('2017-02-02T10:11:12') as ts union select 
timestamp('2017-02-02T10:19:00') as ts) as t group by round_date(t.ts, 
'1day')").show()
*** AnalysisException: u"expression 't.`ts`' is neither present in the group 
by, nor is it an aggregate function. Add to group by or wrap in first() (or 
first_value) if you don't care which value you get.;;\nAggregate [UDF(ts#80, 
1day)], [UDF(ts#80, 1day) AS UDF(ts, 1day)#82]\n+- SubqueryAlias t\n   +- 
Distinct\n  +- Union\n :- Project [cast(2017-02-02T10:11:12 as 
timestamp) AS ts#80]\n :  +- OneRowRelation$\n +- Project 
[cast(2017-02-02T10:19:00 as timestamp) AS ts#81]\n+- 
OneRowRelation$\n"
{noformat}

> Throw "Expression not in GROUP BY" when using same expression in group by 
> clause and  select clause
> ---
>
> Key: SPARK-4296
> URL: https://issues.apache.org/jira/browse/SPARK-4296
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0, 1.1.1, 1.2.0
>Reporter: Shixiong Zhu
>Assignee: Cheng Lian
>Priority: Blocker
> Fix For: 1.2.1, 1.3.0
>
>
> When the input data has a complex structure, using same expression in group 
> by clause and  select clause will throw "Expression not in GROUP BY".
> {code:java}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.createSchemaRDD
> case class Birthday(date: String)
> case class Person(name: String, birthday: Birthday)
> val people = sc.parallelize(List(Person("John", Birthday("1990-01-22")), 
> Person("Jim", Birthday("1980-02-28"
> people.registerTempTable("people")
> val year = sqlContext.sql("select count(*), upper(birthday.date) from people 
> group by upper(birthday.date)")
> year.collect
> {code}
> Here is the plan of year:
> {code:java}
> SchemaRDD[3] at RDD at SchemaRDD.scala:105
> == Query Plan ==
> == Physical Plan ==
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression 
> not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree:
> Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date 
> AS date#9) AS c1#3]
>  Subquery people
>   LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at 
> ExistingRDD.scala:36
> {code}
> The bug is the equality test for `Upper(birthday#1.date)` and 
> `Upper(birthday#1.date AS date#9)`.
> Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias 
> expression.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4296) Throw "Expression not in GROUP BY" when using same expression in group by clause and select clause

2017-03-21 Thread Irina Truong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935409#comment-15935409
 ] 

Irina Truong edited comment on SPARK-4296 at 3/21/17 9:59 PM:
--

I have the same exception with pyspark when my expression uses a compiled and 
registered Scala UDF. This is how it's registered:

{noformat}
sqlContext.registerJavaFunction("round_date", 'my.package.RoundDate')
{noformat}

And this is how it's called:

{noformat}
ipdb> sqlContext.sql("SELECT round_date(t.ts, '1day') from (select 
timestamp('2017-02-02T10:11:12') as ts union select 
timestamp('2017-02-02T10:19:00') as ts) as t group by round_date(t.ts, 
'1day')").show()
*** AnalysisException: u"expression 't.`ts`' is neither present in the group 
by, nor is it an aggregate function. Add to group by or wrap in first() (or 
first_value) if you don't care which value you get.;;\nAggregate [UDF(ts#80, 
1day)], [UDF(ts#80, 1day) AS UDF(ts, 1day)#82]\n+- SubqueryAlias t\n   +- 
Distinct\n  +- Union\n :- Project [cast(2017-02-02T10:11:12 as 
timestamp) AS ts#80]\n :  +- OneRowRelation$\n +- Project 
[cast(2017-02-02T10:19:00 as timestamp) AS ts#81]\n+- 
OneRowRelation$\n"
{noformat}


was (Author: irinatruong):
I'm have the same exception with pyspark when my expression uses a compiled and 
registered Scala UDF:

sqlContext.registerJavaFunction("round_date", 'my.package.RoundDate')

ipdb> sqlContext.sql("SELECT round_date(t.ts, '1day') from (select 
timestamp('2017-02-02T10:11:12') as ts union select 
timestamp('2017-02-02T10:19:00') as ts) as t group by round_date(t.ts, 
'1day')").show()
*** AnalysisException: u"expression 't.`ts`' is neither present in the group 
by, nor is it an aggregate function. Add to group by or wrap in first() (or 
first_value) if you don't care which value you get.;;\nAggregate [UDF(ts#80, 
1day)], [UDF(ts#80, 1day) AS UDF(ts, 1day)#82]\n+- SubqueryAlias t\n   +- 
Distinct\n  +- Union\n :- Project [cast(2017-02-02T10:11:12 as 
timestamp) AS ts#80]\n :  +- OneRowRelation$\n +- Project 
[cast(2017-02-02T10:19:00 as timestamp) AS ts#81]\n+- 
OneRowRelation$\n"




> Throw "Expression not in GROUP BY" when using same expression in group by 
> clause and  select clause
> ---
>
> Key: SPARK-4296
> URL: https://issues.apache.org/jira/browse/SPARK-4296
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0, 1.1.1, 1.2.0
>Reporter: Shixiong Zhu
>Assignee: Cheng Lian
>Priority: Blocker
> Fix For: 1.2.1, 1.3.0
>
>
> When the input data has a complex structure, using same expression in group 
> by clause and  select clause will throw "Expression not in GROUP BY".
> {code:java}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.createSchemaRDD
> case class Birthday(date: String)
> case class Person(name: String, birthday: Birthday)
> val people = sc.parallelize(List(Person("John", Birthday("1990-01-22")), 
> Person("Jim", Birthday("1980-02-28"
> people.registerTempTable("people")
> val year = sqlContext.sql("select count(*), upper(birthday.date) from people 
> group by upper(birthday.date)")
> year.collect
> {code}
> Here is the plan of year:
> {code:java}
> SchemaRDD[3] at RDD at SchemaRDD.scala:105
> == Query Plan ==
> == Physical Plan ==
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression 
> not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree:
> Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date 
> AS date#9) AS c1#3]
>  Subquery people
>   LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at 
> ExistingRDD.scala:36
> {code}
> The bug is the equality test for `Upper(birthday#1.date)` and 
> `Upper(birthday#1.date AS date#9)`.
> Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias 
> expression.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4296) Throw "Expression not in GROUP BY" when using same expression in group by clause and select clause

2017-03-21 Thread Irina Truong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935409#comment-15935409
 ] 

Irina Truong commented on SPARK-4296:
-

I'm have the same exception with pyspark when my expression uses a compiled and 
registered Scala UDF:

sqlContext.registerJavaFunction("round_date", 'my.package.RoundDate')

ipdb> sqlContext.sql("SELECT round_date(t.ts, '1day') from (select 
timestamp('2017-02-02T10:11:12') as ts union select 
timestamp('2017-02-02T10:19:00') as ts) as t group by round_date(t.ts, 
'1day')").show()
*** AnalysisException: u"expression 't.`ts`' is neither present in the group 
by, nor is it an aggregate function. Add to group by or wrap in first() (or 
first_value) if you don't care which value you get.;;\nAggregate [UDF(ts#80, 
1day)], [UDF(ts#80, 1day) AS UDF(ts, 1day)#82]\n+- SubqueryAlias t\n   +- 
Distinct\n  +- Union\n :- Project [cast(2017-02-02T10:11:12 as 
timestamp) AS ts#80]\n :  +- OneRowRelation$\n +- Project 
[cast(2017-02-02T10:19:00 as timestamp) AS ts#81]\n+- 
OneRowRelation$\n"




> Throw "Expression not in GROUP BY" when using same expression in group by 
> clause and  select clause
> ---
>
> Key: SPARK-4296
> URL: https://issues.apache.org/jira/browse/SPARK-4296
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0, 1.1.1, 1.2.0
>Reporter: Shixiong Zhu
>Assignee: Cheng Lian
>Priority: Blocker
> Fix For: 1.2.1, 1.3.0
>
>
> When the input data has a complex structure, using same expression in group 
> by clause and  select clause will throw "Expression not in GROUP BY".
> {code:java}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.createSchemaRDD
> case class Birthday(date: String)
> case class Person(name: String, birthday: Birthday)
> val people = sc.parallelize(List(Person("John", Birthday("1990-01-22")), 
> Person("Jim", Birthday("1980-02-28"
> people.registerTempTable("people")
> val year = sqlContext.sql("select count(*), upper(birthday.date) from people 
> group by upper(birthday.date)")
> year.collect
> {code}
> Here is the plan of year:
> {code:java}
> SchemaRDD[3] at RDD at SchemaRDD.scala:105
> == Query Plan ==
> == Physical Plan ==
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression 
> not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree:
> Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date 
> AS date#9) AS c1#3]
>  Subquery people
>   LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at 
> ExistingRDD.scala:36
> {code}
> The bug is the equality test for `Upper(birthday#1.date)` and 
> `Upper(birthday#1.date AS date#9)`.
> Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias 
> expression.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org