[jira] [Reopened] (SPARK-21096) Pickle error when passing a member variable to Spark executors
[ https://issues.apache.org/jira/browse/SPARK-21096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Irina Truong reopened SPARK-21096: -- The 2 methods I described should be equivalent, but they are not. > Pickle error when passing a member variable to Spark executors > -- > > Key: SPARK-21096 > URL: https://issues.apache.org/jira/browse/SPARK-21096 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Irina Truong > > There is a pickle error when submitting a spark job that references a member > variable in a lambda, even when the member variable is a simple type that > should be serializable. > Here is a minimal example: > https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278 > In the gist above, this method will throw an exception: > {quote} > def build_fail(self): > processed = self.rdd.map(lambda row: process_row(row, self.multiplier)) > return processed.collect() > {quote} > While this method will run just fine: > {quote} > def build_ok(self): > mult = self.multiplier > processed = self.rdd.map(lambda row: process_row(row, mult)) > return processed.collect() > {quote} > In this example, {{self.multiplier}} is just an int. However, passing it into > a lambda throws a pickle error, because it is trying to pickle the whole > {{self}}, and that contains {{sc}}. > If this is the expected behavior, then why should re-assigning > {{self.multiplier}} to a variable make a difference? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21096) Pickle error when passing a member variable to Spark executors
[ https://issues.apache.org/jira/browse/SPARK-21096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050609#comment-16050609 ] Irina Truong commented on SPARK-21096: -- I am not passing in {{self}}. I am passing in {{self.multiplier}} - an integer value. If this spark behavior is correct, why does the 2nd method not break? {quote} def build_ok(self): mult = self.multiplier processed = self.rdd.map(lambda row: process_row(row, mult)) return processed.collect() {quote} > Pickle error when passing a member variable to Spark executors > -- > > Key: SPARK-21096 > URL: https://issues.apache.org/jira/browse/SPARK-21096 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Irina Truong > > There is a pickle error when submitting a spark job that references a member > variable in a lambda, even when the member variable is a simple type that > should be serializable. > Here is a minimal example: > https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278 > In the gist above, this method will throw an exception: > {quote} > def build_fail(self): > processed = self.rdd.map(lambda row: process_row(row, self.multiplier)) > return processed.collect() > {quote} > While this method will run just fine: > {quote} > def build_ok(self): > mult = self.multiplier > processed = self.rdd.map(lambda row: process_row(row, mult)) > return processed.collect() > {quote} > In this example, {{self.multiplier}} is just an int. However, passing it into > a lambda throws a pickle error, because it is trying to pickle the whole > {{self}}, and that contains {{sc}}. > If this is the expected behavior, then why should re-assigning > {{self.multiplier}} to a variable make a difference? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21096) Pickle error when passing a member variable to Spark executors
[ https://issues.apache.org/jira/browse/SPARK-21096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Irina Truong updated SPARK-21096: - Description: There is a pickle error when submitting a spark job that references a member variable in a lambda, even when the member variable is a simple type that should be serializable. Here is a minimal example: https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278 In the gist above, this method will throw an exception: {quote} def build_fail(self): processed = self.rdd.map(lambda row: process_row(row, self.multiplier)) return processed.collect() {quote} While this method will run just fine: {quote} def build_ok(self): mult = self.multiplier processed = self.rdd.map(lambda row: process_row(row, mult)) return processed.collect() {quote} In this example, {{self.multiplier}} is just an int. However, passing it into a lambda throws a pickle error, because it is trying to pickle the whole {{self}}, and that contains {{sc}}. If this is the expected behavior, then why should re-assigning {{self.multiplier}} to a variable make a difference? was: There is a pickle error when submitting a spark job that references a member variable in a lambda, even when the member variable is a simple type that should be serializable. Here is a minimal example: https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278 In the gist above, this method will throw an exception: bq. def build_fail(self): bq. processed = self.rdd.map(lambda row: process_row(row, self.multiplier)) bq. return processed.collect() While this method will run just fine: {quote} def build_ok(self): mult = self.multiplier processed = self.rdd.map(lambda row: process_row(row, mult)) return processed.collect() {quote} In this example, {{self.multiplier}} is just an int. However, passing it into a lambda throws a pickle error, because it is trying to pickle the whole {{self}}, and that contains {{sc}}. If this is the expected behavior, then why should re-assigning {{self.multiplier}} to a variable make a difference? > Pickle error when passing a member variable to Spark executors > -- > > Key: SPARK-21096 > URL: https://issues.apache.org/jira/browse/SPARK-21096 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Irina Truong > > There is a pickle error when submitting a spark job that references a member > variable in a lambda, even when the member variable is a simple type that > should be serializable. > Here is a minimal example: > https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278 > In the gist above, this method will throw an exception: > {quote} > def build_fail(self): > processed = self.rdd.map(lambda row: process_row(row, self.multiplier)) > return processed.collect() > {quote} > While this method will run just fine: > {quote} > def build_ok(self): > mult = self.multiplier > processed = self.rdd.map(lambda row: process_row(row, mult)) > return processed.collect() > {quote} > In this example, {{self.multiplier}} is just an int. However, passing it into > a lambda throws a pickle error, because it is trying to pickle the whole > {{self}}, and that contains {{sc}}. > If this is the expected behavior, then why should re-assigning > {{self.multiplier}} to a variable make a difference? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21096) Pickle error when passing a member variable to Spark executors
[ https://issues.apache.org/jira/browse/SPARK-21096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Irina Truong updated SPARK-21096: - Description: There is a pickle error when submitting a spark job that references a member variable in a lambda, even when the member variable is a simple type that should be serializable. Here is a minimal example: https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278 In the gist above, this method will throw an exception: bq. def build_fail(self): bq. processed = self.rdd.map(lambda row: process_row(row, self.multiplier)) bq. return processed.collect() While this method will run just fine: {quote} def build_ok(self): mult = self.multiplier processed = self.rdd.map(lambda row: process_row(row, mult)) return processed.collect() {quote} In this example, {{self.multiplier}} is just an int. However, passing it into a lambda throws a pickle error, because it is trying to pickle the whole {{self}}, and that contains {{sc}}. If this is the expected behavior, then why should re-assigning {{self.multiplier}} to a variable make a difference? was: There is a pickle error when submitting a spark job that references a member variable in a lambda, even when the member variable is a simple type that should be serializable. Here is a minimal example: https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278 In the gist above, this method will throw an exception: {{ def build_fail(self): processed = self.rdd.map(lambda row: process_row(row, self.multiplier)) return processed.collect() }} While this method will run just fine: {quote} def build_ok(self): mult = self.multiplier processed = self.rdd.map(lambda row: process_row(row, mult)) return processed.collect() {quote} In this example, {{self.multiplier}} is just an int. However, passing it into a lambda throws a pickle error, because it is trying to pickle the whole {{self}}, and that contains {{sc}}. If this is the expected behavior, then why should re-assigning {{self.multiplier}} to a variable make a difference? > Pickle error when passing a member variable to Spark executors > -- > > Key: SPARK-21096 > URL: https://issues.apache.org/jira/browse/SPARK-21096 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Irina Truong > > There is a pickle error when submitting a spark job that references a member > variable in a lambda, even when the member variable is a simple type that > should be serializable. > Here is a minimal example: > https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278 > In the gist above, this method will throw an exception: > bq. def build_fail(self): > bq. processed = self.rdd.map(lambda row: process_row(row, > self.multiplier)) > bq. return processed.collect() > While this method will run just fine: > {quote} > def build_ok(self): > mult = self.multiplier > processed = self.rdd.map(lambda row: process_row(row, mult)) > return processed.collect() > {quote} > In this example, {{self.multiplier}} is just an int. However, passing it into > a lambda throws a pickle error, because it is trying to pickle the whole > {{self}}, and that contains {{sc}}. > If this is the expected behavior, then why should re-assigning > {{self.multiplier}} to a variable make a difference? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21096) Pickle error when passing a member variable to Spark executors
[ https://issues.apache.org/jira/browse/SPARK-21096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Irina Truong updated SPARK-21096: - Description: There is a pickle error when submitting a spark job that references a member variable in a lambda, even when the member variable is a simple type that should be serializable. Here is a minimal example: https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278 In the gist above, this method will throw an exception: {{ def build_fail(self): processed = self.rdd.map(lambda row: process_row(row, self.multiplier)) return processed.collect() }} While this method will run just fine: {quote} def build_ok(self): mult = self.multiplier processed = self.rdd.map(lambda row: process_row(row, mult)) return processed.collect() {quote} In this example, {{self.multiplier}} is just an int. However, passing it into a lambda throws a pickle error, because it is trying to pickle the whole {{self}}, and that contains {{sc}}. If this is the expected behavior, then why should re-assigning {{self.multiplier}} to a variable make a difference? was: There is a pickle error when submitting a spark job that references a member variable in a lambda, even when the member variable is a simple type that should be serializable. Here is a minimal example: https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278 In the gist above, this method will throw an exception: {quote} def build_fail(self): processed = self.rdd.map(lambda row: process_row(row, self.multiplier)) return processed.collect() {quote} While this method will run just fine: {quote} def build_ok(self): mult = self.multiplier processed = self.rdd.map(lambda row: process_row(row, mult)) return processed.collect() {quote} In this example, {{self.multiplier}} is just an int. However, passing it into a lambda throws a pickle error, because it is trying to pickle the whole {{self}}, and that contains {{sc}}. If this is the expected behavior, then why should re-assigning {{self.multiplier}} to a variable make a difference? > Pickle error when passing a member variable to Spark executors > -- > > Key: SPARK-21096 > URL: https://issues.apache.org/jira/browse/SPARK-21096 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Irina Truong > > There is a pickle error when submitting a spark job that references a member > variable in a lambda, even when the member variable is a simple type that > should be serializable. > Here is a minimal example: > https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278 > In the gist above, this method will throw an exception: > {{ > def build_fail(self): > processed = self.rdd.map(lambda row: process_row(row, self.multiplier)) > return processed.collect() > }} > While this method will run just fine: > {quote} > def build_ok(self): > mult = self.multiplier > processed = self.rdd.map(lambda row: process_row(row, mult)) > return processed.collect() > {quote} > In this example, {{self.multiplier}} is just an int. However, passing it into > a lambda throws a pickle error, because it is trying to pickle the whole > {{self}}, and that contains {{sc}}. > If this is the expected behavior, then why should re-assigning > {{self.multiplier}} to a variable make a difference? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21096) Pickle error when passing a member variable to Spark executors
[ https://issues.apache.org/jira/browse/SPARK-21096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Irina Truong updated SPARK-21096: - Description: There is a pickle error when submitting a spark job that references a member variable in a lambda, even when the member variable is a simple type that should be serializable. Here is a minimal example: https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278 In the gist above, this method will throw an exception: {quote} def build_fail(self): processed = self.rdd.map(lambda row: process_row(row, self.multiplier)) return processed.collect() {quote} While this method will run just fine: {quote} def build_ok(self): mult = self.multiplier processed = self.rdd.map(lambda row: process_row(row, mult)) return processed.collect() {quote} In this example, {{self.multiplier}} is just an int. However, passing it into a lambda throws a pickle error, because it is trying to pickle the whole {{self}}, and that contains {{sc}}. If this is the expected behavior, then why should re-assigning {{self.multiplier}} to a variable make a difference? was: There is a pickle error when submitting a spark job that references a member variable in a lambda, even when the member variable is a simple type that should be serializable. Here is a minimal example: https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278 In the gist above, this method will throw an exception: {quote}def build_fail(self): processed = self.rdd.map(lambda row: process_row(row, self.multiplier)) return processed.collect(){quote} While this method will run just fine: {quote}def build_ok(self): mult = self.multiplier processed = self.rdd.map(lambda row: process_row(row, mult)) return processed.collect() {quote} In this example, {{self.multiplier}} is just an int. However, passing it into a lambda throws a pickle error, because it is trying to pickle the whole {{self}}, and that contains {{sc}}. If this is the expected behavior, then why should re-assigning {{self.multiplier}} to a variable make a difference? > Pickle error when passing a member variable to Spark executors > -- > > Key: SPARK-21096 > URL: https://issues.apache.org/jira/browse/SPARK-21096 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Irina Truong > > There is a pickle error when submitting a spark job that references a member > variable in a lambda, even when the member variable is a simple type that > should be serializable. > Here is a minimal example: > https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278 > In the gist above, this method will throw an exception: > {quote} > def build_fail(self): > processed = self.rdd.map(lambda row: process_row(row, self.multiplier)) > return processed.collect() > {quote} > While this method will run just fine: > {quote} > def build_ok(self): > mult = self.multiplier > processed = self.rdd.map(lambda row: process_row(row, mult)) > return processed.collect() > {quote} > In this example, {{self.multiplier}} is just an int. However, passing it into > a lambda throws a pickle error, because it is trying to pickle the whole > {{self}}, and that contains {{sc}}. > If this is the expected behavior, then why should re-assigning > {{self.multiplier}} to a variable make a difference? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21096) Pickle error when passing a member variable to Spark executors
[ https://issues.apache.org/jira/browse/SPARK-21096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Irina Truong updated SPARK-21096: - Description: There is a pickle error when submitting a spark job that references a member variable in a lambda, even when the member variable is a simple type that should be serializable. Here is a minimal example: https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278 In the gist above, this method will throw an exception: {quote}def build_fail(self): processed = self.rdd.map(lambda row: process_row(row, self.multiplier)) return processed.collect(){quote} While this method will run just fine: {quote}def build_ok(self): mult = self.multiplier processed = self.rdd.map(lambda row: process_row(row, mult)) return processed.collect() }}{quote} In this example, {{self.multiplier}} is just an int. However, passing it into a lambda throws a pickle error, because it is trying to pickle the whole {{self}}, and that contains {{sc}}. If this is the expected behavior, then why should re-assigning {{self.multiplier}} to a variable make a difference? was: There is a pickle error when submitting a spark job that references a member variable in a lambda, even when the member variable is a simple type that should be serializable. Here is a minimal example: https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278 In the gist above, this method will throw an exception: {{def build_fail(self): processed = self.rdd.map(lambda row: process_row(row, self.multiplier)) return processed.collect() }} While this method will run just fine: {{ def build_ok(self): mult = self.multiplier processed = self.rdd.map(lambda row: process_row(row, mult)) return processed.collect() }} In this example, {{self.multiplier}} is just an int. However, passing it into a lambda throws a pickle error, because it is trying to pickle the whole {{self}}, and that contains {{sc}}. If this is the expected behavior, then why should re-assigning {{self.multiplier}} to a variable make a difference? > Pickle error when passing a member variable to Spark executors > -- > > Key: SPARK-21096 > URL: https://issues.apache.org/jira/browse/SPARK-21096 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Irina Truong > > There is a pickle error when submitting a spark job that references a member > variable in a lambda, even when the member variable is a simple type that > should be serializable. > Here is a minimal example: > https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278 > In the gist above, this method will throw an exception: > {quote}def build_fail(self): > processed = self.rdd.map(lambda row: process_row(row, self.multiplier)) > return processed.collect(){quote} > While this method will run just fine: > {quote}def build_ok(self): > mult = self.multiplier > processed = self.rdd.map(lambda row: process_row(row, mult)) > return processed.collect() > }}{quote} > In this example, {{self.multiplier}} is just an int. However, passing it into > a lambda throws a pickle error, because it is trying to pickle the whole > {{self}}, and that contains {{sc}}. > If this is the expected behavior, then why should re-assigning > {{self.multiplier}} to a variable make a difference? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21096) Pickle error when passing a member variable to Spark executors
[ https://issues.apache.org/jira/browse/SPARK-21096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Irina Truong updated SPARK-21096: - Description: There is a pickle error when submitting a spark job that references a member variable in a lambda, even when the member variable is a simple type that should be serializable. Here is a minimal example: https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278 In the gist above, this method will throw an exception: {quote}def build_fail(self): processed = self.rdd.map(lambda row: process_row(row, self.multiplier)) return processed.collect(){quote} While this method will run just fine: {quote}def build_ok(self): mult = self.multiplier processed = self.rdd.map(lambda row: process_row(row, mult)) return processed.collect() {quote} In this example, {{self.multiplier}} is just an int. However, passing it into a lambda throws a pickle error, because it is trying to pickle the whole {{self}}, and that contains {{sc}}. If this is the expected behavior, then why should re-assigning {{self.multiplier}} to a variable make a difference? was: There is a pickle error when submitting a spark job that references a member variable in a lambda, even when the member variable is a simple type that should be serializable. Here is a minimal example: https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278 In the gist above, this method will throw an exception: {quote}def build_fail(self): processed = self.rdd.map(lambda row: process_row(row, self.multiplier)) return processed.collect(){quote} While this method will run just fine: {quote}def build_ok(self): mult = self.multiplier processed = self.rdd.map(lambda row: process_row(row, mult)) return processed.collect() }}{quote} In this example, {{self.multiplier}} is just an int. However, passing it into a lambda throws a pickle error, because it is trying to pickle the whole {{self}}, and that contains {{sc}}. If this is the expected behavior, then why should re-assigning {{self.multiplier}} to a variable make a difference? > Pickle error when passing a member variable to Spark executors > -- > > Key: SPARK-21096 > URL: https://issues.apache.org/jira/browse/SPARK-21096 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Irina Truong > > There is a pickle error when submitting a spark job that references a member > variable in a lambda, even when the member variable is a simple type that > should be serializable. > Here is a minimal example: > https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278 > In the gist above, this method will throw an exception: > {quote}def build_fail(self): > processed = self.rdd.map(lambda row: process_row(row, self.multiplier)) > return processed.collect(){quote} > While this method will run just fine: > {quote}def build_ok(self): > mult = self.multiplier > processed = self.rdd.map(lambda row: process_row(row, mult)) > return processed.collect() > {quote} > In this example, {{self.multiplier}} is just an int. However, passing it into > a lambda throws a pickle error, because it is trying to pickle the whole > {{self}}, and that contains {{sc}}. > If this is the expected behavior, then why should re-assigning > {{self.multiplier}} to a variable make a difference? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21096) Pickle error when passing a member variable to Spark executors
[ https://issues.apache.org/jira/browse/SPARK-21096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Irina Truong updated SPARK-21096: - Description: There is a pickle error when submitting a spark job that references a member variable in a lambda, even when the member variable is a simple type that should be serializable. Here is a minimal example: https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278 In the gist above, this method will throw an exception: {{def build_fail(self): processed = self.rdd.map(lambda row: process_row(row, self.multiplier)) return processed.collect() }} While this method will run just fine: {{ def build_ok(self): mult = self.multiplier processed = self.rdd.map(lambda row: process_row(row, mult)) return processed.collect() }} In this example, {{self.multiplier}} is just an int. However, passing it into a lambda throws a pickle error, because it is trying to pickle the whole {{self}}, and that contains {{sc}}. If this is the expected behavior, then why should re-assigning {{self.multiplier}} to a variable make a difference? was: There is a pickle error when submitting a spark job that references a member variable in a lambda, even when the member variable is a simple type that should be serializable. Here is a minimal example: https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278 In the gist above, this method will throw an exception: {{ def build_fail(self): processed = self.rdd.map(lambda row: process_row(row, self.multiplier)) return processed.collect() }} While this method will run just fine: {{ def build_ok(self): mult = self.multiplier processed = self.rdd.map(lambda row: process_row(row, mult)) return processed.collect() }} In this example, {{self.multiplier}} is just an int. However, passing it into a lambda throws a pickle error, because it is trying to pickle the whole {{self}}, and that contains {{sc}}. If this is the expected behavior, then why should re-assigning {{self.multiplier}} to a variable make a difference? > Pickle error when passing a member variable to Spark executors > -- > > Key: SPARK-21096 > URL: https://issues.apache.org/jira/browse/SPARK-21096 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Irina Truong > > There is a pickle error when submitting a spark job that references a member > variable in a lambda, even when the member variable is a simple type that > should be serializable. > Here is a minimal example: > https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278 > In the gist above, this method will throw an exception: > {{def build_fail(self): > processed = self.rdd.map(lambda row: process_row(row, self.multiplier)) > return processed.collect() > }} > While this method will run just fine: > {{ > def build_ok(self): > mult = self.multiplier > processed = self.rdd.map(lambda row: process_row(row, mult)) > return processed.collect() > }} > In this example, {{self.multiplier}} is just an int. However, passing it into > a lambda throws a pickle error, because it is trying to pickle the whole > {{self}}, and that contains {{sc}}. > If this is the expected behavior, then why should re-assigning > {{self.multiplier}} to a variable make a difference? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21096) Pickle error when passing a member variable to Spark executors
[ https://issues.apache.org/jira/browse/SPARK-21096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Irina Truong updated SPARK-21096: - Description: There is a pickle error when submitting a spark job that references a member variable in a lambda, even when the member variable is a simple type that should be serializable. Here is a minimal example: https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278 In the gist above, this method will throw an exception: {{ def build_fail(self): processed = self.rdd.map(lambda row: process_row(row, self.multiplier)) return processed.collect() }} While this method will run just fine: {{ def build_ok(self): mult = self.multiplier processed = self.rdd.map(lambda row: process_row(row, mult)) return processed.collect() }} In this example, {{self.multiplier}} is just an int. However, passing it into a lambda throws a pickle error, because it is trying to pickle the whole {{self}}, and that contains {{sc}}. If this is the expected behavior, then why should re-assigning {{self.multiplier}} to a variable make a difference? was: There is a pickle error when submitting a spark job that references a member variable in a lambda, even when the member variable is a simple type that should be serializable. Here is a minimal example: https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278 In the gist above, this method will throw an exception: {{def build_fail(self): processed = self.rdd.map(lambda row: process_row(row, self.multiplier)) return processed.collect() }} While this method will run just fine: {{def build_ok(self): mult = self.multiplier processed = self.rdd.map(lambda row: process_row(row, mult)) return processed.collect() }} In this example, {{self.multiplier}} is just an int. However, passing it into a lambda throws a pickle error, because it is trying to pickle the whole {{self}}, and that contains {{sc}}. If this is the expected behavior, then why should re-assigning {{self.multiplier}} to a variable make a difference? > Pickle error when passing a member variable to Spark executors > -- > > Key: SPARK-21096 > URL: https://issues.apache.org/jira/browse/SPARK-21096 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Irina Truong > > There is a pickle error when submitting a spark job that references a member > variable in a lambda, even when the member variable is a simple type that > should be serializable. > Here is a minimal example: > https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278 > In the gist above, this method will throw an exception: > {{ > def build_fail(self): > processed = self.rdd.map(lambda row: process_row(row, self.multiplier)) > return processed.collect() > }} > While this method will run just fine: > {{ > def build_ok(self): > mult = self.multiplier > processed = self.rdd.map(lambda row: process_row(row, mult)) > return processed.collect() > }} > In this example, {{self.multiplier}} is just an int. However, passing it into > a lambda throws a pickle error, because it is trying to pickle the whole > {{self}}, and that contains {{sc}}. > If this is the expected behavior, then why should re-assigning > {{self.multiplier}} to a variable make a difference? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21096) Pickle error when passing a member variable to Spark executors
Irina Truong created SPARK-21096: Summary: Pickle error when passing a member variable to Spark executors Key: SPARK-21096 URL: https://issues.apache.org/jira/browse/SPARK-21096 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.1 Reporter: Irina Truong There is a pickle error when submitting a spark job that references a member variable in a lambda, even when the member variable is a simple type that should be serializable. Here is a minimal example: https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278 In the gist above, this method will throw an exception: {{def build_fail(self): processed = self.rdd.map(lambda row: process_row(row, self.multiplier)) return processed.collect() }} While this method will run just fine: {{def build_ok(self): mult = self.multiplier processed = self.rdd.map(lambda row: process_row(row, mult)) return processed.collect() }} In this example, {{self.multiplier}} is just an int. However, passing it into a lambda throws a pickle error, because it is trying to pickle the whole {{self}}, and that contains {{sc}}. If this is the expected behavior, then why should re-assigning {{self.multiplier}} to a variable make a difference? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16784) Configurable log4j settings
[ https://issues.apache.org/jira/browse/SPARK-16784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16041856#comment-16041856 ] Irina Truong commented on SPARK-16784: -- In 2.1.0, setting "spark.driver.extraJavaOptions" to "-Dlog4j.configuration=file:/home/hadoop/log4j.properties" in SparkConfig seemed to work. In 2.1.1, it does not work anymore, but setting it via "--driver-java-options" still works. Is this a bug in 2.1.1? > Configurable log4j settings > --- > > Key: SPARK-16784 > URL: https://issues.apache.org/jira/browse/SPARK-16784 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0, 2.1.0 >Reporter: Michael Gummelt > > I often want to change the logging configuration on a single spark job. This > is easy in client mode. I just modify log4j.properties. It's difficult in > cluster mode, because I need to modify the log4j.properties in the > distribution in which the driver runs. I'd like a way of setting this > dynamically, such as a java system property. Some brief searching showed > that log4j doesn't seem to accept such a property, but I'd like to open up > this idea for further comment. Maybe we can find a solution. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19307) SPARK-17387 caused ignorance of conf object passed to SparkContext:
[ https://issues.apache.org/jira/browse/SPARK-19307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16041040#comment-16041040 ] Irina Truong commented on SPARK-19307: -- Is this available in 2.1.1? I could not find it in release notes. > SPARK-17387 caused ignorance of conf object passed to SparkContext: > --- > > Key: SPARK-19307 > URL: https://issues.apache.org/jira/browse/SPARK-19307 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: yuriy_hupalo >Assignee: Marcelo Vanzin > Attachments: SPARK-19307.patch > > > after patch SPARK-17387 was applied -- Sparkconf object is ignored when > launching SparkContext programmatically via python from spark-submit: > https://github.com/apache/spark/blob/master/python/pyspark/context.py#L128: > in case when we are running python SparkContext(conf=xxx) from spark-submit: > conf is set, conf._jconf is None () > passed as arg conf object is ignored (and used only when we are > launching java_gateway). > how to fix: > python/pyspark/context.py:132 > {code:title=python/pyspark/context.py:132} > if conf is not None and conf._jconf is not None: > # conf has been initialized in JVM properly, so use conf > directly. This represent the > # scenario that JVM has been launched before SparkConf is created > (e.g. SparkContext is > # created and then stopped, and we create a new SparkConf and new > SparkContext again) > self._conf = conf > else: > self._conf = SparkConf(_jvm=SparkContext._jvm) > + if conf: > + for key, value in conf.getAll(): > + self._conf.set(key,value) > + print(key,value) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4296) Throw "Expression not in GROUP BY" when using same expression in group by clause and select clause
[ https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935409#comment-15935409 ] Irina Truong edited comment on SPARK-4296 at 3/21/17 10:01 PM: --- I have the same exception with pyspark when my expression uses a compiled and registered Scala UDF. This is how it's registered: {noformat} sqlContext.registerJavaFunction("round_date", 'my.package.RoundDate') {noformat} And this is how it's called: {noformat} ipdb> sqlContext.sql("SELECT round_date(t.ts, '1day') from (select timestamp('2017-02-02T10:11:12') as ts union select timestamp('2017-02-02T10:19:00') as ts) as t group by round_date(t.ts, '1day')").show() *** AnalysisException: u"expression 't.`ts`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;\nAggregate [UDF(ts#80, 1day)], [UDF(ts#80, 1day) AS UDF(ts, 1day)#82]\n+- SubqueryAlias t\n +- Distinct\n +- Union\n :- Project [cast(2017-02-02T10:11:12 as timestamp) AS ts#80]\n : +- OneRowRelation$\n +- Project [cast(2017-02-02T10:19:00 as timestamp) AS ts#81]\n+- OneRowRelation$\n" {noformat} was (Author: irinatruong): I have the same exception with pyspark when my expression uses a compiled and registered Scala UDF. This is how it's registered: {noformat} sqlContext.registerJavaFunction("round_date", 'my.package.RoundDate') {noformat} And this is how it's called: {noformat} ipdb> sqlContext.sql("SELECT round_date(t.ts, '1day') from (select timestamp('2017-02-02T10:11:12') as ts union select timestamp('2017-02-02T10:19:00') as ts) as t group by round_date(t.ts, '1day')").show() *** AnalysisException: u"expression 't.`ts`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;\nAggregate [UDF(ts#80, 1day)], [UDF(ts#80, 1day) AS UDF(ts, 1day)#82]\n+- SubqueryAlias t\n +- Distinct\n +- Union\n :- Project [cast(2017-02-02T10:11:12 as timestamp) AS ts#80]\n : +- OneRowRelation$\n +- Project [cast(2017-02-02T10:19:00 as timestamp) AS ts#81]\n+- OneRowRelation$\n" {noformat} > Throw "Expression not in GROUP BY" when using same expression in group by > clause and select clause > --- > > Key: SPARK-4296 > URL: https://issues.apache.org/jira/browse/SPARK-4296 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0, 1.1.1, 1.2.0 >Reporter: Shixiong Zhu >Assignee: Cheng Lian >Priority: Blocker > Fix For: 1.2.1, 1.3.0 > > > When the input data has a complex structure, using same expression in group > by clause and select clause will throw "Expression not in GROUP BY". > {code:java} > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > import sqlContext.createSchemaRDD > case class Birthday(date: String) > case class Person(name: String, birthday: Birthday) > val people = sc.parallelize(List(Person("John", Birthday("1990-01-22")), > Person("Jim", Birthday("1980-02-28" > people.registerTempTable("people") > val year = sqlContext.sql("select count(*), upper(birthday.date) from people > group by upper(birthday.date)") > year.collect > {code} > Here is the plan of year: > {code:java} > SchemaRDD[3] at RDD at SchemaRDD.scala:105 > == Query Plan == > == Physical Plan == > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression > not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree: > Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date > AS date#9) AS c1#3] > Subquery people > LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at > ExistingRDD.scala:36 > {code} > The bug is the equality test for `Upper(birthday#1.date)` and > `Upper(birthday#1.date AS date#9)`. > Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias > expression. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4296) Throw "Expression not in GROUP BY" when using same expression in group by clause and select clause
[ https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935409#comment-15935409 ] Irina Truong edited comment on SPARK-4296 at 3/21/17 9:59 PM: -- I have the same exception with pyspark when my expression uses a compiled and registered Scala UDF. This is how it's registered: {noformat} sqlContext.registerJavaFunction("round_date", 'my.package.RoundDate') {noformat} And this is how it's called: {noformat} ipdb> sqlContext.sql("SELECT round_date(t.ts, '1day') from (select timestamp('2017-02-02T10:11:12') as ts union select timestamp('2017-02-02T10:19:00') as ts) as t group by round_date(t.ts, '1day')").show() *** AnalysisException: u"expression 't.`ts`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;\nAggregate [UDF(ts#80, 1day)], [UDF(ts#80, 1day) AS UDF(ts, 1day)#82]\n+- SubqueryAlias t\n +- Distinct\n +- Union\n :- Project [cast(2017-02-02T10:11:12 as timestamp) AS ts#80]\n : +- OneRowRelation$\n +- Project [cast(2017-02-02T10:19:00 as timestamp) AS ts#81]\n+- OneRowRelation$\n" {noformat} was (Author: irinatruong): I'm have the same exception with pyspark when my expression uses a compiled and registered Scala UDF: sqlContext.registerJavaFunction("round_date", 'my.package.RoundDate') ipdb> sqlContext.sql("SELECT round_date(t.ts, '1day') from (select timestamp('2017-02-02T10:11:12') as ts union select timestamp('2017-02-02T10:19:00') as ts) as t group by round_date(t.ts, '1day')").show() *** AnalysisException: u"expression 't.`ts`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;\nAggregate [UDF(ts#80, 1day)], [UDF(ts#80, 1day) AS UDF(ts, 1day)#82]\n+- SubqueryAlias t\n +- Distinct\n +- Union\n :- Project [cast(2017-02-02T10:11:12 as timestamp) AS ts#80]\n : +- OneRowRelation$\n +- Project [cast(2017-02-02T10:19:00 as timestamp) AS ts#81]\n+- OneRowRelation$\n" > Throw "Expression not in GROUP BY" when using same expression in group by > clause and select clause > --- > > Key: SPARK-4296 > URL: https://issues.apache.org/jira/browse/SPARK-4296 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0, 1.1.1, 1.2.0 >Reporter: Shixiong Zhu >Assignee: Cheng Lian >Priority: Blocker > Fix For: 1.2.1, 1.3.0 > > > When the input data has a complex structure, using same expression in group > by clause and select clause will throw "Expression not in GROUP BY". > {code:java} > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > import sqlContext.createSchemaRDD > case class Birthday(date: String) > case class Person(name: String, birthday: Birthday) > val people = sc.parallelize(List(Person("John", Birthday("1990-01-22")), > Person("Jim", Birthday("1980-02-28" > people.registerTempTable("people") > val year = sqlContext.sql("select count(*), upper(birthday.date) from people > group by upper(birthday.date)") > year.collect > {code} > Here is the plan of year: > {code:java} > SchemaRDD[3] at RDD at SchemaRDD.scala:105 > == Query Plan == > == Physical Plan == > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression > not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree: > Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date > AS date#9) AS c1#3] > Subquery people > LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at > ExistingRDD.scala:36 > {code} > The bug is the equality test for `Upper(birthday#1.date)` and > `Upper(birthday#1.date AS date#9)`. > Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias > expression. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4296) Throw "Expression not in GROUP BY" when using same expression in group by clause and select clause
[ https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935409#comment-15935409 ] Irina Truong commented on SPARK-4296: - I'm have the same exception with pyspark when my expression uses a compiled and registered Scala UDF: sqlContext.registerJavaFunction("round_date", 'my.package.RoundDate') ipdb> sqlContext.sql("SELECT round_date(t.ts, '1day') from (select timestamp('2017-02-02T10:11:12') as ts union select timestamp('2017-02-02T10:19:00') as ts) as t group by round_date(t.ts, '1day')").show() *** AnalysisException: u"expression 't.`ts`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;\nAggregate [UDF(ts#80, 1day)], [UDF(ts#80, 1day) AS UDF(ts, 1day)#82]\n+- SubqueryAlias t\n +- Distinct\n +- Union\n :- Project [cast(2017-02-02T10:11:12 as timestamp) AS ts#80]\n : +- OneRowRelation$\n +- Project [cast(2017-02-02T10:19:00 as timestamp) AS ts#81]\n+- OneRowRelation$\n" > Throw "Expression not in GROUP BY" when using same expression in group by > clause and select clause > --- > > Key: SPARK-4296 > URL: https://issues.apache.org/jira/browse/SPARK-4296 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0, 1.1.1, 1.2.0 >Reporter: Shixiong Zhu >Assignee: Cheng Lian >Priority: Blocker > Fix For: 1.2.1, 1.3.0 > > > When the input data has a complex structure, using same expression in group > by clause and select clause will throw "Expression not in GROUP BY". > {code:java} > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > import sqlContext.createSchemaRDD > case class Birthday(date: String) > case class Person(name: String, birthday: Birthday) > val people = sc.parallelize(List(Person("John", Birthday("1990-01-22")), > Person("Jim", Birthday("1980-02-28" > people.registerTempTable("people") > val year = sqlContext.sql("select count(*), upper(birthday.date) from people > group by upper(birthday.date)") > year.collect > {code} > Here is the plan of year: > {code:java} > SchemaRDD[3] at RDD at SchemaRDD.scala:105 > == Query Plan == > == Physical Plan == > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression > not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree: > Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date > AS date#9) AS c1#3] > Subquery people > LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at > ExistingRDD.scala:36 > {code} > The bug is the equality test for `Upper(birthday#1.date)` and > `Upper(birthday#1.date AS date#9)`. > Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias > expression. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org