[jira] [Commented] (SPARK-20077) Documentation for ml.stats.Correlation
[ https://issues.apache.org/jira/browse/SPARK-20077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16046023#comment-16046023 ] Michael Patterson commented on SPARK-20077: --- Is this task still open? Are you thinking of documentation like the one that exists for MLlib (https://spark.apache.org/docs/latest/mllib-statistics.html#correlations)? > Documentation for ml.stats.Correlation > -- > > Key: SPARK-20077 > URL: https://issues.apache.org/jira/browse/SPARK-20077 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter > > Now that (Pearson) correlations are available in spark.ml, we need to write > some documentation to go along with this feature. It can simply be looking at > the unit tests for example right now. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20456) Add examples for functions collection for pyspark
[ https://issues.apache.org/jira/browse/SPARK-20456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Patterson updated SPARK-20456: -- Description: Document sql.functions.py: 1. Add examples for the common string functions (upper, lower, and reverse) 2. Rename columns in datetime examples to be more informative (e.g. from 'd' to 'date') 3. Add examples for unix_timestamp, from_unixtime, rand, randn, collect_list, collect_set, lit, 4. Add note to all trigonometry functions that units are radians. 5. Add links between functions, (e.g. add link to radians from toRadians) was: Document `sql.functions.py`: 1. Add examples for the common aggregate functions (`min`, `max`, `mean`, `count`, `collect_set`, `collect_list`, `stddev`, `variance`) 2. Rename columns in datetime examples. 3. Add examples for `unix_timestamp` and `from_unixtime` 4. Add note to all trigonometry functions that units are radians. 5. Add example for `lit` > Add examples for functions collection for pyspark > - > > Key: SPARK-20456 > URL: https://issues.apache.org/jira/browse/SPARK-20456 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 2.1.0 >Reporter: Michael Patterson >Priority: Minor > > Document sql.functions.py: > 1. Add examples for the common string functions (upper, lower, and reverse) > 2. Rename columns in datetime examples to be more informative (e.g. from 'd' > to 'date') > 3. Add examples for unix_timestamp, from_unixtime, rand, randn, collect_list, > collect_set, lit, > 4. Add note to all trigonometry functions that units are radians. > 5. Add links between functions, (e.g. add link to radians from toRadians) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20456) Add examples for functions collection for pyspark
[ https://issues.apache.org/jira/browse/SPARK-20456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Patterson updated SPARK-20456: -- Summary: Add examples for functions collection for pyspark (was: Document major aggregation functions for pyspark) > Add examples for functions collection for pyspark > - > > Key: SPARK-20456 > URL: https://issues.apache.org/jira/browse/SPARK-20456 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Michael Patterson >Priority: Minor > > Document `sql.functions.py`: > 1. Document the common aggregate functions (`min`, `max`, `mean`, `count`, > `collect_set`, `collect_list`, `stddev`, `variance`) > 2. Rename columns in datetime examples. > 3. Add examples for `unix_timestamp` and `from_unixtime` > 4. Add note to all trigonometry functions that units are radians. > 5. Document `lit` -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20456) Add examples for functions collection for pyspark
[ https://issues.apache.org/jira/browse/SPARK-20456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Patterson updated SPARK-20456: -- Description: Document `sql.functions.py`: 1. Add examples for the common aggregate functions (`min`, `max`, `mean`, `count`, `collect_set`, `collect_list`, `stddev`, `variance`) 2. Rename columns in datetime examples. 3. Add examples for `unix_timestamp` and `from_unixtime` 4. Add note to all trigonometry functions that units are radians. 5. Add example for `lit` was: Document `sql.functions.py`: 1. Document the common aggregate functions (`min`, `max`, `mean`, `count`, `collect_set`, `collect_list`, `stddev`, `variance`) 2. Rename columns in datetime examples. 3. Add examples for `unix_timestamp` and `from_unixtime` 4. Add note to all trigonometry functions that units are radians. 5. Document `lit` > Add examples for functions collection for pyspark > - > > Key: SPARK-20456 > URL: https://issues.apache.org/jira/browse/SPARK-20456 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Michael Patterson >Priority: Minor > > Document `sql.functions.py`: > 1. Add examples for the common aggregate functions (`min`, `max`, `mean`, > `count`, `collect_set`, `collect_list`, `stddev`, `variance`) > 2. Rename columns in datetime examples. > 3. Add examples for `unix_timestamp` and `from_unixtime` > 4. Add note to all trigonometry functions that units are radians. > 5. Add example for `lit` -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20456) Document major aggregation functions for pyspark
[ https://issues.apache.org/jira/browse/SPARK-20456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983590#comment-15983590 ] Michael Patterson commented on SPARK-20456: --- I saw that there are short docstrings for the aggregate functions, but I think it can be unclear for people new to Spark, or relational algebra. For example, some of my coworkers didn't know you could do, for example, `df.agg(mean(col))`, without doing a `groupby`. There are also no links to `groupby` in any of the aggregate functions. I also didn't know about `collect_set` for a long time. I think adding examples would help with visibility and understanding. The same things applies to `lit`. It took me a while to learn what it did. For the datetime stuff, for example this line has a column named 'd': https://github.com/map222/spark/blob/master/python/pyspark/sql/functions.py#L926 I think it would be more informative to name it 'date' or 'time'. Do these sound reasonable? > Document major aggregation functions for pyspark > > > Key: SPARK-20456 > URL: https://issues.apache.org/jira/browse/SPARK-20456 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Michael Patterson >Priority: Minor > > Document `sql.functions.py`: > 1. Document the common aggregate functions (`min`, `max`, `mean`, `count`, > `collect_set`, `collect_list`, `stddev`, `variance`) > 2. Rename columns in datetime examples. > 3. Add examples for `unix_timestamp` and `from_unixtime` > 4. Add note to all trigonometry functions that units are radians. > 5. Document `lit` -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20456) Document major aggregation functions for pyspark
Michael Patterson created SPARK-20456: - Summary: Document major aggregation functions for pyspark Key: SPARK-20456 URL: https://issues.apache.org/jira/browse/SPARK-20456 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 2.1.0 Reporter: Michael Patterson Document `sql.functions.py`: 1. Document the common aggregate functions (`min`, `max`, `mean`, `count`, `collect_set`, `collect_list`, `stddev`, `variance`) 2. Rename columns in datetime examples. 3. Add examples for `unix_timestamp` and `from_unixtime` 4. Add note to all trigonometry functions that units are radians. 5. Document `lit` -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-20132) Add documentation for column string functions
[ https://issues.apache.org/jira/browse/SPARK-20132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Patterson updated SPARK-20132: -- Comment: was deleted (was: I have a commit with the documentation: https://github.com/map222/spark/commit/ac91b654555f9a07021222f2f1a162634d81be5b I will make a more formal PR tonight.) > Add documentation for column string functions > - > > Key: SPARK-20132 > URL: https://issues.apache.org/jira/browse/SPARK-20132 > Project: Spark > Issue Type: Documentation > Components: PySpark, SQL >Affects Versions: 2.1.0 >Reporter: Michael Patterson >Priority: Minor > Labels: documentation, newbie > > Four Column string functions do not have documentation for PySpark: > rlike > like > startswith > endswith > These functions are called through the _bin_op interface, which allows the > passing of a docstring. I have added docstrings with examples to each of the > four functions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20132) Add documentation for column string functions
[ https://issues.apache.org/jira/browse/SPARK-20132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15946130#comment-15946130 ] Michael Patterson commented on SPARK-20132: --- I have a commit with the documentation: https://github.com/map222/spark/commit/ac91b654555f9a07021222f2f1a162634d81be5b I will make a more formal PR tonight. > Add documentation for column string functions > - > > Key: SPARK-20132 > URL: https://issues.apache.org/jira/browse/SPARK-20132 > Project: Spark > Issue Type: Documentation > Components: PySpark, SQL >Affects Versions: 2.1.0 >Reporter: Michael Patterson >Priority: Minor > Labels: documentation, newbie > > Four Column string functions do not have documentation for PySpark: > rlike > like > startswith > endswith > These functions are called through the _bin_op interface, which allows the > passing of a docstring. I have added docstrings with examples to each of the > four functions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20132) Add documentation for column string functions
Michael Patterson created SPARK-20132: - Summary: Add documentation for column string functions Key: SPARK-20132 URL: https://issues.apache.org/jira/browse/SPARK-20132 Project: Spark Issue Type: Documentation Components: PySpark, SQL Affects Versions: 2.1.0 Reporter: Michael Patterson Priority: Minor Four Column string functions do not have documentation for PySpark: rlike like startswith endswith These functions are called through the _bin_op interface, which allows the passing of a docstring. I have added docstrings with examples to each of the four functions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18014) Filters are incorrectly being grouped together when there is processing in between
[ https://issues.apache.org/jira/browse/SPARK-18014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Patterson updated SPARK-18014: -- Environment: Pyspark 2.0.1, Ipython 4.2 (was: Pyspark 2.0.0, Ipython 4.2) > Filters are incorrectly being grouped together when there is processing in > between > -- > > Key: SPARK-18014 > URL: https://issues.apache.org/jira/browse/SPARK-18014 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.1 > Environment: Pyspark 2.0.1, Ipython 4.2 >Reporter: Michael Patterson >Priority: Minor > > I created a dataframe that needed to filter the data on columnA, create a new > columnB by applying a user defined function to columnA, and then filter on > columnB. However, the two filters were being grouped together in the > execution plan after the withColumn statement, which was causing errors due > to unexpected input to the withColumn statement. > Example code to reproduce: > {code} > import pyspark.sql.functions as F > import pyspark.sql.types as T > from functools import partial > data = [{'input':0}, {'input':1}, {'input':2}] > input_df = sc.parallelize(data).toDF() > my_dict = {1:'first', 2:'second'} > def apply_dict( input_dict, value): > return input_dict[value] > test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() ) > test_df = input_df.filter('input > 0').withColumn('output', > test_udf('input')).filter(F.col('output').rlike('^s')) > test_df.explain(True) > {code} > Execution plan: > {code} > == Analyzed Logical Plan == > input: bigint, output: string > Filter output#4 RLIKE ^s > +- Project [input#0L, partial(input#0L) AS output#4] >+- Filter (input#0L > cast(0 as bigint)) > +- LogicalRDD [input#0L] > == Optimized Logical Plan == > Project [input#0L, partial(input#0L) AS output#4] > +- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE > ^s) >+- LogicalRDD [input#0L] > {code} > Executing test_def.show() after the above code in pyspark 2.0.1 yields: > KeyError: 0 > Executing test_def.show() in pyspark 1.6.2 yields: > {code} > +-+--+ > |input|output| > +-+--+ > |2|second| > +-+--+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18014) Filters are incorrectly being grouped together when there is processing in between
[ https://issues.apache.org/jira/browse/SPARK-18014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Patterson updated SPARK-18014: -- Environment: Pyspark 2.0.0, Ipython 4.2 (was: Pyspark 2.0.1, Ipython 4.2) > Filters are incorrectly being grouped together when there is processing in > between > -- > > Key: SPARK-18014 > URL: https://issues.apache.org/jira/browse/SPARK-18014 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.1 > Environment: Pyspark 2.0.0, Ipython 4.2 >Reporter: Michael Patterson >Priority: Minor > > I created a dataframe that needed to filter the data on columnA, create a new > columnB by applying a user defined function to columnA, and then filter on > columnB. However, the two filters were being grouped together in the > execution plan after the withColumn statement, which was causing errors due > to unexpected input to the withColumn statement. > Example code to reproduce: > {code} > import pyspark.sql.functions as F > import pyspark.sql.types as T > from functools import partial > data = [{'input':0}, {'input':1}, {'input':2}] > input_df = sc.parallelize(data).toDF() > my_dict = {1:'first', 2:'second'} > def apply_dict( input_dict, value): > return input_dict[value] > test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() ) > test_df = input_df.filter('input > 0').withColumn('output', > test_udf('input')).filter(F.col('output').rlike('^s')) > test_df.explain(True) > {code} > Execution plan: > {code} > == Analyzed Logical Plan == > input: bigint, output: string > Filter output#4 RLIKE ^s > +- Project [input#0L, partial(input#0L) AS output#4] >+- Filter (input#0L > cast(0 as bigint)) > +- LogicalRDD [input#0L] > == Optimized Logical Plan == > Project [input#0L, partial(input#0L) AS output#4] > +- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE > ^s) >+- LogicalRDD [input#0L] > {code} > Executing test_def.show() after the above code in pyspark 2.0.1 yields: > KeyError: 0 > Executing test_def.show() in pyspark 1.6.2 yields: > {code} > +-+--+ > |input|output| > +-+--+ > |2|second| > +-+--+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18014) Filters are incorrectly being grouped together when there is processing in between
[ https://issues.apache.org/jira/browse/SPARK-18014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Patterson updated SPARK-18014: -- Description: I created a dataframe that needed to filter the data on columnA, create a new columnB by applying a user defined function to columnA, and then filter on columnB. However, the two filters were being grouped together in the execution plan after the withColumn statement, which was causing errors due to unexpected input to the withColumn statement. Example code to reproduce: ``` import pyspark.sql.functions as F import pyspark.sql.types as T from functools import partial data = [{'input':0}, {'input':1}, {'input':2}] input_df = sc.parallelize(data).toDF() my_dict = {1:'first', 2:'second'} def apply_dict( input_dict, value): return input_dict[value] test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() ) test_df = input_df.filter('input > 0').withColumn('output', test_udf('input')).filter(F.col('output').rlike('^s')) test_df.explain(True) ``` Execution plan: == Analyzed Logical Plan == input: bigint, output: string Filter output#4 RLIKE ^s +- Project [input#0L, partial(input#0L) AS output#4] +- Filter (input#0L > cast(0 as bigint)) +- LogicalRDD [input#0L] == Optimized Logical Plan == Project [input#0L, partial(input#0L) AS output#4] +- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE ^s) +- LogicalRDD [input#0L] Executing test_def.show() after the above code in pyspark 2.0.1 yields: KeyError: 0 Executing test_def.show() in pyspark 1.6.2 yields +-+--+ |input|output| +-+--+ |2|second| +-+--+ was: I created a dataframe that needed to filter the data on columnA, create a new columnB by applying a user defined function to columnA, and then filter on columnB. However, the two filters were being grouped together in the execution plan after the withColumn statement, which was causing errors due to unexpected input to the withColumn statement. Example code to reproduce: import pyspark.sql.functions as F import pyspark.sql.types as T from functools import partial data = [{'input':0}, {'input':1}, {'input':2}] input_df = sc.parallelize(data).toDF() my_dict = {1:'first', 2:'second'} def apply_dict( input_dict, value): return input_dict[value] test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() ) test_df = input_df.filter('input > 0').withColumn('output', test_udf('input')).filter(F.col('output').rlike('^s')) test_df.explain(True) Execution plan: == Analyzed Logical Plan == input: bigint, output: string Filter output#4 RLIKE ^s +- Project [input#0L, partial(input#0L) AS output#4] +- Filter (input#0L > cast(0 as bigint)) +- LogicalRDD [input#0L] == Optimized Logical Plan == Project [input#0L, partial(input#0L) AS output#4] +- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE ^s) +- LogicalRDD [input#0L] Executing test_def.show() after the above code in pyspark 2.0.1 yields: KeyError: 0 Executing test_def.show() in pyspark 1.6.2 yields {{+-+--+}} |input|output| +-+--+ |2|second| +-+--+ > Filters are incorrectly being grouped together when there is processing in > between > -- > > Key: SPARK-18014 > URL: https://issues.apache.org/jira/browse/SPARK-18014 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.1 > Environment: Pyspark 2.0.1, Ipython 4.2 >Reporter: Michael Patterson >Priority: Minor > > I created a dataframe that needed to filter the data on columnA, create a new > columnB by applying a user defined function to columnA, and then filter on > columnB. However, the two filters were being grouped together in the > execution plan after the withColumn statement, which was causing errors due > to unexpected input to the withColumn statement. > Example code to reproduce: > ``` > import pyspark.sql.functions as F > import pyspark.sql.types as T > from functools import partial > data = [{'input':0}, {'input':1}, {'input':2}] > input_df = sc.parallelize(data).toDF() > my_dict = {1:'first', 2:'second'} > def apply_dict( input_dict, value): > return input_dict[value] > test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() ) > test_df = input_df.filter('input > 0').withColumn('output', > test_udf('input')).filter(F.col('output').rlike('^s')) > test_df.explain(True) > ``` > Execution plan: > == Analyzed Logical Plan == > input: bigint, output: string > Filter output#4 RLIKE ^s > +- Project [input#0L, partial(input#0L) AS output#4] >+- Filter (input#0L > cast(0 as bigint)) > +- LogicalRDD [input#0L] > == Optimized Logical Plan == > Project [input#0L, partial(input#0L) AS output#4] > +- Filter ((isnotnull(input#0L) && (input#
[jira] [Created] (SPARK-18014) Filters are incorrectly being grouped together when there is processing in between
Michael Patterson created SPARK-18014: - Summary: Filters are incorrectly being grouped together when there is processing in between Key: SPARK-18014 URL: https://issues.apache.org/jira/browse/SPARK-18014 Project: Spark Issue Type: Bug Affects Versions: 2.0.1 Environment: Pyspark 2.0.1, Ipython 4.2 Reporter: Michael Patterson Priority: Minor I created a dataframe that needed to filter the data on columnA, create a new columnB by applying a user defined function to columnA, and then filter on columnB. However, the two filters were being grouped together in the execution plan after the withColumn statement, which was causing errors due to unexpected input to the withColumn statement. Example code to reproduce: ```import pyspark.sql.functions as F import pyspark.sql.types as T from functools import partial data = [{'input':0},{'input':1}, {'input':2}] input_df = sc.parallelize(data).toDF() my_dict = {1:'first', 2:'second'} def apply_dict( input_dict, value): return input_dict[value] test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() ) test_df = input_df.filter('input > 0').withColumn('output', test_udf('input')).filter(F.col('output').rlike('^s')) test_df.explain(True)``` Execution plan: == Analyzed Logical Plan == input: bigint, output: string Filter output#4 RLIKE ^s +- Project [input#0L, partial(input#0L) AS output#4] +- Filter (input#0L > cast(0 as bigint)) +- LogicalRDD [input#0L] == Optimized Logical Plan == Project [input#0L, partial(input#0L) AS output#4] +- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE ^s) +- LogicalRDD [input#0L] Executing test_def.show() after the above code in pyspark 2.0.1 yields: KeyError: 0 Executing the above code in pyspark 1.6.2 yields +-+--+ |input|output| +-+--+ |2|second| +-+--+ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18014) Filters are incorrectly being grouped together when there is processing in between
[ https://issues.apache.org/jira/browse/SPARK-18014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Patterson updated SPARK-18014: -- Description: I created a dataframe that needed to filter the data on columnA, create a new columnB by applying a user defined function to columnA, and then filter on columnB. However, the two filters were being grouped together in the execution plan after the withColumn statement, which was causing errors due to unexpected input to the withColumn statement. Example code to reproduce: ``` import pyspark.sql.functions as F import pyspark.sql.types as T from functools import partial data = [{'input':0},{'input':1}, {'input':2}] input_df = sc.parallelize(data).toDF() my_dict = {1:'first', 2:'second'} def apply_dict( input_dict, value): return input_dict[value] test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() ) test_df = input_df.filter('input > 0').withColumn('output', test_udf('input')).filter(F.col('output').rlike('^s')) test_df.explain(True) ``` Execution plan: == Analyzed Logical Plan == input: bigint, output: string Filter output#4 RLIKE ^s +- Project [input#0L, partial(input#0L) AS output#4] +- Filter (input#0L > cast(0 as bigint)) +- LogicalRDD [input#0L] == Optimized Logical Plan == Project [input#0L, partial(input#0L) AS output#4] +- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE ^s) +- LogicalRDD [input#0L] Executing test_def.show() after the above code in pyspark 2.0.1 yields: KeyError: 0 Executing the above code in pyspark 1.6.2 yields +-+--+ |input|output| +-+--+ |2|second| +-+--+ was: I created a dataframe that needed to filter the data on columnA, create a new columnB by applying a user defined function to columnA, and then filter on columnB. However, the two filters were being grouped together in the execution plan after the withColumn statement, which was causing errors due to unexpected input to the withColumn statement. Example code to reproduce: ```import pyspark.sql.functions as F import pyspark.sql.types as T from functools import partial data = [{'input':0},{'input':1}, {'input':2}] input_df = sc.parallelize(data).toDF() my_dict = {1:'first', 2:'second'} def apply_dict( input_dict, value): return input_dict[value] test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() ) test_df = input_df.filter('input > 0').withColumn('output', test_udf('input')).filter(F.col('output').rlike('^s')) test_df.explain(True)``` Execution plan: == Analyzed Logical Plan == input: bigint, output: string Filter output#4 RLIKE ^s +- Project [input#0L, partial(input#0L) AS output#4] +- Filter (input#0L > cast(0 as bigint)) +- LogicalRDD [input#0L] == Optimized Logical Plan == Project [input#0L, partial(input#0L) AS output#4] +- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE ^s) +- LogicalRDD [input#0L] Executing test_def.show() after the above code in pyspark 2.0.1 yields: KeyError: 0 Executing the above code in pyspark 1.6.2 yields +-+--+ |input|output| +-+--+ |2|second| +-+--+ > Filters are incorrectly being grouped together when there is processing in > between > -- > > Key: SPARK-18014 > URL: https://issues.apache.org/jira/browse/SPARK-18014 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.1 > Environment: Pyspark 2.0.1, Ipython 4.2 >Reporter: Michael Patterson >Priority: Minor > > I created a dataframe that needed to filter the data on columnA, create a new > columnB by applying a user defined function to columnA, and then filter on > columnB. However, the two filters were being grouped together in the > execution plan after the withColumn statement, which was causing errors due > to unexpected input to the withColumn statement. > Example code to reproduce: > ``` > import pyspark.sql.functions as F > import pyspark.sql.types as T > from functools import partial > data = [{'input':0},{'input':1}, {'input':2}] > input_df = sc.parallelize(data).toDF() > my_dict = {1:'first', 2:'second'} > def apply_dict( input_dict, value): > return input_dict[value] > test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() ) > test_df = input_df.filter('input > 0').withColumn('output', > test_udf('input')).filter(F.col('output').rlike('^s')) > test_df.explain(True) > ``` > Execution plan: > == Analyzed Logical Plan == > input: bigint, output: string > Filter output#4 RLIKE ^s > +- Project [input#0L, partial(input#0L) AS output#4] >+- Filter (input#0L > cast(0 as bigint)) > +- LogicalRDD [input#0L] > == Optimized Logical Plan == > Project [input#0L, partial(input#0L) AS output#4] > +- Filter ((isnotnull(input#0L) && (input#0L >
[jira] [Updated] (SPARK-18014) Filters are incorrectly being grouped together when there is processing in between
[ https://issues.apache.org/jira/browse/SPARK-18014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Patterson updated SPARK-18014: -- Description: I created a dataframe that needed to filter the data on columnA, create a new columnB by applying a user defined function to columnA, and then filter on columnB. However, the two filters were being grouped together in the execution plan after the withColumn statement, which was causing errors due to unexpected input to the withColumn statement. Example code to reproduce: {code} import pyspark.sql.functions as F import pyspark.sql.types as T from functools import partial data = [{'input':0}, {'input':1}, {'input':2}] input_df = sc.parallelize(data).toDF() my_dict = {1:'first', 2:'second'} def apply_dict( input_dict, value): return input_dict[value] test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() ) test_df = input_df.filter('input > 0').withColumn('output', test_udf('input')).filter(F.col('output').rlike('^s')) test_df.explain(True) {code} Execution plan: {code} == Analyzed Logical Plan == input: bigint, output: string Filter output#4 RLIKE ^s +- Project [input#0L, partial(input#0L) AS output#4] +- Filter (input#0L > cast(0 as bigint)) +- LogicalRDD [input#0L] == Optimized Logical Plan == Project [input#0L, partial(input#0L) AS output#4] +- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE ^s) +- LogicalRDD [input#0L] {code} Executing test_def.show() after the above code in pyspark 2.0.1 yields: KeyError: 0 Executing test_def.show() in pyspark 1.6.2 yields: {code} +-+--+ |input|output| +-+--+ |2|second| +-+--+ {code} was: I created a dataframe that needed to filter the data on columnA, create a new columnB by applying a user defined function to columnA, and then filter on columnB. However, the two filters were being grouped together in the execution plan after the withColumn statement, which was causing errors due to unexpected input to the withColumn statement. Example code to reproduce: {code} import pyspark.sql.functions as F import pyspark.sql.types as T from functools import partial data = [{'input':0}, {'input':1}, {'input':2}] input_df = sc.parallelize(data).toDF() my_dict = {1:'first', 2:'second'} def apply_dict( input_dict, value): return input_dict[value] test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() ) test_df = input_df.filter('input > 0').withColumn('output', test_udf('input')).filter(F.col('output').rlike('^s')) test_df.explain(True) {code} Execution plan: == Analyzed Logical Plan == input: bigint, output: string Filter output#4 RLIKE ^s +- Project [input#0L, partial(input#0L) AS output#4] +- Filter (input#0L > cast(0 as bigint)) +- LogicalRDD [input#0L] == Optimized Logical Plan == Project [input#0L, partial(input#0L) AS output#4] +- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE ^s) +- LogicalRDD [input#0L] Executing test_def.show() after the above code in pyspark 2.0.1 yields: KeyError: 0 Executing test_def.show() in pyspark 1.6.2 yields +-+--+ |input|output| +-+--+ |2|second| +-+--+ > Filters are incorrectly being grouped together when there is processing in > between > -- > > Key: SPARK-18014 > URL: https://issues.apache.org/jira/browse/SPARK-18014 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.1 > Environment: Pyspark 2.0.1, Ipython 4.2 >Reporter: Michael Patterson >Priority: Minor > > I created a dataframe that needed to filter the data on columnA, create a new > columnB by applying a user defined function to columnA, and then filter on > columnB. However, the two filters were being grouped together in the > execution plan after the withColumn statement, which was causing errors due > to unexpected input to the withColumn statement. > Example code to reproduce: > {code} > import pyspark.sql.functions as F > import pyspark.sql.types as T > from functools import partial > data = [{'input':0}, {'input':1}, {'input':2}] > input_df = sc.parallelize(data).toDF() > my_dict = {1:'first', 2:'second'} > def apply_dict( input_dict, value): > return input_dict[value] > test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() ) > test_df = input_df.filter('input > 0').withColumn('output', > test_udf('input')).filter(F.col('output').rlike('^s')) > test_df.explain(True) > {code} > Execution plan: > {code} > == Analyzed Logical Plan == > input: bigint, output: string > Filter output#4 RLIKE ^s > +- Project [input#0L, partial(input#0L) AS output#4] >+- Filter (input#0L > cast(0 as bigint)) > +- LogicalRDD [input#0L] > == Optimized Logical Plan == > Project [input#0L, partial(input#0L) A
[jira] [Updated] (SPARK-18014) Filters are incorrectly being grouped together when there is processing in between
[ https://issues.apache.org/jira/browse/SPARK-18014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Patterson updated SPARK-18014: -- Description: I created a dataframe that needed to filter the data on columnA, create a new columnB by applying a user defined function to columnA, and then filter on columnB. However, the two filters were being grouped together in the execution plan after the withColumn statement, which was causing errors due to unexpected input to the withColumn statement. Example code to reproduce: {code} import pyspark.sql.functions as F import pyspark.sql.types as T from functools import partial data = [{'input':0}, {'input':1}, {'input':2}] input_df = sc.parallelize(data).toDF() my_dict = {1:'first', 2:'second'} def apply_dict( input_dict, value): return input_dict[value] test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() ) test_df = input_df.filter('input > 0').withColumn('output', test_udf('input')).filter(F.col('output').rlike('^s')) test_df.explain(True) {code} Execution plan: == Analyzed Logical Plan == input: bigint, output: string Filter output#4 RLIKE ^s +- Project [input#0L, partial(input#0L) AS output#4] +- Filter (input#0L > cast(0 as bigint)) +- LogicalRDD [input#0L] == Optimized Logical Plan == Project [input#0L, partial(input#0L) AS output#4] +- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE ^s) +- LogicalRDD [input#0L] Executing test_def.show() after the above code in pyspark 2.0.1 yields: KeyError: 0 Executing test_def.show() in pyspark 1.6.2 yields +-+--+ |input|output| +-+--+ |2|second| +-+--+ was: I created a dataframe that needed to filter the data on columnA, create a new columnB by applying a user defined function to columnA, and then filter on columnB. However, the two filters were being grouped together in the execution plan after the withColumn statement, which was causing errors due to unexpected input to the withColumn statement. Example code to reproduce: import pyspark.sql.functions as F import pyspark.sql.types as T from functools import partial data = [{'input':0}, {'input':1}, {'input':2}] input_df = sc.parallelize(data).toDF() my_dict = {1:'first', 2:'second'} def apply_dict( input_dict, value): return input_dict[value] test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() ) test_df = input_df.filter('input > 0').withColumn('output', test_udf('input')).filter(F.col('output').rlike('^s')) test_df.explain(True) Execution plan: == Analyzed Logical Plan == input: bigint, output: string Filter output#4 RLIKE ^s +- Project [input#0L, partial(input#0L) AS output#4] +- Filter (input#0L > cast(0 as bigint)) +- LogicalRDD [input#0L] == Optimized Logical Plan == Project [input#0L, partial(input#0L) AS output#4] +- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE ^s) +- LogicalRDD [input#0L] Executing test_def.show() after the above code in pyspark 2.0.1 yields: KeyError: 0 Executing test_def.show() in pyspark 1.6.2 yields +-+--+ |input|output| +-+--+ |2|second| +-+--+ > Filters are incorrectly being grouped together when there is processing in > between > -- > > Key: SPARK-18014 > URL: https://issues.apache.org/jira/browse/SPARK-18014 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.1 > Environment: Pyspark 2.0.1, Ipython 4.2 >Reporter: Michael Patterson >Priority: Minor > > I created a dataframe that needed to filter the data on columnA, create a new > columnB by applying a user defined function to columnA, and then filter on > columnB. However, the two filters were being grouped together in the > execution plan after the withColumn statement, which was causing errors due > to unexpected input to the withColumn statement. > Example code to reproduce: > {code} > import pyspark.sql.functions as F > import pyspark.sql.types as T > from functools import partial > data = [{'input':0}, {'input':1}, {'input':2}] > input_df = sc.parallelize(data).toDF() > my_dict = {1:'first', 2:'second'} > def apply_dict( input_dict, value): > return input_dict[value] > test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() ) > test_df = input_df.filter('input > 0').withColumn('output', > test_udf('input')).filter(F.col('output').rlike('^s')) > test_df.explain(True) > {code} > Execution plan: > == Analyzed Logical Plan == > input: bigint, output: string > Filter output#4 RLIKE ^s > +- Project [input#0L, partial(input#0L) AS output#4] >+- Filter (input#0L > cast(0 as bigint)) > +- LogicalRDD [input#0L] > == Optimized Logical Plan == > Project [input#0L, partial(input#0L) AS output#4] > +- Filter ((isnotnull(input#0L) &&
[jira] [Updated] (SPARK-18014) Filters are incorrectly being grouped together when there is processing in between
[ https://issues.apache.org/jira/browse/SPARK-18014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Patterson updated SPARK-18014: -- Description: I created a dataframe that needed to filter the data on columnA, create a new columnB by applying a user defined function to columnA, and then filter on columnB. However, the two filters were being grouped together in the execution plan after the withColumn statement, which was causing errors due to unexpected input to the withColumn statement. Example code to reproduce: import pyspark.sql.functions as F import pyspark.sql.types as T from functools import partial data = [{'input':0}, {'input':1}, {'input':2}] input_df = sc.parallelize(data).toDF() my_dict = {1:'first', 2:'second'} def apply_dict( input_dict, value): return input_dict[value] test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() ) test_df = input_df.filter('input > 0').withColumn('output', test_udf('input')).filter(F.col('output').rlike('^s')) test_df.explain(True) Execution plan: == Analyzed Logical Plan == input: bigint, output: string Filter output#4 RLIKE ^s +- Project [input#0L, partial(input#0L) AS output#4] +- Filter (input#0L > cast(0 as bigint)) +- LogicalRDD [input#0L] == Optimized Logical Plan == Project [input#0L, partial(input#0L) AS output#4] +- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE ^s) +- LogicalRDD [input#0L] Executing test_def.show() after the above code in pyspark 2.0.1 yields: KeyError: 0 Executing test_def.show() in pyspark 1.6.2 yields +-+--+ |input|output| +-+--+ |2|second| +-+--+ was: I created a dataframe that needed to filter the data on columnA, create a new columnB by applying a user defined function to columnA, and then filter on columnB. However, the two filters were being grouped together in the execution plan after the withColumn statement, which was causing errors due to unexpected input to the withColumn statement. Example code to reproduce: ``` import pyspark.sql.functions as F import pyspark.sql.types as T from functools import partial data = [{'input':0}, {'input':1}, {'input':2}] input_df = sc.parallelize(data).toDF() my_dict = {1:'first', 2:'second'} def apply_dict( input_dict, value): return input_dict[value] test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() ) test_df = input_df.filter('input > 0').withColumn('output', test_udf('input')).filter(F.col('output').rlike('^s')) test_df.explain(True) ``` Execution plan: == Analyzed Logical Plan == input: bigint, output: string Filter output#4 RLIKE ^s +- Project [input#0L, partial(input#0L) AS output#4] +- Filter (input#0L > cast(0 as bigint)) +- LogicalRDD [input#0L] == Optimized Logical Plan == Project [input#0L, partial(input#0L) AS output#4] +- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE ^s) +- LogicalRDD [input#0L] Executing test_def.show() after the above code in pyspark 2.0.1 yields: KeyError: 0 Executing test_def.show() in pyspark 1.6.2 yields +-+--+ |input|output| +-+--+ |2|second| +-+--+ > Filters are incorrectly being grouped together when there is processing in > between > -- > > Key: SPARK-18014 > URL: https://issues.apache.org/jira/browse/SPARK-18014 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.1 > Environment: Pyspark 2.0.1, Ipython 4.2 >Reporter: Michael Patterson >Priority: Minor > > I created a dataframe that needed to filter the data on columnA, create a new > columnB by applying a user defined function to columnA, and then filter on > columnB. However, the two filters were being grouped together in the > execution plan after the withColumn statement, which was causing errors due > to unexpected input to the withColumn statement. > Example code to reproduce: > import pyspark.sql.functions as F > import pyspark.sql.types as T > from functools import partial > data = [{'input':0}, {'input':1}, {'input':2}] > input_df = sc.parallelize(data).toDF() > my_dict = {1:'first', 2:'second'} > def apply_dict( input_dict, value): > return input_dict[value] > test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() ) > test_df = input_df.filter('input > 0').withColumn('output', > test_udf('input')).filter(F.col('output').rlike('^s')) > test_df.explain(True) > Execution plan: > == Analyzed Logical Plan == > input: bigint, output: string > Filter output#4 RLIKE ^s > +- Project [input#0L, partial(input#0L) AS output#4] >+- Filter (input#0L > cast(0 as bigint)) > +- LogicalRDD [input#0L] > == Optimized Logical Plan == > Project [input#0L, partial(input#0L) AS output#4] > +- Filter ((isnotnull(input#0L) && (input#0L > 0)) && par
[jira] [Updated] (SPARK-18014) Filters are incorrectly being grouped together when there is processing in between
[ https://issues.apache.org/jira/browse/SPARK-18014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Patterson updated SPARK-18014: -- Description: I created a dataframe that needed to filter the data on columnA, create a new columnB by applying a user defined function to columnA, and then filter on columnB. However, the two filters were being grouped together in the execution plan after the withColumn statement, which was causing errors due to unexpected input to the withColumn statement. Example code to reproduce: import pyspark.sql.functions as F import pyspark.sql.types as T from functools import partial data = [{'input':0}, {'input':1}, {'input':2}] input_df = sc.parallelize(data).toDF() my_dict = {1:'first', 2:'second'} def apply_dict( input_dict, value): return input_dict[value] test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() ) test_df = input_df.filter('input > 0').withColumn('output', test_udf('input')).filter(F.col('output').rlike('^s')) test_df.explain(True) Execution plan: == Analyzed Logical Plan == input: bigint, output: string Filter output#4 RLIKE ^s +- Project [input#0L, partial(input#0L) AS output#4] +- Filter (input#0L > cast(0 as bigint)) +- LogicalRDD [input#0L] == Optimized Logical Plan == Project [input#0L, partial(input#0L) AS output#4] +- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE ^s) +- LogicalRDD [input#0L] Executing test_def.show() after the above code in pyspark 2.0.1 yields: KeyError: 0 Executing test_def.show() in pyspark 1.6.2 yields +-+--+ |input|output| +-+--+ |2|second| +-+--+ was: I created a dataframe that needed to filter the data on columnA, create a new columnB by applying a user defined function to columnA, and then filter on columnB. However, the two filters were being grouped together in the execution plan after the withColumn statement, which was causing errors due to unexpected input to the withColumn statement. Example code to reproduce: {{import pyspark.sql.functions as F import pyspark.sql.types as T from functools import partial data = [{'input':0}, {'input':1}, {'input':2}] input_df = sc.parallelize(data).toDF() my_dict = {1:'first', 2:'second'} def apply_dict( input_dict, value): return input_dict[value] test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() ) test_df = input_df.filter('input > 0').withColumn('output', test_udf('input')).filter(F.col('output').rlike('^s')) test_df.explain(True)}} Execution plan: == Analyzed Logical Plan == input: bigint, output: string Filter output#4 RLIKE ^s +- Project [input#0L, partial(input#0L) AS output#4] +- Filter (input#0L > cast(0 as bigint)) +- LogicalRDD [input#0L] == Optimized Logical Plan == Project [input#0L, partial(input#0L) AS output#4] +- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE ^s) +- LogicalRDD [input#0L] Executing test_def.show() after the above code in pyspark 2.0.1 yields: KeyError: 0 Executing test_def.show() in pyspark 1.6.2 yields +-+--+ |input|output| +-+--+ |2|second| +-+--+ > Filters are incorrectly being grouped together when there is processing in > between > -- > > Key: SPARK-18014 > URL: https://issues.apache.org/jira/browse/SPARK-18014 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.1 > Environment: Pyspark 2.0.1, Ipython 4.2 >Reporter: Michael Patterson >Priority: Minor > > I created a dataframe that needed to filter the data on columnA, create a new > columnB by applying a user defined function to columnA, and then filter on > columnB. However, the two filters were being grouped together in the > execution plan after the withColumn statement, which was causing errors due > to unexpected input to the withColumn statement. > Example code to reproduce: > import pyspark.sql.functions as F > import pyspark.sql.types as T > from functools import partial > data = [{'input':0}, {'input':1}, {'input':2}] > input_df = sc.parallelize(data).toDF() > my_dict = {1:'first', 2:'second'} > def apply_dict( input_dict, value): > return input_dict[value] > test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() ) > test_df = input_df.filter('input > 0').withColumn('output', > test_udf('input')).filter(F.col('output').rlike('^s')) > test_df.explain(True) > Execution plan: > == Analyzed Logical Plan == > input: bigint, output: string > Filter output#4 RLIKE ^s > +- Project [input#0L, partial(input#0L) AS output#4] >+- Filter (input#0L > cast(0 as bigint)) > +- LogicalRDD [input#0L] > == Optimized Logical Plan == > Project [input#0L, partial(input#0L) AS output#4] > +- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(i
[jira] [Updated] (SPARK-18014) Filters are incorrectly being grouped together when there is processing in between
[ https://issues.apache.org/jira/browse/SPARK-18014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Patterson updated SPARK-18014: -- Description: I created a dataframe that needed to filter the data on columnA, create a new columnB by applying a user defined function to columnA, and then filter on columnB. However, the two filters were being grouped together in the execution plan after the withColumn statement, which was causing errors due to unexpected input to the withColumn statement. Example code to reproduce: import pyspark.sql.functions as F import pyspark.sql.types as T from functools import partial data = [{'input':0}, {'input':1}, {'input':2}] input_df = sc.parallelize(data).toDF() my_dict = {1:'first', 2:'second'} def apply_dict( input_dict, value): return input_dict[value] test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() ) test_df = input_df.filter('input > 0').withColumn('output', test_udf('input')).filter(F.col('output').rlike('^s')) test_df.explain(True) Execution plan: == Analyzed Logical Plan == input: bigint, output: string Filter output#4 RLIKE ^s +- Project [input#0L, partial(input#0L) AS output#4] +- Filter (input#0L > cast(0 as bigint)) +- LogicalRDD [input#0L] == Optimized Logical Plan == Project [input#0L, partial(input#0L) AS output#4] +- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE ^s) +- LogicalRDD [input#0L] Executing test_def.show() after the above code in pyspark 2.0.1 yields: KeyError: 0 Executing test_def.show() in pyspark 1.6.2 yields {{+-+--+}} |input|output| +-+--+ |2|second| +-+--+ was: I created a dataframe that needed to filter the data on columnA, create a new columnB by applying a user defined function to columnA, and then filter on columnB. However, the two filters were being grouped together in the execution plan after the withColumn statement, which was causing errors due to unexpected input to the withColumn statement. Example code to reproduce: import pyspark.sql.functions as F import pyspark.sql.types as T from functools import partial data = [{'input':0}, {'input':1}, {'input':2}] input_df = sc.parallelize(data).toDF() my_dict = {1:'first', 2:'second'} def apply_dict( input_dict, value): return input_dict[value] test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() ) test_df = input_df.filter('input > 0').withColumn('output', test_udf('input')).filter(F.col('output').rlike('^s')) test_df.explain(True) Execution plan: == Analyzed Logical Plan == input: bigint, output: string Filter output#4 RLIKE ^s +- Project [input#0L, partial(input#0L) AS output#4] +- Filter (input#0L > cast(0 as bigint)) +- LogicalRDD [input#0L] == Optimized Logical Plan == Project [input#0L, partial(input#0L) AS output#4] +- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE ^s) +- LogicalRDD [input#0L] Executing test_def.show() after the above code in pyspark 2.0.1 yields: KeyError: 0 Executing test_def.show() in pyspark 1.6.2 yields +-+--+ |input|output| +-+--+ |2|second| +-+--+ > Filters are incorrectly being grouped together when there is processing in > between > -- > > Key: SPARK-18014 > URL: https://issues.apache.org/jira/browse/SPARK-18014 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.1 > Environment: Pyspark 2.0.1, Ipython 4.2 >Reporter: Michael Patterson >Priority: Minor > > I created a dataframe that needed to filter the data on columnA, create a new > columnB by applying a user defined function to columnA, and then filter on > columnB. However, the two filters were being grouped together in the > execution plan after the withColumn statement, which was causing errors due > to unexpected input to the withColumn statement. > Example code to reproduce: > import pyspark.sql.functions as F > import pyspark.sql.types as T > from functools import partial > data = [{'input':0}, {'input':1}, {'input':2}] > input_df = sc.parallelize(data).toDF() > my_dict = {1:'first', 2:'second'} > def apply_dict( input_dict, value): > return input_dict[value] > test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() ) > test_df = input_df.filter('input > 0').withColumn('output', > test_udf('input')).filter(F.col('output').rlike('^s')) > test_df.explain(True) > Execution plan: > == Analyzed Logical Plan == > input: bigint, output: string > Filter output#4 RLIKE ^s > +- Project [input#0L, partial(input#0L) AS output#4] >+- Filter (input#0L > cast(0 as bigint)) > +- LogicalRDD [input#0L] > == Optimized Logical Plan == > Project [input#0L, partial(input#0L) AS output#4] > +- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(i
[jira] [Updated] (SPARK-18014) Filters are incorrectly being grouped together when there is processing in between
[ https://issues.apache.org/jira/browse/SPARK-18014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Patterson updated SPARK-18014: -- Description: I created a dataframe that needed to filter the data on columnA, create a new columnB by applying a user defined function to columnA, and then filter on columnB. However, the two filters were being grouped together in the execution plan after the withColumn statement, which was causing errors due to unexpected input to the withColumn statement. Example code to reproduce: {{import pyspark.sql.functions as F import pyspark.sql.types as T from functools import partial data = [{'input':0}, {'input':1}, {'input':2}] input_df = sc.parallelize(data).toDF() my_dict = {1:'first', 2:'second'} def apply_dict( input_dict, value): return input_dict[value] test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() ) test_df = input_df.filter('input > 0').withColumn('output', test_udf('input')).filter(F.col('output').rlike('^s')) test_df.explain(True)}} Execution plan: == Analyzed Logical Plan == input: bigint, output: string Filter output#4 RLIKE ^s +- Project [input#0L, partial(input#0L) AS output#4] +- Filter (input#0L > cast(0 as bigint)) +- LogicalRDD [input#0L] == Optimized Logical Plan == Project [input#0L, partial(input#0L) AS output#4] +- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE ^s) +- LogicalRDD [input#0L] Executing test_def.show() after the above code in pyspark 2.0.1 yields: KeyError: 0 Executing test_def.show() in pyspark 1.6.2 yields +-+--+ |input|output| +-+--+ |2|second| +-+--+ was: I created a dataframe that needed to filter the data on columnA, create a new columnB by applying a user defined function to columnA, and then filter on columnB. However, the two filters were being grouped together in the execution plan after the withColumn statement, which was causing errors due to unexpected input to the withColumn statement. Example code to reproduce: import pyspark.sql.functions as F import pyspark.sql.types as T from functools import partial data = [{'input':0}, {'input':1}, {'input':2}] input_df = sc.parallelize(data).toDF() my_dict = {1:'first', 2:'second'} def apply_dict( input_dict, value): return input_dict[value] test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() ) test_df = input_df.filter('input > 0').withColumn('output', test_udf('input')).filter(F.col('output').rlike('^s')) test_df.explain(True) Execution plan: == Analyzed Logical Plan == input: bigint, output: string Filter output#4 RLIKE ^s +- Project [input#0L, partial(input#0L) AS output#4] +- Filter (input#0L > cast(0 as bigint)) +- LogicalRDD [input#0L] == Optimized Logical Plan == Project [input#0L, partial(input#0L) AS output#4] +- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE ^s) +- LogicalRDD [input#0L] Executing test_def.show() after the above code in pyspark 2.0.1 yields: KeyError: 0 Executing test_def.show() in pyspark 1.6.2 yields +-+--+ |input|output| +-+--+ |2|second| +-+--+ > Filters are incorrectly being grouped together when there is processing in > between > -- > > Key: SPARK-18014 > URL: https://issues.apache.org/jira/browse/SPARK-18014 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.1 > Environment: Pyspark 2.0.1, Ipython 4.2 >Reporter: Michael Patterson >Priority: Minor > > I created a dataframe that needed to filter the data on columnA, create a new > columnB by applying a user defined function to columnA, and then filter on > columnB. However, the two filters were being grouped together in the > execution plan after the withColumn statement, which was causing errors due > to unexpected input to the withColumn statement. > Example code to reproduce: > {{import pyspark.sql.functions as F > import pyspark.sql.types as T > from functools import partial > data = [{'input':0}, {'input':1}, {'input':2}] > input_df = sc.parallelize(data).toDF() > my_dict = {1:'first', 2:'second'} > def apply_dict( input_dict, value): > return input_dict[value] > test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() ) > test_df = input_df.filter('input > 0').withColumn('output', > test_udf('input')).filter(F.col('output').rlike('^s')) > test_df.explain(True)}} > Execution plan: > == Analyzed Logical Plan == > input: bigint, output: string > Filter output#4 RLIKE ^s > +- Project [input#0L, partial(input#0L) AS output#4] >+- Filter (input#0L > cast(0 as bigint)) > +- LogicalRDD [input#0L] > == Optimized Logical Plan == > Project [input#0L, partial(input#0L) AS output#4] > +- Filter ((isnotnull(input#0L) && (input#0L > 0)) && parti
[jira] [Updated] (SPARK-18014) Filters are incorrectly being grouped together when there is processing in between
[ https://issues.apache.org/jira/browse/SPARK-18014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Patterson updated SPARK-18014: -- Description: I created a dataframe that needed to filter the data on columnA, create a new columnB by applying a user defined function to columnA, and then filter on columnB. However, the two filters were being grouped together in the execution plan after the withColumn statement, which was causing errors due to unexpected input to the withColumn statement. Example code to reproduce: import pyspark.sql.functions as F import pyspark.sql.types as T from functools import partial data = [{'input':0},{'input':1}, {'input':2}] input_df = sc.parallelize(data).toDF() my_dict = {1:'first', 2:'second'} def apply_dict( input_dict, value): return input_dict[value] test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() ) test_df = input_df.filter('input > 0').withColumn('output', test_udf('input')).filter(F.col('output').rlike('^s')) test_df.explain(True) Execution plan: == Analyzed Logical Plan == input: bigint, output: string Filter output#4 RLIKE ^s +- Project [input#0L, partial(input#0L) AS output#4] +- Filter (input#0L > cast(0 as bigint)) +- LogicalRDD [input#0L] == Optimized Logical Plan == Project [input#0L, partial(input#0L) AS output#4] +- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE ^s) +- LogicalRDD [input#0L] Executing test_def.show() after the above code in pyspark 2.0.1 yields: KeyError: 0 Executing test_def.show() in pyspark 1.6.2 yields +-+--+ |input|output| +-+--+ |2|second| +-+--+ was: I created a dataframe that needed to filter the data on columnA, create a new columnB by applying a user defined function to columnA, and then filter on columnB. However, the two filters were being grouped together in the execution plan after the withColumn statement, which was causing errors due to unexpected input to the withColumn statement. Example code to reproduce: ``` import pyspark.sql.functions as F import pyspark.sql.types as T from functools import partial data = [{'input':0},{'input':1}, {'input':2}] input_df = sc.parallelize(data).toDF() my_dict = {1:'first', 2:'second'} def apply_dict( input_dict, value): return input_dict[value] test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() ) test_df = input_df.filter('input > 0').withColumn('output', test_udf('input')).filter(F.col('output').rlike('^s')) test_df.explain(True) ``` Execution plan: == Analyzed Logical Plan == input: bigint, output: string Filter output#4 RLIKE ^s +- Project [input#0L, partial(input#0L) AS output#4] +- Filter (input#0L > cast(0 as bigint)) +- LogicalRDD [input#0L] == Optimized Logical Plan == Project [input#0L, partial(input#0L) AS output#4] +- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE ^s) +- LogicalRDD [input#0L] Executing test_def.show() after the above code in pyspark 2.0.1 yields: KeyError: 0 Executing the above code in pyspark 1.6.2 yields +-+--+ |input|output| +-+--+ |2|second| +-+--+ > Filters are incorrectly being grouped together when there is processing in > between > -- > > Key: SPARK-18014 > URL: https://issues.apache.org/jira/browse/SPARK-18014 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.1 > Environment: Pyspark 2.0.1, Ipython 4.2 >Reporter: Michael Patterson >Priority: Minor > > I created a dataframe that needed to filter the data on columnA, create a new > columnB by applying a user defined function to columnA, and then filter on > columnB. However, the two filters were being grouped together in the > execution plan after the withColumn statement, which was causing errors due > to unexpected input to the withColumn statement. > Example code to reproduce: > import pyspark.sql.functions as F > import pyspark.sql.types as T > from functools import partial > data = [{'input':0},{'input':1}, {'input':2}] > input_df = sc.parallelize(data).toDF() > my_dict = {1:'first', 2:'second'} > def apply_dict( input_dict, value): > return input_dict[value] > test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() ) > test_df = input_df.filter('input > 0').withColumn('output', > test_udf('input')).filter(F.col('output').rlike('^s')) > test_df.explain(True) > Execution plan: > == Analyzed Logical Plan == > input: bigint, output: string > Filter output#4 RLIKE ^s > +- Project [input#0L, partial(input#0L) AS output#4] >+- Filter (input#0L > cast(0 as bigint)) > +- LogicalRDD [input#0L] > == Optimized Logical Plan == > Project [input#0L, partial(input#0L) AS output#4] > +- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(i
[jira] [Updated] (SPARK-18014) Filters are incorrectly being grouped together when there is processing in between
[ https://issues.apache.org/jira/browse/SPARK-18014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Patterson updated SPARK-18014: -- Description: I created a dataframe that needed to filter the data on columnA, create a new columnB by applying a user defined function to columnA, and then filter on columnB. However, the two filters were being grouped together in the execution plan after the withColumn statement, which was causing errors due to unexpected input to the withColumn statement. Example code to reproduce: import pyspark.sql.functions as F import pyspark.sql.types as T from functools import partial data = [{'input':0}, {'input':1}, {'input':2}] input_df = sc.parallelize(data).toDF() my_dict = {1:'first', 2:'second'} def apply_dict( input_dict, value): return input_dict[value] test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() ) test_df = input_df.filter('input > 0').withColumn('output', test_udf('input')).filter(F.col('output').rlike('^s')) test_df.explain(True) Execution plan: == Analyzed Logical Plan == input: bigint, output: string Filter output#4 RLIKE ^s +- Project [input#0L, partial(input#0L) AS output#4] +- Filter (input#0L > cast(0 as bigint)) +- LogicalRDD [input#0L] == Optimized Logical Plan == Project [input#0L, partial(input#0L) AS output#4] +- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE ^s) +- LogicalRDD [input#0L] Executing test_def.show() after the above code in pyspark 2.0.1 yields: KeyError: 0 Executing test_def.show() in pyspark 1.6.2 yields +-+--+ |input|output| +-+--+ |2|second| +-+--+ was: I created a dataframe that needed to filter the data on columnA, create a new columnB by applying a user defined function to columnA, and then filter on columnB. However, the two filters were being grouped together in the execution plan after the withColumn statement, which was causing errors due to unexpected input to the withColumn statement. Example code to reproduce: import pyspark.sql.functions as F import pyspark.sql.types as T from functools import partial data = [{'input':0},{'input':1}, {'input':2}] input_df = sc.parallelize(data).toDF() my_dict = {1:'first', 2:'second'} def apply_dict( input_dict, value): return input_dict[value] test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() ) test_df = input_df.filter('input > 0').withColumn('output', test_udf('input')).filter(F.col('output').rlike('^s')) test_df.explain(True) Execution plan: == Analyzed Logical Plan == input: bigint, output: string Filter output#4 RLIKE ^s +- Project [input#0L, partial(input#0L) AS output#4] +- Filter (input#0L > cast(0 as bigint)) +- LogicalRDD [input#0L] == Optimized Logical Plan == Project [input#0L, partial(input#0L) AS output#4] +- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE ^s) +- LogicalRDD [input#0L] Executing test_def.show() after the above code in pyspark 2.0.1 yields: KeyError: 0 Executing test_def.show() in pyspark 1.6.2 yields +-+--+ |input|output| +-+--+ |2|second| +-+--+ > Filters are incorrectly being grouped together when there is processing in > between > -- > > Key: SPARK-18014 > URL: https://issues.apache.org/jira/browse/SPARK-18014 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.1 > Environment: Pyspark 2.0.1, Ipython 4.2 >Reporter: Michael Patterson >Priority: Minor > > I created a dataframe that needed to filter the data on columnA, create a new > columnB by applying a user defined function to columnA, and then filter on > columnB. However, the two filters were being grouped together in the > execution plan after the withColumn statement, which was causing errors due > to unexpected input to the withColumn statement. > Example code to reproduce: > import pyspark.sql.functions as F > import pyspark.sql.types as T > from functools import partial > data = [{'input':0}, {'input':1}, {'input':2}] > input_df = sc.parallelize(data).toDF() > my_dict = {1:'first', 2:'second'} > def apply_dict( input_dict, value): > return input_dict[value] > test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() ) > test_df = input_df.filter('input > 0').withColumn('output', > test_udf('input')).filter(F.col('output').rlike('^s')) > test_df.explain(True) > Execution plan: > == Analyzed Logical Plan == > input: bigint, output: string > Filter output#4 RLIKE ^s > +- Project [input#0L, partial(input#0L) AS output#4] >+- Filter (input#0L > cast(0 as bigint)) > +- LogicalRDD [input#0L] > == Optimized Logical Plan == > Project [input#0L, partial(input#0L) AS output#4] > +- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#