[jira] [Updated] (SPARK-37449) Side effects between PySpark, Numpy and Pygeos
[ https://issues.apache.org/jira/browse/SPARK-37449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Carlos Gameiro updated SPARK-37449: --- Priority: Critical (was: Major) > Side effects between PySpark, Numpy and Pygeos > -- > > Key: SPARK-37449 > URL: https://issues.apache.org/jira/browse/SPARK-37449 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.1.2 >Reporter: Carlos Gameiro >Priority: Critical > Labels: NumPy, Pandas, Pygeos, UDF, applyInPandas > > I'm using pygeos 0.11.1. > Let's create a simple Pandas Dataframe with a single column named 'id' with a > range: > {code:java} > df = pd.DataFrame(np.arange(0,1000), columns=['id']){code} > Consider this simple function that selects the first 4 indexes of the 'id' > column of an array, and that for some reason calls a Pyegos operation in the > beginning. > {code:java} > def udf_example(df): > > geo = pygeos.from_wkt(np.array(['POINT (20 30)', 'POINT (34 -2)', 'POINT > (20 30)'])) > > some_index = np.array([0, 1, 2, 3]) > values = df['id'].values[some_index] > > df = pd.DataFrame(values, columns=['id']) > return df{code} > If I apply this function in Pyspark I get this result: > {code:java} > schema = t.StructType([t.StructField('id', t.LongType(), True)]) > df_spark = spark.createDataFrame(df).groupBy().applyInPandas(udf_example, > schema) > display(df_spark) > # id > # 125 > # 126 > # 127 > # 128 > {code} > If I apply it in Python I get the correct and expected result: > {code:java} > udf_example(df) > # id > # 0 > # 1 > # 2 > # 3 > {code} > Using a Pygeos function together with Spark causes side effects on NumPy > indexing operations. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37449) Side effects between PySpark, Numpy and Pygeos
[ https://issues.apache.org/jira/browse/SPARK-37449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Carlos Gameiro updated SPARK-37449: --- Labels: applyInPandas (was: ) > Side effects between PySpark, Numpy and Pygeos > -- > > Key: SPARK-37449 > URL: https://issues.apache.org/jira/browse/SPARK-37449 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.1.2 >Reporter: Carlos Gameiro >Priority: Major > Labels: applyInPandas > > I'm using pygeos 0.11.1. > Let's create a simple Pandas Dataframe with a single column named 'id' with a > range: > {code:java} > df = pd.DataFrame(np.arange(0,1000), columns=['id']){code} > Consider this simple function that selects the first 4 indexes of the 'id' > column of an array, and that for some reason calls a Pyegos operation in the > beginning. > {code:java} > def udf_example(df): > > geo = pygeos.from_wkt(np.array(['POINT (20 30)', 'POINT (34 -2)', 'POINT > (20 30)'])) > > some_index = np.array([0, 1, 2, 3]) > values = df['id'].values[some_index] > > df = pd.DataFrame(values, columns=['id']) > return df{code} > If I apply this function in Pyspark I get this result: > {code:java} > schema = t.StructType([t.StructField('id', t.LongType(), True)]) > df_spark = spark.createDataFrame(df).groupBy().applyInPandas(udf_example, > schema) > display(df_spark) > # id > # 125 > # 126 > # 127 > # 128 > {code} > If I apply it in Python I get the correct and expected result: > {code:java} > udf_example(df) > # id > # 0 > # 1 > # 2 > # 3 > {code} > Using a Pygeos function together with Spark causes side effects on NumPy > indexing operations. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37449) Side effects between PySpark, Numpy and Pygeos
[ https://issues.apache.org/jira/browse/SPARK-37449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Carlos Gameiro updated SPARK-37449: --- Labels: NumPy Pandas Pygeos UDF applyInPandas (was: applyInPandas) > Side effects between PySpark, Numpy and Pygeos > -- > > Key: SPARK-37449 > URL: https://issues.apache.org/jira/browse/SPARK-37449 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.1.2 >Reporter: Carlos Gameiro >Priority: Major > Labels: NumPy, Pandas, Pygeos, UDF, applyInPandas > > I'm using pygeos 0.11.1. > Let's create a simple Pandas Dataframe with a single column named 'id' with a > range: > {code:java} > df = pd.DataFrame(np.arange(0,1000), columns=['id']){code} > Consider this simple function that selects the first 4 indexes of the 'id' > column of an array, and that for some reason calls a Pyegos operation in the > beginning. > {code:java} > def udf_example(df): > > geo = pygeos.from_wkt(np.array(['POINT (20 30)', 'POINT (34 -2)', 'POINT > (20 30)'])) > > some_index = np.array([0, 1, 2, 3]) > values = df['id'].values[some_index] > > df = pd.DataFrame(values, columns=['id']) > return df{code} > If I apply this function in Pyspark I get this result: > {code:java} > schema = t.StructType([t.StructField('id', t.LongType(), True)]) > df_spark = spark.createDataFrame(df).groupBy().applyInPandas(udf_example, > schema) > display(df_spark) > # id > # 125 > # 126 > # 127 > # 128 > {code} > If I apply it in Python I get the correct and expected result: > {code:java} > udf_example(df) > # id > # 0 > # 1 > # 2 > # 3 > {code} > Using a Pygeos function together with Spark causes side effects on NumPy > indexing operations. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37449) Side effects between PySpark, Numpy and Pygeos
[ https://issues.apache.org/jira/browse/SPARK-37449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Carlos Gameiro updated SPARK-37449: --- Description: I'm using pygeos 0.11.1. Let's create a simple Pandas Dataframe with a single column named 'id' with a range: {code:java} df = pd.DataFrame(np.arange(0,1000), columns=['id']){code} Consider this simple function that selects the first 4 indexes of the 'id' column of an array, and that for some reason calls a Pyegos operation in the beginning. {code:java} def udf_example(df): geo = pygeos.from_wkt(np.array(['POINT (20 30)', 'POINT (34 -2)', 'POINT (20 30)'])) some_index = np.array([0, 1, 2, 3]) values = df['id'].values[some_index] df = pd.DataFrame(values, columns=['id']) return df{code} If I apply this function in Pyspark I get this result: {code:java} schema = t.StructType([t.StructField('id', t.LongType(), True)]) df_spark = spark.createDataFrame(df).groupBy().applyInPandas(udf_example, schema) display(df_spark) # id # 125 # 126 # 127 # 128 {code} If I apply it in Python I get the correct and expected result: {code:java} udf_example(df) # id # 0 # 1 # 2 # 3 {code} Using a Pygeos function together with Spark causes side effects on NumPy indexing operations. was: I'm using pygeos 0.11.1. Let's create a simple Pandas Dataframe with a single column named 'id' with a range: {code:java} df = pd.DataFrame(np.arange(0,1000), columns=['id']){code} Consider this simple function that selects the first 4 indexes of the 'id' column of an array, and that for some reason calls a Pyegos operation in the beginning. {code:java} def udf_example(df): geo = pygeos.from_wkt(np.array(['POINT (20 30)', 'POINT (34 -2)', 'POINT (20 30)'])) some_index = np.array([0, 1, 2, 3]) values = df['id'].values[some_index] df = pd.DataFrame(values, columns=['id']) return df{code} If I apply this function in Pyspark I get this result: {code:java} schema = t.StructType([t.StructField('id', t.LongType(), True)]) df_spark = spark.createDataFrame(df).groupBy().applyInPandas(udf_example, schema) display(df_spark) # id # 125 # 126 # 127 # 128 {code} If I apply it in Python I get the correct and expected result: {code:java} udf_example(df) # id # 0 # 1 # 2 # 3 {code} Using a Pygeos function together with Spark causes side effects on NumPy indexing operations. > Side effects between PySpark, Numpy and Pygeos > -- > > Key: SPARK-37449 > URL: https://issues.apache.org/jira/browse/SPARK-37449 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.1.2 >Reporter: Carlos Gameiro >Priority: Major > > I'm using pygeos 0.11.1. > Let's create a simple Pandas Dataframe with a single column named 'id' with a > range: > {code:java} > df = pd.DataFrame(np.arange(0,1000), columns=['id']){code} > Consider this simple function that selects the first 4 indexes of the 'id' > column of an array, and that for some reason calls a Pyegos operation in the > beginning. > {code:java} > def udf_example(df): > > geo = pygeos.from_wkt(np.array(['POINT (20 30)', 'POINT (34 -2)', 'POINT > (20 30)'])) > > some_index = np.array([0, 1, 2, 3]) > values = df['id'].values[some_index] > > df = pd.DataFrame(values, columns=['id']) > return df{code} > If I apply this function in Pyspark I get this result: > {code:java} > schema = t.StructType([t.StructField('id', t.LongType(), True)]) > df_spark = spark.createDataFrame(df).groupBy().applyInPandas(udf_example, > schema) > display(df_spark) > # id > # 125 > # 126 > # 127 > # 128 > {code} > If I apply it in Python I get the correct and expected result: > {code:java} > udf_example(df) > # id > # 0 > # 1 > # 2 > # 3 > {code} > Using a Pygeos function together with Spark causes side effects on NumPy > indexing operations. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org