[jira] [Updated] (SPARK-37449) Side effects between PySpark, Numpy and Pygeos

2021-11-23 Thread Carlos Gameiro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlos Gameiro updated SPARK-37449:
---
Priority: Critical  (was: Major)

> Side effects between PySpark, Numpy and Pygeos
> --
>
> Key: SPARK-37449
> URL: https://issues.apache.org/jira/browse/SPARK-37449
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Carlos Gameiro
>Priority: Critical
>  Labels: NumPy, Pandas, Pygeos, UDF, applyInPandas
>
> I'm using pygeos 0.11.1.
> Let's create a simple Pandas Dataframe with a single column named 'id' with a 
> range:
> {code:java}
> df = pd.DataFrame(np.arange(0,1000), columns=['id']){code}
> Consider this simple function that selects the first 4 indexes of the 'id' 
> column of an array, and that for some reason calls a Pyegos operation in the 
> beginning.
> {code:java}
> def udf_example(df):
>   
>   geo = pygeos.from_wkt(np.array(['POINT (20 30)', 'POINT (34 -2)', 'POINT 
> (20 30)']))
>   
>   some_index = np.array([0, 1, 2, 3])
>   values = df['id'].values[some_index]
>   
>   df = pd.DataFrame(values, columns=['id'])
>   return df{code}
> If I apply this function in Pyspark I get this result:
> {code:java}
> schema = t.StructType([t.StructField('id', t.LongType(), True)])
> df_spark = spark.createDataFrame(df).groupBy().applyInPandas(udf_example, 
> schema)
> display(df_spark)
> # id
> # 125
> # 126
> # 127
> # 128
> {code}
> If I apply it in Python I get the correct and expected result:
> {code:java}
> udf_example(df)
> # id
> # 0
> # 1
> # 2
> # 3
> {code}
> Using a Pygeos function together with Spark causes side effects on NumPy 
> indexing operations.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37449) Side effects between PySpark, Numpy and Pygeos

2021-11-23 Thread Carlos Gameiro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlos Gameiro updated SPARK-37449:
---
Labels: applyInPandas  (was: )

> Side effects between PySpark, Numpy and Pygeos
> --
>
> Key: SPARK-37449
> URL: https://issues.apache.org/jira/browse/SPARK-37449
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Carlos Gameiro
>Priority: Major
>  Labels: applyInPandas
>
> I'm using pygeos 0.11.1.
> Let's create a simple Pandas Dataframe with a single column named 'id' with a 
> range:
> {code:java}
> df = pd.DataFrame(np.arange(0,1000), columns=['id']){code}
> Consider this simple function that selects the first 4 indexes of the 'id' 
> column of an array, and that for some reason calls a Pyegos operation in the 
> beginning.
> {code:java}
> def udf_example(df):
>   
>   geo = pygeos.from_wkt(np.array(['POINT (20 30)', 'POINT (34 -2)', 'POINT 
> (20 30)']))
>   
>   some_index = np.array([0, 1, 2, 3])
>   values = df['id'].values[some_index]
>   
>   df = pd.DataFrame(values, columns=['id'])
>   return df{code}
> If I apply this function in Pyspark I get this result:
> {code:java}
> schema = t.StructType([t.StructField('id', t.LongType(), True)])
> df_spark = spark.createDataFrame(df).groupBy().applyInPandas(udf_example, 
> schema)
> display(df_spark)
> # id
> # 125
> # 126
> # 127
> # 128
> {code}
> If I apply it in Python I get the correct and expected result:
> {code:java}
> udf_example(df)
> # id
> # 0
> # 1
> # 2
> # 3
> {code}
> Using a Pygeos function together with Spark causes side effects on NumPy 
> indexing operations.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37449) Side effects between PySpark, Numpy and Pygeos

2021-11-23 Thread Carlos Gameiro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlos Gameiro updated SPARK-37449:
---
Labels: NumPy Pandas Pygeos UDF applyInPandas  (was: applyInPandas)

> Side effects between PySpark, Numpy and Pygeos
> --
>
> Key: SPARK-37449
> URL: https://issues.apache.org/jira/browse/SPARK-37449
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Carlos Gameiro
>Priority: Major
>  Labels: NumPy, Pandas, Pygeos, UDF, applyInPandas
>
> I'm using pygeos 0.11.1.
> Let's create a simple Pandas Dataframe with a single column named 'id' with a 
> range:
> {code:java}
> df = pd.DataFrame(np.arange(0,1000), columns=['id']){code}
> Consider this simple function that selects the first 4 indexes of the 'id' 
> column of an array, and that for some reason calls a Pyegos operation in the 
> beginning.
> {code:java}
> def udf_example(df):
>   
>   geo = pygeos.from_wkt(np.array(['POINT (20 30)', 'POINT (34 -2)', 'POINT 
> (20 30)']))
>   
>   some_index = np.array([0, 1, 2, 3])
>   values = df['id'].values[some_index]
>   
>   df = pd.DataFrame(values, columns=['id'])
>   return df{code}
> If I apply this function in Pyspark I get this result:
> {code:java}
> schema = t.StructType([t.StructField('id', t.LongType(), True)])
> df_spark = spark.createDataFrame(df).groupBy().applyInPandas(udf_example, 
> schema)
> display(df_spark)
> # id
> # 125
> # 126
> # 127
> # 128
> {code}
> If I apply it in Python I get the correct and expected result:
> {code:java}
> udf_example(df)
> # id
> # 0
> # 1
> # 2
> # 3
> {code}
> Using a Pygeos function together with Spark causes side effects on NumPy 
> indexing operations.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37449) Side effects between PySpark, Numpy and Pygeos

2021-11-23 Thread Carlos Gameiro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlos Gameiro updated SPARK-37449:
---
Description: 
I'm using pygeos 0.11.1.

Let's create a simple Pandas Dataframe with a single column named 'id' with a 
range:
{code:java}
df = pd.DataFrame(np.arange(0,1000), columns=['id']){code}
Consider this simple function that selects the first 4 indexes of the 'id' 
column of an array, and that for some reason calls a Pyegos operation in the 
beginning.
{code:java}
def udf_example(df):
  
  geo = pygeos.from_wkt(np.array(['POINT (20 30)', 'POINT (34 -2)', 'POINT (20 
30)']))
  
  some_index = np.array([0, 1, 2, 3])
  values = df['id'].values[some_index]
  
  df = pd.DataFrame(values, columns=['id'])
  return df{code}
If I apply this function in Pyspark I get this result:
{code:java}
schema = t.StructType([t.StructField('id', t.LongType(), True)])
df_spark = spark.createDataFrame(df).groupBy().applyInPandas(udf_example, 
schema)
display(df_spark)
# id
# 125
# 126
# 127
# 128
{code}
If I apply it in Python I get the correct and expected result:
{code:java}
udf_example(df)
# id
# 0
# 1
# 2
# 3
{code}
Using a Pygeos function together with Spark causes side effects on NumPy 
indexing operations.

  was:
I'm using pygeos 0.11.1.

Let's create a simple Pandas Dataframe with a single column named 'id' with a 
range:
{code:java}
df = pd.DataFrame(np.arange(0,1000), columns=['id']){code}
Consider this simple function that selects the first 4 indexes of the 'id' 
column of an array, and that for some reason calls a Pyegos operation in the 
beginning. 
{code:java}
def udf_example(df):
  
  geo = pygeos.from_wkt(np.array(['POINT (20 30)', 'POINT (34 -2)', 'POINT (20 
30)']))
  
  some_index = np.array([0, 1, 2, 3])
  values = df['id'].values[some_index]
  
  df = pd.DataFrame(values, columns=['id'])
  return df{code}
If I apply this function in Pyspark I get this result:

 
{code:java}
schema = t.StructType([t.StructField('id', t.LongType(), True)])
df_spark = spark.createDataFrame(df).groupBy().applyInPandas(udf_example, 
schema)
display(df_spark)
# id
# 125
# 126
# 127
# 128
{code}
If I apply it in Python I get the correct and expected result:
{code:java}
udf_example(df)
# id
# 0
# 1
# 2
# 3
{code}
Using a Pygeos function together with Spark causes side effects on NumPy 
indexing operations.


> Side effects between PySpark, Numpy and Pygeos
> --
>
> Key: SPARK-37449
> URL: https://issues.apache.org/jira/browse/SPARK-37449
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Carlos Gameiro
>Priority: Major
>
> I'm using pygeos 0.11.1.
> Let's create a simple Pandas Dataframe with a single column named 'id' with a 
> range:
> {code:java}
> df = pd.DataFrame(np.arange(0,1000), columns=['id']){code}
> Consider this simple function that selects the first 4 indexes of the 'id' 
> column of an array, and that for some reason calls a Pyegos operation in the 
> beginning.
> {code:java}
> def udf_example(df):
>   
>   geo = pygeos.from_wkt(np.array(['POINT (20 30)', 'POINT (34 -2)', 'POINT 
> (20 30)']))
>   
>   some_index = np.array([0, 1, 2, 3])
>   values = df['id'].values[some_index]
>   
>   df = pd.DataFrame(values, columns=['id'])
>   return df{code}
> If I apply this function in Pyspark I get this result:
> {code:java}
> schema = t.StructType([t.StructField('id', t.LongType(), True)])
> df_spark = spark.createDataFrame(df).groupBy().applyInPandas(udf_example, 
> schema)
> display(df_spark)
> # id
> # 125
> # 126
> # 127
> # 128
> {code}
> If I apply it in Python I get the correct and expected result:
> {code:java}
> udf_example(df)
> # id
> # 0
> # 1
> # 2
> # 3
> {code}
> Using a Pygeos function together with Spark causes side effects on NumPy 
> indexing operations.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org