[jira] [Commented] (SPARK-37181) pyspark.pandas.read_csv() should support latin-1 encoding

2021-11-15 Thread Chuck Connell (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444117#comment-17444117
 ] 

Chuck Connell commented on SPARK-37181:
---

That would be a good solution, just convert latin-1 silently to ISO-8859-1. 

> pyspark.pandas.read_csv() should support latin-1 encoding
> -
>
> Key: SPARK-37181
> URL: https://issues.apache.org/jira/browse/SPARK-37181
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> {{In regular pandas, you can say read_csv(encoding='latin-1'). This encoding 
> is not recognized in pyspark.pandas. You have to use Windows-1252 instead, 
> which is almost the same but not identical. }}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37198) pyspark.pandas read_csv() and to_csv() should handle local files

2021-11-04 Thread Chuck Connell (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438789#comment-17438789
 ] 

Chuck Connell edited comment on SPARK-37198 at 11/4/21, 3:42 PM:
-

There are many hints/techtips on the Internet which say that 
{{[file://local_path|file://local_path/] }} already works to read and write 
local files from a Spark cluster. But in my testing (from Databricks) this is 
not true. I have never gotten it to work.

If there is already a way to read/write local files, please say the exact, 
tested method to do so. 


was (Author: chconnell):
There are many hints/techtips on the Internet which say that 
{{file://local_path }}already works to read and write local files from a Spark 
cluster. But in my testing (from Databricks) this is not true. I have never 
gotten it to work.

If there is already a way to read/write local files, please say the exact, 
tested method to do so. 

> pyspark.pandas read_csv() and to_csv() should handle local files 
> -
>
> Key: SPARK-37198
> URL: https://issues.apache.org/jira/browse/SPARK-37198
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> Pandas programmers who move their code to Spark would like to import and 
> export text files to and from their local disk. I know there are technical 
> hurdles to this (since Spark is usually in a cluster that does not know where 
> your local computer is) but it would really help code migration. 
> For read_csv() and to_csv(), the syntax {{*file://c:/Temp/my_file.csv* }}(or 
> something like this) should import and export to the local disk on Windows. 
> Similarly for Mac and Linux. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37198) pyspark.pandas read_csv() and to_csv() should handle local files

2021-11-04 Thread Chuck Connell (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438789#comment-17438789
 ] 

Chuck Connell commented on SPARK-37198:
---

There are many hints/techtips on the Internet which say that 
{{file://local_path }}already works to read and write local files from a Spark 
cluster. But in my testing (from Databricks) this is not true. I have never 
gotten it to work.

If there is already a way to read/write local files, please say the exact, 
tested method to do so. 

> pyspark.pandas read_csv() and to_csv() should handle local files 
> -
>
> Key: SPARK-37198
> URL: https://issues.apache.org/jira/browse/SPARK-37198
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> Pandas programmers who move their code to Spark would like to import and 
> export text files to and from their local disk. I know there are technical 
> hurdles to this (since Spark is usually in a cluster that does not know where 
> your local computer is) but it would really help code migration. 
> For read_csv() and to_csv(), the syntax {{*file://c:/Temp/my_file.csv* }}(or 
> something like this) should import and export to the local disk on Windows. 
> Similarly for Mac and Linux. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37197) PySpark pandas recent issues from chconnell

2021-11-02 Thread Chuck Connell (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chuck Connell updated SPARK-37197:
--
Description: 
SPARK-37180  PySpark.pandas should support __version__

SPARK-37181  pyspark.pandas.read_csv() should support latin-1 encoding
  
 SPARK-37183  pyspark.pandas.DataFrame.map() should support .fillna()
  
 SPARK-37184  pyspark.pandas should support 
DF["column"].str.split("some_suffix").str[0]
 SPARK-37186  pyspark.pandas should support tseries.offsets 

SPARK-37187  pyspark.pandas fails to create a histogram of one column from a 
large DataFrame

SPARK-37188  pyspark.pandas histogram accepts the title option but does not add 
a title to the plot

SPARK-37189  pyspark.pandas histogram accepts the range option but does not use 
it

SPARK-37198  pyspark.pandas read_csv() and to_csv() should handle local files

 

  was:
SPARK-37180  PySpark.pandas should support __version__

SPARK-37181  pyspark.pandas.read_csv() should support latin-1 encoding
 
SPARK-37183  pyspark.pandas.DataFrame.map() should support .fillna()
 
SPARK-37184  pyspark.pandas should support 
DF["column"].str.split("some_suffix").str[0]
SPARK-37186  pyspark.pandas should support tseries.offsets 

SPARK-37187  pyspark.pandas fails to create a histogram of one column from a 
large DataFrame

SPARK-37188  pyspark.pandas histogram accepts the title option but does not add 
a title to the plot

SPARK-37189  pyspark.pandas histogram accepts the range option but does not use 
it

 


> PySpark pandas recent issues from chconnell
> ---
>
> Key: SPARK-37197
> URL: https://issues.apache.org/jira/browse/SPARK-37197
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> SPARK-37180  PySpark.pandas should support __version__
> SPARK-37181  pyspark.pandas.read_csv() should support latin-1 encoding
>   
>  SPARK-37183  pyspark.pandas.DataFrame.map() should support .fillna()
>   
>  SPARK-37184  pyspark.pandas should support 
> DF["column"].str.split("some_suffix").str[0]
>  SPARK-37186  pyspark.pandas should support tseries.offsets 
> SPARK-37187  pyspark.pandas fails to create a histogram of one column from a 
> large DataFrame
> SPARK-37188  pyspark.pandas histogram accepts the title option but does not 
> add a title to the plot
> SPARK-37189  pyspark.pandas histogram accepts the range option but does not 
> use it
> SPARK-37198  pyspark.pandas read_csv() and to_csv() should handle local files
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37198) pyspark.pandas read_csv() and to_csv() should handle local files

2021-11-02 Thread Chuck Connell (Jira)
Chuck Connell created SPARK-37198:
-

 Summary: pyspark.pandas read_csv() and to_csv() should handle 
local files 
 Key: SPARK-37198
 URL: https://issues.apache.org/jira/browse/SPARK-37198
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Chuck Connell


Pandas programmers who move their code to Spark would like to import and export 
text files to and from their local disk. I know there are technical hurdles to 
this (since Spark is usually in a cluster that does not know where your local 
computer is) but it would really help code migration. 

For read_csv() and to_csv(), the syntax {{*file://c:/Temp/my_file.csv* }}(or 
something like this) should import and export to the local disk on Windows. 
Similarly for Mac and Linux. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37189) pyspark.pandas histogram accepts the range option but does not use it

2021-11-02 Thread Chuck Connell (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437329#comment-17437329
 ] 

Chuck Connell edited comment on SPARK-37189 at 11/2/21, 1:18 PM:
-

Ok, will do. 


was (Author: chconnell):
Ok, will do. FYI, getting covid shot today, so I may be tired for a few days.

> pyspark.pandas histogram accepts the range option but does not use it
> -
>
> Key: SPARK-37189
> URL: https://issues.apache.org/jira/browse/SPARK-37189
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> In pyspark.pandas if you write a line like this
> {quote}DF.plot.hist(bins=30, range=[0, 20], title="US Counties -- 
> DeathsPer100k (<20)")
> {quote}
> it compiles and runs, but the plot does not respect the range. All the values 
> are shown.
> The workaround is to create a new DataFrame that pre-selects just the rows 
> you want, but line above should work also.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37197) PySpark pandas recent issues from chconnell

2021-11-02 Thread Chuck Connell (Jira)
Chuck Connell created SPARK-37197:
-

 Summary: PySpark pandas recent issues from chconnell
 Key: SPARK-37197
 URL: https://issues.apache.org/jira/browse/SPARK-37197
 Project: Spark
  Issue Type: Umbrella
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Chuck Connell


SPARK-37180  PySpark.pandas should support __version__

SPARK-37181  pyspark.pandas.read_csv() should support latin-1 encoding
 
SPARK-37183  pyspark.pandas.DataFrame.map() should support .fillna()
 
SPARK-37184  pyspark.pandas should support 
DF["column"].str.split("some_suffix").str[0]
SPARK-37186  pyspark.pandas should support tseries.offsets 

SPARK-37187  pyspark.pandas fails to create a histogram of one column from a 
large DataFrame

SPARK-37188  pyspark.pandas histogram accepts the title option but does not add 
a title to the plot

SPARK-37189  pyspark.pandas histogram accepts the range option but does not use 
it

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37189) pyspark.pandas histogram accepts the range option but does not use it

2021-11-02 Thread Chuck Connell (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437329#comment-17437329
 ] 

Chuck Connell commented on SPARK-37189:
---

Ok, will do. FYI, getting covid shot today, so I may be tired for a few days.

> pyspark.pandas histogram accepts the range option but does not use it
> -
>
> Key: SPARK-37189
> URL: https://issues.apache.org/jira/browse/SPARK-37189
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> In pyspark.pandas if you write a line like this
> {quote}DF.plot.hist(bins=30, range=[0, 20], title="US Counties -- 
> DeathsPer100k (<20)")
> {quote}
> it compiles and runs, but the plot does not respect the range. All the values 
> are shown.
> The workaround is to create a new DataFrame that pre-selects just the rows 
> you want, but line above should work also.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37189) pyspark.pandas histogram accepts the range option but does not use it

2021-11-01 Thread Chuck Connell (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chuck Connell updated SPARK-37189:
--
Description: 
In pyspark.pandas if you write a line like this
{quote}DF.plot.hist(bins=30, range=[0, 20], title="US Counties -- DeathsPer100k 
(<20)")
{quote}
it compiles and runs, but the plot does not respect the range. All the values 
are shown.

The workaround is to create a new DataFrame that pre-selects just the rows you 
want, but line above should work also.

  was:
In pyspark.pandas if you write a line like this
{quote}DF.plot.hist(bins=20, title="US Counties -- FullVaxPer100")
{quote}
it compiles and runs, but the plot has no title.


> pyspark.pandas histogram accepts the range option but does not use it
> -
>
> Key: SPARK-37189
> URL: https://issues.apache.org/jira/browse/SPARK-37189
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> In pyspark.pandas if you write a line like this
> {quote}DF.plot.hist(bins=30, range=[0, 20], title="US Counties -- 
> DeathsPer100k (<20)")
> {quote}
> it compiles and runs, but the plot does not respect the range. All the values 
> are shown.
> The workaround is to create a new DataFrame that pre-selects just the rows 
> you want, but line above should work also.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37189) CLONE - pyspark.pandas histogram accepts the title option but does not add a title to the plot

2021-11-01 Thread Chuck Connell (Jira)
Chuck Connell created SPARK-37189:
-

 Summary: CLONE - pyspark.pandas histogram accepts the title option 
but does not add a title to the plot
 Key: SPARK-37189
 URL: https://issues.apache.org/jira/browse/SPARK-37189
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Chuck Connell


In pyspark.pandas if you write a line like this
{quote}DF.plot.hist(bins=20, title="US Counties -- FullVaxPer100")
{quote}
it compiles and runs, but the plot has no title.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37189) pyspark.pandas histogram accepts the range option but does not use it

2021-11-01 Thread Chuck Connell (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chuck Connell updated SPARK-37189:
--
Summary: pyspark.pandas histogram accepts the range option but does not use 
it  (was: CLONE - pyspark.pandas histogram accepts the title option but does 
not add a title to the plot)

> pyspark.pandas histogram accepts the range option but does not use it
> -
>
> Key: SPARK-37189
> URL: https://issues.apache.org/jira/browse/SPARK-37189
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> In pyspark.pandas if you write a line like this
> {quote}DF.plot.hist(bins=20, title="US Counties -- FullVaxPer100")
> {quote}
> it compiles and runs, but the plot has no title.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37188) pyspark.pandas histogram accepts the title option but does not add a title to the plot

2021-11-01 Thread Chuck Connell (Jira)
Chuck Connell created SPARK-37188:
-

 Summary: pyspark.pandas histogram accepts the title option but 
does not add a title to the plot
 Key: SPARK-37188
 URL: https://issues.apache.org/jira/browse/SPARK-37188
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Chuck Connell


In pyspark.pandas if you write a line like this
{quote}DF.plot.hist(bins=20, title="US Counties -- FullVaxPer100")
{quote}
it compiles and runs, but the plot has no title.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37187) pyspark.pandas fails to create a histogram of one column from a large DataFrame

2021-11-01 Thread Chuck Connell (Jira)
Chuck Connell created SPARK-37187:
-

 Summary: pyspark.pandas fails to create a histogram of one column 
from a large DataFrame
 Key: SPARK-37187
 URL: https://issues.apache.org/jira/browse/SPARK-37187
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Chuck Connell


 

When trying to create a histogram from one column of a large DataFrame, 
pyspark.pandas fails. So this line
{quote}DF.plot.hist(column="FullVaxPer100", bins=20)  # there are many other 
columns
{quote}
yields this error
{quote}cannot resolve 'least(min(EndDate), min(EndDeaths), min(`STATE-COUNTY`), 
min(StartDate), min(StartDeaths), min(POPESTIMATE2020), min(ST_ABBR), 
min(VaxStartDate), min(Series_Complete_Yes_Start), 
min(Administered_Dose1_Recip_Start), min(VaxEndDate), 
min(Series_Complete_Yes_End), min(Administered_Dose1_Recip_End), min(Deaths), 
min(Series_Complete_Yes_Mid), min(Administered_Dose1_Recip_Mid), 
min(FullVaxPer100), min(OnePlusVaxPer100), min(DeathsPer100k))' due to data 
type mismatch: The expressions should all have the same type, got 
LEAST(timestamp, bigint, string, timestamp, bigint, bigint, string, timestamp, 
bigint, bigint, timestamp, bigint, bigint, bigint, double, double, double, 
double, double).;
{quote}
The odd thing is that pyspark.pandas seems to be operating on all the columns 
when only one is needed.

As a workaround, you can first create a one-column DataFrame that selects just 
the field you want, then make a histogram of that. But the command above should 
work also.

I can supply the complete program and datasets that demonstrate the error.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37186) pyspark.pandas should support tseries.offsets

2021-11-01 Thread Chuck Connell (Jira)
Chuck Connell created SPARK-37186:
-

 Summary: pyspark.pandas should support tseries.offsets
 Key: SPARK-37186
 URL: https://issues.apache.org/jira/browse/SPARK-37186
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Chuck Connell


In regular pandas you can use pandas.offsets to create a time delta. This 
allows a line like
{quote}this_period_start = OVERALL_START_DATE + pd.offsets.Day(NN)
{quote}
But this does not work in pyspark.pandas.

There are good workarounds, such as datetime.timedelta(days=NN), but pandas 
programmers would like to move code to pyspark without changing it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-37182) pyspark.pandas.to_numeric() should support the errors option

2021-11-01 Thread Chuck Connell (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chuck Connell updated SPARK-37182:
--
Comment: was deleted

(was: https://issues.apache.org/jira/browse/SPARK-36609)

> pyspark.pandas.to_numeric() should support the errors option
> 
>
> Key: SPARK-37182
> URL: https://issues.apache.org/jira/browse/SPARK-37182
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> In regular pandas you can say to_numeric(errors='coerce'). But the errors 
> option is not recognized by pyspark.pandas.
> FYI, the errors option is recognized by pyspark.pandas.to_datetime()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37182) pyspark.pandas.to_numeric() should support the errors option

2021-11-01 Thread Chuck Connell (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chuck Connell resolved SPARK-37182.
---
Resolution: Duplicate

https://issues.apache.org/jira/browse/SPARK-36609

> pyspark.pandas.to_numeric() should support the errors option
> 
>
> Key: SPARK-37182
> URL: https://issues.apache.org/jira/browse/SPARK-37182
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> In regular pandas you can say to_numeric(errors='coerce'). But the errors 
> option is not recognized by pyspark.pandas.
> FYI, the errors option is recognized by pyspark.pandas.to_datetime()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37182) pyspark.pandas.to_numeric() should support the errors option

2021-11-01 Thread Chuck Connell (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436992#comment-17436992
 ] 

Chuck Connell commented on SPARK-37182:
---

Duplicate of https://issues.apache.org/jira/browse/SPARK-36609

> pyspark.pandas.to_numeric() should support the errors option
> 
>
> Key: SPARK-37182
> URL: https://issues.apache.org/jira/browse/SPARK-37182
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> In regular pandas you can say to_numeric(errors='coerce'). But the errors 
> option is not recognized by pyspark.pandas.
> FYI, the errors option is recognized by pyspark.pandas.to_datetime()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37184) pyspark.pandas should support DF["column"].str.split("some_suffix").str[0]

2021-11-01 Thread Chuck Connell (Jira)
Chuck Connell created SPARK-37184:
-

 Summary:  pyspark.pandas should support 
DF["column"].str.split("some_suffix").str[0]
 Key: SPARK-37184
 URL: https://issues.apache.org/jira/browse/SPARK-37184
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Chuck Connell


In regular pandas you can say
{quote}DF["column"] = DF["column"].str.split("suffix").str[0]
{quote}
In order to strip off a suffix.

With pyspark.pandas, this syntax does not work. You have to say something like
{quote}DF["column"] = DF["column"].str.replace("suffix", '', 1)
{quote}
which works fine if the suffix only appears once at the end, but is not really 
the same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37183) pyspark.pandas.DataFrame.map() should support .fillna()

2021-11-01 Thread Chuck Connell (Jira)
Chuck Connell created SPARK-37183:
-

 Summary: pyspark.pandas.DataFrame.map() should support .fillna()
 Key: SPARK-37183
 URL: https://issues.apache.org/jira/browse/SPARK-37183
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Chuck Connell


In regular pandas you can say 
{quote}DF["new_column"] = DF["column"].map(some_map).fillna(DF["column"])
{quote}
In order to use the existing value if the mapping key is not found.

But this does not work in pyspark.pandas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37182) pyspark.pandas.to_numeric() should support the errors option

2021-11-01 Thread Chuck Connell (Jira)
Chuck Connell created SPARK-37182:
-

 Summary: pyspark.pandas.to_numeric() should support the errors 
option
 Key: SPARK-37182
 URL: https://issues.apache.org/jira/browse/SPARK-37182
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Chuck Connell


In regular pandas you can say to_numeric(errors='coerce'). But the errors 
option is not recognized by pyspark.pandas.

FYI, the errors option is recognized by pyspark.pandas.to_datetime()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37181) pyspark.pandas.read_csv() should support latin-1 encoding

2021-11-01 Thread Chuck Connell (Jira)
Chuck Connell created SPARK-37181:
-

 Summary: pyspark.pandas.read_csv() should support latin-1 encoding
 Key: SPARK-37181
 URL: https://issues.apache.org/jira/browse/SPARK-37181
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Chuck Connell


{{In regular pandas, you can say read_csv(encoding='latin-1'). This encoding is 
not recognized in pyspark.pandas. You have to use Windows-1252 instead, which 
is almost the same but not identical. }}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37180) PySpark.pandas should support __version__

2021-11-01 Thread Chuck Connell (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chuck Connell updated SPARK-37180:
--
Description: 
In regular pandas you can say
{quote}pd.___version___ 
{quote}
to get the pandas version number. PySpark pandas should support the same.

  was:
In regular pandas you can say
{quote}{{pd.__version__ }}{quote}
to get the pandas version number. PySpark pandas should support the same.


> PySpark.pandas should support __version__
> -
>
> Key: SPARK-37180
> URL: https://issues.apache.org/jira/browse/SPARK-37180
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> In regular pandas you can say
> {quote}pd.___version___ 
> {quote}
> to get the pandas version number. PySpark pandas should support the same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37180) PySpark.pandas should support __version__

2021-11-01 Thread Chuck Connell (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chuck Connell updated SPARK-37180:
--
Description: 
In regular pandas you can say
{quote}{{pd.__version__ }}{quote}
to get the pandas version number. PySpark pandas should support the same.

  was:In regular pandas you can say pd.__version__ to get the pandas version 
number. PySpark pandas should support the same.


> PySpark.pandas should support __version__
> -
>
> Key: SPARK-37180
> URL: https://issues.apache.org/jira/browse/SPARK-37180
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> In regular pandas you can say
> {quote}{{pd.__version__ }}{quote}
> to get the pandas version number. PySpark pandas should support the same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37180) PySpark.pandas should support __version__

2021-11-01 Thread Chuck Connell (Jira)
Chuck Connell created SPARK-37180:
-

 Summary: PySpark.pandas should support __version__
 Key: SPARK-37180
 URL: https://issues.apache.org/jira/browse/SPARK-37180
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Chuck Connell


In regular pandas you can say pd.__version__ to get the pandas version number. 
PySpark pandas should support the same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org