[jira] [Commented] (SPARK-34544) pyspark toPandas() should return pd.DataFrame

2021-02-25 Thread Daniel Himmelstein (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17291040#comment-17291040
 ] 

Daniel Himmelstein commented on SPARK-34544:


SPARK-34540 is an example. 
{{[DataFrameLike|https://github.com/apache/spark/blob/4a3200b08ac3e7733b5a3dc7271d35e6872c5967/python/pyspark/sql/pandas/_typing/protocols/frame.pyi#L37-L428]}}
 is missing the {{pd.DataFrame.convert_dtypes}} method. It's also missing 
{{pd.DataFrame.head}} and column attribute access 
({{pd.DataFrame.my_column_name}}).

Keeping up with all upstream pandas.DataFrame API changes seems like an 
impossible task? And can't accommodate the different pandas versions in use by 
end users.

> pyspark toPandas() should return pd.DataFrame
> -
>
> Key: SPARK-34544
> URL: https://issues.apache.org/jira/browse/SPARK-34544
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.1
>Reporter: Rafal Wojdyla
>Priority: Critical
>
> Right now {{toPandas()}} returns {{DataFrameLike}}, which is an incomplete 
> "view" of pandas {{DataFrame}}. Which leads to cases like mypy reporting that 
> certain pandas methods are not present in {{DataFrameLike}}, even tho those 
> methods are valid methods on pandas {{DataFrame}}, which is the actual type 
> of the object. This requires type ignore comments or asserts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26325) Interpret timestamp fields in Spark while reading json (timestampFormat)

2021-02-02 Thread Daniel Himmelstein (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277405#comment-17277405
 ] 

Daniel Himmelstein commented on SPARK-26325:


h1. Solution in pyspark 3.0.1

Turns out there is an {{inferTimestamp }}option that must be enabled. >From the 
spark [migration 
guide|https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-30-to-301]:
{quote}In Spark 3.0, JSON datasource and JSON function {{schema_of_json}} infer 
TimestampType from string values if they match to the pattern defined by the 
JSON option {{timestampFormat}}. Since version 3.0.1, the timestamp type 
inference is disabled by default. Set the JSON option {{inferTimestamp}} to 
{{true}} to enable such type inference.
{quote}
Surprised this would occur in a patch release and is not reflected yet in the 
[latest 
docs|https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html].
 But looks like it correlated with a major performance decrease so was turned 
off by default: 
[apache/spark#28966|https://github.com/apache/spark/pull/28966], SPARK-26325, 
and SPARK-32130.

So in pyspark 3.0.1:
{code:python}
line = '{"time_field" : "2017-09-30 04:53:39.412496Z"}'
rdd = spark.sparkContext.parallelize([line])
(
spark.read
.option("inferTimestamp", "true")
.option("timestampFormat", "-MM-dd HH:mm:ss.SS'Z'")
.json(path=rdd)
){code}
Returns:
{code:java}
DataFrame[time_field: timestamp]
{code}
Yay!

> Interpret timestamp fields in Spark while reading json (timestampFormat)
> 
>
> Key: SPARK-26325
> URL: https://issues.apache.org/jira/browse/SPARK-26325
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Veenit Shah
>Priority: Major
>
> I am trying to read a pretty printed json which has time fields in it. I want 
> to interpret the timestamps columns as timestamp fields while reading the 
> json itself. However, it's still reading them as string when I {{printSchema}}
> E.g. Input json file -
> {code:java}
> [{
> "time_field" : "2017-09-30 04:53:39.412496Z"
> }]
> {code}
> Code -
> {code:java}
> df = spark.read.option("multiLine", 
> "true").option("timestampFormat","-MM-dd 
> HH:mm:ss.SS'Z'").json('path_to_json_file')
> {code}
> Output of df.printSchema() -
> {code:java}
> root
>  |-- time_field: string (nullable = true)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26325) Interpret timestamp fields in Spark while reading json (timestampFormat)

2021-02-01 Thread Daniel Himmelstein (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276711#comment-17276711
 ] 

Daniel Himmelstein edited comment on SPARK-26325 at 2/1/21, 10:53 PM:
--

Here's the code from the original post, but using an RDD rather than JSON file 
and applying [~maxgekk]'s suggestion to "try Z instead of 'Z'":
{code:python}
line = '{"time_field" : "2017-09-30 04:53:39.412496Z"}'
rdd = spark.sparkContext.parallelize([line])
(
spark.read
.option("timestampFormat", "-MM-dd HH:mm:ss.SSZ")
.json(path=rdd)
){code}
The output I get with pyspark 3.0.1 is `DataFrame[time_field: string]`. So it 
looks like the issue remains.

I'd be interested if there are any examples where spark infers a date or 
timestamp from a JSON string or whether dateFormat and timestampFormat​ do not 
work at all?


was (Author: dhimmel):
Here's the code from the original post, but using an RDD rather than JSON file 
and applying [~maxgekk]'s suggestion to "try Z instead of 'Z'":
{code:python}
line = '{"time_field" : "2017-09-30 04:53:39.412496Z"}'
rdd = spark.sparkContext.parallelize([line])
(
spark.read
.option("timestampFormat", "-MM-dd HH:mm:ss.SSZ")
.json(path=rdd)
){code}
The output I get with pyspark 3.0.1 is `DataFrame[time_field: string]`. So it 
looks like the issue remains.

I'd be interested if there are any examples where spark infers a timestamp from 
a JSON string or whether timestampFormat​ does not work at all?

> Interpret timestamp fields in Spark while reading json (timestampFormat)
> 
>
> Key: SPARK-26325
> URL: https://issues.apache.org/jira/browse/SPARK-26325
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Veenit Shah
>Priority: Major
>
> I am trying to read a pretty printed json which has time fields in it. I want 
> to interpret the timestamps columns as timestamp fields while reading the 
> json itself. However, it's still reading them as string when I {{printSchema}}
> E.g. Input json file -
> {code:java}
> [{
> "time_field" : "2017-09-30 04:53:39.412496Z"
> }]
> {code}
> Code -
> {code:java}
> df = spark.read.option("multiLine", 
> "true").option("timestampFormat","-MM-dd 
> HH:mm:ss.SS'Z'").json('path_to_json_file')
> {code}
> Output of df.printSchema() -
> {code:java}
> root
>  |-- time_field: string (nullable = true)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26325) Interpret timestamp fields in Spark while reading json (timestampFormat)

2021-02-01 Thread Daniel Himmelstein (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276711#comment-17276711
 ] 

Daniel Himmelstein commented on SPARK-26325:


Here's the code from the original post, but using an RDD rather than JSON file 
and applying [~maxgekk]'s suggestion to "try Z instead of 'Z'":
{code:python}
line = '{"time_field" : "2017-09-30 04:53:39.412496Z"}'
rdd = spark.sparkContext.parallelize([line])
(
spark.read
.option("timestampFormat", "-MM-dd HH:mm:ss.SSZ")
.json(path=rdd)
){code}
The output I get with pyspark 3.0.1 is `DataFrame[time_field: string]`. So it 
looks like the issue remains.

I'd be interested if there are any examples where spark infers a timestamp from 
a JSON string or whether timestampFormat​ does not work at all?

> Interpret timestamp fields in Spark while reading json (timestampFormat)
> 
>
> Key: SPARK-26325
> URL: https://issues.apache.org/jira/browse/SPARK-26325
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Veenit Shah
>Priority: Major
>
> I am trying to read a pretty printed json which has time fields in it. I want 
> to interpret the timestamps columns as timestamp fields while reading the 
> json itself. However, it's still reading them as string when I {{printSchema}}
> E.g. Input json file -
> {code:java}
> [{
> "time_field" : "2017-09-30 04:53:39.412496Z"
> }]
> {code}
> Code -
> {code:java}
> df = spark.read.option("multiLine", 
> "true").option("timestampFormat","-MM-dd 
> HH:mm:ss.SS'Z'").json('path_to_json_file')
> {code}
> Output of df.printSchema() -
> {code:java}
> root
>  |-- time_field: string (nullable = true)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33310) Relax pyspark typing for sql str functions

2020-10-31 Thread Daniel Himmelstein (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Himmelstein updated SPARK-33310:
---
Description: 
Several pyspark.sql.functions have overly strict typing, in that the type is 
more restrictive than the functionality. Specifically, the function allows 
specifying the column to operate on with a pyspark.sql.Column or a str. This is 
handled internally by 
[_to_java_column|https://github.com/apache/spark/blob/491a0fb08b0c57a99894a0b33c5814854db8de3d/python/pyspark/sql/column.py#L39-L50],
 which accepts a Column or string.

There is a pre-existing type for this: 
[ColumnOrName|https://github.com/apache/spark/blob/72ad9dcd5d484a8dd64c08889de85ef9de2a6077/python/pyspark/sql/_typing.pyi#L37].
 ColumnOrName is used for many of the type definitions of pyspark.sql.functions 
arguments, but [not 
for|https://github.com/apache/spark/blob/72ad9dcd5d484a8dd64c08889de85ef9de2a6077/python/pyspark/sql/functions.pyi#L158-L162]
 locate, lpad, rpad, repeat, and split.
{code:java}
def locate(substr: str, str: Column, pos: int = ...) -> Column: ...
def lpad(col: Column, len: int, pad: str) -> Column: ...
def rpad(col: Column, len: int, pad: str) -> Column: ...
def repeat(col: Column, n: int) -> Column: ...
def split(str: Column, pattern: str, limit: int = ...) -> Column: ...{code}
ColumnOrName was not added by [~zero323] since Maciej "was concerned that this 
might be confusing or ambiguous", because these functions take a column to 
operate on as well strings which are used in the operation.

But I think ColumnOrName makes clear that this variable refers to the column 
and not a string parameter. Also there are other ways to address confusion, 
such as via the docstring or by changing the argument name for the column to 
col from str.

Finally, there's considerable convenience for users to not have to wrap column 
names in pyspark.sql.functions.col. Elsewhere the API seems pretty consistent 
in its willingness to accept columns by name and not Column object (at least 
when there is not alternative meaning for a string value, exception would be 
.when/.otherwise).

For example, we were calling pyspark.sql.functions.split with a string value 
for the str argument (specifying which column to split). And I noticed this 
when we enforced typing with pyspark-stubs in preparation for pyspark 3.1. For 
users that will enable typing in 3.1, this is a restriction in functionality.

Pre-existing PRs to address this:
 * [https://github.com/apache/spark/pull/30209]
 * [https://github.com/zero323/pyspark-stubs/pull/420]

  was:
Several pyspark.sql.functions have overly strict typing, in that the type is 
more restrictive than the functionality. Specifically, the function allows 
specifying the column to operate on with a pyspark.sql.Column or a str. This is 
handled internally by 
[_to_java_column|https://github.com/apache/spark/blob/491a0fb08b0c57a99894a0b33c5814854db8de3d/python/pyspark/sql/column.py#L39-L50],
 which accepts a Column or string.

There is a pre-existing type for this: 
[ColumnOrName|https://github.com/apache/spark/blob/72ad9dcd5d484a8dd64c08889de85ef9de2a6077/python/pyspark/sql/_typing.pyi#L37].
 ColumnOrName is used for many of the type definitions of pyspark.sql.functions 
arguments, but [not 
for|https://github.com/apache/spark/blob/72ad9dcd5d484a8dd64c08889de85ef9de2a6077/python/pyspark/sql/functions.pyi#L158-L162]
 locate, lpad, rpad, repeat, and split.
{code:java}
def locate(substr: str, str: Column, pos: int = ...) -> Column: ...
def lpad(col: Column, len: int, pad: str) -> Column: ...
def rpad(col: Column, len: int, pad: str) -> Column: ...
def repeat(col: Column, n: int) -> Column: ...
def split(str: Column, pattern: str, limit: int = ...) -> Column: ...{code}
ColumnOrName was not added by [~zero323] since Maciej "was concerned that this 
might be confusing or ambiguous", because these functions take a column to 
operate on as well strings which are used in the operation.

But I think ColumnOrName makes clear that this variable refers to the column 
and not a string parameter. Also there are other ways to address confusion, 
such as via the docstring or by changing the argument name for the column to 
col from str.

Finally, there's considerable convenience for users to not have to wrap column 
names in pyspark.sql.functions.col. Elsewhere the API seems pretty consistent 
in its willingness to accept columns by name and not Column object (at least 
when there is not alternative meaning for a string value, exception would be 
.when/.otherwise).

For example, we were pyspark.sql.functions.split with a string value for the 
str argument (specifying which column to split). And I noticed this when we 
enforced typing with pyspark-stubs in preparation for pyspark 3.1.

Pre-existing PRs to address this:
 * https://github.com/apache/spark/pull/30209
 * 

[jira] [Created] (SPARK-33310) Relax pyspark typing for sql str functions

2020-10-31 Thread Daniel Himmelstein (Jira)
Daniel Himmelstein created SPARK-33310:
--

 Summary: Relax pyspark typing for sql str functions
 Key: SPARK-33310
 URL: https://issues.apache.org/jira/browse/SPARK-33310
 Project: Spark
  Issue Type: Wish
  Components: PySpark
Affects Versions: 3.1.0
Reporter: Daniel Himmelstein
 Fix For: 3.1.0


Several pyspark.sql.functions have overly strict typing, in that the type is 
more restrictive than the functionality. Specifically, the function allows 
specifying the column to operate on with a pyspark.sql.Column or a str. This is 
handled internally by 
[_to_java_column|https://github.com/apache/spark/blob/491a0fb08b0c57a99894a0b33c5814854db8de3d/python/pyspark/sql/column.py#L39-L50],
 which accepts a Column or string.

There is a pre-existing type for this: 
[ColumnOrName|https://github.com/apache/spark/blob/72ad9dcd5d484a8dd64c08889de85ef9de2a6077/python/pyspark/sql/_typing.pyi#L37].
 ColumnOrName is used for many of the type definitions of pyspark.sql.functions 
arguments, but [not 
for|https://github.com/apache/spark/blob/72ad9dcd5d484a8dd64c08889de85ef9de2a6077/python/pyspark/sql/functions.pyi#L158-L162]
 locate, lpad, rpad, repeat, and split.
{code:java}
def locate(substr: str, str: Column, pos: int = ...) -> Column: ...
def lpad(col: Column, len: int, pad: str) -> Column: ...
def rpad(col: Column, len: int, pad: str) -> Column: ...
def repeat(col: Column, n: int) -> Column: ...
def split(str: Column, pattern: str, limit: int = ...) -> Column: ...{code}
ColumnOrName was not added by [~zero323] since Maciej "was concerned that this 
might be confusing or ambiguous", because these functions take a column to 
operate on as well strings which are used in the operation.

But I think ColumnOrName makes clear that this variable refers to the column 
and not a string parameter. Also there are other ways to address confusion, 
such as via the docstring or by changing the argument name for the column to 
col from str.

Finally, there's considerable convenience for users to not have to wrap column 
names in pyspark.sql.functions.col. Elsewhere the API seems pretty consistent 
in its willingness to accept columns by name and not Column object (at least 
when there is not alternative meaning for a string value, exception would be 
.when/.otherwise).

For example, we were pyspark.sql.functions.split with a string value for the 
str argument (specifying which column to split). And I noticed this when we 
enforced typing with pyspark-stubs in preparation for pyspark 3.1.

Pre-existing PRs to address this:
 * https://github.com/apache/spark/pull/30209
 * https://github.com/zero323/pyspark-stubs/pull/420



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org