[ https://issues.apache.org/jira/browse/SPARK-11758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Leandro Ferrado updated SPARK-11758: ------------------------------------ Comment: was deleted (was: Hi Holden. First, I would add just a single line in order to avoid the bad conversion of 'datetime' objects (so far, DataFrame.to_records(index=False) converts a Date column into a LongInt column). The idea is to first convert all columns into string types, thus the function DataFrame.to_records(index=False) wouldn't make bad conversions with datetime.datetime objects. However, that can be done only if we define a pyspark.sql.dataframe.DataFrame with a schema of strings or if we didn't define an schema (in that case, the function create an schema of strings). So, the modification is only present on the condition 'schema=None' and the snippet would be: ------- if has_pandas and isinstance(data, pandas.DataFrame): if schema is None: # begin if clause# schema = [str(x) for x in data.columns] data = data.astype(str) # Converting all fields on string objects because we don't have a defined schema # end if clause# data = [r.tolist() for r in data.to_records(index=False)] ------- In case of having an schema with timestamps (e.g. TimestampType() or DateType()), it is needed a prior conversion between datetime.datetime objects on Python to a convenient format for pyspark DataFrames. Regarding to the 'index=False' term, so far I can't figure out an scenario in which it is needed an index per row on a DataFrame. So it may be fine that argument on the function, I'm not sure.) > Missing Index column while creating a DataFrame from Pandas > ------------------------------------------------------------ > > Key: SPARK-11758 > URL: https://issues.apache.org/jira/browse/SPARK-11758 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL > Affects Versions: 1.5.1 > Environment: Linux Debian, PySpark, in local testing. > Reporter: Leandro Ferrado > Priority: Minor > Original Estimate: 5h > Remaining Estimate: 5h > > In PySpark's SQLContext, when it invokes createDataFrame() from a > pandas.DataFrame and indicating a 'schema' with StructFields, the function > _createFromLocal() converts the pandas.DataFrame but ignoring two points: > - Index column, because the flag index=False > - Timestamp's records, because a Date column can't be index and Pandas > doesn't converts its records in Timestamp's type. > So, converting a DataFrame from Pandas to SQL is poor in scenarios with > temporal records. > Doc: > http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.to_records.html > Affected code: > def _createFromLocal(self, data, schema): > """ > Create an RDD for DataFrame from an list or pandas.DataFrame, returns > the RDD and schema. > """ > if has_pandas and isinstance(data, pandas.DataFrame): > if schema is None: > schema = [str(x) for x in data.columns] > data = [r.tolist() for r in data.to_records(index=False)] # HERE > # ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org