Hi, I'm new to spark and am trying to create a Spark df from a pandas df with ~5 million rows. Using Spark 1.4.1.
When I type: df = sqlContext.createDataFrame(pandas_df.where(pd.notnull(didf), None)) (the df.where is a hack I found on the Spark JIRA to avoid a problem with NaN values making mixed column types) I get: TypeError: cannot create an RDD from type: <type 'list'> Converting a smaller pandas dataframe (~2000 rows) works fine. Anyone had this issue? This is already a workaround-- ideally I'd like to read the spark dataframe from a Hive table. But this is currently not an option for my setup. I also tried reading the data into spark from a CSV using spark-csv. Haven't been able to make this work as yet. I launch $ pyspark --jars path/to/spark-csv_2.11-1.2.0.jar and when I attempt to read the csv I get: Py4JJavaError: An error occurred while calling o22.load. : java.lang.NoClassDefFoundError: org/apache/commons/csv/CSVFormat ... Other options I can think of: - Convert my CSV to json (use Pig?) and read into Spark - Read in using jdbc connect from postgres But want to make sure I'm not misusing Spark or missing something obvious. Thanks! Charlie