If you would like to try using spark-csv, please use `pyspark --packages com.databricks:spark-csv_2.11:1.2.0`
You're missing a dependency. Best, Burak On Thu, Aug 20, 2015 at 1:08 PM, Charlie Hack <charles.t.h...@gmail.com> wrote: > Hi, > > I'm new to spark and am trying to create a Spark df from a pandas df with > ~5 million rows. Using Spark 1.4.1. > > When I type: > > df = sqlContext.createDataFrame(pandas_df.where(pd.notnull(didf), None)) > > (the df.where is a hack I found on the Spark JIRA to avoid a problem with > NaN values making mixed column types) > > I get: > > TypeError: cannot create an RDD from type: <type 'list'> > > Converting a smaller pandas dataframe (~2000 rows) works fine. Anyone had > this issue? > > > This is already a workaround-- ideally I'd like to read the spark > dataframe from a Hive table. But this is currently not an option for my > setup. > > I also tried reading the data into spark from a CSV using spark-csv. > Haven't been able to make this work as yet. I launch > > $ pyspark --jars path/to/spark-csv_2.11-1.2.0.jar > > and when I attempt to read the csv I get: > > Py4JJavaError: An error occurred while calling o22.load. : > java.lang.NoClassDefFoundError: org/apache/commons/csv/CSVFormat ... > > Other options I can think of: > > - Convert my CSV to json (use Pig?) and read into Spark > - Read in using jdbc connect from postgres > > But want to make sure I'm not misusing Spark or missing something obvious. > > Thanks! > > Charlie >