The easiest option I found to put jars in SPARK CLASSPATH On 21 Aug 2015 06:20, "Burak Yavuz" <brk...@gmail.com> wrote:
> If you would like to try using spark-csv, please use > `pyspark --packages com.databricks:spark-csv_2.11:1.2.0` > > You're missing a dependency. > > Best, > Burak > > On Thu, Aug 20, 2015 at 1:08 PM, Charlie Hack <charles.t.h...@gmail.com> > wrote: > >> Hi, >> >> I'm new to spark and am trying to create a Spark df from a pandas df with >> ~5 million rows. Using Spark 1.4.1. >> >> When I type: >> >> df = sqlContext.createDataFrame(pandas_df.where(pd.notnull(didf), None)) >> >> (the df.where is a hack I found on the Spark JIRA to avoid a problem with >> NaN values making mixed column types) >> >> I get: >> >> TypeError: cannot create an RDD from type: <type 'list'> >> >> Converting a smaller pandas dataframe (~2000 rows) works fine. Anyone had >> this issue? >> >> >> This is already a workaround-- ideally I'd like to read the spark >> dataframe from a Hive table. But this is currently not an option for my >> setup. >> >> I also tried reading the data into spark from a CSV using spark-csv. >> Haven't been able to make this work as yet. I launch >> >> $ pyspark --jars path/to/spark-csv_2.11-1.2.0.jar >> >> and when I attempt to read the csv I get: >> >> Py4JJavaError: An error occurred while calling o22.load. : >> java.lang.NoClassDefFoundError: org/apache/commons/csv/CSVFormat ... >> >> Other options I can think of: >> >> - Convert my CSV to json (use Pig?) and read into Spark >> - Read in using jdbc connect from postgres >> >> But want to make sure I'm not misusing Spark or missing something obvious. >> >> Thanks! >> >> Charlie >> > >