Pipeline in pyspark
Hi, I came across documentation for creating a pipeline in mlib library of pyspark. I wanted to know if something similar exists for pyspark input transformations. I have a use case where I have my input files in different formats and would like to convert them to rdd and store them in memory and perform certain custom tasks in a pipeline without storing it back to disc in any step. I came across luigi(http://luigi.readthedocs.org/en/latest/), but I found that it stores the contents onto disc and reloads it for the next phase of the pipeline. -- Thanks and regards, Suraj
Dataframe from mysql database in pyspark
Hi, Is there any means of transforming mysql databases into dataframes from pyspark. Iwas about to find a document that converts mysql database to dataframe in spark-shell(http://www.infoobjects.com/spark-sql-jdbcrdd/) using jdbc. I had been through the official documentation and can't find any pointers in this regard. -- Regards, Suraj
Query regarding infering data types in pyspark
Hi, In pyspark when if I read a json file using sqlcontext I find that the date field is not infered as date instead it is converted to string. And when I try to convert it to date using df.withColumn(df.DateCol.cast(timestamp)) it does not parse it successfuly and adds a null instead there. Should I use UDF to convert the date ? Is this expected behaviour (not throwing an error after failure to cast all fields)? -- Regards, Suraj