Pipeline in pyspark

2015-04-22 Thread Suraj Shetiya
Hi,

I came across documentation for creating a pipeline in mlib library of
pyspark. I wanted to know if something similar exists for pyspark input
transformations. I have a use case where I have my input files in different
formats and would like to convert them to rdd and store them in memory and
perform certain custom tasks in a pipeline without storing it back to disc
in any step. I came across luigi(http://luigi.readthedocs.org/en/latest/),
but I found that it stores the contents onto disc and reloads it for the
next phase of the pipeline.

-- 
Thanks and regards,
Suraj


Dataframe from mysql database in pyspark

2015-04-16 Thread Suraj Shetiya
Hi,

Is there any means of transforming mysql databases into dataframes from
pyspark.
Iwas about to find a document that converts mysql database to dataframe in
spark-shell(http://www.infoobjects.com/spark-sql-jdbcrdd/) using jdbc. I
had been through the official documentation and can't find any pointers in
this regard.

-- 
Regards,
Suraj


Query regarding infering data types in pyspark

2015-04-10 Thread Suraj Shetiya
Hi,

In pyspark when if I read a json file using sqlcontext I find that the date
field is not infered as date instead it is converted to string. And when I
try to convert it to date using df.withColumn(df.DateCol.cast(timestamp))
it does not parse it successfuly and adds a null instead there. Should I
use UDF to convert the date ? Is this expected behaviour (not throwing an
error after failure to cast all fields)?

-- 
Regards,
Suraj