Updating reference data once a day in Spark Streaming job

2016-03-07 Thread Karthikeyan Muthukumar
Hi, We have reference data pulled in from an RDBMS through a Sqoop job, this reference data is pulled into the Analytics platform once a day. We have a Spark Streaming job, where at job bootup we read the reference data, and then join this reference data with continuously flowing event data. When t

Make Spark Streaming DFrame as SQL table

2015-12-13 Thread Karthikeyan Muthukumar
Hi, The aim here is as follows: - read data from Socket using Spark Streaming every N seconds - register received data as SQL table - there will be more data read from HDFS etc as reference data, they will also be registered as SQL tables - the idea is to perform arbitrary SQL queries on the combi

Stratified sampling with DataFrames

2015-05-11 Thread Karthikeyan Muthukumar
Hi, I'm in Spark 1.3.0 and my data is in DataFrames. I need operations like sampleByKey(), sampleByKeyExact(). I saw the JIRA "Add approximate stratified sampling to DataFrame" ( https://issues.apache.org/jira/browse/SPARK-7157). That's targeted for Spark 1.5, till that comes through, whats the eas

Workflow layer for Spark

2015-03-13 Thread Karthikeyan Muthukumar
Hi, We are building a machine learning platform based on ML-Lib in Spark. We would be using Scala for the development. We need a thin workflow layer where we can easily configure the different actions to be done, configuration for the actions (like load-data, clean-data, split-data etc), and the or