Sorting within partitions is not maintained in parquet?

2016-08-10 Thread Jason Moore
Hi, It seems that something changed between Spark 1.6.2 and 2.0.0 that I wasn't expecting. If I have a DataFrame with records sorted within each partition, and I write it to parquet and read back from the parquet, previously the records would be iterated through in the same order they were wri

Re: Serving Spark ML models via a regular Python web app

2016-08-10 Thread Michael Allman
Nick, Check out MLeap: https://github.com/TrueCar/mleap . It's not python, but we use it in production to serve a random forest model trained by a Spark ML pipeline. Thanks, Michael > On Aug 10, 2016, at 7:50 PM, Nicholas Chammas > wrote: > > Are there any

Serving Spark ML models via a regular Python web app

2016-08-10 Thread Nicholas Chammas
Are there any existing JIRAs covering the possibility of serving up Spark ML models via, for example, a regular Python web app? The story goes like this: You train your model with Spark on several TB of data, and now you want to use it in a prediction service that you’re building, say with Flask <

Re: Use cases around image/video processing in spark

2016-08-10 Thread Benjamin Fradet
Hi, Check out the the thunder project On Wed, Aug 10, 2016 at 5:20 PM, Deepak Sharma wrote: > Hi > If anyone is using or knows about github repo that can help me get started > with image and video processing using spark. > The images/videos will be s

Use cases around image/video processing in spark

2016-08-10 Thread Deepak Sharma
Hi If anyone is using or knows about github repo that can help me get started with image and video processing using spark. The images/videos will be stored in s3 and i am planning to use s3 with Spark. In this case , how will spark achieve distributed processing? Any code base or references is real

Re: Get data from CSV files to feed SparkML library methods

2016-08-10 Thread Minudika Malshan
Thanks a lot Yanbo.! I will try it. Best Regards. On Wed, Aug 10, 2016 at 7:09 PM, Yanbo Liang wrote: > You can load dataset from CSV file and use VectorAssembler to assemble > necessary columns into a single columns of vector type. The output column > of VectorAssembler will be the features c

Re: Get data from CSV files to feed SparkML library methods

2016-08-10 Thread Yanbo Liang
You can load dataset from CSV file and use VectorAssembler to assemble necessary columns into a single columns of vector type. The output column of VectorAssembler will be the features column which should be feed into ML estimator for model training. You can refer VectorAssembler document: http://s

Get data from CSV files to feed SparkML library methods

2016-08-10 Thread Minudika Malshan
Hi all, I'm using spark ml library and need to train a model using data extracted from a CSV file. I found that we can load datasets from LibSVM files to spark ML methods. As far as i understood, the data should be represented as labeled points in-order to feed the ml methods. Is there a way to lo