Sorting within partitions is not maintained in parquet?

2016-08-10 Thread Jason Moore
Hi,

It seems that something changed between Spark 1.6.2 and 2.0.0 that I wasn't 
expecting.

If I have a DataFrame with records sorted within each partition, and I write it 
to parquet and read back from the parquet, previously the records would be 
iterated through in the same order they were written (assuming no shuffle has 
taken place).  But this doesn't seem to be the case anymore.  Below is the code 
to reproduce in a spark-shell.

Was this change expected?

Thanks,
Jason.


import org.apache.spark.sql._
def isSorted[T](self: DataFrame, mapping: Row => T)(implicit ordering: 
Ordering[T]) = {
  import self.sqlContext.implicits._
  import ordering._
  self
.mapPartitions(rows => {
  val isSorted = rows
.map(mapping)
.sliding(2) // all adjacent pairs
.forall {
  case x :: y :: Nil => x <= y
  case x :: Nil => true
  case Nil => true
}

  Iterator(isSorted)
})
.reduce(_ && _)
}

// in Spark 2.0.0
spark.range(10).toDF("id").registerTempTable("input")
spark.sql("SELECT id FROM input DISTRIBUTE BY id SORT BY 
id").write.mode("overwrite").parquet("input.parquet")
isSorted(spark.read.parquet("input.parquet"), _.getAs[Long]("id"))
// FALSE

// in Spark 1.6.2
sqlContext.range(10).toDF("id").registerTempTable("input")
sqlContext.sql("SELECT id FROM input DISTRIBUTE BY id SORT BY 
id").write.mode("overwrite").parquet("input.parquet")
isSorted(sqlContext.read.parquet("input.parquet"), _.getAs[Long]("id"))
// TRUE



Re: Serving Spark ML models via a regular Python web app

2016-08-10 Thread Michael Allman
Nick,

Check out MLeap: https://github.com/TrueCar/mleap 
. It's not python, but we use it in 
production to serve a random forest model trained by a Spark ML pipeline.

Thanks,

Michael

> On Aug 10, 2016, at 7:50 PM, Nicholas Chammas  
> wrote:
> 
> Are there any existing JIRAs covering the possibility of serving up Spark ML 
> models via, for example, a regular Python web app?
> 
> The story goes like this: You train your model with Spark on several TB of 
> data, and now you want to use it in a prediction service that you’re 
> building, say with Flask . In principle, you don’t 
> need Spark anymore since you’re just passing individual data points to your 
> model and looking for it to spit some prediction back.
> 
> I assume this is something people do today, right? I presume Spark needs to 
> run in their web service to serve up the model. (Sorry, I’m new to the ML 
> side of Spark. 😅)
> 
> Are there any JIRAs discussing potential improvements to this story? I did a 
> search, but I’m not sure what exactly to look for. SPARK-4587 
>  (model import/export) 
> looks relevant, but doesn’t address the story directly.
> 
> Nick
> 



Serving Spark ML models via a regular Python web app

2016-08-10 Thread Nicholas Chammas
Are there any existing JIRAs covering the possibility of serving up Spark
ML models via, for example, a regular Python web app?

The story goes like this: You train your model with Spark on several TB of
data, and now you want to use it in a prediction service that you’re
building, say with Flask . In principle, you don’t
need Spark anymore since you’re just passing individual data points to your
model and looking for it to spit some prediction back.

I assume this is something people do today, right? I presume Spark needs to
run in their web service to serve up the model. (Sorry, I’m new to the ML
side of Spark. 😅)

Are there any JIRAs discussing potential improvements to this story? I did
a search, but I’m not sure what exactly to look for. SPARK-4587
 (model import/export)
looks relevant, but doesn’t address the story directly.

Nick
​


Re: Use cases around image/video processing in spark

2016-08-10 Thread Benjamin Fradet
Hi,

Check out the the thunder project


On Wed, Aug 10, 2016 at 5:20 PM, Deepak Sharma 
wrote:

> Hi
> If anyone is using or knows about github repo that can help me get started
> with image and video processing using spark.
> The images/videos will be stored in s3 and i am planning to use s3 with
> Spark.
> In this case , how will spark achieve distributed processing?
> Any code base or references is really appreciated.
>
> --
> Thanks
> Deepak
>



-- 
Ben Fradet.


Use cases around image/video processing in spark

2016-08-10 Thread Deepak Sharma
Hi
If anyone is using or knows about github repo that can help me get started
with image and video processing using spark.
The images/videos will be stored in s3 and i am planning to use s3 with
Spark.
In this case , how will spark achieve distributed processing?
Any code base or references is really appreciated.

-- 
Thanks
Deepak


Re: Get data from CSV files to feed SparkML library methods

2016-08-10 Thread Minudika Malshan
Thanks a lot Yanbo.!
I will try it.

Best Regards.


On Wed, Aug 10, 2016 at 7:09 PM, Yanbo Liang  wrote:

> You can load dataset from CSV file and use VectorAssembler to assemble
> necessary columns into a single columns of vector type. The output column
> of VectorAssembler will be the features column which should be feed into ML
> estimator for model training. You can refer VectorAssembler document:
> http://spark.apache.org/docs/latest/ml-features.html#vectorassembler .
>
> Thanks
> Yanbo
>
> 2016-08-10 4:16 GMT-07:00 Minudika Malshan :
>
>> Hi all,
>>
>> I'm using spark ml library and need to train a model using data extracted
>> from a CSV file.
>> I found that we can load datasets from LibSVM files to spark ML methods.
>> As far as i understood, the data should be represented as labeled points
>> in-order to feed the ml methods.
>> Is there a way to load dataset from a CSV file instead of a LibSVM file?
>> Or do I need to convert the CSV file to LibSVM format? If so, could you
>> please let me know a way to do that.?
>> Your help would be much appreciated.
>>
>> Thank you!
>> Minudika
>>
>
>


Re: Get data from CSV files to feed SparkML library methods

2016-08-10 Thread Yanbo Liang
You can load dataset from CSV file and use VectorAssembler to assemble
necessary columns into a single columns of vector type. The output column
of VectorAssembler will be the features column which should be feed into ML
estimator for model training. You can refer VectorAssembler document:
http://spark.apache.org/docs/latest/ml-features.html#vectorassembler .

Thanks
Yanbo

2016-08-10 4:16 GMT-07:00 Minudika Malshan :

> Hi all,
>
> I'm using spark ml library and need to train a model using data extracted
> from a CSV file.
> I found that we can load datasets from LibSVM files to spark ML methods.
> As far as i understood, the data should be represented as labeled points
> in-order to feed the ml methods.
> Is there a way to load dataset from a CSV file instead of a LibSVM file?
> Or do I need to convert the CSV file to LibSVM format? If so, could you
> please let me know a way to do that.?
> Your help would be much appreciated.
>
> Thank you!
> Minudika
>


Get data from CSV files to feed SparkML library methods

2016-08-10 Thread Minudika Malshan
Hi all,

I'm using spark ml library and need to train a model using data extracted
from a CSV file.
I found that we can load datasets from LibSVM files to spark ML methods.
As far as i understood, the data should be represented as labeled points
in-order to feed the ml methods.
Is there a way to load dataset from a CSV file instead of a LibSVM file?
Or do I need to convert the CSV file to LibSVM format? If so, could you
please let me know a way to do that.?
Your help would be much appreciated.

Thank you!
Minudika