Fwd: [MLlib] Performance problem in GeneralizedLinearAlgorithm

2015-02-17 Thread Josh Devins
Cross-posting as I got no response on the users mailing list last
week. Any response would be appreciated :)

Josh


-- Forwarded message --
From: Josh Devins 
Date: 9 February 2015 at 15:59
Subject: [MLlib] Performance problem in GeneralizedLinearAlgorithm
To: "u...@spark.apache.org" 


I've been looking into a performance problem when using
LogisticRegressionWithLBFGS (and in turn GeneralizedLinearAlgorithm).
Here's an outline of what I've figured out so far and it would be
great to get some confirmation of the problem, some input on how
wide-spread this problem might be and any ideas on a nice way to fix
this.

Context:
- I will reference `branch-1.1` as we are currently on v1.1.1 however
this appears to still be a problem on `master`
- The cluster is run on YARN, on bare-metal hardware (no VMs)
- I've not filed a Jira issue yet but can do so
- This problem affects all algorithms based on
GeneralizedLinearAlgorithm (GLA) that use feature scaling (and less so
when not, but still a problem) (e.g. LogisticRegressionWithLBFGS)

Problem Outline:
- Starting at GLA line 177
(https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala#L177),
a feature scaler is created using the `input` RDD
- Refer next to line 186 which then maps over the `input` RDD and
produces a new `data` RDD
(https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala#L186)
- If you are using feature scaling or adding intercepts, the user
`input` RDD has been mapped over *after* the user has persisted it
(hopefully) and *before* going into the (iterative) optimizer on line
204 
(https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala#L204)
- Since the RDD `data` that is iterated over in the optimizer is
unpersisted, when we are running the cost function in the optimizer
(e.g. LBFGS -- 
https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala#L198),
the map phase will actually first go back and rerun the feature
scaling (map tasks on `input`) and then map with the cost function
(two maps pipelined into one stage)
- As a result, parts of the StandardScaler will actually be run again
(perhaps only because the variable is `lazy`?) and this can be costly,
see line 84 
(https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala#L84)
- For small datasets and/or few iterations, this is not really a
problem, however we found that by adding a `data.persist()` right
before running the optimizer, we went from map iterations in the
optimizer that went from 5:30 down to 0:45

I had a very tough time coming up with a nice way to describe my
debugging sessions in an email so I hope this gets the main points
across. Happy to clarify anything if necessary (also by live
debugging/Skype/phone if that's helpful).

Thanks,

Josh

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [MLlib] Performance problem in GeneralizedLinearAlgorithm

2015-02-17 Thread Evan R. Sparks
Josh - thanks for the detailed write up - this seems a little funny to me.
I agree that with the current code path there is extra work being done than
needs to be (e.g. the features are re-scaled at every iteration, but the
relatively costly process of fitting the StandardScaler should not be
re-done at each iteration. Instead, at each iteration, all points are
re-scaled according to the pre-computed standard-deviations in the
StandardScalerModel, and then an intercept is appended.

Just to be clear - you're currently calling .persist() before you pass data
to LogisticRegressionWithLBFGS?

Also - can you give some parameters about the problem/cluster size you're
solving this on? How much memory per node? How big are n and d, what is its
sparsity (if any) and how many iterations are you running for? Is 0:45 the
per-iteration time or total time for some number of iterations?

A useful test might be to call GeneralizedLinearAlgorithm useFeatureScaling
set to false (and maybe also addIntercept set to false) on persisted data,
and see if you see the same performance wins. If that's the case we've
isolated the issue and can start profiling to see where all the time is
going.

It would be great if you can open a JIRA.

Thanks!



On Tue, Feb 17, 2015 at 6:36 AM, Josh Devins  wrote:

> Cross-posting as I got no response on the users mailing list last
> week. Any response would be appreciated :)
>
> Josh
>
>
> -- Forwarded message --
> From: Josh Devins 
> Date: 9 February 2015 at 15:59
> Subject: [MLlib] Performance problem in GeneralizedLinearAlgorithm
> To: "u...@spark.apache.org" 
>
>
> I've been looking into a performance problem when using
> LogisticRegressionWithLBFGS (and in turn GeneralizedLinearAlgorithm).
> Here's an outline of what I've figured out so far and it would be
> great to get some confirmation of the problem, some input on how
> wide-spread this problem might be and any ideas on a nice way to fix
> this.
>
> Context:
> - I will reference `branch-1.1` as we are currently on v1.1.1 however
> this appears to still be a problem on `master`
> - The cluster is run on YARN, on bare-metal hardware (no VMs)
> - I've not filed a Jira issue yet but can do so
> - This problem affects all algorithms based on
> GeneralizedLinearAlgorithm (GLA) that use feature scaling (and less so
> when not, but still a problem) (e.g. LogisticRegressionWithLBFGS)
>
> Problem Outline:
> - Starting at GLA line 177
> (
> https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala#L177
> ),
> a feature scaler is created using the `input` RDD
> - Refer next to line 186 which then maps over the `input` RDD and
> produces a new `data` RDD
> (
> https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala#L186
> )
> - If you are using feature scaling or adding intercepts, the user
> `input` RDD has been mapped over *after* the user has persisted it
> (hopefully) and *before* going into the (iterative) optimizer on line
> 204 (
> https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala#L204
> )
> - Since the RDD `data` that is iterated over in the optimizer is
> unpersisted, when we are running the cost function in the optimizer
> (e.g. LBFGS --
> https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala#L198
> ),
> the map phase will actually first go back and rerun the feature
> scaling (map tasks on `input`) and then map with the cost function
> (two maps pipelined into one stage)
> - As a result, parts of the StandardScaler will actually be run again
> (perhaps only because the variable is `lazy`?) and this can be costly,
> see line 84 (
> https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala#L84
> )
> - For small datasets and/or few iterations, this is not really a
> problem, however we found that by adding a `data.persist()` right
> before running the optimizer, we went from map iterations in the
> optimizer that went from 5:30 down to 0:45
>
> I had a very tough time coming up with a nice way to describe my
> debugging sessions in an email so I hope this gets the main points
> across. Happy to clarify anything if necessary (also by live
> debugging/Skype/phone if that's helpful).
>
> Thanks,
>
> Josh
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [MLlib] Performance problem in GeneralizedLinearAlgorithm

2015-02-17 Thread Peter Rudenko

It's fixed today: https://github.com/apache/spark/pull/4593

Thanks,
Peter Rudenko
On 2015-02-17 18:25, Evan R. Sparks wrote:

Josh - thanks for the detailed write up - this seems a little funny to me.
I agree that with the current code path there is extra work being done than
needs to be (e.g. the features are re-scaled at every iteration, but the
relatively costly process of fitting the StandardScaler should not be
re-done at each iteration. Instead, at each iteration, all points are
re-scaled according to the pre-computed standard-deviations in the
StandardScalerModel, and then an intercept is appended.

Just to be clear - you're currently calling .persist() before you pass data
to LogisticRegressionWithLBFGS?

Also - can you give some parameters about the problem/cluster size you're
solving this on? How much memory per node? How big are n and d, what is its
sparsity (if any) and how many iterations are you running for? Is 0:45 the
per-iteration time or total time for some number of iterations?

A useful test might be to call GeneralizedLinearAlgorithm useFeatureScaling
set to false (and maybe also addIntercept set to false) on persisted data,
and see if you see the same performance wins. If that's the case we've
isolated the issue and can start profiling to see where all the time is
going.

It would be great if you can open a JIRA.

Thanks!



On Tue, Feb 17, 2015 at 6:36 AM, Josh Devins  wrote:


Cross-posting as I got no response on the users mailing list last
week. Any response would be appreciated :)

Josh


-- Forwarded message --
From: Josh Devins 
Date: 9 February 2015 at 15:59
Subject: [MLlib] Performance problem in GeneralizedLinearAlgorithm
To: "u...@spark.apache.org" 


I've been looking into a performance problem when using
LogisticRegressionWithLBFGS (and in turn GeneralizedLinearAlgorithm).
Here's an outline of what I've figured out so far and it would be
great to get some confirmation of the problem, some input on how
wide-spread this problem might be and any ideas on a nice way to fix
this.

Context:
- I will reference `branch-1.1` as we are currently on v1.1.1 however
this appears to still be a problem on `master`
- The cluster is run on YARN, on bare-metal hardware (no VMs)
- I've not filed a Jira issue yet but can do so
- This problem affects all algorithms based on
GeneralizedLinearAlgorithm (GLA) that use feature scaling (and less so
when not, but still a problem) (e.g. LogisticRegressionWithLBFGS)

Problem Outline:
- Starting at GLA line 177
(
https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala#L177
),
a feature scaler is created using the `input` RDD
- Refer next to line 186 which then maps over the `input` RDD and
produces a new `data` RDD
(
https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala#L186
)
- If you are using feature scaling or adding intercepts, the user
`input` RDD has been mapped over *after* the user has persisted it
(hopefully) and *before* going into the (iterative) optimizer on line
204 (
https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala#L204
)
- Since the RDD `data` that is iterated over in the optimizer is
unpersisted, when we are running the cost function in the optimizer
(e.g. LBFGS --
https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala#L198
),
the map phase will actually first go back and rerun the feature
scaling (map tasks on `input`) and then map with the cost function
(two maps pipelined into one stage)
- As a result, parts of the StandardScaler will actually be run again
(perhaps only because the variable is `lazy`?) and this can be costly,
see line 84 (
https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala#L84
)
- For small datasets and/or few iterations, this is not really a
problem, however we found that by adding a `data.persist()` right
before running the optimizer, we went from map iterations in the
optimizer that went from 5:30 down to 0:45

I had a very tough time coming up with a nice way to describe my
debugging sessions in an email so I hope this gets the main points
across. Happy to clarify anything if necessary (also by live
debugging/Skype/phone if that's helpful).

Thanks,

Josh

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org





-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: org.apache.spark.sql.sources.DDLException: Unsupported dataType: [1.1] failure: ``varchar'' expected but identifier char found in spark-sql

2015-02-17 Thread Yin Huai
Hi Quizhuang,

Right now, char is not supported in DDL. Can you try varchar or string?

Thanks,

Yin

On Mon, Feb 16, 2015 at 10:39 PM, Qiuzhuang Lian 
wrote:

> Hi,
>
> I am not sure this has been reported already or not, I run into this error
> under spark-sql shell as build from newest of spark git trunk,
>
> spark-sql> describe qiuzhuang_hcatlog_import;
> 15/02/17 14:38:36 ERROR SparkSQLDriver: Failed in [describe
> qiuzhuang_hcatlog_import]
> org.apache.spark.sql.sources.DDLException: Unsupported dataType: [1.1]
> failure: ``varchar'' expected but identifier char found
>
> char(32)
> ^
> at org.apache.spark.sql.sources.DDLParser.parseType(ddl.scala:52)
> at
>
> org.apache.spark.sql.hive.MetastoreRelation$SchemaAttribute.toAttribute(HiveMetastoreCatalog.scala:664)
> at
>
> org.apache.spark.sql.hive.MetastoreRelation$$anonfun$23.apply(HiveMetastoreCatalog.scala:674)
> at
>
> org.apache.spark.sql.hive.MetastoreRelation$$anonfun$23.apply(HiveMetastoreCatalog.scala:674)
> at
>
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
>
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at
>
> org.apache.spark.sql.hive.MetastoreRelation.(HiveMetastoreCatalog.scala:674)
> at
>
> org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:185)
> at org.apache.spark.sql.hive.HiveContext$$anon$2.org
>
> $apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:234)
>
> As in hive 0.131, console, this commands works,
>
> hive> describe qiuzhuang_hcatlog_import;
> OK
> id  char(32)
> assistant_novarchar(20)
> assistant_name  varchar(32)
> assistant_type  int
> grade   int
> shop_no varchar(20)
> shop_name   varchar(64)
> organ_novarchar(20)
> organ_name  varchar(20)
> entry_date  string
> education   int
> commission  decimal(8,2)
> tel varchar(20)
> address varchar(100)
> identity_card   varchar(25)
> sex int
> birthdaystring
> employee_type   int
> status  int
> remark  varchar(255)
> create_user_no  varchar(20)
> create_user varchar(32)
> create_time string
> update_user_no  varchar(20)
> update_user varchar(32)
> update_time string
> Time taken: 0.49 seconds, Fetched: 26 row(s)
> hive>
>
>
> Regards,
> Qiuzhuang
>


Re: [ml] Lost persistence for fold in crossvalidation.

2015-02-17 Thread Xiangrui Meng
There are three different regParams defined in the grid and there are
tree folds. For simplicity, we didn't split the dataset into three and
reuse them, but do the split for each fold. Then we need to cache 3*3
times. Note that the pipeline API is not yet optimized for
performance. It would be nice to optimize its perforamnce in 1.4.
-Xiangrui

On Wed, Feb 11, 2015 at 11:13 AM, Peter Rudenko  wrote:
> Hi i have a problem. Using spark 1.2 with Pipeline + GridSearch +
> LogisticRegression. I’ve reimplemented LogisticRegression.fit method and
> comment out instances.unpersist()
>
> |override  def  fit(dataset:SchemaRDD,
> paramMap:ParamMap):LogisticRegressionModel  = {
> println(s"Fitting dataset ${dataset.take(1000).toSeq.hashCode()} with
> ParamMap $paramMap.")
> transformSchema(dataset.schema, paramMap, logging =true)
> import  dataset.sqlContext._
> val  map  =  this.paramMap ++ paramMap
> val  instances  =  dataset.select(map(labelCol).attr,
> map(featuresCol).attr)
>   .map {
> case  Row(label:Double, features:Vector) =>
>   LabeledPoint(label, features)
>   }
>
> if  (instances.getStorageLevel ==StorageLevel.NONE) {
>   println("Instances not persisted")
>   instances.persist(StorageLevel.MEMORY_AND_DISK)
> }
>
>  val  lr  =  (new  LogisticRegressionWithLBFGS)
>   .setValidateData(false)
>   .setIntercept(true)
> lr.optimizer
>   .setRegParam(map(regParam))
>   .setNumIterations(map(maxIter))
> val  lrm  =  new  LogisticRegressionModel(this, map,
> lr.run(instances).weights)
> //instances.unpersist()
> // copy model params
> Params.inheritValues(map,this, lrm)
> lrm
>   }
> |
>
> CrossValidator feeds the same SchemaRDD for each parameter (same hash code),
> but somewhere cache being flushed. The memory is enough. Here’s the output:
>
> |Fitting dataset 2051470010 with ParamMap {
> DRLogisticRegression-f35ae4d3-regParam: 0.1
> }.
> Instances not persisted
> Fitting dataset 2051470010 with ParamMap {
> DRLogisticRegression-f35ae4d3-regParam: 0.01
> }.
> Instances not persisted
> Fitting dataset 2051470010 with ParamMap {
> DRLogisticRegression-f35ae4d3-regParam: 0.001
> }.
> Instances not persisted
> Fitting dataset 802615223 with ParamMap {
> DRLogisticRegression-f35ae4d3-regParam: 0.1
> }.
> Instances not persisted
> Fitting dataset 802615223 with ParamMap {
> DRLogisticRegression-f35ae4d3-regParam: 0.01
> }.
> Instances not persisted
> |
>
> I have 3 parameters in GridSearch and 3 folds for CrossValidation:
>
> |
> val  paramGrid  =  new  ParamGridBuilder()
>   .addGrid(model.regParam,Array(0.1,0.01,0.001))
>   .build()
>
> crossval.setEstimatorParamMaps(paramGrid)
> crossval.setNumFolds(3)
> |
>
> I assume that the data should be read and cached 3 times (1 to
> numFolds).combinations(2) and be independent from number of parameters. But
> i have 9 times data being read and cached.
>
> Thanks,
> Peter Rudenko
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: mllib.recommendation Design

2015-02-17 Thread Xiangrui Meng
The current ALS implementation allow pluggable solvers for
NormalEquation, where we put CholeskeySolver and NNLS solver. Please
check the current implementation and let us know how your constraint
solver would fit. For a general matrix factorization package, let's
make a JIRA and move our discussion there. -Xiangrui

On Fri, Feb 13, 2015 at 7:46 AM, Debasish Das  wrote:
> Hi,
>
> I am bit confused on the mllib design in the master. I thought that core
> algorithms will stay in mllib and ml will define the pipelines over the
> core algorithm but looks like in master ALS is moved from mllib to ml...
>
> I am refactoring my PR to a factorization package and I want to build it on
> top of ml.recommendation.ALS (possibly extend from ml.recommendation.ALS
> since first version will use very similar RDD handling as ALS and a
> proximal solver that's being added to breeze)
>
> https://issues.apache.org/jira/browse/SPARK-2426
> https://github.com/scalanlp/breeze/pull/321
>
> Basically I am not sure if we should merge it with recommendation.ALS since
> this is more generic than recommendation. I am considering calling it
> ConstrainedALS where user can specify different constraint for user and
> product factors (Similar to GraphLab CF structure).
>
> I am also working on ConstrainedALM where the underlying algorithm is no
> longer ALS but nonlinear alternating minimization with constraints.
> https://github.com/scalanlp/breeze/pull/364
> This will let us do large rank matrix completion where there is no need to
> construct gram matrices. I will open up the JIRA soon after getting initial
> results
>
> I am bit confused that where should I add the factorization package. It
> will use the current ALS test-cases and I have to construct more test-cases
> for sparse coding and PLSA formulations.
>
> Thanks.
> Deb

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Batch prediciton for ALS

2015-02-17 Thread Xiangrui Meng
It may be too late to merge it into 1.3. I'm going to make another
pass on your PR today. -Xiangrui

On Tue, Feb 10, 2015 at 8:01 AM, Debasish Das  wrote:
> Hi,
>
> Will it be possible to merge this PR to 1.3 ?
>
> https://github.com/apache/spark/pull/3098
>
> The batch prediction API in ALS will be useful for us who want to cross
> validate on prec@k and MAP...
>
> Thanks.
> Deb

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Batch prediciton for ALS

2015-02-17 Thread Debasish Das
It will be really help us if we merge it but I guess it is already diverged
from the new ALS...I will also take a look at it again and try update with
the new ALS...

On Tue, Feb 17, 2015 at 3:22 PM, Xiangrui Meng  wrote:

> It may be too late to merge it into 1.3. I'm going to make another
> pass on your PR today. -Xiangrui
>
> On Tue, Feb 10, 2015 at 8:01 AM, Debasish Das 
> wrote:
> > Hi,
> >
> > Will it be possible to merge this PR to 1.3 ?
> >
> > https://github.com/apache/spark/pull/3098
> >
> > The batch prediction API in ALS will be useful for us who want to cross
> > validate on prec@k and MAP...
> >
> > Thanks.
> > Deb
>


Re: mllib.recommendation Design

2015-02-17 Thread Debasish Das
There is a usability difference...I am not sure if recommendation.ALS would
like to add both userConstraint and productConstraint ? GraphLab CF for
example has it and we are ready to support all the features for modest
ranks where gram matrices can be made...

For large ranks I am still working on the code

On Tue, Feb 17, 2015 at 3:19 PM, Xiangrui Meng  wrote:

> The current ALS implementation allow pluggable solvers for
> NormalEquation, where we put CholeskeySolver and NNLS solver. Please
> check the current implementation and let us know how your constraint
> solver would fit. For a general matrix factorization package, let's
> make a JIRA and move our discussion there. -Xiangrui
>
> On Fri, Feb 13, 2015 at 7:46 AM, Debasish Das 
> wrote:
> > Hi,
> >
> > I am bit confused on the mllib design in the master. I thought that core
> > algorithms will stay in mllib and ml will define the pipelines over the
> > core algorithm but looks like in master ALS is moved from mllib to ml...
> >
> > I am refactoring my PR to a factorization package and I want to build it
> on
> > top of ml.recommendation.ALS (possibly extend from ml.recommendation.ALS
> > since first version will use very similar RDD handling as ALS and a
> > proximal solver that's being added to breeze)
> >
> > https://issues.apache.org/jira/browse/SPARK-2426
> > https://github.com/scalanlp/breeze/pull/321
> >
> > Basically I am not sure if we should merge it with recommendation.ALS
> since
> > this is more generic than recommendation. I am considering calling it
> > ConstrainedALS where user can specify different constraint for user and
> > product factors (Similar to GraphLab CF structure).
> >
> > I am also working on ConstrainedALM where the underlying algorithm is no
> > longer ALS but nonlinear alternating minimization with constraints.
> > https://github.com/scalanlp/breeze/pull/364
> > This will let us do large rank matrix completion where there is no need
> to
> > construct gram matrices. I will open up the JIRA soon after getting
> initial
> > results
> >
> > I am bit confused that where should I add the factorization package. It
> > will use the current ALS test-cases and I have to construct more
> test-cases
> > for sparse coding and PLSA formulations.
> >
> > Thanks.
> > Deb
>


Re: Replacing Jetty with TomCat

2015-02-17 Thread Niranda Perera
Hi Sean,
The main issue we have is, running two web servers in a single product. we
think it would not be an elegant solution.

Could you please point me to the main areas where jetty server is tightly
coupled or extension points where I could plug tomcat instead of jetty?
If successful I could contribute it to the spark project. :-)

cheers



On Mon, Feb 16, 2015 at 4:51 PM, Sean Owen  wrote:

> There's no particular reason you have to remove the embedded Jetty
> server, right? it doesn't prevent you from using it inside another app
> that happens to run in Tomcat. You won't be able to switch it out
> without rewriting a fair bit of code, no, but you don't need to.
>
> On Mon, Feb 16, 2015 at 5:08 AM, Niranda Perera
>  wrote:
> > Hi,
> >
> > We are thinking of integrating Spark server inside a product. Our current
> > product uses Tomcat as its webserver.
> >
> > Is it possible to switch the Jetty webserver in Spark to Tomcat
> > off-the-shelf?
> >
> > Cheers
> >
> > --
> > Niranda
>



-- 
Niranda


Re: Replacing Jetty with TomCat

2015-02-17 Thread Patrick Wendell
Hey Niranda,

It seems to me a lot of effort to support multiple libraries inside of
Spark like this, so I'm not sure that's a great solution.

If you are building an application that embeds Spark, is it not
possible for you to continue to use Jetty for Spark's internal servers
and use tomcat for your own server's? I would guess that many complex
applications end up embedding multiple server libraries in various
places (Spark itself has different transport mechanisms, etc.)

- Patrick

On Tue, Feb 17, 2015 at 7:14 PM, Niranda Perera
 wrote:
> Hi Sean,
> The main issue we have is, running two web servers in a single product. we
> think it would not be an elegant solution.
>
> Could you please point me to the main areas where jetty server is tightly
> coupled or extension points where I could plug tomcat instead of jetty?
> If successful I could contribute it to the spark project. :-)
>
> cheers
>
>
>
> On Mon, Feb 16, 2015 at 4:51 PM, Sean Owen  wrote:
>
>> There's no particular reason you have to remove the embedded Jetty
>> server, right? it doesn't prevent you from using it inside another app
>> that happens to run in Tomcat. You won't be able to switch it out
>> without rewriting a fair bit of code, no, but you don't need to.
>>
>> On Mon, Feb 16, 2015 at 5:08 AM, Niranda Perera
>>  wrote:
>> > Hi,
>> >
>> > We are thinking of integrating Spark server inside a product. Our current
>> > product uses Tomcat as its webserver.
>> >
>> > Is it possible to switch the Jetty webserver in Spark to Tomcat
>> > off-the-shelf?
>> >
>> > Cheers
>> >
>> > --
>> > Niranda
>>
>
>
>
> --
> Niranda

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Replacing Jetty with TomCat

2015-02-17 Thread Corey Nolet
Niranda,

I'm not sure if I'd say Spark's use of Jetty to expose its UI monitoring
layer constitutes a use of "two web servers in a single product". Hadoop
uses Jetty as well as do many other applications today that need embedded
http layers for serving up their monitoring UI to users. This is completely
aside from any web container an application developer would use to interact
with Spark and Hadoop and service domain-specific content to users. The two
are disjoint.

Many applications use Thrift as a means of establishing socket connections
between clients and across servers. One alternative to Thrift is Protobuf.
You wouldn't say "I want to swap out thrift for protobuf in Cassandra
because I want to use protobuf in my application and there shouldn't be two
different socket layer abstractions on my cluster."

I could understand wanting to do this if you were being forced to deploy a
war file to a web container in order to do the monitoring but Spark's UI is
embedded within the code. If you are worried about having the Jetty
libraries on your classpath, you can exclude the Jetty dependencies from
your servlet code if you want to interact with a SparkContext in Tomcat.



On Tue, Feb 17, 2015 at 10:22 PM, Patrick Wendell 
wrote:

> Hey Niranda,
>
> It seems to me a lot of effort to support multiple libraries inside of
> Spark like this, so I'm not sure that's a great solution.
>
> If you are building an application that embeds Spark, is it not
> possible for you to continue to use Jetty for Spark's internal servers
> and use tomcat for your own server's? I would guess that many complex
> applications end up embedding multiple server libraries in various
> places (Spark itself has different transport mechanisms, etc.)
>
> - Patrick
>
> On Tue, Feb 17, 2015 at 7:14 PM, Niranda Perera
>  wrote:
> > Hi Sean,
> > The main issue we have is, running two web servers in a single product.
> we
> > think it would not be an elegant solution.
> >
> > Could you please point me to the main areas where jetty server is tightly
> > coupled or extension points where I could plug tomcat instead of jetty?
> > If successful I could contribute it to the spark project. :-)
> >
> > cheers
> >
> >
> >
> > On Mon, Feb 16, 2015 at 4:51 PM, Sean Owen  wrote:
> >
> >> There's no particular reason you have to remove the embedded Jetty
> >> server, right? it doesn't prevent you from using it inside another app
> >> that happens to run in Tomcat. You won't be able to switch it out
> >> without rewriting a fair bit of code, no, but you don't need to.
> >>
> >> On Mon, Feb 16, 2015 at 5:08 AM, Niranda Perera
> >>  wrote:
> >> > Hi,
> >> >
> >> > We are thinking of integrating Spark server inside a product. Our
> current
> >> > product uses Tomcat as its webserver.
> >> >
> >> > Is it possible to switch the Jetty webserver in Spark to Tomcat
> >> > off-the-shelf?
> >> >
> >> > Cheers
> >> >
> >> > --
> >> > Niranda
> >>
> >
> >
> >
> > --
> > Niranda
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


JavaRDD Aggregate initial value - Closure-serialized zero value reasoning?

2015-02-17 Thread Matt Cheah
Hi everyone,

I was using JavaPairRDD¹s combineByKey() to compute all of my aggregations
before, since I assumed that every aggregation required a key. However, I
realized I could do my analysis using JavaRDD¹s aggregate() instead and not
use a key.

I have set spark.serializer to use Kryo. As a result, JavaRDD¹s combineByKey
requires that a ³createCombiner² function is provided, and the return value
from that function must be serializable using Kryo. When I switched to using
rdd.aggregate I assumed that the zero value would also be strictly Kryo
serialized, as it is a data item and not part of a closure or the
aggregation functions. However, I got a serialization exception as the
closure serializer (only valid serializer is the Java serializer) was used
instead.

I was wondering the following:
1. What is the rationale for making the zero value be serialized using the
closure serializer? This isn¹t part of the closure, but is an initial data
item.
2. Would it make sense for us to perhaps write a version of rdd.aggregate()
that takes a function as a parameter, that generates the zero value? This
would be more intuitive to be serialized using the closure serializer.
I believe aggregateByKey is also affected.

Thanks,

-Matt Cheah




smime.p7s
Description: S/MIME cryptographic signature