Hi Tsai,

Thank you for pointing out the implementation details which I missed.
Yes I saw several jira issues with the intercept, regularization and
standardization, I just didn't realize it made such a big impact.
Thanks again.

2015-10-13 4:32 GMT+08:00 DB Tsai <dbt...@dbtsai.com>:
> Hi Liu,
>
> In ML, even after extracting the data into RDD, the versions between MLib
> and ML are quite different. Due to legacy design, in MLlib, we use Updater
> for handling regularization, and this layer of abstraction also does
> adaptive step size which is only for SGD. In order to get it working with
> LBFGS, some hacks were being done here and there, and in Updater, all the
> components including intercept are regularized which is not desirable in
> many cases. Also, in the legacy design, it's hard for us to do in-place
> standardization to improve the convergency rate. As a result, at some point,
> we decide to ditch those abstractions, and customize them for each
> algorithms. (Even LiR and LoR use different tricks to have better
> performance for numerical optimization, so it's hard to share code at that
> time. But I can see the point that we have working code now, so it's time to
> try to refactor those code to share more.)
>
>
> Sincerely,
>
> DB Tsai
> ----------------------------------------------------------
> Blog: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
>
> On Mon, Oct 12, 2015 at 1:24 AM, YiZhi Liu <javeli...@gmail.com> wrote:
>>
>> Hi Joseph,
>>
>> Thank you for clarifying the motivation that you setup a different API
>> for ml pipelines, it sounds great. But I still think we could extract
>> some common parts of the training & inference procedures for ml and
>> mllib. In ml.classification.LogisticRegression, you simply transform
>> the DataFrame into RDD and follow the same procedures in
>> mllib.optimization.{LBFGS,OWLQN}, right?
>>
>> My suggestion is, if I may, ml package should focus on the public API,
>> and leave the underlying implementations, e.g. numerical optimization,
>> to mllib package.
>>
>> Please let me know if my understanding has any problem. Thank you!
>>
>> 2015-10-08 1:15 GMT+08:00 Joseph Bradley <jos...@databricks.com>:
>> > Hi YiZhi Liu,
>> >
>> > The spark.ml classes are part of the higher-level "Pipelines" API, which
>> > works with DataFrames.  When creating this API, we decided to separate
>> > it
>> > from the old API to avoid confusion.  You can read more about it here:
>> > http://spark.apache.org/docs/latest/ml-guide.html
>> >
>> > For (3): We use Breeze, but we have to modify it in order to do
>> > distributed
>> > optimization based on Spark.
>> >
>> > Joseph
>> >
>> > On Tue, Oct 6, 2015 at 11:47 PM, YiZhi Liu <javeli...@gmail.com> wrote:
>> >>
>> >> Hi everyone,
>> >>
>> >> I'm curious about the difference between
>> >> ml.classification.LogisticRegression and
>> >> mllib.classification.LogisticRegressionWithLBFGS. Both of them are
>> >> optimized using LBFGS, the only difference I see is LogisticRegression
>> >> takes DataFrame while LogisticRegressionWithLBFGS takes RDD.
>> >>
>> >> So I wonder,
>> >> 1. Why not simply add a DataFrame training interface to
>> >> LogisticRegressionWithLBFGS?
>> >> 2. Whats the difference between ml.classification and
>> >> mllib.classification package?
>> >> 3. Why doesn't ml.classification.LogisticRegression call
>> >> mllib.optimization.LBFGS / mllib.optimization.OWLQN directly? Instead,
>> >> it uses breeze.optimize.LBFGS and re-implements most of the procedures
>> >> in mllib.optimization.{LBFGS,OWLQN}.
>> >>
>> >> Thank you.
>> >>
>> >> Best,
>> >>
>> >> --
>> >> Yizhi Liu
>> >> Senior Software Engineer / Data Mining
>> >> www.mvad.com, Shanghai, China
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >> For additional commands, e-mail: user-h...@spark.apache.org
>> >>
>> >
>>
>>
>>
>> --
>> Yizhi Liu
>> Senior Software Engineer / Data Mining
>> www.mvad.com, Shanghai, China
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>



-- 
Yizhi Liu
Senior Software Engineer / Data Mining
www.mvad.com, Shanghai, China

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to