in my case, my model size is fairly small ( 100k training samples ), though
the features count is roughly 100k populated out of 10mil possible features.

in this case it does not help me to distribute the training process, since
data size is so small. I just need a good core solver to train the model in
a serial manner. On the other hand, I have to train millions of such models
independently, so I have enough load balancing opportunity

On Tue, Oct 11, 2016 at 3:09 AM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> That's a good point about shuffle data compression. Still, it would be
> good to benchmark the ideas behind https://github.com/
> apache/spark/pull/12761 I think.
>
> For many datasets, even within one partition the gradient sums etc can
> remain very sparse. For example Criteo DAC data is extremely sparse - and
> it has roughly 5% of active features per partition. However, you're correct
> that as the coefficients (and intermediate stats counters) get aggregated
> they will become more and more dense. But there is also the intermediate
> memory overhead of the dense structures, though that comes into play in the
> 100s - 1000s millions feature range.
>
> The situation in the PR above is actually different in that even the
> coefficient vector itself is truly sparse (through some encoding they did
> IRC). This is not an uncommon scenario however, as for high-dimensional
> features users may want to use feature hashing which may result in actually
> sparse coefficient vectors. With hashing often the feature dimension will
> be chosen as power of 2 and higher (in some cases significantly) than the
> true feature dimension to reduce collisions. So sparsity is critical here
> for storage efficiency.
>
> Your result for the final stage does seem to indicate something can be
> improved - perhaps it is due to some level of fetch parallelism - so more
> partitions may fetch more data in parallel? Because with just default
> setting for `treeAggregate` I was seeing much faster times for the final
> stage with 34 million feature dimension (though the final shuffle size
> seems 50% of yours with 2x the features - this is with Spark 2.0.1, I
> haven't tested out master yet with this data).
>
> [image: Screen Shot 2016-10-11 at 12.03.55 PM.png]
>
>
>
> On Fri, 7 Oct 2016 at 08:11 DB Tsai <dbt...@dbtsai.com> wrote:
>
>> Hi Nick,
>>
>>
>>
>> I'm also working on the benchmark of liner models in Spark. :)
>>
>>
>>
>> One thing I saw is that for sparse features, 14 million features, with
>>
>> multi-depth aggregation, the final aggregation to the driver is
>>
>> extremely slow. See the attachment. The amount of data being exchanged
>>
>> between executor and executor is significantly larger than collecting
>>
>> the data into driver, but the time for collecting the data back to
>>
>> driver takes 4mins while the aggregation between executors only takes
>>
>> 20secs. Seems that the code path is different, and I suspect that
>>
>> there may be something in the spark core that we can optimize.
>>
>>
>>
>> Regrading using sparse data structure for aggregation, I'm not so sure
>>
>> how much this will improve the performance. Since after computing the
>>
>> gradient sum for all the data in one partitions, the vector will be no
>>
>> longer to be very sparse. Even it's sparse, after couple depth of
>>
>> aggregation, it will be very dense. Also, we perform the compression
>>
>> in the shuffle phase, so if there are many zeros, even it's in dense
>>
>> vector representation, the vector should take around the same size as
>>
>> sparse representation. I can be wrong since I never do a study on
>>
>> this, and I wonder how much performance we can gain in practice by
>>
>> using sparse vector for aggregating the gradients.
>>
>>
>>
>> Sincerely,
>>
>>
>>
>> DB Tsai
>>
>> ----------------------------------------------------------
>>
>> Web: https://www.dbtsai.com
>>
>> PGP Key ID: 0xAF08DF8D
>>
>>
>>
>>
>>
>> On Thu, Oct 6, 2016 at 4:09 AM, Nick Pentreath <nick.pentre...@gmail.com>
>> wrote:
>>
>> > I'm currently working on various performance tests for large, sparse
>> feature
>>
>> > spaces.
>>
>> >
>>
>> > For the Criteo DAC data - 45.8 million rows, 34.3 million features
>>
>> > (categorical, extremely sparse), the time per iteration for
>>
>> > ml.LogisticRegression is about 20-30s.
>>
>> >
>>
>> > This is with 4x worker nodes, 48 cores & 120GB RAM each. I haven't yet
>> tuned
>>
>> > the tree aggregation depth. But the number of partitions can make a
>>
>> > difference - generally fewer is better since the cost is mostly
>>
>> > communication of the gradient (the gradient computation is < 10% of the
>>
>> > per-iteration time).
>>
>> >
>>
>> > Note that the current impl forces dense arrays for intermediate data
>>
>> > structures, increasing the communication cost significantly. See this
>> PR for
>>
>> > info: https://github.com/apache/spark/pull/12761. Once sparse data
>>
>> > structures are supported for this, the linear models will be orders of
>>
>> > magnitude more scalable for sparse data.
>>
>> >
>>
>> >
>>
>> > On Wed, 5 Oct 2016 at 23:37 DB Tsai <dbt...@dbtsai.com> wrote:
>>
>> >>
>>
>> >> With the latest code in the current master, we're successfully
>>
>> >> training LOR using Spark ML's implementation with 14M sparse features.
>>
>> >> You need to tune the depth of aggregation to make it efficient.
>>
>> >>
>>
>> >> Sincerely,
>>
>> >>
>>
>> >> DB Tsai
>>
>> >> ----------------------------------------------------------
>>
>> >> Web: https://www.dbtsai.com
>>
>> >> PGP Key ID: 0x9DCC1DBD7FC7BBB2
>>
>> >>
>>
>> >>
>>
>> >> On Wed, Oct 5, 2016 at 12:00 PM, Yang <teddyyyy...@gmail.com> wrote:
>>
>> >> > anybody had actual experience applying it to real problems of this
>>
>> >> > scale?
>>
>> >> >
>>
>> >> > thanks
>>
>> >> >
>>
>> >>
>>
>> >> ---------------------------------------------------------------------
>>
>> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>> >>
>>
>> >
>>
>>

Reply via email to