Hi,

Just one simple question: Is the org.apache.mahout.math.BinarySearch.binarySearch() function an optimized version of Arrays.binarySearch()? If it is not, why implement it again?

Yours Peng

On 13-07-17 06:31 PM, Sebastian Schelter wrote:
You are completely right, the simple interface would only be usable for
readonly / batch-updatable recommenders. Online recommenders might need
something different. I tried to widen the discussion here to discuss all
kinds of API changes in the recommenders that would be necessary in the
future.



2013/7/17 Peng Cheng <pc...@uowmail.edu.au>

One thing that suddenly comes to my mind is that, for a simple interface
like FactorizablePreferences, maybe sequential READ in real time is
possible, but sequential WRITE in O(1) time is Utopia. Because you need to
flush out old preference with same user and item ID (in worst case it could
be an interpolation search), otherwise you are permitting a user rating an
item twice with different values. Considering how FileDataModel suppose to
work (new files flush old files), maybe using the simple interface has less
advantages than we used to believe.


On 13-07-17 04:58 PM, Sebastian Schelter wrote:

Hi Peng,

I never wanted to discard the old interface, I just wanted to split it up.
I want to have a simple interface that only supports sequential access
(and
allows for very memory efficient implementions, e.g. by the use of
primitive arrays). DataModel should *extend* this interface and provide
sequential and random access (basically what is already does).

Than a recommender such as SGD could state that it only needs sequential
access to the preferences and you can either feed it a DataModel (so we
don"t break backwards compatibility) or a memory efficient sequential
access thingy.

Does that make sense for you?


2013/7/17 Peng Cheng <pc...@uowmail.edu.au>

  I see, OK so we shouldn't use the old implementation. But I mean, the old
interface doesn't have to be discarded. The discrepancy between your
FactorizablePreferences and DataModel is that, your model supports
getPreferences(), which returns all preferences as an iterator, and
DataModel supports a few old functions that returns preferences for an
individual user or item.

My point is that, it is not hard for each of them to implement what they
lack of: old DataModel can implement getPreferences() just by a a loop in
abstract class. Your new FactorizablePreferences can implement those old
functions by a binary search that takes O(log n) time, or an
interpolation
search that takes O(log log n) time in average. So does the online
update.
It will just be a matter of different speed and space, but not different
interface standard, we can use old unit tests, old examples, old
everything. And we will be more flexible in writing ensemble recommender.

Just a few thoughts, I'll have to validate the idea first before creating
a new JIRA ticket.

Yours Peng



On 13-07-16 02:51 PM, Sebastian Schelter wrote:

  I completely agree, Netflix is less than one gigabye in a smart
representation, 12x more memory is a nogo. The techniques used in
FactorizablePreferences allow a much more memory efficient
representation,
tested on KDD Music dataset which is approx 2.5 times Netflix and fits
into
3GB with that approach.


2013/7/16 Ted Dunning <ted.dunn...@gmail.com>

   Netflix is a small dataset.  12G for that seems quite excessive.

Note also that this is before you have done any work.

Ideally, 100million observations should take << 1GB.

On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <pc...@uowmail.edu.au>
wrote:

   The second idea is indeed splendid, we should separate
time-complexity

first and space-complexity first implementation. What I'm not quite
sure,
is that if we really need to create two interfaces instead of one.
Personally, I think 12G heap space is not that high right? Most new

  laptop
  can already handle that (emphasis on laptop). And if we replace hash
map
(the culprit of high memory consumption) with list/linkedList, it
would
simply degrade time complexity for a linear search to O(n), not too
bad
either. The current DataModel is a result of careful thoughts and has
underwent extensive test, it is easier to expand on top of it instead
of
subverting it.





Reply via email to