Hi Ryan,

That data is very high-dimensional, with lots of holes. See [1] for some of
its stats. This initial exploration also indicates that a lot of the data
is very non-uniform (skewed, peaked, etc). For NuPIC, this is either a)
great! or b) awful!

The size and shape of the dataset would indicate that swarming is unlikely
to do anything meaningful for you, and the deliberate obfuscation of the
meanings of the fields prevents using common-sense to help design the
encoding, field-combination strategy, or hierarchy for NuPIC.

(On the point in your most recent email, if the SP can discover these
correlations, then changing the encodings is unnecessary!)

David, you're correct in theory but in practise NuPIC might learn so slowly
that you're better off using other ML methods.

I have a feeling that Chetan's geospatial encoder might have something to
illuminate this kind of problem, but I'm currently only beginning to think
about how this might work...

Regards,

Fergal Byrne

[1]
https://www.kaggle.com/c/criteo-display-ad-challenge/forums/t/9651/training-data-statistics


On Fri, Aug 8, 2014 at 2:39 PM, cogmission1 . <[email protected]>
wrote:

> >Each row is a data related to the a single display of an advertisement.
> You're trying to predict whether the ad will be clicked >or not.
>
> Ryan, I'm also new here. What I've seen and gleaned from discussions
> related to shaping the encoding of input revolves around the understanding
> of how the input data expresses semantic meaning. The type of encoding
> would be based on the dimensions of the data (how many variations can be
> expected in input types), together with how many bits are needed given the
> number of bits used to express the breadth of variation (window size of the
> bits etc.). As long as the encoding can differentiate that, you can just
> let the HTM "discover" the various relationships and distinctions in the
> data - (i.e. it will just work).
>
> Following this you could work backwards once trained, and figure out what
> it all means?
>
> David
>
>
> On Fri, Aug 8, 2014 at 8:18 AM, Ryan Belcher <[email protected]> wrote:
>
>> I think the "urge to click" depends on the person browsing and the
>> content of the ad.  The words on the page act as a proxy for the person.
>> If you search for "brake pads" then it's very likely you're a person
>> looking for brake pads.  But now the ad companies are collecting more and
>> more information about people, so the words on the page aren't needed as
>> much.  They know you're interested in diapers even if the page has nothing
>> to do with babies.
>>
>> None of that matters for the Criteo competition since they're not saying
>> what any of the data means.  The fields are named I1, I2, C1, etc.  So all
>> you can do is look for correlations in the data.
>>
>>
>> On Fri, Aug 8, 2014 at 8:58 AM, David Ray <[email protected]>
>> wrote:
>>
>>> That seems to be assuming that the "urge to click", is somehow related
>>> to the pattern associated with the occurrence of words on a page? This
>>> could be true and it would be interesting to find a correlation.
>>>
>>> You could maybe come up with a general theory for "click attraction" and
>>> patterns associated with word occurrence and web browsing in general....
>>>
>>> Sent from my iPhone
>>>
>>> On Aug 8, 2014, at 7:44 AM, Ryan Belcher <[email protected]> wrote:
>>>
>>> I'm looking at the Criteo Kaggle competition.  Each row is a data
>>> related to the a single display of an advertisement.  You're trying to
>>> predict whether the ad will be clicked or not.
>>>
>>> Am I trying to categorize?  Yes and no.  I'm trying to predict whether
>>> the ad will be clicked, but the way I'm trying to do that is by
>>> categorizing the rows into buckets and calculating probability based on the
>>> category.
>>>
>>> I'm not sure how else you'd go about it.
>>>
>>>
>>> On Thu, Aug 7, 2014 at 5:44 PM, Jim Bridgewater <[email protected]>
>>> wrote:
>>>
>>>> Hi Ryan,
>>>>
>>>> For classification problems it sounds like you are headed in the right
>>>> direction, but I'm unclear about what your objective is.  Are you just
>>>> trying to categorize each row in the data set?
>>>>
>>>>
>>>>
>>>> On Thu, Aug 7, 2014 at 1:33 PM, Ryan Belcher <[email protected]> wrote:
>>>> > I've been playing around with NuPIC for a while and am still trying
>>>> to wrap
>>>> > my head around how to use it.  Right now I'm playing with some
>>>> prediction
>>>> > scenarios where you have a number of input fields and you're trying to
>>>> > predict one output.
>>>> >
>>>> > My understaning is that if the inputs aren't related temporally, then
>>>> it's a
>>>> > Spatial Pooling problem.  If there are common patterns in the data,
>>>> then it
>>>> > may be helpful to create hierarchies of SPs.
>>>> >
>>>> > The data I'm looking at right now probably doesn't have common
>>>> patterns.
>>>> > It's basically a bunch of categorical data from which you're trying to
>>>> > predict a boolean outcome.  There are about 15M rows in the training
>>>> set.
>>>> >
>>>> > So my thinking is to create 1 SP where the inputDimensions is wide
>>>> enough to
>>>> > accomodate all of the fields and columnDimensions sized so that rows
>>>> get
>>>> > grouped together.  (If there were 100k columns, then on average 150
>>>> rows
>>>> > would be pooled together.)
>>>> >
>>>> > In theory I could run all of the training data through the SP, then
>>>> run it
>>>> > through again (without learning) and calculate an outcome probability
>>>> for
>>>> > each column.  Then I could run the test data through and it's
>>>> probability
>>>> > would be the probability of the column it matches.
>>>> >
>>>> > Is that a reasonable approach or am I way out in left field?
>>>> >
>>>> > Thanks,
>>>> > Ryan
>>>> >
>>>> > _______________________________________________
>>>> > nupic mailing list
>>>> > [email protected]
>>>> > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> James Bridgewater, PhD
>>>> Arizona State University
>>>> 480-227-9592
>>>>
>>>> _______________________________________________
>>>> nupic mailing list
>>>> [email protected]
>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>
>>>
>>> _______________________________________________
>>> nupic mailing list
>>> [email protected]
>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>
>>>
>>> _______________________________________________
>>> nupic mailing list
>>> [email protected]
>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>
>>>
>>
>> _______________________________________________
>> nupic mailing list
>> [email protected]
>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>
>>
>
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>
>


-- 

Fergal Byrne, Brenter IT

Author, Real Machine Intelligence with Clortex and NuPIC
https://leanpub.com/realsmartmachines

Speaking on Clortex and HTM/CLA at euroClojure Krakow, June 2014:
http://euroclojure.com/2014/
and at LambdaJam Chicago, July 2014: http://www.lambdajam.com

http://inbits.com - Better Living through Thoughtful Technology
http://ie.linkedin.com/in/fergbyrne/ - https://github.com/fergalbyrne

e:[email protected] t:+353 83 4214179
Join the quest for Machine Intelligence at http://numenta.org
Formerly of Adnet [email protected] http://www.adnet.ie
_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Reply via email to