Another way that removing spurious order from data is beneficial:

Parallel algorithms

When I recently tried running cmix on the OBWB, I noticed it ran on only
one core (and incredibly slowly).

I'm not sure how the OBWB is organized but if it is a bunch of "articles"
whose order conveys little information, it seems breaking it out into
separate files, one per article, would be worth it.


On Wed, Jan 15, 2020 at 1:06 PM James Bowery <[email protected]> wrote:

> Right.  The example is not so much to suggest a practical/significant
> improvement in the Hutter Prize as to address a general problem:
> Specification of information sometimes requires deliberately leaving out
> ordering information -- as in set literals like
>
> cats = set(fluffy, scruffy, paws, claws, ...)
>
> This is to make it clear that the data being described has nothing to do
> with the order in which the cats' names are listed.
>
> Hierarchical data structures will frequently contain elements like this
> but even widely used standards like XML insist on conflating syntactic
> structures, rendering it problematic to avoid "specifying" unwanted order.
> I had something of a knock-down drag-out fight about this with some Perl
> Monks 8 years ago regarding serialization of HTML documents
> <https://www.perlmonks.org/?node_id=879166>.
>
> If XML standards abstracted out ordering information where needed, it
> would, I think, help the data description world quite a bit and may even
> have significant positive practical implications for data modeling.
>
> On Wed, Jan 15, 2020 at 12:30 PM Matt Mahoney <[email protected]>
> wrote:
>
>> Removing the ordering constraint on enwik8  should reduce the compressed
>> size by about 50K bytes, or 2 bytes per article. But it wouldn't affect the
>> nature of the research. Here is more about the data.
>> http://mattmahoney.net/dc/textdata.html
>>
>> On Tue, Jan 14, 2020, 7:59 AM James Bowery <[email protected]> wrote:
>>
>>> Here's a simple modification to The Hutter Prize
>>> <http://prize.hutter1.net/> and the Large Text Compression Benchmark
>>> <http://mattmahoney.net/dc/text.html> to illustrate my point:
>>>
>>> Split the Wikipedia corpus into separate files, one per Wikipedia
>>> article.  An entry qualifies only if the set of checksums of the files
>>> produced by the self-extracting archive matches that of the original corpus.
>>>
>>> This reduces the over-constraint imposed by the strictly serialized
>>> corpus.
>>>
>>>
>>> On Sun, Jan 5, 2020 at 12:12 PM James Bowery <[email protected]> wrote:
>>>
>>>> In reality, sensors and effectors exist in space as well as time.
>>>> Serializing the spatial dimension of observations to formalize their
>>>> Kolmogorov Complexity, so they conform to the serialized input to a
>>>> Universal Turing machine, over-constrains the observations, introducing
>>>> order not relevant to their natural information content, hence artificially
>>>> inflating the, so-defined, KC.
>>>>
>>>> Since virtually all models in machine learning are based on tabular
>>>> data, even if they can be cast as time series, row-indexed by a timestamp,
>>>> each row is an observation with multiple dimensions.   So it seems rather
>>>> interesting, if not frustrating, that the default assumption in Algorithmic
>>>> Information Theory is of a serial UTM.
>>>>
>>>>
>>>> *Artificial General Intelligence List <https://agi.topicbox.com/latest>*
>> / AGI / see discussions <https://agi.topicbox.com/groups/agi> +
>> participants <https://agi.topicbox.com/groups/agi/members> + delivery
>> options <https://agi.topicbox.com/groups/agi/subscription> Permalink
>> <https://agi.topicbox.com/groups/agi/Tc33b8ed7189d2a18-Ma929612907338546069466a8>
>>

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/Tc33b8ed7189d2a18-M481c37c6d4d331643142412f
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to