Another way that removing spurious order from data is beneficial: Parallel algorithms
When I recently tried running cmix on the OBWB, I noticed it ran on only one core (and incredibly slowly). I'm not sure how the OBWB is organized but if it is a bunch of "articles" whose order conveys little information, it seems breaking it out into separate files, one per article, would be worth it. On Wed, Jan 15, 2020 at 1:06 PM James Bowery <[email protected]> wrote: > Right. The example is not so much to suggest a practical/significant > improvement in the Hutter Prize as to address a general problem: > Specification of information sometimes requires deliberately leaving out > ordering information -- as in set literals like > > cats = set(fluffy, scruffy, paws, claws, ...) > > This is to make it clear that the data being described has nothing to do > with the order in which the cats' names are listed. > > Hierarchical data structures will frequently contain elements like this > but even widely used standards like XML insist on conflating syntactic > structures, rendering it problematic to avoid "specifying" unwanted order. > I had something of a knock-down drag-out fight about this with some Perl > Monks 8 years ago regarding serialization of HTML documents > <https://www.perlmonks.org/?node_id=879166>. > > If XML standards abstracted out ordering information where needed, it > would, I think, help the data description world quite a bit and may even > have significant positive practical implications for data modeling. > > On Wed, Jan 15, 2020 at 12:30 PM Matt Mahoney <[email protected]> > wrote: > >> Removing the ordering constraint on enwik8 should reduce the compressed >> size by about 50K bytes, or 2 bytes per article. But it wouldn't affect the >> nature of the research. Here is more about the data. >> http://mattmahoney.net/dc/textdata.html >> >> On Tue, Jan 14, 2020, 7:59 AM James Bowery <[email protected]> wrote: >> >>> Here's a simple modification to The Hutter Prize >>> <http://prize.hutter1.net/> and the Large Text Compression Benchmark >>> <http://mattmahoney.net/dc/text.html> to illustrate my point: >>> >>> Split the Wikipedia corpus into separate files, one per Wikipedia >>> article. An entry qualifies only if the set of checksums of the files >>> produced by the self-extracting archive matches that of the original corpus. >>> >>> This reduces the over-constraint imposed by the strictly serialized >>> corpus. >>> >>> >>> On Sun, Jan 5, 2020 at 12:12 PM James Bowery <[email protected]> wrote: >>> >>>> In reality, sensors and effectors exist in space as well as time. >>>> Serializing the spatial dimension of observations to formalize their >>>> Kolmogorov Complexity, so they conform to the serialized input to a >>>> Universal Turing machine, over-constrains the observations, introducing >>>> order not relevant to their natural information content, hence artificially >>>> inflating the, so-defined, KC. >>>> >>>> Since virtually all models in machine learning are based on tabular >>>> data, even if they can be cast as time series, row-indexed by a timestamp, >>>> each row is an observation with multiple dimensions. So it seems rather >>>> interesting, if not frustrating, that the default assumption in Algorithmic >>>> Information Theory is of a serial UTM. >>>> >>>> >>>> *Artificial General Intelligence List <https://agi.topicbox.com/latest>* >> / AGI / see discussions <https://agi.topicbox.com/groups/agi> + >> participants <https://agi.topicbox.com/groups/agi/members> + delivery >> options <https://agi.topicbox.com/groups/agi/subscription> Permalink >> <https://agi.topicbox.com/groups/agi/Tc33b8ed7189d2a18-Ma929612907338546069466a8> >> ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/Tc33b8ed7189d2a18-M481c37c6d4d331643142412f Delivery options: https://agi.topicbox.com/groups/agi/subscription
