On Tue, Jan 16, 2024 at 06:54:56PM +0000, Renato via Digitalmars-d-learn wrote: > On Tuesday, 16 January 2024 at 16:56:04 UTC, Siarhei Siamashka wrote: [...] > > You are not allowed to emit "1" as the first token in the output as > > long as there are any dictionary word matches at that position. The > > relevant paragraph from the problem statement:
Ohhh now I get it. Initially I misunderstood that as saying that if the rest of the phone number has at least one match, then a digit is not allowed. Now I see that what it's actually saying is that even if some random dictionary word matches at that position, even if it does not lead to any full matches, then a digit is excluded. [...] > > I also spent a bit of time trying to figure out this nuance when > > implementing my solution. It doesn't make much sense visually (no > > back-to-back digits in the output either way), but that's how it is. > > Exactly, this is one of the things that make this problem a bit > annoying to solve :) It's a strange requirement, for sure, but I don't think it's annoying. It makes the problem more Interesting(tm). ;-) Anyway, I've fixed the problem, now my program produces the exact same output as Renato's repo. Code is posted below. Interestingly enough, the running time has now halved to about 0.9 seconds for 1 million phone numbers. I guess that's caused by the more stringent requirement excluding many more matching possibilities, effectively pruning away large parts of the search tree. > @"H. S. Teoh" you implemented the solution as a Trie!! Nice, that's > also what I did when I "participated" in the study. Here's [my Trie > solution in > Java](https://github.com/renatoathaydes/prechelt-phone-number-encoding/blob/fastest-implementations-print-or-count/src/java/Main.java). > > These are basically the two common approaches to the problem: a Trie > or a numeric-based table. According to the study, people who use > scripting languages almost always go with the numeric approach, while > people coming from lower level languages tend to use a data structure > like Trie (if they don't know Trie, they come up with something > similar which is fascinating), which is harder to implement but more > efficient in general. Interesting. I guess my C/C++ background is showing. ;-) I'm not sure what exactly motivated me to go this route; I guess it was just my default preference of choosing the path of least work as far as the algorithm is concerned: I chose the algorithm that did the least amount of work needed to produce the right answer. Scanning through sections of the dictionary to find a match was therefore excluded; so my first thought was an AA. But then how much of the initial prefix to look up an in AA? Since it can't be known beforehand, I'd have to gradually lengthen the prefix to search for, which does a lot of repetitive work (we keep looking up the first few digits repeatedly each time we search for a longer prefix). Plus, multiple consecutive AA lookups is not cache-friendly. So my next thought was, what should I do such that I don't have to look at the initial digits anymore once I already processed it? This line of thought naturally led to a trie structure. Once I arrived at a trie structure, the next question was how exactly dictionary entries would be stored in it. Again, in the vein of doing the least amount of work I could get away with, I thought, if I stored words in the trie directly, with each edge encoding a letter, then during the search I'd have to repeatedly convert letters to the corresponding phone number digit and vice versa. So why not do this conversion beforehand, and store only phone digits in the trie? This also had the additional benefit of letting me effectively search multiple letters simultaneously, since multiple letters map to the same digit, so scanning a digit is equivalent to searching multiple letters at the same time. The output, of course, required the original form of the words -- so the obvious solution was to attach the original words as a list of words attached to the trie node representing the end of that word. Once this was all decided, the only remaining question was the search algorithm. This turned out to take the most time in solving this problem, due to the recursive nature of the search, I had to grapple with where and how to make the recursive calls, and how to propagate return values correctly. The initial implementation only found word matches, and did not allow the single digits. Happily, the recursive algorithm turned out to have enough structure to encode the single digit requirements as well, although it took a bit of trial and error to find the correct implementation. > Can I ask you why didn't you use the [D stdlib > Trie](https://dlang.org/phobos/std_uni.html#codepointTrie)? Not sure > that would've worked, but did you consider that? Haha, I didn't even think of that. :-D I wouldn't have wanted to use it anyway, because it was optimized for single codepoint lookups, and wouldn't have matched what I wanted to do. Besides, I also understood it as an internal helper structure used by std.uni, so it isn't really designed as a trie implementation for general use. Also, since performance was an issue in your OP, I wanted to avoid off-the-bat any hidden costs that may be present in a pre-existing trie implementation that I did not fully understand. Maybe I'm just affected with NIH syndrome, but IME, when performance is important you usually want to write a custom data structure for your inner loop, one that's fine-tuned to your specific algorithm. Otherwise the impedance mismatch may give you suboptimal results. T -- Knowledge is that area of ignorance that we arrange and classify. -- Ambrose Bierce