On Friday, May 1, 2015 at 1:25:41 AM UTC-4, Jeff Bezanson wrote: > > It is true that we have not yet done enough to optimize the worst and > worse performance cases. The bright side of that is that we have room > to improve; it's not that we've run out of ideas and techniques. > > Tim is right that the complexity of our dispatch system makes julia > potentially slower than python. But in dispatch-heavy code I've seen > cases where we are faster or slower; it depends. > > Python's string and dictionary operations, in particular, are really > fast. This is not surprising considering what the language was > designed for, and that they have a big library of well-tuned C code > for these things. > > I still maintain that it is misleading to describe an *asymptotic* > slowdown as "800x slower". If you name a constant factor, it sounds > like you're talking about a constant factor slowdown. But the number > is arbitrary, because it depends on data size. In theory, of course, > an asymptotic slowdown is *much worse* than a constant factor > slowdown. However in the systems world constant factors are often more > important, and are often what we talk about. >
No, that was just my very first test comparing Julia & Python, using a size that matched the record sizes I'd typically seen from way too many years of benchmarking (database / string processing operations) > You say "a lot of the algorithms are O(n) instead of O(1)". Are there > any examples other than length()? > Actually, it's worse than that... length, and getting finding a particular character by character position, and getting a substring by character position, some of the most frequent operations for what I deal with, are O(n) instead of O(1), and things like conversions are O(n^2), not O(n) [and the conversions are much more complex, due to the string representation in Julia, unlike Python 3]. The conversions I am fixing, so that they are not O(n^2), but rather O(n) [slower than Python, again because of the representation, but not asymptotic]. The reason they are O(n^2), like the string concatenation problem I ran into right when I first started to evaluate Julia, is because of the way the conversion functions are written, initially creating a 0-length array, and then doing push! to successively add characters to the array, and then finally calling UTF8String, UTF16String, or UTF32String to convert the Vector{UInt8}, Vector{UInt16} or Vector{Char} respectively into an immutable string. As the string grows, Julia's internals end up having to reallocate the memory and sometimes copy it to a new location, hence the O(n^2) nature of the code. My changes, which hopefully will be accepted (after I check in my next round of pure Julia optimizations), solve that by first validating the input UTF-8, UTF-16, or UTF-32 string at the same time as calculating how many characters of the different ranges are present, so that the memory can be allocated once, exactly the size needed, and also frequently allowing dispatching to simpler conversion code, when it is know that all of the characters in the string just need to be widened (zero-extended), or narrowed. I disagree that UTF-8 has no space savings over UTF-32 when using the > full range of unicode. The reason is that strings often have only a > small percentage of non-BMP characters, with lots of spaces and > newlines etc. You don't want your whole file to use 4x the space just > to use one emoji. > Please read my statement more carefully... > UTF-8 *can* take up to 50% more storage than UTF-16 if you are just > dealing with BMP characters. > If you have some field that needs to hold *a certain number of Unicode > characters*, for the full range of Unicode, > you need to allocate 4 bytes for every character, so no savings compared > to UTF-16 or UTF-32. My point was that if you have to allocate a buffer to hold a certain # of characters, say because you have a CHAR, NCHAR, or WCHAR, or VARCHAR, etc. field from a DBMS, for UTF-8, you need to allocate at least 4 bytes per character, so no savings over UTF-16 or UTF-32 for those operations... I spent over two years going back and forth to Japan, when I designed (and was the main implementor) for the Unicode support of a database system / language, and spent a lot of time looking at the just how much storage space different representations would take... Note, at that time, Unicode 2.0 was not out, so the choice was between UCS-2 (no surrogates then), UTF-8, some combination thereof, or some new encoding. My first version, released finally in 1997, used either 8-bit (ANSI Latin 1) or UCS-2 to store data... The next release, I came up with a new encoding for Unicode, that was much more compact (at the insistence of the Japanese customers, who didn't want their storage requirements to increase because of moving from SJIS and EUC to Unicode). In memory, all strings were UCS-2 (or really UTF-16, but like Java, because I designed it before Unicode added surrogates and non-BMP characters, it treats them as 2 characters, but the conversions to/from UTF-8 were fixed to handle surrogate pairs correctly [4 bytes instead of 2 3-byte surrogates in UTF-8], and functions were added to get # of logical characters, to do pattern matching etc. taking surrogates into account, etc.) For the Japanese customers (and this was verified over an over again), their data would have been much much larger if we had tried to store it with UTF-8 [SJIS encodes the Katakana characters in 1 byte, hence the very large difference, 3x for much of their data]. I'm not trying to throw bricks at the Julia implementors, I understand quite well that they had/have different priorities (numerical / scientific computing), however, I think Julia has great potential as a general programming language, and even as a string processing / database language, if certain things can be addressed. I would like to help make that happen, if they are willing to let me contribute... As an aside, I spent 29 years on the other side of the fence, being the main compiler/language designer/implementer, and having people tell me... "why don't you do this, this is too slow, it is not suitable for our needs"... in particular, we had to shoehorn in support for IEEE binary floating port, which wasn't easy at all, because up to that point, the language only had essentially one data type... string (internally, numeric values were stored with an integer or scaled decimal representation, however, that was normally not visible to the programmer), A + B means take the numeric interpretation of A & B, and then add them together... It had never been a priority in the past, since most of our customer base required decimal arithmetic, and we didn't want to break the clean conceptual model (everything is a string) and add any "types" that were visible to the user, but we ended up having to, so that customers who wanted to store scientific data could do so... Scott P.S. Kudos to the whole Julia team for the focus on performance... even when we might sometimes disagree on how best to achieve it! ;-)