On Friday, May 1, 2015 at 1:25:41 AM UTC-4, Jeff Bezanson wrote:
>
> It is true that we have not yet done enough to optimize the worst and 
> worse performance cases. The bright side of that is that we have room 
> to improve; it's not that we've run out of ideas and techniques. 
>
> Tim is right that the complexity of our dispatch system makes julia 
> potentially slower than python. But in dispatch-heavy code I've seen 
> cases where we are faster or slower; it depends. 
>
> Python's string and dictionary operations, in particular, are really 
> fast. This is not surprising considering what the language was 
> designed for, and that they have a big library of well-tuned C code 
> for these things. 
>
> I still maintain that it is misleading to describe an *asymptotic* 
> slowdown as "800x slower". If you name a constant factor, it sounds 
> like you're talking about a constant factor slowdown. But the number 
> is arbitrary, because it depends on data size. In theory, of course, 
> an asymptotic slowdown is *much worse* than a constant factor 
> slowdown. However in the systems world constant factors are often more 
> important, and are often what we talk about. 
>

No, that was just my very first test comparing Julia & Python, using a size 
that matched the record sizes I'd typically seen from way too many years of
benchmarking (database / string processing operations)
 

> You say "a lot of the algorithms are O(n) instead of O(1)". Are there 
> any examples other than length()? 
>

Actually, it's worse than that... length, and getting finding a particular 
character by character position, and getting a substring by character 
position, some of the most frequent operations for what I deal with, are 
O(n) instead of O(1),  and things like conversions are O(n^2), not O(n) 
[and the conversions are much more complex, due to the string 
representation in Julia, unlike Python 3].
The conversions I am fixing, so that they are not O(n^2), but rather O(n) 
[slower than Python, again because of the representation, but not 
asymptotic].
The reason they are O(n^2), like the string concatenation problem I ran 
into right when I first started to evaluate Julia, is because of the way 
the conversion functions are written,
initially creating a 0-length array, and then doing push! to successively 
add characters to the array, and then finally calling UTF8String, 
UTF16String, or UTF32String to convert
the Vector{UInt8}, Vector{UInt16} or Vector{Char} respectively into an 
immutable string.
As the string grows, Julia's internals end up having to reallocate the 
memory and sometimes copy it to a new location, hence the O(n^2) nature of 
the code.

My changes, which hopefully will be accepted (after I check in my next 
round of pure Julia optimizations), solve that by first validating the 
input UTF-8, UTF-16, or UTF-32
string at the same time as calculating how many characters of the different 
ranges are present, so that the memory can be allocated once, exactly the 
size needed, and also
frequently allowing dispatching to simpler conversion code, when it is know 
that all of the characters in the string just need to be widened 
(zero-extended), or narrowed.

I disagree that UTF-8 has no space savings over UTF-32 when using the 
> full range of unicode. The reason is that strings often have only a 
> small percentage of non-BMP characters, with lots of spaces and 
> newlines etc. You don't want your whole file to use 4x the space just 
> to use one emoji. 
>

Please read my statement more carefully...

> UTF-8 *can* take up to 50% more storage than UTF-16 if you are just 
> dealing with BMP characters.
> If you have some field that needs to hold *a certain number of Unicode 
> characters*, for the full range of Unicode,
> you need to allocate 4 bytes for every character, so no savings compared 
> to UTF-16 or UTF-32.


My point was that if you have to allocate a buffer to hold a certain # of 
characters, say because you have a CHAR, NCHAR, or WCHAR, or VARCHAR, etc. 
field from a DBMS,
for UTF-8, you need to allocate at least 4 bytes per character, so no 
savings over UTF-16 or UTF-32 for those operations...

I spent over two years going back and forth to Japan, when I designed (and 
was the main implementor) for the Unicode support of a database system / 
language, and spent a lot of time looking at the just how much storage 
space different representations would take... Note, at that time, Unicode 
2.0 was not out, so the choice was between UCS-2 (no surrogates then), 
UTF-8, some combination thereof, or some new encoding.

My first version, released finally in 1997, used either 8-bit (ANSI Latin 
1) or UCS-2 to store data...  The next release, I came up with a new 
encoding for Unicode, that was much more compact (at the insistence of the 
Japanese customers, who didn't want their storage requirements to increase 
because of moving from SJIS and EUC to Unicode).
In memory, all strings were UCS-2 (or really UTF-16, but like Java, because 
I designed it before Unicode added surrogates and non-BMP characters, it 
treats them as 2 characters,
but the conversions to/from UTF-8 were fixed to handle surrogate pairs 
correctly [4 bytes instead of 2 3-byte surrogates in UTF-8], and functions 
were added to get # of logical characters, to do pattern matching etc. 
taking surrogates into account, etc.)
For the Japanese customers (and this was verified over an over again), 
their data would have been much much larger if we had tried to store it 
with UTF-8 [SJIS encodes
the Katakana characters in 1 byte, hence the very large difference, 3x for 
much of their data].

I'm not trying to throw bricks at the Julia implementors, I understand 
quite well that they had/have different priorities (numerical / scientific 
computing), however, I think Julia
has great potential as a general programming language, and even as a string 
processing / database language, if certain things can be addressed.
I would like to help make that happen, if they are willing to let me 
contribute...

As an aside, I spent 29 years on the other side of the fence, being the 
main compiler/language designer/implementer, and having people tell me... 
"why don't you do this,
this is too slow, it is not suitable for our needs"... in particular, we 
had to shoehorn in support for IEEE binary floating port, which wasn't easy 
at all, because up to that
point, the language only had essentially one data type... string 
(internally, numeric values were stored with an integer or scaled decimal 
representation, however, that was
normally not visible to the programmer), A + B means take the numeric 
interpretation of A & B, and then add them together...
It had never been a priority in the past, since most of our customer base 
required decimal arithmetic, and we didn't want to break the clean 
conceptual model (everything is a string) and add any "types" that were 
visible to the user, but we ended up having to, so that customers who 
wanted to store scientific data could do so...

Scott

P.S. Kudos to the whole Julia team for the focus on performance... even 
when we might sometimes disagree on how best to achieve it! ;-)

Reply via email to