Re: [Pharo-users] protobufs for Pharo

2015-05-11 Thread Benjamin Pollack
On Mon, May 11, 2015, at 12:27 PM, Paul DeBruicker wrote:
 If you've gotta have prorobuf then you've gotta have protobuf but there
 is a
 thrift implementation (https://thrift.apache.org/) and also a message
 pack
 implementation (https://code.google.com/p/stomp-serializer/)
 

There are unfortunately a lot of protocols (including the one I
mentioned used by RethinkDb) that are protobuf-based, so using one of
those won't work too well. If interfacing with a third-party tool
*weren't* necessary, I'd honestly probably just use Fuel, anyway.

--Benjamin



[Pharo-users] protobufs for Pharo

2015-05-09 Thread Benjamin Pollack
Hey all,

Has anyone implemented protobufs for Pharo yet? I’ve not managed to find any 
examples, so I assume the answer is “no”, but I thought I’d check here before 
rolling up my sleeves and writing one myself. (My specific target here is to 
write a RethinkDB driver, so if someone has magically also done *that*, please 
let me know, too.)

Thanks,
--Benjamin


Re: [Pharo-users] Spur images

2014-10-09 Thread Benjamin Pollack
On Sun, 05 Oct 2014 16:51:25 -0400, Sven Van Caekenberghe s...@stfx.eu  
wrote:

[snip]
Apart from that, the tokenisation is not very efficient, #lines is a  
copy of your whole contents, so is the #split: and #trimmed. The  
algorithm sounds a bit lazy as well, writing it 'on purpose' with an eye  
for performance might yield better results.


So I was reflecting on this more.  If String and WideString were  
immutable, then it'd be easy to avoid all of these copies; you could  
instead pass around very tiny objects that had only three members (a  
String, a start position, a stop position), and avoid copying very much  
data.  It's that String and WideString are mutable that preclude that.   
For fun, since I know I won't mutate the stringsin this example, I  
actually did a quick spike where I replaced #copyFrom:to: with a new  
method I introduced called #viewFrom:to: that returned a StringView.  I'll  
post the code when I have a chance to clean it up if there's interest, but  
it looks like it pretty handedly chops off 120-150ms from that runtime  
(i.e., double the speed).


Has there been any thought to introducing some immutable collections?  Or  
maybe I'm just missing them?  They'd be useful not just for String and  
WideString, but really for basically any of the collection types.  The  
implementation in most cases would be as simple as overriding #at:put: and  
friends to throw self shouldNotImplement, and then providing  
methods/classes like the one I introduced to allow taking advantage of the  
newfound immutability.


If there's interest, I'd be happy to submit a Slice we could use as a  
concrete RFC of what this could look like.


--Benjamin



[Pharo-users] Spur images

2014-10-03 Thread Benjamin Pollack
My apologies if this is already spelled out somewhere and I simply can't
find it, but are there any Spur images for Pharo yet?  I can only find
Squeak ones.  It's not a big deal either way; I was just playing with an
algorithm that, after very heavy optimization, I was able to get down to
about 278ms per pass (from ~700ms initially from a naive
implementation).  For contrast, the equivalent Python runs in ~80ms. 
Looking at MessageTally, it seems at least half of that 278ms is spent
noodling around in WideStrings and other small things that the Spur
object format ought to help with, so it'd be interesting to see how much
of a speed boost that gives without any further work.



Re: [Pharo-users] Ridiculous we are

2014-09-24 Thread Benjamin Pollack

On Tue, 23 Sep 2014 08:51:54 -0400, Hilaire hila...@drgeo.eu wrote:


Le 23/09/2014 14:09, Damien Cassou a écrit :

I recently read documents about utf-8 encoding. In all of them, the
author says that pathnames should be kept as is because you never know
which encoding the filesystem uses. So, a filename should probably be
a bytearray.



yes, but a #é should be encoded in two bytes.


As noted in my previous message, é could be represented as either one or  
two Unicode code points, and these in turn could validly be either two or  
three bytes in UTF-8.  My gut says that $é should be U+00E9, because  
otherwise you should have to use two Characters ($e and $´), but you could  
legitimately argue otherwise as well, and at any rate, #é could definitely  
be either.  This is likely the core of the issue you're hitting.




Re: [Pharo-users] Ridiculous we are

2014-09-24 Thread Benjamin Pollack
On Mon, 22 Sep 2014 17:58:41 -0400, Sven Van Caekenberghe s...@stfx.eu  
wrote:


I also find the way some problems are reported quite disturbing. How  
much testing did you do ? On which platforms ?


I can do this (in Pharo 3) without any problems (we're talking about  
arbitrary Unicode characters in path names):


('/tmp' asFileReference / 'été') ensureCreateDirectory.
'/tmp/été' asFileReference exists.
('/tmp/été' asFileReference / 'Ελλάδα.txt') writeStreamDo: [ :out |
  out  'What about Greece ?' ].
('/tmp/été' asFileReference / 'Ελλάδα.txt') exists.
('/tmp/été' asFileReference / 'Ελλάδα.txt') contents.

And in a terminal, I get:

$ ls /tmp/été/Ελλάδα.txt
/tmp/été/Ελλάδα.txt

$ cat !$
cat /tmp/été/Ελλάδα.txt
What about Greece ?

This is on Mac OS X.

So this part fundamentally works in the image and on one VM. There might  
of course be problems in how paths are used in certain places or on  
certain VM/platforms.




Focusing purely on Unicode itself (not the encoding systems), a letter  
like é can be represented as U+00E9 (LATIN SMALL LETTER E WITH ACUTE), or  
as U+0065 (LATIN SMALL LETTER E) followed by U+0301 (combining acute  
accent).  These will appear identical to the user, but are emphatically  
*not* identical for most software.  The way you're testing here, you will  
not hit any error relating to this concept, ever, because you're using  
Pharo for both generating and consuming the strings.  At the very least,  
we'd need to generate a file named été with both forms explicitly and  
see what happens.


Things get even more exciting, though, because Unix says that file names  
are simply arbitrary byte patterns that do not contain the null byte.*   
Thus, you can trivially create a file named été using Latin-1 encoding,  
and again using UTF-8 encoding, and again using UTF-7 encoding, and these  
might all be shown to the user as identically named, but I guarantee you  
that Pharo will not act sanely with all four of these.  Even on Windows,  
where things are a bit saner (NTFS mandates UTF-16), and where an explicit  
normalization form is preferred (NFC), I just explicitly verified that I  
can trivially inject other normalization forms into the file system.   
Thus, you can still have two files named été that nevertheless have  
different names as far as the OS is concerned.


In this case, as far as I can tell, Pharo assumes that all path names are  
Unicode, and does not do any work to convert strings to or from the  
various normalization schemes (looking in Path  
classcanonicalizeElements:, Path classfrom:delimiter, and  
FileSystemStorepathFromString: here).


There's therefore a pretty straightforward fix that Pharo could do:

  1. Path would use ByteArrays as the actual canonical store, and
 provide convenience methods to see what the array decodes to
 in various encodings.  The developer and application can make
 decisions about what encoding system they want to use.
  2. The VM likely needs to be modified to handle this (didn't check)

As much as I wish Hilaire provided more details in his bug report, it's  
worth keeping in mind that not all users, or even all programmers,  
understand the full implications of things like how various Unicode  
normalization and encoding schemes interact in practice with Unix's very  
vague concept of what a file name actually is, so I usually try to  
approach these bug reports carefully and with an open mind.


--Benjamin

* On OS X, HFS+ uses UTF-16 with an Apple-specific variant of NFD, whereas  
I do not believe this holds for e.g. UFS or FUSE-backed file systems, so  
things are a bit subtler there, but the general rule holds.




Re: [Pharo-users] Ridiculous we are

2014-09-24 Thread Benjamin Pollack
On Wed, 24 Sep 2014 13:03:57 -0400, Sven Van Caekenberghe s...@stfx.eu  
wrote:




Did you read the actual conversation in the issue ?

 
https://pharo.fogbugz.com/f/cases/14054/Issue-with-path-with-accented-characters

It has been renamed and there is a fix (as a change set, not as a slice,  
yet). Basically, there was a primitive call into a plugin that failed to  
do encoding.




No, I apologize; I missed the bug link.  Thanks for reposting it.

Now regarding the issues you raised. Pharo does not do Unicode  
canonicalisation or any of that other fancy stuff (like categorisation,  
proper ordering and so on). This is another orthogonal and way more  
general issue.


Regarding the pathnames encoding: if the OS itself does not know it, how  
can we ?


That's actually the argument *against* using UTF-8 as the standard Pharo  
way to represent filenames--at least on Unix systems.  If Pharo used  
ByteArrays to represent paths, with convenience methods for working with  
UTF-8 (since I do agree that's the most likely thing for a user/dev to  
want), then you'd be able to work with all files no matter what, *and*  
have a convenient way of doing so for the common case.


This is an old discussion, and I do see both sides of it.  In terms of  
SCMs, Mercurial and Git both just say it's a collection of bytes,  
whereas Subversion says it's Unicode code points.  This has some  
uncomfortable implications for both systems when working on multiple  
platforms.


--Benjamin