subject:"RFC\: std.json sucessor"


Am 12.10.2014 20:17, schrieb Andrei Alexandrescu:

Here's my destruction of std.data.json.

* lexer.d:

** Beautifully done. From what I understand, if the input is string or
immutable(ubyte)[] then the strings are carved out as slices of the
input, as opposed to newly allocated. Awesome.

** The string after lexing is correctly scanned and stored in raw format
(escapes are not rewritten) and decoded on demand. Problem with decoding
is that it may allocate memory, and it would be great (and not
difficult) to make the lexer 100% lazy/non-allocating. To achieve that,
lexer.d should define TWO Kinds of strings at the lexer level: regular
string and undecoded string. The former is lexer.d's way of saying I
got lucky in the sense that it didn't detect any '\\' so the raw and
decoded strings are identical. No need for anyone to do any further
processing in the majority of cases = win. The latter means the lexer
lexed the string, saw at least one '\\', and leaves it to the caller to
do the actual decoding.


This is actually more or less done in unescapeStringLiteral() - if it 
doesn't find any '\\', it just returns the original string. Also 
JSONString allows to access its .rawValue without doing any 
decoding/allocations.


https://github.com/s-ludwig/std_data_json/blob/master/source/stdx/data/json/lexer.d#L1421

Unfortunately .rawValue can't be @nogc because the raw value might 
have to be constructed first when the input is not a string (in this 
case unescaping is done on-the-fly for efficiency reasons).





** After moving the decoding business out of lexer.d, a way to take this
further would be to qualify lexer methods as @nogc if the input is
string/immutable(ubyte)[]. I wonder how to implement a conditional
attribute. We'll probably need a language enhancement for that.


Isn't @nogc inferred? Everything is templated, so that should be 
possible. Or does attribute inference only work for template function 
and not for methods of templated types? Should it?




** The implementation uses manually-defined tagged unions for work.
Could we use Algebraic instead - dogfooding and all that? I recall there
was a comment in Sönke's original work that Algebraic has a specific
issue (was it false pointers?) - so the question arises, should we fix
Algebraic and use it thus helping other uses as well?


I had started on an implementation of a type and ID safe TaggedAlgebraic 
that uses Algebraic for its internal storage. If we can get that in 
first, it should be no problem to use it instead (with no or minimal API 
breakage). However, it uses a struct instead of an enum to define the 
Kind (which is the only nice way I could conceive to safely couple 
enum value and type at compile time), so it's not as nice in the 
generated documentation.




** I see the boolean kind, should we instead have the true_ and
false_ kinds?


I always found it cumbersome and awkward to work like that. What would 
be the reason to go that route?




** Long story short I couldn't find any major issue with this module,
and I looked! I do think the decoding logic should be moved outside of
lexer.d or at least the JSONLexerRange.

* generator.d: looking good, no special comments. Like the consistent
use of structs filled with options as template parameters.

* foundation.d:

** At four words per token, Location seems pretty bulky. How about
reducing line and column to uint?


Single line JSON files 64k (or line counts 64k) are no exception, so 
that would only work in a limited way. My thought about this was that it 
is quite unusual to actually store the tokens for most purposes 
(especially when directly serializing to a native D type), so that it 
should have minimal impact on performance or memory consumption.




** Could JSONException create the message string in toString (i.e.
when/if used) as opposed to in the constructor?


That could of course be done, but the you'd not get the full error 
message using ex.msg, only with ex.toString(), which usually prints a 
call trace instead. Alternatively, it's also possible to completely 
avoid using exceptions with LexOptions.noThrow.




* parser.d:

** How about using .init instead of .defaults for options?


I'd slightly tend to prefer the more explicit defaults, especially 
because init could mean either defaults or none (currently it 
means none). But another idea would be to invert the option values so 
that defaults==none... any objections?




** I'm a bit surprised by JSONParserNode.Kind. E.g. the objectStart/End
markers shouldn't appear as nodes. There should be an object node
only. I guess that's needed for laziness.


While you could infer the end of an object in the parser range by 
looking for the first entry that doesn't start with a key node, the 
same would not be possible for arrays, so in general the end marker *is* 
required. Not that the parser range is a StAX style parser, which is 
still very close to the lexical structure of the document.


I was also wondering if

Re: RFC: std.json sucessor


Am 12.10.2014 21:04, schrieb Sean Kelly:


I'd like to see unescapeStringLiteral() made public.  Then I can
unescape multiple strings to the same preallocated destination, or even
unescape in place (guaranteed to work since the result will always be
smaller than the input).


Will do. Same for the inverse functions.

Re: RFC: std.json sucessor


Am 12.10.2014 23:52, schrieb Sean Kelly:

Oh, it looks like you aren't checking for 0x7F (DEL) as a control
character.


It doesn't get mentioned in the JSON spec, so I left it out. But I guess 
nothing speaks against adding it anyway.

Re: RFC: std.json sucessor

2014-10-13 Thread Jacob Carlborg via Digitalmars-d


On 13/10/14 09:39, Sönke Ludwig wrote:


** At four words per token, Location seems pretty bulky. How about
reducing line and column to uint?


Single line JSON files 64k (or line counts 64k) are no exception


64k?

--
/Jacob Carlborg

Re: RFC: std.json sucessor

2014-10-13 Thread Jacob Carlborg via Digitalmars-d


On 22/08/14 00:35, Sönke Ludwig wrote:

Following up on the recent std.jgrandson thread [1], I've picked up
the work (a lot earlier than anticipated) and finished a first version
of a loose blend of said std.jgrandson, vibe.data.json and some changes
that I had planned for vibe.data.json for a while. I'm quite pleased by
the results so far, although without a serialization framework it still
misses a very important building block.

Code: https://github.com/s-ludwig/std_data_json


JSONToken.Kind and JSONParserNode.Kind could be ubyte to save space.

--
/Jacob Carlborg

Re: RFC: std.json sucessor


Am 13.10.2014 13:33, schrieb Jacob Carlborg:

On 13/10/14 09:39, Sönke Ludwig wrote:


** At four words per token, Location seems pretty bulky. How about
reducing line and column to uint?


Single line JSON files 64k (or line counts 64k) are no exception


64k?



Oh, I've read both line and column into a single uint, because of 
four words per token - considering that word == 16bit, but Andrei 
obviously meant word == (void*).sizeof. If simply using uint instead 
of size_t is meant, then that's of course a different thing.

Re: RFC: std.json sucessor


Am 13.10.2014 13:37, schrieb Jacob Carlborg:

On 22/08/14 00:35, Sönke Ludwig wrote:

Following up on the recent std.jgrandson thread [1], I've picked up
the work (a lot earlier than anticipated) and finished a first version
of a loose blend of said std.jgrandson, vibe.data.json and some changes
that I had planned for vibe.data.json for a while. I'm quite pleased by
the results so far, although without a serialization framework it still
misses a very important building block.

Code: https://github.com/s-ludwig/std_data_json


JSONToken.Kind and JSONParserNode.Kind could be ubyte to save space.



But it won't save space in practice, at least on x86, due to alignment, 
and depending on what the compiler assumes, the access can also be 
slower that way.

Re: RFC: std.json sucessor

2014-10-13 Thread Andrei Alexandrescu via Digitalmars-d


On 10/13/14, 4:48 AM, Sönke Ludwig wrote:

Am 13.10.2014 13:37, schrieb Jacob Carlborg:

On 22/08/14 00:35, Sönke Ludwig wrote:

Following up on the recent std.jgrandson thread [1], I've picked up
the work (a lot earlier than anticipated) and finished a first version
of a loose blend of said std.jgrandson, vibe.data.json and some changes
that I had planned for vibe.data.json for a while. I'm quite pleased by
the results so far, although without a serialization framework it still
misses a very important building block.

Code: https://github.com/s-ludwig/std_data_json


JSONToken.Kind and JSONParserNode.Kind could be ubyte to save space.



But it won't save space in practice, at least on x86, due to alignment,
and depending on what the compiler assumes, the access can also be
slower that way.


Correct. -- Andrei

Re: RFC: std.json sucessor

2014-10-13 Thread Andrei Alexandrescu via Digitalmars-d


On 10/13/14, 4:45 AM, Sönke Ludwig wrote:

Am 13.10.2014 13:33, schrieb Jacob Carlborg:

On 13/10/14 09:39, Sönke Ludwig wrote:


** At four words per token, Location seems pretty bulky. How about
reducing line and column to uint?


Single line JSON files 64k (or line counts 64k) are no exception


64k?



Oh, I've read both line and column into a single uint, because of
four words per token - considering that word == 16bit, but Andrei
obviously meant word == (void*).sizeof. If simply using uint instead
of size_t is meant, then that's of course a different thing.


Yah, one uint for each. -- Andrei

Re: RFC: std.json sucessor

2014-10-12 Thread Andrei Alexandrescu via Digitalmars-d


Here's my destruction of std.data.json.

* lexer.d:

** Beautifully done. From what I understand, if the input is string or 
immutable(ubyte)[] then the strings are carved out as slices of the 
input, as opposed to newly allocated. Awesome.


** The string after lexing is correctly scanned and stored in raw format 
(escapes are not rewritten) and decoded on demand. Problem with decoding 
is that it may allocate memory, and it would be great (and not 
difficult) to make the lexer 100% lazy/non-allocating. To achieve that, 
lexer.d should define TWO Kinds of strings at the lexer level: regular 
string and undecoded string. The former is lexer.d's way of saying I 
got lucky in the sense that it didn't detect any '\\' so the raw and 
decoded strings are identical. No need for anyone to do any further 
processing in the majority of cases = win. The latter means the lexer 
lexed the string, saw at least one '\\', and leaves it to the caller to 
do the actual decoding.


** After moving the decoding business out of lexer.d, a way to take this 
further would be to qualify lexer methods as @nogc if the input is 
string/immutable(ubyte)[]. I wonder how to implement a conditional 
attribute. We'll probably need a language enhancement for that.


** The implementation uses manually-defined tagged unions for work. 
Could we use Algebraic instead - dogfooding and all that? I recall there 
was a comment in Sönke's original work that Algebraic has a specific 
issue (was it false pointers?) - so the question arises, should we fix 
Algebraic and use it thus helping other uses as well?


** I see the boolean kind, should we instead have the true_ and 
false_ kinds?


** Long story short I couldn't find any major issue with this module, 
and I looked! I do think the decoding logic should be moved outside of 
lexer.d or at least the JSONLexerRange.


* generator.d: looking good, no special comments. Like the consistent 
use of structs filled with options as template parameters.


* foundation.d:

** At four words per token, Location seems pretty bulky. How about 
reducing line and column to uint?


** Could JSONException create the message string in toString (i.e. 
when/if used) as opposed to in the constructor?


* parser.d:

** How about using .init instead of .defaults for options?

** I'm a bit surprised by JSONParserNode.Kind. E.g. the objectStart/End 
markers shouldn't appear as nodes. There should be an object node 
only. I guess that's needed for laziness.


** It's unclear where memory is being allocated in the parser. @nogc 
annotations wherever appropriate would be great.


* value.d:

** Looks like this is/may be the only place where memory is being 
managed, at least if the input is string/immutable(ubyte)[]. Right?


** Algebraic ftw.



Overall: This is very close to everything I hoped! A bit more care to 
@nogc would be awesome, especially with the upcoming focus on memory 
management going forward.


After one more pass it would be great to move forward for review.


Andrei

Re: RFC: std.json sucessor

2014-10-12 Thread Sean Kelly via Digitalmars-d

On Sunday, 12 October 2014 at 18:17:29 UTC, Andrei Alexandrescu 
wrote:


** The string after lexing is correctly scanned and stored in 
raw format (escapes are not rewritten) and decoded on demand. 
Problem with decoding is that it may allocate memory, and it 
would be great (and not difficult) to make the lexer 100% 
lazy/non-allocating. To achieve that, lexer.d should define TWO 
Kinds of strings at the lexer level: regular string and 
undecoded string. The former is lexer.d's way of saying I got 
lucky in the sense that it didn't detect any '\\' so the raw 
and decoded strings are identical. No need for anyone to do any 
further processing in the majority of cases = win. The latter 
means the lexer lexed the string, saw at least one '\\', and 
leaves it to the caller to do the actual decoding.


I'd like to see unescapeStringLiteral() made public.  Then I can 
unescape multiple strings to the same preallocated destination, 
or even unescape in place (guaranteed to work since the result 
will always be smaller than the input).

Re: RFC: std.json sucessor

2014-10-12 Thread Sean Kelly via Digitalmars-d

Oh, it looks like you aren't checking for 0x7F (DEL) as a control 
character.

Re: RFC: std.json sucessor

2014-09-08 Thread Atila Neves via Digitalmars-d

Been using it for a bit now, I think the only thing I have to say 
is having to insert all of those `JSONValue` everywhere is 
tiresome and I never know when I have to do it.


Atila

On Thursday, 21 August 2014 at 22:35:18 UTC, Sönke Ludwig wrote:
Following up on the recent std.jgrandson thread [1], I've 
picked up the work (a lot earlier than anticipated) and 
finished a first version of a loose blend of said 
std.jgrandson, vibe.data.json and some changes that I had 
planned for vibe.data.json for a while. I'm quite pleased by 
the results so far, although without a serialization framework 
it still misses a very important building block.


Code: https://github.com/s-ludwig/std_data_json
Docs: http://s-ludwig.github.io/std_data_json/
DUB: http://code.dlang.org/packages/std_data_json

The new code contains:
 - Lazy lexer in the form of a token input range (using slices 
of the

   input if possible)
 - Lazy streaming parser (StAX style) in the form of a node 
input range

 - Eager DOM style parser returning a JSONValue
 - Range based JSON string generator taking either a token 
range, a

   node range, or a JSONValue
 - Opt-out location tracking (line/column) for tokens, nodes 
and values
 - No opDispatch() for JSONValue - this has shown to do more 
harm than

   good in vibe.data.json

The DOM style JSONValue type is based on std.variant.Algebraic. 
This currently has a few usability issues that can be solved by 
upgrading/fixing Algebraic:


 - Operator overloading only works sporadically
 - No tag enum is supported, so that switch()ing on the type 
of a

   value doesn't work and an if-else cascade is required
 - Operations and conversions between different Algebraic types 
is not
   conveniently supported, which gets important when other 
similar

   formats get supported (e.g. BSON)

Assuming that those points are solved, I'd like to get some 
early feedback before going for an official review. One open 
issue is how to handle unescaping of string literals. Currently 
it always unescapes immediately, which is more efficient for 
general input ranges when the unescaped result is needed, but 
less efficient for string inputs when the unescaped result is 
not needed. Maybe a flag could be used to conditionally switch 
behavior depending on the input range type.


Destroy away! ;)

[1]: http://forum.dlang.org/thread/lrknjl$co7$1...@digitalmars.com

Re: RFC: std.json sucessor

2014-08-28 Thread Don via Digitalmars-d


On Wednesday, 27 August 2014 at 23:51:54 UTC, Walter Bright wrote:

On 8/26/2014 12:24 AM, Don wrote:

On Monday, 25 August 2014 at 23:29:21 UTC, Walter Bright wrote:

On 8/25/2014 4:15 PM, Ola Fosheim Grøstad
ola.fosheim.grostad+dl...@gmail.com wrote:
On Monday, 25 August 2014 at 21:24:11 UTC, Walter Bright 
wrote:
I didn't know that. But recall I did implement it in DMC++, 
and it turned out
to simply not be useful. I'd be surprised if the new C++ 
support for it does

anything worthwhile.


Well, one should initialize with signaling NaN. Then you get 
an exception if you

try to compute using uninitialized values.



That's the theory. The practice doesn't work out so well.


To be more concrete:

Processors from AMD have signalling NaN behaviour which is 
different from

processors from Intel.

And the situation is worst on most other architectures. It's a 
lost cause, I think.


The other issues were just when the snan = qnan conversion 
took place. This is quite unclear given the extensive constant 
folding, CTFE, etc., that D does.


It was also affected by how dmd generates code. Some code gen 
on floating point doesn't need the FPU, such as toggling the 
sign bit. But then what happens with snan = qnan?


The whole thing is an undefined, unmanageable mess.


I think the way to think of it is, to the programmer, there is 
*no such thing* as an snan value. It's an implementation detail 
that should be invisible.
Semantically, a signalling nan is a qnan value with a hardware 
breakpoint on it.


An SNAN should never enter the CPU. The CPU always converts them 
to QNAN if you try. You're kind of not supposed to know that SNAN 
exists.


Because of this, I think SNAN only ever makes sense for static 
variables. Setting local variables to snan doesn't make sense. 
since the snan has to enter the CPU. Making that work without 
triggering the snan is very painful. Making it trigger the snan 
on all forms of access is even worse.


If float.init exists, it cannot be an snan, since you are allowed 
to use float.init.

Re: RFC: std.json sucessor


On Thursday, 28 August 2014 at 11:09:16 UTC, Don wrote:
I think the way to think of it is, to the programmer, there is 
*no such thing* as an snan value. It's an implementation detail 
that should be invisible.
Semantically, a signalling nan is a qnan value with a hardware 
breakpoint on it.


I disagree with this view.

QNAN: there is a value, but it does not result in a real

SNAN: the value is missing for an unspecified reason

AFAIK some x86 ops such as ROUNDPD allows you to treat SNAN as 
QNAN or throw an exception. So there is an builtin test if needed.


Other ops such as reciprocals don't throw any FP exceptions and 
will treat SNAN as QNAN.


An SNAN should never enter the CPU. The CPU always converts 
them to QNAN if you try. You're kind of not supposed to know 
that SNAN exists.


I'm not sure how you reached this interpretation?

The solution should be to emit a test for SNAN explicitly or 
implicitly if you cannot prove that SNAN is impossible.

Re: RFC: std.json sucessor


Or to be more explicit:

If have SNAN then there is no point in trying to recompute the 
expression using a different algorithm.


If have QNAN then you might want to recompute the expression 
using a different algorithm (e.g. complex numbers or 
analytically).


?

Re: RFC: std.json sucessor

2014-08-28 Thread Don via Digitalmars-d

On Thursday, 28 August 2014 at 12:10:58 UTC, Ola Fosheim Grøstad 
wrote:

Or to be more explicit:

If have SNAN then there is no point in trying to recompute the 
expression using a different algorithm.


If have QNAN then you might want to recompute the expression 
using a different algorithm (e.g. complex numbers or 
analytically).


?


No. Once you load an SNAN, it isn't an SNAN any more! It is a 
QNAN.
You cannot have an SNAN in a floating-point register (unless you 
do a nasty hack to pass it in). It gets converted during loading.


const float x = snan;
x = x;

// x is now a qnan.

Re: RFC: std.json sucessor


On Thursday, 28 August 2014 at 14:43:30 UTC, Don wrote:
No. Once you load an SNAN, it isn't an SNAN any more! It is a 
QNAN.


By which definition?  It is only if you consume the SNAN with an 
fp-exception-free arithmetic op that it should be turned into a 
QNAN. If you compute with an op that throws then it should throw 
an exception.


MOV should not be viewed as a computation…

It also makes sense to save SNAN to file when converting 
corrupted data-files. SNAN could then mean corrupted and QNAN 
could mean absent. You should not get an exception for loading 
a file. You should get an exception if you start computing on the 
SNAN in the file.


You cannot have an SNAN in a floating-point register (unless 
you do a nasty hack to pass it in). It gets converted during 
loading.


I don't understand this position. If you cannot load SNAN then 
why does SSE handle SNAN in arithmetic ops and compares?



const float x = snan;
x = x;

// x is now a qnan.


I disagree (and why const?)

Assignment does nothing, it should not consume the SNAN. 
Assignment is just naming. It is not computing.

Re: RFC: std.json sucessor


Let me try again:

SNAN = unfortunately absent

QNAN = deliberately absent

So you can have:

compute(SNAN) = handle(exception) {
   if(can turn unfortunate situation into deliberate)
   then compute(QNAN)
   else throw
)

Re: RFC: std.json sucessor


Kahan states this in a 1997 paper:

«[…]An SNaN may be moved ( copied ) without incident, but any 
other arithmetic operation upon an SNaN is an INVALID operation ( 
and so is loading one onto the ix87's stack ) that must trap or 
else produce a new nonsignaling NaN. ( Another way to turn an 
SNaN into a NaN is to turn 0xxx...xxx into 1xxx...xxx with a 
logical OR.) Intended for, among other things, data missing from 
statistical collections, and for uninitialized variables[…]»


( http://www.eecs.berkeley.edu/~wkahan/ieee754status/IEEE754.PDF)

x87 is legacy, it predates IEEE754 by 5 years and should be 
forgotten.


Note also that the string representation for a signalling nan is 
NANS, so it reasonable to save it to file if you need to 
represent missing data. NAN represents 0/0, sqrt(-1), not 
missing data.


I'm not really sure how it can be interpreted differently?

Ola.

Re: RFC: std.json sucessor

2014-08-27 Thread Walter Bright via Digitalmars-d


On 8/26/2014 12:24 AM, Don wrote:

On Monday, 25 August 2014 at 23:29:21 UTC, Walter Bright wrote:

On 8/25/2014 4:15 PM, Ola Fosheim Grøstad
ola.fosheim.grostad+dl...@gmail.com wrote:

On Monday, 25 August 2014 at 21:24:11 UTC, Walter Bright wrote:

I didn't know that. But recall I did implement it in DMC++, and it turned out
to simply not be useful. I'd be surprised if the new C++ support for it does
anything worthwhile.


Well, one should initialize with signaling NaN. Then you get an exception if you
try to compute using uninitialized values.



That's the theory. The practice doesn't work out so well.


To be more concrete:

Processors from AMD have signalling NaN behaviour which is different from
processors from Intel.

And the situation is worst on most other architectures. It's a lost cause, I 
think.


The other issues were just when the snan = qnan conversion took place. This is 
quite unclear given the extensive constant folding, CTFE, etc., that D does.


It was also affected by how dmd generates code. Some code gen on floating point 
doesn't need the FPU, such as toggling the sign bit. But then what happens with 
snan = qnan?


The whole thing is an undefined, unmanageable mess.

Re: RFC: std.json sucessor

2014-08-26 Thread Jacob Carlborg via Digitalmars-d


On 25/08/14 21:49, simendsjo wrote:


So ldc can remove quite a substantial amount of code in some cases.


It's because the latest release of LDC has the --gc-sections falg 
enabled by default.


--
/Jacob Carlborg

Re: RFC: std.json sucessor


On Monday, 25 August 2014 at 23:29:21 UTC, Walter Bright wrote:
On 8/25/2014 4:15 PM, Ola Fosheim Grøstad 
ola.fosheim.grostad+dl...@gmail.com wrote:

On Monday, 25 August 2014 at 21:24:11 UTC, Walter Bright wrote:
I didn't know that. But recall I did implement it in DMC++, 
and it turned out
to simply not be useful. I'd be surprised if the new C++ 
support for it does

anything worthwhile.


Well, one should initialize with signaling NaN. Then you get 
an exception if you

try to compute using uninitialized values.



That's the theory. The practice doesn't work out so well.


To be more concrete:

Processors from AMD have signalling NaN behaviour which is 
different from processors from Intel.


And the situation is worst on most other architectures. It's a 
lost cause, I think.

Re: RFC: std.json sucessor

2014-08-26 Thread Ola Fosheim Gr via Digitalmars-d


On Tuesday, 26 August 2014 at 07:24:19 UTC, Don wrote:
Processors from AMD have signalling NaN behaviour which is 
different from processors from Intel.


And the situation is worst on most other architectures. It's a 
lost cause, I think.


I disagree. AFAIK signaling NaN was standardized in IEEE 754-2008.
So it receives attention.

Re: RFC: std.json sucessor

Am 25.08.2014 23:53, schrieb Ola Fosheim Grøstad 
ola.fosheim.grostad+dl...@gmail.com:

On Monday, 25 August 2014 at 21:27:42 UTC, Sönke Ludwig wrote:

But why should UTF validation be the job of the lexer in the first place?


Because you want to save time, it is faster to integrate validation? The
most likely use scenario is to receive REST data over HTTP that needs
validation.

Well, so then I agree with Andrei… array of bytes it is. ;-)


added as a separate proxy range. But if we end up going for validating
in the lexer, it would indeed be enough to validate inside strings,
because the rest of the grammar assumes a subset of ASCII.


Not assumes, but defines! :-)


I guess it depends on if you look at the grammar as productions or 
comprehensions(right term?) ;)




If you have to validate UTF before lexing then you will end up
needlessly scanning lots of ascii if the file contains lots of
non-strings or is from a encoder that only sends pure ascii.


That's true. So the ideal solution would be to *assume* UTF-8 when the 
input is char based and to *validate* if the input is numeric.




If you want to have plugin validation of strings then you also need to
differentiate strings so that the user can select which data should be
just ascii, utf8, numbers, ids etc. Otherwise the user will end up doing
double validation (you have to bypass 7F followed by string-end anyway).

The advantage of integrated validation is that you can use 16 bytes SIMD
registers on the buffer.

I presume you can load 16 bytes and do BITWISE-AND on the MSB, then
match against string-end and carefully use this to boost performance of
simultanous UTF validation, escape-scanning, and string-end scan. A bit
tricky, of course.


Well, that's something that's definitely out of the scope of this 
proposal. Definitely an interesting direction to pursue, though.



At least no UTF validation is needed. Since all non-ASCII characters
will always be composed of bytes 0x7F, a sequence \u can be
assumed to be valid wherever in the string it occurs, and all other
bytes that don't belong to an escape sequence are just passed through
as-is.


You cannot assume \u… to be valid if you convert it.


I meant X to stand for a hex digit. The point was just that you don't 
have to worry about interacting in a bad way with UTF sequences when you 
find \u.

Re: RFC: std.json sucessor


On Tuesday, 26 August 2014 at 07:51:04 UTC, Sönke Ludwig wrote:
That's true. So the ideal solution would be to *assume* UTF-8 
when the input is char based and to *validate* if the input is 
numeric.


I think you should validate JSON-strings to be UTF-8 encoded even 
if you allow illegal unicode values. Basically ensuring that 
0x7f has the right number of bytes after it, so you don't get 
0x7f as the last byte in a string etc.


Well, that's something that's definitely out of the scope of 
this proposal. Definitely an interesting direction to pursue, 
though.


Maybe the interface/code structure is or could be designed so 
that the implementation could later be version()'ed to SIMD where 
possible.



You cannot assume \u… to be valid if you convert it.


I meant X to stand for a hex digit. The point was just that 
you don't have to worry about interacting in a bad way with UTF 
sequences when you find \u.


When you convert \u to UTF-8 bytes, is it then validated as 
a legal code point? I guess it is not necessary.


Btw, I believe rapidJSON achieves high speed by converting 
strings in situ, so that if the prefix is escape free it just 
converts in place when it hits the first escape. Thus avoiding 
some moving.

Re: RFC: std.json sucessor


Am 26.08.2014 03:31, schrieb Entusiastic user:

Hi!

Thanks for the effort you've put in this.

I am having problems with building with LDC 0.14.0. DMD 2.066.0
seems to work fine (all unit tests pass). Do you have any ideas
why?


I've fixed all errors on DMD 2.065 now. Hopefully that should also fix LDC.

Re: RFC: std.json sucessor

Am 26.08.2014 10:24, schrieb Ola Fosheim Grøstad 
ola.fosheim.grostad+dl...@gmail.com:

On Tuesday, 26 August 2014 at 07:51:04 UTC, Sönke Ludwig wrote:

That's true. So the ideal solution would be to *assume* UTF-8 when the
input is char based and to *validate* if the input is numeric.


I think you should validate JSON-strings to be UTF-8 encoded even if you
allow illegal unicode values. Basically ensuring that 0x7f has the
right number of bytes after it, so you don't get 0x7f as the last byte
in a string etc.


I think this is a misunderstanding. What I mean is that if the input 
range passed to the lexer is char/wchar/dchar based, the lexer should 
assume that the input is well formed UTF. After all this is how D 
strings are defined.


When on the other hand a ubyte/ushort/uint range is used, the lexer 
should validate all string literals.





Well, that's something that's definitely out of the scope of this
proposal. Definitely an interesting direction to pursue, though.


Maybe the interface/code structure is or could be designed so that the
implementation could later be version()'ed to SIMD where possible.


I guess that shouldn't be an issue. From the outside it's just a generic 
range that is passed in and internally it's always possible to add 
special cases for array inputs. If someone else wants to play around 
with this idea, we could of course also integrate it right away, it's 
just that I personally don't have the time to go to the extreme here.



You cannot assume \u… to be valid if you convert it.


I meant X to stand for a hex digit. The point was just that you
don't have to worry about interacting in a bad way with UTF sequences
when you find \u.


When you convert \u to UTF-8 bytes, is it then validated as a
legal code point? I guess it is not necessary.


What is validated is that it forms valid UTF-16 surrogate pairs, and 
those are converted to a single dchar instead (if applicable). This is 
necessary, because otherwise the lexer would produce invalid UTF-8 for 
valid inputs. Apart from that, the value is used verbatim as a dchar.




Btw, I believe rapidJSON achieves high speed by converting strings in
situ, so that if the prefix is escape free it just converts in place
when it hits the first escape. Thus avoiding some moving.


The same is true for this lexer, at least for array inputs. It actually 
currently just stores a slice of the string literal in all cases and 
lazily decodes on the first access. While doing that, it first skips any 
escape sequence free prefix and returns a slice if the whole string is 
escape sequence free.

Re: RFC: std.json sucessor


On Tuesday, 26 August 2014 at 09:05:05 UTC, Sönke Ludwig wrote:
When on the other hand a ubyte/ushort/uint range is used, the 
lexer should validate all string literals.


Yes, so this will be supported? Because this is what is most 
useful.

Re: RFC: std.json sucessor

2014-08-26 Thread Entusiastic user via Digitalmars-d

I tried using -disable-linker-strip-dead, but it had no effect. 
From the error messages it seems the problem is compile-time and 
not link-time...




On Tuesday, 26 August 2014 at 07:01:09 UTC, Jacob Carlborg wrote:

On 25/08/14 21:49, simendsjo wrote:

So ldc can remove quite a substantial amount of code in some 
cases.


It's because the latest release of LDC has the --gc-sections 
falg enabled by default.

Re: RFC: std.json sucessor

Am 26.08.2014 11:11, schrieb Ola Fosheim Grøstad 
ola.fosheim.grostad+dl...@gmail.com:

On Tuesday, 26 August 2014 at 09:05:05 UTC, Sönke Ludwig wrote:

When on the other hand a ubyte/ushort/uint range is used, the lexer
should validate all string literals.


Yes, so this will be supported? Because this is what is most useful.


If nobody plays a veto card, I'll implement it that way.

Re: RFC: std.json sucessor


On Tuesday, 26 August 2014 at 07:34:05 UTC, Ola Fosheim Gr wrote:

On Tuesday, 26 August 2014 at 07:24:19 UTC, Don wrote:
Processors from AMD have signalling NaN behaviour which is 
different from processors from Intel.


And the situation is worst on most other architectures. It's a 
lost cause, I think.


I disagree. AFAIK signaling NaN was standardized in IEEE 
754-2008.

So it receives attention.


It was always in IEEE754. The decision in 754-2008 was simply to 
not remove it from the spec (a lot of people wanted to remove 
it). I don't think anything has changed.


The point is, existing hardware does not support it consistently. 
It's not possible at reasonable cost.


---
real uninitialized_var = real.snan;

void foo()
{
  real other_var = void;
  asm {
 fld uninitialized_var;
 fstp other_var;
  }
}
---

will signal on AMD, but not Intel. I'd love for this to work, but 
the hardware is fighting against us. I think it's useful only for 
debugging.

Re: RFC: std.json sucessor

2014-08-26 Thread David Soria Parra via Digitalmars-d


On Thursday, 21 August 2014 at 22:35:18 UTC, Sönke Ludwig wrote:
Following up on the recent std.jgrandson thread [1], I've 
picked up the work (a lot earlier than anticipated) and 
finished a first version of a loose blend of said 
std.jgrandson, vibe.data.json and some changes that I had 
planned for vibe.data.json for a while. I'm quite pleased by 
the results so far, although without a serialization framework 
it still misses a very important building block.


Code: https://github.com/s-ludwig/std_data_json
Docs: http://s-ludwig.github.io/std_data_json/
DUB: http://code.dlang.org/packages/std_data_json


Do we have any benchmarks for this yet. Note that the main
motivation for a new json parsers was that std.json is remarkable
slow in comparison to python's json or ujson.

Re: RFC: std.json sucessor


On Tuesday, 26 August 2014 at 10:55:20 UTC, Don wrote:
It was always in IEEE754. The decision in 754-2008 was simply 
to not remove it from the spec (a lot of people wanted to 
remove it). I don't think anything has changed.


It was implementation defined before. I think they specified the 
bit in 2008.



 fld uninitialized_var;
 fstp other_var;


This is not SSE, but I guess MOVSS does not create exceptions 
either. AVX is quite complicated, but searching for signaling 
gives some hints about the semantics you can rely on.


https://software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensions

https://software.intel.com/sites/default/files/managed/c6/a9/319433-020.pdf

Ola.

Re: RFC: std.json sucessor

On Tuesday, 26 August 2014 at 12:37:58 UTC, Ola Fosheim Grøstad 
wrote:


either. AVX is quite complicated, but searching for signaling 
gives some hints about the semantics you can rely on.

…

https://software.intel.com/sites/default/files/managed/c6/a9/319433-020.pdf


(Actually, searching for SNAN is better…)

Re: RFC: std.json sucessor

With the danger of being noisy, these instructions are subject to 
floating point exceptions according to my (perhaps sloppy) 
reading of Intel Architecture Instruction Set Extensions 
Programming Reference (2012):


(V)ADDPD, (V)ADDPS, (V)ADDSUBPD, (V)ADDSUBPS, (V)CMPPD, (V)CMPPS, 
(V)CVTDQ2PS, (V)CVTPD2DQ, (V)CVTPD2PS, (V)CVTPS2DQ, (V)CVTTPD2DQ, 
(V)CVTTPS2DQ, (V)DIVPD, (V)DIVPS, (V)DPPD*, (V)DPPS*, 
VFMADD132PD, VFMADD213PD, VFMADD231PD, VFMADD132PS, VFMADD213PS, 
VFMADD231PS, VFMADDSUB132PD, VFMADDSUB213PD, VFMADDSUB231PD, 
VFMADDSUB132PS, VFMADDSUB213PS, VFMADDSUB231PS, VFMSUBADD132PD, 
VFMSUBADD213PD, VFMSUBADD231PD, VFMSUBADD132PS, VFMSUBADD213PS, 
VFMSUBADD231PS, VFMSUB132PD, VFMSUB213PD, VFMSUB231PD, 
VFMSUB132PS, VFMSUB213PS, VFMSUB231PS, VFNMADD132PD, 
VFNMADD213PD, VFNMADD231PD, VFNMADD132PS, VFNMADD213PS, 
VFNMADD231PS, VFNMSUB132PD, VFNMSUB213PD, VFNMSUB231PD, 
VFNMSUB132PS, VFNMSUB213PS, VFNMSUB231PS, (V)HADDPD, (V)HADDPS, 
(V)HSUBPD, (V)HSUBPS, (V)MAXPD, (V)MAXPS, (V)MINPD, (V)MINPS, 
(V)MULPD, (V)MULPS, (V)ROUNDPS, (V)ROUNDPS, (V)SQRTPD, (V)SQRTPS, 
(V)SUBPD, (V)SUBPS


(V)ADDSD, (V)ADDSS, (V)CMPSD, (V)CMPSS, (V)COMISD, (V)COMISS, 
(V)CVTPS2PD, (V)CVTSD2SI, (V)CVTSD2SS, (V)CVTSI2SD, (V)CVTSI2SS, 
(V)CVTSS2SD, (V)CVTSS2SI, (V)CVTTSD2SI, (V)CVTTSS2SI, (V)DIVSD, 
(V)DIVSS, VFMADD132SD, VFMADD213SD, VFMADD231SD, VFMADD132SS, 
VFMADD213SS, VFMADD231SS, VFMSUB132SD, VFMSUB213SD, VFMSUB231SD, 
VFMSUB132SS, VFMSUB213SS, VFMSUB231SS, VFNMADD132SD, 
VFNMADD213SD, VFNMADD231SD, VFNMADD132SS, VFNMADD213SS, 
VFNMADD231SS, VFNMSUB132SD, VFNMSUB213SD, VFNMSUB231SD, 
VFNMSUB132SS, VFNMSUB213SS, VFNMSUB231SS, (V)MAXSD, (V)MAXSS, 
(V)MINSD, (V)MINSS, (V)MULSD, (V)MULSS, (V)ROUNDSD, (V)ROUNDSS, 
(V)SQRTSD, (V)SQRTSS, (V)SUBSD, (V)SUBSS, (V)UCOMISD, (V)UCOMISS


VCVTPH2PS, VCVTPS2PH

So I guess Intel floating point exceptions trigger on 
computations, but not on moves?


Ola.

Re: RFC: std.json sucessor

On Tuesday, 26 August 2014 at 12:37:58 UTC, Ola Fosheim Grøstad 
wrote:

On Tuesday, 26 August 2014 at 10:55:20 UTC, Don wrote:
It was always in IEEE754. The decision in 754-2008 was simply 
to not remove it from the spec (a lot of people wanted to 
remove it). I don't think anything has changed.


It was implementation defined before. I think they specified 
the bit in 2008.



fld uninitialized_var;
fstp other_var;


This is not SSE, but I guess MOVSS does not create exceptions 
either.


No, it's more subtle. On the original x87, signalling NaNs are 
triggered for 64 bits loads, but not for 80 bit loads. You have 
to read the fine print to discover this. I don't think the 
behaviour was intentional.

Re: RFC: std.json sucessor


On Tuesday, 26 August 2014 at 13:24:11 UTC, Don wrote:
No, it's more subtle. On the original x87, signalling NaNs are 
triggered for 64 bits loads, but not for 80 bit loads. You have 
to read the fine print to discover this.


You are right, but it happens for loads from the FP-stack too: 
«Source operand is an SNaN. Does not occur if the source operand 
is in double extended-precision floating-point format (FLD m80fp 
or FLD ST(i)).»



I don't think the behaviour was intentional.


It seems reasonable, you need to load/save NaNs without 
exceptions if you do a context switch? I don't think the extended 
format was not meant for end users.


Anyway, the x87 FP stack is history, even MOVSS is considered 
legacy by Intel…

Re: RFC: std.json sucessor


On Monday, 25 August 2014 at 14:04:12 UTC, Sönke Ludwig wrote:

Am 25.08.2014 15:07, schrieb Don:
On Thursday, 21 August 2014 at 22:35:18 UTC, Sönke Ludwig 
wrote:
Following up on the recent std.jgrandson thread [1], I've 
picked up
the work (a lot earlier than anticipated) and finished a 
first version
of a loose blend of said std.jgrandson, vibe.data.json and 
some
changes that I had planned for vibe.data.json for a while. 
I'm quite
pleased by the results so far, although without a 
serialization

framework it still misses a very important building block.

Code: https://github.com/s-ludwig/std_data_json
Docs: http://s-ludwig.github.io/std_data_json/
DUB: http://code.dlang.org/packages/std_data_json

The new code contains:
- Lazy lexer in the form of a token input range (using slices 
of the

  input if possible)
- Lazy streaming parser (StAX style) in the form of a node 
input range

- Eager DOM style parser returning a JSONValue
- Range based JSON string generator taking either a token 
range, a

  node range, or a JSONValue
- Opt-out location tracking (line/column) for tokens, nodes 
and values
- No opDispatch() for JSONValue - this has shown to do more 
harm than

  good in vibe.data.json

The DOM style JSONValue type is based on 
std.variant.Algebraic. This

currently has a few usability issues that can be solved by
upgrading/fixing Algebraic:

- Operator overloading only works sporadically
- No tag enum is supported, so that switch()ing on the type 
of a

  value doesn't work and an if-else cascade is required
- Operations and conversions between different Algebraic 
types is not
  conveniently supported, which gets important when other 
similar

  formats get supported (e.g. BSON)

Assuming that those points are solved, I'd like to get some 
early
feedback before going for an official review. One open issue 
is how to
handle unescaping of string literals. Currently it always 
unescapes
immediately, which is more efficient for general input ranges 
when the
unescaped result is needed, but less efficient for string 
inputs when
the unescaped result is not needed. Maybe a flag could be 
used to
conditionally switch behavior depending on the input range 
type.


Destroy away! ;)

[1]: 
http://forum.dlang.org/thread/lrknjl$co7$1...@digitalmars.com



One missing feature (which is also missing from the existing 
std.json)
is support for NaN and Infinity as JSON values. Although they 
are not
part of the formal JSON spec (which is a ridiculous omission, 
the
argument given for excluding them is fallacious), they do get 
generated
if you use Javascript's toString to create the JSON. Many JSON 
libraries
(eg Google's) also generate them, so they are frequently 
encountered in

practice. So a JSON parser should at least be able to lex them.

ie this should be parsable:

{foo: NaN, bar: Infinity, baz: -Infinity}


This would probably best added as another (CT) optional 
feature. I think the default should strictly adhere to the JSON 
specification, though.


Yes, it should be optional, but not a compile-time option.
I think it should parse it, and based on a runtime flag, throw an 
error (perhaps an OutOfRange error or something, and use the same 
thing for values that exceed the representable range).


An app may accept these non-standard values under certain 
circumstances and not others. In real-world code, you see a *lot* 
of these guys.


Part of the reason these are important, is that NaN or Infinity 
generally means some Javascript code just has an uninitialized 
variable. Any other kind of invalid JSON typically means 
something very nasty has happened. It's important to distinguish 
these.

Re: RFC: std.json sucessor


Am 26.08.2014 15:43, schrieb Don:

On Monday, 25 August 2014 at 14:04:12 UTC, Sönke Ludwig wrote:

Am 25.08.2014 15:07, schrieb Don:

ie this should be parsable:

{foo: NaN, bar: Infinity, baz: -Infinity}


This would probably best added as another (CT) optional feature. I
think the default should strictly adhere to the JSON specification,
though.


Yes, it should be optional, but not a compile-time option.
I think it should parse it, and based on a runtime flag, throw an error
(perhaps an OutOfRange error or something, and use the same thing for
values that exceed the representable range).

An app may accept these non-standard values under certain circumstances
and not others. In real-world code, you see a *lot* of these guys.


Why not a compile time option?

That sounds to me like such an app should simply enable parsing those 
values and manually test for NaN at places where it matters. For all 
other (the majority) of applications, encountering NaN/Infinity will 
simply mean that there is a bug, so it makes sense to not accept those 
at all by default.


Apart from that I don't think that it's a good idea for the lexer in 
general to accept non-standard input by default.




Part of the reason these are important, is that NaN or Infinity
generally means some Javascript code just has an uninitialized variable.
Any other kind of invalid JSON typically means something very nasty has
happened. It's important to distinguish these.


As far as I understood, JavaScript will output those special values as 
null (at least when not using external JSON libraries). But even if not, 
an uninitialized variable can also be very nasty, so it's hard to see 
why that kind of bug should be silently supported (by default).

Re: RFC: std.json sucessor

On Tuesday, 26 August 2014 at 13:43:56 UTC, Ola Fosheim Grøstad 
wrote:
Anyway, the x87 FP stack is history, even MOVSS is considered 
legacy by Intel…


Sorry for being off-topic, but MOVSS and VMOVSS on AMD don't 
throw FP exceptions either, but calculations does. So it seems 
like AMD and Intel sufficiently close for D to support NaNs, 
IMHO. Forget the legacy…


http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/26568_APM_v41.pdf

Ola.

Re: RFC: std.json sucessor


On Tuesday, 26 August 2014 at 14:06:42 UTC, Sönke Ludwig wrote:

Am 26.08.2014 15:43, schrieb Don:

On Monday, 25 August 2014 at 14:04:12 UTC, Sönke Ludwig wrote:

Am 25.08.2014 15:07, schrieb Don:

ie this should be parsable:

{foo: NaN, bar: Infinity, baz: -Infinity}


This would probably best added as another (CT) optional 
feature. I
think the default should strictly adhere to the JSON 
specification,

though.


Yes, it should be optional, but not a compile-time option.
I think it should parse it, and based on a runtime flag, throw 
an error
(perhaps an OutOfRange error or something, and use the same 
thing for

values that exceed the representable range).

An app may accept these non-standard values under certain 
circumstances
and not others. In real-world code, you see a *lot* of these 
guys.


Why not a compile time option?

That sounds to me like such an app should simply enable parsing 
those values and manually test for NaN at places where it 
matters.
For all other (the majority) of applications, encountering 
NaN/Infinity will simply mean that there is a bug, so it makes 
sense to not accept those at all by default.


Apart from that I don't think that it's a good idea for the 
lexer in general to accept non-standard input by default.


Please note, I've been talking about the lexer. I'm choosing my 
words very carefully.



Part of the reason these are important, is that NaN or Infinity
generally means some Javascript code just has an uninitialized 
variable.
Any other kind of invalid JSON typically means something very 
nasty has

happened. It's important to distinguish these.


As far as I understood, JavaScript will output those special 
values as null (at least when not using external JSON 
libraries).


No. Javascript generates them directly. Naive JS code generates 
these guys. That's why they're so important.


But even if not, an uninitialized variable can also be very 
nasty, so it's hard to see why that kind of bug should be 
silently supported (by default).


I never said it should accepted by default. I said it is a 
situation which should be *lexed*. Ideally, by default it should 
give a different error from simply 'invalid JSON'. I believe it 
should ALWAYS be lexed, even if an error is ultimately generated.


This is the difference: if you get NaN or Infinity, there's 
probably a straightforward bug in the Javascript code, but your D 
code is fine. Any other kind of JSON parsing error means you've 
got a garbage string that isn't JSON at all. They are very 
different errors.

It's a diagnostics issue.

Re: RFC: std.json sucessor


Am 26.08.2014 16:40, schrieb Don:

On Tuesday, 26 August 2014 at 14:06:42 UTC, Sönke Ludwig wrote:

Am 26.08.2014 15:43, schrieb Don:

On Monday, 25 August 2014 at 14:04:12 UTC, Sönke Ludwig wrote:

Am 25.08.2014 15:07, schrieb Don:

ie this should be parsable:

{foo: NaN, bar: Infinity, baz: -Infinity}


This would probably best added as another (CT) optional feature. I
think the default should strictly adhere to the JSON specification,
though.


Yes, it should be optional, but not a compile-time option.
I think it should parse it, and based on a runtime flag, throw an error
(perhaps an OutOfRange error or something, and use the same thing for
values that exceed the representable range).

An app may accept these non-standard values under certain circumstances
and not others. In real-world code, you see a *lot* of these guys.


Why not a compile time option?

That sounds to me like such an app should simply enable parsing those
values and manually test for NaN at places where it matters.
For all other (the majority) of applications, encountering
NaN/Infinity will simply mean that there is a bug, so it makes sense
to not accept those at all by default.

Apart from that I don't think that it's a good idea for the lexer in
general to accept non-standard input by default.


Please note, I've been talking about the lexer. I'm choosing my words
very carefully.


I've been talking about the lexer, too. Sorry for the confusing use of 
the term parsing (after all, the lexer is also a parser, but anyway).





Part of the reason these are important, is that NaN or Infinity
generally means some Javascript code just has an uninitialized variable.
Any other kind of invalid JSON typically means something very nasty has
happened. It's important to distinguish these.


As far as I understood, JavaScript will output those special values as
null (at least when not using external JSON libraries).


No. Javascript generates them directly. Naive JS code generates these
guys. That's why they're so important.


JSON.stringify(0/0) == null

Holds for all browsers that I've tested.




But even if not, an uninitialized variable can also be very nasty, so
it's hard to see why that kind of bug should be silently supported (by
default).


I never said it should accepted by default. I said it is a situation
which should be *lexed*. Ideally, by default it should give a different
error from simply 'invalid JSON'. I believe it should ALWAYS be lexed,
even if an error is ultimately generated.

This is the difference: if you get NaN or Infinity, there's probably a
straightforward bug in the Javascript code, but your D code is fine. Any
other kind of JSON parsing error means you've got a garbage string that
isn't JSON at all. They are very different errors.
It's a diagnostics issue.


The error will be more like filename(line:column): Invalid token - 
possibly the text following the line/column could also be displayed. 
Wouldn't that be sufficient?

Re: RFC: std.json sucessor


On Tuesday, 26 August 2014 at 14:40:02 UTC, Don wrote:
This is the difference: if you get NaN or Infinity, there's 
probably a straightforward bug in the Javascript code, but your 
D code is fine. Any other kind of JSON parsing error means 
you've got a garbage string that isn't JSON at all. They are 
very different errors.


I don't care either way, but JSON.stringify() has the following 
support:


IE8 and up
Firefox 3.5 and up
Safari 4 and up
Chrome

So not using it is very much legacy…

Re: RFC: std.json sucessor


Am 26.08.2014 16:51, schrieb Sönke Ludwig:

Am 26.08.2014 16:40, schrieb Don:

This is the difference: if you get NaN or Infinity, there's probably a
straightforward bug in the Javascript code, but your D code is fine. Any
other kind of JSON parsing error means you've got a garbage string that
isn't JSON at all. They are very different errors.
It's a diagnostics issue.


The error will be more like filename(line:column): Invalid token -
possibly the text following the line/column could also be displayed.
Wouldn't that be sufficient?


One argument against supporting it in the parser is that the parser 
currently works without any configuration, but the user would then have 
to specify two sets of configuration options with this added.

Re: RFC: std.json sucessor

I've added support (compile time option [1]) for long and BigInt in the 
lexer (and parser), see [2]. JSONValue currently still only stores 
double for numbers. There are two options for extending JSONValue:


1. Add long and BigInt to the set of supported types for JSONValue. This 
preserves all features of Algebraic and would later still allow 
transparent conversion to other similar value types (e.g. BSONValue). On 
the other hand it would be necessary to always check the actual type 
before accessing a number, or the Algebraic would throw.


2. Instead of double, store a JSONNumber in the Algebraic. This enables 
all the transparent conversions of JSONNumber and would thus be more 
convenient, but blocks the way for possible automatic conversions in the 
future.


I'm leaning towards 1, because allowing generic conversion between 
different JSONValue-like types was one of my prime goals for the new module.


[1]: 
http://s-ludwig.github.io/std_data_json/stdx/data/json/lexer/LexOptions.html
[2]: 
http://s-ludwig.github.io/std_data_json/stdx/data/json/lexer/JSONNumber.html

Re: RFC: std.json sucessor


On Monday, 25 August 2014 at 11:30:15 UTC, Sönke Ludwig wrote:
I've added support (compile time option [1]) for long and 
BigInt in the lexer (and parser), see [2]. JSONValue currently 
still only stores double for numbers.


It can be very useful to have a base 10 exponent representation 
in certain situations where you need to have the exact same 
results in two systems (like a third party ERP server versus a 
client side application). Base 2 exponents are tricky (incorrect) 
when you read ascii.


E.g. I have resorted to using Decimal in Python just to avoid the 
weird round off issues when calculating prices where the price is 
given in fractions of the order unit.


Perhaps a marginal problem, but could be important for some 
serious application areas where you need to integrate D with 
existing systems (for which you don't have the source code).

Re: RFC: std.json sucessor

2014-08-25 Thread Don via Digitalmars-d


On Thursday, 21 August 2014 at 22:35:18 UTC, Sönke Ludwig wrote:
Following up on the recent std.jgrandson thread [1], I've 
picked up the work (a lot earlier than anticipated) and 
finished a first version of a loose blend of said 
std.jgrandson, vibe.data.json and some changes that I had 
planned for vibe.data.json for a while. I'm quite pleased by 
the results so far, although without a serialization framework 
it still misses a very important building block.


Code: https://github.com/s-ludwig/std_data_json
Docs: http://s-ludwig.github.io/std_data_json/
DUB: http://code.dlang.org/packages/std_data_json

The new code contains:
 - Lazy lexer in the form of a token input range (using slices 
of the

   input if possible)
 - Lazy streaming parser (StAX style) in the form of a node 
input range

 - Eager DOM style parser returning a JSONValue
 - Range based JSON string generator taking either a token 
range, a

   node range, or a JSONValue
 - Opt-out location tracking (line/column) for tokens, nodes 
and values
 - No opDispatch() for JSONValue - this has shown to do more 
harm than

   good in vibe.data.json

The DOM style JSONValue type is based on std.variant.Algebraic. 
This currently has a few usability issues that can be solved by 
upgrading/fixing Algebraic:


 - Operator overloading only works sporadically
 - No tag enum is supported, so that switch()ing on the type 
of a

   value doesn't work and an if-else cascade is required
 - Operations and conversions between different Algebraic types 
is not
   conveniently supported, which gets important when other 
similar

   formats get supported (e.g. BSON)

Assuming that those points are solved, I'd like to get some 
early feedback before going for an official review. One open 
issue is how to handle unescaping of string literals. Currently 
it always unescapes immediately, which is more efficient for 
general input ranges when the unescaped result is needed, but 
less efficient for string inputs when the unescaped result is 
not needed. Maybe a flag could be used to conditionally switch 
behavior depending on the input range type.


Destroy away! ;)

[1]: http://forum.dlang.org/thread/lrknjl$co7$1...@digitalmars.com



One missing feature (which is also missing from the existing 
std.json) is support for NaN and Infinity as JSON values. 
Although they are not part of the formal JSON spec (which is a 
ridiculous omission, the argument given for excluding them is 
fallacious), they do get generated if you use Javascript's 
toString to create the JSON. Many JSON libraries (eg Google's) 
also generate them, so they are frequently encountered in 
practice. So a JSON parser should at least be able to lex them.


ie this should be parsable:

{foo: NaN, bar: Infinity, baz: -Infinity}

You should also put tests in for what happens when you pass NaN 
or infinity to toJSON. It shouldn't silently generate invalid 
JSON.

Re: RFC: std.json sucessor


On Monday, 25 August 2014 at 13:07:08 UTC, Don wrote:

practice. So a JSON parser should at least be able to lex them.

ie this should be parsable:

{foo: NaN, bar: Infinity, baz: -Infinity}

You should also put tests in for what happens when you pass NaN 
or infinity to toJSON. It shouldn't silently generate invalid 
JSON.


I believe you are allowed to use very high exponents, though. 
Like: 1E999 . So you need to decide if those should be mapped to 
+Infinity or to the max value…


NaN also come in two forms with differing semantics: 
signalling(NaNs) and quiet (NaN).  NaN is used for 0/0 and 
sqrt(-1), but NaNs is used for illegal values and failure.


For some reason D does not seem to support this aspect of 
IEEE754? I cannot find .nans listed on the page 
http://dlang.org/property.html


The distinction is important when you do conditional branching. 
With NaNs you might not be able to figure out which branch to 
take since you might have missed out on a real value, with NaN 
you got the value (which is known to be not real) and you might 
be able to branch.

Re: RFC: std.json sucessor

Am 25.08.2014 14:12, schrieb Ola Fosheim Grøstad 
ola.fosheim.grostad+dl...@gmail.com:

On Monday, 25 August 2014 at 11:30:15 UTC, Sönke Ludwig wrote:

I've added support (compile time option [1]) for long and BigInt in
the lexer (and parser), see [2]. JSONValue currently still only stores
double for numbers.


It can be very useful to have a base 10 exponent representation in
certain situations where you need to have the exact same results in two
systems (like a third party ERP server versus a client side
application). Base 2 exponents are tricky (incorrect) when you read ascii.

E.g. I have resorted to using Decimal in Python just to avoid the weird
round off issues when calculating prices where the price is given in
fractions of the order unit.

Perhaps a marginal problem, but could be important for some serious
application areas where you need to integrate D with existing systems
(for which you don't have the source code).


In fact, I've already prepared the code for that, but commented it out 
for now, because I wanted to have an efficient algorithm for converting 
double to Decimal and because we should probably first add a Decimal 
type to Phobos instead of adding it to the JSON module.

Re: RFC: std.json sucessor


Am 25.08.2014 15:07, schrieb Don:

On Thursday, 21 August 2014 at 22:35:18 UTC, Sönke Ludwig wrote:

Following up on the recent std.jgrandson thread [1], I've picked up
the work (a lot earlier than anticipated) and finished a first version
of a loose blend of said std.jgrandson, vibe.data.json and some
changes that I had planned for vibe.data.json for a while. I'm quite
pleased by the results so far, although without a serialization
framework it still misses a very important building block.

Code: https://github.com/s-ludwig/std_data_json
Docs: http://s-ludwig.github.io/std_data_json/
DUB: http://code.dlang.org/packages/std_data_json

The new code contains:
 - Lazy lexer in the form of a token input range (using slices of the
   input if possible)
 - Lazy streaming parser (StAX style) in the form of a node input range
 - Eager DOM style parser returning a JSONValue
 - Range based JSON string generator taking either a token range, a
   node range, or a JSONValue
 - Opt-out location tracking (line/column) for tokens, nodes and values
 - No opDispatch() for JSONValue - this has shown to do more harm than
   good in vibe.data.json

The DOM style JSONValue type is based on std.variant.Algebraic. This
currently has a few usability issues that can be solved by
upgrading/fixing Algebraic:

 - Operator overloading only works sporadically
 - No tag enum is supported, so that switch()ing on the type of a
   value doesn't work and an if-else cascade is required
 - Operations and conversions between different Algebraic types is not
   conveniently supported, which gets important when other similar
   formats get supported (e.g. BSON)

Assuming that those points are solved, I'd like to get some early
feedback before going for an official review. One open issue is how to
handle unescaping of string literals. Currently it always unescapes
immediately, which is more efficient for general input ranges when the
unescaped result is needed, but less efficient for string inputs when
the unescaped result is not needed. Maybe a flag could be used to
conditionally switch behavior depending on the input range type.

Destroy away! ;)

[1]: http://forum.dlang.org/thread/lrknjl$co7$1...@digitalmars.com



One missing feature (which is also missing from the existing std.json)
is support for NaN and Infinity as JSON values. Although they are not
part of the formal JSON spec (which is a ridiculous omission, the
argument given for excluding them is fallacious), they do get generated
if you use Javascript's toString to create the JSON. Many JSON libraries
(eg Google's) also generate them, so they are frequently encountered in
practice. So a JSON parser should at least be able to lex them.

ie this should be parsable:

{foo: NaN, bar: Infinity, baz: -Infinity}


This would probably best added as another (CT) optional feature. I think 
the default should strictly adhere to the JSON specification, though.




You should also put tests in for what happens when you pass NaN or
infinity to toJSON. It shouldn't silently generate invalid JSON.


Good point. The current solution to just use formattedWrite(%.16g) is 
also not ideal.

Re: RFC: std.json sucessor

Am 25.08.2014 16:04, schrieb Sönke Ludwig:

Am 25.08.2014 15:07, schrieb Don:

On Thursday, 21 August 2014 at 22:35:18 UTC, Sönke Ludwig wrote:

Following up on the recent std.jgrandson thread [1], I've picked up
the work (a lot earlier than anticipated) and finished a first version
of a loose blend of said std.jgrandson, vibe.data.json and some
changes that I had planned for vibe.data.json for a while. I'm quite
pleased by the results so far, although without a serialization
framework it still misses a very important building block.

Code: https://github.com/s-ludwig/std_data_json
Docs: http://s-ludwig.github.io/std_data_json/
DUB: http://code.dlang.org/packages/std_data_json

The new code contains:
- Lazy lexer in the form of a token input range (using slices of the
input if possible)
- Lazy streaming parser (StAX style) in the form of a node input range
- Eager DOM style parser returning a JSONValue
- Range based JSON string generator taking either a token range, a
node range, or a JSONValue
- Opt-out location tracking (line/column) for tokens, nodes and values
- No opDispatch() for JSONValue - this has shown to do more harm than
good in vibe.data.json

The DOM style JSONValue type is based on std.variant.Algebraic. This
currently has a few usability issues that can be solved by
upgrading/fixing Algebraic:

- Operator overloading only works sporadically
- No tag enum is supported, so that switch()ing on the type of a
value doesn't work and an if-else cascade is required
- Operations and conversions between different Algebraic types is not
conveniently supported, which gets important when other similar
formats get supported (e.g. BSON)

Assuming that those points are solved, I'd like to get some early
feedback before going for an official review. One open issue is how to
handle unescaping of string literals. Currently it always unescapes
immediately, which is more efficient for general input ranges when the
unescaped result is needed, but less efficient for string inputs when
the unescaped result is not needed. Maybe a flag could be used to
conditionally switch behavior depending on the input range type.

Destroy away! ;)

[1]: http://forum.dlang.org/thread/lrknjl$co7$1...@digitalmars.com

One missing feature (which is also missing from the existing std.json)
is support for NaN and Infinity as JSON values. Although they are not
part of the formal JSON spec (which is a ridiculous omission, the
argument given for excluding them is fallacious), they do get generated
if you use Javascript's toString to create the JSON. Many JSON libraries
(eg Google's) also generate them, so they are frequently encountered in
practice. So a JSON parser should at least be able to lex them.

ie this should be parsable:

{foo: NaN, bar: Infinity, baz: -Infinity}

This would probably best added as another (CT) optional feature. I think
the default should strictly adhere to the JSON specification, though.

http://s-ludwig.github.io/std_data_json/stdx/data/json/lexer/LexOptions.specialFloatLiterals.html

You should also put tests in for what happens when you pass NaN or
infinity to toJSON. It shouldn't silently generate invalid JSON.

Good point. The current solution to just use formattedWrite(%.16g) is
also not ideal.

By default, floating-point special values are now output as 'null',
according to the ECMA-script standard. Optionally, they will be emitted
as 'NaN' and 'Infinity':

http://s-ludwig.github.io/std_data_json/stdx/data/json/generator/GeneratorOptions.specialFloatLiterals.html

Re: RFC: std.json sucessor


On Monday, 25 August 2014 at 15:34:29 UTC, Sönke Ludwig wrote:
By default, floating-point special values are now output as 
'null', according to the ECMA-script standard. Optionally, they 
will be emitted as 'NaN' and 'Infinity':


ECMAScript presumes double. I think one should base Phobos on 
language-independent standards. I suggest:


http://tools.ietf.org/html/rfc7159

For a web server it would be most useful to get an exception 
since you risk ending up with web-clients not working with no 
logging. It is better to have an exception and log an error so 
the problem can be fixed.

Re: RFC: std.json sucessor

On Monday, 25 August 2014 at 15:46:12 UTC, Ola Fosheim Grøstad 
wrote:
For a web server it would be most useful to get an exception 
since you risk ending up with web-clients not working with no 
logging. It is better to have an exception and log an error so 
the problem can be fixed.


Let me expand a bit on the difference between web clients and 
servers, assuming D is used on the server:


* Web servers have to check all input and log illegal activity. 
It is either a bug or an attack.


* Web clients don't have to check input from the server (at most 
a crypto check) and should not do double work if servers validate 
anyway.


* Web servers detect errors and send the error as a response to 
the client that displays it as a warning to the user. This is the 
uncommon case so you don't want to burden the client with it.


From this we can infer:

- It makes more sense for ECMAScript to turn illegal values into 
null since it runs on the client.


- The server needs efficient validation of input so that it can 
have faster response.


- The more integration of validation of typedness you can have in 
the parser, the better.



Thus it would be an advantage to be able to configure the 
validation done in the parser (through template mechanisms):



1. On write: throw exception on all illegal values or values that 
cannot be represented in the format. If the values are illegal 
then the client should not receive it. It could cause legal 
problems (like wrong prices).



2. On read: add the ability to configure the validation of 
typedness on many parameters:


- no nulls, no dicts, only nesting arrays etc

- predetermined key-values and automatic mapping to structs on 
exact match.


- require all leaf arrays to be uniform (array of strings, array 
of numbers)


- match a predefined grammar

etc

Re: RFC: std.json sucessor


On 8/23/2014 6:32 PM, Brad Roberts via Digitalmars-d wrote:

I'm not convinced that using an adapter algorithm won't be just as fast.

Consider your own talks on optimizing the existing dmd lexer.  In those talks
you've talked about the evils of additional processing on every byte.  That's
what you're talking about here.  While it's possible that the inliner and other
optimizer steps might be able to integrate the two phases and remove some
overhead, I'll believe it when I see the resulting assembly code.


On the other hand, deadalnix demonstrated that the ldc optimizer was able to 
remove the extra code.


I have a reasonable faith that optimization can be improved where necessary to 
cover this.

Re: RFC: std.json sucessor


On 8/23/2014 3:51 PM, Andrei Alexandrescu wrote:

An adapter would solve the wrong problem here. There's nothing to adapt from and
to.

An adapter would be good if e.g. the stream uses UTF-16 or some Windows
encoding. Bytes are the natural input for a json parser.


The adaptation is to take arbitrary byte input in an unknown encoding and 
produce valid UTF.


Note that many html readers scan the bytes to see if it is ASCII, UTF, some code 
page encoding, Shift-JIS, etc., and translate accordingly. I do not see why that 
is less costly to put inside the JSON lexer than as an adapter.

Re: RFC: std.json sucessor

On 8/25/2014 6:23 AM, Ola Fosheim Grøstad 
ola.fosheim.grostad+dl...@gmail.com wrote:

On Monday, 25 August 2014 at 13:07:08 UTC, Don wrote:

practice. So a JSON parser should at least be able to lex them.

ie this should be parsable:

{foo: NaN, bar: Infinity, baz: -Infinity}

You should also put tests in for what happens when you pass NaN or infinity to
toJSON. It shouldn't silently generate invalid JSON.


I believe you are allowed to use very high exponents, though. Like: 1E999 . So
you need to decide if those should be mapped to +Infinity or to the max value…


Infinity. Mapping to max value would be a horrible bug.



NaN also come in two forms with differing semantics: signalling(NaNs) and quiet
(NaN).  NaN is used for 0/0 and sqrt(-1), but NaNs is used for illegal values
and failure.

For some reason D does not seem to support this aspect of IEEE754? I cannot find
.nans listed on the page http://dlang.org/property.html


Because I tried supporting them in C++. It doesn't work for various reasons. 
Nobody else supports them, either.

Re: RFC: std.json sucessor

2014-08-25 Thread simendsjo via Digitalmars-d

On 08/25/2014 09:35 PM, Walter Bright wrote:
 On 8/23/2014 6:32 PM, Brad Roberts via Digitalmars-d wrote:
 I'm not convinced that using an adapter algorithm won't be just as fast.
 Consider your own talks on optimizing the existing dmd lexer.  In
 those talks
 you've talked about the evils of additional processing on every byte. 
 That's
 what you're talking about here.  While it's possible that the inliner
 and other
 optimizer steps might be able to integrate the two phases and remove some
 overhead, I'll believe it when I see the resulting assembly code.
 
 On the other hand, deadalnix demonstrated that the ldc optimizer was
 able to remove the extra code.
 
 I have a reasonable faith that optimization can be improved where
 necessary to cover this.

I just happened to write a very small script yesterday and tested with
the compilers (with dub --build=release).

dmd: 2.8 mb
gdc: 3.3 mb
ldc  0.5 mb

So ldc can remove quite a substantial amount of code in some cases.

Re: RFC: std.json sucessor


On Monday, 25 August 2014 at 19:38:05 UTC, Walter Bright wrote:
The adaptation is to take arbitrary byte input in an unknown 
encoding and produce valid UTF.


I agree.

For a restful http service the encoding should be specified in 
the http header and the input rejected if it isn't UTF 
compatible. For that use scenario you only want validation, not 
conversion. However some validation is free, like if you only 
accept numbers you could just turn off parsing of strings in the 
template…


If files are read from storage then you can reread the file if it 
fails validation on the first pass.


I wonder, in which use scenario it is that both of these 
conditions fail?


1. unspecified character-set and cannot assume UTF for JSON
3. unable to re-parse

Re: RFC: std.json sucessor


On Monday, 25 August 2014 at 19:42:03 UTC, Walter Bright wrote:

Infinity. Mapping to max value would be a horrible bug.


Yes… but then you are reading an illegal value that JSON does not 
support…


For some reason D does not seem to support this aspect of 
IEEE754? I cannot find

.nans listed on the page http://dlang.org/property.html


Because I tried supporting them in C++. It doesn't work for 
various reasons. Nobody else supports them, either.


I haven't tested, but Python is supposed to throw on NaNs.

gcc has support for nans in their documentation:
https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

IBM Fortran supports it…

I think supporting signaling NaN is important for correctness.

Re: RFC: std.json sucessor

Am 25.08.2014 17:46, schrieb Ola Fosheim Grøstad 
ola.fosheim.grostad+dl...@gmail.com:

On Monday, 25 August 2014 at 15:34:29 UTC, Sönke Ludwig wrote:

By default, floating-point special values are now output as 'null',
according to the ECMA-script standard. Optionally, they will be
emitted as 'NaN' and 'Infinity':


ECMAScript presumes double. I think one should base Phobos on
language-independent standards. I suggest:

http://tools.ietf.org/html/rfc7159


Well, of course it's based on that RFC, did you seriously think 
something else? However, that standard has no mention of infinity or 
NaN, and since JSON is designed to be a subset of ECMA script, it's 
basically the only thing that comes close.




For a web server it would be most useful to get an exception since you
risk ending up with web-clients not working with no logging. It is
better to have an exception and log an error so the problem can be fixed.


Although you have a point there of course, it's also highly unlikely 
that those clients would work correctly if we presume that JSON 
supported infinity/NaN. So it would really be just coincidence to detect 
a bug like that.


But I generally agree, it's just that the anti-exception voices are 
pretty loud these days (including Walter's), so that I opted for a 
non-throwing solution instead. I guess it wouldn't hurt though to 
default to throwing an exception, while still providing the 
GeneratorOptions.specialFloatLiterals option to handle those values 
without exception overhead, but in a non standard-conforming way.

Re: RFC: std.json sucessor

On Monday, 25 August 2014 at 20:04:10 UTC, Ola Fosheim Grøstad 
wrote:

I think supporting signaling NaN is important for correctness.


It is defined in C++11:

http://en.cppreference.com/w/cpp/types/numeric_limits/signaling_NaN

Re: RFC: std.json sucessor


- It makes more sense for ECMAScript to turn illegal values into null
since it runs on the client.


Like... node.js?

Sorry, just kidding.

I don't think it makes sense for clients to be less strict about such 
things, but I do agree with your assessment about being as strict as 
possible on the server. I also do think that exceptions are a perfect 
tool especially for server applications and that instead of avoiding 
them because they are slow, they should better be made fast enough to 
not be an issue.

Re: RFC: std.json sucessor

Am 25.08.2014 21:50, schrieb Ola Fosheim Grøstad 
ola.fosheim.grostad+dl...@gmail.com:

On Monday, 25 August 2014 at 19:38:05 UTC, Walter Bright wrote:

The adaptation is to take arbitrary byte input in an unknown encoding
and produce valid UTF.


I agree.

For a restful http service the encoding should be specified in the http
header and the input rejected if it isn't UTF compatible. For that use
scenario you only want validation, not conversion. However some
validation is free, like if you only accept numbers you could just turn
off parsing of strings in the template…

If files are read from storage then you can reread the file if it fails
validation on the first pass.

I wonder, in which use scenario it is that both of these conditions fail?

1. unspecified character-set and cannot assume UTF for JSON
3. unable to re-parse


BTW, JSON is *required* to be UTF encoded anyway as per RFC-7159, which 
is another argument for just letting the lexer assume valid UTF.

Re: RFC: std.json sucessor


On Monday, 25 August 2014 at 20:21:01 UTC, Sönke Ludwig wrote:
Well, of course it's based on that RFC, did you seriously think 
something else?


I made no assumptions, just responded to what you wrote :-). It 
would be reasonable in the context of vibe.d to assume the 
ECMAScript spec.


But I generally agree, it's just that the anti-exception voices 
are pretty loud these days (including Walter's), so that I 
opted for a non-throwing solution instead.


Yes, the minimum requirement is to just get did not validate 
directly as a single value. One can create a wrapper to get 
exceptions.


I guess it wouldn't hurt though to default to throwing an 
exception, while still providing the 
GeneratorOptions.specialFloatLiterals option to handle those 
values without exception overhead, but in a non 
standard-conforming way.


What I care most about is getting all the free validation that 
can be added with no extra cost.


That will make writing web services easier. Like if you can 
define constraints like:


- root is array, values are strings.
- root is array, second level only arrays, third level is numbers
- root is dict, all arrays contain only numbers

What is a bit annoying about generic libs is that you have no 
idea what you are getting so you have to spend time creating dull 
validation code.


But maybe StructuredJSON should be a separate library. It would 
be useful for REST services to specify the grammar and 
auto-generate both javascript and D structures to hold it along 
with validation code.


However, just turning off parsing of true, false, null, 
[, { etc seems like a cheap addition that also can improve 
parsing speed if the compiler can make do with two if statements 
instead of a switch.


Ola.

Re: RFC: std.json sucessor


Am 25.08.2014 22:21, schrieb Sönke Ludwig:

that standard has no mention of infinity or
NaN


Sorry, to be precise, it has no suggestion of how to *handle* infinity 
or NaN.

Re: RFC: std.json sucessor


On Monday, 25 August 2014 at 20:35:32 UTC, Sönke Ludwig wrote:
BTW, JSON is *required* to be UTF encoded anyway as per 
RFC-7159, which is another argument for just letting the lexer 
assume valid UTF.


The lexer cannot assume valid UTF since the client might be a 
rogue, but it can just bail out if the lookahead isn't jSON? So 
UTF-validation is limited to strings.


You have to parse the strings because of the \u escapes of 
course, so some basic validation is unavoidable? But I guess full 
validation of string content could be another useful option along 
with ignore escapes for the case where you want to avoid 
decode-encode scenarios. (like for a proxy, or if you store 
pre-escaped unicode in a database)

Re: RFC: std.json sucessor

On 8/25/2014 1:21 PM, Ola Fosheim Grøstad 
ola.fosheim.grostad+dl...@gmail.com wrote:

On Monday, 25 August 2014 at 20:04:10 UTC, Ola Fosheim Grøstad wrote:

I think supporting signaling NaN is important for correctness.


It is defined in C++11:

http://en.cppreference.com/w/cpp/types/numeric_limits/signaling_NaN



I didn't know that. But recall I did implement it in DMC++, and it turned out to 
simply not be useful. I'd be surprised if the new C++ support for it does 
anything worthwhile.

Re: RFC: std.json sucessor


On 8/25/2014 1:35 PM, Sönke Ludwig wrote:

BTW, JSON is *required* to be UTF encoded anyway as per RFC-7159, which is
another argument for just letting the lexer assume valid UTF.


I think that settles it.

Re: RFC: std.json sucessor

Am 25.08.2014 22:51, schrieb Ola Fosheim Grøstad 
ola.fosheim.grostad+dl...@gmail.com:

On Monday, 25 August 2014 at 20:35:32 UTC, Sönke Ludwig wrote:

BTW, JSON is *required* to be UTF encoded anyway as per RFC-7159,
which is another argument for just letting the lexer assume valid UTF.


The lexer cannot assume valid UTF since the client might be a rogue, but
it can just bail out if the lookahead isn't jSON? So UTF-validation is
limited to strings.


But why should UTF validation be the job of the lexer in the first 
place? D's string type is also defined to be UTF-8, so given that, it 
would of course be free to assume valid UTF-8. I agree with Walter there 
that validation/conversion should be added as a separate proxy range. 
But if we end up going for validating in the lexer, it would indeed be 
enough to validate inside strings, because the rest of the grammar 
assumes a subset of ASCII.




You have to parse the strings because of the \u escapes of course,
so some basic validation is unavoidable?


At least no UTF validation is needed. Since all non-ASCII characters 
will always be composed of bytes 0x7F, a sequence \u can be assumed 
to be valid wherever in the string it occurs, and all other bytes that 
don't belong to an escape sequence are just passed through as-is.



But I guess full validation of
string content could be another useful option along with ignore
escapes for the case where you want to avoid decode-encode scenarios.
(like for a proxy, or if you store pre-escaped unicode in a database)

Re: RFC: std.json sucessor


On 8/25/2014 12:49 PM, simendsjo wrote:

I just happened to write a very small script yesterday and tested with
the compilers (with dub --build=release).

dmd: 2.8 mb
gdc: 3.3 mb
ldc  0.5 mb

So ldc can remove quite a substantial amount of code in some cases.



Speed optimizations are different.

Re: RFC: std.json sucessor


On Monday, 25 August 2014 at 21:27:42 UTC, Sönke Ludwig wrote:
But why should UTF validation be the job of the lexer in the 
first place?


Because you want to save time, it is faster to integrate 
validation? The most likely use scenario is to receive REST data 
over HTTP that needs validation.


Well, so then I agree with Andrei… array of bytes it is. ;-)

added as a separate proxy range. But if we end up going for 
validating in the lexer, it would indeed be enough to validate 
inside strings, because the rest of the grammar assumes a 
subset of ASCII.


Not assumes, but defines! :-)

If you have to validate UTF before lexing then you will end up 
needlessly scanning lots of ascii if the file contains lots of 
non-strings or is from a encoder that only sends pure ascii.


If you want to have plugin validation of strings then you also 
need to differentiate strings so that the user can select which 
data should be just ascii, utf8, numbers, ids etc. Otherwise the 
user will end up doing double validation (you have to bypass 7F 
followed by string-end anyway).


The advantage of integrated validation is that you can use 16 
bytes SIMD registers on the buffer.


I presume you can load 16 bytes and do BITWISE-AND on the MSB, 
then match against string-end and carefully use this to boost 
performance of simultanous UTF validation, escape-scanning, and 
string-end scan. A bit tricky, of course.


At least no UTF validation is needed. Since all non-ASCII 
characters will always be composed of bytes 0x7F, a sequence 
\u can be assumed to be valid wherever in the string it 
occurs, and all other bytes that don't belong to an escape 
sequence are just passed through as-is.


You cannot assume \u… to be valid if you convert it.

Re: RFC: std.json sucessor

On Monday, 25 August 2014 at 21:53:50 UTC, Ola Fosheim Grøstad 
wrote:
I presume you can load 16 bytes and do BITWISE-AND on the MSB, 
then match against string-end and carefully use this to boost 
performance of simultanous UTF validation, escape-scanning, and 
string-end scan. A bit tricky, of course.


I think it is doable and worth it…

https://software.intel.com/sites/landingpage/IntrinsicsGuide/

e.g.:

__mmask16 _mm_cmpeq_epu8_mask (__m128i a, __m128i b)
__mmask32 _mm256_cmpeq_epu8_mask (__m256i a, __m256i b)
__mmask64 _mm512_cmpeq_epu8_mask (__m512i a, __m512i b)
__mmask16 _mm_test_epi8_mask (__m128i a, __m128i b)
etc.

So you can:

1. preload registers with … ,  \\…  and \0\0\0…
2. then compare signed/unsigned/equal whatever.
3. then load 16,32 or 64 bytes of data and stream until the masks 
trigger

4. tests masks
5. resolve any potential issues, goto 3

Re: RFC: std.json sucessor


On Monday, 25 August 2014 at 21:24:11 UTC, Walter Bright wrote:
I didn't know that. But recall I did implement it in DMC++, and 
it turned out to simply not be useful. I'd be surprised if the 
new C++ support for it does anything worthwhile.


Well, one should initialize with signaling NaN. Then you get an 
exception if you try to compute using uninitialized values.

Re: RFC: std.json sucessor

2014-08-25 Thread Kiith-Sa via Digitalmars-d

On Monday, 25 August 2014 at 22:40:00 UTC, Ola Fosheim Grøstad 
wrote:
On Monday, 25 August 2014 at 21:53:50 UTC, Ola Fosheim Grøstad 
wrote:
I presume you can load 16 bytes and do BITWISE-AND on the MSB, 
then match against string-end and carefully use this to boost 
performance of simultanous UTF validation, escape-scanning, 
and string-end scan. A bit tricky, of course.


I think it is doable and worth it…

https://software.intel.com/sites/landingpage/IntrinsicsGuide/

e.g.:

__mmask16 _mm_cmpeq_epu8_mask (__m128i a, __m128i b)
__mmask32 _mm256_cmpeq_epu8_mask (__m256i a, __m256i b)
__mmask64 _mm512_cmpeq_epu8_mask (__m512i a, __m512i b)
__mmask16 _mm_test_epi8_mask (__m128i a, __m128i b)
etc.

So you can:

1. preload registers with … ,  \\…  and \0\0\0…
2. then compare signed/unsigned/equal whatever.
3. then load 16,32 or 64 bytes of data and stream until the 
masks trigger

4. tests masks
5. resolve any potential issues, goto 3


D:YAML uses a similar approach, but with 8 bytes (plain ulong - 
portable) to detect how many ASCII chars are there before the 
first non-ASCII UTF-8 sequence,  and it significantly improves 
performance (didn't keep any numbers unfortunately, but it 
decreases decoding overhead to a fraction for most inputs (since 
YAML (and JSON) files tend to be mostly-ASCII with non-ASCII from 
time to time in strings), if we know that we have e.g. 100 chars 
incoming that are plain ASCII, we can use a fast path for them 
and only consider decoding after that))


See the countASCII() function in 
https://github.com/kiith-sa/D-YAML/blob/master/source/dyaml/reader.d


However, this approach is useful only if you decode the whole 
buffer at once, not if you do something like foreach(dchar ch; 
asdsššdfáľäô) {}, which is the most obvious way to decode in D.


FWIW, decoding _was_ a significant overhead in D:YAML (again, 
didn't keep numbers, but at a time it was around 10% in the 
profiler), and I didn't like the fact that it prevented making my 
code @nogc - I ended up copying chunks of std.utf and making them 
@nogc nothrow (D:YAML as a whole is not @nogc but I use @nogc in 
some parts basically as @noalloc to ensure I don't allocate 
anything)

Re: RFC: std.json sucessor

On 8/25/2014 4:15 PM, Ola Fosheim Grøstad 
ola.fosheim.grostad+dl...@gmail.com wrote:

On Monday, 25 August 2014 at 21:24:11 UTC, Walter Bright wrote:

I didn't know that. But recall I did implement it in DMC++, and it turned out
to simply not be useful. I'd be surprised if the new C++ support for it does
anything worthwhile.


Well, one should initialize with signaling NaN. Then you get an exception if you
try to compute using uninitialized values.



That's the theory. The practice doesn't work out so well.

Re: RFC: std.json sucessor

Btw, maybe it would be a good idea to take a look on the JSON 
that various browsers generates to see if there are any 
differences?


Then one could tune optimizations to what is the most common 
coding, like this:


1. start parsing assuming browser style restricted JSON grammar.

2. on failure jump to the slower generic JSON

Chrome does not seem to generate whitespace in JSON.stringfy(). 
And I would not be surprised if the encoding of double is similar 
across browsers.


Ola.

Re: RFC: std.json sucessor


On Monday, 25 August 2014 at 23:24:43 UTC, Kiith-Sa wrote:
D:YAML uses a similar approach, but with 8 bytes (plain ulong - 
portable) to detect how many ASCII chars are there before the 
first non-ASCII UTF-8 sequence,  and it significantly improves 
performance (didn't keep any numbers unfortunately, but it


Cool!

I think often you will have an array of numbers so you could 
subtract 0…, then parse offset-bytes and convert the 
mantissa/exponent using shuffles and simd.


Somehow…

Re: RFC: std.json sucessor

2014-08-25 Thread Entusiastic user via Digitalmars-d


Hi!

Thanks for the effort you've put in this.

I am having problems with building with LDC 0.14.0. DMD 2.066.0
seems to work fine (all unit tests pass). Do you have any ideas
why?

I am using Ubuntu 3.10 (Linux 3.11.0-15-generic x86_64).

Master was at 6a9f8e62e456c3601fe8ff2e1fbb640f38793d08.
$ dub fetch std_data_json --version=~master
$ cd std_data_json-master/
$ dub test --compiler=ldc2

Generating test runner configuration '__test__library__' for
'library' (library).
Building std_data_json ~master configuration __test__library__,
build type unittest.
Running ldc2...
source/stdx/data/json/parser.d(77): Error: safe function
'stdx.data.json.parser.__unittestL68_22' cannot call system
function 'object.AssociativeArray!(string,
JSONValue).AssociativeArray.length'
source/stdx/data/json/parser.d(124): Error: safe function
'stdx.data.json.parser.__unittestL116_24' cannot call system
function 'object.AssociativeArray!(string,
JSONValue).AssociativeArray.length'
source/stdx/data/json/parser.d(341): Error: function
stdx.data.json.parser.JSONParserRange!(JSONLexerRange!string).JSONParserRange.opAssign
is not callable because it is annotated with @disable
source/stdx/data/json/parser.d(341): Error: safe function
'stdx.data.json.parser.__unittestL318_32' cannot call system
function
'stdx.data.json.parser.JSONParserRange!(JSONLexerRange!string).JSONParserRange.opAssign'
source/stdx/data/json/parser.d(633): Error: function
stdx.data.json.lexer.JSONToken.opAssign is not callable because
it is annotated with @disable
source/stdx/data/json/parser.d(633): Error:
'stdx.data.json.lexer.JSONToken.opAssign' is not nothrow
source/stdx/data/json/parser.d(630): Error: function
'stdx.data.json.parser.JSONParserNode.literal' is nothrow yet may
throw
FAIL
.dub/build/__test__library__-unittest-linux.posix-x86_64-ldc2-0F620B217010475A5A4E545A57CDD09A/
__test__library__ executable
Error executing command test: ldc2 failed with exit code 1.

Thanks

Re: RFC: std.json sucessor

2014-08-25 Thread Entusiastic user via Digitalmars-d


...
I am using Ubuntu 3.10 (Linux 3.11.0-15-generic x86_64).
...


I meant Ubuntu 13.10 :D

Re: RFC: std.json sucessor

2014-08-23 Thread Ola Fosheim Gr via Digitalmars-d


On Saturday, 23 August 2014 at 05:28:55 UTC, Walter Bright wrote:

On 8/22/2014 9:48 PM, Ola Fosheim Gr wrote:
On Saturday, 23 August 2014 at 04:36:34 UTC, Walter Bright 
wrote:

On 8/22/2014 9:01 PM, Ola Fosheim Gr wrote:
Does this mean that D is getting resizable stack allocations 
in lower stack

frames? That has a lot of implications for code gen.


scopebuffer does not require resizeable stack allocations.


So you cannot use the stack for resizable allocations.


Please, take a look at how scopebuffer works.


I have? It requires an upperbound to stay on the stack, that 
creates a big hole in the stack. I don't think wasting the stack 
or moving to the heap is a nice predictable solution. It would be 
better to just have a couple of regions that do reverse stack 
allocations, but the most efficient solution is the one I 
outlined.


With json you might be able to create an upperbound of say 4-8 
times the size of the source iff you know the file size. You 
don't if you are streaming.


(scopebuffer is too unpredictable for real time, a pure stack 
solution is predictable)

Re: RFC: std.json sucessor

2014-08-23 Thread Walter Bright via Digitalmars-d


On 8/22/2014 11:25 PM, Ola Fosheim Gr wrote:

On Saturday, 23 August 2014 at 05:28:55 UTC, Walter Bright wrote:

On 8/22/2014 9:48 PM, Ola Fosheim Gr wrote:

On Saturday, 23 August 2014 at 04:36:34 UTC, Walter Bright wrote:

On 8/22/2014 9:01 PM, Ola Fosheim Gr wrote:

Does this mean that D is getting resizable stack allocations in lower stack
frames? That has a lot of implications for code gen.


scopebuffer does not require resizeable stack allocations.


So you cannot use the stack for resizable allocations.


Please, take a look at how scopebuffer works.


I have? It requires an upperbound to stay on the stack, that creates a big hole
in the stack. I don't think wasting the stack or moving to the heap is a nice
predictable solution. It would be better to just have a couple of regions that
do reverse stack allocations, but the most efficient solution is the one I
outlined.


Scopebuffer is extensively used in Warp, and works very well. The hole in the 
stack is not a significant problem.




With json you might be able to create an upperbound of say 4-8 times the size of
the source iff you know the file size. You don't if you are streaming.

(scopebuffer is too unpredictable for real time, a pure stack solution is
predictable)


You can always implement your own buffering system and pass it in - that's the 
point, it's under user control.

Re: RFC: std.json sucessor

2014-08-23 Thread Ola Fosheim Gr via Digitalmars-d


On Saturday, 23 August 2014 at 06:41:11 UTC, Walter Bright wrote:
Scopebuffer is extensively used in Warp, and works very well. 
The hole in the stack is not a significant problem.


Well, on a webserver you don't want to push out the caches for no 
good reason.


You can always implement your own buffering system and pass it 
in - that's the point, it's under user control.


My point is that you need compiler support to get good buffering 
options on the stack. Something like an @alloca_inline:


auto buffer = @alloca_inline getstuff();
process(buffer);

I think all memory allocation should be under compiler control, 
the library solutions are bound to be suboptimal, i.e. slower.

Re: RFC: std.json sucessor

2014-08-23 Thread Andrej Mitrovic via Digitalmars-d

On 8/22/14, Sönke Ludwig digitalmars-d@puremagic.com wrote:
 Hmmm, but it *is* a string. Isn't the problem more the use of with in
 this case?

Yeah, maybe so. I thought for a second it was a tuple, but then I saw
the square brackets and was left scratching my head. :)

Re: RFC: std.json sucessor


Am 23.08.2014 03:05, schrieb Walter Bright:

On 8/22/2014 2:27 PM, Sönke Ludwig wrote:

Am 22.08.2014 20:08, schrieb Walter Bright:

1. There's no mention of what will happen if it is passed malformed JSON
strings. I presume an exception is thrown. Exceptions are both slow and
consume GC memory. I suggest an alternative would be to emit an Error
token instead; this would be much like how the UTF decoding algorithms
emit a replacement char for invalid UTF sequences.

The latest version now features a LexOptions.noThrow option which
causes an
error token to be emitted instead. After popping the error token, the
range is
always empty.


Having a nothrow option may prevent the functions from being attributed
as nothrow.


It's a compile time option, so that shouldn't be an issue. There is also 
just a single throw statement in the source, so it's easy to isolate.

Re: RFC: std.json sucessor


Am 23.08.2014 04:23, schrieb deadalnix:

First thank you for your work. std.json is horrible to use right now, so
a replacement is more than welcome.

I haven't played with your code yet, so I may be asking for somethign
that already exists, but did you had a look to jsvar by Adam ?

You can find it here:
https://github.com/adamdruppe/arsd/blob/master/jsvar.d

One of the big pain when one work with format like JSON is that you go
from the untyped world to the typed world (the same problem occurs with
XML and various config format as well).

I think Adam got the right balance in jsvar. It behave closely enough to
javascript so it is convenient to manipulate, while removing the most
dangerous behavior (concatenation is still done using ~and not + as in JS).

If that is not already the case, I'd love that the element I get out of
my JSON behave that way. If you can do that, you have a user.


Setting the issue of opDispatch aside, one of the goals was to use 
Algebraic to store values. It is probably not completely as flexible as 
jsvar, but still transparently enables a lot of operations (with those 
pull requests merged at least). But it has another big advantage, which 
is that we can later define other types based on Algebraic, such as 
BSONValue, and those can be transparently runtime converted between each 
other in a generic way. A special case type on the other hand produces 
nasty dependencies between the formats.


Main issues of using opDispatch:

 - Prone to bugs where a normal field/method of the JSONValue struct is 
accessed instead of a JSON field
 - On top of that the var.field syntax gives the wrong impression that 
you are working with static typing, while var[field] makes it clear 
that runtime indexing is going on
 - Every interface change of JSONValue would be a silent breaking 
change, because the whole string domain is used up for opDispatch

Re: RFC: std.json sucessor

2014-08-23 Thread w0rp via Digitalmars-d


On Saturday, 23 August 2014 at 09:22:01 UTC, Sönke Ludwig wrote:

Main issues of using opDispatch:

 - Prone to bugs where a normal field/method of the JSONValue 
struct is accessed instead of a JSON field
 - On top of that the var.field syntax gives the wrong 
impression that you are working with static typing, while 
var[field] makes it clear that runtime indexing is going on
 - Every interface change of JSONValue would be a silent 
breaking change, because the whole string domain is used up for 
opDispatch


I have seen similar issues to these with simplexml in PHP. Using 
opDispatch to match all possible names except a few doesn't work 
so well.


I'm not sure if you've changed it already, but I agree with the 
earlier comment about changing the flag for pretty printing from 
a boolean to an enum value. Booleans in interfaces is one of my 
pet peeves.

Re: RFC: std.json sucessor


Am 23.08.2014 14:19, schrieb w0rp:

I'm not sure if you've changed it already, but I agree with the earlier
comment about changing the flag for pretty printing from a boolean to an
enum value. Booleans in interfaces is one of my pet peeves.


It's split into two separate functions now. Having to type out a full 
enum value I guess would be too distracting in this case, since they 
will be pretty frequently used.

Re: RFC: std.json sucessor


Am 22.08.2014 20:08, schrieb Walter Bright:

(...)
2. The escape sequenced strings presumably consume GC memory. This will
be a problem for high performance code. I suggest either leaving them
undecoded in the token stream, and letting higher level code decide what
to do about them, or provide a hook that the user can override with his
own allocation scheme.

If we don't make it possible to use std.json without invoking the GC, I
believe the module will fail in the long term.


I've added two new types now to abstract away how strings and numbers 
are represented in memory. For string literals this means that for input 
types string and immutable(ubyte)[] they will always be stored as 
slices to the input buffer. JSONValue has a .rawValue property to access 
them, as well as an alias thised .value property that transparently 
unescapes.


At that place it would also be easy to provide a method that takes an 
arbitrary output range to unescape without allocations.


Documentation and code are both updated (also added a note about 
exception behavior).

Re: RFC: std.json sucessor


Am 22.08.2014 21:00, schrieb Marc Schütz schue...@gmx.net:

On Friday, 22 August 2014 at 18:08:34 UTC, Sönke Ludwig wrote:

Am 22.08.2014 19:57, schrieb Marc Schütz schue...@gmx.net:

The easiest and cleanest way would be to add a function in
std.data.json:

auto parse(Target, Source)(Source input)
if(is(Target == JSONValue))
{
return ...;
}

The various overloads of `std.conv.parse` already have mutually
exclusive template constraints, they will not collide with our function.


Okay, for parse that may work, but what about to!()?


What's the problem with to!()?


to!() definitely doesn't have a template constraint that excludes 
JSONValue. Instead, it will convert any struct type that doesn't define 
toString() to a D-like representation.

Re: RFC: std.json sucessor

2014-08-23 Thread via Digitalmars-d


On Saturday, 23 August 2014 at 16:49:23 UTC, Sönke Ludwig wrote:

Am 22.08.2014 21:00, schrieb Marc Schütz schue...@gmx.net:

On Friday, 22 August 2014 at 18:08:34 UTC, Sönke Ludwig wrote:
Am 22.08.2014 19:57, schrieb Marc Schütz 
schue...@gmx.net:

The easiest and cleanest way would be to add a function in
std.data.json:

   auto parse(Target, Source)(Source input)
   if(is(Target == JSONValue))
   {
   return ...;
   }

The various overloads of `std.conv.parse` already have 
mutually
exclusive template constraints, they will not collide with 
our function.


Okay, for parse that may work, but what about to!()?


What's the problem with to!()?


to!() definitely doesn't have a template constraint that 
excludes JSONValue. Instead, it will convert any struct type 
that doesn't define toString() to a D-like representation.


For converting a JSONValue to a different type, JSONValue can 
implement `opCast`, which is the regular interface that 
std.conv.to uses if it's available.


For converting something _to_ a JSONValue, std.conv.to will 
simply create an instance of it by calling the constructor.

Re: RFC: std.json sucessor