Re: RFC: std.json sucessor
On 2/5/15 1:07 AM, Jakob Ovrum wrote: On Thursday, 21 August 2014 at 22:35:18 UTC, Sönke Ludwig wrote: ... Added to the review queue as a work in progress with relevant links: http://wiki.dlang.org/Review_Queue Yay! -- Andrei
Re: RFC: std.json sucessor
Am 05.02.2015 um 10:07 schrieb Jakob Ovrum: On Thursday, 21 August 2014 at 22:35:18 UTC, Sönke Ludwig wrote: ... Added to the review queue as a work in progress with relevant links: http://wiki.dlang.org/Review_Queue Thanks! I(t) should be ready for an official review in one or two weeks when my schedule relaxes a little bit.
Re: RFC: std.json sucessor
On Thursday, 21 August 2014 at 22:35:18 UTC, Sönke Ludwig wrote: ... Added to the review queue as a work in progress with relevant links: http://wiki.dlang.org/Review_Queue
Re: RFC: std.json sucessor
On Saturday, 18 October 2014 at 19:53:23 UTC, Sean Kelly wrote: Python 0.499547972114 0.499779920774 0.499811461578 12.01s, 1355.1Mb I assume this is the standard json module? I am wondering how ujson is performing, which is considered the fastest python module.
Re: RFC: std.json sucessor
On 10/18/14, 4:53 PM, Sean Kelly wrote: On Friday, 17 October 2014 at 18:27:34 UTC, Ary Borenszweig wrote: Once its done you can compare its performance against other languages with this benchmark: https://github.com/kostya/benchmarks/tree/master/json Wow, the C++Rapid parser is really impressive. I threw together a test with my own parser for comparison, and Rapid still beat it. It's the first parser I've encountered that's faster. Ruby 0.4995479721139979 0.49977992077421846 0.49981146157805545 7.53s, 2330.9Mb Python 0.499547972114 0.499779920774 0.499811461578 12.01s, 1355.1Mb C++ Rapid 0.499548 0.49978 0.499811 1.75s, 1009.0Mb JEP (mine) 0.49954797 0.49977992 0.49981146 2.38s, 203.4Mb Yes, C++ rapid seems to be really, really fast. It has some sse2/see4 specific optimizations and I guess a lot more. I have to investigate more in order to do something similar :-)
Re: RFC: std.json sucessor
On Friday, 17 October 2014 at 18:27:34 UTC, Ary Borenszweig wrote: Once its done you can compare its performance against other languages with this benchmark: https://github.com/kostya/benchmarks/tree/master/json Wow, the C++Rapid parser is really impressive. I threw together a test with my own parser for comparison, and Rapid still beat it. It's the first parser I've encountered that's faster. Ruby 0.4995479721139979 0.49977992077421846 0.49981146157805545 7.53s, 2330.9Mb Python 0.499547972114 0.499779920774 0.499811461578 12.01s, 1355.1Mb C++ Rapid 0.499548 0.49978 0.499811 1.75s, 1009.0Mb JEP (mine) 0.49954797 0.49977992 0.49981146 2.38s, 203.4Mb
Re: RFC: std.json sucessor
On Saturday, 18 October 2014 at 19:53:23 UTC, Sean Kelly wrote: On Friday, 17 October 2014 at 18:27:34 UTC, Ary Borenszweig wrote: Once its done you can compare its performance against other languages with this benchmark: https://github.com/kostya/benchmarks/tree/master/json Wow, the C++Rapid parser is really impressive. I threw together a test with my own parser for comparison, and Rapid still beat it. It's the first parser I've encountered that's faster. C++ Rapid 0.499548 0.49978 0.499811 1.75s, 1009.0Mb JEP (mine) 0.49954797 0.49977992 0.49981146 2.38s, 203.4Mb I just commented out the sscanf() call that was parsing the float and re-ran the test to see what the difference would be. Here's the new timing: JEP (mine) 0. 0. 0. 1.23s, 203.1Mb So nearly half of the total execution time was spent simply parsing floats. For this reason, I'm starting to think that this isn't the best benchmark of JSON parser performance. The other issue with my parser is that it's written in C, and so all of the user-defined bits are called via a bank of function pointers. If it were converted to C++ or D where this could be done via templates it would be much faster. Just as a test I nulled out the function pointers I'd set to see what the cost of indirection was, and here's the result: JEP (mine) nan nan nan 0.57s, 109.4Mb The memory difference is interesting, and I can't entirely explain it other than to say that it's probably an artifact of my mapping in the file as virtual memory rather than reading it into an allocated buffer. Either way, roughly 0.60s can be attributed to indirect function calls and the bit of logic on the other side, which seems like a good candidate for optimization.
Re: RFC: std.json sucessor
On 8/21/14, 7:35 PM, Sönke Ludwig wrote: Following up on the recent std.jgrandson thread [1], I've picked up the work (a lot earlier than anticipated) and finished a first version of a loose blend of said std.jgrandson, vibe.data.json and some changes that I had planned for vibe.data.json for a while. I'm quite pleased by the results so far, although without a serialization framework it still misses a very important building block. Code: https://github.com/s-ludwig/std_data_json Docs: http://s-ludwig.github.io/std_data_json/ DUB: http://code.dlang.org/packages/std_data_json Destroy away! ;) [1]: http://forum.dlang.org/thread/lrknjl$co7$1...@digitalmars.com Once its done you can compare its performance against other languages with this benchmark: https://github.com/kostya/benchmarks/tree/master/json
Re: RFC: std.json sucessor
Am 12.10.2014 20:17, schrieb Andrei Alexandrescu: Here's my destruction of std.data.json. * lexer.d: ** Beautifully done. From what I understand, if the input is string or immutable(ubyte)[] then the strings are carved out as slices of the input, as opposed to newly allocated. Awesome. ** The string after lexing is correctly scanned and stored in raw format (escapes are not rewritten) and decoded on demand. Problem with decoding is that it may allocate memory, and it would be great (and not difficult) to make the lexer 100% lazy/non-allocating. To achieve that, lexer.d should define TWO Kinds of strings at the lexer level: regular string and undecoded string. The former is lexer.d's way of saying I got lucky in the sense that it didn't detect any '\\' so the raw and decoded strings are identical. No need for anyone to do any further processing in the majority of cases = win. The latter means the lexer lexed the string, saw at least one '\\', and leaves it to the caller to do the actual decoding. This is actually more or less done in unescapeStringLiteral() - if it doesn't find any '\\', it just returns the original string. Also JSONString allows to access its .rawValue without doing any decoding/allocations. https://github.com/s-ludwig/std_data_json/blob/master/source/stdx/data/json/lexer.d#L1421 Unfortunately .rawValue can't be @nogc because the raw value might have to be constructed first when the input is not a string (in this case unescaping is done on-the-fly for efficiency reasons). ** After moving the decoding business out of lexer.d, a way to take this further would be to qualify lexer methods as @nogc if the input is string/immutable(ubyte)[]. I wonder how to implement a conditional attribute. We'll probably need a language enhancement for that. Isn't @nogc inferred? Everything is templated, so that should be possible. Or does attribute inference only work for template function and not for methods of templated types? Should it? ** The implementation uses manually-defined tagged unions for work. Could we use Algebraic instead - dogfooding and all that? I recall there was a comment in Sönke's original work that Algebraic has a specific issue (was it false pointers?) - so the question arises, should we fix Algebraic and use it thus helping other uses as well? I had started on an implementation of a type and ID safe TaggedAlgebraic that uses Algebraic for its internal storage. If we can get that in first, it should be no problem to use it instead (with no or minimal API breakage). However, it uses a struct instead of an enum to define the Kind (which is the only nice way I could conceive to safely couple enum value and type at compile time), so it's not as nice in the generated documentation. ** I see the boolean kind, should we instead have the true_ and false_ kinds? I always found it cumbersome and awkward to work like that. What would be the reason to go that route? ** Long story short I couldn't find any major issue with this module, and I looked! I do think the decoding logic should be moved outside of lexer.d or at least the JSONLexerRange. * generator.d: looking good, no special comments. Like the consistent use of structs filled with options as template parameters. * foundation.d: ** At four words per token, Location seems pretty bulky. How about reducing line and column to uint? Single line JSON files 64k (or line counts 64k) are no exception, so that would only work in a limited way. My thought about this was that it is quite unusual to actually store the tokens for most purposes (especially when directly serializing to a native D type), so that it should have minimal impact on performance or memory consumption. ** Could JSONException create the message string in toString (i.e. when/if used) as opposed to in the constructor? That could of course be done, but the you'd not get the full error message using ex.msg, only with ex.toString(), which usually prints a call trace instead. Alternatively, it's also possible to completely avoid using exceptions with LexOptions.noThrow. * parser.d: ** How about using .init instead of .defaults for options? I'd slightly tend to prefer the more explicit defaults, especially because init could mean either defaults or none (currently it means none). But another idea would be to invert the option values so that defaults==none... any objections? ** I'm a bit surprised by JSONParserNode.Kind. E.g. the objectStart/End markers shouldn't appear as nodes. There should be an object node only. I guess that's needed for laziness. While you could infer the end of an object in the parser range by looking for the first entry that doesn't start with a key node, the same would not be possible for arrays, so in general the end marker *is* required. Not that the parser range is a StAX style parser, which is still very close to the lexical structure of the document. I was also wondering if
Re: RFC: std.json sucessor
Am 12.10.2014 21:04, schrieb Sean Kelly: I'd like to see unescapeStringLiteral() made public. Then I can unescape multiple strings to the same preallocated destination, or even unescape in place (guaranteed to work since the result will always be smaller than the input). Will do. Same for the inverse functions.
Re: RFC: std.json sucessor
Am 12.10.2014 23:52, schrieb Sean Kelly: Oh, it looks like you aren't checking for 0x7F (DEL) as a control character. It doesn't get mentioned in the JSON spec, so I left it out. But I guess nothing speaks against adding it anyway.
Re: RFC: std.json sucessor
On 13/10/14 09:39, Sönke Ludwig wrote: ** At four words per token, Location seems pretty bulky. How about reducing line and column to uint? Single line JSON files 64k (or line counts 64k) are no exception 64k? -- /Jacob Carlborg
Re: RFC: std.json sucessor
On 22/08/14 00:35, Sönke Ludwig wrote: Following up on the recent std.jgrandson thread [1], I've picked up the work (a lot earlier than anticipated) and finished a first version of a loose blend of said std.jgrandson, vibe.data.json and some changes that I had planned for vibe.data.json for a while. I'm quite pleased by the results so far, although without a serialization framework it still misses a very important building block. Code: https://github.com/s-ludwig/std_data_json JSONToken.Kind and JSONParserNode.Kind could be ubyte to save space. -- /Jacob Carlborg
Re: RFC: std.json sucessor
Am 13.10.2014 13:33, schrieb Jacob Carlborg: On 13/10/14 09:39, Sönke Ludwig wrote: ** At four words per token, Location seems pretty bulky. How about reducing line and column to uint? Single line JSON files 64k (or line counts 64k) are no exception 64k? Oh, I've read both line and column into a single uint, because of four words per token - considering that word == 16bit, but Andrei obviously meant word == (void*).sizeof. If simply using uint instead of size_t is meant, then that's of course a different thing.
Re: RFC: std.json sucessor
Am 13.10.2014 13:37, schrieb Jacob Carlborg: On 22/08/14 00:35, Sönke Ludwig wrote: Following up on the recent std.jgrandson thread [1], I've picked up the work (a lot earlier than anticipated) and finished a first version of a loose blend of said std.jgrandson, vibe.data.json and some changes that I had planned for vibe.data.json for a while. I'm quite pleased by the results so far, although without a serialization framework it still misses a very important building block. Code: https://github.com/s-ludwig/std_data_json JSONToken.Kind and JSONParserNode.Kind could be ubyte to save space. But it won't save space in practice, at least on x86, due to alignment, and depending on what the compiler assumes, the access can also be slower that way.
Re: RFC: std.json sucessor
On 10/13/14, 4:48 AM, Sönke Ludwig wrote: Am 13.10.2014 13:37, schrieb Jacob Carlborg: On 22/08/14 00:35, Sönke Ludwig wrote: Following up on the recent std.jgrandson thread [1], I've picked up the work (a lot earlier than anticipated) and finished a first version of a loose blend of said std.jgrandson, vibe.data.json and some changes that I had planned for vibe.data.json for a while. I'm quite pleased by the results so far, although without a serialization framework it still misses a very important building block. Code: https://github.com/s-ludwig/std_data_json JSONToken.Kind and JSONParserNode.Kind could be ubyte to save space. But it won't save space in practice, at least on x86, due to alignment, and depending on what the compiler assumes, the access can also be slower that way. Correct. -- Andrei
Re: RFC: std.json sucessor
On 10/13/14, 4:45 AM, Sönke Ludwig wrote: Am 13.10.2014 13:33, schrieb Jacob Carlborg: On 13/10/14 09:39, Sönke Ludwig wrote: ** At four words per token, Location seems pretty bulky. How about reducing line and column to uint? Single line JSON files 64k (or line counts 64k) are no exception 64k? Oh, I've read both line and column into a single uint, because of four words per token - considering that word == 16bit, but Andrei obviously meant word == (void*).sizeof. If simply using uint instead of size_t is meant, then that's of course a different thing. Yah, one uint for each. -- Andrei
Re: RFC: std.json sucessor
Here's my destruction of std.data.json. * lexer.d: ** Beautifully done. From what I understand, if the input is string or immutable(ubyte)[] then the strings are carved out as slices of the input, as opposed to newly allocated. Awesome. ** The string after lexing is correctly scanned and stored in raw format (escapes are not rewritten) and decoded on demand. Problem with decoding is that it may allocate memory, and it would be great (and not difficult) to make the lexer 100% lazy/non-allocating. To achieve that, lexer.d should define TWO Kinds of strings at the lexer level: regular string and undecoded string. The former is lexer.d's way of saying I got lucky in the sense that it didn't detect any '\\' so the raw and decoded strings are identical. No need for anyone to do any further processing in the majority of cases = win. The latter means the lexer lexed the string, saw at least one '\\', and leaves it to the caller to do the actual decoding. ** After moving the decoding business out of lexer.d, a way to take this further would be to qualify lexer methods as @nogc if the input is string/immutable(ubyte)[]. I wonder how to implement a conditional attribute. We'll probably need a language enhancement for that. ** The implementation uses manually-defined tagged unions for work. Could we use Algebraic instead - dogfooding and all that? I recall there was a comment in Sönke's original work that Algebraic has a specific issue (was it false pointers?) - so the question arises, should we fix Algebraic and use it thus helping other uses as well? ** I see the boolean kind, should we instead have the true_ and false_ kinds? ** Long story short I couldn't find any major issue with this module, and I looked! I do think the decoding logic should be moved outside of lexer.d or at least the JSONLexerRange. * generator.d: looking good, no special comments. Like the consistent use of structs filled with options as template parameters. * foundation.d: ** At four words per token, Location seems pretty bulky. How about reducing line and column to uint? ** Could JSONException create the message string in toString (i.e. when/if used) as opposed to in the constructor? * parser.d: ** How about using .init instead of .defaults for options? ** I'm a bit surprised by JSONParserNode.Kind. E.g. the objectStart/End markers shouldn't appear as nodes. There should be an object node only. I guess that's needed for laziness. ** It's unclear where memory is being allocated in the parser. @nogc annotations wherever appropriate would be great. * value.d: ** Looks like this is/may be the only place where memory is being managed, at least if the input is string/immutable(ubyte)[]. Right? ** Algebraic ftw. Overall: This is very close to everything I hoped! A bit more care to @nogc would be awesome, especially with the upcoming focus on memory management going forward. After one more pass it would be great to move forward for review. Andrei
Re: RFC: std.json sucessor
On Sunday, 12 October 2014 at 18:17:29 UTC, Andrei Alexandrescu wrote: ** The string after lexing is correctly scanned and stored in raw format (escapes are not rewritten) and decoded on demand. Problem with decoding is that it may allocate memory, and it would be great (and not difficult) to make the lexer 100% lazy/non-allocating. To achieve that, lexer.d should define TWO Kinds of strings at the lexer level: regular string and undecoded string. The former is lexer.d's way of saying I got lucky in the sense that it didn't detect any '\\' so the raw and decoded strings are identical. No need for anyone to do any further processing in the majority of cases = win. The latter means the lexer lexed the string, saw at least one '\\', and leaves it to the caller to do the actual decoding. I'd like to see unescapeStringLiteral() made public. Then I can unescape multiple strings to the same preallocated destination, or even unescape in place (guaranteed to work since the result will always be smaller than the input).
Re: RFC: std.json sucessor
Oh, it looks like you aren't checking for 0x7F (DEL) as a control character.
Re: RFC: std.json sucessor
Been using it for a bit now, I think the only thing I have to say is having to insert all of those `JSONValue` everywhere is tiresome and I never know when I have to do it. Atila On Thursday, 21 August 2014 at 22:35:18 UTC, Sönke Ludwig wrote: Following up on the recent std.jgrandson thread [1], I've picked up the work (a lot earlier than anticipated) and finished a first version of a loose blend of said std.jgrandson, vibe.data.json and some changes that I had planned for vibe.data.json for a while. I'm quite pleased by the results so far, although without a serialization framework it still misses a very important building block. Code: https://github.com/s-ludwig/std_data_json Docs: http://s-ludwig.github.io/std_data_json/ DUB: http://code.dlang.org/packages/std_data_json The new code contains: - Lazy lexer in the form of a token input range (using slices of the input if possible) - Lazy streaming parser (StAX style) in the form of a node input range - Eager DOM style parser returning a JSONValue - Range based JSON string generator taking either a token range, a node range, or a JSONValue - Opt-out location tracking (line/column) for tokens, nodes and values - No opDispatch() for JSONValue - this has shown to do more harm than good in vibe.data.json The DOM style JSONValue type is based on std.variant.Algebraic. This currently has a few usability issues that can be solved by upgrading/fixing Algebraic: - Operator overloading only works sporadically - No tag enum is supported, so that switch()ing on the type of a value doesn't work and an if-else cascade is required - Operations and conversions between different Algebraic types is not conveniently supported, which gets important when other similar formats get supported (e.g. BSON) Assuming that those points are solved, I'd like to get some early feedback before going for an official review. One open issue is how to handle unescaping of string literals. Currently it always unescapes immediately, which is more efficient for general input ranges when the unescaped result is needed, but less efficient for string inputs when the unescaped result is not needed. Maybe a flag could be used to conditionally switch behavior depending on the input range type. Destroy away! ;) [1]: http://forum.dlang.org/thread/lrknjl$co7$1...@digitalmars.com
Re: RFC: std.json sucessor
On Wednesday, 27 August 2014 at 23:51:54 UTC, Walter Bright wrote: On 8/26/2014 12:24 AM, Don wrote: On Monday, 25 August 2014 at 23:29:21 UTC, Walter Bright wrote: On 8/25/2014 4:15 PM, Ola Fosheim Grøstad ola.fosheim.grostad+dl...@gmail.com wrote: On Monday, 25 August 2014 at 21:24:11 UTC, Walter Bright wrote: I didn't know that. But recall I did implement it in DMC++, and it turned out to simply not be useful. I'd be surprised if the new C++ support for it does anything worthwhile. Well, one should initialize with signaling NaN. Then you get an exception if you try to compute using uninitialized values. That's the theory. The practice doesn't work out so well. To be more concrete: Processors from AMD have signalling NaN behaviour which is different from processors from Intel. And the situation is worst on most other architectures. It's a lost cause, I think. The other issues were just when the snan = qnan conversion took place. This is quite unclear given the extensive constant folding, CTFE, etc., that D does. It was also affected by how dmd generates code. Some code gen on floating point doesn't need the FPU, such as toggling the sign bit. But then what happens with snan = qnan? The whole thing is an undefined, unmanageable mess. I think the way to think of it is, to the programmer, there is *no such thing* as an snan value. It's an implementation detail that should be invisible. Semantically, a signalling nan is a qnan value with a hardware breakpoint on it. An SNAN should never enter the CPU. The CPU always converts them to QNAN if you try. You're kind of not supposed to know that SNAN exists. Because of this, I think SNAN only ever makes sense for static variables. Setting local variables to snan doesn't make sense. since the snan has to enter the CPU. Making that work without triggering the snan is very painful. Making it trigger the snan on all forms of access is even worse. If float.init exists, it cannot be an snan, since you are allowed to use float.init.
Re: RFC: std.json sucessor
On Thursday, 28 August 2014 at 11:09:16 UTC, Don wrote: I think the way to think of it is, to the programmer, there is *no such thing* as an snan value. It's an implementation detail that should be invisible. Semantically, a signalling nan is a qnan value with a hardware breakpoint on it. I disagree with this view. QNAN: there is a value, but it does not result in a real SNAN: the value is missing for an unspecified reason AFAIK some x86 ops such as ROUNDPD allows you to treat SNAN as QNAN or throw an exception. So there is an builtin test if needed. Other ops such as reciprocals don't throw any FP exceptions and will treat SNAN as QNAN. An SNAN should never enter the CPU. The CPU always converts them to QNAN if you try. You're kind of not supposed to know that SNAN exists. I'm not sure how you reached this interpretation? The solution should be to emit a test for SNAN explicitly or implicitly if you cannot prove that SNAN is impossible.
Re: RFC: std.json sucessor
Or to be more explicit: If have SNAN then there is no point in trying to recompute the expression using a different algorithm. If have QNAN then you might want to recompute the expression using a different algorithm (e.g. complex numbers or analytically). ?
Re: RFC: std.json sucessor
On Thursday, 28 August 2014 at 12:10:58 UTC, Ola Fosheim Grøstad wrote: Or to be more explicit: If have SNAN then there is no point in trying to recompute the expression using a different algorithm. If have QNAN then you might want to recompute the expression using a different algorithm (e.g. complex numbers or analytically). ? No. Once you load an SNAN, it isn't an SNAN any more! It is a QNAN. You cannot have an SNAN in a floating-point register (unless you do a nasty hack to pass it in). It gets converted during loading. const float x = snan; x = x; // x is now a qnan.
Re: RFC: std.json sucessor
On Thursday, 28 August 2014 at 14:43:30 UTC, Don wrote: No. Once you load an SNAN, it isn't an SNAN any more! It is a QNAN. By which definition? It is only if you consume the SNAN with an fp-exception-free arithmetic op that it should be turned into a QNAN. If you compute with an op that throws then it should throw an exception. MOV should not be viewed as a computation… It also makes sense to save SNAN to file when converting corrupted data-files. SNAN could then mean corrupted and QNAN could mean absent. You should not get an exception for loading a file. You should get an exception if you start computing on the SNAN in the file. You cannot have an SNAN in a floating-point register (unless you do a nasty hack to pass it in). It gets converted during loading. I don't understand this position. If you cannot load SNAN then why does SSE handle SNAN in arithmetic ops and compares? const float x = snan; x = x; // x is now a qnan. I disagree (and why const?) Assignment does nothing, it should not consume the SNAN. Assignment is just naming. It is not computing.
Re: RFC: std.json sucessor
Let me try again: SNAN = unfortunately absent QNAN = deliberately absent So you can have: compute(SNAN) = handle(exception) { if(can turn unfortunate situation into deliberate) then compute(QNAN) else throw )
Re: RFC: std.json sucessor
Kahan states this in a 1997 paper: «[…]An SNaN may be moved ( copied ) without incident, but any other arithmetic operation upon an SNaN is an INVALID operation ( and so is loading one onto the ix87's stack ) that must trap or else produce a new nonsignaling NaN. ( Another way to turn an SNaN into a NaN is to turn 0xxx...xxx into 1xxx...xxx with a logical OR.) Intended for, among other things, data missing from statistical collections, and for uninitialized variables[…]» ( http://www.eecs.berkeley.edu/~wkahan/ieee754status/IEEE754.PDF) x87 is legacy, it predates IEEE754 by 5 years and should be forgotten. Note also that the string representation for a signalling nan is NANS, so it reasonable to save it to file if you need to represent missing data. NAN represents 0/0, sqrt(-1), not missing data. I'm not really sure how it can be interpreted differently? Ola.
Re: RFC: std.json sucessor
On 8/26/2014 12:24 AM, Don wrote: On Monday, 25 August 2014 at 23:29:21 UTC, Walter Bright wrote: On 8/25/2014 4:15 PM, Ola Fosheim Grøstad ola.fosheim.grostad+dl...@gmail.com wrote: On Monday, 25 August 2014 at 21:24:11 UTC, Walter Bright wrote: I didn't know that. But recall I did implement it in DMC++, and it turned out to simply not be useful. I'd be surprised if the new C++ support for it does anything worthwhile. Well, one should initialize with signaling NaN. Then you get an exception if you try to compute using uninitialized values. That's the theory. The practice doesn't work out so well. To be more concrete: Processors from AMD have signalling NaN behaviour which is different from processors from Intel. And the situation is worst on most other architectures. It's a lost cause, I think. The other issues were just when the snan = qnan conversion took place. This is quite unclear given the extensive constant folding, CTFE, etc., that D does. It was also affected by how dmd generates code. Some code gen on floating point doesn't need the FPU, such as toggling the sign bit. But then what happens with snan = qnan? The whole thing is an undefined, unmanageable mess.
Re: RFC: std.json sucessor
On 25/08/14 21:49, simendsjo wrote: So ldc can remove quite a substantial amount of code in some cases. It's because the latest release of LDC has the --gc-sections falg enabled by default. -- /Jacob Carlborg
Re: RFC: std.json sucessor
On Monday, 25 August 2014 at 23:29:21 UTC, Walter Bright wrote: On 8/25/2014 4:15 PM, Ola Fosheim Grøstad ola.fosheim.grostad+dl...@gmail.com wrote: On Monday, 25 August 2014 at 21:24:11 UTC, Walter Bright wrote: I didn't know that. But recall I did implement it in DMC++, and it turned out to simply not be useful. I'd be surprised if the new C++ support for it does anything worthwhile. Well, one should initialize with signaling NaN. Then you get an exception if you try to compute using uninitialized values. That's the theory. The practice doesn't work out so well. To be more concrete: Processors from AMD have signalling NaN behaviour which is different from processors from Intel. And the situation is worst on most other architectures. It's a lost cause, I think.
Re: RFC: std.json sucessor
On Tuesday, 26 August 2014 at 07:24:19 UTC, Don wrote: Processors from AMD have signalling NaN behaviour which is different from processors from Intel. And the situation is worst on most other architectures. It's a lost cause, I think. I disagree. AFAIK signaling NaN was standardized in IEEE 754-2008. So it receives attention.
Re: RFC: std.json sucessor
Am 25.08.2014 23:53, schrieb Ola Fosheim Grøstad ola.fosheim.grostad+dl...@gmail.com: On Monday, 25 August 2014 at 21:27:42 UTC, Sönke Ludwig wrote: But why should UTF validation be the job of the lexer in the first place? Because you want to save time, it is faster to integrate validation? The most likely use scenario is to receive REST data over HTTP that needs validation. Well, so then I agree with Andrei… array of bytes it is. ;-) added as a separate proxy range. But if we end up going for validating in the lexer, it would indeed be enough to validate inside strings, because the rest of the grammar assumes a subset of ASCII. Not assumes, but defines! :-) I guess it depends on if you look at the grammar as productions or comprehensions(right term?) ;) If you have to validate UTF before lexing then you will end up needlessly scanning lots of ascii if the file contains lots of non-strings or is from a encoder that only sends pure ascii. That's true. So the ideal solution would be to *assume* UTF-8 when the input is char based and to *validate* if the input is numeric. If you want to have plugin validation of strings then you also need to differentiate strings so that the user can select which data should be just ascii, utf8, numbers, ids etc. Otherwise the user will end up doing double validation (you have to bypass 7F followed by string-end anyway). The advantage of integrated validation is that you can use 16 bytes SIMD registers on the buffer. I presume you can load 16 bytes and do BITWISE-AND on the MSB, then match against string-end and carefully use this to boost performance of simultanous UTF validation, escape-scanning, and string-end scan. A bit tricky, of course. Well, that's something that's definitely out of the scope of this proposal. Definitely an interesting direction to pursue, though. At least no UTF validation is needed. Since all non-ASCII characters will always be composed of bytes 0x7F, a sequence \u can be assumed to be valid wherever in the string it occurs, and all other bytes that don't belong to an escape sequence are just passed through as-is. You cannot assume \u… to be valid if you convert it. I meant X to stand for a hex digit. The point was just that you don't have to worry about interacting in a bad way with UTF sequences when you find \u.
Re: RFC: std.json sucessor
On Tuesday, 26 August 2014 at 07:51:04 UTC, Sönke Ludwig wrote: That's true. So the ideal solution would be to *assume* UTF-8 when the input is char based and to *validate* if the input is numeric. I think you should validate JSON-strings to be UTF-8 encoded even if you allow illegal unicode values. Basically ensuring that 0x7f has the right number of bytes after it, so you don't get 0x7f as the last byte in a string etc. Well, that's something that's definitely out of the scope of this proposal. Definitely an interesting direction to pursue, though. Maybe the interface/code structure is or could be designed so that the implementation could later be version()'ed to SIMD where possible. You cannot assume \u… to be valid if you convert it. I meant X to stand for a hex digit. The point was just that you don't have to worry about interacting in a bad way with UTF sequences when you find \u. When you convert \u to UTF-8 bytes, is it then validated as a legal code point? I guess it is not necessary. Btw, I believe rapidJSON achieves high speed by converting strings in situ, so that if the prefix is escape free it just converts in place when it hits the first escape. Thus avoiding some moving.
Re: RFC: std.json sucessor
Am 26.08.2014 03:31, schrieb Entusiastic user: Hi! Thanks for the effort you've put in this. I am having problems with building with LDC 0.14.0. DMD 2.066.0 seems to work fine (all unit tests pass). Do you have any ideas why? I've fixed all errors on DMD 2.065 now. Hopefully that should also fix LDC.
Re: RFC: std.json sucessor
Am 26.08.2014 10:24, schrieb Ola Fosheim Grøstad ola.fosheim.grostad+dl...@gmail.com: On Tuesday, 26 August 2014 at 07:51:04 UTC, Sönke Ludwig wrote: That's true. So the ideal solution would be to *assume* UTF-8 when the input is char based and to *validate* if the input is numeric. I think you should validate JSON-strings to be UTF-8 encoded even if you allow illegal unicode values. Basically ensuring that 0x7f has the right number of bytes after it, so you don't get 0x7f as the last byte in a string etc. I think this is a misunderstanding. What I mean is that if the input range passed to the lexer is char/wchar/dchar based, the lexer should assume that the input is well formed UTF. After all this is how D strings are defined. When on the other hand a ubyte/ushort/uint range is used, the lexer should validate all string literals. Well, that's something that's definitely out of the scope of this proposal. Definitely an interesting direction to pursue, though. Maybe the interface/code structure is or could be designed so that the implementation could later be version()'ed to SIMD where possible. I guess that shouldn't be an issue. From the outside it's just a generic range that is passed in and internally it's always possible to add special cases for array inputs. If someone else wants to play around with this idea, we could of course also integrate it right away, it's just that I personally don't have the time to go to the extreme here. You cannot assume \u… to be valid if you convert it. I meant X to stand for a hex digit. The point was just that you don't have to worry about interacting in a bad way with UTF sequences when you find \u. When you convert \u to UTF-8 bytes, is it then validated as a legal code point? I guess it is not necessary. What is validated is that it forms valid UTF-16 surrogate pairs, and those are converted to a single dchar instead (if applicable). This is necessary, because otherwise the lexer would produce invalid UTF-8 for valid inputs. Apart from that, the value is used verbatim as a dchar. Btw, I believe rapidJSON achieves high speed by converting strings in situ, so that if the prefix is escape free it just converts in place when it hits the first escape. Thus avoiding some moving. The same is true for this lexer, at least for array inputs. It actually currently just stores a slice of the string literal in all cases and lazily decodes on the first access. While doing that, it first skips any escape sequence free prefix and returns a slice if the whole string is escape sequence free.
Re: RFC: std.json sucessor
On Tuesday, 26 August 2014 at 09:05:05 UTC, Sönke Ludwig wrote: When on the other hand a ubyte/ushort/uint range is used, the lexer should validate all string literals. Yes, so this will be supported? Because this is what is most useful.
Re: RFC: std.json sucessor
I tried using -disable-linker-strip-dead, but it had no effect. From the error messages it seems the problem is compile-time and not link-time... On Tuesday, 26 August 2014 at 07:01:09 UTC, Jacob Carlborg wrote: On 25/08/14 21:49, simendsjo wrote: So ldc can remove quite a substantial amount of code in some cases. It's because the latest release of LDC has the --gc-sections falg enabled by default.
Re: RFC: std.json sucessor
Am 26.08.2014 11:11, schrieb Ola Fosheim Grøstad ola.fosheim.grostad+dl...@gmail.com: On Tuesday, 26 August 2014 at 09:05:05 UTC, Sönke Ludwig wrote: When on the other hand a ubyte/ushort/uint range is used, the lexer should validate all string literals. Yes, so this will be supported? Because this is what is most useful. If nobody plays a veto card, I'll implement it that way.
Re: RFC: std.json sucessor
On Tuesday, 26 August 2014 at 07:34:05 UTC, Ola Fosheim Gr wrote: On Tuesday, 26 August 2014 at 07:24:19 UTC, Don wrote: Processors from AMD have signalling NaN behaviour which is different from processors from Intel. And the situation is worst on most other architectures. It's a lost cause, I think. I disagree. AFAIK signaling NaN was standardized in IEEE 754-2008. So it receives attention. It was always in IEEE754. The decision in 754-2008 was simply to not remove it from the spec (a lot of people wanted to remove it). I don't think anything has changed. The point is, existing hardware does not support it consistently. It's not possible at reasonable cost. --- real uninitialized_var = real.snan; void foo() { real other_var = void; asm { fld uninitialized_var; fstp other_var; } } --- will signal on AMD, but not Intel. I'd love for this to work, but the hardware is fighting against us. I think it's useful only for debugging.
Re: RFC: std.json sucessor
On Thursday, 21 August 2014 at 22:35:18 UTC, Sönke Ludwig wrote: Following up on the recent std.jgrandson thread [1], I've picked up the work (a lot earlier than anticipated) and finished a first version of a loose blend of said std.jgrandson, vibe.data.json and some changes that I had planned for vibe.data.json for a while. I'm quite pleased by the results so far, although without a serialization framework it still misses a very important building block. Code: https://github.com/s-ludwig/std_data_json Docs: http://s-ludwig.github.io/std_data_json/ DUB: http://code.dlang.org/packages/std_data_json Do we have any benchmarks for this yet. Note that the main motivation for a new json parsers was that std.json is remarkable slow in comparison to python's json or ujson.
Re: RFC: std.json sucessor
On Tuesday, 26 August 2014 at 10:55:20 UTC, Don wrote: It was always in IEEE754. The decision in 754-2008 was simply to not remove it from the spec (a lot of people wanted to remove it). I don't think anything has changed. It was implementation defined before. I think they specified the bit in 2008. fld uninitialized_var; fstp other_var; This is not SSE, but I guess MOVSS does not create exceptions either. AVX is quite complicated, but searching for signaling gives some hints about the semantics you can rely on. https://software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensions https://software.intel.com/sites/default/files/managed/c6/a9/319433-020.pdf Ola.
Re: RFC: std.json sucessor
On Tuesday, 26 August 2014 at 12:37:58 UTC, Ola Fosheim Grøstad wrote: either. AVX is quite complicated, but searching for signaling gives some hints about the semantics you can rely on. … https://software.intel.com/sites/default/files/managed/c6/a9/319433-020.pdf (Actually, searching for SNAN is better…)
Re: RFC: std.json sucessor
With the danger of being noisy, these instructions are subject to floating point exceptions according to my (perhaps sloppy) reading of Intel Architecture Instruction Set Extensions Programming Reference (2012): (V)ADDPD, (V)ADDPS, (V)ADDSUBPD, (V)ADDSUBPS, (V)CMPPD, (V)CMPPS, (V)CVTDQ2PS, (V)CVTPD2DQ, (V)CVTPD2PS, (V)CVTPS2DQ, (V)CVTTPD2DQ, (V)CVTTPS2DQ, (V)DIVPD, (V)DIVPS, (V)DPPD*, (V)DPPS*, VFMADD132PD, VFMADD213PD, VFMADD231PD, VFMADD132PS, VFMADD213PS, VFMADD231PS, VFMADDSUB132PD, VFMADDSUB213PD, VFMADDSUB231PD, VFMADDSUB132PS, VFMADDSUB213PS, VFMADDSUB231PS, VFMSUBADD132PD, VFMSUBADD213PD, VFMSUBADD231PD, VFMSUBADD132PS, VFMSUBADD213PS, VFMSUBADD231PS, VFMSUB132PD, VFMSUB213PD, VFMSUB231PD, VFMSUB132PS, VFMSUB213PS, VFMSUB231PS, VFNMADD132PD, VFNMADD213PD, VFNMADD231PD, VFNMADD132PS, VFNMADD213PS, VFNMADD231PS, VFNMSUB132PD, VFNMSUB213PD, VFNMSUB231PD, VFNMSUB132PS, VFNMSUB213PS, VFNMSUB231PS, (V)HADDPD, (V)HADDPS, (V)HSUBPD, (V)HSUBPS, (V)MAXPD, (V)MAXPS, (V)MINPD, (V)MINPS, (V)MULPD, (V)MULPS, (V)ROUNDPS, (V)ROUNDPS, (V)SQRTPD, (V)SQRTPS, (V)SUBPD, (V)SUBPS (V)ADDSD, (V)ADDSS, (V)CMPSD, (V)CMPSS, (V)COMISD, (V)COMISS, (V)CVTPS2PD, (V)CVTSD2SI, (V)CVTSD2SS, (V)CVTSI2SD, (V)CVTSI2SS, (V)CVTSS2SD, (V)CVTSS2SI, (V)CVTTSD2SI, (V)CVTTSS2SI, (V)DIVSD, (V)DIVSS, VFMADD132SD, VFMADD213SD, VFMADD231SD, VFMADD132SS, VFMADD213SS, VFMADD231SS, VFMSUB132SD, VFMSUB213SD, VFMSUB231SD, VFMSUB132SS, VFMSUB213SS, VFMSUB231SS, VFNMADD132SD, VFNMADD213SD, VFNMADD231SD, VFNMADD132SS, VFNMADD213SS, VFNMADD231SS, VFNMSUB132SD, VFNMSUB213SD, VFNMSUB231SD, VFNMSUB132SS, VFNMSUB213SS, VFNMSUB231SS, (V)MAXSD, (V)MAXSS, (V)MINSD, (V)MINSS, (V)MULSD, (V)MULSS, (V)ROUNDSD, (V)ROUNDSS, (V)SQRTSD, (V)SQRTSS, (V)SUBSD, (V)SUBSS, (V)UCOMISD, (V)UCOMISS VCVTPH2PS, VCVTPS2PH So I guess Intel floating point exceptions trigger on computations, but not on moves? Ola.
Re: RFC: std.json sucessor
On Tuesday, 26 August 2014 at 12:37:58 UTC, Ola Fosheim Grøstad wrote: On Tuesday, 26 August 2014 at 10:55:20 UTC, Don wrote: It was always in IEEE754. The decision in 754-2008 was simply to not remove it from the spec (a lot of people wanted to remove it). I don't think anything has changed. It was implementation defined before. I think they specified the bit in 2008. fld uninitialized_var; fstp other_var; This is not SSE, but I guess MOVSS does not create exceptions either. No, it's more subtle. On the original x87, signalling NaNs are triggered for 64 bits loads, but not for 80 bit loads. You have to read the fine print to discover this. I don't think the behaviour was intentional.
Re: RFC: std.json sucessor
On Tuesday, 26 August 2014 at 13:24:11 UTC, Don wrote: No, it's more subtle. On the original x87, signalling NaNs are triggered for 64 bits loads, but not for 80 bit loads. You have to read the fine print to discover this. You are right, but it happens for loads from the FP-stack too: «Source operand is an SNaN. Does not occur if the source operand is in double extended-precision floating-point format (FLD m80fp or FLD ST(i)).» I don't think the behaviour was intentional. It seems reasonable, you need to load/save NaNs without exceptions if you do a context switch? I don't think the extended format was not meant for end users. Anyway, the x87 FP stack is history, even MOVSS is considered legacy by Intel…
Re: RFC: std.json sucessor
On Monday, 25 August 2014 at 14:04:12 UTC, Sönke Ludwig wrote: Am 25.08.2014 15:07, schrieb Don: On Thursday, 21 August 2014 at 22:35:18 UTC, Sönke Ludwig wrote: Following up on the recent std.jgrandson thread [1], I've picked up the work (a lot earlier than anticipated) and finished a first version of a loose blend of said std.jgrandson, vibe.data.json and some changes that I had planned for vibe.data.json for a while. I'm quite pleased by the results so far, although without a serialization framework it still misses a very important building block. Code: https://github.com/s-ludwig/std_data_json Docs: http://s-ludwig.github.io/std_data_json/ DUB: http://code.dlang.org/packages/std_data_json The new code contains: - Lazy lexer in the form of a token input range (using slices of the input if possible) - Lazy streaming parser (StAX style) in the form of a node input range - Eager DOM style parser returning a JSONValue - Range based JSON string generator taking either a token range, a node range, or a JSONValue - Opt-out location tracking (line/column) for tokens, nodes and values - No opDispatch() for JSONValue - this has shown to do more harm than good in vibe.data.json The DOM style JSONValue type is based on std.variant.Algebraic. This currently has a few usability issues that can be solved by upgrading/fixing Algebraic: - Operator overloading only works sporadically - No tag enum is supported, so that switch()ing on the type of a value doesn't work and an if-else cascade is required - Operations and conversions between different Algebraic types is not conveniently supported, which gets important when other similar formats get supported (e.g. BSON) Assuming that those points are solved, I'd like to get some early feedback before going for an official review. One open issue is how to handle unescaping of string literals. Currently it always unescapes immediately, which is more efficient for general input ranges when the unescaped result is needed, but less efficient for string inputs when the unescaped result is not needed. Maybe a flag could be used to conditionally switch behavior depending on the input range type. Destroy away! ;) [1]: http://forum.dlang.org/thread/lrknjl$co7$1...@digitalmars.com One missing feature (which is also missing from the existing std.json) is support for NaN and Infinity as JSON values. Although they are not part of the formal JSON spec (which is a ridiculous omission, the argument given for excluding them is fallacious), they do get generated if you use Javascript's toString to create the JSON. Many JSON libraries (eg Google's) also generate them, so they are frequently encountered in practice. So a JSON parser should at least be able to lex them. ie this should be parsable: {foo: NaN, bar: Infinity, baz: -Infinity} This would probably best added as another (CT) optional feature. I think the default should strictly adhere to the JSON specification, though. Yes, it should be optional, but not a compile-time option. I think it should parse it, and based on a runtime flag, throw an error (perhaps an OutOfRange error or something, and use the same thing for values that exceed the representable range). An app may accept these non-standard values under certain circumstances and not others. In real-world code, you see a *lot* of these guys. Part of the reason these are important, is that NaN or Infinity generally means some Javascript code just has an uninitialized variable. Any other kind of invalid JSON typically means something very nasty has happened. It's important to distinguish these.
Re: RFC: std.json sucessor
Am 26.08.2014 15:43, schrieb Don: On Monday, 25 August 2014 at 14:04:12 UTC, Sönke Ludwig wrote: Am 25.08.2014 15:07, schrieb Don: ie this should be parsable: {foo: NaN, bar: Infinity, baz: -Infinity} This would probably best added as another (CT) optional feature. I think the default should strictly adhere to the JSON specification, though. Yes, it should be optional, but not a compile-time option. I think it should parse it, and based on a runtime flag, throw an error (perhaps an OutOfRange error or something, and use the same thing for values that exceed the representable range). An app may accept these non-standard values under certain circumstances and not others. In real-world code, you see a *lot* of these guys. Why not a compile time option? That sounds to me like such an app should simply enable parsing those values and manually test for NaN at places where it matters. For all other (the majority) of applications, encountering NaN/Infinity will simply mean that there is a bug, so it makes sense to not accept those at all by default. Apart from that I don't think that it's a good idea for the lexer in general to accept non-standard input by default. Part of the reason these are important, is that NaN or Infinity generally means some Javascript code just has an uninitialized variable. Any other kind of invalid JSON typically means something very nasty has happened. It's important to distinguish these. As far as I understood, JavaScript will output those special values as null (at least when not using external JSON libraries). But even if not, an uninitialized variable can also be very nasty, so it's hard to see why that kind of bug should be silently supported (by default).
Re: RFC: std.json sucessor
On Tuesday, 26 August 2014 at 13:43:56 UTC, Ola Fosheim Grøstad wrote: Anyway, the x87 FP stack is history, even MOVSS is considered legacy by Intel… Sorry for being off-topic, but MOVSS and VMOVSS on AMD don't throw FP exceptions either, but calculations does. So it seems like AMD and Intel sufficiently close for D to support NaNs, IMHO. Forget the legacy… http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/26568_APM_v41.pdf Ola.
Re: RFC: std.json sucessor
On Tuesday, 26 August 2014 at 14:06:42 UTC, Sönke Ludwig wrote: Am 26.08.2014 15:43, schrieb Don: On Monday, 25 August 2014 at 14:04:12 UTC, Sönke Ludwig wrote: Am 25.08.2014 15:07, schrieb Don: ie this should be parsable: {foo: NaN, bar: Infinity, baz: -Infinity} This would probably best added as another (CT) optional feature. I think the default should strictly adhere to the JSON specification, though. Yes, it should be optional, but not a compile-time option. I think it should parse it, and based on a runtime flag, throw an error (perhaps an OutOfRange error or something, and use the same thing for values that exceed the representable range). An app may accept these non-standard values under certain circumstances and not others. In real-world code, you see a *lot* of these guys. Why not a compile time option? That sounds to me like such an app should simply enable parsing those values and manually test for NaN at places where it matters. For all other (the majority) of applications, encountering NaN/Infinity will simply mean that there is a bug, so it makes sense to not accept those at all by default. Apart from that I don't think that it's a good idea for the lexer in general to accept non-standard input by default. Please note, I've been talking about the lexer. I'm choosing my words very carefully. Part of the reason these are important, is that NaN or Infinity generally means some Javascript code just has an uninitialized variable. Any other kind of invalid JSON typically means something very nasty has happened. It's important to distinguish these. As far as I understood, JavaScript will output those special values as null (at least when not using external JSON libraries). No. Javascript generates them directly. Naive JS code generates these guys. That's why they're so important. But even if not, an uninitialized variable can also be very nasty, so it's hard to see why that kind of bug should be silently supported (by default). I never said it should accepted by default. I said it is a situation which should be *lexed*. Ideally, by default it should give a different error from simply 'invalid JSON'. I believe it should ALWAYS be lexed, even if an error is ultimately generated. This is the difference: if you get NaN or Infinity, there's probably a straightforward bug in the Javascript code, but your D code is fine. Any other kind of JSON parsing error means you've got a garbage string that isn't JSON at all. They are very different errors. It's a diagnostics issue.
Re: RFC: std.json sucessor
Am 26.08.2014 16:40, schrieb Don: On Tuesday, 26 August 2014 at 14:06:42 UTC, Sönke Ludwig wrote: Am 26.08.2014 15:43, schrieb Don: On Monday, 25 August 2014 at 14:04:12 UTC, Sönke Ludwig wrote: Am 25.08.2014 15:07, schrieb Don: ie this should be parsable: {foo: NaN, bar: Infinity, baz: -Infinity} This would probably best added as another (CT) optional feature. I think the default should strictly adhere to the JSON specification, though. Yes, it should be optional, but not a compile-time option. I think it should parse it, and based on a runtime flag, throw an error (perhaps an OutOfRange error or something, and use the same thing for values that exceed the representable range). An app may accept these non-standard values under certain circumstances and not others. In real-world code, you see a *lot* of these guys. Why not a compile time option? That sounds to me like such an app should simply enable parsing those values and manually test for NaN at places where it matters. For all other (the majority) of applications, encountering NaN/Infinity will simply mean that there is a bug, so it makes sense to not accept those at all by default. Apart from that I don't think that it's a good idea for the lexer in general to accept non-standard input by default. Please note, I've been talking about the lexer. I'm choosing my words very carefully. I've been talking about the lexer, too. Sorry for the confusing use of the term parsing (after all, the lexer is also a parser, but anyway). Part of the reason these are important, is that NaN or Infinity generally means some Javascript code just has an uninitialized variable. Any other kind of invalid JSON typically means something very nasty has happened. It's important to distinguish these. As far as I understood, JavaScript will output those special values as null (at least when not using external JSON libraries). No. Javascript generates them directly. Naive JS code generates these guys. That's why they're so important. JSON.stringify(0/0) == null Holds for all browsers that I've tested. But even if not, an uninitialized variable can also be very nasty, so it's hard to see why that kind of bug should be silently supported (by default). I never said it should accepted by default. I said it is a situation which should be *lexed*. Ideally, by default it should give a different error from simply 'invalid JSON'. I believe it should ALWAYS be lexed, even if an error is ultimately generated. This is the difference: if you get NaN or Infinity, there's probably a straightforward bug in the Javascript code, but your D code is fine. Any other kind of JSON parsing error means you've got a garbage string that isn't JSON at all. They are very different errors. It's a diagnostics issue. The error will be more like filename(line:column): Invalid token - possibly the text following the line/column could also be displayed. Wouldn't that be sufficient?
Re: RFC: std.json sucessor
On Tuesday, 26 August 2014 at 14:40:02 UTC, Don wrote: This is the difference: if you get NaN or Infinity, there's probably a straightforward bug in the Javascript code, but your D code is fine. Any other kind of JSON parsing error means you've got a garbage string that isn't JSON at all. They are very different errors. I don't care either way, but JSON.stringify() has the following support: IE8 and up Firefox 3.5 and up Safari 4 and up Chrome So not using it is very much legacy…
Re: RFC: std.json sucessor
Am 26.08.2014 16:51, schrieb Sönke Ludwig: Am 26.08.2014 16:40, schrieb Don: This is the difference: if you get NaN or Infinity, there's probably a straightforward bug in the Javascript code, but your D code is fine. Any other kind of JSON parsing error means you've got a garbage string that isn't JSON at all. They are very different errors. It's a diagnostics issue. The error will be more like filename(line:column): Invalid token - possibly the text following the line/column could also be displayed. Wouldn't that be sufficient? One argument against supporting it in the parser is that the parser currently works without any configuration, but the user would then have to specify two sets of configuration options with this added.
Re: RFC: std.json sucessor
I've added support (compile time option [1]) for long and BigInt in the lexer (and parser), see [2]. JSONValue currently still only stores double for numbers. There are two options for extending JSONValue: 1. Add long and BigInt to the set of supported types for JSONValue. This preserves all features of Algebraic and would later still allow transparent conversion to other similar value types (e.g. BSONValue). On the other hand it would be necessary to always check the actual type before accessing a number, or the Algebraic would throw. 2. Instead of double, store a JSONNumber in the Algebraic. This enables all the transparent conversions of JSONNumber and would thus be more convenient, but blocks the way for possible automatic conversions in the future. I'm leaning towards 1, because allowing generic conversion between different JSONValue-like types was one of my prime goals for the new module. [1]: http://s-ludwig.github.io/std_data_json/stdx/data/json/lexer/LexOptions.html [2]: http://s-ludwig.github.io/std_data_json/stdx/data/json/lexer/JSONNumber.html
Re: RFC: std.json sucessor
On Monday, 25 August 2014 at 11:30:15 UTC, Sönke Ludwig wrote: I've added support (compile time option [1]) for long and BigInt in the lexer (and parser), see [2]. JSONValue currently still only stores double for numbers. It can be very useful to have a base 10 exponent representation in certain situations where you need to have the exact same results in two systems (like a third party ERP server versus a client side application). Base 2 exponents are tricky (incorrect) when you read ascii. E.g. I have resorted to using Decimal in Python just to avoid the weird round off issues when calculating prices where the price is given in fractions of the order unit. Perhaps a marginal problem, but could be important for some serious application areas where you need to integrate D with existing systems (for which you don't have the source code).
Re: RFC: std.json sucessor
On Thursday, 21 August 2014 at 22:35:18 UTC, Sönke Ludwig wrote: Following up on the recent std.jgrandson thread [1], I've picked up the work (a lot earlier than anticipated) and finished a first version of a loose blend of said std.jgrandson, vibe.data.json and some changes that I had planned for vibe.data.json for a while. I'm quite pleased by the results so far, although without a serialization framework it still misses a very important building block. Code: https://github.com/s-ludwig/std_data_json Docs: http://s-ludwig.github.io/std_data_json/ DUB: http://code.dlang.org/packages/std_data_json The new code contains: - Lazy lexer in the form of a token input range (using slices of the input if possible) - Lazy streaming parser (StAX style) in the form of a node input range - Eager DOM style parser returning a JSONValue - Range based JSON string generator taking either a token range, a node range, or a JSONValue - Opt-out location tracking (line/column) for tokens, nodes and values - No opDispatch() for JSONValue - this has shown to do more harm than good in vibe.data.json The DOM style JSONValue type is based on std.variant.Algebraic. This currently has a few usability issues that can be solved by upgrading/fixing Algebraic: - Operator overloading only works sporadically - No tag enum is supported, so that switch()ing on the type of a value doesn't work and an if-else cascade is required - Operations and conversions between different Algebraic types is not conveniently supported, which gets important when other similar formats get supported (e.g. BSON) Assuming that those points are solved, I'd like to get some early feedback before going for an official review. One open issue is how to handle unescaping of string literals. Currently it always unescapes immediately, which is more efficient for general input ranges when the unescaped result is needed, but less efficient for string inputs when the unescaped result is not needed. Maybe a flag could be used to conditionally switch behavior depending on the input range type. Destroy away! ;) [1]: http://forum.dlang.org/thread/lrknjl$co7$1...@digitalmars.com One missing feature (which is also missing from the existing std.json) is support for NaN and Infinity as JSON values. Although they are not part of the formal JSON spec (which is a ridiculous omission, the argument given for excluding them is fallacious), they do get generated if you use Javascript's toString to create the JSON. Many JSON libraries (eg Google's) also generate them, so they are frequently encountered in practice. So a JSON parser should at least be able to lex them. ie this should be parsable: {foo: NaN, bar: Infinity, baz: -Infinity} You should also put tests in for what happens when you pass NaN or infinity to toJSON. It shouldn't silently generate invalid JSON.
Re: RFC: std.json sucessor
On Monday, 25 August 2014 at 13:07:08 UTC, Don wrote: practice. So a JSON parser should at least be able to lex them. ie this should be parsable: {foo: NaN, bar: Infinity, baz: -Infinity} You should also put tests in for what happens when you pass NaN or infinity to toJSON. It shouldn't silently generate invalid JSON. I believe you are allowed to use very high exponents, though. Like: 1E999 . So you need to decide if those should be mapped to +Infinity or to the max value… NaN also come in two forms with differing semantics: signalling(NaNs) and quiet (NaN). NaN is used for 0/0 and sqrt(-1), but NaNs is used for illegal values and failure. For some reason D does not seem to support this aspect of IEEE754? I cannot find .nans listed on the page http://dlang.org/property.html The distinction is important when you do conditional branching. With NaNs you might not be able to figure out which branch to take since you might have missed out on a real value, with NaN you got the value (which is known to be not real) and you might be able to branch.
Re: RFC: std.json sucessor
Am 25.08.2014 14:12, schrieb Ola Fosheim Grøstad ola.fosheim.grostad+dl...@gmail.com: On Monday, 25 August 2014 at 11:30:15 UTC, Sönke Ludwig wrote: I've added support (compile time option [1]) for long and BigInt in the lexer (and parser), see [2]. JSONValue currently still only stores double for numbers. It can be very useful to have a base 10 exponent representation in certain situations where you need to have the exact same results in two systems (like a third party ERP server versus a client side application). Base 2 exponents are tricky (incorrect) when you read ascii. E.g. I have resorted to using Decimal in Python just to avoid the weird round off issues when calculating prices where the price is given in fractions of the order unit. Perhaps a marginal problem, but could be important for some serious application areas where you need to integrate D with existing systems (for which you don't have the source code). In fact, I've already prepared the code for that, but commented it out for now, because I wanted to have an efficient algorithm for converting double to Decimal and because we should probably first add a Decimal type to Phobos instead of adding it to the JSON module.
Re: RFC: std.json sucessor
Am 25.08.2014 15:07, schrieb Don: On Thursday, 21 August 2014 at 22:35:18 UTC, Sönke Ludwig wrote: Following up on the recent std.jgrandson thread [1], I've picked up the work (a lot earlier than anticipated) and finished a first version of a loose blend of said std.jgrandson, vibe.data.json and some changes that I had planned for vibe.data.json for a while. I'm quite pleased by the results so far, although without a serialization framework it still misses a very important building block. Code: https://github.com/s-ludwig/std_data_json Docs: http://s-ludwig.github.io/std_data_json/ DUB: http://code.dlang.org/packages/std_data_json The new code contains: - Lazy lexer in the form of a token input range (using slices of the input if possible) - Lazy streaming parser (StAX style) in the form of a node input range - Eager DOM style parser returning a JSONValue - Range based JSON string generator taking either a token range, a node range, or a JSONValue - Opt-out location tracking (line/column) for tokens, nodes and values - No opDispatch() for JSONValue - this has shown to do more harm than good in vibe.data.json The DOM style JSONValue type is based on std.variant.Algebraic. This currently has a few usability issues that can be solved by upgrading/fixing Algebraic: - Operator overloading only works sporadically - No tag enum is supported, so that switch()ing on the type of a value doesn't work and an if-else cascade is required - Operations and conversions between different Algebraic types is not conveniently supported, which gets important when other similar formats get supported (e.g. BSON) Assuming that those points are solved, I'd like to get some early feedback before going for an official review. One open issue is how to handle unescaping of string literals. Currently it always unescapes immediately, which is more efficient for general input ranges when the unescaped result is needed, but less efficient for string inputs when the unescaped result is not needed. Maybe a flag could be used to conditionally switch behavior depending on the input range type. Destroy away! ;) [1]: http://forum.dlang.org/thread/lrknjl$co7$1...@digitalmars.com One missing feature (which is also missing from the existing std.json) is support for NaN and Infinity as JSON values. Although they are not part of the formal JSON spec (which is a ridiculous omission, the argument given for excluding them is fallacious), they do get generated if you use Javascript's toString to create the JSON. Many JSON libraries (eg Google's) also generate them, so they are frequently encountered in practice. So a JSON parser should at least be able to lex them. ie this should be parsable: {foo: NaN, bar: Infinity, baz: -Infinity} This would probably best added as another (CT) optional feature. I think the default should strictly adhere to the JSON specification, though. You should also put tests in for what happens when you pass NaN or infinity to toJSON. It shouldn't silently generate invalid JSON. Good point. The current solution to just use formattedWrite(%.16g) is also not ideal.
Re: RFC: std.json sucessor
Am 25.08.2014 16:04, schrieb Sönke Ludwig: Am 25.08.2014 15:07, schrieb Don: On Thursday, 21 August 2014 at 22:35:18 UTC, Sönke Ludwig wrote: Following up on the recent std.jgrandson thread [1], I've picked up the work (a lot earlier than anticipated) and finished a first version of a loose blend of said std.jgrandson, vibe.data.json and some changes that I had planned for vibe.data.json for a while. I'm quite pleased by the results so far, although without a serialization framework it still misses a very important building block. Code: https://github.com/s-ludwig/std_data_json Docs: http://s-ludwig.github.io/std_data_json/ DUB: http://code.dlang.org/packages/std_data_json The new code contains: - Lazy lexer in the form of a token input range (using slices of the input if possible) - Lazy streaming parser (StAX style) in the form of a node input range - Eager DOM style parser returning a JSONValue - Range based JSON string generator taking either a token range, a node range, or a JSONValue - Opt-out location tracking (line/column) for tokens, nodes and values - No opDispatch() for JSONValue - this has shown to do more harm than good in vibe.data.json The DOM style JSONValue type is based on std.variant.Algebraic. This currently has a few usability issues that can be solved by upgrading/fixing Algebraic: - Operator overloading only works sporadically - No tag enum is supported, so that switch()ing on the type of a value doesn't work and an if-else cascade is required - Operations and conversions between different Algebraic types is not conveniently supported, which gets important when other similar formats get supported (e.g. BSON) Assuming that those points are solved, I'd like to get some early feedback before going for an official review. One open issue is how to handle unescaping of string literals. Currently it always unescapes immediately, which is more efficient for general input ranges when the unescaped result is needed, but less efficient for string inputs when the unescaped result is not needed. Maybe a flag could be used to conditionally switch behavior depending on the input range type. Destroy away! ;) [1]: http://forum.dlang.org/thread/lrknjl$co7$1...@digitalmars.com One missing feature (which is also missing from the existing std.json) is support for NaN and Infinity as JSON values. Although they are not part of the formal JSON spec (which is a ridiculous omission, the argument given for excluding them is fallacious), they do get generated if you use Javascript's toString to create the JSON. Many JSON libraries (eg Google's) also generate them, so they are frequently encountered in practice. So a JSON parser should at least be able to lex them. ie this should be parsable: {foo: NaN, bar: Infinity, baz: -Infinity} This would probably best added as another (CT) optional feature. I think the default should strictly adhere to the JSON specification, though. http://s-ludwig.github.io/std_data_json/stdx/data/json/lexer/LexOptions.specialFloatLiterals.html You should also put tests in for what happens when you pass NaN or infinity to toJSON. It shouldn't silently generate invalid JSON. Good point. The current solution to just use formattedWrite(%.16g) is also not ideal. By default, floating-point special values are now output as 'null', according to the ECMA-script standard. Optionally, they will be emitted as 'NaN' and 'Infinity': http://s-ludwig.github.io/std_data_json/stdx/data/json/generator/GeneratorOptions.specialFloatLiterals.html
Re: RFC: std.json sucessor
On Monday, 25 August 2014 at 15:34:29 UTC, Sönke Ludwig wrote: By default, floating-point special values are now output as 'null', according to the ECMA-script standard. Optionally, they will be emitted as 'NaN' and 'Infinity': ECMAScript presumes double. I think one should base Phobos on language-independent standards. I suggest: http://tools.ietf.org/html/rfc7159 For a web server it would be most useful to get an exception since you risk ending up with web-clients not working with no logging. It is better to have an exception and log an error so the problem can be fixed.
Re: RFC: std.json sucessor
On Monday, 25 August 2014 at 15:46:12 UTC, Ola Fosheim Grøstad wrote: For a web server it would be most useful to get an exception since you risk ending up with web-clients not working with no logging. It is better to have an exception and log an error so the problem can be fixed. Let me expand a bit on the difference between web clients and servers, assuming D is used on the server: * Web servers have to check all input and log illegal activity. It is either a bug or an attack. * Web clients don't have to check input from the server (at most a crypto check) and should not do double work if servers validate anyway. * Web servers detect errors and send the error as a response to the client that displays it as a warning to the user. This is the uncommon case so you don't want to burden the client with it. From this we can infer: - It makes more sense for ECMAScript to turn illegal values into null since it runs on the client. - The server needs efficient validation of input so that it can have faster response. - The more integration of validation of typedness you can have in the parser, the better. Thus it would be an advantage to be able to configure the validation done in the parser (through template mechanisms): 1. On write: throw exception on all illegal values or values that cannot be represented in the format. If the values are illegal then the client should not receive it. It could cause legal problems (like wrong prices). 2. On read: add the ability to configure the validation of typedness on many parameters: - no nulls, no dicts, only nesting arrays etc - predetermined key-values and automatic mapping to structs on exact match. - require all leaf arrays to be uniform (array of strings, array of numbers) - match a predefined grammar etc
Re: RFC: std.json sucessor
On 8/23/2014 6:32 PM, Brad Roberts via Digitalmars-d wrote: I'm not convinced that using an adapter algorithm won't be just as fast. Consider your own talks on optimizing the existing dmd lexer. In those talks you've talked about the evils of additional processing on every byte. That's what you're talking about here. While it's possible that the inliner and other optimizer steps might be able to integrate the two phases and remove some overhead, I'll believe it when I see the resulting assembly code. On the other hand, deadalnix demonstrated that the ldc optimizer was able to remove the extra code. I have a reasonable faith that optimization can be improved where necessary to cover this.
Re: RFC: std.json sucessor
On 8/23/2014 3:51 PM, Andrei Alexandrescu wrote: An adapter would solve the wrong problem here. There's nothing to adapt from and to. An adapter would be good if e.g. the stream uses UTF-16 or some Windows encoding. Bytes are the natural input for a json parser. The adaptation is to take arbitrary byte input in an unknown encoding and produce valid UTF. Note that many html readers scan the bytes to see if it is ASCII, UTF, some code page encoding, Shift-JIS, etc., and translate accordingly. I do not see why that is less costly to put inside the JSON lexer than as an adapter.
Re: RFC: std.json sucessor
On 8/25/2014 6:23 AM, Ola Fosheim Grøstad ola.fosheim.grostad+dl...@gmail.com wrote: On Monday, 25 August 2014 at 13:07:08 UTC, Don wrote: practice. So a JSON parser should at least be able to lex them. ie this should be parsable: {foo: NaN, bar: Infinity, baz: -Infinity} You should also put tests in for what happens when you pass NaN or infinity to toJSON. It shouldn't silently generate invalid JSON. I believe you are allowed to use very high exponents, though. Like: 1E999 . So you need to decide if those should be mapped to +Infinity or to the max value… Infinity. Mapping to max value would be a horrible bug. NaN also come in two forms with differing semantics: signalling(NaNs) and quiet (NaN). NaN is used for 0/0 and sqrt(-1), but NaNs is used for illegal values and failure. For some reason D does not seem to support this aspect of IEEE754? I cannot find .nans listed on the page http://dlang.org/property.html Because I tried supporting them in C++. It doesn't work for various reasons. Nobody else supports them, either.
Re: RFC: std.json sucessor
On 08/25/2014 09:35 PM, Walter Bright wrote: On 8/23/2014 6:32 PM, Brad Roberts via Digitalmars-d wrote: I'm not convinced that using an adapter algorithm won't be just as fast. Consider your own talks on optimizing the existing dmd lexer. In those talks you've talked about the evils of additional processing on every byte. That's what you're talking about here. While it's possible that the inliner and other optimizer steps might be able to integrate the two phases and remove some overhead, I'll believe it when I see the resulting assembly code. On the other hand, deadalnix demonstrated that the ldc optimizer was able to remove the extra code. I have a reasonable faith that optimization can be improved where necessary to cover this. I just happened to write a very small script yesterday and tested with the compilers (with dub --build=release). dmd: 2.8 mb gdc: 3.3 mb ldc 0.5 mb So ldc can remove quite a substantial amount of code in some cases.
Re: RFC: std.json sucessor
On Monday, 25 August 2014 at 19:38:05 UTC, Walter Bright wrote: The adaptation is to take arbitrary byte input in an unknown encoding and produce valid UTF. I agree. For a restful http service the encoding should be specified in the http header and the input rejected if it isn't UTF compatible. For that use scenario you only want validation, not conversion. However some validation is free, like if you only accept numbers you could just turn off parsing of strings in the template… If files are read from storage then you can reread the file if it fails validation on the first pass. I wonder, in which use scenario it is that both of these conditions fail? 1. unspecified character-set and cannot assume UTF for JSON 3. unable to re-parse
Re: RFC: std.json sucessor
On Monday, 25 August 2014 at 19:42:03 UTC, Walter Bright wrote: Infinity. Mapping to max value would be a horrible bug. Yes… but then you are reading an illegal value that JSON does not support… For some reason D does not seem to support this aspect of IEEE754? I cannot find .nans listed on the page http://dlang.org/property.html Because I tried supporting them in C++. It doesn't work for various reasons. Nobody else supports them, either. I haven't tested, but Python is supposed to throw on NaNs. gcc has support for nans in their documentation: https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html IBM Fortran supports it… I think supporting signaling NaN is important for correctness.
Re: RFC: std.json sucessor
Am 25.08.2014 17:46, schrieb Ola Fosheim Grøstad ola.fosheim.grostad+dl...@gmail.com: On Monday, 25 August 2014 at 15:34:29 UTC, Sönke Ludwig wrote: By default, floating-point special values are now output as 'null', according to the ECMA-script standard. Optionally, they will be emitted as 'NaN' and 'Infinity': ECMAScript presumes double. I think one should base Phobos on language-independent standards. I suggest: http://tools.ietf.org/html/rfc7159 Well, of course it's based on that RFC, did you seriously think something else? However, that standard has no mention of infinity or NaN, and since JSON is designed to be a subset of ECMA script, it's basically the only thing that comes close. For a web server it would be most useful to get an exception since you risk ending up with web-clients not working with no logging. It is better to have an exception and log an error so the problem can be fixed. Although you have a point there of course, it's also highly unlikely that those clients would work correctly if we presume that JSON supported infinity/NaN. So it would really be just coincidence to detect a bug like that. But I generally agree, it's just that the anti-exception voices are pretty loud these days (including Walter's), so that I opted for a non-throwing solution instead. I guess it wouldn't hurt though to default to throwing an exception, while still providing the GeneratorOptions.specialFloatLiterals option to handle those values without exception overhead, but in a non standard-conforming way.
Re: RFC: std.json sucessor
On Monday, 25 August 2014 at 20:04:10 UTC, Ola Fosheim Grøstad wrote: I think supporting signaling NaN is important for correctness. It is defined in C++11: http://en.cppreference.com/w/cpp/types/numeric_limits/signaling_NaN
Re: RFC: std.json sucessor
- It makes more sense for ECMAScript to turn illegal values into null since it runs on the client. Like... node.js? Sorry, just kidding. I don't think it makes sense for clients to be less strict about such things, but I do agree with your assessment about being as strict as possible on the server. I also do think that exceptions are a perfect tool especially for server applications and that instead of avoiding them because they are slow, they should better be made fast enough to not be an issue.
Re: RFC: std.json sucessor
Am 25.08.2014 21:50, schrieb Ola Fosheim Grøstad ola.fosheim.grostad+dl...@gmail.com: On Monday, 25 August 2014 at 19:38:05 UTC, Walter Bright wrote: The adaptation is to take arbitrary byte input in an unknown encoding and produce valid UTF. I agree. For a restful http service the encoding should be specified in the http header and the input rejected if it isn't UTF compatible. For that use scenario you only want validation, not conversion. However some validation is free, like if you only accept numbers you could just turn off parsing of strings in the template… If files are read from storage then you can reread the file if it fails validation on the first pass. I wonder, in which use scenario it is that both of these conditions fail? 1. unspecified character-set and cannot assume UTF for JSON 3. unable to re-parse BTW, JSON is *required* to be UTF encoded anyway as per RFC-7159, which is another argument for just letting the lexer assume valid UTF.
Re: RFC: std.json sucessor
On Monday, 25 August 2014 at 20:21:01 UTC, Sönke Ludwig wrote: Well, of course it's based on that RFC, did you seriously think something else? I made no assumptions, just responded to what you wrote :-). It would be reasonable in the context of vibe.d to assume the ECMAScript spec. But I generally agree, it's just that the anti-exception voices are pretty loud these days (including Walter's), so that I opted for a non-throwing solution instead. Yes, the minimum requirement is to just get did not validate directly as a single value. One can create a wrapper to get exceptions. I guess it wouldn't hurt though to default to throwing an exception, while still providing the GeneratorOptions.specialFloatLiterals option to handle those values without exception overhead, but in a non standard-conforming way. What I care most about is getting all the free validation that can be added with no extra cost. That will make writing web services easier. Like if you can define constraints like: - root is array, values are strings. - root is array, second level only arrays, third level is numbers - root is dict, all arrays contain only numbers What is a bit annoying about generic libs is that you have no idea what you are getting so you have to spend time creating dull validation code. But maybe StructuredJSON should be a separate library. It would be useful for REST services to specify the grammar and auto-generate both javascript and D structures to hold it along with validation code. However, just turning off parsing of true, false, null, [, { etc seems like a cheap addition that also can improve parsing speed if the compiler can make do with two if statements instead of a switch. Ola.
Re: RFC: std.json sucessor
Am 25.08.2014 22:21, schrieb Sönke Ludwig: that standard has no mention of infinity or NaN Sorry, to be precise, it has no suggestion of how to *handle* infinity or NaN.
Re: RFC: std.json sucessor
On Monday, 25 August 2014 at 20:35:32 UTC, Sönke Ludwig wrote: BTW, JSON is *required* to be UTF encoded anyway as per RFC-7159, which is another argument for just letting the lexer assume valid UTF. The lexer cannot assume valid UTF since the client might be a rogue, but it can just bail out if the lookahead isn't jSON? So UTF-validation is limited to strings. You have to parse the strings because of the \u escapes of course, so some basic validation is unavoidable? But I guess full validation of string content could be another useful option along with ignore escapes for the case where you want to avoid decode-encode scenarios. (like for a proxy, or if you store pre-escaped unicode in a database)
Re: RFC: std.json sucessor
On 8/25/2014 1:21 PM, Ola Fosheim Grøstad ola.fosheim.grostad+dl...@gmail.com wrote: On Monday, 25 August 2014 at 20:04:10 UTC, Ola Fosheim Grøstad wrote: I think supporting signaling NaN is important for correctness. It is defined in C++11: http://en.cppreference.com/w/cpp/types/numeric_limits/signaling_NaN I didn't know that. But recall I did implement it in DMC++, and it turned out to simply not be useful. I'd be surprised if the new C++ support for it does anything worthwhile.
Re: RFC: std.json sucessor
On 8/25/2014 1:35 PM, Sönke Ludwig wrote: BTW, JSON is *required* to be UTF encoded anyway as per RFC-7159, which is another argument for just letting the lexer assume valid UTF. I think that settles it.
Re: RFC: std.json sucessor
Am 25.08.2014 22:51, schrieb Ola Fosheim Grøstad ola.fosheim.grostad+dl...@gmail.com: On Monday, 25 August 2014 at 20:35:32 UTC, Sönke Ludwig wrote: BTW, JSON is *required* to be UTF encoded anyway as per RFC-7159, which is another argument for just letting the lexer assume valid UTF. The lexer cannot assume valid UTF since the client might be a rogue, but it can just bail out if the lookahead isn't jSON? So UTF-validation is limited to strings. But why should UTF validation be the job of the lexer in the first place? D's string type is also defined to be UTF-8, so given that, it would of course be free to assume valid UTF-8. I agree with Walter there that validation/conversion should be added as a separate proxy range. But if we end up going for validating in the lexer, it would indeed be enough to validate inside strings, because the rest of the grammar assumes a subset of ASCII. You have to parse the strings because of the \u escapes of course, so some basic validation is unavoidable? At least no UTF validation is needed. Since all non-ASCII characters will always be composed of bytes 0x7F, a sequence \u can be assumed to be valid wherever in the string it occurs, and all other bytes that don't belong to an escape sequence are just passed through as-is. But I guess full validation of string content could be another useful option along with ignore escapes for the case where you want to avoid decode-encode scenarios. (like for a proxy, or if you store pre-escaped unicode in a database)
Re: RFC: std.json sucessor
On 8/25/2014 12:49 PM, simendsjo wrote: I just happened to write a very small script yesterday and tested with the compilers (with dub --build=release). dmd: 2.8 mb gdc: 3.3 mb ldc 0.5 mb So ldc can remove quite a substantial amount of code in some cases. Speed optimizations are different.
Re: RFC: std.json sucessor
On Monday, 25 August 2014 at 21:27:42 UTC, Sönke Ludwig wrote: But why should UTF validation be the job of the lexer in the first place? Because you want to save time, it is faster to integrate validation? The most likely use scenario is to receive REST data over HTTP that needs validation. Well, so then I agree with Andrei… array of bytes it is. ;-) added as a separate proxy range. But if we end up going for validating in the lexer, it would indeed be enough to validate inside strings, because the rest of the grammar assumes a subset of ASCII. Not assumes, but defines! :-) If you have to validate UTF before lexing then you will end up needlessly scanning lots of ascii if the file contains lots of non-strings or is from a encoder that only sends pure ascii. If you want to have plugin validation of strings then you also need to differentiate strings so that the user can select which data should be just ascii, utf8, numbers, ids etc. Otherwise the user will end up doing double validation (you have to bypass 7F followed by string-end anyway). The advantage of integrated validation is that you can use 16 bytes SIMD registers on the buffer. I presume you can load 16 bytes and do BITWISE-AND on the MSB, then match against string-end and carefully use this to boost performance of simultanous UTF validation, escape-scanning, and string-end scan. A bit tricky, of course. At least no UTF validation is needed. Since all non-ASCII characters will always be composed of bytes 0x7F, a sequence \u can be assumed to be valid wherever in the string it occurs, and all other bytes that don't belong to an escape sequence are just passed through as-is. You cannot assume \u… to be valid if you convert it.
Re: RFC: std.json sucessor
On Monday, 25 August 2014 at 21:53:50 UTC, Ola Fosheim Grøstad wrote: I presume you can load 16 bytes and do BITWISE-AND on the MSB, then match against string-end and carefully use this to boost performance of simultanous UTF validation, escape-scanning, and string-end scan. A bit tricky, of course. I think it is doable and worth it… https://software.intel.com/sites/landingpage/IntrinsicsGuide/ e.g.: __mmask16 _mm_cmpeq_epu8_mask (__m128i a, __m128i b) __mmask32 _mm256_cmpeq_epu8_mask (__m256i a, __m256i b) __mmask64 _mm512_cmpeq_epu8_mask (__m512i a, __m512i b) __mmask16 _mm_test_epi8_mask (__m128i a, __m128i b) etc. So you can: 1. preload registers with … , \\… and \0\0\0… 2. then compare signed/unsigned/equal whatever. 3. then load 16,32 or 64 bytes of data and stream until the masks trigger 4. tests masks 5. resolve any potential issues, goto 3
Re: RFC: std.json sucessor
On Monday, 25 August 2014 at 21:24:11 UTC, Walter Bright wrote: I didn't know that. But recall I did implement it in DMC++, and it turned out to simply not be useful. I'd be surprised if the new C++ support for it does anything worthwhile. Well, one should initialize with signaling NaN. Then you get an exception if you try to compute using uninitialized values.
Re: RFC: std.json sucessor
On Monday, 25 August 2014 at 22:40:00 UTC, Ola Fosheim Grøstad wrote: On Monday, 25 August 2014 at 21:53:50 UTC, Ola Fosheim Grøstad wrote: I presume you can load 16 bytes and do BITWISE-AND on the MSB, then match against string-end and carefully use this to boost performance of simultanous UTF validation, escape-scanning, and string-end scan. A bit tricky, of course. I think it is doable and worth it… https://software.intel.com/sites/landingpage/IntrinsicsGuide/ e.g.: __mmask16 _mm_cmpeq_epu8_mask (__m128i a, __m128i b) __mmask32 _mm256_cmpeq_epu8_mask (__m256i a, __m256i b) __mmask64 _mm512_cmpeq_epu8_mask (__m512i a, __m512i b) __mmask16 _mm_test_epi8_mask (__m128i a, __m128i b) etc. So you can: 1. preload registers with … , \\… and \0\0\0… 2. then compare signed/unsigned/equal whatever. 3. then load 16,32 or 64 bytes of data and stream until the masks trigger 4. tests masks 5. resolve any potential issues, goto 3 D:YAML uses a similar approach, but with 8 bytes (plain ulong - portable) to detect how many ASCII chars are there before the first non-ASCII UTF-8 sequence, and it significantly improves performance (didn't keep any numbers unfortunately, but it decreases decoding overhead to a fraction for most inputs (since YAML (and JSON) files tend to be mostly-ASCII with non-ASCII from time to time in strings), if we know that we have e.g. 100 chars incoming that are plain ASCII, we can use a fast path for them and only consider decoding after that)) See the countASCII() function in https://github.com/kiith-sa/D-YAML/blob/master/source/dyaml/reader.d However, this approach is useful only if you decode the whole buffer at once, not if you do something like foreach(dchar ch; asdsššdfáľäô) {}, which is the most obvious way to decode in D. FWIW, decoding _was_ a significant overhead in D:YAML (again, didn't keep numbers, but at a time it was around 10% in the profiler), and I didn't like the fact that it prevented making my code @nogc - I ended up copying chunks of std.utf and making them @nogc nothrow (D:YAML as a whole is not @nogc but I use @nogc in some parts basically as @noalloc to ensure I don't allocate anything)
Re: RFC: std.json sucessor
On 8/25/2014 4:15 PM, Ola Fosheim Grøstad ola.fosheim.grostad+dl...@gmail.com wrote: On Monday, 25 August 2014 at 21:24:11 UTC, Walter Bright wrote: I didn't know that. But recall I did implement it in DMC++, and it turned out to simply not be useful. I'd be surprised if the new C++ support for it does anything worthwhile. Well, one should initialize with signaling NaN. Then you get an exception if you try to compute using uninitialized values. That's the theory. The practice doesn't work out so well.
Re: RFC: std.json sucessor
Btw, maybe it would be a good idea to take a look on the JSON that various browsers generates to see if there are any differences? Then one could tune optimizations to what is the most common coding, like this: 1. start parsing assuming browser style restricted JSON grammar. 2. on failure jump to the slower generic JSON Chrome does not seem to generate whitespace in JSON.stringfy(). And I would not be surprised if the encoding of double is similar across browsers. Ola.
Re: RFC: std.json sucessor
On Monday, 25 August 2014 at 23:24:43 UTC, Kiith-Sa wrote: D:YAML uses a similar approach, but with 8 bytes (plain ulong - portable) to detect how many ASCII chars are there before the first non-ASCII UTF-8 sequence, and it significantly improves performance (didn't keep any numbers unfortunately, but it Cool! I think often you will have an array of numbers so you could subtract 0…, then parse offset-bytes and convert the mantissa/exponent using shuffles and simd. Somehow…
Re: RFC: std.json sucessor
Hi! Thanks for the effort you've put in this. I am having problems with building with LDC 0.14.0. DMD 2.066.0 seems to work fine (all unit tests pass). Do you have any ideas why? I am using Ubuntu 3.10 (Linux 3.11.0-15-generic x86_64). Master was at 6a9f8e62e456c3601fe8ff2e1fbb640f38793d08. $ dub fetch std_data_json --version=~master $ cd std_data_json-master/ $ dub test --compiler=ldc2 Generating test runner configuration '__test__library__' for 'library' (library). Building std_data_json ~master configuration __test__library__, build type unittest. Running ldc2... source/stdx/data/json/parser.d(77): Error: safe function 'stdx.data.json.parser.__unittestL68_22' cannot call system function 'object.AssociativeArray!(string, JSONValue).AssociativeArray.length' source/stdx/data/json/parser.d(124): Error: safe function 'stdx.data.json.parser.__unittestL116_24' cannot call system function 'object.AssociativeArray!(string, JSONValue).AssociativeArray.length' source/stdx/data/json/parser.d(341): Error: function stdx.data.json.parser.JSONParserRange!(JSONLexerRange!string).JSONParserRange.opAssign is not callable because it is annotated with @disable source/stdx/data/json/parser.d(341): Error: safe function 'stdx.data.json.parser.__unittestL318_32' cannot call system function 'stdx.data.json.parser.JSONParserRange!(JSONLexerRange!string).JSONParserRange.opAssign' source/stdx/data/json/parser.d(633): Error: function stdx.data.json.lexer.JSONToken.opAssign is not callable because it is annotated with @disable source/stdx/data/json/parser.d(633): Error: 'stdx.data.json.lexer.JSONToken.opAssign' is not nothrow source/stdx/data/json/parser.d(630): Error: function 'stdx.data.json.parser.JSONParserNode.literal' is nothrow yet may throw FAIL .dub/build/__test__library__-unittest-linux.posix-x86_64-ldc2-0F620B217010475A5A4E545A57CDD09A/ __test__library__ executable Error executing command test: ldc2 failed with exit code 1. Thanks
Re: RFC: std.json sucessor
... I am using Ubuntu 3.10 (Linux 3.11.0-15-generic x86_64). ... I meant Ubuntu 13.10 :D
Re: RFC: std.json sucessor
On Saturday, 23 August 2014 at 05:28:55 UTC, Walter Bright wrote: On 8/22/2014 9:48 PM, Ola Fosheim Gr wrote: On Saturday, 23 August 2014 at 04:36:34 UTC, Walter Bright wrote: On 8/22/2014 9:01 PM, Ola Fosheim Gr wrote: Does this mean that D is getting resizable stack allocations in lower stack frames? That has a lot of implications for code gen. scopebuffer does not require resizeable stack allocations. So you cannot use the stack for resizable allocations. Please, take a look at how scopebuffer works. I have? It requires an upperbound to stay on the stack, that creates a big hole in the stack. I don't think wasting the stack or moving to the heap is a nice predictable solution. It would be better to just have a couple of regions that do reverse stack allocations, but the most efficient solution is the one I outlined. With json you might be able to create an upperbound of say 4-8 times the size of the source iff you know the file size. You don't if you are streaming. (scopebuffer is too unpredictable for real time, a pure stack solution is predictable)
Re: RFC: std.json sucessor
On 8/22/2014 11:25 PM, Ola Fosheim Gr wrote: On Saturday, 23 August 2014 at 05:28:55 UTC, Walter Bright wrote: On 8/22/2014 9:48 PM, Ola Fosheim Gr wrote: On Saturday, 23 August 2014 at 04:36:34 UTC, Walter Bright wrote: On 8/22/2014 9:01 PM, Ola Fosheim Gr wrote: Does this mean that D is getting resizable stack allocations in lower stack frames? That has a lot of implications for code gen. scopebuffer does not require resizeable stack allocations. So you cannot use the stack for resizable allocations. Please, take a look at how scopebuffer works. I have? It requires an upperbound to stay on the stack, that creates a big hole in the stack. I don't think wasting the stack or moving to the heap is a nice predictable solution. It would be better to just have a couple of regions that do reverse stack allocations, but the most efficient solution is the one I outlined. Scopebuffer is extensively used in Warp, and works very well. The hole in the stack is not a significant problem. With json you might be able to create an upperbound of say 4-8 times the size of the source iff you know the file size. You don't if you are streaming. (scopebuffer is too unpredictable for real time, a pure stack solution is predictable) You can always implement your own buffering system and pass it in - that's the point, it's under user control.
Re: RFC: std.json sucessor
On Saturday, 23 August 2014 at 06:41:11 UTC, Walter Bright wrote: Scopebuffer is extensively used in Warp, and works very well. The hole in the stack is not a significant problem. Well, on a webserver you don't want to push out the caches for no good reason. You can always implement your own buffering system and pass it in - that's the point, it's under user control. My point is that you need compiler support to get good buffering options on the stack. Something like an @alloca_inline: auto buffer = @alloca_inline getstuff(); process(buffer); I think all memory allocation should be under compiler control, the library solutions are bound to be suboptimal, i.e. slower.
Re: RFC: std.json sucessor
On 8/22/14, Sönke Ludwig digitalmars-d@puremagic.com wrote: Hmmm, but it *is* a string. Isn't the problem more the use of with in this case? Yeah, maybe so. I thought for a second it was a tuple, but then I saw the square brackets and was left scratching my head. :)
Re: RFC: std.json sucessor
Am 23.08.2014 03:05, schrieb Walter Bright: On 8/22/2014 2:27 PM, Sönke Ludwig wrote: Am 22.08.2014 20:08, schrieb Walter Bright: 1. There's no mention of what will happen if it is passed malformed JSON strings. I presume an exception is thrown. Exceptions are both slow and consume GC memory. I suggest an alternative would be to emit an Error token instead; this would be much like how the UTF decoding algorithms emit a replacement char for invalid UTF sequences. The latest version now features a LexOptions.noThrow option which causes an error token to be emitted instead. After popping the error token, the range is always empty. Having a nothrow option may prevent the functions from being attributed as nothrow. It's a compile time option, so that shouldn't be an issue. There is also just a single throw statement in the source, so it's easy to isolate.
Re: RFC: std.json sucessor
Am 23.08.2014 04:23, schrieb deadalnix: First thank you for your work. std.json is horrible to use right now, so a replacement is more than welcome. I haven't played with your code yet, so I may be asking for somethign that already exists, but did you had a look to jsvar by Adam ? You can find it here: https://github.com/adamdruppe/arsd/blob/master/jsvar.d One of the big pain when one work with format like JSON is that you go from the untyped world to the typed world (the same problem occurs with XML and various config format as well). I think Adam got the right balance in jsvar. It behave closely enough to javascript so it is convenient to manipulate, while removing the most dangerous behavior (concatenation is still done using ~and not + as in JS). If that is not already the case, I'd love that the element I get out of my JSON behave that way. If you can do that, you have a user. Setting the issue of opDispatch aside, one of the goals was to use Algebraic to store values. It is probably not completely as flexible as jsvar, but still transparently enables a lot of operations (with those pull requests merged at least). But it has another big advantage, which is that we can later define other types based on Algebraic, such as BSONValue, and those can be transparently runtime converted between each other in a generic way. A special case type on the other hand produces nasty dependencies between the formats. Main issues of using opDispatch: - Prone to bugs where a normal field/method of the JSONValue struct is accessed instead of a JSON field - On top of that the var.field syntax gives the wrong impression that you are working with static typing, while var[field] makes it clear that runtime indexing is going on - Every interface change of JSONValue would be a silent breaking change, because the whole string domain is used up for opDispatch
Re: RFC: std.json sucessor
On Saturday, 23 August 2014 at 09:22:01 UTC, Sönke Ludwig wrote: Main issues of using opDispatch: - Prone to bugs where a normal field/method of the JSONValue struct is accessed instead of a JSON field - On top of that the var.field syntax gives the wrong impression that you are working with static typing, while var[field] makes it clear that runtime indexing is going on - Every interface change of JSONValue would be a silent breaking change, because the whole string domain is used up for opDispatch I have seen similar issues to these with simplexml in PHP. Using opDispatch to match all possible names except a few doesn't work so well. I'm not sure if you've changed it already, but I agree with the earlier comment about changing the flag for pretty printing from a boolean to an enum value. Booleans in interfaces is one of my pet peeves.
Re: RFC: std.json sucessor
Am 23.08.2014 14:19, schrieb w0rp: I'm not sure if you've changed it already, but I agree with the earlier comment about changing the flag for pretty printing from a boolean to an enum value. Booleans in interfaces is one of my pet peeves. It's split into two separate functions now. Having to type out a full enum value I guess would be too distracting in this case, since they will be pretty frequently used.
Re: RFC: std.json sucessor
Am 22.08.2014 20:08, schrieb Walter Bright: (...) 2. The escape sequenced strings presumably consume GC memory. This will be a problem for high performance code. I suggest either leaving them undecoded in the token stream, and letting higher level code decide what to do about them, or provide a hook that the user can override with his own allocation scheme. If we don't make it possible to use std.json without invoking the GC, I believe the module will fail in the long term. I've added two new types now to abstract away how strings and numbers are represented in memory. For string literals this means that for input types string and immutable(ubyte)[] they will always be stored as slices to the input buffer. JSONValue has a .rawValue property to access them, as well as an alias thised .value property that transparently unescapes. At that place it would also be easy to provide a method that takes an arbitrary output range to unescape without allocations. Documentation and code are both updated (also added a note about exception behavior).
Re: RFC: std.json sucessor
Am 22.08.2014 21:00, schrieb Marc Schütz schue...@gmx.net: On Friday, 22 August 2014 at 18:08:34 UTC, Sönke Ludwig wrote: Am 22.08.2014 19:57, schrieb Marc Schütz schue...@gmx.net: The easiest and cleanest way would be to add a function in std.data.json: auto parse(Target, Source)(Source input) if(is(Target == JSONValue)) { return ...; } The various overloads of `std.conv.parse` already have mutually exclusive template constraints, they will not collide with our function. Okay, for parse that may work, but what about to!()? What's the problem with to!()? to!() definitely doesn't have a template constraint that excludes JSONValue. Instead, it will convert any struct type that doesn't define toString() to a D-like representation.
Re: RFC: std.json sucessor
On Saturday, 23 August 2014 at 16:49:23 UTC, Sönke Ludwig wrote: Am 22.08.2014 21:00, schrieb Marc Schütz schue...@gmx.net: On Friday, 22 August 2014 at 18:08:34 UTC, Sönke Ludwig wrote: Am 22.08.2014 19:57, schrieb Marc Schütz schue...@gmx.net: The easiest and cleanest way would be to add a function in std.data.json: auto parse(Target, Source)(Source input) if(is(Target == JSONValue)) { return ...; } The various overloads of `std.conv.parse` already have mutually exclusive template constraints, they will not collide with our function. Okay, for parse that may work, but what about to!()? What's the problem with to!()? to!() definitely doesn't have a template constraint that excludes JSONValue. Instead, it will convert any struct type that doesn't define toString() to a D-like representation. For converting a JSONValue to a different type, JSONValue can implement `opCast`, which is the regular interface that std.conv.to uses if it's available. For converting something _to_ a JSONValue, std.conv.to will simply create an instance of it by calling the constructor.
Re: RFC: std.json sucessor
Am 23.08.2014 19:25, schrieb Marc Schütz schue...@gmx.net: On Saturday, 23 August 2014 at 16:49:23 UTC, Sönke Ludwig wrote: Am 22.08.2014 21:00, schrieb Marc Schütz schue...@gmx.net: On Friday, 22 August 2014 at 18:08:34 UTC, Sönke Ludwig wrote: Am 22.08.2014 19:57, schrieb Marc Schütz schue...@gmx.net: The easiest and cleanest way would be to add a function in std.data.json: auto parse(Target, Source)(Source input) if(is(Target == JSONValue)) { return ...; } The various overloads of `std.conv.parse` already have mutually exclusive template constraints, they will not collide with our function. Okay, for parse that may work, but what about to!()? What's the problem with to!()? to!() definitely doesn't have a template constraint that excludes JSONValue. Instead, it will convert any struct type that doesn't define toString() to a D-like representation. For converting a JSONValue to a different type, JSONValue can implement `opCast`, which is the regular interface that std.conv.to uses if it's available. For converting something _to_ a JSONValue, std.conv.to will simply create an instance of it by calling the constructor. That would just introduce the said dependency cycle between JSONValue, the parser and the lexer. Possible, but not particularly pretty. Also, using the JSONValue constructor to parse an input string would contradict the intuitive behavior to just store the string value.