Re: [PR] feat: add projection support to TapeDecoder for skipping unknown fields in json parsing (1.4x speedup) [arrow-rs]

via GitHub Wed, 07 Jan 2026 14:33:48 -0800


scovich commented on code in PR #9097:
URL: https://github.com/apache/arrow-rs/pull/9097#discussion_r2670273139



##########
arrow-json/src/reader/tape.rs:
##########
@@ -237,8 +238,21 @@ enum DecoderState {
     ///
     /// Consists of `(literal, decoded length)`
     Literal(Literal, u8),
+    /// Skipping a value (for unprojected fields)
+    ///
+    /// Consists of:
+    /// - `depth`: Nesting level of objects/arrays being skipped (u32)
+    /// - `flags`: Bit-packed flags (in_string: bit 0, escape: bit 1)
+    SkipValue {
+        depth: u32,
+        flags: u8,
+    },
 }
 
+// Bit flags for SkipValue state
+const SKIP_IN_STRING: u8 = 1 << 0; // 0x01
+const SKIP_ESCAPE: u8 = 1 << 1; // 0x02

Review Comment:
   I threw an LLM at this whole situation during a boring meeting, and arrived 
at a surprisingly different potential approach, if you're game to try it out?
   
   The short version is:
   * Keep the existing (highly optimized and efficient) decoding logic, but 
factor it out to a helper method that is generic over `const SKIP: bool` that 
says whether to actually store the parsed output.
   * Wrap that helper in `decode` and `decode_skip` methods, with a clean 
transition between the two: enter at the `:` match (like the PR does today), 
and `decode_skip` breaks back out to `decode` when the stack length drops back 
down.
   * We need a new boolean `skipping` field to handle cases where input bytes 
were exhausted while skipping (so the next call to `decode` can jump straight 
to `decode_skip` when starting the next buffer of bytes)
   * The state stack tracks everything related to skipping (small memory cost 
but very efficient).
   * No new tape decoder enum variants needed.
   
   In theory, the approach should be simpler (less duplicated source code) 
while also having friendlier branching (fewer and/or more predictable 
branches). 
   
   Is that something you'd want me to put a bit more time into exploring 
further? 
   Or something you'd prefer to dig into yourself?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat: add projection support to TapeDecoder for skipping unknown fields in json parsing (1.4x speedup) [arrow-rs]

Reply via email to