Re: [I] [C++][Parquet] Encoding tools for variant type [arrow]

2025-06-02 Thread via GitHub


zeroshade commented on issue #46555:
URL: https://github.com/apache/arrow/issues/46555#issuecomment-2930948588

   @PinkCrow007 I plan on eventually building higher-level nested builder APIs 
on top of the single-buffer designed one so that a user can make the choice 
between the efficiency or API ergonomics.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++][Parquet] Encoding tools for variant type [arrow]

2025-06-02 Thread via GitHub


PinkCrow007 commented on issue #46555:
URL: https://github.com/apache/arrow/issues/46555#issuecomment-2929299578

   I really like the single-buffer builder design @zeroshade . It’s 
impressively efficient, avoids extra allocations, and makes lifecycle 
management a lot simpler.
   
   I’ve been working on a nested builder in 
[Rust](https://github.com/apache/arrow-rs/issues/7424), and honestly, it's 
significantly less efficient than this approach. It uses intermediate buffers 
and multiple copies, which hurts performance. The main benefit, though, is API 
ergonomics — nested builders feel more intuitive to users. (Still not entirely 
user-friendly: due to the borrow checker, users must .finish() one nested 
object before starting another)
   
   Regarding key sorting: I believe the problem stems from Variant format 
itself. Since field_ids are determined by the sorted dictionary in metadata, we 
either need to sort all keys upfront or patch field_ids after the fact. In my 
current implementation, I follow the latter approach - when sorted_strings is 
set to true, the builder walks back and patches every field headers during the 
finalization of the root builder, but this isn’t very efficient or safe. I’m 
still looking for better ways to handle this.
   
   Overall, I think this design space presents a tradeoff: Single builders 
optimize for efficiency and buffer reuse, but expose a lower-level API; Nested 
builders are easier for users, but are harder to make efficient. It’s been 
really helpful to see this Go implementation and the surrounding discussion. I 
agree that establishing best practices here will be crucial as more language 
implementations emerge.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++][Parquet] Encoding tools for variant type [arrow]

2025-05-25 Thread via GitHub


mapleFU commented on issue #46555:
URL: https://github.com/apache/arrow/issues/46555#issuecomment-2908741862

   Another interesting( but maybe useless ) finding is that, the variant 
allowing out-of order offset for array/object field in value, like:
   1. Field 1 can after field 2?
   2. If field 1 is totally equal to field 2, seems they can share the same 
variant sub-object?
   
   It's a bit tricky for writers to implement these features, for array or 
object, if multiple field can have duplicate value, they may have a 
`addRepeatsXXX(T item, uint32_t repeats_num)` interface. But maybe it can be 
regard as a future optimization.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++][Parquet] Encoding tools for variant type [arrow]

2025-05-23 Thread via GitHub


zeroshade commented on issue #46555:
URL: https://github.com/apache/arrow/issues/46555#issuecomment-2905956585

   > For implementation and the buffer handling stack is inspiring, I think for 
!sorted, this is great example. I may need to think about using a tmp buffer 
when build sub-object, it could change memmove to memcpy. But when object/array 
stack size grows, the required buffer also grows, which makes it even worse
   
   Notice that for building the sub-objects I'm simply growing the single 
buffer and then shifting the built variant data down to make room for the 
header. This works just fine for nested objects as long as the user calls 
`FinishObject` in the correct order (the ordering would still need to be 
careful with sub-objects). What does the temp buffer get you instead? 
Eventually you'll need to grow the top buffer to accomodate it regardless when 
you copy it back to the parent builder, so you end up with more allocations 
happening by using the sub-objects. 
   
   > For sorted, maybe it's easy to do it in one-layer object, for multi layer 
object, a sorted & unique Builder might need maintain a "object builder tree" 
and need to finish it in build order 😅
   
   Well, my current implementation does guarantee *uniqueness* for the keys, 
but sorting is definitely much more complex. There's only two ways I can think 
about efficiently getting the final metadata sorted:
   
   The first case: If the user adds all the keys they are going to use 
up-front, the builder could provide a function to sort the keys and give them 
new ids based on the sorted positions. If you require this before creating 
*any* objects, everything else becomes simple. But requiring ALL of the keys 
up-front is not really a reasonable thing IMO
   
   The more complex case: You have to maintain your object builder tree with 
nested builders and children and so on, but NONE of them can construct their 
object header until the entire builder is being finished. At which point you do 
a depth-first iteration of the tree recursively constructing the object headers 
and copying the buffer data into the parents as you go. 
   
   The reason for this is because the header for the objects contain the IDs 
for each key in the object, and the ID is based on the order it is written to 
the metadata. Which means every time a new key is added to the dictionary, you 
would potentially have to go through all of the objects and update ALL of the 
key ids (in the worst case scenario). The only way to do this efficiently is to 
not write any object headers until you *know* you have ALL of the keys, and 
thus the IDs won't change. For a simple value, this probably isn't difficult. 
But for a complex, multi-level nested object, this could get pretty inefficient 
real quick.
   
   In the case of the rust code you linked, they are basically keeping multiple 
buffers, one for each value/object/element and then if the keys are sorted it 
has to go rewrite the header of every child object, every time it finishes one 
of the parent objects (to update the IDs).
   
   In the Go code, I just didn't bother even attempting to sort the keys. I 
just check on the final result call whether or not the keys *happen* to be 
sorted, if they are then I set the sorted bit in the metadata lol


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++][Parquet] Encoding tools for variant type [arrow]

2025-05-23 Thread via GitHub


mapleFU commented on issue #46555:
URL: https://github.com/apache/arrow/issues/46555#issuecomment-2905829507

   When reading I found I forget the design of how to build the sorted metadata 
in this scenerio, lol
   
   The api looks nice!
   
   For implementation and the buffer handling stack is inspiring, I think for 
!sorted, this is great example. I may need to think about  using a tmp buffer 
when build sub-object, it could change memmove to memcpy. But when object/array 
stack size grows, the required buffer also grows, which makes it even worse
   
   For sorted, maybe it's easy to do it in one-layer object, for multi layer 
object, a sorted & unique Builder might need maintain a "object builder tree" 
and need to finish it in build order 😅 This is really tricky for nested complex 
object...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++][Parquet] Encoding tools for variant type [arrow]

2025-05-23 Thread via GitHub


zeroshade commented on issue #46555:
URL: https://github.com/apache/arrow/issues/46555#issuecomment-2905730430

   yup! :) That's from inspiration i got from the Spark implementation


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++][Parquet] Encoding tools for variant type [arrow]

2025-05-23 Thread via GitHub


mapleFU commented on issue #46555:
URL: https://github.com/apache/arrow/issues/46555#issuecomment-2905679475

   It's impressive to use one `buf` during building the variant object!
   
   ```go
   // Builder is used to construct Variant values by appending data of various 
types.
   // It manages an internal buffer for the value data and a dictionary for 
field keys.
   type Builder struct {
buf bytes.Buffer
dictmap[string]uint32
dictKeys[][]byte
allowDuplicates bool
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++][Parquet] Encoding tools for variant type [arrow]

2025-05-23 Thread via GitHub


zeroshade commented on issue #46555:
URL: https://github.com/apache/arrow/issues/46555#issuecomment-2905639920

   @mapleFU I've put up my changes finally by updating the PR at 
https://github.com/apache/arrow-go/pull/344 complete with my updated design and 
docs for the builder and encoding/decoding values. Let me know what you think 
there! Inspiration there includes your C++ code and the Spark and parquet-java 
implementations.
   
@alamb Since you've brought this to the attention of the rust devs, I'd 
also be curious of your (and their!) thoughts on what I ended up coming up 
with. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++][Parquet] Encoding tools for variant type [arrow]

2025-05-23 Thread via GitHub


zeroshade commented on issue #46555:
URL: https://github.com/apache/arrow/issues/46555#issuecomment-2905074401

   I think the work we're doing is gonna defining the best practices since so 
far we're seeing a divergence in approach between Spark, Parquet-java, and 
others. :smile: I guess we're gonna have to compare and contrast and see what 
the community likes. In addition, there may be language specific divergences in 
what makes the most sense.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++][Parquet] Encoding tools for variant type [arrow]

2025-05-23 Thread via GitHub


mapleFU commented on issue #46555:
URL: https://github.com/apache/arrow/issues/46555#issuecomment-2905069929

   Thanks for the sharing! It's really inspiring! Looks like this kind of api 
has some best practice. And since variant object is complex, maybe we cannot 
expect executing it could be fast, maybe the bottleneck will be memory 
allocation in the main path, and it cannot be vectorized in most of cases ( 
since so many memory dependency in main path )


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++][Parquet] Encoding tools for variant type [arrow]

2025-05-23 Thread via GitHub


zeroshade commented on issue #46555:
URL: https://github.com/apache/arrow/issues/46555#issuecomment-2905049574

   The only thing I like better about the nested impl approach is that the 
developer API is slightly easier to use and better (in some ways). Instead of 
needing the `NextElement` and `NextField` calls, a user can just make the 
`Append*` calls on the `ArrayBuilder` or `ObjectBuilder`. Though it does have 
the complexity of the extra child builder objects, it does make things slightly 
easier in some ways even though it's less efficient.
   
   I guess I'll finish this up and put the PR up and see what people think


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++][Parquet] Encoding tools for variant type [arrow]

2025-05-23 Thread via GitHub


zeroshade commented on issue #46555:
URL: https://github.com/apache/arrow/issues/46555#issuecomment-2904995318

   So, I used that `DecimalValue` type for how I return the decimal types:
   
   ```go
   type DecimalValue[T decimal.DecimalTypes] struct {
Scale uint8
Value decimal.Num[T]
   }
   ```
   
   But the builder currently has:
   
   ```go
   func (b *Builder) AppendDecimal4(scale uint8, v decimal.Decimal32) error 
{...}
   func (b *Builder) AppendDecimal8(scale uint8, v decimal.Decimal64) error 
{...}
   func (b *Builder) AppendDecimal16(scale uint8, v decimal.Decimal128) error 
{...}
   ```
   
   > ( can float/double cannot distinct it fits the smallest type? Perhaps not? 
)
   
   I'll try that, shouldn't be difficult to set it up for that.
   
   > Also, I've list a part for output and internal buffer handling. Arena like 
data structure would be used here. Previously I've think how to do think of 
reusing buffer and making build array of Variant vectorization. But now I found 
it's complex to vectorize building a so complex type. So maybe I should focous 
on reuse buffer and decrease dynamic buffer allocation
   
   Another reason why I avoided the "child builder" construction is to improve 
buffer reuse and reduce dynamic allocation. There's a single buffer in the 
builder that everything writes to. `FinishObject` / `FinishArray` are called, 
the buffer is expanded and the size of the header is calculated. The raw data 
is then shifted by that many bytes and the header is written in the new space. 
That way we aren't creating a separate buffer for each child builder that we 
then have to copy back to the main one (which is what 
https://github.com/apache/parquet-java/blob/1f1e07bbf750fba228851c2d63470c3da5726831/parquet-variant/src/main/java/org/apache/parquet/variant/VariantObjectBuilder.java
 does)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++][Parquet] Encoding tools for variant type [arrow]

2025-05-23 Thread via GitHub


mapleFU commented on issue #46555:
URL: https://github.com/apache/arrow/issues/46555#issuecomment-2905035400

   Yeah:
   1. for array and object, size is unknown before `finishObject`
   2. For element in array and object, an "arena" like object is likely used
   
   I found the parquet-java uses nested impl [1], and velocypack uses internal 
arena stack [2]. [1] is more easy to implement, [2] is more efficient for me
   
   [1] 
https://github.com/apache/parquet-java/blob/1f1e07bbf750fba228851c2d63470c3da5726831/parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java#L437
   [2] 
https://github.com/arangodb/velocypack/blob/8f0fc652e7641046a9f12b7d8a9917dcc507a5c2/include/velocypack/Builder.h#L98-L99
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++][Parquet] Encoding tools for variant type [arrow]

2025-05-23 Thread via GitHub


mapleFU commented on issue #46555:
URL: https://github.com/apache/arrow/issues/46555#issuecomment-2904893130

   @zeroshade This builder api sounds ok to me for primitive api, though I 
found maybe only integer will use smallest ( can float/double cannot distinct 
it fits the smallest type? Perhaps not? )
   
   I also wonder what the decimal value in builder api would be like, since 
arrow doesn't have Java's `BigDecimal` type. I draft a
   
   ```
   template
   struct DecimalValue {
 ArrowDecimalType decimal;
  uint8_t scale;
   };
   ```
   
   What would here better being?
   
   Also, I've list a part for output and internal buffer handling. Arena like 
data structure would be used here. Previously I've think how to do think of 
reusing buffer and making build array of Variant vectorization. But now I found 
it's complex to vectorize building a so complex type. So maybe I should focous 
on reuse buffer and decrease dynamic buffer allocation


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++][Parquet] Encoding tools for variant type [arrow]

2025-05-23 Thread via GitHub


zeroshade commented on issue #46555:
URL: https://github.com/apache/arrow/issues/46555#issuecomment-2904786416

   I've been building out an API for a Variant Builder for the Go 
implementation, and so far I have based it on the [Spark Variant 
Builder](https://github.com/apache/spark/blob/master/common/variant/src/main/java/org/apache/spark/types/variant/VariantBuilder.java).
   
   The API looks like this for the primitive types:
   
   ```go
   func (b *Builder) AppendNull() error {...}
   func (b *Builder) AppendBool(v bool) error {...}
   // uses the smallest int type it can based on the value
   func (b *Builder) AppendInt(v int64) error {...}
   func (b *Builder) AppendTimestamp(v arrow.Timestamp, useMicro, useUTC bool) 
error {...}
   // and so on
   ```
   
   There are also methods for adding a key to the metadata:
   
   ```go
   func (b *Builder) AddKey(k string) (id uint32) {...}
   func (b *Builder) AddKeys(k []string) (id []uint32) {...}
   ```
   
   Which just adds them so they will get put into the metadata when you finally 
build it, and returns the IDs.
   
   For managing objects and arrays, I decided against using sub-builders as 
that would require having to manage the lifetime of multiple objects and 
sub-builders which could get very complex. Instead the pattern is the following:
   
   1. Get the starting offset via `Buffer.Offset()`
   2. Use the starting offset and helper functions to build a list of 
fields/offsets while appending the values
   3. Call `FinishArray` / `FinishObject` with the start offset + list of 
offsets/fields.
   
   Code-wise it looks like this to build a list:
   
   ```go
   var b variant.Builder
   start := b.Offset()
   offsets := make([]int, 0)
   
   offsets = append(offsets, b.NextElement(start))
   b.AppendInt(5)
   
   offsets = append(offsets, b.NextElement(start))
   b.AppendBool(true)
   
   offsets = append(offsets, b.NextElement(start))
   b.AppendInt(9)
   
   b.FinishArray(start, offsets)
   
   v, err := b.Build()
   ```
   
   Resulting in a variant that is equivalent to `[5, true, 9]`.
   
   For an object it looks like:
   
   ```go
   var b variant.Builder
   start := b.Offset()
   fields := make([]variant.FieldEntry, 0)
   
   fields := append(fields, b.NextField(start, "int_field"))
   b.AppendInt(1)
   
   fields = append(fields, b.NextField(start, "string_field"))
   b.AppendString("Variant")
   
   fields = append(fields, b.NextField(start, "null_field"))
   b.AppendNull()
   
   b.FinishObject(start, fields)
   
   v, err := b.Build()
   ```
   
   This would create a variant that is equivalent to `{ "int_field": 1, 
"string_field": "Variant", "null_field": null }`
   
   What do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++][Parquet] Encoding tools for variant type [arrow]

2025-05-23 Thread via GitHub


mapleFU commented on issue #46555:
URL: https://github.com/apache/arrow/issues/46555#issuecomment-2903939496

   cc @pitrou @alamb @wgtmac @emkornfield for some interface design ideas


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org