Re: [I] [C++][Parquet] Encoding tools for variant type [arrow]
zeroshade commented on issue #46555: URL: https://github.com/apache/arrow/issues/46555#issuecomment-2930948588 @PinkCrow007 I plan on eventually building higher-level nested builder APIs on top of the single-buffer designed one so that a user can make the choice between the efficiency or API ergonomics. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++][Parquet] Encoding tools for variant type [arrow]
PinkCrow007 commented on issue #46555: URL: https://github.com/apache/arrow/issues/46555#issuecomment-2929299578 I really like the single-buffer builder design @zeroshade . It’s impressively efficient, avoids extra allocations, and makes lifecycle management a lot simpler. I’ve been working on a nested builder in [Rust](https://github.com/apache/arrow-rs/issues/7424), and honestly, it's significantly less efficient than this approach. It uses intermediate buffers and multiple copies, which hurts performance. The main benefit, though, is API ergonomics — nested builders feel more intuitive to users. (Still not entirely user-friendly: due to the borrow checker, users must .finish() one nested object before starting another) Regarding key sorting: I believe the problem stems from Variant format itself. Since field_ids are determined by the sorted dictionary in metadata, we either need to sort all keys upfront or patch field_ids after the fact. In my current implementation, I follow the latter approach - when sorted_strings is set to true, the builder walks back and patches every field headers during the finalization of the root builder, but this isn’t very efficient or safe. I’m still looking for better ways to handle this. Overall, I think this design space presents a tradeoff: Single builders optimize for efficiency and buffer reuse, but expose a lower-level API; Nested builders are easier for users, but are harder to make efficient. It’s been really helpful to see this Go implementation and the surrounding discussion. I agree that establishing best practices here will be crucial as more language implementations emerge. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++][Parquet] Encoding tools for variant type [arrow]
mapleFU commented on issue #46555: URL: https://github.com/apache/arrow/issues/46555#issuecomment-2908741862 Another interesting( but maybe useless ) finding is that, the variant allowing out-of order offset for array/object field in value, like: 1. Field 1 can after field 2? 2. If field 1 is totally equal to field 2, seems they can share the same variant sub-object? It's a bit tricky for writers to implement these features, for array or object, if multiple field can have duplicate value, they may have a `addRepeatsXXX(T item, uint32_t repeats_num)` interface. But maybe it can be regard as a future optimization. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++][Parquet] Encoding tools for variant type [arrow]
zeroshade commented on issue #46555: URL: https://github.com/apache/arrow/issues/46555#issuecomment-2905956585 > For implementation and the buffer handling stack is inspiring, I think for !sorted, this is great example. I may need to think about using a tmp buffer when build sub-object, it could change memmove to memcpy. But when object/array stack size grows, the required buffer also grows, which makes it even worse Notice that for building the sub-objects I'm simply growing the single buffer and then shifting the built variant data down to make room for the header. This works just fine for nested objects as long as the user calls `FinishObject` in the correct order (the ordering would still need to be careful with sub-objects). What does the temp buffer get you instead? Eventually you'll need to grow the top buffer to accomodate it regardless when you copy it back to the parent builder, so you end up with more allocations happening by using the sub-objects. > For sorted, maybe it's easy to do it in one-layer object, for multi layer object, a sorted & unique Builder might need maintain a "object builder tree" and need to finish it in build order 😅 Well, my current implementation does guarantee *uniqueness* for the keys, but sorting is definitely much more complex. There's only two ways I can think about efficiently getting the final metadata sorted: The first case: If the user adds all the keys they are going to use up-front, the builder could provide a function to sort the keys and give them new ids based on the sorted positions. If you require this before creating *any* objects, everything else becomes simple. But requiring ALL of the keys up-front is not really a reasonable thing IMO The more complex case: You have to maintain your object builder tree with nested builders and children and so on, but NONE of them can construct their object header until the entire builder is being finished. At which point you do a depth-first iteration of the tree recursively constructing the object headers and copying the buffer data into the parents as you go. The reason for this is because the header for the objects contain the IDs for each key in the object, and the ID is based on the order it is written to the metadata. Which means every time a new key is added to the dictionary, you would potentially have to go through all of the objects and update ALL of the key ids (in the worst case scenario). The only way to do this efficiently is to not write any object headers until you *know* you have ALL of the keys, and thus the IDs won't change. For a simple value, this probably isn't difficult. But for a complex, multi-level nested object, this could get pretty inefficient real quick. In the case of the rust code you linked, they are basically keeping multiple buffers, one for each value/object/element and then if the keys are sorted it has to go rewrite the header of every child object, every time it finishes one of the parent objects (to update the IDs). In the Go code, I just didn't bother even attempting to sort the keys. I just check on the final result call whether or not the keys *happen* to be sorted, if they are then I set the sorted bit in the metadata lol -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++][Parquet] Encoding tools for variant type [arrow]
mapleFU commented on issue #46555: URL: https://github.com/apache/arrow/issues/46555#issuecomment-2905829507 When reading I found I forget the design of how to build the sorted metadata in this scenerio, lol The api looks nice! For implementation and the buffer handling stack is inspiring, I think for !sorted, this is great example. I may need to think about using a tmp buffer when build sub-object, it could change memmove to memcpy. But when object/array stack size grows, the required buffer also grows, which makes it even worse For sorted, maybe it's easy to do it in one-layer object, for multi layer object, a sorted & unique Builder might need maintain a "object builder tree" and need to finish it in build order 😅 This is really tricky for nested complex object... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++][Parquet] Encoding tools for variant type [arrow]
zeroshade commented on issue #46555: URL: https://github.com/apache/arrow/issues/46555#issuecomment-2905730430 yup! :) That's from inspiration i got from the Spark implementation -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++][Parquet] Encoding tools for variant type [arrow]
mapleFU commented on issue #46555: URL: https://github.com/apache/arrow/issues/46555#issuecomment-2905679475 It's impressive to use one `buf` during building the variant object! ```go // Builder is used to construct Variant values by appending data of various types. // It manages an internal buffer for the value data and a dictionary for field keys. type Builder struct { buf bytes.Buffer dictmap[string]uint32 dictKeys[][]byte allowDuplicates bool } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++][Parquet] Encoding tools for variant type [arrow]
zeroshade commented on issue #46555: URL: https://github.com/apache/arrow/issues/46555#issuecomment-2905639920 @mapleFU I've put up my changes finally by updating the PR at https://github.com/apache/arrow-go/pull/344 complete with my updated design and docs for the builder and encoding/decoding values. Let me know what you think there! Inspiration there includes your C++ code and the Spark and parquet-java implementations. @alamb Since you've brought this to the attention of the rust devs, I'd also be curious of your (and their!) thoughts on what I ended up coming up with. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++][Parquet] Encoding tools for variant type [arrow]
zeroshade commented on issue #46555: URL: https://github.com/apache/arrow/issues/46555#issuecomment-2905074401 I think the work we're doing is gonna defining the best practices since so far we're seeing a divergence in approach between Spark, Parquet-java, and others. :smile: I guess we're gonna have to compare and contrast and see what the community likes. In addition, there may be language specific divergences in what makes the most sense. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++][Parquet] Encoding tools for variant type [arrow]
mapleFU commented on issue #46555: URL: https://github.com/apache/arrow/issues/46555#issuecomment-2905069929 Thanks for the sharing! It's really inspiring! Looks like this kind of api has some best practice. And since variant object is complex, maybe we cannot expect executing it could be fast, maybe the bottleneck will be memory allocation in the main path, and it cannot be vectorized in most of cases ( since so many memory dependency in main path ) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++][Parquet] Encoding tools for variant type [arrow]
zeroshade commented on issue #46555: URL: https://github.com/apache/arrow/issues/46555#issuecomment-2905049574 The only thing I like better about the nested impl approach is that the developer API is slightly easier to use and better (in some ways). Instead of needing the `NextElement` and `NextField` calls, a user can just make the `Append*` calls on the `ArrayBuilder` or `ObjectBuilder`. Though it does have the complexity of the extra child builder objects, it does make things slightly easier in some ways even though it's less efficient. I guess I'll finish this up and put the PR up and see what people think -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++][Parquet] Encoding tools for variant type [arrow]
zeroshade commented on issue #46555: URL: https://github.com/apache/arrow/issues/46555#issuecomment-2904995318 So, I used that `DecimalValue` type for how I return the decimal types: ```go type DecimalValue[T decimal.DecimalTypes] struct { Scale uint8 Value decimal.Num[T] } ``` But the builder currently has: ```go func (b *Builder) AppendDecimal4(scale uint8, v decimal.Decimal32) error {...} func (b *Builder) AppendDecimal8(scale uint8, v decimal.Decimal64) error {...} func (b *Builder) AppendDecimal16(scale uint8, v decimal.Decimal128) error {...} ``` > ( can float/double cannot distinct it fits the smallest type? Perhaps not? ) I'll try that, shouldn't be difficult to set it up for that. > Also, I've list a part for output and internal buffer handling. Arena like data structure would be used here. Previously I've think how to do think of reusing buffer and making build array of Variant vectorization. But now I found it's complex to vectorize building a so complex type. So maybe I should focous on reuse buffer and decrease dynamic buffer allocation Another reason why I avoided the "child builder" construction is to improve buffer reuse and reduce dynamic allocation. There's a single buffer in the builder that everything writes to. `FinishObject` / `FinishArray` are called, the buffer is expanded and the size of the header is calculated. The raw data is then shifted by that many bytes and the header is written in the new space. That way we aren't creating a separate buffer for each child builder that we then have to copy back to the main one (which is what https://github.com/apache/parquet-java/blob/1f1e07bbf750fba228851c2d63470c3da5726831/parquet-variant/src/main/java/org/apache/parquet/variant/VariantObjectBuilder.java does) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++][Parquet] Encoding tools for variant type [arrow]
mapleFU commented on issue #46555: URL: https://github.com/apache/arrow/issues/46555#issuecomment-2905035400 Yeah: 1. for array and object, size is unknown before `finishObject` 2. For element in array and object, an "arena" like object is likely used I found the parquet-java uses nested impl [1], and velocypack uses internal arena stack [2]. [1] is more easy to implement, [2] is more efficient for me [1] https://github.com/apache/parquet-java/blob/1f1e07bbf750fba228851c2d63470c3da5726831/parquet-variant/src/main/java/org/apache/parquet/variant/VariantBuilder.java#L437 [2] https://github.com/arangodb/velocypack/blob/8f0fc652e7641046a9f12b7d8a9917dcc507a5c2/include/velocypack/Builder.h#L98-L99 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++][Parquet] Encoding tools for variant type [arrow]
mapleFU commented on issue #46555: URL: https://github.com/apache/arrow/issues/46555#issuecomment-2904893130 @zeroshade This builder api sounds ok to me for primitive api, though I found maybe only integer will use smallest ( can float/double cannot distinct it fits the smallest type? Perhaps not? ) I also wonder what the decimal value in builder api would be like, since arrow doesn't have Java's `BigDecimal` type. I draft a ``` template struct DecimalValue { ArrowDecimalType decimal; uint8_t scale; }; ``` What would here better being? Also, I've list a part for output and internal buffer handling. Arena like data structure would be used here. Previously I've think how to do think of reusing buffer and making build array of Variant vectorization. But now I found it's complex to vectorize building a so complex type. So maybe I should focous on reuse buffer and decrease dynamic buffer allocation -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++][Parquet] Encoding tools for variant type [arrow]
zeroshade commented on issue #46555: URL: https://github.com/apache/arrow/issues/46555#issuecomment-2904786416 I've been building out an API for a Variant Builder for the Go implementation, and so far I have based it on the [Spark Variant Builder](https://github.com/apache/spark/blob/master/common/variant/src/main/java/org/apache/spark/types/variant/VariantBuilder.java). The API looks like this for the primitive types: ```go func (b *Builder) AppendNull() error {...} func (b *Builder) AppendBool(v bool) error {...} // uses the smallest int type it can based on the value func (b *Builder) AppendInt(v int64) error {...} func (b *Builder) AppendTimestamp(v arrow.Timestamp, useMicro, useUTC bool) error {...} // and so on ``` There are also methods for adding a key to the metadata: ```go func (b *Builder) AddKey(k string) (id uint32) {...} func (b *Builder) AddKeys(k []string) (id []uint32) {...} ``` Which just adds them so they will get put into the metadata when you finally build it, and returns the IDs. For managing objects and arrays, I decided against using sub-builders as that would require having to manage the lifetime of multiple objects and sub-builders which could get very complex. Instead the pattern is the following: 1. Get the starting offset via `Buffer.Offset()` 2. Use the starting offset and helper functions to build a list of fields/offsets while appending the values 3. Call `FinishArray` / `FinishObject` with the start offset + list of offsets/fields. Code-wise it looks like this to build a list: ```go var b variant.Builder start := b.Offset() offsets := make([]int, 0) offsets = append(offsets, b.NextElement(start)) b.AppendInt(5) offsets = append(offsets, b.NextElement(start)) b.AppendBool(true) offsets = append(offsets, b.NextElement(start)) b.AppendInt(9) b.FinishArray(start, offsets) v, err := b.Build() ``` Resulting in a variant that is equivalent to `[5, true, 9]`. For an object it looks like: ```go var b variant.Builder start := b.Offset() fields := make([]variant.FieldEntry, 0) fields := append(fields, b.NextField(start, "int_field")) b.AppendInt(1) fields = append(fields, b.NextField(start, "string_field")) b.AppendString("Variant") fields = append(fields, b.NextField(start, "null_field")) b.AppendNull() b.FinishObject(start, fields) v, err := b.Build() ``` This would create a variant that is equivalent to `{ "int_field": 1, "string_field": "Variant", "null_field": null }` What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [C++][Parquet] Encoding tools for variant type [arrow]
mapleFU commented on issue #46555: URL: https://github.com/apache/arrow/issues/46555#issuecomment-2903939496 cc @pitrou @alamb @wgtmac @emkornfield for some interface design ideas -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org