Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-25 Thread via GitHub


alamb merged PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-25 Thread via GitHub


alamb commented on PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#issuecomment-2752055289

   Thanks again @adriangb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-24 Thread via GitHub


alamb commented on PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#issuecomment-2749214543

   I merged up from main and plan to merge this PR when CI passes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-19 Thread via GitHub


alamb commented on PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#issuecomment-2738324992

   > As this is a breaking change it will need to wait until main opens up for 
such changes, which I believe should be after the current release process 
concludes.
   
   Yes I plan to complete the current release tomorrow or Friday
   - https://github.com/apache/arrow-rs/issues/7107
   
   Then I'll try and merge whatever is ready to go into main. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-19 Thread via GitHub


alamb commented on code in PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#discussion_r2001938813


##
arrow-json/src/writer/encoder.rs:
##
@@ -25,126 +27,322 @@ use arrow_schema::{ArrowError, DataType, FieldRef};
 use half::f16;
 use lexical_core::FormattedSize;
 use serde::Serializer;
-use std::io::Write;
 
+/// Configuration options for the JSON encoder.
 #[derive(Debug, Clone, Default)]
 pub struct EncoderOptions {
-pub explicit_nulls: bool,
-pub struct_mode: StructMode,
+/// Whether to include nulls in the output or elide them.
+explicit_nulls: bool,
+/// Whether to encode structs as JSON objects or JSON arrays of their 
values.
+struct_mode: StructMode,
+/// An optional hook for customizing encoding behavior.
+encoder_factory: Option>,
+}
+
+impl EncoderOptions {
+/// Set whether to include nulls in the output or elide them.
+pub fn with_explicit_nulls(mut self, explicit_nulls: bool) -> Self {
+self.explicit_nulls = explicit_nulls;
+self
+}
+
+/// Set whether to encode structs as JSON objects or JSON arrays of their 
values.
+pub fn with_struct_mode(mut self, struct_mode: StructMode) -> Self {
+self.struct_mode = struct_mode;
+self
+}
+
+/// Set an optional hook for customizing encoding behavior.
+pub fn with_encoder_factory(mut self, encoder_factory: Arc) -> Self {
+self.encoder_factory = Some(encoder_factory);
+self
+}
+
+/// Get whether to include nulls in the output or elide them.
+pub fn explicit_nulls(&self) -> bool {
+self.explicit_nulls
+}
+
+/// Get whether to encode structs as JSON objects or JSON arrays of their 
values.
+pub fn struct_mode(&self) -> StructMode {
+self.struct_mode
+}
+
+/// Get the optional hook for customizing encoding behavior.
+pub fn encoder_factory(&self) -> Option<&Arc> {
+self.encoder_factory.as_ref()
+}
+}
+
+/// A trait to create custom encoders for specific data types.
+///
+/// This allows overriding the default encoders for specific data types,
+/// or adding new encoders for custom data types.
+/// 
+/// # Examples
+/// 
+/// ```
+/// use std::io::Write;
+/// use arrow_array::{ArrayAccessor, Array, BinaryArray, Float64Array, 
RecordBatch};
+/// use arrow_array::cast::AsArray;
+/// use arrow_schema::{DataType, Field, Schema, FieldRef};
+/// use arrow_json::{writer::{WriterBuilder, JsonArray, 
EncoderWithNullBuffer}, StructMode};
+/// use arrow_json::{Encoder, EncoderFactory, EncoderOptions};
+/// use arrow_schema::ArrowError;
+/// use std::sync::Arc;
+/// use serde_json::json;
+/// use serde_json::Value;
+///
+/// struct IntArrayBinaryEncoder {
+/// array: B,
+/// }
+///
+/// impl<'a, B> Encoder for IntArrayBinaryEncoder
+/// where
+/// B: ArrayAccessor,
+/// {
+/// fn encode(&mut self, idx: usize, out: &mut Vec) {
+/// out.push(b'[');
+/// let child = self.array.value(idx);
+/// for (idx, byte) in child.iter().enumerate() {
+/// write!(out, "{byte}").unwrap();
+/// if idx < child.len() - 1 {
+/// out.push(b',');
+/// }
+/// }
+/// out.push(b']');
+/// }
+/// }
+///
+/// #[derive(Debug)]
+/// struct IntArayBinaryEncoderFactory;
+///
+/// impl EncoderFactory for IntArayBinaryEncoderFactory {
+/// fn make_default_encoder<'a>(
+/// &self,
+/// _field: &'a FieldRef,
+/// array: &'a dyn Array,
+/// _options: &'a EncoderOptions,
+/// ) -> Result>, ArrowError> {
+/// match array.data_type() {
+/// DataType::Binary => {
+/// let array = array.as_binary::();
+/// let encoder = IntArrayBinaryEncoder { array };
+/// let array_encoder = Box::new(encoder) as Box;
+/// let nulls = array.nulls().cloned();
+/// Ok(Some(EncoderWithNullBuffer::new(array_encoder, nulls)))
+/// }
+/// _ => Ok(None),
+/// }
+/// }
+/// }
+///
+/// let binary_array = BinaryArray::from_iter([Some(b"a".as_slice()), None, 
Some(b"b".as_slice())]);
+/// let float_array = Float64Array::from(vec![Some(1.0), Some(2.3), None]);
+/// let fields = vec![
+/// Field::new("bytes", DataType::Binary, true),
+/// Field::new("float", DataType::Float64, true),
+/// ];
+/// let batch = RecordBatch::try_new(
+/// Arc::new(Schema::new(fields)),
+/// vec![
+/// Arc::new(binary_array) as Arc,
+/// Arc::new(float_array) as Arc,
+/// ],
+/// )
+/// .unwrap();
+///
+/// let json_value: Value = {
+/// let mut buf = Vec::new();
+/// let mut writer = WriterBuilder::new()
+/// .with_encoder_factory(Arc::new(IntArayBinaryEncoderFactory))
+/// .build::<_, JsonArray>(&mut buf);
+/// writer.write_batches(&[&batch]).unwrap();
+/// writer.finish().unwrap();
+/// serde_json::from_slic

Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-19 Thread via GitHub


adriangb commented on PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#issuecomment-2731786354

   I pushed dcfae68 and here's what I get from `git checkout main && cargo 
bench --bench json_writer -- --save-baseline main && git checkout arrow-union 
&& cargo bench --bench json_writer -- --baseline=main` (sorry for not using 
criptcmp, could not get it to do what I want):
   
   ```
   bench_integer   time:   [2.8701 ms 2.9021 ms 2.9518 ms]
   change: [-1.0003% +1.4133% +3.6346%] (p = 0.26 > 
0.05)
   No change in performance detected.
   Found 1 outliers among 100 measurements (1.00%)
 1 (1.00%) high severe
   
   bench_float time:   [3.3993 ms 3.4496 ms 3.5122 ms]
   change: [+2.0880% +3.6490% +5.7300%] (p = 0.00 < 
0.05)
   Performance has regressed.
   Found 3 outliers among 100 measurements (3.00%)
 3 (3.00%) high severe
   
   bench_stringtime:   [8.6580 ms 8.7945 ms 8.9765 ms]
   change: [+1.4659% +3.3027% +5.6227%] (p = 0.00 < 
0.05)
   Performance has regressed.
   Found 3 outliers among 100 measurements (3.00%)
 1 (1.00%) high mild
 2 (2.00%) high severe
   
   bench_mixed time:   [11.691 ms 11.767 ms 11.846 ms]
   change: [-0.7671% +0.1921% +1.2132%] (p = 0.71 > 
0.05)
   No change in performance detected.
   
   bench_dict_arraytime:   [4.4199 ms 4.4658 ms 4.5184 ms]
   change: [+3.6129% +4.9766% +6.3930%] (p = 0.00 < 
0.05)
   Performance has regressed.
   Found 1 outliers among 100 measurements (1.00%)
 1 (1.00%) high severe
   
   bench_structtime:   [18.621 ms 18.735 ms 18.867 ms]
   change: [+1.1160% +1.9472% +2.7869%] (p = 0.00 < 
0.05)
   Performance has regressed.
   Found 4 outliers among 100 measurements (4.00%)
 2 (2.00%) high mild
 2 (2.00%) high severe
   
   bench_nullable_struct   time:   [8.9303 ms 8.9962 ms 9.0710 ms]
   change: [-0.5520% +0.3707% +1.3565%] (p = 0.46 > 
0.05)
   No change in performance detected.
   Found 4 outliers among 100 measurements (4.00%)
 2 (2.00%) high mild
 2 (2.00%) high severe
   
   bench_list  time:   [26.674 ms 26.809 ms 26.957 ms]
   change: [-0.4316% +0.2764% +1.0098%] (p = 0.46 > 
0.05)
   No change in performance detected.
   Found 3 outliers among 100 measurements (3.00%)
 2 (2.00%) high mild
 1 (1.00%) high severe
   
   bench_nullable_list time:   [5.5016 ms 5.5505 ms 5.6046 ms]
   change: [-0.9382% +0.3661% +1.6491%] (p = 0.57 > 
0.05)
   No change in performance detected.
   Found 5 outliers among 100 measurements (5.00%)
 3 (3.00%) high mild
 2 (2.00%) high severe
   
   bench_struct_list   time:   [2.8141 ms 2.8451 ms 2.8775 ms]
   change: [+0.4203% +2.2760% +4.2258%] (p = 0.02 < 
0.05)
   Change within noise threshold.
   Found 1 outliers among 100 measurements (1.00%)
 1 (1.00%) high mild
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-18 Thread via GitHub


adriangb commented on PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#issuecomment-2734968709

   Thank you for the careful reviews and patience!
   
   Two approvals, shall we merge?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-18 Thread via GitHub


tustvold commented on PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#issuecomment-2734969870

   As this is a breaking change it will need to wait until main opens up for 
such changes, which I believe should be after the current release process 
concludes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-18 Thread via GitHub


tustvold commented on code in PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#discussion_r2002136298


##
arrow-json/src/writer/encoder.rs:
##
@@ -25,126 +27,322 @@ use arrow_schema::{ArrowError, DataType, FieldRef};
 use half::f16;
 use lexical_core::FormattedSize;
 use serde::Serializer;
-use std::io::Write;
 
+/// Configuration options for the JSON encoder.
 #[derive(Debug, Clone, Default)]
 pub struct EncoderOptions {
-pub explicit_nulls: bool,
-pub struct_mode: StructMode,
+/// Whether to include nulls in the output or elide them.
+explicit_nulls: bool,
+/// Whether to encode structs as JSON objects or JSON arrays of their 
values.
+struct_mode: StructMode,
+/// An optional hook for customizing encoding behavior.
+encoder_factory: Option>,
+}
+
+impl EncoderOptions {
+/// Set whether to include nulls in the output or elide them.
+pub fn with_explicit_nulls(mut self, explicit_nulls: bool) -> Self {
+self.explicit_nulls = explicit_nulls;
+self
+}
+
+/// Set whether to encode structs as JSON objects or JSON arrays of their 
values.
+pub fn with_struct_mode(mut self, struct_mode: StructMode) -> Self {
+self.struct_mode = struct_mode;
+self
+}
+
+/// Set an optional hook for customizing encoding behavior.
+pub fn with_encoder_factory(mut self, encoder_factory: Arc) -> Self {
+self.encoder_factory = Some(encoder_factory);
+self
+}
+
+/// Get whether to include nulls in the output or elide them.
+pub fn explicit_nulls(&self) -> bool {
+self.explicit_nulls
+}
+
+/// Get whether to encode structs as JSON objects or JSON arrays of their 
values.
+pub fn struct_mode(&self) -> StructMode {
+self.struct_mode
+}
+
+/// Get the optional hook for customizing encoding behavior.
+pub fn encoder_factory(&self) -> Option<&Arc> {
+self.encoder_factory.as_ref()
+}
+}
+
+/// A trait to create custom encoders for specific data types.
+///
+/// This allows overriding the default encoders for specific data types,
+/// or adding new encoders for custom data types.
+/// 
+/// # Examples
+/// 
+/// ```
+/// use std::io::Write;
+/// use arrow_array::{ArrayAccessor, Array, BinaryArray, Float64Array, 
RecordBatch};
+/// use arrow_array::cast::AsArray;
+/// use arrow_schema::{DataType, Field, Schema, FieldRef};
+/// use arrow_json::{writer::{WriterBuilder, JsonArray, 
EncoderWithNullBuffer}, StructMode};
+/// use arrow_json::{Encoder, EncoderFactory, EncoderOptions};
+/// use arrow_schema::ArrowError;
+/// use std::sync::Arc;
+/// use serde_json::json;
+/// use serde_json::Value;
+///
+/// struct IntArrayBinaryEncoder {
+/// array: B,
+/// }
+///
+/// impl<'a, B> Encoder for IntArrayBinaryEncoder
+/// where
+/// B: ArrayAccessor,
+/// {
+/// fn encode(&mut self, idx: usize, out: &mut Vec) {
+/// out.push(b'[');
+/// let child = self.array.value(idx);
+/// for (idx, byte) in child.iter().enumerate() {
+/// write!(out, "{byte}").unwrap();
+/// if idx < child.len() - 1 {
+/// out.push(b',');
+/// }
+/// }
+/// out.push(b']');
+/// }
+/// }
+///
+/// #[derive(Debug)]
+/// struct IntArayBinaryEncoderFactory;
+///
+/// impl EncoderFactory for IntArayBinaryEncoderFactory {
+/// fn make_default_encoder<'a>(
+/// &self,
+/// _field: &'a FieldRef,
+/// array: &'a dyn Array,
+/// _options: &'a EncoderOptions,
+/// ) -> Result>, ArrowError> {
+/// match array.data_type() {
+/// DataType::Binary => {
+/// let array = array.as_binary::();
+/// let encoder = IntArrayBinaryEncoder { array };
+/// let array_encoder = Box::new(encoder) as Box;
+/// let nulls = array.nulls().cloned();
+/// Ok(Some(EncoderWithNullBuffer::new(array_encoder, nulls)))
+/// }
+/// _ => Ok(None),
+/// }
+/// }
+/// }
+///
+/// let binary_array = BinaryArray::from_iter([Some(b"a".as_slice()), None, 
Some(b"b".as_slice())]);
+/// let float_array = Float64Array::from(vec![Some(1.0), Some(2.3), None]);
+/// let fields = vec![
+/// Field::new("bytes", DataType::Binary, true),
+/// Field::new("float", DataType::Float64, true),
+/// ];
+/// let batch = RecordBatch::try_new(
+/// Arc::new(Schema::new(fields)),
+/// vec![
+/// Arc::new(binary_array) as Arc,
+/// Arc::new(float_array) as Arc,
+/// ],
+/// )
+/// .unwrap();
+///
+/// let json_value: Value = {
+/// let mut buf = Vec::new();
+/// let mut writer = WriterBuilder::new()
+/// .with_encoder_factory(Arc::new(IntArayBinaryEncoderFactory))
+/// .build::<_, JsonArray>(&mut buf);
+/// writer.write_batches(&[&batch]).unwrap();
+/// writer.finish().unwrap();
+/// serde_json::from_s

Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-18 Thread via GitHub


adriangb commented on code in PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#discussion_r2002136062


##
arrow-json/src/writer/mod.rs:
##
@@ -351,8 +360,18 @@ where
 }
 
 let array = StructArray::from(batch.clone());
-let mut encoder = make_encoder(&array, &self.options)?;
+let field = Arc::new(Field::new_struct(
+"",
+batch.schema().fields().iter().cloned().collect::>(),
+false,
+));
+
+let encoder = make_encoder(&field, &array, &self.options)?;
+
+// Validate that the root is not nullable
+assert!(!encoder.has_nulls(), "root cannot be nullable");
 
+let mut encoder = make_encoder(&field, &array, &self.options)?;

Review Comment:
   Good catch 😄 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-18 Thread via GitHub


tustvold commented on code in PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#discussion_r2002134825


##
arrow-json/src/writer/mod.rs:
##
@@ -351,8 +360,18 @@ where
 }
 
 let array = StructArray::from(batch.clone());
-let mut encoder = make_encoder(&array, &self.options)?;
+let field = Arc::new(Field::new_struct(
+"",
+batch.schema().fields().iter().cloned().collect::>(),
+false,
+));
+
+let encoder = make_encoder(&field, &array, &self.options)?;
+
+// Validate that the root is not nullable
+assert!(!encoder.has_nulls(), "root cannot be nullable");
 
+let mut encoder = make_encoder(&field, &array, &self.options)?;

Review Comment:
   ```suggestion
   let mut encoder = make_encoder(&field, &array, &self.options)?;
   
   // Validate that the root is not nullable
   assert!(!encoder.has_nulls(), "root cannot be nullable");
   ```
   
   Or am I missing something subtle?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-18 Thread via GitHub


tustvold commented on code in PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#discussion_r2002134825


##
arrow-json/src/writer/mod.rs:
##
@@ -351,8 +360,18 @@ where
 }
 
 let array = StructArray::from(batch.clone());
-let mut encoder = make_encoder(&array, &self.options)?;
+let field = Arc::new(Field::new_struct(
+"",
+batch.schema().fields().iter().cloned().collect::>(),
+false,
+));
+
+let encoder = make_encoder(&field, &array, &self.options)?;
+
+// Validate that the root is not nullable
+assert!(!encoder.has_nulls(), "root cannot be nullable");
 
+let mut encoder = make_encoder(&field, &array, &self.options)?;

Review Comment:
   ```suggestion
   let mut encoder = make_encoder(&field, &array, &self.options)?;
   
   // Validate that the root is not nullable
   assert!(!encoder.has_nulls(), "root cannot be nullable");
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-18 Thread via GitHub


adriangb commented on code in PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#discussion_r2002132112


##
arrow-json/src/writer/encoder.rs:
##
@@ -25,126 +27,322 @@ use arrow_schema::{ArrowError, DataType, FieldRef};
 use half::f16;
 use lexical_core::FormattedSize;
 use serde::Serializer;
-use std::io::Write;
 
+/// Configuration options for the JSON encoder.
 #[derive(Debug, Clone, Default)]
 pub struct EncoderOptions {
-pub explicit_nulls: bool,
-pub struct_mode: StructMode,
+/// Whether to include nulls in the output or elide them.
+explicit_nulls: bool,
+/// Whether to encode structs as JSON objects or JSON arrays of their 
values.
+struct_mode: StructMode,
+/// An optional hook for customizing encoding behavior.
+encoder_factory: Option>,
+}
+
+impl EncoderOptions {
+/// Set whether to include nulls in the output or elide them.
+pub fn with_explicit_nulls(mut self, explicit_nulls: bool) -> Self {
+self.explicit_nulls = explicit_nulls;
+self
+}
+
+/// Set whether to encode structs as JSON objects or JSON arrays of their 
values.
+pub fn with_struct_mode(mut self, struct_mode: StructMode) -> Self {
+self.struct_mode = struct_mode;
+self
+}
+
+/// Set an optional hook for customizing encoding behavior.
+pub fn with_encoder_factory(mut self, encoder_factory: Arc) -> Self {
+self.encoder_factory = Some(encoder_factory);
+self
+}
+
+/// Get whether to include nulls in the output or elide them.
+pub fn explicit_nulls(&self) -> bool {
+self.explicit_nulls
+}
+
+/// Get whether to encode structs as JSON objects or JSON arrays of their 
values.
+pub fn struct_mode(&self) -> StructMode {
+self.struct_mode
+}
+
+/// Get the optional hook for customizing encoding behavior.
+pub fn encoder_factory(&self) -> Option<&Arc> {
+self.encoder_factory.as_ref()
+}
+}
+
+/// A trait to create custom encoders for specific data types.
+///
+/// This allows overriding the default encoders for specific data types,
+/// or adding new encoders for custom data types.
+///
+/// # Examples
+///
+/// ```
+/// use std::io::Write;
+/// use arrow_array::{ArrayAccessor, Array, BinaryArray, Float64Array, 
RecordBatch};
+/// use arrow_array::cast::AsArray;
+/// use arrow_schema::{DataType, Field, Schema, FieldRef};
+/// use arrow_json::{writer::{WriterBuilder, JsonArray, 
EncoderWithNullBuffer}, StructMode};
+/// use arrow_json::{Encoder, EncoderFactory, EncoderOptions};
+/// use arrow_schema::ArrowError;
+/// use std::sync::Arc;
+/// use serde_json::json;
+/// use serde_json::Value;
+///
+/// struct IntArrayBinaryEncoder {
+/// array: B,
+/// }
+///
+/// impl<'a, B> Encoder for IntArrayBinaryEncoder
+/// where
+/// B: ArrayAccessor,
+/// {
+/// fn encode(&mut self, idx: usize, out: &mut Vec) {
+/// out.push(b'[');
+/// let child = self.array.value(idx);
+/// for (idx, byte) in child.iter().enumerate() {
+/// write!(out, "{byte}").unwrap();
+/// if idx < child.len() - 1 {
+/// out.push(b',');
+/// }
+/// }
+/// out.push(b']');
+/// }
+/// }
+///
+/// #[derive(Debug)]
+/// struct IntArayBinaryEncoderFactory;
+///
+/// impl EncoderFactory for IntArayBinaryEncoderFactory {
+/// fn make_default_encoder<'a>(
+/// &self,
+/// _field: &'a FieldRef,
+/// array: &'a dyn Array,
+/// _options: &'a EncoderOptions,
+/// ) -> Result>, ArrowError> {
+/// match array.data_type() {
+/// DataType::Binary => {
+/// let array = array.as_binary::();
+/// let encoder = IntArrayBinaryEncoder { array };
+/// let array_encoder = Box::new(encoder) as Box;
+/// let nulls = array.nulls().cloned();
+/// Ok(Some(EncoderWithNullBuffer::new(array_encoder, nulls)))
+/// }
+/// _ => Ok(None),
+/// }
+/// }
+/// }
+///
+/// let binary_array = BinaryArray::from_iter([Some(b"a".as_slice()), None, 
Some(b"b".as_slice())]);
+/// let float_array = Float64Array::from(vec![Some(1.0), Some(2.3), None]);
+/// let fields = vec![
+/// Field::new("bytes", DataType::Binary, true),
+/// Field::new("float", DataType::Float64, true),
+/// ];
+/// let batch = RecordBatch::try_new(
+/// Arc::new(Schema::new(fields)),
+/// vec![
+/// Arc::new(binary_array) as Arc,
+/// Arc::new(float_array) as Arc,
+/// ],
+/// )
+/// .unwrap();
+///
+/// let json_value: Value = {
+/// let mut buf = Vec::new();
+/// let mut writer = WriterBuilder::new()
+/// .with_encoder_factory(Arc::new(IntArayBinaryEncoderFactory))
+/// .build::<_, JsonArray>(&mut buf);
+/// writer.write_batches(&[&batch]).unwrap();
+/// writer.finish().unwrap();
+/// serde_json::from_sli

Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-18 Thread via GitHub


tustvold commented on code in PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#discussion_r2002129979


##
arrow-json/src/writer/encoder.rs:
##
@@ -25,126 +27,322 @@ use arrow_schema::{ArrowError, DataType, FieldRef};
 use half::f16;
 use lexical_core::FormattedSize;
 use serde::Serializer;
-use std::io::Write;
 
+/// Configuration options for the JSON encoder.
 #[derive(Debug, Clone, Default)]
 pub struct EncoderOptions {
-pub explicit_nulls: bool,
-pub struct_mode: StructMode,
+/// Whether to include nulls in the output or elide them.
+explicit_nulls: bool,
+/// Whether to encode structs as JSON objects or JSON arrays of their 
values.
+struct_mode: StructMode,
+/// An optional hook for customizing encoding behavior.
+encoder_factory: Option>,
+}
+
+impl EncoderOptions {
+/// Set whether to include nulls in the output or elide them.
+pub fn with_explicit_nulls(mut self, explicit_nulls: bool) -> Self {
+self.explicit_nulls = explicit_nulls;
+self
+}
+
+/// Set whether to encode structs as JSON objects or JSON arrays of their 
values.
+pub fn with_struct_mode(mut self, struct_mode: StructMode) -> Self {
+self.struct_mode = struct_mode;
+self
+}
+
+/// Set an optional hook for customizing encoding behavior.
+pub fn with_encoder_factory(mut self, encoder_factory: Arc) -> Self {
+self.encoder_factory = Some(encoder_factory);
+self
+}
+
+/// Get whether to include nulls in the output or elide them.
+pub fn explicit_nulls(&self) -> bool {
+self.explicit_nulls
+}
+
+/// Get whether to encode structs as JSON objects or JSON arrays of their 
values.
+pub fn struct_mode(&self) -> StructMode {
+self.struct_mode
+}
+
+/// Get the optional hook for customizing encoding behavior.
+pub fn encoder_factory(&self) -> Option<&Arc> {
+self.encoder_factory.as_ref()
+}
+}
+
+/// A trait to create custom encoders for specific data types.
+///
+/// This allows overriding the default encoders for specific data types,
+/// or adding new encoders for custom data types.
+///
+/// # Examples
+///
+/// ```
+/// use std::io::Write;
+/// use arrow_array::{ArrayAccessor, Array, BinaryArray, Float64Array, 
RecordBatch};
+/// use arrow_array::cast::AsArray;
+/// use arrow_schema::{DataType, Field, Schema, FieldRef};
+/// use arrow_json::{writer::{WriterBuilder, JsonArray, 
EncoderWithNullBuffer}, StructMode};
+/// use arrow_json::{Encoder, EncoderFactory, EncoderOptions};
+/// use arrow_schema::ArrowError;
+/// use std::sync::Arc;
+/// use serde_json::json;
+/// use serde_json::Value;
+///
+/// struct IntArrayBinaryEncoder {
+/// array: B,
+/// }
+///
+/// impl<'a, B> Encoder for IntArrayBinaryEncoder
+/// where
+/// B: ArrayAccessor,
+/// {
+/// fn encode(&mut self, idx: usize, out: &mut Vec) {
+/// out.push(b'[');
+/// let child = self.array.value(idx);
+/// for (idx, byte) in child.iter().enumerate() {
+/// write!(out, "{byte}").unwrap();
+/// if idx < child.len() - 1 {
+/// out.push(b',');
+/// }
+/// }
+/// out.push(b']');
+/// }
+/// }
+///
+/// #[derive(Debug)]
+/// struct IntArayBinaryEncoderFactory;
+///
+/// impl EncoderFactory for IntArayBinaryEncoderFactory {
+/// fn make_default_encoder<'a>(
+/// &self,
+/// _field: &'a FieldRef,
+/// array: &'a dyn Array,
+/// _options: &'a EncoderOptions,
+/// ) -> Result>, ArrowError> {
+/// match array.data_type() {
+/// DataType::Binary => {
+/// let array = array.as_binary::();
+/// let encoder = IntArrayBinaryEncoder { array };
+/// let array_encoder = Box::new(encoder) as Box;
+/// let nulls = array.nulls().cloned();
+/// Ok(Some(EncoderWithNullBuffer::new(array_encoder, nulls)))
+/// }
+/// _ => Ok(None),
+/// }
+/// }
+/// }
+///
+/// let binary_array = BinaryArray::from_iter([Some(b"a".as_slice()), None, 
Some(b"b".as_slice())]);
+/// let float_array = Float64Array::from(vec![Some(1.0), Some(2.3), None]);
+/// let fields = vec![
+/// Field::new("bytes", DataType::Binary, true),
+/// Field::new("float", DataType::Float64, true),
+/// ];
+/// let batch = RecordBatch::try_new(
+/// Arc::new(Schema::new(fields)),
+/// vec![
+/// Arc::new(binary_array) as Arc,
+/// Arc::new(float_array) as Arc,
+/// ],
+/// )
+/// .unwrap();
+///
+/// let json_value: Value = {
+/// let mut buf = Vec::new();
+/// let mut writer = WriterBuilder::new()
+/// .with_encoder_factory(Arc::new(IntArayBinaryEncoderFactory))
+/// .build::<_, JsonArray>(&mut buf);
+/// writer.write_batches(&[&batch]).unwrap();
+/// writer.finish().unwrap();
+/// serde_json::from_sli

Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-18 Thread via GitHub


adriangb commented on code in PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#discussion_r2001188500


##
arrow-json/src/writer/encoder.rs:
##
@@ -25,126 +27,322 @@ use arrow_schema::{ArrowError, DataType, FieldRef};
 use half::f16;
 use lexical_core::FormattedSize;
 use serde::Serializer;
-use std::io::Write;
 
+/// Configuration options for the JSON encoder.
 #[derive(Debug, Clone, Default)]
 pub struct EncoderOptions {
-pub explicit_nulls: bool,
-pub struct_mode: StructMode,
+/// Whether to include nulls in the output or elide them.
+explicit_nulls: bool,
+/// Whether to encode structs as JSON objects or JSON arrays of their 
values.
+struct_mode: StructMode,
+/// An optional hook for customizing encoding behavior.
+encoder_factory: Option>,
+}
+
+impl EncoderOptions {
+/// Set whether to include nulls in the output or elide them.
+pub fn with_explicit_nulls(mut self, explicit_nulls: bool) -> Self {
+self.explicit_nulls = explicit_nulls;
+self
+}
+
+/// Set whether to encode structs as JSON objects or JSON arrays of their 
values.
+pub fn with_struct_mode(mut self, struct_mode: StructMode) -> Self {
+self.struct_mode = struct_mode;
+self
+}
+
+/// Set an optional hook for customizing encoding behavior.
+pub fn with_encoder_factory(mut self, encoder_factory: Arc) -> Self {
+self.encoder_factory = Some(encoder_factory);
+self
+}
+
+/// Get whether to include nulls in the output or elide them.
+pub fn explicit_nulls(&self) -> bool {
+self.explicit_nulls
+}
+
+/// Get whether to encode structs as JSON objects or JSON arrays of their 
values.
+pub fn struct_mode(&self) -> StructMode {
+self.struct_mode
+}
+
+/// Get the optional hook for customizing encoding behavior.
+pub fn encoder_factory(&self) -> Option<&Arc> {
+self.encoder_factory.as_ref()
+}
+}
+
+/// A trait to create custom encoders for specific data types.
+///
+/// This allows overriding the default encoders for specific data types,
+/// or adding new encoders for custom data types.
+/// 
+/// # Examples
+/// 
+/// ```
+/// use std::io::Write;
+/// use arrow_array::{ArrayAccessor, Array, BinaryArray, Float64Array, 
RecordBatch};
+/// use arrow_array::cast::AsArray;
+/// use arrow_schema::{DataType, Field, Schema, FieldRef};
+/// use arrow_json::{writer::{WriterBuilder, JsonArray, 
EncoderWithNullBuffer}, StructMode};
+/// use arrow_json::{Encoder, EncoderFactory, EncoderOptions};
+/// use arrow_schema::ArrowError;
+/// use std::sync::Arc;
+/// use serde_json::json;
+/// use serde_json::Value;
+///
+/// struct IntArrayBinaryEncoder {
+/// array: B,
+/// }
+///
+/// impl<'a, B> Encoder for IntArrayBinaryEncoder
+/// where
+/// B: ArrayAccessor,
+/// {
+/// fn encode(&mut self, idx: usize, out: &mut Vec) {
+/// out.push(b'[');
+/// let child = self.array.value(idx);
+/// for (idx, byte) in child.iter().enumerate() {
+/// write!(out, "{byte}").unwrap();
+/// if idx < child.len() - 1 {
+/// out.push(b',');
+/// }
+/// }
+/// out.push(b']');
+/// }
+/// }
+///
+/// #[derive(Debug)]
+/// struct IntArayBinaryEncoderFactory;
+///
+/// impl EncoderFactory for IntArayBinaryEncoderFactory {
+/// fn make_default_encoder<'a>(
+/// &self,
+/// _field: &'a FieldRef,
+/// array: &'a dyn Array,
+/// _options: &'a EncoderOptions,
+/// ) -> Result>, ArrowError> {
+/// match array.data_type() {
+/// DataType::Binary => {
+/// let array = array.as_binary::();
+/// let encoder = IntArrayBinaryEncoder { array };
+/// let array_encoder = Box::new(encoder) as Box;
+/// let nulls = array.nulls().cloned();
+/// Ok(Some(EncoderWithNullBuffer::new(array_encoder, nulls)))
+/// }
+/// _ => Ok(None),
+/// }
+/// }
+/// }
+///
+/// let binary_array = BinaryArray::from_iter([Some(b"a".as_slice()), None, 
Some(b"b".as_slice())]);
+/// let float_array = Float64Array::from(vec![Some(1.0), Some(2.3), None]);
+/// let fields = vec![
+/// Field::new("bytes", DataType::Binary, true),
+/// Field::new("float", DataType::Float64, true),
+/// ];
+/// let batch = RecordBatch::try_new(
+/// Arc::new(Schema::new(fields)),
+/// vec![
+/// Arc::new(binary_array) as Arc,
+/// Arc::new(float_array) as Arc,
+/// ],
+/// )
+/// .unwrap();
+///
+/// let json_value: Value = {
+/// let mut buf = Vec::new();
+/// let mut writer = WriterBuilder::new()
+/// .with_encoder_factory(Arc::new(IntArayBinaryEncoderFactory))
+/// .build::<_, JsonArray>(&mut buf);
+/// writer.write_batches(&[&batch]).unwrap();
+/// writer.finish().unwrap();
+/// serde_json::from_s

Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-17 Thread via GitHub


alamb commented on code in PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#discussion_r1999482887


##
arrow-json/src/writer/encoder.rs:
##
@@ -362,42 +480,36 @@ impl Encoder for ListEncoder<'_, O> {
 let end = self.offsets[idx + 1].as_usize();
 let start = self.offsets[idx].as_usize();
 out.push(b'[');
-match self.nulls.as_ref() {
-Some(n) => (start..end).for_each(|idx| {
-if idx != start {
-out.push(b',')
-}
-match n.is_null(idx) {
-true => out.extend_from_slice(b"null"),
-false => self.encoder.encode(idx, out),
-}
-}),
-None => (start..end).for_each(|idx| {
-if idx != start {
-out.push(b',')
-}
+

Review Comment:
   I think the previous formulation is important for performance:
   1. It checks for nulls once per output list (rather than once per output 
list element)
   2. The code for the `None` case is simpler and this more likely to be auto 
vectorized



##
arrow-json/src/writer/encoder.rs:
##
@@ -25,126 +27,230 @@ use arrow_schema::{ArrowError, DataType, FieldRef};
 use half::f16;
 use lexical_core::FormattedSize;
 use serde::Serializer;
-use std::io::Write;
 
+/// Configuration options for the JSON encoder.

Review Comment:
   I verified that this structure is not currently publically exported and thus 
this change is not a breaking API change:
   https://docs.rs/arrow-json/54.2.1/arrow_json/index.html?search=EncoderOptions



##
arrow-json/src/writer/mod.rs:
##
@@ -1953,4 +1968,317 @@ mod tests {
 "#,
 );
 }
+
+fn make_fallback_encoder_test_data() -> (RecordBatch, Arc) {
+// Note: this is not intended to be an efficient implementation.
+// Just a simple example to demonstrate how to implement a custom 
encoder.
+#[derive(Debug)]
+enum UnionValue {
+Null,
+Int32(i32),
+String(String),
+}
+
+#[derive(Debug)]
+struct UnionEncoder {

Review Comment:
   Filed a ticket to track
   - https://github.com/apache/arrow-rs/issues/7302



##
arrow-json/src/writer/encoder.rs:
##
@@ -25,126 +27,230 @@ use arrow_schema::{ArrowError, DataType, FieldRef};
 use half::f16;
 use lexical_core::FormattedSize;
 use serde::Serializer;
-use std::io::Write;
 
+/// Configuration options for the JSON encoder.
 #[derive(Debug, Clone, Default)]
 pub struct EncoderOptions {
-pub explicit_nulls: bool,
-pub struct_mode: StructMode,
+/// Whether to include nulls in the output or elide them.
+explicit_nulls: bool,
+/// Whether to encode structs as JSON objects or JSON arrays of their 
values.
+struct_mode: StructMode,
+/// An optional hook for customizing encoding behavior.
+encoder_factory: Option>,
+}
+
+impl EncoderOptions {
+/// Set whether to include nulls in the output or elide them.
+pub fn with_explicit_nulls(mut self, explicit_nulls: bool) -> Self {
+self.explicit_nulls = explicit_nulls;
+self
+}
+
+/// Set whether to encode structs as JSON objects or JSON arrays of their 
values.
+pub fn with_struct_mode(mut self, struct_mode: StructMode) -> Self {
+self.struct_mode = struct_mode;
+self
+}
+
+/// Set an optional hook for customizing encoding behavior.
+pub fn with_encoder_factory(mut self, encoder_factory: Arc) -> Self {
+self.encoder_factory = Some(encoder_factory);
+self
+}
+
+/// Get whether to include nulls in the output or elide them.
+pub fn explicit_nulls(&self) -> bool {
+self.explicit_nulls
+}
+
+/// Get whether to encode structs as JSON objects or JSON arrays of their 
values.
+pub fn struct_mode(&self) -> StructMode {
+self.struct_mode
+}
+
+/// Get the optional hook for customizing encoding behavior.
+pub fn encoder_factory(&self) -> Option<&Arc> {
+self.encoder_factory.as_ref()
+}
+}
+
+/// A trait to create custom encoders for specific data types.
+///
+/// This allows overriding the default encoders for specific data types,

Review Comment:
   It would be sweet to add some toy doc example of using this `EncoderFactory` 
in a doc comment



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-17 Thread via GitHub


alamb commented on PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#issuecomment-2730661788

   Here are some other results. 
   
   ```
   ++ critcmp main arrow-union
   grouparrow-unionmain
   ----
   bench_dict_array 1.13  7.8±0.10ms? ?/sec1.00  
6.9±0.06ms? ?/sec
   bench_float  1.00  6.2±0.01ms? ?/sec1.00  
6.2±0.05ms? ?/sec
   bench_integer1.05  5.5±0.01ms? ?/sec1.00  
5.2±0.01ms? ?/sec
   bench_list   1.05 78.8±0.31ms? ?/sec1.00 
74.9±0.36ms? ?/sec
   bench_mixed  1.08 31.2±0.35ms? ?/sec1.00 
28.9±0.27ms? ?/sec
   bench_nullable_list  1.19 11.2±0.19ms? ?/sec1.00  
9.4±0.23ms? ?/sec
   bench_nullable_struct1.12 18.2±0.16ms? ?/sec1.00 
16.2±0.21ms? ?/sec
   bench_string 1.15 15.8±0.19ms? ?/sec1.00 
13.7±0.15ms? ?/sec
   bench_struct 1.07 52.2±0.42ms? ?/sec1.00 
48.6±0.39ms? ?/sec
   bench_struct_list1.05  6.3±0.09ms? ?/sec1.00  
5.9±0.08ms? ?/sec
   ```
   
   ```
   ++ critcmp main arrow-union
   grouparrow-unionmain
   ----
   bench_dict_array 1.13  7.8±0.04ms? ?/sec1.00  
6.9±0.04ms? ?/sec
   bench_float  1.00  6.2±0.01ms? ?/sec1.01  
6.3±0.01ms? ?/sec
   bench_integer1.02  5.5±0.01ms? ?/sec1.00  
5.4±0.01ms? ?/sec
   bench_list   1.10 82.0±0.55ms? ?/sec1.00 
74.6±0.27ms? ?/sec
   bench_mixed  1.08 31.1±0.25ms? ?/sec1.00 
28.7±0.41ms? ?/sec
   bench_nullable_list  1.16 11.0±0.12ms? ?/sec1.00  
9.5±0.14ms? ?/sec
   bench_nullable_struct1.06 19.6±0.16ms? ?/sec1.00 
18.5±0.20ms? ?/sec
   bench_string 1.16 15.8±0.14ms? ?/sec1.00 
13.6±0.10ms? ?/sec
   bench_struct 1.08 52.2±0.46ms? ?/sec1.00 
48.4±0.29ms? ?/sec
   bench_struct_list1.08  6.4±0.06ms? ?/sec1.00  
5.9±0.07ms? ?/sec
   ++ popd
   ~/datafusion-benchmarking
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-17 Thread via GitHub


adriangb commented on code in PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#discussion_r1999888217


##
arrow-json/src/writer/encoder.rs:
##
@@ -25,126 +27,322 @@ use arrow_schema::{ArrowError, DataType, FieldRef};
 use half::f16;
 use lexical_core::FormattedSize;
 use serde::Serializer;
-use std::io::Write;
 
+/// Configuration options for the JSON encoder.
 #[derive(Debug, Clone, Default)]
 pub struct EncoderOptions {
-pub explicit_nulls: bool,
-pub struct_mode: StructMode,
+/// Whether to include nulls in the output or elide them.
+explicit_nulls: bool,
+/// Whether to encode structs as JSON objects or JSON arrays of their 
values.
+struct_mode: StructMode,
+/// An optional hook for customizing encoding behavior.
+encoder_factory: Option>,
+}
+
+impl EncoderOptions {
+/// Set whether to include nulls in the output or elide them.
+pub fn with_explicit_nulls(mut self, explicit_nulls: bool) -> Self {
+self.explicit_nulls = explicit_nulls;
+self
+}
+
+/// Set whether to encode structs as JSON objects or JSON arrays of their 
values.
+pub fn with_struct_mode(mut self, struct_mode: StructMode) -> Self {
+self.struct_mode = struct_mode;
+self
+}
+
+/// Set an optional hook for customizing encoding behavior.
+pub fn with_encoder_factory(mut self, encoder_factory: Arc) -> Self {
+self.encoder_factory = Some(encoder_factory);
+self
+}
+
+/// Get whether to include nulls in the output or elide them.
+pub fn explicit_nulls(&self) -> bool {
+self.explicit_nulls
+}
+
+/// Get whether to encode structs as JSON objects or JSON arrays of their 
values.
+pub fn struct_mode(&self) -> StructMode {
+self.struct_mode
+}
+
+/// Get the optional hook for customizing encoding behavior.
+pub fn encoder_factory(&self) -> Option<&Arc> {
+self.encoder_factory.as_ref()
+}
+}
+
+/// A trait to create custom encoders for specific data types.
+///
+/// This allows overriding the default encoders for specific data types,
+/// or adding new encoders for custom data types.
+/// 
+/// # Examples
+/// 
+/// ```
+/// use std::io::Write;
+/// use arrow_array::{ArrayAccessor, Array, BinaryArray, Float64Array, 
RecordBatch};
+/// use arrow_array::cast::AsArray;
+/// use arrow_schema::{DataType, Field, Schema, FieldRef};
+/// use arrow_json::{writer::{WriterBuilder, JsonArray, 
EncoderWithNullBuffer}, StructMode};
+/// use arrow_json::{Encoder, EncoderFactory, EncoderOptions};
+/// use arrow_schema::ArrowError;
+/// use std::sync::Arc;
+/// use serde_json::json;
+/// use serde_json::Value;
+///
+/// struct IntArrayBinaryEncoder {
+/// array: B,
+/// }
+///
+/// impl<'a, B> Encoder for IntArrayBinaryEncoder
+/// where
+/// B: ArrayAccessor,
+/// {
+/// fn encode(&mut self, idx: usize, out: &mut Vec) {
+/// out.push(b'[');
+/// let child = self.array.value(idx);
+/// for (idx, byte) in child.iter().enumerate() {
+/// write!(out, "{byte}").unwrap();
+/// if idx < child.len() - 1 {
+/// out.push(b',');
+/// }
+/// }
+/// out.push(b']');
+/// }
+/// }
+///
+/// #[derive(Debug)]
+/// struct IntArayBinaryEncoderFactory;
+///
+/// impl EncoderFactory for IntArayBinaryEncoderFactory {
+/// fn make_default_encoder<'a>(
+/// &self,
+/// _field: &'a FieldRef,
+/// array: &'a dyn Array,
+/// _options: &'a EncoderOptions,
+/// ) -> Result>, ArrowError> {
+/// match array.data_type() {
+/// DataType::Binary => {
+/// let array = array.as_binary::();
+/// let encoder = IntArrayBinaryEncoder { array };
+/// let array_encoder = Box::new(encoder) as Box;
+/// let nulls = array.nulls().cloned();
+/// Ok(Some(EncoderWithNullBuffer::new(array_encoder, nulls)))
+/// }
+/// _ => Ok(None),
+/// }
+/// }
+/// }
+///
+/// let binary_array = BinaryArray::from_iter([Some(b"a".as_slice()), None, 
Some(b"b".as_slice())]);
+/// let float_array = Float64Array::from(vec![Some(1.0), Some(2.3), None]);
+/// let fields = vec![
+/// Field::new("bytes", DataType::Binary, true),
+/// Field::new("float", DataType::Float64, true),
+/// ];
+/// let batch = RecordBatch::try_new(
+/// Arc::new(Schema::new(fields)),
+/// vec![
+/// Arc::new(binary_array) as Arc,
+/// Arc::new(float_array) as Arc,
+/// ],
+/// )
+/// .unwrap();
+///
+/// let json_value: Value = {
+/// let mut buf = Vec::new();
+/// let mut writer = WriterBuilder::new()
+/// .with_encoder_factory(Arc::new(IntArayBinaryEncoderFactory))
+/// .build::<_, JsonArray>(&mut buf);
+/// writer.write_batches(&[&batch]).unwrap();
+/// writer.finish().unwrap();
+/// serde_json::from_s

Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-17 Thread via GitHub


adriangb commented on code in PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#discussion_r1999797429


##
arrow-json/src/writer/encoder.rs:
##
@@ -362,42 +480,36 @@ impl Encoder for ListEncoder<'_, O> {
 let end = self.offsets[idx + 1].as_usize();
 let start = self.offsets[idx].as_usize();
 out.push(b'[');
-match self.nulls.as_ref() {
-Some(n) => (start..end).for_each(|idx| {
-if idx != start {
-out.push(b',')
-}
-match n.is_null(idx) {
-true => out.extend_from_slice(b"null"),
-false => self.encoder.encode(idx, out),
-}
-}),
-None => (start..end).for_each(|idx| {
-if idx != start {
-out.push(b',')
-}
+

Review Comment:
   Okay will try to reproduce the slowdown tomorrow and address it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-17 Thread via GitHub


alamb commented on code in PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#discussion_r1999503132


##
arrow-json/src/writer/encoder.rs:
##
@@ -25,126 +27,230 @@ use arrow_schema::{ArrowError, DataType, FieldRef};
 use half::f16;
 use lexical_core::FormattedSize;
 use serde::Serializer;
-use std::io::Write;
 
+/// Configuration options for the JSON encoder.
 #[derive(Debug, Clone, Default)]
 pub struct EncoderOptions {
-pub explicit_nulls: bool,
-pub struct_mode: StructMode,
+/// Whether to include nulls in the output or elide them.
+explicit_nulls: bool,
+/// Whether to encode structs as JSON objects or JSON arrays of their 
values.
+struct_mode: StructMode,
+/// An optional hook for customizing encoding behavior.
+encoder_factory: Option>,
+}
+
+impl EncoderOptions {
+/// Set whether to include nulls in the output or elide them.
+pub fn with_explicit_nulls(mut self, explicit_nulls: bool) -> Self {
+self.explicit_nulls = explicit_nulls;
+self
+}
+
+/// Set whether to encode structs as JSON objects or JSON arrays of their 
values.
+pub fn with_struct_mode(mut self, struct_mode: StructMode) -> Self {
+self.struct_mode = struct_mode;
+self
+}
+
+/// Set an optional hook for customizing encoding behavior.
+pub fn with_encoder_factory(mut self, encoder_factory: Arc) -> Self {
+self.encoder_factory = Some(encoder_factory);
+self
+}
+
+/// Get whether to include nulls in the output or elide them.
+pub fn explicit_nulls(&self) -> bool {
+self.explicit_nulls
+}
+
+/// Get whether to encode structs as JSON objects or JSON arrays of their 
values.
+pub fn struct_mode(&self) -> StructMode {
+self.struct_mode
+}
+
+/// Get the optional hook for customizing encoding behavior.
+pub fn encoder_factory(&self) -> Option<&Arc> {
+self.encoder_factory.as_ref()
+}
+}
+
+/// A trait to create custom encoders for specific data types.
+///
+/// This allows overriding the default encoders for specific data types,
+/// or adding new encoders for custom data types.
+pub trait EncoderFactory: std::fmt::Debug + Send + Sync {
+/// Make an encoder that if returned runs before all of the default 
encoders.

Review Comment:
   what does "runs before all default encoders" mean?
   
   Maybe it would be clearer to say "If a decoder is returned it does not use 
the default encoder at all" 🤔 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-16 Thread via GitHub


aditanase commented on PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#issuecomment-2727490753

   Big 👍  for this PR.
   
   I have a simpler use case, I would like to implement a `to_json` UDF in 
datafusion. I have a PoC currently in which I am implementing the Write trait 
for a wrappwer around GenericStringBuilder, coupled with the current 
LineDelimiter JsonFormatter. In the write function I split on \n and call 
append_value on the builder.
   
   By simply having access to make_encoder this would be much simpler and I 
could also add support for nullable columns.
   
   Would be awesome if this lands in the next release! 🚀


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-13 Thread via GitHub


alamb commented on PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#issuecomment-2720967036

   > @alamb @tustvold could you please give this another look? I think I 
addressed the major outstanding issue of performance.
   
   I will try to do so later today, but I have several other things to do first


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-12 Thread via GitHub


adriangb commented on code in PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#discussion_r1992335818


##
arrow-json/src/writer/encoder.rs:
##
@@ -196,9 +236,18 @@ impl Encoder for StructArrayEncoder<'_> {
 let mut is_first = true;
 // Nulls can only be dropped in explicit mode
 let drop_nulls = (self.struct_mode == StructMode::ObjectOnly) && 
!self.explicit_nulls;
-for field_encoder in &mut self.encoders {
-let is_null = is_some_and(field_encoder.nulls.as_ref(), |n| 
n.is_null(idx));
-if drop_nulls && is_null {
+
+// Collect all the field nulls buffers up front to avoid dynamic 
dispatch in the loop
+// This creates a temporary Vec, but avoids repeated virtual calls 
which should be a net win

Review Comment:
   ```
   ❯ git checkout main && cargo bench --bench json_writer && git checkout 
arrow-union && cargo bench --bench json_writer
   Switched to branch 'main'
   Your branch is ahead of 'origin/main' by 6 commits.
 (use "git push" to publish your local commits)
  Compiling arrow-json v54.2.1 (/Users/adriangb/GitHub/arrow-rs/arrow-json)
  Compiling arrow v54.2.1 (/Users/adriangb/GitHub/arrow-rs/arrow)
   Finished `bench` profile [optimized] target(s) in 3.03s
Running benches/json_writer.rs 
(target/release/deps/json_writer-8d99cf5d0d08efba)
   Gnuplot not found, using plotters backend
   bench_integer   time:   [2.9239 ms 2.9488 ms 2.9810 ms]
   change: [-1.4155% -0.3843% +0.7943%] (p = 0.52 > 
0.05)
   No change in performance detected.
   Found 2 outliers among 100 measurements (2.00%)
 1 (1.00%) high mild
 1 (1.00%) high severe
   
   bench_float time:   [3.4054 ms 3.4216 ms 3.4394 ms]
   change: [-3.7264% -2.9504% -2.1609%] (p = 0.00 < 
0.05)
   Performance has improved.
   Found 15 outliers among 100 measurements (15.00%)
 8 (8.00%) high mild
 7 (7.00%) high severe
   
   bench_stringtime:   [8.8852 ms 8.9466 ms 9.0106 ms]
   change: [-1.3959% -0.5660% +0.3531%] (p = 0.21 > 
0.05)
   No change in performance detected.
   Found 2 outliers among 100 measurements (2.00%)
 2 (2.00%) high mild
   
   bench_mixed time:   [11.975 ms 12.030 ms 12.087 ms]
   change: [-3.7701% -3.0819% -2.4763%] (p = 0.00 < 
0.05)
   Performance has improved.
   Found 5 outliers among 100 measurements (5.00%)
 5 (5.00%) high mild
   
   bench_dict_arraytime:   [4.6221 ms 4.6771 ms 4.7345 ms]
   change: [-2.4806% -0.5873% +1.2096%] (p = 0.53 > 
0.05)
   No change in performance detected.
   Found 4 outliers among 100 measurements (4.00%)
 3 (3.00%) high mild
 1 (1.00%) high severe
   
   bench_structtime:   [19.521 ms 19.623 ms 19.731 ms]
   change: [-0.9078% +0.1000% +0.9733%] (p = 0.84 > 
0.05)
   No change in performance detected.
   Found 2 outliers among 100 measurements (2.00%)
 2 (2.00%) high mild
   
   bench_nullable_struct   time:   [9.4059 ms 9.4565 ms 9.5092 ms]
   change: [-3.5737% -2.7898% -1.9920%] (p = 0.00 < 
0.05)
   Performance has improved.
   Found 2 outliers among 100 measurements (2.00%)
 2 (2.00%) high mild
   
   bench_list  time:   [27.854 ms 27.976 ms 28.118 ms]
   change: [-0.8182% -0.3309% +0.1897%] (p = 0.22 > 
0.05)
   No change in performance detected.
   Found 10 outliers among 100 measurements (10.00%)
 7 (7.00%) high mild
 3 (3.00%) high severe
   
   bench_nullable_list time:   [5.9283 ms 5.9849 ms 6.0429 ms]
   change: [-8.7315% -6.8046% -5.0511%] (p = 0.00 < 
0.05)
   Performance has improved.
   
   bench_struct_list   time:   [2.9264 ms 2.9694 ms 3.0130 ms]
   change: [-5.8590% -3.8633% -1.8735%] (p = 0.00 < 
0.05)
   Performance has improved.
   
   Switched to branch 'arrow-union'
   Your branch is up to date with 'origin/arrow-union'.
  Compiling arrow-json v54.2.1 (/Users/adriangb/GitHub/arrow-rs/arrow-json)
  Compiling arrow v54.2.1 (/Users/adriangb/GitHub/arrow-rs/arrow)
   Finished `bench` profile [optimized] target(s) in 3.03s
Running benches/json_writer.rs 
(target/release/deps/json_writer-8d99cf5d0d08efba)
   Gnuplot not found, using plotters backend
   bench_integer   time:   [2.9105 ms 2.9313 ms 2.9542 ms]
   change: [-1.8789% -0.5924% +0.5939%] (p = 0.36 > 
0.05)
   No change in performance detected.
   Found 3 outliers among 100 measurements (3.00%)
 1 (1.00%) high mild

Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-12 Thread via GitHub


adriangb commented on PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#issuecomment-2719188759

   @alamb @tustvold could you please give this another look? I think I 
addressed the major outstanding issue of performance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-12 Thread via GitHub


adriangb commented on code in PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#discussion_r1992326453


##
arrow-json/src/writer/encoder.rs:
##
@@ -196,9 +236,18 @@ impl Encoder for StructArrayEncoder<'_> {
 let mut is_first = true;
 // Nulls can only be dropped in explicit mode
 let drop_nulls = (self.struct_mode == StructMode::ObjectOnly) && 
!self.explicit_nulls;
-for field_encoder in &mut self.encoders {
-let is_null = is_some_and(field_encoder.nulls.as_ref(), |n| 
n.is_null(idx));
-if drop_nulls && is_null {
+
+// Collect all the field nulls buffers up front to avoid dynamic 
dispatch in the loop
+// This creates a temporary Vec, but avoids repeated virtual calls 
which should be a net win

Review Comment:
   I had run it and there was a measurable (~30%) slowdown. This makes sense, 
it was doing 2 dynamic dispatch calls for every row instead of just 1.
   
   I've now refactored this to avoid the dynamic dispatch altogether by 
encapsulating the dynamic encoder + null buffer in a helper struct. So it's 
basically the same as before except that things get passed around as a struct 
instead of a tuple of unrelated values. I don't love the name 
`EncoderWithNullBuffer`, open to suggestions.
   
   Benchmarks now show no difference vs. `main`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-12 Thread via GitHub


alamb commented on code in PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#discussion_r1992264897


##
arrow-json/src/writer/encoder.rs:
##
@@ -196,9 +236,18 @@ impl Encoder for StructArrayEncoder<'_> {
 let mut is_first = true;
 // Nulls can only be dropped in explicit mode
 let drop_nulls = (self.struct_mode == StructMode::ObjectOnly) && 
!self.explicit_nulls;
-for field_encoder in &mut self.encoders {
-let is_null = is_some_and(field_encoder.nulls.as_ref(), |n| 
n.is_null(idx));
-if drop_nulls && is_null {
+
+// Collect all the field nulls buffers up front to avoid dynamic 
dispatch in the loop
+// This creates a temporary Vec, but avoids repeated virtual calls 
which should be a net win

Review Comment:
   Let's just rn
   ```shell
   cargo bench --bench json_writer
   ```
   
   And see what the numbers say?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-12 Thread via GitHub


alamb commented on PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#issuecomment-2719085000

   Should we be trying to get this change in for  the release in a day or two?
   - https://github.com/apache/arrow-rs/issues/7107
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-11 Thread via GitHub


adriangb commented on code in PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#discussion_r1988215898


##
arrow-json/src/writer/encoder.rs:
##
@@ -196,9 +236,18 @@ impl Encoder for StructArrayEncoder<'_> {
 let mut is_first = true;
 // Nulls can only be dropped in explicit mode
 let drop_nulls = (self.struct_mode == StructMode::ObjectOnly) && 
!self.explicit_nulls;
-for field_encoder in &mut self.encoders {
-let is_null = is_some_and(field_encoder.nulls.as_ref(), |n| 
n.is_null(idx));
-if drop_nulls && is_null {
+
+// Collect all the field nulls buffers up front to avoid dynamic 
dispatch in the loop
+// This creates a temporary Vec, but avoids repeated virtual calls 
which should be a net win

Review Comment:
   To be clear though, it currently does do dynamic dispatch for every non-null 
value write, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-11 Thread via GitHub


adriangb commented on code in PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#discussion_r1987877637


##
arrow-json/src/writer/encoder.rs:
##
@@ -25,126 +27,160 @@ use arrow_schema::{ArrowError, DataType, FieldRef};
 use half::f16;
 use lexical_core::FormattedSize;
 use serde::Serializer;
-use std::io::Write;
 
+/// Configuration options for the JSON encoder.
 #[derive(Debug, Clone, Default)]

Review Comment:
   Done in 5576ebf09



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-10 Thread via GitHub


adriangb commented on PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#issuecomment-2712204581

   > I think the changes make the null handling worse 😅
   > 
   > The API was designed the way it was for a reason, I think if we want to 
change it we need to justify doing so empirically
   
   To be clear, you're saying that the API `make_encoder(...) -> (Box, NullBuffer)` is a better API than adding a `nulls()` method to 
`Encoder` to package them up together? Or is it something else about the 
existing implementation that I'm missing? I can try to revert it if that's it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-10 Thread via GitHub


adriangb commented on code in PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#discussion_r1988214038


##
arrow-json/src/writer/encoder.rs:
##
@@ -290,15 +344,41 @@ impl PrimitiveEncode for f16 {
 }
 }
 
+/// Extension trait providing null-related methods to `Option`
+pub trait NullBufferExt {

Review Comment:
   thanks good catch, made it `pub(crate)` since it is used in `mod.rs`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-09 Thread via GitHub


tustvold commented on code in PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#discussion_r1986250002


##
arrow-json/src/writer/encoder.rs:
##
@@ -290,15 +344,41 @@ impl PrimitiveEncode for f16 {
 }
 }
 
+/// Extension trait providing null-related methods to `Option`
+pub trait NullBufferExt {

Review Comment:
   ```suggestion
   trait NullBufferExt {
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-09 Thread via GitHub


tustvold commented on code in PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#discussion_r1986249663


##
arrow-json/src/writer/encoder.rs:
##
@@ -196,9 +236,18 @@ impl Encoder for StructArrayEncoder<'_> {
 let mut is_first = true;
 // Nulls can only be dropped in explicit mode
 let drop_nulls = (self.struct_mode == StructMode::ObjectOnly) && 
!self.explicit_nulls;
-for field_encoder in &mut self.encoders {
-let is_null = is_some_and(field_encoder.nulls.as_ref(), |n| 
n.is_null(idx));
-if drop_nulls && is_null {
+
+// Collect all the field nulls buffers up front to avoid dynamic 
dispatch in the loop
+// This creates a temporary Vec, but avoids repeated virtual calls 
which should be a net win

Review Comment:
   We're still doing this dynamic dispatch to get null information on every 
value write... Unless we can show with benchmarks it doesn't regress 
performance I'm lukewarm on this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-08 Thread via GitHub


adriangb commented on code in PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#discussion_r1986134439


##
arrow-json/src/writer/encoder.rs:
##
@@ -25,126 +27,160 @@ use arrow_schema::{ArrowError, DataType, FieldRef};
 use half::f16;
 use lexical_core::FormattedSize;
 use serde::Serializer;
-use std::io::Write;
 
+/// Configuration options for the JSON encoder.
 #[derive(Debug, Clone, Default)]
 pub struct EncoderOptions {
+/// Whether to include nulls in the output or elide them.
 pub explicit_nulls: bool,
+/// Whether to encode structs as JSON objects or JSON arrays of their 
values.
 pub struct_mode: StructMode,
+/// An optional hook for customizing encoding behavior.
+pub encoder_factory: Option>,
+}
+
+/// A trait to create custom encoders for specific data types.
+///
+/// This allows overriding the default encoders for specific data types,
+/// or adding new encoders for custom data types.
+pub trait EncoderFactory: std::fmt::Debug + Send + Sync {
+/// Make an encoder that if returned runs before all of the default 
encoders.
+/// This can be used to override how e.g. binary data is encoded so that 
it is an encoded string or an array of integers.
+///
+/// Note that the type of the field may not match the type of the array: 
for dictionary arrays unless the top-level dictionary is handled this
+/// will be called again for the keys and values of the dictionary, at 
which point the field type will still be the outer dictionary type but the
+/// array will have a different type.
+/// For example, `field`` might have the type `Dictionary(i32, Utf8)` but 
`array` will be `Utf8`.
+fn make_default_encoder<'a>(
+&self,
+_field: &'a FieldRef,
+_array: &'a dyn Array,
+_options: &'a EncoderOptions,
+) -> Result>, ArrowError> {
+Ok(None)
+}
 }
 
 /// A trait to format array values as JSON values
 ///
 /// Nullability is handled by the caller to allow encoding nulls implicitly, 
i.e. `{}` instead of `{"a": null}`
 pub trait Encoder {
-/// Encode the non-null value at index `idx` to `out`
+/// Encode the non-null value at index `idx` to `out`.
 ///
-/// The behaviour is unspecified if `idx` corresponds to a null index
+/// The behaviour is unspecified if `idx` corresponds to a null index.
 fn encode(&mut self, idx: usize, out: &mut Vec);
+
+/// Check if the value at `idx` is null.
+fn is_null(&self, _idx: usize) -> bool;
+
+/// Short circuiting null checks if there are none.
+fn has_nulls(&self) -> bool;

Review Comment:
   Hard to remember now but I think the primary motivation was to encapsulate 
the (Encoder, NullBuffer) into the Encoder to make the whole API easier to work 
with.
   
   I've now collapsed these into a single method that returns an 
`Option` which can be hoisted out of loops. I think that should be 
equivalent to the old implementation but with a better API for implementers.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-08 Thread via GitHub


adriangb commented on PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#issuecomment-2708409677

   Thanks for reviewing! I'll try to work out how to do the null handling 
differently.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-08 Thread via GitHub


tustvold commented on code in PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#discussion_r1986122353


##
arrow-json/src/writer/encoder.rs:
##
@@ -25,126 +27,160 @@ use arrow_schema::{ArrowError, DataType, FieldRef};
 use half::f16;
 use lexical_core::FormattedSize;
 use serde::Serializer;
-use std::io::Write;
 
+/// Configuration options for the JSON encoder.
 #[derive(Debug, Clone, Default)]
 pub struct EncoderOptions {
+/// Whether to include nulls in the output or elide them.
 pub explicit_nulls: bool,
+/// Whether to encode structs as JSON objects or JSON arrays of their 
values.
 pub struct_mode: StructMode,
+/// An optional hook for customizing encoding behavior.
+pub encoder_factory: Option>,
+}
+
+/// A trait to create custom encoders for specific data types.
+///
+/// This allows overriding the default encoders for specific data types,
+/// or adding new encoders for custom data types.
+pub trait EncoderFactory: std::fmt::Debug + Send + Sync {
+/// Make an encoder that if returned runs before all of the default 
encoders.
+/// This can be used to override how e.g. binary data is encoded so that 
it is an encoded string or an array of integers.
+///
+/// Note that the type of the field may not match the type of the array: 
for dictionary arrays unless the top-level dictionary is handled this
+/// will be called again for the keys and values of the dictionary, at 
which point the field type will still be the outer dictionary type but the
+/// array will have a different type.
+/// For example, `field`` might have the type `Dictionary(i32, Utf8)` but 
`array` will be `Utf8`.
+fn make_default_encoder<'a>(
+&self,
+_field: &'a FieldRef,
+_array: &'a dyn Array,
+_options: &'a EncoderOptions,
+) -> Result>, ArrowError> {
+Ok(None)
+}
 }
 
 /// A trait to format array values as JSON values
 ///
 /// Nullability is handled by the caller to allow encoding nulls implicitly, 
i.e. `{}` instead of `{"a": null}`
 pub trait Encoder {
-/// Encode the non-null value at index `idx` to `out`
+/// Encode the non-null value at index `idx` to `out`.
 ///
-/// The behaviour is unspecified if `idx` corresponds to a null index
+/// The behaviour is unspecified if `idx` corresponds to a null index.
 fn encode(&mut self, idx: usize, out: &mut Vec);
+
+/// Check if the value at `idx` is null.
+fn is_null(&self, _idx: usize) -> bool;
+
+/// Short circuiting null checks if there are none.
+fn has_nulls(&self) -> bool;

Review Comment:
   I guess I'm not seeing why it needs to be changed, this wasn't necessary 
before...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-08 Thread via GitHub


adriangb commented on code in PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#discussion_r1986122184


##
arrow-json/src/writer/encoder.rs:
##
@@ -25,126 +27,160 @@ use arrow_schema::{ArrowError, DataType, FieldRef};
 use half::f16;
 use lexical_core::FormattedSize;
 use serde::Serializer;
-use std::io::Write;
 
+/// Configuration options for the JSON encoder.
 #[derive(Debug, Clone, Default)]
 pub struct EncoderOptions {
+/// Whether to include nulls in the output or elide them.
 pub explicit_nulls: bool,
+/// Whether to encode structs as JSON objects or JSON arrays of their 
values.
 pub struct_mode: StructMode,
+/// An optional hook for customizing encoding behavior.
+pub encoder_factory: Option>,
+}
+
+/// A trait to create custom encoders for specific data types.
+///
+/// This allows overriding the default encoders for specific data types,
+/// or adding new encoders for custom data types.
+pub trait EncoderFactory: std::fmt::Debug + Send + Sync {
+/// Make an encoder that if returned runs before all of the default 
encoders.
+/// This can be used to override how e.g. binary data is encoded so that 
it is an encoded string or an array of integers.
+///
+/// Note that the type of the field may not match the type of the array: 
for dictionary arrays unless the top-level dictionary is handled this
+/// will be called again for the keys and values of the dictionary, at 
which point the field type will still be the outer dictionary type but the
+/// array will have a different type.
+/// For example, `field`` might have the type `Dictionary(i32, Utf8)` but 
`array` will be `Utf8`.
+fn make_default_encoder<'a>(
+&self,
+_field: &'a FieldRef,
+_array: &'a dyn Array,
+_options: &'a EncoderOptions,
+) -> Result>, ArrowError> {
+Ok(None)
+}
 }
 
 /// A trait to format array values as JSON values
 ///
 /// Nullability is handled by the caller to allow encoding nulls implicitly, 
i.e. `{}` instead of `{"a": null}`
 pub trait Encoder {
-/// Encode the non-null value at index `idx` to `out`
+/// Encode the non-null value at index `idx` to `out`.
 ///
-/// The behaviour is unspecified if `idx` corresponds to a null index
+/// The behaviour is unspecified if `idx` corresponds to a null index.
 fn encode(&mut self, idx: usize, out: &mut Vec);
+
+/// Check if the value at `idx` is null.
+fn is_null(&self, _idx: usize) -> bool;
+
+/// Short circuiting null checks if there are none.
+fn has_nulls(&self) -> bool;

Review Comment:
   Makes sense. I guess it does need to be dynamic, but I take it you'd rather 
have a method that returns a NullBuffer the length of the array in one go?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-08 Thread via GitHub


adriangb commented on code in PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#discussion_r1986121889


##
arrow-json/src/writer/encoder.rs:
##
@@ -25,126 +27,160 @@ use arrow_schema::{ArrowError, DataType, FieldRef};
 use half::f16;
 use lexical_core::FormattedSize;
 use serde::Serializer;
-use std::io::Write;
 
+/// Configuration options for the JSON encoder.
 #[derive(Debug, Clone, Default)]

Review Comment:
   Will do



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-08 Thread via GitHub


tustvold commented on code in PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#discussion_r1986041729


##
arrow-json/src/writer/encoder.rs:
##
@@ -25,126 +27,160 @@ use arrow_schema::{ArrowError, DataType, FieldRef};
 use half::f16;
 use lexical_core::FormattedSize;
 use serde::Serializer;
-use std::io::Write;
 
+/// Configuration options for the JSON encoder.
 #[derive(Debug, Clone, Default)]

Review Comment:
   As this is now public, I think we either need to mark it `#[non_exhaustive]` 
or make it into a builder with private fields.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-08 Thread via GitHub


tustvold commented on code in PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#discussion_r1986041672


##
arrow-json/src/lib.rs:
##
@@ -70,7 +70,9 @@ pub mod reader;
 pub mod writer;
 
 pub use self::reader::{Reader, ReaderBuilder};
-pub use self::writer::{ArrayWriter, LineDelimitedWriter, Writer, 
WriterBuilder};
+pub use self::writer::{
+ArrayWriter, Encoder, EncoderFactory, LineDelimitedWriter, Writer, 
WriterBuilder,

Review Comment:
   ```suggestion
   ArrayWriter, Encoder, EncoderOptions, EncoderFactory, 
LineDelimitedWriter, Writer, WriterBuilder,
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-03-08 Thread via GitHub


tustvold commented on code in PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#discussion_r1986041236


##
arrow-json/src/writer/encoder.rs:
##
@@ -25,126 +27,160 @@ use arrow_schema::{ArrowError, DataType, FieldRef};
 use half::f16;
 use lexical_core::FormattedSize;
 use serde::Serializer;
-use std::io::Write;
 
+/// Configuration options for the JSON encoder.
 #[derive(Debug, Clone, Default)]
 pub struct EncoderOptions {
+/// Whether to include nulls in the output or elide them.
 pub explicit_nulls: bool,
+/// Whether to encode structs as JSON objects or JSON arrays of their 
values.
 pub struct_mode: StructMode,
+/// An optional hook for customizing encoding behavior.
+pub encoder_factory: Option>,
+}
+
+/// A trait to create custom encoders for specific data types.
+///
+/// This allows overriding the default encoders for specific data types,
+/// or adding new encoders for custom data types.
+pub trait EncoderFactory: std::fmt::Debug + Send + Sync {
+/// Make an encoder that if returned runs before all of the default 
encoders.
+/// This can be used to override how e.g. binary data is encoded so that 
it is an encoded string or an array of integers.
+///
+/// Note that the type of the field may not match the type of the array: 
for dictionary arrays unless the top-level dictionary is handled this
+/// will be called again for the keys and values of the dictionary, at 
which point the field type will still be the outer dictionary type but the
+/// array will have a different type.
+/// For example, `field`` might have the type `Dictionary(i32, Utf8)` but 
`array` will be `Utf8`.
+fn make_default_encoder<'a>(
+&self,
+_field: &'a FieldRef,
+_array: &'a dyn Array,
+_options: &'a EncoderOptions,
+) -> Result>, ArrowError> {
+Ok(None)
+}
 }
 
 /// A trait to format array values as JSON values
 ///
 /// Nullability is handled by the caller to allow encoding nulls implicitly, 
i.e. `{}` instead of `{"a": null}`
 pub trait Encoder {
-/// Encode the non-null value at index `idx` to `out`
+/// Encode the non-null value at index `idx` to `out`.
 ///
-/// The behaviour is unspecified if `idx` corresponds to a null index
+/// The behaviour is unspecified if `idx` corresponds to a null index.
 fn encode(&mut self, idx: usize, out: &mut Vec);
+
+/// Check if the value at `idx` is null.
+fn is_null(&self, _idx: usize) -> bool;
+
+/// Short circuiting null checks if there are none.
+fn has_nulls(&self) -> bool;

Review Comment:
   I suspect this change will have major performance implications, as you're 
now adding a per-value dyn dispatch in order to check nullability



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-02-12 Thread via GitHub


adriangb commented on PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#issuecomment-263947

   I've started trying to replace our vendored version with this + a hook to 
handle the JSON union and have a couple of notes:
   1. In order to handle unions we need `make_encoder` to be public: the union 
will need to delegate to `make_encoder` for it's children.
   2. Handling dictionary types is annoying, I would consider making 
`DictionaryEncoder` public.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-02-11 Thread via GitHub


tustvold commented on PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#issuecomment-2652236729

   I'm a bit swamped at the moment, but I'll try to take a look this weekend.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-02-10 Thread via GitHub


adriangb commented on PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#issuecomment-2649336495

   Hi @tustvold sorry to ping you. Given the use cases beyond our JSON 
serialization, would you mind taking another look at this PR when you have some 
time? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-02-02 Thread via GitHub


kylebarron commented on code in PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#discussion_r1938704942


##
arrow-json/src/writer/encoder.rs:
##
@@ -25,126 +27,157 @@ use arrow_schema::{ArrowError, DataType, FieldRef};
 use half::f16;
 use lexical_core::FormattedSize;
 use serde::Serializer;
-use std::io::Write;
 
 #[derive(Debug, Clone, Default)]
 pub struct EncoderOptions {
 pub explicit_nulls: bool,
 pub struct_mode: StructMode,
+pub encoder_factory: Option>,
+}
+
+/// A trait to create custom encoders for specific data types.
+///
+/// This allows overriding the default encoders for specific data types,
+/// or adding new encoders for custom data types.
+pub trait EncoderFactory: std::fmt::Debug {
+/// Make an encoder that if returned runs before all of the default 
encoders.
+/// This can be used to override how e.g. binary data is encoded so that 
it is an encoded string or an array of integers.
+fn make_default_encoder<'a>(
+&self,
+_array: &'a dyn Array,

Review Comment:
   I believe so! As long as we can access the field that should be enough!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-01-31 Thread via GitHub


adriangb commented on code in PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#discussion_r1937992380


##
arrow-json/src/writer/encoder.rs:
##
@@ -25,126 +27,157 @@ use arrow_schema::{ArrowError, DataType, FieldRef};
 use half::f16;
 use lexical_core::FormattedSize;
 use serde::Serializer;
-use std::io::Write;
 
 #[derive(Debug, Clone, Default)]
 pub struct EncoderOptions {
 pub explicit_nulls: bool,
 pub struct_mode: StructMode,
+pub encoder_factory: Option>,
+}
+
+/// A trait to create custom encoders for specific data types.
+///
+/// This allows overriding the default encoders for specific data types,
+/// or adding new encoders for custom data types.
+pub trait EncoderFactory: std::fmt::Debug {
+/// Make an encoder that if returned runs before all of the default 
encoders.
+/// This can be used to override how e.g. binary data is encoded so that 
it is an encoded string or an array of integers.
+fn make_default_encoder<'a>(
+&self,
+_array: &'a dyn Array,

Review Comment:
   @kylebarron will this work for you?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-01-29 Thread via GitHub


adriangb commented on code in PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#discussion_r1934375511


##
arrow-json/src/writer/encoder.rs:
##
@@ -25,126 +27,157 @@ use arrow_schema::{ArrowError, DataType, FieldRef};
 use half::f16;
 use lexical_core::FormattedSize;
 use serde::Serializer;
-use std::io::Write;
 
 #[derive(Debug, Clone, Default)]
 pub struct EncoderOptions {
 pub explicit_nulls: bool,
 pub struct_mode: StructMode,
+pub encoder_factory: Option>,
+}
+
+/// A trait to create custom encoders for specific data types.
+///
+/// This allows overriding the default encoders for specific data types,
+/// or adding new encoders for custom data types.
+pub trait EncoderFactory: std::fmt::Debug {
+/// Make an encoder that if returned runs before all of the default 
encoders.
+/// This can be used to override how e.g. binary data is encoded so that 
it is an encoded string or an array of integers.
+fn make_default_encoder<'a>(
+&self,
+_array: &'a dyn Array,

Review Comment:
   
https://github.com/apache/arrow-rs/pull/7015/commits/01fb58cce13b23b0f2f51e397ac53049f1741a9d



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-01-29 Thread via GitHub


adriangb commented on code in PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#discussion_r1934367478


##
arrow-json/src/writer/encoder.rs:
##
@@ -25,126 +27,157 @@ use arrow_schema::{ArrowError, DataType, FieldRef};
 use half::f16;
 use lexical_core::FormattedSize;
 use serde::Serializer;
-use std::io::Write;
 
 #[derive(Debug, Clone, Default)]
 pub struct EncoderOptions {
 pub explicit_nulls: bool,
 pub struct_mode: StructMode,
+pub encoder_factory: Option>,
+}
+
+/// A trait to create custom encoders for specific data types.
+///
+/// This allows overriding the default encoders for specific data types,
+/// or adding new encoders for custom data types.
+pub trait EncoderFactory: std::fmt::Debug {
+/// Make an encoder that if returned runs before all of the default 
encoders.
+/// This can be used to override how e.g. binary data is encoded so that 
it is an encoded string or an array of integers.
+fn make_default_encoder<'a>(
+&self,
+_array: &'a dyn Array,

Review Comment:
   Great point yes I think that should be possible 😄 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-01-29 Thread via GitHub


kylebarron commented on code in PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#discussion_r1934355453


##
arrow-json/src/writer/encoder.rs:
##
@@ -25,126 +27,157 @@ use arrow_schema::{ArrowError, DataType, FieldRef};
 use half::f16;
 use lexical_core::FormattedSize;
 use serde::Serializer;
-use std::io::Write;
 
 #[derive(Debug, Clone, Default)]
 pub struct EncoderOptions {
 pub explicit_nulls: bool,
 pub struct_mode: StructMode,
+pub encoder_factory: Option>,
+}
+
+/// A trait to create custom encoders for specific data types.
+///
+/// This allows overriding the default encoders for specific data types,
+/// or adding new encoders for custom data types.
+pub trait EncoderFactory: std::fmt::Debug {
+/// Make an encoder that if returned runs before all of the default 
encoders.
+/// This can be used to override how e.g. binary data is encoded so that 
it is an encoded string or an array of integers.
+fn make_default_encoder<'a>(
+&self,
+_array: &'a dyn Array,

Review Comment:
   I'd be interested in using the changes in this PR to write 
[GeoJSON](https://datatracker.ietf.org/doc/html/rfc7946) from 
[geoarrow-rs](https://github.com/geoarrow/geoarrow-rs).
   
   However this API would not be sufficient for me because it assumes that the 
physical `Array` is enough to know how to encode the data. This is not true for 
geospatial data (at least for Arrow data according to the [GeoArrow 
specification](https://geoarrow.org/)) because the same physical layout can 
describe multiple types. 
   
   E.g. an array of `LineString` and an array of `MultiPoint` would both be 
stored as an Arrow `List[FixedSizeList[2, Float64]]`, but the extension 
metadata on the `Field` would be necessary to know whether to write [`"type": 
"MultiPoint"`](https://datatracker.ietf.org/doc/html/rfc7946#section-3.1.3) or 
[`"type": 
"LineString"`](https://datatracker.ietf.org/doc/html/rfc7946#section-3.1.4) in 
each JSON object.
   
   Given that the existing json `Writer` API writes a `RecordBatch`, it should 
be possible to access the `Field` and pass that through here, instead of just 
using the `Array`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-01-26 Thread via GitHub


adriangb commented on PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#issuecomment-2614521919

   Makes sense, done!
   
   There is quite a bit of code churn from changing the `Encoder` trait. Would 
you like me to break that out into a precursor PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-01-26 Thread via GitHub


tustvold commented on PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#issuecomment-2614503716

   A private newtype wrapper would avoid a non-trivial amount of code 
duplication whilst resolving the issue, and would be my preferred approach


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-01-26 Thread via GitHub


adriangb commented on PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#issuecomment-2614495129

   Okay I think I understand now. Looking at the implementation it seems to me 
that it wouldn't be that much extra work to add formatters for Date, Timestamp, 
Time and Decimal, then we could ditch ArrayFormatter for JSON encoding 
altogether. I can do that as a prequel PR. Wdyt?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-01-25 Thread via GitHub


tustvold commented on PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#issuecomment-2614089236

   `ArrayFormatter` is a type from `arrow_cast` used to format array values to 
strings. It is used by `arrow_json` to format some values, however, being a 
general formatter it does not implement JSON escaping. This isn't a problem for 
the types `arrow_json` uses it for, but as we're now making Encoder public we 
need to ensure people don't think ArrayFormatter can be used as an `Encoder` in 
the general case


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-01-25 Thread via GitHub


adriangb commented on PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#issuecomment-2614080029

   Thank you for the review. I've pushed a (broken) version with most comments 
addressed aside from the part about `ArrayFormater`. To be honest I don't 
understand what `ArrayFormatter` does or why it's special... would you mind 
elaborating a bit on how you propose to handle it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-01-25 Thread via GitHub


adriangb commented on code in PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#discussion_r1929594652


##
arrow-json/src/writer/mod.rs:
##
@@ -1953,4 +1968,317 @@ mod tests {
 "#,
 );
 }
+
+fn make_fallback_encoder_test_data() -> (RecordBatch, Arc) {
+// Note: this is not intended to be an efficient implementation.
+// Just a simple example to demonstrate how to implement a custom 
encoder.
+#[derive(Debug)]
+enum UnionValue {
+Null,
+Int32(i32),
+String(String),
+}
+
+#[derive(Debug)]
+struct UnionEncoder {

Review Comment:
   That would be nice 😄 
   
   How about I try to write one on our end and then we can upstream it if it 
works properly?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-01-25 Thread via GitHub


adriangb commented on code in PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#discussion_r1929594559


##
arrow-json/src/writer/encoder.rs:
##
@@ -24,11 +26,42 @@ use arrow_schema::{ArrowError, DataType, FieldRef};
 use half::f16;
 use lexical_core::FormattedSize;
 use serde::Serializer;
-use std::io::Write;
 
 #[derive(Debug, Clone, Default)]
 pub struct EncoderOptions {
 pub explicit_nulls: bool,
+pub encoder_factory: Option>,
+}
+
+type EncoderFactoryResult<'a> =
+Result, Option)>, ArrowError>;
+
+/// A trait to create custom encoders for specific data types.
+///
+/// This allows overriding the default encoders for specific data types,
+/// or adding new encoders for custom data types.
+pub trait EncoderFactory: std::fmt::Debug {
+/// Make an encoder that if returned runs before all of the default 
encoders.
+/// This can be used to override how e.g. binary data is encoded so that 
it is an encoded string or an array of integers.
+fn make_default_encoder<'a>(
+&self,
+_array: &'a dyn Array,
+_data_type: &DataType,
+_options: &EncoderOptions,
+) -> EncoderFactoryResult<'a> {
+Ok(None)
+}
+
+/// Make an encoder that if returned runs after all of the default 
encoders.
+/// If this method returns None then an error will be returned.
+fn make_fallback_encoder<'a>(

Review Comment:
   I'll drop it in favor of a single method, I think it should work for our use 
case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-01-25 Thread via GitHub


adriangb commented on code in PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#discussion_r1929592262


##
arrow-json/src/writer/encoder.rs:
##
@@ -24,11 +26,42 @@ use arrow_schema::{ArrowError, DataType, FieldRef};
 use half::f16;
 use lexical_core::FormattedSize;
 use serde::Serializer;
-use std::io::Write;
 
 #[derive(Debug, Clone, Default)]
 pub struct EncoderOptions {
 pub explicit_nulls: bool,
+pub encoder_factory: Option>,
+}
+
+type EncoderFactoryResult<'a> =
+Result, Option)>, ArrowError>;
+
+/// A trait to create custom encoders for specific data types.
+///
+/// This allows overriding the default encoders for specific data types,
+/// or adding new encoders for custom data types.
+pub trait EncoderFactory: std::fmt::Debug {
+/// Make an encoder that if returned runs before all of the default 
encoders.
+/// This can be used to override how e.g. binary data is encoded so that 
it is an encoded string or an array of integers.
+fn make_default_encoder<'a>(
+&self,
+_array: &'a dyn Array,
+_data_type: &DataType,

Review Comment:
   We don't need to, it just simplifies the implementation of the 
`EncoderFactory` a bit because almost surely all of them are going to call 
`array.data_type()`. I can remove if you'd prefer.



##
arrow-json/src/writer/mod.rs:
##
@@ -426,10 +438,13 @@ mod tests {
 
 let actual: Vec> = input
 .split(|b| *b == b'\n')
-.map(|s| (!s.is_empty()).then(|| 
serde_json::from_slice(s).unwrap()))
+.map(|s| {
+println!("{:?}", str::from_utf8(s));

Review Comment:
   upsies!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-01-25 Thread via GitHub


tustvold commented on code in PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#discussion_r1929562915


##
arrow-json/src/writer/encoder.rs:
##
@@ -24,11 +26,42 @@ use arrow_schema::{ArrowError, DataType, FieldRef};
 use half::f16;
 use lexical_core::FormattedSize;
 use serde::Serializer;
-use std::io::Write;
 
 #[derive(Debug, Clone, Default)]
 pub struct EncoderOptions {
 pub explicit_nulls: bool,
+pub encoder_factory: Option>,
+}
+
+type EncoderFactoryResult<'a> =
+Result, Option)>, ArrowError>;
+
+/// A trait to create custom encoders for specific data types.
+///
+/// This allows overriding the default encoders for specific data types,
+/// or adding new encoders for custom data types.
+pub trait EncoderFactory: std::fmt::Debug {
+/// Make an encoder that if returned runs before all of the default 
encoders.
+/// This can be used to override how e.g. binary data is encoded so that 
it is an encoded string or an array of integers.
+fn make_default_encoder<'a>(
+&self,
+_array: &'a dyn Array,
+_data_type: &DataType,

Review Comment:
   Why do we pass both DataType and Array?



##
arrow-json/src/writer/encoder.rs:
##
@@ -24,11 +26,42 @@ use arrow_schema::{ArrowError, DataType, FieldRef};
 use half::f16;
 use lexical_core::FormattedSize;
 use serde::Serializer;
-use std::io::Write;
 
 #[derive(Debug, Clone, Default)]
 pub struct EncoderOptions {
 pub explicit_nulls: bool,
+pub encoder_factory: Option>,
+}
+
+type EncoderFactoryResult<'a> =
+Result, Option)>, ArrowError>;
+
+/// A trait to create custom encoders for specific data types.
+///
+/// This allows overriding the default encoders for specific data types,
+/// or adding new encoders for custom data types.
+pub trait EncoderFactory: std::fmt::Debug {
+/// Make an encoder that if returned runs before all of the default 
encoders.
+/// This can be used to override how e.g. binary data is encoded so that 
it is an encoded string or an array of integers.
+fn make_default_encoder<'a>(
+&self,
+_array: &'a dyn Array,
+_data_type: &DataType,
+_options: &EncoderOptions,
+) -> EncoderFactoryResult<'a> {
+Ok(None)
+}
+
+/// Make an encoder that if returned runs after all of the default 
encoders.
+/// If this method returns None then an error will be returned.
+fn make_fallback_encoder<'a>(

Review Comment:
   Why do we need this? AFAICT we could remove this with no real loss of 
functionality, if anything it would be more resilient as adding a new default 
encoder wouldn't break existing overrides.



##
arrow-json/src/writer/mod.rs:
##
@@ -1953,4 +1968,317 @@ mod tests {
 "#,
 );
 }
+
+fn make_fallback_encoder_test_data() -> (RecordBatch, Arc) {
+// Note: this is not intended to be an efficient implementation.
+// Just a simple example to demonstrate how to implement a custom 
encoder.
+#[derive(Debug)]
+enum UnionValue {
+Null,
+Int32(i32),
+String(String),
+}
+
+#[derive(Debug)]
+struct UnionEncoder {

Review Comment:
   FWIW adding a native UnionEncoder is likely not overly complicated 



##
arrow-json/src/writer/mod.rs:
##
@@ -426,10 +438,13 @@ mod tests {
 
 let actual: Vec> = input
 .split(|b| *b == b'\n')
-.map(|s| (!s.is_empty()).then(|| 
serde_json::from_slice(s).unwrap()))
+.map(|s| {
+println!("{:?}", str::from_utf8(s));

Review Comment:
   ?



##
arrow-json/src/writer/mod.rs:
##
@@ -1953,4 +1968,317 @@ mod tests {
 "#,
 );
 }
+
+fn make_fallback_encoder_test_data() -> (RecordBatch, Arc) {
+// Note: this is not intended to be an efficient implementation.
+// Just a simple example to demonstrate how to implement a custom 
encoder.
+#[derive(Debug)]
+enum UnionValue {
+Null,
+Int32(i32),
+String(String),
+}
+
+#[derive(Debug)]
+struct UnionEncoder {
+array: Vec,
+}
+
+impl Encoder for UnionEncoder {
+fn encode(&mut self, idx: usize, out: &mut Vec) {
+match &self.array[idx] {
+UnionValue::Null => out.extend_from_slice(b"null"),
+UnionValue::Int32(v) => 
out.extend_from_slice(v.to_string().as_bytes()),
+UnionValue::String(v) => 
out.extend_from_slice(format!("\"{}\"", v).as_bytes()),
+}
+}
+}
+
+#[derive(Debug)]
+struct UnionEncoderFactory;
+
+impl EncoderFactory for UnionEncoderFactory {
+fn make_fallback_encoder(
+&self,
+array: &dyn Array,
+data_type: &DataType,
+_options: &EncoderOptions,
+) 

Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-01-24 Thread via GitHub


adriangb commented on PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#issuecomment-2612771306

   > TODO: add a test showing how to change the encoding of binary arrays.
   
   Done


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-01-24 Thread via GitHub


adriangb commented on code in PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#discussion_r1928848832


##
arrow-json/src/writer/encoder.rs:
##
@@ -24,11 +26,42 @@ use arrow_schema::{ArrowError, DataType, FieldRef};
 use half::f16;
 use lexical_core::FormattedSize;
 use serde::Serializer;
-use std::io::Write;
 
 #[derive(Debug, Clone, Default)]
 pub struct EncoderOptions {
 pub explicit_nulls: bool,
+pub encoder_factory: Option>,
+}
+
+type EncoderFactoryResult<'a> =
+Result, Option)>, ArrowError>;

Review Comment:
   Should this be `Option<&'a NullBuffer>`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-01-24 Thread via GitHub


adriangb commented on PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#issuecomment-2612435127

   cc @tustvold


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] Add hooks to json encoder to override default encoding or add support for unsupported types [arrow-rs]

2025-01-24 Thread via GitHub


adriangb commented on PR #7015:
URL: https://github.com/apache/arrow-rs/pull/7015#issuecomment-2612434533

   TODO: add a test showing how to change the encoding of binary arrays.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]