alamb commented on code in PR #4736:
URL: https://github.com/apache/arrow-rs/pull/4736#discussion_r1305888339
##########
arrow-row/src/lib.rs:
##########
@@ -1445,6 +1446,70 @@ unsafe fn decode_column(
Ok(array)
}
+macro_rules! downcast_dict {
+ ($array:ident, $key:ident) => {{
+ $array
+ .as_any()
+ .downcast_ref::<DictionaryArray<$key>>()
+ .unwrap()
+ }};
+}
+
+const LOW_CARDINALITY_THRESHOLD: usize = 10;
+
+#[derive(Debug)]
+pub struct CardinalityAwareRowConverter {
+ inner: RowConverter,
+ done: bool,
+}
+
+impl CardinalityAwareRowConverter {
+ pub fn new(fields: Vec<SortField>) -> Result<Self, ArrowError> {
+ Ok(Self {
+ inner: RowConverter::new(fields)?,
+ done: false,
+ })
+ }
+
+ pub fn size(&self) -> usize {
+ self.inner.size()
+ }
+
+ pub fn convert_rows(&self, rows: &Rows) -> Result<Vec<ArrayRef>,
ArrowError> {
+ self.inner.convert_rows(rows)
+ }
+
+ pub fn convert_columns(
+ &mut self,
+ columns: &[ArrayRef]) -> Result<Rows, ArrowError> {
+ if !self.done {
+ for (i, col) in columns.iter().enumerate() {
+ if let DataType::Dictionary(k, _) = col.data_type() {
+ // let cardinality =
col.as_any().downcast_ref::<DictionaryArray<Int32Type>>().unwrap().values().len();
+ let cardinality = match k.as_ref() {
Review Comment:
I originally thought that we should base the decision of "is this a high
cardinality column" on the "in use" cardinality of the dictionary (aka how may
distinct key values there were -- as suggested on
https://github.com/apache/arrow-datafusion/issues/7200#issuecomment-1690147397)
However, I now realize that maybe the number of potential key values (aka
the length of the values array) is actually a more robust predictor of being
"high cardinality" (as the other values in the dictionary could be used in
subsequent batches, perhaps)
Do you have any opinion @tustvold ?
##########
arrow-row/src/lib.rs:
##########
@@ -1445,6 +1446,70 @@ unsafe fn decode_column(
Ok(array)
}
+macro_rules! downcast_dict {
+ ($array:ident, $key:ident) => {{
+ $array
+ .as_any()
+ .downcast_ref::<DictionaryArray<$key>>()
+ .unwrap()
+ }};
+}
+
+const LOW_CARDINALITY_THRESHOLD: usize = 10;
+
+#[derive(Debug)]
+pub struct CardinalityAwareRowConverter {
+ inner: RowConverter,
+ done: bool,
+}
+
+impl CardinalityAwareRowConverter {
+ pub fn new(fields: Vec<SortField>) -> Result<Self, ArrowError> {
+ Ok(Self {
+ inner: RowConverter::new(fields)?,
+ done: false,
+ })
+ }
+
+ pub fn size(&self) -> usize {
+ self.inner.size()
+ }
+
+ pub fn convert_rows(&self, rows: &Rows) -> Result<Vec<ArrayRef>,
ArrowError> {
+ self.inner.convert_rows(rows)
+ }
+
+ pub fn convert_columns(
+ &mut self,
+ columns: &[ArrayRef]) -> Result<Rows, ArrowError> {
+ if !self.done {
+ for (i, col) in columns.iter().enumerate() {
+ if let DataType::Dictionary(k, _) = col.data_type() {
+ // let cardinality =
col.as_any().downcast_ref::<DictionaryArray<Int32Type>>().unwrap().values().len();
+ let cardinality = match k.as_ref() {
+ DataType::Int8 => downcast_dict!(col,
Int32Type).values().len(),
+ DataType::Int16 => downcast_dict!(col,
Int32Type).values().len(),
+ DataType::Int32 => downcast_dict!(col,
Int32Type).values().len(),
+ DataType::Int64 => downcast_dict!(col,
Int64Type).values().len(),
+ DataType::UInt16 => downcast_dict!(col,
UInt16Type).values().len(),
+ DataType::UInt32 => downcast_dict!(col,
UInt32Type).values().len(),
+ DataType::UInt64 => downcast_dict!(col,
UInt64Type).values().len(),
+ _ => unreachable!(),
+ };
+
+ if cardinality >= LOW_CARDINALITY_THRESHOLD {
Review Comment:
this will effectively switch the encoding mid-steam, I think -- which will
mean that the output can't be compared with previously created rows, which is
not correct.
I think the decision has to be made based on the first batch and then that
decision used for encoding all rows
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]