Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2026-01-07 Thread via GitHub


Vishwanatha-HD commented on PR #48205:
URL: https://github.com/apache/arrow/pull/48205#issuecomment-3722240287

   @pitrou @kou 
   Greetings,
   I was on vacation for most part of Dec and hence there was no reply from 
me.. Can you please help me with merging the pull requests that I raised, if 
you poeple dont have any further comments or issues.. 
   Thanks.. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2026-01-07 Thread via GitHub


Vishwanatha-HD commented on PR #48205:
URL: https://github.com/apache/arrow/pull/48205#issuecomment-3722230918

   > @Vishwanatha-HD Could you please fix the number `GH-48204` in the title? 
It should be `GH-48205`.
   
   Hi @kiszk.. 
   Sorry for the delayed reply.. I was on vacation.. 
   Please note that GH-48204 is the issue number associated with this pull 
request.. 
   https://github.com/apache/arrow/issues/48204


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2025-12-30 Thread via GitHub


kiszk commented on PR #48205:
URL: https://github.com/apache/arrow/pull/48205#issuecomment-3698769913

   @Vishwanatha-HD Could you please fix the number `GH-48204` in the title? It 
should be `GH-48205`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2025-12-09 Thread via GitHub


Vishwanatha-HD commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2601918022


##
cpp/src/parquet/column_writer.h:
##
@@ -267,9 +267,14 @@ inline void ArrowTimestampToImpalaTimestamp(const int64_t 
time, Int96* impala_ti
   }
   impala_timestamp->value[2] = static_cast(julian_days);
   uint64_t last_day_nanos = static_cast(last_day_units) * 
NanosecondsPerUnit;
+#if ARROW_LITTLE_ENDIAN
   // impala_timestamp will be unaligned every other entry so do memcpy instead
   // of assign and reinterpret cast to avoid undefined behavior.
-  std::memcpy(impala_timestamp, &last_day_nanos, sizeof(uint64_t));
+  std::memcpy(impala_timestamp, &last_day_nanos, sizeof(int64_t));
+#else
+  (*impala_timestamp).value[0] = static_cast(last_day_nanos);
+  (*impala_timestamp).value[1] = static_cast(last_day_nanos >> 32);

Review Comment:
   @kou.. I dont think that is necessary.. Otherwise the testcase would have 
failed.. Its just the way the 3 uint32_t values of "last_day_nanos" are 
arranged into the "(*impala_timestamp).value" variable.. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2025-12-09 Thread via GitHub


pitrou commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2601509785


##
cpp/src/parquet/column_writer.h:
##
@@ -267,9 +267,14 @@ inline void ArrowTimestampToImpalaTimestamp(const int64_t 
time, Int96* impala_ti
   }
   impala_timestamp->value[2] = static_cast(julian_days);
   uint64_t last_day_nanos = static_cast(last_day_units) * 
NanosecondsPerUnit;
+#if ARROW_LITTLE_ENDIAN
   // impala_timestamp will be unaligned every other entry so do memcpy instead
   // of assign and reinterpret cast to avoid undefined behavior.
-  std::memcpy(impala_timestamp, &last_day_nanos, sizeof(uint64_t));
+  std::memcpy(impala_timestamp, &last_day_nanos, sizeof(int64_t));
+#else
+  (*impala_timestamp).value[0] = static_cast(last_day_nanos);
+  (*impala_timestamp).value[1] = static_cast(last_day_nanos >> 32);

Review Comment:
   Sure, but the individual 32-bit values still need to be byte-swapped, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2025-12-08 Thread via GitHub


pitrou commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2601507548


##
cpp/src/parquet/column_writer.cc:
##
@@ -2578,13 +2579,31 @@ struct SerializeFunctor<
   if constexpr (std::is_same_v) {
 *p++ = ::arrow::bit_util::ToBigEndian(u64_in[0]);
   } else if constexpr (std::is_same_v) 
{
+#if ARROW_LITTLE_ENDIAN

Review Comment:
   I will write again what I said above:
   > If we're on a big-endian system, this entire code is unnecessary and we 
can just use the FIXED_LEN_BYTE_ARRAY SerializeFunctor.
   
   Of course, I might be missing something and some dedicated code might still 
be necessary. But I don't think that is.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2025-12-08 Thread via GitHub


Vishwanatha-HD commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2599921254


##
cpp/src/parquet/column_writer.cc:
##
@@ -2578,13 +2579,31 @@ struct SerializeFunctor<
   if constexpr (std::is_same_v) {
 *p++ = ::arrow::bit_util::ToBigEndian(u64_in[0]);
   } else if constexpr (std::is_same_v) 
{
+#if ARROW_LITTLE_ENDIAN

Review Comment:
   @pitrou.. I can come up with an optimized logic as below.. Please let me 
know if you are fine with it.. 
   
   ```
   FixedLenByteArray FixDecimalEndianness(const uint8_t* in, int64_t offset) {
   auto out = reinterpret_cast(scratch) + offset;
   
   constexpr int32_t kSize =
   std::is_same_v   ? 4  :
   std::is_same_v   ? 8  :
   std::is_same_v  ? 16 :
   std::is_same_v  ? 32 :
   0;
   
   static_assert(kSize != 0, "Unsupported decimal type");
   
   uint8_t* out_bytes = reinterpret_cast(scratch);
   
   // Arrow guarantees: ByteSwapBuffer swaps only on little-endian.
   // On big-endian it becomes memcpy.
   ::arrow::bit_util::ByteSwapBuffer(out_bytes, in, kSize);
   
   scratch += kSize;
   return FixedLenByteArray(out);
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2025-12-08 Thread via GitHub


Vishwanatha-HD commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2599921254


##
cpp/src/parquet/column_writer.cc:
##
@@ -2578,13 +2579,31 @@ struct SerializeFunctor<
   if constexpr (std::is_same_v) {
 *p++ = ::arrow::bit_util::ToBigEndian(u64_in[0]);
   } else if constexpr (std::is_same_v) 
{
+#if ARROW_LITTLE_ENDIAN

Review Comment:
   @pitrou.. I can come up with an optimized logic as below.. Please let me 
know if you are fine with it.. 
   
   FixedLenByteArray FixDecimalEndianness(const uint8_t* in, int64_t offset) {
   auto out = reinterpret_cast(scratch) + offset;
   
   constexpr int32_t kSize =
   std::is_same_v   ? 4  :
   std::is_same_v   ? 8  :
   std::is_same_v  ? 16 :
   std::is_same_v  ? 32 :
   0;
   
   static_assert(kSize != 0, "Unsupported decimal type");
   
   uint8_t* out_bytes = reinterpret_cast(scratch);
   
   // Arrow guarantees: ByteSwapBuffer swaps only on little-endian.
   // On big-endian it becomes memcpy.
   ::arrow::bit_util::ByteSwapBuffer(out_bytes, in, kSize);
   
   scratch += kSize;
   return FixedLenByteArray(out);
   }



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2025-12-08 Thread via GitHub


Vishwanatha-HD commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2599786362


##
cpp/src/parquet/column_writer.h:
##
@@ -267,9 +267,14 @@ inline void ArrowTimestampToImpalaTimestamp(const int64_t 
time, Int96* impala_ti
   }
   impala_timestamp->value[2] = static_cast(julian_days);
   uint64_t last_day_nanos = static_cast(last_day_units) * 
NanosecondsPerUnit;
+#if ARROW_LITTLE_ENDIAN
   // impala_timestamp will be unaligned every other entry so do memcpy instead
   // of assign and reinterpret cast to avoid undefined behavior.
-  std::memcpy(impala_timestamp, &last_day_nanos, sizeof(uint64_t));
+  std::memcpy(impala_timestamp, &last_day_nanos, sizeof(int64_t));
+#else
+  (*impala_timestamp).value[0] = static_cast(last_day_nanos);
+  (*impala_timestamp).value[1] = static_cast(last_day_nanos >> 32);

Review Comment:
   @pitrou.. If I am understanding correctly, this logic is trying to put the 
lower 4-bytes (32-bits) of "last_day_nanos" into value[0] and upper 4-bytes 
(32-bits) into value[1].. "memcpy of 8-bytes (64-bits) will keep the values as 
is.. Hence we wont be able to use "memcpy" on BE machines.. Hope I am making 
things clear here..



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2025-12-03 Thread via GitHub


pitrou commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2584474050


##
cpp/src/parquet/column_writer.h:
##
@@ -267,9 +267,14 @@ inline void ArrowTimestampToImpalaTimestamp(const int64_t 
time, Int96* impala_ti
   }
   impala_timestamp->value[2] = static_cast(julian_days);
   uint64_t last_day_nanos = static_cast(last_day_units) * 
NanosecondsPerUnit;
+#if ARROW_LITTLE_ENDIAN
   // impala_timestamp will be unaligned every other entry so do memcpy instead
   // of assign and reinterpret cast to avoid undefined behavior.
-  std::memcpy(impala_timestamp, &last_day_nanos, sizeof(uint64_t));
+  std::memcpy(impala_timestamp, &last_day_nanos, sizeof(int64_t));
+#else
+  (*impala_timestamp).value[0] = static_cast(last_day_nanos);
+  (*impala_timestamp).value[1] = static_cast(last_day_nanos >> 32);

Review Comment:
   Why is this not changing the endianness of values?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2025-12-03 Thread via GitHub


pitrou commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2584469301


##
cpp/src/parquet/column_writer.cc:
##
@@ -2603,7 +2622,24 @@ struct SerializeFunctor<
 template <>
 struct SerializeFunctor<::parquet::FLBAType, ::arrow::HalfFloatType> {
   Status Serialize(const ::arrow::HalfFloatArray& array, ArrowWriteContext*, 
FLBA* out) {
+#if ARROW_LITTLE_ENDIAN
+return SerializeLittleEndianValues(array, array.raw_values(), out);
+#else
 const uint16_t* values = array.raw_values();
+const int64_t length = array.length();

Review Comment:
   We already have endian-fixing logic for FLBA in the decimal serializer 
above. It should be factored out and reused instead of reinventing something 
different here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2025-12-03 Thread via GitHub


pitrou commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2584461068


##
cpp/src/parquet/column_writer.cc:
##
@@ -2578,13 +2579,31 @@ struct SerializeFunctor<
   if constexpr (std::is_same_v) {
 *p++ = ::arrow::bit_util::ToBigEndian(u64_in[0]);
   } else if constexpr (std::is_same_v) 
{
+#if ARROW_LITTLE_ENDIAN

Review Comment:
   @Vishwanatha-HD Sure, but you haven't addressed the initial comment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2025-12-02 Thread via GitHub


kiszk commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2583522725


##
cpp/src/parquet/column_writer.cc:
##
@@ -2603,7 +2622,24 @@ struct SerializeFunctor<
 template <>
 struct SerializeFunctor<::parquet::FLBAType, ::arrow::HalfFloatType> {
   Status Serialize(const ::arrow::HalfFloatArray& array, ArrowWriteContext*, 
FLBA* out) {
+#if ARROW_LITTLE_ENDIAN
+return SerializeLittleEndianValues(array, array.raw_values(), out);
+#else
 const uint16_t* values = array.raw_values();
+const int64_t length = array.length();

Review Comment:
   Can we create or do versioning `SerializeLittleEndianValues()` for 
big-endian, which calls `::arrow::bit_util::ToLittleEndian()` when `&values[i]` 
is called?
   Then we can use it instead of storing converted values to a vector? This can 
reduce memory accesses.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2025-12-02 Thread via GitHub


kiszk commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2583522725


##
cpp/src/parquet/column_writer.cc:
##
@@ -2603,7 +2622,24 @@ struct SerializeFunctor<
 template <>
 struct SerializeFunctor<::parquet::FLBAType, ::arrow::HalfFloatType> {
   Status Serialize(const ::arrow::HalfFloatArray& array, ArrowWriteContext*, 
FLBA* out) {
+#if ARROW_LITTLE_ENDIAN
+return SerializeLittleEndianValues(array, array.raw_values(), out);
+#else
 const uint16_t* values = array.raw_values();
+const int64_t length = array.length();

Review Comment:
   Can we create `SerializeLittleEndianValues()` for big-endian, which calls 
`::arrow::bit_util::ToLittleEndian()` when `&values[i]` is called?
   Then we can use it instead of storing converted values to a vector? This can 
reduce memory accesses.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2025-12-02 Thread via GitHub


Vishwanatha-HD commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2580322478


##
cpp/src/parquet/column_writer.h:
##
@@ -260,13 +261,21 @@ constexpr int64_t kJulianEpochOffsetDays = 
INT64_C(2440588);
 template 
 inline void ArrowTimestampToImpalaTimestamp(const int64_t time, Int96* 
impala_timestamp) {
   int64_t julian_days = (time / UnitPerDay) + kJulianEpochOffsetDays;
+#if ARROW_LITTLE_ENDIAN
   (*impala_timestamp).value[2] = (uint32_t)julian_days;
+#endif
 
   int64_t last_day_units = time % UnitPerDay;
   auto last_day_nanos = last_day_units * NanosecondsPerUnit;
+#if ARROW_LITTLE_ENDIAN
   // impala_timestamp will be unaligned every other entry so do memcpy instead
   // of assign and reinterpret cast to avoid undefined behavior.
   std::memcpy(impala_timestamp, &last_day_nanos, sizeof(int64_t));
+#else
+  (*impala_timestamp).value[0] = static_cast(last_day_nanos);
+  (*impala_timestamp).value[1] = static_cast(last_day_nanos >> 32);

Review Comment:
   Hi @kou.. I tried to run the testcase with the changes that you are 
referring above.. Unfortunately, its not working.. I am seeing following 
testcase failures on big-endian (s390x) machine.. 
   
   [  FAILED  ] 5 tests, listed below:
   [  FAILED  ] TestArrowReadWrite.UseDeprecatedInt96
   [  FAILED  ] TestArrowReadWrite.DownsampleDeprecatedInt96
   [  FAILED  ] TestArrowReadWrite.ParquetVersionTimestampDifferences
   [  FAILED  ] ArrowReadWrite.NestedRequiredOuterOptional
   [  FAILED  ] TestImpalaConversion.ArrowTimestampToImpalaTimestamp
   
5 FAILED TESTS
   
   These testcases are part of "parquet-arrow-reader-writer-test" testsuites.. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2025-12-02 Thread via GitHub


Vishwanatha-HD commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2580322478


##
cpp/src/parquet/column_writer.h:
##
@@ -260,13 +261,21 @@ constexpr int64_t kJulianEpochOffsetDays = 
INT64_C(2440588);
 template 
 inline void ArrowTimestampToImpalaTimestamp(const int64_t time, Int96* 
impala_timestamp) {
   int64_t julian_days = (time / UnitPerDay) + kJulianEpochOffsetDays;
+#if ARROW_LITTLE_ENDIAN
   (*impala_timestamp).value[2] = (uint32_t)julian_days;
+#endif
 
   int64_t last_day_units = time % UnitPerDay;
   auto last_day_nanos = last_day_units * NanosecondsPerUnit;
+#if ARROW_LITTLE_ENDIAN
   // impala_timestamp will be unaligned every other entry so do memcpy instead
   // of assign and reinterpret cast to avoid undefined behavior.
   std::memcpy(impala_timestamp, &last_day_nanos, sizeof(int64_t));
+#else
+  (*impala_timestamp).value[0] = static_cast(last_day_nanos);
+  (*impala_timestamp).value[1] = static_cast(last_day_nanos >> 32);

Review Comment:
   Hi @kou.. I tried to run the testcase with the changes that you are 
referring above.. Unfortunately, its not working.. I am seeing following 
testcase failures on big-endian (s390x) machine.. 
   
   [  FAILED  ] 5 tests, listed below:
   [  FAILED  ] TestArrowReadWrite.UseDeprecatedInt96
   [  FAILED  ] TestArrowReadWrite.DownsampleDeprecatedInt96
   [  FAILED  ] TestArrowReadWrite.ParquetVersionTimestampDifferences
   [  FAILED  ] ArrowReadWrite.NestedRequiredOuterOptional
   [  FAILED  ] TestImpalaConversion.ArrowTimestampToImpalaTimestamp
   
5 FAILED TESTS



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2025-12-02 Thread via GitHub


Vishwanatha-HD commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2580322478


##
cpp/src/parquet/column_writer.h:
##
@@ -260,13 +261,21 @@ constexpr int64_t kJulianEpochOffsetDays = 
INT64_C(2440588);
 template 
 inline void ArrowTimestampToImpalaTimestamp(const int64_t time, Int96* 
impala_timestamp) {
   int64_t julian_days = (time / UnitPerDay) + kJulianEpochOffsetDays;
+#if ARROW_LITTLE_ENDIAN
   (*impala_timestamp).value[2] = (uint32_t)julian_days;
+#endif
 
   int64_t last_day_units = time % UnitPerDay;
   auto last_day_nanos = last_day_units * NanosecondsPerUnit;
+#if ARROW_LITTLE_ENDIAN
   // impala_timestamp will be unaligned every other entry so do memcpy instead
   // of assign and reinterpret cast to avoid undefined behavior.
   std::memcpy(impala_timestamp, &last_day_nanos, sizeof(int64_t));
+#else
+  (*impala_timestamp).value[0] = static_cast(last_day_nanos);
+  (*impala_timestamp).value[1] = static_cast(last_day_nanos >> 32);

Review Comment:
   Hi @kou.. I tried to run the testcase with the changes that you are 
referring above.. Unfortunately, its not working.. I am seeing following 
testcase failures on big-endian (s390x) machine.. 
   
   [  FAILED  ] 5 tests, listed below:
   [  FAILED  ] TestArrowReadWrite.UseDeprecatedInt96
   [  FAILED  ] TestArrowReadWrite.DownsampleDeprecatedInt96
   [  FAILED  ] TestArrowReadWrite.ParquetVersionTimestampDifferences
   [  FAILED  ] ArrowReadWrite.NestedRequiredOuterOptional
   [  FAILED  ] TestImpalaConversion.ArrowTimestampToImpalaTimestamp
   
5 FAILED TESTS
 YOU HAVE 1 DISABLED TEST
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2025-11-28 Thread via GitHub


Vishwanatha-HD commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2571991742


##
cpp/src/parquet/column_writer.cc:
##
@@ -2578,13 +2579,31 @@ struct SerializeFunctor<
   if constexpr (std::is_same_v) {
 *p++ = ::arrow::bit_util::ToBigEndian(u64_in[0]);
   } else if constexpr (std::is_same_v) 
{
+#if ARROW_LITTLE_ENDIAN

Review Comment:
   @pitrou.. I have rebased this to git main and removed the below piece of 
code.. 
   
   ```
   -#if ARROW_LITTLE_ENDIAN
   -  if (num_bytes < 0 || num_bytes > data_size - 4) {
   -#else
  if (num_bytes < 0 || num_bytes > data_size) {
---> Only retaining this line now.. 
   -#endif
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2025-11-28 Thread via GitHub


Vishwanatha-HD commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2571991742


##
cpp/src/parquet/column_writer.cc:
##
@@ -2578,13 +2579,31 @@ struct SerializeFunctor<
   if constexpr (std::is_same_v) {
 *p++ = ::arrow::bit_util::ToBigEndian(u64_in[0]);
   } else if constexpr (std::is_same_v) 
{
+#if ARROW_LITTLE_ENDIAN

Review Comment:
   @pitrou.. I have rebased this to git main and removed the below piece of 
code.. 
   
   -#if ARROW_LITTLE_ENDIAN
   -  if (num_bytes < 0 || num_bytes > data_size - 4) {
   -#else
  if (num_bytes < 0 || num_bytes > data_size) {
---> Only retaining this line now.. 
   -#endif
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2025-11-28 Thread via GitHub


Vishwanatha-HD commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2571966345


##
cpp/src/parquet/column_writer.cc:
##
@@ -2578,13 +2579,31 @@ struct SerializeFunctor<
   if constexpr (std::is_same_v) {
 *p++ = ::arrow::bit_util::ToBigEndian(u64_in[0]);
   } else if constexpr (std::is_same_v) 
{
+#if ARROW_LITTLE_ENDIAN

Review Comment:
   @pitrou.. Ok.. sure.. thanks.. 
   
   > @Vishwanatha-HD Please don't resolve discussions until they are actually 
resolved. This one hasn't been addressed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2025-11-26 Thread via GitHub


pitrou commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2565105141


##
cpp/src/parquet/column_writer.cc:
##
@@ -2578,13 +2579,31 @@ struct SerializeFunctor<
   if constexpr (std::is_same_v) {
 *p++ = ::arrow::bit_util::ToBigEndian(u64_in[0]);
   } else if constexpr (std::is_same_v) 
{
+#if ARROW_LITTLE_ENDIAN

Review Comment:
   @Vishwanatha-HD Please don't resolve discussions until they are actually 
resolved. This one hasn't been addressed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2025-11-26 Thread via GitHub


pitrou commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2565094873


##
cpp/src/parquet/column_reader.cc:
##
@@ -132,7 +134,11 @@ int LevelDecoder::SetData(Encoding::type encoding, int16_t 
max_level,
 "Number of buffered values too large (corrupt data page?)");
   }
   num_bytes = static_cast(bit_util::BytesForBits(num_bits));
+#if ARROW_LITTLE_ENDIAN
   if (num_bytes < 0 || num_bytes > data_size - 4) {
+#else
+  if (num_bytes < 0 || num_bytes > data_size) {
+#endif

Review Comment:
   @Vishwanatha-HD Can you rebase/merge from git main and remove this change? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2025-11-25 Thread via GitHub


Vishwanatha-HD commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2559802079


##
cpp/src/parquet/column_writer.cc:
##
@@ -2578,13 +2579,31 @@ struct SerializeFunctor<
   if constexpr (std::is_same_v) {
 *p++ = ::arrow::bit_util::ToBigEndian(u64_in[0]);
   } else if constexpr (std::is_same_v) 
{
+#if ARROW_LITTLE_ENDIAN

Review Comment:
   @pitrou.. Thanks for your review comments.. I will probably work on this 
change in the next pass. Thanks..



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2025-11-25 Thread via GitHub


pitrou commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2558909398


##
cpp/src/parquet/column_writer.h:
##
@@ -260,13 +261,21 @@ constexpr int64_t kJulianEpochOffsetDays = 
INT64_C(2440588);
 template 
 inline void ArrowTimestampToImpalaTimestamp(const int64_t time, Int96* 
impala_timestamp) {

Review Comment:
   Ok, let's keep it like that for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2025-11-24 Thread via GitHub


Vishwanatha-HD commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2558748029


##
cpp/src/parquet/column_writer.h:
##
@@ -260,13 +261,21 @@ constexpr int64_t kJulianEpochOffsetDays = 
INT64_C(2440588);
 template 
 inline void ArrowTimestampToImpalaTimestamp(const int64_t time, Int96* 
impala_timestamp) {

Review Comment:
   @pitrou.. As you suggested, I tried moving the definition to 
column_writer.cc and keeping the declaration in column_writer.h.. But the 
compiler is throwing the below error.. 
   
   The basic issue here is the column_writer.h is being added by many other 
files like stream_writers and file_writers and they are now cribbing that they 
are not able to see the function definition.. 
   
   If I endup providing the definition everywhere just to fix these compile 
error, then it will be code-duplication.. Hence I will leave this as is for 
now.. Thanks.. 
   
   n file included from 
/home/vishwa/golang/arrow/cpp/src/parquet/stream_writer.h:31,
from /home/vishwa/golang/arrow/cpp/src/parquet/stream_writer.cc:18:
   /home/vishwa/golang/arrow/cpp/src/parquet/column_writer.h:261:13: error: 
inline function ‘void 
parquet::internal::ArrowTimestampToImpalaTimestamp(int64_t, p
   arquet::Int96*) [with long int UnitPerDay = 86400; long int 
NanosecondsPerUnit = 10; int64_t = long int]’ used but never defined 
[-Werror]
261 | inline void ArrowTimestampToImpalaTimestamp(const int64_t time, 
Int96* impala_timestamp);
  |   ^~~
   /home/vishwa/golang/arrow/cpp/src/parquet/column_writer.h:261:13: error: 
inline function ‘void 
parquet::internal::ArrowTimestampToImpalaTimestamp(int64_t, p
   arquet::Int96*) [with long int UnitPerDay = 8640; long int 
NanosecondsPerUnit = 100; int64_t = long int]’ used but never defined 
[-Werror]
   /home/vishwa/golang/arrow/cpp/src/parquet/column_writer.h:261:13: error: 
inline function ‘void 
parquet::internal::ArrowTimestampToImpalaTimestamp(int64_t, p
   arquet::Int96*) [with long int UnitPerDay = 864; long int 
NanosecondsPerUnit = 1000; int64_t = long int]’ used but never defined [-Werror]
   /home/vishwa/golang/arrow/cpp/src/parquet/column_writer.h:261:13: error: 
inline function ‘void 
parquet::internal::ArrowTimestampToImpalaTimestamp(int64_t, p
   arquet::Int96*) [with long int UnitPerDay = 864000; long int 
NanosecondsPerUnit = 1; int64_t = long int]’ used but never defined [-Werror]
   In file included from 
/home/vishwa/golang/arrow/cpp/src/parquet/file_writer.cc:29:
   /home/vishwa/golang/arrow/cpp/src/parquet/column_writer.h:261:13: error: 
inline function ‘void 
parquet::internal::ArrowTimestampToImpalaTimestamp(int64_t, p
   arquet::Int96*) [with long int UnitPerDay = 86400; long int 
NanosecondsPerUnit = 10; int64_t = long int]’ used but never defined 
[-Werror]
261 | inline void ArrowTimestampToImpalaTimestamp(const int64_t time, 
Int96* impala_timestamp);



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2025-11-24 Thread via GitHub


Vishwanatha-HD commented on PR #48205:
URL: https://github.com/apache/arrow/pull/48205#issuecomment-3572600519

   > Thanks @Vishwanatha-HD , and thanks for splitting up like this.
   
   @pitrou.. Sure.. my pleasure.. Thanks alot for spending so much time and 
reviewing the code change and give your comments.. I appreciate it.. !!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2025-11-24 Thread via GitHub


pitrou commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2557635517


##
cpp/src/parquet/column_writer.h:
##
@@ -260,13 +261,21 @@ constexpr int64_t kJulianEpochOffsetDays = 
INT64_C(2440588);
 template 
 inline void ArrowTimestampToImpalaTimestamp(const int64_t time, Int96* 
impala_timestamp) {

Review Comment:
   > @pitrou.. Please note that "ArrowTimestampToImpalaTimestamp() is an inline 
function.. Hence I wont be able to move this to the column_writer.cc file.. If 
we have to move, then the inling has to be removed
   
   No, it hasn't. You just have to ensure that the definition is visible at the 
call point.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2025-11-24 Thread via GitHub


Vishwanatha-HD commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2557632141


##
cpp/src/parquet/column_writer.h:
##
@@ -23,6 +23,7 @@
 
 #include "arrow/type_fwd.h"
 #include "arrow/util/compression.h"
+#include "arrow/util/endian.h"

Review Comment:
   @pitrou.. I have made the appropriate code changes.. Thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2025-11-24 Thread via GitHub


Vishwanatha-HD commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2557630887


##
cpp/src/parquet/column_writer.h:
##
@@ -260,13 +261,21 @@ constexpr int64_t kJulianEpochOffsetDays = 
INT64_C(2440588);
 template 
 inline void ArrowTimestampToImpalaTimestamp(const int64_t time, Int96* 
impala_timestamp) {

Review Comment:
   @pitrou.. Please note that "ArrowTimestampToImpalaTimestamp() is an inline 
function.. Hence I wont be able to move this to the column_writer.cc file.. If 
we have to move, then the inling has to be removed and in that case, as you 
know we are not sure of the performance impact.. Thanks.. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2025-11-24 Thread via GitHub


Vishwanatha-HD commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2557590480


##
cpp/src/parquet/column_writer.cc:
##
@@ -2621,6 +2641,38 @@ struct SerializeFunctor<::parquet::FLBAType, 
::arrow::HalfFloatType> {
 return FLBA{reinterpret_cast(value_ptr)};
   }
 };
+#else
+template <>
+struct SerializeFunctor<::parquet::FLBAType, ::arrow::HalfFloatType> {
+  Status Serialize(const ::arrow::HalfFloatArray& array, ArrowWriteContext*, 
FLBA* out) {
+const uint16_t* values = array.raw_values();
+const int64_t length = array.length();
+
+// Allocate buffer for little-endian converted values
+converted_values_.resize(length);
+
+if (array.null_count() == 0) {
+  for (int64_t i = 0; i < length; ++i) {
+converted_values_[i] = ::arrow::bit_util::ToLittleEndian(values[i]);
+out[i] = FLBA{reinterpret_cast(&converted_values_[i])};
+  }
+} else {
+  for (int64_t i = 0; i < length; ++i) {
+if (array.IsValid(i)) {
+  converted_values_[i] = ::arrow::bit_util::ToLittleEndian(values[i]);
+  out[i] = FLBA{reinterpret_cast(&converted_values_[i])};
+} else {
+  out[i] = FLBA{};
+}
+  }
+}
+return Status::OK();
+  }
+
+ private:
+  std::vector converted_values_;
+};
+#endif

Review Comment:
   Hi @kou,
   I have made the necessary code changes as you have suggested above.. I 
tested it on s390x and it just works fine.. Thanks.. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2025-11-24 Thread via GitHub


Vishwanatha-HD commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2557265427


##
cpp/src/parquet/column_writer.h:
##
@@ -260,13 +261,21 @@ constexpr int64_t kJulianEpochOffsetDays = 
INT64_C(2440588);
 template 
 inline void ArrowTimestampToImpalaTimestamp(const int64_t time, Int96* 
impala_timestamp) {
   int64_t julian_days = (time / UnitPerDay) + kJulianEpochOffsetDays;
+#if ARROW_LITTLE_ENDIAN
   (*impala_timestamp).value[2] = (uint32_t)julian_days;
+#endif

Review Comment:
   @kou.. Thanks for pointing this out.. I have made an appropriate code change 
to take care of this comment.. Thanks.. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2025-11-24 Thread via GitHub


pitrou commented on PR #48205:
URL: https://github.com/apache/arrow/pull/48205#issuecomment-3570233704

   Thanks @Vishwanatha-HD , and thanks for splitting up like this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2025-11-24 Thread via GitHub


Vishwanatha-HD commented on PR #48205:
URL: https://github.com/apache/arrow/pull/48205#issuecomment-3570045067

   > I'm frankly surprised that so few changes are required, given that Parquet 
C++ was never successfully tested on BE systems before. @Vishwanatha-HD Did you 
try to read the files in 
https://github.com/apache/parquet-testing/tree/master/data and check the 
contents were properly decoded?
   
   Hi @pitrou.. 
   thanks for your comments.. Please note that this is not just the changes 
that are required to enable support on s390x.. There are other 12 PRs that I 
have raised.. The main issue link is: 
https://github.com/apache/arrow/issues/48151
   Please check that and you will get the links to all other remaining PRs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2025-11-24 Thread via GitHub


pitrou commented on PR #48205:
URL: https://github.com/apache/arrow/pull/48205#issuecomment-3569858830

   I'm frankly surprised that so few changes are required, given that Parquet 
C++ was never successfully tested on BE systems before. @Vishwanatha-HD Did you 
try to read the files in 
https://github.com/apache/parquet-testing/tree/master/data and check the 
contents were properly decoded?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2025-11-24 Thread via GitHub


pitrou commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2555493413


##
cpp/src/parquet/column_writer.h:
##
@@ -23,6 +23,7 @@
 
 #include "arrow/type_fwd.h"
 #include "arrow/util/compression.h"
+#include "arrow/util/endian.h"

Review Comment:
   There shouldn't be a need to include this in this header file, IMHO.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2025-11-24 Thread via GitHub


pitrou commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2555491624


##
cpp/src/parquet/column_writer.h:
##
@@ -260,13 +261,21 @@ constexpr int64_t kJulianEpochOffsetDays = 
INT64_C(2440588);
 template 
 inline void ArrowTimestampToImpalaTimestamp(const int64_t time, Int96* 
impala_timestamp) {

Review Comment:
   Please let's move the function definition to `column_writer.cc`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2025-11-24 Thread via GitHub


pitrou commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2555481395


##
cpp/src/parquet/column_writer.cc:
##
@@ -2621,6 +2641,38 @@ struct SerializeFunctor<::parquet::FLBAType, 
::arrow::HalfFloatType> {
 return FLBA{reinterpret_cast(value_ptr)};
   }
 };
+#else
+template <>
+struct SerializeFunctor<::parquet::FLBAType, ::arrow::HalfFloatType> {
+  Status Serialize(const ::arrow::HalfFloatArray& array, ArrowWriteContext*, 
FLBA* out) {
+const uint16_t* values = array.raw_values();
+const int64_t length = array.length();
+
+// Allocate buffer for little-endian converted values
+converted_values_.resize(length);
+
+if (array.null_count() == 0) {
+  for (int64_t i = 0; i < length; ++i) {
+converted_values_[i] = ::arrow::bit_util::ToLittleEndian(values[i]);
+out[i] = FLBA{reinterpret_cast(&converted_values_[i])};
+  }
+} else {
+  for (int64_t i = 0; i < length; ++i) {
+if (array.IsValid(i)) {
+  converted_values_[i] = ::arrow::bit_util::ToLittleEndian(values[i]);
+  out[i] = FLBA{reinterpret_cast(&converted_values_[i])};
+} else {
+  out[i] = FLBA{};
+}
+  }
+}
+return Status::OK();
+  }
+
+ private:
+  std::vector converted_values_;
+};
+#endif

Review Comment:
   Please, reuse the `AllocateScratch` approach from the Decimal serializer. 
You can introduce a mixin class if that makes things easier.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2025-11-24 Thread via GitHub


pitrou commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2555472890


##
cpp/src/parquet/column_writer.cc:
##
@@ -2578,13 +2579,31 @@ struct SerializeFunctor<
   if constexpr (std::is_same_v) {
 *p++ = ::arrow::bit_util::ToBigEndian(u64_in[0]);
   } else if constexpr (std::is_same_v) 
{
+#if ARROW_LITTLE_ENDIAN

Review Comment:
   Please take a step back and read the comments above:
   ```c++
   // Requires a custom serializer because decimal in parquet are in big-endian
   // format. Thus, a temporary local buffer is required.
   ```
   
   If we're on a big-endian system, this entire code is unnecessary and we can 
just use the FIXED_LEN_BYTE_ARRAY `SerializeFunctor`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2025-11-24 Thread via GitHub


pitrou commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2555448401


##
cpp/src/parquet/column_reader.cc:
##
@@ -132,7 +134,11 @@ int LevelDecoder::SetData(Encoding::type encoding, int16_t 
max_level,
 "Number of buffered values too large (corrupt data page?)");
   }
   num_bytes = static_cast(bit_util::BytesForBits(num_bits));
+#if ARROW_LITTLE_ENDIAN
   if (num_bytes < 0 || num_bytes > data_size - 4) {
+#else
+  if (num_bytes < 0 || num_bytes > data_size) {
+#endif

Review Comment:
   https://github.com/apache/arrow/pull/48235



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2025-11-24 Thread via GitHub


pitrou commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2555377529


##
cpp/src/parquet/column_reader.cc:
##
@@ -132,7 +134,11 @@ int LevelDecoder::SetData(Encoding::type encoding, int16_t 
max_level,
 "Number of buffered values too large (corrupt data page?)");
   }
   num_bytes = static_cast(bit_util::BytesForBits(num_bits));
+#if ARROW_LITTLE_ENDIAN
   if (num_bytes < 0 || num_bytes > data_size - 4) {
+#else
+  if (num_bytes < 0 || num_bytes > data_size) {
+#endif

Review Comment:
   Thanks for the ping @kou . I've re-read through this code and I now think 
the original change was a mistake. I'll submit a separate issue/PR to fix it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2025-11-23 Thread via GitHub


k8ika0s commented on PR #48205:
URL: https://github.com/apache/arrow/pull/48205#issuecomment-3568405152

   @Vishwanatha-HD
   
   Working through this one, I’m reminded how many odd little corners show up 
when Arrow’s layout meets Parquet’s expectations — especially around levels, 
decimals, and the legacy INT96 bits.
   
   Looking at the pieces that overlap with the work I’ve been doing, the 
overall direction makes sense. A few notes from what I’ve seen on real s390x 
hardware:
   
   **• BIT_PACKED level headers**
   Your patch keeps the `data_size - 4` bound under `ARROW_LITTLE_ENDIAN`, 
whereas my tree leans on accepting the full BIT_PACKED buffer and logging 
failures rather than early-bounding. Neither approach is wrong, but on BE 
machines I’ve found that the “minus 4” guard sometimes rejects buffers that are 
actually fine, depending on how many values the upstream encoder produced.
   
   **• Decimal serialization**
   This is one of the trickier spots. Parquet expects decimals in a big-endian 
128-bit payload, but Arrow materializes them in little-endian limbs even on BE 
hardware. In my implementation I reverse the Arrow words (`[low, high]` → 
`[high, low]`) before handing them to the writer so the final byte stream 
matches the canonical Parquet format.
   Your patch uses `ToBigEndian` on each limb directly in host order, which 
works for many cases but can produce a differently ordered representation when 
Arrow’s in-memory layout doesn’t match the 128-bit big-endian wire format. Just 
sharing that in case you’ve seen similar behavior when mixing different decimal 
widths.
   
   **• Half-floats in FLBA**
   The BE path you added with `ToLittleEndian(values[i])` aligns with the 
intent. I ended up staging the FLBA structs and the 2-byte payloads together in 
one scratch buffer, mostly because some downstream consumers treat the pointer 
lifetime very strictly. Either way, normalizing those 2-byte halves before page 
assembly helps avoid the cross-architecture drift I’ve run into.
   
   **• Paging / DoInBatches**
   Your rewrite to enforce `max_rows_per_page` is a meaningful cleanup. My 
patches didn’t touch this area, so no conflicts there — but just to mention it, 
keeping the paging logic predictable on BE made debugging the level stream 
quite a bit easier for me.
   
   **• INT96 (Impala timestamp)**
   Your implementation writes host-order limbs on BE and memcpy on LE. In my 
case I leaned heavily on always emitting LE limbs so the decode path doesn’t 
have to branch on architecture. Both approaches work as long as the 
corresponding reader expects the same convention.
   
   None of this is blocking — just trying to pass along the details I’ve seen 
crop up when running the full parquet-encode → parquet-decode cycle on 
big-endian hardware. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]

2025-11-22 Thread via GitHub


kou commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2553142118


##
cpp/src/parquet/column_reader.cc:
##
@@ -132,7 +134,11 @@ int LevelDecoder::SetData(Encoding::type encoding, int16_t 
max_level,
 "Number of buffered values too large (corrupt data page?)");
   }
   num_bytes = static_cast(bit_util::BytesForBits(num_bits));
+#if ARROW_LITTLE_ENDIAN
   if (num_bytes < 0 || num_bytes > data_size - 4) {
+#else
+  if (num_bytes < 0 || num_bytes > data_size) {
+#endif

Review Comment:
   @pitrou You added `- 4` in #6848. Do you think that we need `- 4` with big 
endian too?



##
cpp/src/parquet/column_writer.h:
##
@@ -260,13 +261,21 @@ constexpr int64_t kJulianEpochOffsetDays = 
INT64_C(2440588);
 template 
 inline void ArrowTimestampToImpalaTimestamp(const int64_t time, Int96* 
impala_timestamp) {
   int64_t julian_days = (time / UnitPerDay) + kJulianEpochOffsetDays;
+#if ARROW_LITTLE_ENDIAN
   (*impala_timestamp).value[2] = (uint32_t)julian_days;
+#endif
 
   int64_t last_day_units = time % UnitPerDay;
   auto last_day_nanos = last_day_units * NanosecondsPerUnit;
+#if ARROW_LITTLE_ENDIAN
   // impala_timestamp will be unaligned every other entry so do memcpy instead
   // of assign and reinterpret cast to avoid undefined behavior.
   std::memcpy(impala_timestamp, &last_day_nanos, sizeof(int64_t));
+#else
+  (*impala_timestamp).value[0] = static_cast(last_day_nanos);
+  (*impala_timestamp).value[1] = static_cast(last_day_nanos >> 32);

Review Comment:
   Can we use the following instead of `#if`?
   
   ```cpp
   auto last_day_nanos = last_day_units * NanosecondsPerUnit;
   auto last_day_nanos_little_endian = 
::arrow::bit_util::ToLittleEndian(last_day_nanos);
   std::memcpy(impala_timestamp, &last_day_nanos_little_endian, 
sizeof(int64_t));
   ```



##
cpp/src/parquet/column_writer.cc:
##
@@ -2621,6 +2641,38 @@ struct SerializeFunctor<::parquet::FLBAType, 
::arrow::HalfFloatType> {
 return FLBA{reinterpret_cast(value_ptr)};
   }
 };
+#else
+template <>
+struct SerializeFunctor<::parquet::FLBAType, ::arrow::HalfFloatType> {
+  Status Serialize(const ::arrow::HalfFloatArray& array, ArrowWriteContext*, 
FLBA* out) {
+const uint16_t* values = array.raw_values();
+const int64_t length = array.length();
+
+// Allocate buffer for little-endian converted values
+converted_values_.resize(length);
+
+if (array.null_count() == 0) {
+  for (int64_t i = 0; i < length; ++i) {
+converted_values_[i] = ::arrow::bit_util::ToLittleEndian(values[i]);
+out[i] = FLBA{reinterpret_cast(&converted_values_[i])};
+  }
+} else {
+  for (int64_t i = 0; i < length; ++i) {
+if (array.IsValid(i)) {
+  converted_values_[i] = ::arrow::bit_util::ToLittleEndian(values[i]);
+  out[i] = FLBA{reinterpret_cast(&converted_values_[i])};
+} else {
+  out[i] = FLBA{};
+}
+  }
+}
+return Status::OK();
+  }
+
+ private:
+  std::vector converted_values_;
+};
+#endif

Review Comment:
   Could you share implementation as much as possible something like:
   
   ```cpp
   template <>
   struct SerializeFunctor<::parquet::FLBAType, ::arrow::HalfFloatType> {
 Status Serialize(const ::arrow::HalfFloatArray& array, ArrowWriteContext*, 
FLBA* out) {
   #if ARROW_LITTLE_ENDIAN
   return SerializeLittleEndianValues(array.raw_values(), out);
   #else
   const uint16_t* values = array.raw_values();
   const int64_t length = array.length();
   converted_values_.resize(length);
   for (int64_t i = 0; i < length; ++i) {
 // We don't need IsValid() here. Non valid values are just ignored in 
SerializeLittleEndianValues().
 converted_values_[i] = ::arrow::bit_util::ToLittleEndian(values[i]);
   }
   return SerializeLittleEndianValues(converted_values_.data(), out);
   #endif
 }
   
private:
 Status SerializeLittleEndianValues(const uint16_t* values, FLBA* out) {
   if (array.null_count() == 0) {
 for (int64_t i = 0; i < array.length(); ++i) {
   out[i] = ToFLBA(&values[i]);
 }
   } else {
 for (int64_t i = 0; i < array.length(); ++i) {
   out[i] = array.IsValid(i) ? ToFLBA(&values[i]) : FLBA{};
 }
   }
   return Status::OK();
 }
   
 FLBA ToFLBA(const uint16_t* value_ptr) const {
   return FLBA{reinterpret_cast(value_ptr)};
 }
   
   #if !ARROW_LITTLE_ENDIAN
 std::vector converted_values_;  
   #endif
   };
   ```



##
cpp/src/parquet/column_writer.h:
##
@@ -260,13 +261,21 @@ constexpr int64_t kJulianEpochOffsetDays = 
INT64_C(2440588);
 template 
 inline void ArrowTimestampToImpalaTimestamp(const int64_t time, Int96* 
impala_timestamp) {
   int64_t julian_days = (time / UnitPerDay) + kJulianEpochOffsetDays;
+#if ARROW_LITTLE_ENDIAN
   (*impala_timestamp).value[2] = (uint32_t)jul