Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
Vishwanatha-HD commented on PR #48205: URL: https://github.com/apache/arrow/pull/48205#issuecomment-3722240287 @pitrou @kou Greetings, I was on vacation for most part of Dec and hence there was no reply from me.. Can you please help me with merging the pull requests that I raised, if you poeple dont have any further comments or issues.. Thanks.. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
Vishwanatha-HD commented on PR #48205: URL: https://github.com/apache/arrow/pull/48205#issuecomment-3722230918 > @Vishwanatha-HD Could you please fix the number `GH-48204` in the title? It should be `GH-48205`. Hi @kiszk.. Sorry for the delayed reply.. I was on vacation.. Please note that GH-48204 is the issue number associated with this pull request.. https://github.com/apache/arrow/issues/48204 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
kiszk commented on PR #48205: URL: https://github.com/apache/arrow/pull/48205#issuecomment-3698769913 @Vishwanatha-HD Could you please fix the number `GH-48204` in the title? It should be `GH-48205`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
Vishwanatha-HD commented on code in PR #48205: URL: https://github.com/apache/arrow/pull/48205#discussion_r2601918022 ## cpp/src/parquet/column_writer.h: ## @@ -267,9 +267,14 @@ inline void ArrowTimestampToImpalaTimestamp(const int64_t time, Int96* impala_ti } impala_timestamp->value[2] = static_cast(julian_days); uint64_t last_day_nanos = static_cast(last_day_units) * NanosecondsPerUnit; +#if ARROW_LITTLE_ENDIAN // impala_timestamp will be unaligned every other entry so do memcpy instead // of assign and reinterpret cast to avoid undefined behavior. - std::memcpy(impala_timestamp, &last_day_nanos, sizeof(uint64_t)); + std::memcpy(impala_timestamp, &last_day_nanos, sizeof(int64_t)); +#else + (*impala_timestamp).value[0] = static_cast(last_day_nanos); + (*impala_timestamp).value[1] = static_cast(last_day_nanos >> 32); Review Comment: @kou.. I dont think that is necessary.. Otherwise the testcase would have failed.. Its just the way the 3 uint32_t values of "last_day_nanos" are arranged into the "(*impala_timestamp).value" variable.. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
pitrou commented on code in PR #48205: URL: https://github.com/apache/arrow/pull/48205#discussion_r2601509785 ## cpp/src/parquet/column_writer.h: ## @@ -267,9 +267,14 @@ inline void ArrowTimestampToImpalaTimestamp(const int64_t time, Int96* impala_ti } impala_timestamp->value[2] = static_cast(julian_days); uint64_t last_day_nanos = static_cast(last_day_units) * NanosecondsPerUnit; +#if ARROW_LITTLE_ENDIAN // impala_timestamp will be unaligned every other entry so do memcpy instead // of assign and reinterpret cast to avoid undefined behavior. - std::memcpy(impala_timestamp, &last_day_nanos, sizeof(uint64_t)); + std::memcpy(impala_timestamp, &last_day_nanos, sizeof(int64_t)); +#else + (*impala_timestamp).value[0] = static_cast(last_day_nanos); + (*impala_timestamp).value[1] = static_cast(last_day_nanos >> 32); Review Comment: Sure, but the individual 32-bit values still need to be byte-swapped, right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
pitrou commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2601507548
##
cpp/src/parquet/column_writer.cc:
##
@@ -2578,13 +2579,31 @@ struct SerializeFunctor<
if constexpr (std::is_same_v) {
*p++ = ::arrow::bit_util::ToBigEndian(u64_in[0]);
} else if constexpr (std::is_same_v)
{
+#if ARROW_LITTLE_ENDIAN
Review Comment:
I will write again what I said above:
> If we're on a big-endian system, this entire code is unnecessary and we
can just use the FIXED_LEN_BYTE_ARRAY SerializeFunctor.
Of course, I might be missing something and some dedicated code might still
be necessary. But I don't think that is.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
Vishwanatha-HD commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2599921254
##
cpp/src/parquet/column_writer.cc:
##
@@ -2578,13 +2579,31 @@ struct SerializeFunctor<
if constexpr (std::is_same_v) {
*p++ = ::arrow::bit_util::ToBigEndian(u64_in[0]);
} else if constexpr (std::is_same_v)
{
+#if ARROW_LITTLE_ENDIAN
Review Comment:
@pitrou.. I can come up with an optimized logic as below.. Please let me
know if you are fine with it..
```
FixedLenByteArray FixDecimalEndianness(const uint8_t* in, int64_t offset) {
auto out = reinterpret_cast(scratch) + offset;
constexpr int32_t kSize =
std::is_same_v ? 4 :
std::is_same_v ? 8 :
std::is_same_v ? 16 :
std::is_same_v ? 32 :
0;
static_assert(kSize != 0, "Unsupported decimal type");
uint8_t* out_bytes = reinterpret_cast(scratch);
// Arrow guarantees: ByteSwapBuffer swaps only on little-endian.
// On big-endian it becomes memcpy.
::arrow::bit_util::ByteSwapBuffer(out_bytes, in, kSize);
scratch += kSize;
return FixedLenByteArray(out);
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
Vishwanatha-HD commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2599921254
##
cpp/src/parquet/column_writer.cc:
##
@@ -2578,13 +2579,31 @@ struct SerializeFunctor<
if constexpr (std::is_same_v) {
*p++ = ::arrow::bit_util::ToBigEndian(u64_in[0]);
} else if constexpr (std::is_same_v)
{
+#if ARROW_LITTLE_ENDIAN
Review Comment:
@pitrou.. I can come up with an optimized logic as below.. Please let me
know if you are fine with it..
FixedLenByteArray FixDecimalEndianness(const uint8_t* in, int64_t offset) {
auto out = reinterpret_cast(scratch) + offset;
constexpr int32_t kSize =
std::is_same_v ? 4 :
std::is_same_v ? 8 :
std::is_same_v ? 16 :
std::is_same_v ? 32 :
0;
static_assert(kSize != 0, "Unsupported decimal type");
uint8_t* out_bytes = reinterpret_cast(scratch);
// Arrow guarantees: ByteSwapBuffer swaps only on little-endian.
// On big-endian it becomes memcpy.
::arrow::bit_util::ByteSwapBuffer(out_bytes, in, kSize);
scratch += kSize;
return FixedLenByteArray(out);
}
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
Vishwanatha-HD commented on code in PR #48205: URL: https://github.com/apache/arrow/pull/48205#discussion_r2599786362 ## cpp/src/parquet/column_writer.h: ## @@ -267,9 +267,14 @@ inline void ArrowTimestampToImpalaTimestamp(const int64_t time, Int96* impala_ti } impala_timestamp->value[2] = static_cast(julian_days); uint64_t last_day_nanos = static_cast(last_day_units) * NanosecondsPerUnit; +#if ARROW_LITTLE_ENDIAN // impala_timestamp will be unaligned every other entry so do memcpy instead // of assign and reinterpret cast to avoid undefined behavior. - std::memcpy(impala_timestamp, &last_day_nanos, sizeof(uint64_t)); + std::memcpy(impala_timestamp, &last_day_nanos, sizeof(int64_t)); +#else + (*impala_timestamp).value[0] = static_cast(last_day_nanos); + (*impala_timestamp).value[1] = static_cast(last_day_nanos >> 32); Review Comment: @pitrou.. If I am understanding correctly, this logic is trying to put the lower 4-bytes (32-bits) of "last_day_nanos" into value[0] and upper 4-bytes (32-bits) into value[1].. "memcpy of 8-bytes (64-bits) will keep the values as is.. Hence we wont be able to use "memcpy" on BE machines.. Hope I am making things clear here.. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
pitrou commented on code in PR #48205: URL: https://github.com/apache/arrow/pull/48205#discussion_r2584474050 ## cpp/src/parquet/column_writer.h: ## @@ -267,9 +267,14 @@ inline void ArrowTimestampToImpalaTimestamp(const int64_t time, Int96* impala_ti } impala_timestamp->value[2] = static_cast(julian_days); uint64_t last_day_nanos = static_cast(last_day_units) * NanosecondsPerUnit; +#if ARROW_LITTLE_ENDIAN // impala_timestamp will be unaligned every other entry so do memcpy instead // of assign and reinterpret cast to avoid undefined behavior. - std::memcpy(impala_timestamp, &last_day_nanos, sizeof(uint64_t)); + std::memcpy(impala_timestamp, &last_day_nanos, sizeof(int64_t)); +#else + (*impala_timestamp).value[0] = static_cast(last_day_nanos); + (*impala_timestamp).value[1] = static_cast(last_day_nanos >> 32); Review Comment: Why is this not changing the endianness of values? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
pitrou commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2584469301
##
cpp/src/parquet/column_writer.cc:
##
@@ -2603,7 +2622,24 @@ struct SerializeFunctor<
template <>
struct SerializeFunctor<::parquet::FLBAType, ::arrow::HalfFloatType> {
Status Serialize(const ::arrow::HalfFloatArray& array, ArrowWriteContext*,
FLBA* out) {
+#if ARROW_LITTLE_ENDIAN
+return SerializeLittleEndianValues(array, array.raw_values(), out);
+#else
const uint16_t* values = array.raw_values();
+const int64_t length = array.length();
Review Comment:
We already have endian-fixing logic for FLBA in the decimal serializer
above. It should be factored out and reused instead of reinventing something
different here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
pitrou commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2584461068
##
cpp/src/parquet/column_writer.cc:
##
@@ -2578,13 +2579,31 @@ struct SerializeFunctor<
if constexpr (std::is_same_v) {
*p++ = ::arrow::bit_util::ToBigEndian(u64_in[0]);
} else if constexpr (std::is_same_v)
{
+#if ARROW_LITTLE_ENDIAN
Review Comment:
@Vishwanatha-HD Sure, but you haven't addressed the initial comment.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
kiszk commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2583522725
##
cpp/src/parquet/column_writer.cc:
##
@@ -2603,7 +2622,24 @@ struct SerializeFunctor<
template <>
struct SerializeFunctor<::parquet::FLBAType, ::arrow::HalfFloatType> {
Status Serialize(const ::arrow::HalfFloatArray& array, ArrowWriteContext*,
FLBA* out) {
+#if ARROW_LITTLE_ENDIAN
+return SerializeLittleEndianValues(array, array.raw_values(), out);
+#else
const uint16_t* values = array.raw_values();
+const int64_t length = array.length();
Review Comment:
Can we create or do versioning `SerializeLittleEndianValues()` for
big-endian, which calls `::arrow::bit_util::ToLittleEndian()` when `&values[i]`
is called?
Then we can use it instead of storing converted values to a vector? This can
reduce memory accesses.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
kiszk commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2583522725
##
cpp/src/parquet/column_writer.cc:
##
@@ -2603,7 +2622,24 @@ struct SerializeFunctor<
template <>
struct SerializeFunctor<::parquet::FLBAType, ::arrow::HalfFloatType> {
Status Serialize(const ::arrow::HalfFloatArray& array, ArrowWriteContext*,
FLBA* out) {
+#if ARROW_LITTLE_ENDIAN
+return SerializeLittleEndianValues(array, array.raw_values(), out);
+#else
const uint16_t* values = array.raw_values();
+const int64_t length = array.length();
Review Comment:
Can we create `SerializeLittleEndianValues()` for big-endian, which calls
`::arrow::bit_util::ToLittleEndian()` when `&values[i]` is called?
Then we can use it instead of storing converted values to a vector? This can
reduce memory accesses.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
Vishwanatha-HD commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2580322478
##
cpp/src/parquet/column_writer.h:
##
@@ -260,13 +261,21 @@ constexpr int64_t kJulianEpochOffsetDays =
INT64_C(2440588);
template
inline void ArrowTimestampToImpalaTimestamp(const int64_t time, Int96*
impala_timestamp) {
int64_t julian_days = (time / UnitPerDay) + kJulianEpochOffsetDays;
+#if ARROW_LITTLE_ENDIAN
(*impala_timestamp).value[2] = (uint32_t)julian_days;
+#endif
int64_t last_day_units = time % UnitPerDay;
auto last_day_nanos = last_day_units * NanosecondsPerUnit;
+#if ARROW_LITTLE_ENDIAN
// impala_timestamp will be unaligned every other entry so do memcpy instead
// of assign and reinterpret cast to avoid undefined behavior.
std::memcpy(impala_timestamp, &last_day_nanos, sizeof(int64_t));
+#else
+ (*impala_timestamp).value[0] = static_cast(last_day_nanos);
+ (*impala_timestamp).value[1] = static_cast(last_day_nanos >> 32);
Review Comment:
Hi @kou.. I tried to run the testcase with the changes that you are
referring above.. Unfortunately, its not working.. I am seeing following
testcase failures on big-endian (s390x) machine..
[ FAILED ] 5 tests, listed below:
[ FAILED ] TestArrowReadWrite.UseDeprecatedInt96
[ FAILED ] TestArrowReadWrite.DownsampleDeprecatedInt96
[ FAILED ] TestArrowReadWrite.ParquetVersionTimestampDifferences
[ FAILED ] ArrowReadWrite.NestedRequiredOuterOptional
[ FAILED ] TestImpalaConversion.ArrowTimestampToImpalaTimestamp
5 FAILED TESTS
These testcases are part of "parquet-arrow-reader-writer-test" testsuites..
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
Vishwanatha-HD commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2580322478
##
cpp/src/parquet/column_writer.h:
##
@@ -260,13 +261,21 @@ constexpr int64_t kJulianEpochOffsetDays =
INT64_C(2440588);
template
inline void ArrowTimestampToImpalaTimestamp(const int64_t time, Int96*
impala_timestamp) {
int64_t julian_days = (time / UnitPerDay) + kJulianEpochOffsetDays;
+#if ARROW_LITTLE_ENDIAN
(*impala_timestamp).value[2] = (uint32_t)julian_days;
+#endif
int64_t last_day_units = time % UnitPerDay;
auto last_day_nanos = last_day_units * NanosecondsPerUnit;
+#if ARROW_LITTLE_ENDIAN
// impala_timestamp will be unaligned every other entry so do memcpy instead
// of assign and reinterpret cast to avoid undefined behavior.
std::memcpy(impala_timestamp, &last_day_nanos, sizeof(int64_t));
+#else
+ (*impala_timestamp).value[0] = static_cast(last_day_nanos);
+ (*impala_timestamp).value[1] = static_cast(last_day_nanos >> 32);
Review Comment:
Hi @kou.. I tried to run the testcase with the changes that you are
referring above.. Unfortunately, its not working.. I am seeing following
testcase failures on big-endian (s390x) machine..
[ FAILED ] 5 tests, listed below:
[ FAILED ] TestArrowReadWrite.UseDeprecatedInt96
[ FAILED ] TestArrowReadWrite.DownsampleDeprecatedInt96
[ FAILED ] TestArrowReadWrite.ParquetVersionTimestampDifferences
[ FAILED ] ArrowReadWrite.NestedRequiredOuterOptional
[ FAILED ] TestImpalaConversion.ArrowTimestampToImpalaTimestamp
5 FAILED TESTS
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
Vishwanatha-HD commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2580322478
##
cpp/src/parquet/column_writer.h:
##
@@ -260,13 +261,21 @@ constexpr int64_t kJulianEpochOffsetDays =
INT64_C(2440588);
template
inline void ArrowTimestampToImpalaTimestamp(const int64_t time, Int96*
impala_timestamp) {
int64_t julian_days = (time / UnitPerDay) + kJulianEpochOffsetDays;
+#if ARROW_LITTLE_ENDIAN
(*impala_timestamp).value[2] = (uint32_t)julian_days;
+#endif
int64_t last_day_units = time % UnitPerDay;
auto last_day_nanos = last_day_units * NanosecondsPerUnit;
+#if ARROW_LITTLE_ENDIAN
// impala_timestamp will be unaligned every other entry so do memcpy instead
// of assign and reinterpret cast to avoid undefined behavior.
std::memcpy(impala_timestamp, &last_day_nanos, sizeof(int64_t));
+#else
+ (*impala_timestamp).value[0] = static_cast(last_day_nanos);
+ (*impala_timestamp).value[1] = static_cast(last_day_nanos >> 32);
Review Comment:
Hi @kou.. I tried to run the testcase with the changes that you are
referring above.. Unfortunately, its not working.. I am seeing following
testcase failures on big-endian (s390x) machine..
[ FAILED ] 5 tests, listed below:
[ FAILED ] TestArrowReadWrite.UseDeprecatedInt96
[ FAILED ] TestArrowReadWrite.DownsampleDeprecatedInt96
[ FAILED ] TestArrowReadWrite.ParquetVersionTimestampDifferences
[ FAILED ] ArrowReadWrite.NestedRequiredOuterOptional
[ FAILED ] TestImpalaConversion.ArrowTimestampToImpalaTimestamp
5 FAILED TESTS
YOU HAVE 1 DISABLED TEST
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
Vishwanatha-HD commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2571991742
##
cpp/src/parquet/column_writer.cc:
##
@@ -2578,13 +2579,31 @@ struct SerializeFunctor<
if constexpr (std::is_same_v) {
*p++ = ::arrow::bit_util::ToBigEndian(u64_in[0]);
} else if constexpr (std::is_same_v)
{
+#if ARROW_LITTLE_ENDIAN
Review Comment:
@pitrou.. I have rebased this to git main and removed the below piece of
code..
```
-#if ARROW_LITTLE_ENDIAN
- if (num_bytes < 0 || num_bytes > data_size - 4) {
-#else
if (num_bytes < 0 || num_bytes > data_size) {
---> Only retaining this line now..
-#endif
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
Vishwanatha-HD commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2571991742
##
cpp/src/parquet/column_writer.cc:
##
@@ -2578,13 +2579,31 @@ struct SerializeFunctor<
if constexpr (std::is_same_v) {
*p++ = ::arrow::bit_util::ToBigEndian(u64_in[0]);
} else if constexpr (std::is_same_v)
{
+#if ARROW_LITTLE_ENDIAN
Review Comment:
@pitrou.. I have rebased this to git main and removed the below piece of
code..
-#if ARROW_LITTLE_ENDIAN
- if (num_bytes < 0 || num_bytes > data_size - 4) {
-#else
if (num_bytes < 0 || num_bytes > data_size) {
---> Only retaining this line now..
-#endif
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
Vishwanatha-HD commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2571966345
##
cpp/src/parquet/column_writer.cc:
##
@@ -2578,13 +2579,31 @@ struct SerializeFunctor<
if constexpr (std::is_same_v) {
*p++ = ::arrow::bit_util::ToBigEndian(u64_in[0]);
} else if constexpr (std::is_same_v)
{
+#if ARROW_LITTLE_ENDIAN
Review Comment:
@pitrou.. Ok.. sure.. thanks..
> @Vishwanatha-HD Please don't resolve discussions until they are actually
resolved. This one hasn't been addressed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
pitrou commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2565105141
##
cpp/src/parquet/column_writer.cc:
##
@@ -2578,13 +2579,31 @@ struct SerializeFunctor<
if constexpr (std::is_same_v) {
*p++ = ::arrow::bit_util::ToBigEndian(u64_in[0]);
} else if constexpr (std::is_same_v)
{
+#if ARROW_LITTLE_ENDIAN
Review Comment:
@Vishwanatha-HD Please don't resolve discussions until they are actually
resolved. This one hasn't been addressed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
pitrou commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2565094873
##
cpp/src/parquet/column_reader.cc:
##
@@ -132,7 +134,11 @@ int LevelDecoder::SetData(Encoding::type encoding, int16_t
max_level,
"Number of buffered values too large (corrupt data page?)");
}
num_bytes = static_cast(bit_util::BytesForBits(num_bits));
+#if ARROW_LITTLE_ENDIAN
if (num_bytes < 0 || num_bytes > data_size - 4) {
+#else
+ if (num_bytes < 0 || num_bytes > data_size) {
+#endif
Review Comment:
@Vishwanatha-HD Can you rebase/merge from git main and remove this change?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
Vishwanatha-HD commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2559802079
##
cpp/src/parquet/column_writer.cc:
##
@@ -2578,13 +2579,31 @@ struct SerializeFunctor<
if constexpr (std::is_same_v) {
*p++ = ::arrow::bit_util::ToBigEndian(u64_in[0]);
} else if constexpr (std::is_same_v)
{
+#if ARROW_LITTLE_ENDIAN
Review Comment:
@pitrou.. Thanks for your review comments.. I will probably work on this
change in the next pass. Thanks..
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
pitrou commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2558909398
##
cpp/src/parquet/column_writer.h:
##
@@ -260,13 +261,21 @@ constexpr int64_t kJulianEpochOffsetDays =
INT64_C(2440588);
template
inline void ArrowTimestampToImpalaTimestamp(const int64_t time, Int96*
impala_timestamp) {
Review Comment:
Ok, let's keep it like that for now.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
Vishwanatha-HD commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2558748029
##
cpp/src/parquet/column_writer.h:
##
@@ -260,13 +261,21 @@ constexpr int64_t kJulianEpochOffsetDays =
INT64_C(2440588);
template
inline void ArrowTimestampToImpalaTimestamp(const int64_t time, Int96*
impala_timestamp) {
Review Comment:
@pitrou.. As you suggested, I tried moving the definition to
column_writer.cc and keeping the declaration in column_writer.h.. But the
compiler is throwing the below error..
The basic issue here is the column_writer.h is being added by many other
files like stream_writers and file_writers and they are now cribbing that they
are not able to see the function definition..
If I endup providing the definition everywhere just to fix these compile
error, then it will be code-duplication.. Hence I will leave this as is for
now.. Thanks..
n file included from
/home/vishwa/golang/arrow/cpp/src/parquet/stream_writer.h:31,
from /home/vishwa/golang/arrow/cpp/src/parquet/stream_writer.cc:18:
/home/vishwa/golang/arrow/cpp/src/parquet/column_writer.h:261:13: error:
inline function ‘void
parquet::internal::ArrowTimestampToImpalaTimestamp(int64_t, p
arquet::Int96*) [with long int UnitPerDay = 86400; long int
NanosecondsPerUnit = 10; int64_t = long int]’ used but never defined
[-Werror]
261 | inline void ArrowTimestampToImpalaTimestamp(const int64_t time,
Int96* impala_timestamp);
| ^~~
/home/vishwa/golang/arrow/cpp/src/parquet/column_writer.h:261:13: error:
inline function ‘void
parquet::internal::ArrowTimestampToImpalaTimestamp(int64_t, p
arquet::Int96*) [with long int UnitPerDay = 8640; long int
NanosecondsPerUnit = 100; int64_t = long int]’ used but never defined
[-Werror]
/home/vishwa/golang/arrow/cpp/src/parquet/column_writer.h:261:13: error:
inline function ‘void
parquet::internal::ArrowTimestampToImpalaTimestamp(int64_t, p
arquet::Int96*) [with long int UnitPerDay = 864; long int
NanosecondsPerUnit = 1000; int64_t = long int]’ used but never defined [-Werror]
/home/vishwa/golang/arrow/cpp/src/parquet/column_writer.h:261:13: error:
inline function ‘void
parquet::internal::ArrowTimestampToImpalaTimestamp(int64_t, p
arquet::Int96*) [with long int UnitPerDay = 864000; long int
NanosecondsPerUnit = 1; int64_t = long int]’ used but never defined [-Werror]
In file included from
/home/vishwa/golang/arrow/cpp/src/parquet/file_writer.cc:29:
/home/vishwa/golang/arrow/cpp/src/parquet/column_writer.h:261:13: error:
inline function ‘void
parquet::internal::ArrowTimestampToImpalaTimestamp(int64_t, p
arquet::Int96*) [with long int UnitPerDay = 86400; long int
NanosecondsPerUnit = 10; int64_t = long int]’ used but never defined
[-Werror]
261 | inline void ArrowTimestampToImpalaTimestamp(const int64_t time,
Int96* impala_timestamp);
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
Vishwanatha-HD commented on PR #48205: URL: https://github.com/apache/arrow/pull/48205#issuecomment-3572600519 > Thanks @Vishwanatha-HD , and thanks for splitting up like this. @pitrou.. Sure.. my pleasure.. Thanks alot for spending so much time and reviewing the code change and give your comments.. I appreciate it.. !! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
pitrou commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2557635517
##
cpp/src/parquet/column_writer.h:
##
@@ -260,13 +261,21 @@ constexpr int64_t kJulianEpochOffsetDays =
INT64_C(2440588);
template
inline void ArrowTimestampToImpalaTimestamp(const int64_t time, Int96*
impala_timestamp) {
Review Comment:
> @pitrou.. Please note that "ArrowTimestampToImpalaTimestamp() is an inline
function.. Hence I wont be able to move this to the column_writer.cc file.. If
we have to move, then the inling has to be removed
No, it hasn't. You just have to ensure that the definition is visible at the
call point.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
Vishwanatha-HD commented on code in PR #48205: URL: https://github.com/apache/arrow/pull/48205#discussion_r2557632141 ## cpp/src/parquet/column_writer.h: ## @@ -23,6 +23,7 @@ #include "arrow/type_fwd.h" #include "arrow/util/compression.h" +#include "arrow/util/endian.h" Review Comment: @pitrou.. I have made the appropriate code changes.. Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
Vishwanatha-HD commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2557630887
##
cpp/src/parquet/column_writer.h:
##
@@ -260,13 +261,21 @@ constexpr int64_t kJulianEpochOffsetDays =
INT64_C(2440588);
template
inline void ArrowTimestampToImpalaTimestamp(const int64_t time, Int96*
impala_timestamp) {
Review Comment:
@pitrou.. Please note that "ArrowTimestampToImpalaTimestamp() is an inline
function.. Hence I wont be able to move this to the column_writer.cc file.. If
we have to move, then the inling has to be removed and in that case, as you
know we are not sure of the performance impact.. Thanks..
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
Vishwanatha-HD commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2557590480
##
cpp/src/parquet/column_writer.cc:
##
@@ -2621,6 +2641,38 @@ struct SerializeFunctor<::parquet::FLBAType,
::arrow::HalfFloatType> {
return FLBA{reinterpret_cast(value_ptr)};
}
};
+#else
+template <>
+struct SerializeFunctor<::parquet::FLBAType, ::arrow::HalfFloatType> {
+ Status Serialize(const ::arrow::HalfFloatArray& array, ArrowWriteContext*,
FLBA* out) {
+const uint16_t* values = array.raw_values();
+const int64_t length = array.length();
+
+// Allocate buffer for little-endian converted values
+converted_values_.resize(length);
+
+if (array.null_count() == 0) {
+ for (int64_t i = 0; i < length; ++i) {
+converted_values_[i] = ::arrow::bit_util::ToLittleEndian(values[i]);
+out[i] = FLBA{reinterpret_cast(&converted_values_[i])};
+ }
+} else {
+ for (int64_t i = 0; i < length; ++i) {
+if (array.IsValid(i)) {
+ converted_values_[i] = ::arrow::bit_util::ToLittleEndian(values[i]);
+ out[i] = FLBA{reinterpret_cast(&converted_values_[i])};
+} else {
+ out[i] = FLBA{};
+}
+ }
+}
+return Status::OK();
+ }
+
+ private:
+ std::vector converted_values_;
+};
+#endif
Review Comment:
Hi @kou,
I have made the necessary code changes as you have suggested above.. I
tested it on s390x and it just works fine.. Thanks..
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
Vishwanatha-HD commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2557265427
##
cpp/src/parquet/column_writer.h:
##
@@ -260,13 +261,21 @@ constexpr int64_t kJulianEpochOffsetDays =
INT64_C(2440588);
template
inline void ArrowTimestampToImpalaTimestamp(const int64_t time, Int96*
impala_timestamp) {
int64_t julian_days = (time / UnitPerDay) + kJulianEpochOffsetDays;
+#if ARROW_LITTLE_ENDIAN
(*impala_timestamp).value[2] = (uint32_t)julian_days;
+#endif
Review Comment:
@kou.. Thanks for pointing this out.. I have made an appropriate code change
to take care of this comment.. Thanks..
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
pitrou commented on PR #48205: URL: https://github.com/apache/arrow/pull/48205#issuecomment-3570233704 Thanks @Vishwanatha-HD , and thanks for splitting up like this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
Vishwanatha-HD commented on PR #48205: URL: https://github.com/apache/arrow/pull/48205#issuecomment-3570045067 > I'm frankly surprised that so few changes are required, given that Parquet C++ was never successfully tested on BE systems before. @Vishwanatha-HD Did you try to read the files in https://github.com/apache/parquet-testing/tree/master/data and check the contents were properly decoded? Hi @pitrou.. thanks for your comments.. Please note that this is not just the changes that are required to enable support on s390x.. There are other 12 PRs that I have raised.. The main issue link is: https://github.com/apache/arrow/issues/48151 Please check that and you will get the links to all other remaining PRs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
pitrou commented on PR #48205: URL: https://github.com/apache/arrow/pull/48205#issuecomment-3569858830 I'm frankly surprised that so few changes are required, given that Parquet C++ was never successfully tested on BE systems before. @Vishwanatha-HD Did you try to read the files in https://github.com/apache/parquet-testing/tree/master/data and check the contents were properly decoded? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
pitrou commented on code in PR #48205: URL: https://github.com/apache/arrow/pull/48205#discussion_r2555493413 ## cpp/src/parquet/column_writer.h: ## @@ -23,6 +23,7 @@ #include "arrow/type_fwd.h" #include "arrow/util/compression.h" +#include "arrow/util/endian.h" Review Comment: There shouldn't be a need to include this in this header file, IMHO. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
pitrou commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2555491624
##
cpp/src/parquet/column_writer.h:
##
@@ -260,13 +261,21 @@ constexpr int64_t kJulianEpochOffsetDays =
INT64_C(2440588);
template
inline void ArrowTimestampToImpalaTimestamp(const int64_t time, Int96*
impala_timestamp) {
Review Comment:
Please let's move the function definition to `column_writer.cc`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
pitrou commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2555481395
##
cpp/src/parquet/column_writer.cc:
##
@@ -2621,6 +2641,38 @@ struct SerializeFunctor<::parquet::FLBAType,
::arrow::HalfFloatType> {
return FLBA{reinterpret_cast(value_ptr)};
}
};
+#else
+template <>
+struct SerializeFunctor<::parquet::FLBAType, ::arrow::HalfFloatType> {
+ Status Serialize(const ::arrow::HalfFloatArray& array, ArrowWriteContext*,
FLBA* out) {
+const uint16_t* values = array.raw_values();
+const int64_t length = array.length();
+
+// Allocate buffer for little-endian converted values
+converted_values_.resize(length);
+
+if (array.null_count() == 0) {
+ for (int64_t i = 0; i < length; ++i) {
+converted_values_[i] = ::arrow::bit_util::ToLittleEndian(values[i]);
+out[i] = FLBA{reinterpret_cast(&converted_values_[i])};
+ }
+} else {
+ for (int64_t i = 0; i < length; ++i) {
+if (array.IsValid(i)) {
+ converted_values_[i] = ::arrow::bit_util::ToLittleEndian(values[i]);
+ out[i] = FLBA{reinterpret_cast(&converted_values_[i])};
+} else {
+ out[i] = FLBA{};
+}
+ }
+}
+return Status::OK();
+ }
+
+ private:
+ std::vector converted_values_;
+};
+#endif
Review Comment:
Please, reuse the `AllocateScratch` approach from the Decimal serializer.
You can introduce a mixin class if that makes things easier.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
pitrou commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2555472890
##
cpp/src/parquet/column_writer.cc:
##
@@ -2578,13 +2579,31 @@ struct SerializeFunctor<
if constexpr (std::is_same_v) {
*p++ = ::arrow::bit_util::ToBigEndian(u64_in[0]);
} else if constexpr (std::is_same_v)
{
+#if ARROW_LITTLE_ENDIAN
Review Comment:
Please take a step back and read the comments above:
```c++
// Requires a custom serializer because decimal in parquet are in big-endian
// format. Thus, a temporary local buffer is required.
```
If we're on a big-endian system, this entire code is unnecessary and we can
just use the FIXED_LEN_BYTE_ARRAY `SerializeFunctor`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
pitrou commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2555448401
##
cpp/src/parquet/column_reader.cc:
##
@@ -132,7 +134,11 @@ int LevelDecoder::SetData(Encoding::type encoding, int16_t
max_level,
"Number of buffered values too large (corrupt data page?)");
}
num_bytes = static_cast(bit_util::BytesForBits(num_bits));
+#if ARROW_LITTLE_ENDIAN
if (num_bytes < 0 || num_bytes > data_size - 4) {
+#else
+ if (num_bytes < 0 || num_bytes > data_size) {
+#endif
Review Comment:
https://github.com/apache/arrow/pull/48235
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
pitrou commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2555377529
##
cpp/src/parquet/column_reader.cc:
##
@@ -132,7 +134,11 @@ int LevelDecoder::SetData(Encoding::type encoding, int16_t
max_level,
"Number of buffered values too large (corrupt data page?)");
}
num_bytes = static_cast(bit_util::BytesForBits(num_bits));
+#if ARROW_LITTLE_ENDIAN
if (num_bytes < 0 || num_bytes > data_size - 4) {
+#else
+ if (num_bytes < 0 || num_bytes > data_size) {
+#endif
Review Comment:
Thanks for the ping @kou . I've re-read through this code and I now think
the original change was a mistake. I'll submit a separate issue/PR to fix it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
k8ika0s commented on PR #48205: URL: https://github.com/apache/arrow/pull/48205#issuecomment-3568405152 @Vishwanatha-HD Working through this one, I’m reminded how many odd little corners show up when Arrow’s layout meets Parquet’s expectations — especially around levels, decimals, and the legacy INT96 bits. Looking at the pieces that overlap with the work I’ve been doing, the overall direction makes sense. A few notes from what I’ve seen on real s390x hardware: **• BIT_PACKED level headers** Your patch keeps the `data_size - 4` bound under `ARROW_LITTLE_ENDIAN`, whereas my tree leans on accepting the full BIT_PACKED buffer and logging failures rather than early-bounding. Neither approach is wrong, but on BE machines I’ve found that the “minus 4” guard sometimes rejects buffers that are actually fine, depending on how many values the upstream encoder produced. **• Decimal serialization** This is one of the trickier spots. Parquet expects decimals in a big-endian 128-bit payload, but Arrow materializes them in little-endian limbs even on BE hardware. In my implementation I reverse the Arrow words (`[low, high]` → `[high, low]`) before handing them to the writer so the final byte stream matches the canonical Parquet format. Your patch uses `ToBigEndian` on each limb directly in host order, which works for many cases but can produce a differently ordered representation when Arrow’s in-memory layout doesn’t match the 128-bit big-endian wire format. Just sharing that in case you’ve seen similar behavior when mixing different decimal widths. **• Half-floats in FLBA** The BE path you added with `ToLittleEndian(values[i])` aligns with the intent. I ended up staging the FLBA structs and the 2-byte payloads together in one scratch buffer, mostly because some downstream consumers treat the pointer lifetime very strictly. Either way, normalizing those 2-byte halves before page assembly helps avoid the cross-architecture drift I’ve run into. **• Paging / DoInBatches** Your rewrite to enforce `max_rows_per_page` is a meaningful cleanup. My patches didn’t touch this area, so no conflicts there — but just to mention it, keeping the paging logic predictable on BE made debugging the level stream quite a bit easier for me. **• INT96 (Impala timestamp)** Your implementation writes host-order limbs on BE and memcpy on LE. In my case I leaned heavily on always emitting LE limbs so the decode path doesn’t have to branch on architecture. Both approaches work as long as the corresponding reader expects the same convention. None of this is blocking — just trying to pass along the details I’ve seen crop up when running the full parquet-encode → parquet-decode cycle on big-endian hardware. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x [arrow]
kou commented on code in PR #48205:
URL: https://github.com/apache/arrow/pull/48205#discussion_r2553142118
##
cpp/src/parquet/column_reader.cc:
##
@@ -132,7 +134,11 @@ int LevelDecoder::SetData(Encoding::type encoding, int16_t
max_level,
"Number of buffered values too large (corrupt data page?)");
}
num_bytes = static_cast(bit_util::BytesForBits(num_bits));
+#if ARROW_LITTLE_ENDIAN
if (num_bytes < 0 || num_bytes > data_size - 4) {
+#else
+ if (num_bytes < 0 || num_bytes > data_size) {
+#endif
Review Comment:
@pitrou You added `- 4` in #6848. Do you think that we need `- 4` with big
endian too?
##
cpp/src/parquet/column_writer.h:
##
@@ -260,13 +261,21 @@ constexpr int64_t kJulianEpochOffsetDays =
INT64_C(2440588);
template
inline void ArrowTimestampToImpalaTimestamp(const int64_t time, Int96*
impala_timestamp) {
int64_t julian_days = (time / UnitPerDay) + kJulianEpochOffsetDays;
+#if ARROW_LITTLE_ENDIAN
(*impala_timestamp).value[2] = (uint32_t)julian_days;
+#endif
int64_t last_day_units = time % UnitPerDay;
auto last_day_nanos = last_day_units * NanosecondsPerUnit;
+#if ARROW_LITTLE_ENDIAN
// impala_timestamp will be unaligned every other entry so do memcpy instead
// of assign and reinterpret cast to avoid undefined behavior.
std::memcpy(impala_timestamp, &last_day_nanos, sizeof(int64_t));
+#else
+ (*impala_timestamp).value[0] = static_cast(last_day_nanos);
+ (*impala_timestamp).value[1] = static_cast(last_day_nanos >> 32);
Review Comment:
Can we use the following instead of `#if`?
```cpp
auto last_day_nanos = last_day_units * NanosecondsPerUnit;
auto last_day_nanos_little_endian =
::arrow::bit_util::ToLittleEndian(last_day_nanos);
std::memcpy(impala_timestamp, &last_day_nanos_little_endian,
sizeof(int64_t));
```
##
cpp/src/parquet/column_writer.cc:
##
@@ -2621,6 +2641,38 @@ struct SerializeFunctor<::parquet::FLBAType,
::arrow::HalfFloatType> {
return FLBA{reinterpret_cast(value_ptr)};
}
};
+#else
+template <>
+struct SerializeFunctor<::parquet::FLBAType, ::arrow::HalfFloatType> {
+ Status Serialize(const ::arrow::HalfFloatArray& array, ArrowWriteContext*,
FLBA* out) {
+const uint16_t* values = array.raw_values();
+const int64_t length = array.length();
+
+// Allocate buffer for little-endian converted values
+converted_values_.resize(length);
+
+if (array.null_count() == 0) {
+ for (int64_t i = 0; i < length; ++i) {
+converted_values_[i] = ::arrow::bit_util::ToLittleEndian(values[i]);
+out[i] = FLBA{reinterpret_cast(&converted_values_[i])};
+ }
+} else {
+ for (int64_t i = 0; i < length; ++i) {
+if (array.IsValid(i)) {
+ converted_values_[i] = ::arrow::bit_util::ToLittleEndian(values[i]);
+ out[i] = FLBA{reinterpret_cast(&converted_values_[i])};
+} else {
+ out[i] = FLBA{};
+}
+ }
+}
+return Status::OK();
+ }
+
+ private:
+ std::vector converted_values_;
+};
+#endif
Review Comment:
Could you share implementation as much as possible something like:
```cpp
template <>
struct SerializeFunctor<::parquet::FLBAType, ::arrow::HalfFloatType> {
Status Serialize(const ::arrow::HalfFloatArray& array, ArrowWriteContext*,
FLBA* out) {
#if ARROW_LITTLE_ENDIAN
return SerializeLittleEndianValues(array.raw_values(), out);
#else
const uint16_t* values = array.raw_values();
const int64_t length = array.length();
converted_values_.resize(length);
for (int64_t i = 0; i < length; ++i) {
// We don't need IsValid() here. Non valid values are just ignored in
SerializeLittleEndianValues().
converted_values_[i] = ::arrow::bit_util::ToLittleEndian(values[i]);
}
return SerializeLittleEndianValues(converted_values_.data(), out);
#endif
}
private:
Status SerializeLittleEndianValues(const uint16_t* values, FLBA* out) {
if (array.null_count() == 0) {
for (int64_t i = 0; i < array.length(); ++i) {
out[i] = ToFLBA(&values[i]);
}
} else {
for (int64_t i = 0; i < array.length(); ++i) {
out[i] = array.IsValid(i) ? ToFLBA(&values[i]) : FLBA{};
}
}
return Status::OK();
}
FLBA ToFLBA(const uint16_t* value_ptr) const {
return FLBA{reinterpret_cast(value_ptr)};
}
#if !ARROW_LITTLE_ENDIAN
std::vector converted_values_;
#endif
};
```
##
cpp/src/parquet/column_writer.h:
##
@@ -260,13 +261,21 @@ constexpr int64_t kJulianEpochOffsetDays =
INT64_C(2440588);
template
inline void ArrowTimestampToImpalaTimestamp(const int64_t time, Int96*
impala_timestamp) {
int64_t julian_days = (time / UnitPerDay) + kJulianEpochOffsetDays;
+#if ARROW_LITTLE_ENDIAN
(*impala_timestamp).value[2] = (uint32_t)jul
