Re: [PR] GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code [arrow]
AlenkaF commented on PR #45998: URL: https://github.com/apache/arrow/pull/45998#issuecomment-2996561344 @pitrou gentle ping. Would I be too optimistic to try to get it into 21.0.0? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code [arrow]
AlenkaF commented on PR #45998: URL: https://github.com/apache/arrow/pull/45998#issuecomment-2958911523 @benibus @pitrou would you mind taking another look? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code [arrow]
AlenkaF commented on code in PR #45998: URL: https://github.com/apache/arrow/pull/45998#discussion_r2137369780 ## python/pyarrow/includes/libarrow_fs.pxd: ## @@ -301,6 +301,19 @@ cdef extern from "arrow/filesystem/api.h" namespace "arrow::fs" nogil: const CIOContext& io_context, int64_t chunk_size, c_bool use_threads) +CStatus HaveLibHdfs() + +cdef enum HdfsDriver" arrow::io::HdfsDriver": +HdfsDriver_LIBHDFS" arrow::io::HdfsDriver::LIBHDFS" +HdfsDriver_LIBHDFS3" arrow::io::HdfsDriver::LIBHDFS3" Review Comment: I need to remove `HdfsDriver` as it is a leftover from: https://github.com/apache/arrow/commit/4e53749097ba687afd5e67925def2e2802c9#diff-8f9209d74ff1ddfc6fed20ab199b6d158d0052382fae2c7bcef9f0bd96cebc95 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code [arrow]
AlenkaF commented on code in PR #45998:
URL: https://github.com/apache/arrow/pull/45998#discussion_r2136977525
##
cpp/src/arrow/filesystem/hdfs_internal.h:
##
@@ -33,8 +38,13 @@ namespace arrow {
class Status;
+using internal::IOErrorFromErrno;
+
namespace io::internal {
+class HdfsReadableFile;
+class HdfsOutputStream;
+
Review Comment:
I think I wanted to change as little as possible. But yeah, should be
changed to `fs::internal`. Will do!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code [arrow]
benibus commented on code in PR #45998:
URL: https://github.com/apache/arrow/pull/45998#discussion_r2136802611
##
cpp/src/arrow/filesystem/hdfs.h:
##
@@ -22,18 +22,32 @@
#include
#include "arrow/filesystem/filesystem.h"
-#include "arrow/io/hdfs.h"
#include "arrow/util/uri.h"
+namespace arrow::io::internal {
+class HdfsReadableFile;
+class HdfsOutputStream;
+
+struct HdfsPathInfo;
+} // namespace arrow::io::internal
Review Comment:
Here as well.
##
cpp/src/arrow/filesystem/hdfs_internal.h:
##
@@ -33,8 +38,13 @@ namespace arrow {
class Status;
+using internal::IOErrorFromErrno;
+
namespace io::internal {
+class HdfsReadableFile;
+class HdfsOutputStream;
+
Review Comment:
Why are we using the `io::internal` namespace here instead of `fs::internal`?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code [arrow]
AlenkaF commented on PR #45998: URL: https://github.com/apache/arrow/pull/45998#issuecomment-2894514536 @pitrou I cleaned up the CI failures (others are not related) and am hoping this changes will not be too bad to review :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code [arrow]
AlenkaF commented on PR #45998: URL: https://github.com/apache/arrow/pull/45998#issuecomment-2890868468 Aha, I see! Thanks, will look into it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code [arrow]
pitrou commented on PR #45998:
URL: https://github.com/apache/arrow/pull/45998#issuecomment-2890868302
> * The MSVC compiler is complaining about a forward-declared friend
function I'm using in `hdfs.cc`. Do you have any advice on how to better
organise this?
Hmm, rather than trying to find the exact explanation, a simple solution
would be to change these functions into static methods, for example this:
```c++
ARROW_EXPORT Status MakeReadableFile(const std::string& path, int32_t
buffer_size,
const io::IOContext& io_context, LibHdfsShim* driver,
hdfsFS fs, hdfsFile file,
std::shared_ptr* out);
```
would become:
```c++
class ARROW_EXPORT HdfsReadableFile : public RandomAccessFile {
public:
(...)
static Result> Make(
const std::string& path, int32_t buffer_size,
const io::IOContext& io_context, LibHdfsShim* driver,
hdfsFS fs, hdfsFile file);
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code [arrow]
pitrou commented on PR #45998:
URL: https://github.com/apache/arrow/pull/45998#issuecomment-2890860924
Hi @AlenkaF
> * Some C++ tests are failing in `hdfs_test`, with the following error:
I think you're misreading the output, the test is actually skipped when the
driver fails unloading, which is normal:
```
(...)
dlopen(/usr/java/latest//lib/amd64/server/libjvm.so) failed:
/usr/java/latest//lib/amd64/server/libjvm.so: cannot open shared object file:
No such file or directory
/arrow/cpp/src/arrow/filesystem/hdfs_internal.cc:294
try_dlopen(libjvm_potential_paths, "libjvm")
/arrow/cpp/src/arrow/filesystem/hdfs.cc:95 ConnectLibHdfs(&driver_)
/arrow/cpp/src/arrow/filesystem/hdfs.cc:725 ptr->impl_->Init()
/arrow/cpp/src/arrow/filesystem/hdfs_test.cc:205: Skipped
Driver not loaded, skipping
```
https://github.com/apache/arrow/actions/runs/15109276550/job/42464862030?pr=45998#step:7:3277
The problem is in the other tests, because it seems a destructor crashes:
```
[ RUN ] TestHadoopFileSystem.DeleteDirContents
!
Running '/build/cpp/debug/arrow-hdfs-test' produced core dump at
'/tmp/core.arrow-hdfs-test.25630', printing backtrace:
[New LWP 25630]
[New LWP 25631]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/build/cpp/debug/arrow-hdfs-test'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x7fc37b23d5be in arrow::io::internal::LibHdfsShim::Disconnect () at
/arrow/cpp/src/arrow/filesystem/hdfs_internal.cc:340
340 int LibHdfsShim::Disconnect(hdfsFS fs) { return
this->hdfsDisconnect(fs); }
[Current thread is 1 (Thread 0x7fc374633dc0 (LWP 25630))]
(...)
```
https://github.com/apache/arrow/actions/runs/15109276550/job/42464862030?pr=45998#step:7:3281
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code [arrow]
AlenkaF commented on PR #45998: URL: https://github.com/apache/arrow/pull/45998#issuecomment-2890402955 Hi @pitrou, could you please take a quick look at the changes when you have a moment? I've done my best to implement the suggested changes, but am sure there's still room for improvement. A couple of issues I could use your input on: - Some C++ tests are failing in `hdfs_test`, with the following error: ``` /arrow/cpp/src/arrow/filesystem/hdfs_test.cc:90: HadoopFileSystem::Make failed, it is possible when we don't have proper driver on this node, err msg is IOError: Unable to load libjvm ``` I'm not sure how best to resolve this — any guidance would be appreciated. - The MSVC compiler is complaining about a forward-declared friend function I'm using in `hdfs.cc`. Do you have any advice on how to better organise this? The Python and MATLAB test failures are not related. Thanks in advance! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code [arrow]
AlenkaF commented on code in PR #45998:
URL: https://github.com/apache/arrow/pull/45998#discussion_r2073379635
##
cpp/src/arrow/filesystem/hdfs_io.h:
##
@@ -38,18 +40,6 @@ namespace io {
class HdfsReadableFile;
class HdfsOutputStream;
-/// DEPRECATED. Use the FileSystem API in arrow::fs instead.
-struct ObjectType {
- enum type { FILE, DIRECTORY };
-};
-
-/// DEPRECATED. Use the FileSystem API in arrow::fs instead.
-struct ARROW_EXPORT FileStatistics {
- /// Size of file, -1 if finding length is unsupported
- int64_t size;
- ObjectType::type kind;
-};
-
class ARROW_EXPORT FileSystem {
Review Comment:
Ok, will go with the disappearing =)
IIUC `hdfs_io.h` will be removed altogether:
- `FileSystem` and `HadoopFileSystem` will go into `hdfs.cc`, folded into
the public `HadoopFileSystem`
- `HdfsConnectionConfig` will also go into `hdfs.cc`
- declarations that are left will go into `hdfs_internal.cc`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code [arrow]
AlenkaF commented on code in PR #45998: URL: https://github.com/apache/arrow/pull/45998#discussion_r2073351700 ## cpp/src/arrow/CMakeLists.txt: ## @@ -833,7 +830,11 @@ if(ARROW_FILESYSTEM) PROPERTIES SKIP_UNITY_BUILD_INCLUSION ON) endif() if(ARROW_HDFS) -list(APPEND ARROW_FILESYSTEM_SRCS filesystem/hdfs.cc) +list(APPEND + ARROW_FILESYSTEM_SRCS + filesystem/hdfs.cc + filesystem/hdfs_io.cc + filesystem/hdfs_internal.cc) Review Comment: Ah, it is there already (that explains why nothing failed =) ) https://github.com/apache/arrow/blob/53ef438333efd4a6eb1421b3e812d10e5a8ed633/cpp/src/arrow/CMakeLists.txt#L857-L861 Not sure if the line with `CMAKE_DL_LIBS` is also needed here then? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code [arrow]
AlenkaF commented on code in PR #45998: URL: https://github.com/apache/arrow/pull/45998#discussion_r2073347504 ## cpp/src/arrow/CMakeLists.txt: ## @@ -833,7 +830,11 @@ if(ARROW_FILESYSTEM) PROPERTIES SKIP_UNITY_BUILD_INCLUSION ON) endif() if(ARROW_HDFS) -list(APPEND ARROW_FILESYSTEM_SRCS filesystem/hdfs.cc) +list(APPEND + ARROW_FILESYSTEM_SRCS + filesystem/hdfs.cc + filesystem/hdfs_io.cc + filesystem/hdfs_internal.cc) Review Comment: Hm, yeah. I will add a link as above as it makes sense. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code [arrow]
AlenkaF commented on code in PR #45998: URL: https://github.com/apache/arrow/pull/45998#discussion_r2073309002 ## python/pyarrow/includes/libarrow_fs.pxd: ## @@ -300,6 +300,80 @@ cdef extern from "arrow/filesystem/api.h" namespace "arrow::fs" nogil: const CIOContext& io_context, int64_t chunk_size, c_bool use_threads) +cdef extern from "arrow/filesystem/hdfs_io.h" namespace "arrow::io" nogil: Review Comment: No, most of them aren't and are copied from `libarrow.pxd`. I can remove the unused ones - but am not sure if some external application can actually use them? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code [arrow]
pitrou commented on code in PR #45998:
URL: https://github.com/apache/arrow/pull/45998#discussion_r2065856352
##
cpp/src/arrow/CMakeLists.txt:
##
@@ -408,15 +408,12 @@ arrow_add_object_library(ARROW_IO
io/caching.cc
io/compressed.cc
io/file.cc
- io/hdfs.cc
- io/hdfs_internal.cc
io/interfaces.cc
io/memory.cc
io/slow.cc
io/stdio.cc
io/transform.cc)
foreach(ARROW_IO_TARGET ${ARROW_IO_TARGETS})
- target_link_libraries(${ARROW_IO_TARGET} PRIVATE arrow::hadoop)
if(NOT MSVC)
target_link_libraries(${ARROW_IO_TARGET} PRIVATE ${CMAKE_DL_LIBS})
endif()
Review Comment:
I think `CMAKE_DL_LIBS` was for HDFS, you need to move it as well?
##
cpp/src/arrow/filesystem/CMakeLists.txt:
##
@@ -143,4 +143,12 @@ endif()
if(ARROW_HDFS)
add_arrow_test(hdfs_test EXTRA_LABELS filesystem)
+ add_arrow_test(hdfs_test_io
Review Comment:
Nit: should call this `hdfs_io_test`, or perhaps better `hdfs_internals_test`
##
cpp/src/arrow/filesystem/hdfs_io.h:
##
@@ -38,18 +40,6 @@ namespace io {
class HdfsReadableFile;
Review Comment:
Most of these declarations should IMHO go into the
`arrow::filesystem::internal` namespace, except for `HdfsConnectionConfig`
which can go into `arrow::filesystem`.
##
cpp/src/arrow/filesystem/api.h:
##
@@ -32,3 +32,6 @@
#ifdef ARROW_S3
# include "arrow/filesystem/s3fs.h" // IWYU pragma: export
#endif
+#ifdef ARROW_HDFS
+# include "arrow/filesystem/hdfs_io.h" // IWYU pragma: export
Review Comment:
I don't think we need to include this one here. (also, it's included
implicitly by `hdfs.h`)
##
cpp/src/arrow/filesystem/hdfs_io.h:
##
@@ -38,18 +40,6 @@ namespace io {
class HdfsReadableFile;
class HdfsOutputStream;
-/// DEPRECATED. Use the FileSystem API in arrow::fs instead.
-struct ObjectType {
- enum type { FILE, DIRECTORY };
-};
-
-/// DEPRECATED. Use the FileSystem API in arrow::fs instead.
-struct ARROW_EXPORT FileStatistics {
- /// Size of file, -1 if finding length is unsupported
- int64_t size;
- ObjectType::type kind;
-};
-
class ARROW_EXPORT FileSystem {
Review Comment:
Ok, but we don't want to keep those two unofficial `FileSystem` and
`HadoopFileSystem` classes which create confusion with the other (public)
filesystem classes.
Ideally, those two classes disappear and their implementation code gets
folded into the public `HadoopFileSystem` class.
If that's too annoying, we should at least merge those two classes and give
them a less ambiguous name, for example `HdfsClient`.
##
cpp/src/arrow/CMakeLists.txt:
##
@@ -833,7 +830,11 @@ if(ARROW_FILESYSTEM)
PROPERTIES SKIP_UNITY_BUILD_INCLUSION ON)
endif()
if(ARROW_HDFS)
-list(APPEND ARROW_FILESYSTEM_SRCS filesystem/hdfs.cc)
+list(APPEND
+ ARROW_FILESYSTEM_SRCS
+ filesystem/hdfs.cc
+ filesystem/hdfs_io.cc
+ filesystem/hdfs_internal.cc)
Review Comment:
Don't we need to link to `arrow::hadoop` as was done above? cc @kou for
advice
##
python/pyarrow/includes/libarrow_fs.pxd:
##
@@ -300,6 +300,80 @@ cdef extern from "arrow/filesystem/api.h" namespace
"arrow::fs" nogil:
const CIOContext& io_context,
int64_t chunk_size, c_bool use_threads)
+cdef extern from "arrow/filesystem/hdfs_io.h" namespace "arrow::io" nogil:
Review Comment:
Are all these declarations actually needed by PyArrow?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code [arrow]
AlenkaF commented on PR #45998: URL: https://github.com/apache/arrow/pull/45998#issuecomment-2789399921 @pitrou I think this is ready for review. The failing builds have an issue opened: https://github.com/apache/arrow/issues/46077 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
