Re: [PR] GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code [arrow]
AlenkaF commented on PR #45998: URL: https://github.com/apache/arrow/pull/45998#issuecomment-2996561344 @pitrou gentle ping. Would I be too optimistic to try to get it into 21.0.0? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code [arrow]
AlenkaF commented on PR #45998: URL: https://github.com/apache/arrow/pull/45998#issuecomment-2958911523 @benibus @pitrou would you mind taking another look? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code [arrow]
AlenkaF commented on code in PR #45998: URL: https://github.com/apache/arrow/pull/45998#discussion_r2137369780 ## python/pyarrow/includes/libarrow_fs.pxd: ## @@ -301,6 +301,19 @@ cdef extern from "arrow/filesystem/api.h" namespace "arrow::fs" nogil: const CIOContext& io_context, int64_t chunk_size, c_bool use_threads) +CStatus HaveLibHdfs() + +cdef enum HdfsDriver" arrow::io::HdfsDriver": +HdfsDriver_LIBHDFS" arrow::io::HdfsDriver::LIBHDFS" +HdfsDriver_LIBHDFS3" arrow::io::HdfsDriver::LIBHDFS3" Review Comment: I need to remove `HdfsDriver` as it is a leftover from: https://github.com/apache/arrow/commit/4e53749097ba687afd5e67925def2e2802c9#diff-8f9209d74ff1ddfc6fed20ab199b6d158d0052382fae2c7bcef9f0bd96cebc95 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code [arrow]
AlenkaF commented on code in PR #45998: URL: https://github.com/apache/arrow/pull/45998#discussion_r2136977525 ## cpp/src/arrow/filesystem/hdfs_internal.h: ## @@ -33,8 +38,13 @@ namespace arrow { class Status; +using internal::IOErrorFromErrno; + namespace io::internal { +class HdfsReadableFile; +class HdfsOutputStream; + Review Comment: I think I wanted to change as little as possible. But yeah, should be changed to `fs::internal`. Will do! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code [arrow]
benibus commented on code in PR #45998: URL: https://github.com/apache/arrow/pull/45998#discussion_r2136802611 ## cpp/src/arrow/filesystem/hdfs.h: ## @@ -22,18 +22,32 @@ #include #include "arrow/filesystem/filesystem.h" -#include "arrow/io/hdfs.h" #include "arrow/util/uri.h" +namespace arrow::io::internal { +class HdfsReadableFile; +class HdfsOutputStream; + +struct HdfsPathInfo; +} // namespace arrow::io::internal Review Comment: Here as well. ## cpp/src/arrow/filesystem/hdfs_internal.h: ## @@ -33,8 +38,13 @@ namespace arrow { class Status; +using internal::IOErrorFromErrno; + namespace io::internal { +class HdfsReadableFile; +class HdfsOutputStream; + Review Comment: Why are we using the `io::internal` namespace here instead of `fs::internal`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code [arrow]
AlenkaF commented on PR #45998: URL: https://github.com/apache/arrow/pull/45998#issuecomment-2894514536 @pitrou I cleaned up the CI failures (others are not related) and am hoping this changes will not be too bad to review :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code [arrow]
AlenkaF commented on PR #45998: URL: https://github.com/apache/arrow/pull/45998#issuecomment-2890868468 Aha, I see! Thanks, will look into it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code [arrow]
pitrou commented on PR #45998: URL: https://github.com/apache/arrow/pull/45998#issuecomment-2890868302 > * The MSVC compiler is complaining about a forward-declared friend function I'm using in `hdfs.cc`. Do you have any advice on how to better organise this? Hmm, rather than trying to find the exact explanation, a simple solution would be to change these functions into static methods, for example this: ```c++ ARROW_EXPORT Status MakeReadableFile(const std::string& path, int32_t buffer_size, const io::IOContext& io_context, LibHdfsShim* driver, hdfsFS fs, hdfsFile file, std::shared_ptr* out); ``` would become: ```c++ class ARROW_EXPORT HdfsReadableFile : public RandomAccessFile { public: (...) static Result> Make( const std::string& path, int32_t buffer_size, const io::IOContext& io_context, LibHdfsShim* driver, hdfsFS fs, hdfsFile file); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code [arrow]
pitrou commented on PR #45998: URL: https://github.com/apache/arrow/pull/45998#issuecomment-2890860924 Hi @AlenkaF > * Some C++ tests are failing in `hdfs_test`, with the following error: I think you're misreading the output, the test is actually skipped when the driver fails unloading, which is normal: ``` (...) dlopen(/usr/java/latest//lib/amd64/server/libjvm.so) failed: /usr/java/latest//lib/amd64/server/libjvm.so: cannot open shared object file: No such file or directory /arrow/cpp/src/arrow/filesystem/hdfs_internal.cc:294 try_dlopen(libjvm_potential_paths, "libjvm") /arrow/cpp/src/arrow/filesystem/hdfs.cc:95 ConnectLibHdfs(&driver_) /arrow/cpp/src/arrow/filesystem/hdfs.cc:725 ptr->impl_->Init() /arrow/cpp/src/arrow/filesystem/hdfs_test.cc:205: Skipped Driver not loaded, skipping ``` https://github.com/apache/arrow/actions/runs/15109276550/job/42464862030?pr=45998#step:7:3277 The problem is in the other tests, because it seems a destructor crashes: ``` [ RUN ] TestHadoopFileSystem.DeleteDirContents ! Running '/build/cpp/debug/arrow-hdfs-test' produced core dump at '/tmp/core.arrow-hdfs-test.25630', printing backtrace: [New LWP 25630] [New LWP 25631] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `/build/cpp/debug/arrow-hdfs-test'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x7fc37b23d5be in arrow::io::internal::LibHdfsShim::Disconnect () at /arrow/cpp/src/arrow/filesystem/hdfs_internal.cc:340 340 int LibHdfsShim::Disconnect(hdfsFS fs) { return this->hdfsDisconnect(fs); } [Current thread is 1 (Thread 0x7fc374633dc0 (LWP 25630))] (...) ``` https://github.com/apache/arrow/actions/runs/15109276550/job/42464862030?pr=45998#step:7:3281 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code [arrow]
AlenkaF commented on PR #45998: URL: https://github.com/apache/arrow/pull/45998#issuecomment-2890402955 Hi @pitrou, could you please take a quick look at the changes when you have a moment? I've done my best to implement the suggested changes, but am sure there's still room for improvement. A couple of issues I could use your input on: - Some C++ tests are failing in `hdfs_test`, with the following error: ``` /arrow/cpp/src/arrow/filesystem/hdfs_test.cc:90: HadoopFileSystem::Make failed, it is possible when we don't have proper driver on this node, err msg is IOError: Unable to load libjvm ``` I'm not sure how best to resolve this — any guidance would be appreciated. - The MSVC compiler is complaining about a forward-declared friend function I'm using in `hdfs.cc`. Do you have any advice on how to better organise this? The Python and MATLAB test failures are not related. Thanks in advance! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code [arrow]
AlenkaF commented on code in PR #45998: URL: https://github.com/apache/arrow/pull/45998#discussion_r2073379635 ## cpp/src/arrow/filesystem/hdfs_io.h: ## @@ -38,18 +40,6 @@ namespace io { class HdfsReadableFile; class HdfsOutputStream; -/// DEPRECATED. Use the FileSystem API in arrow::fs instead. -struct ObjectType { - enum type { FILE, DIRECTORY }; -}; - -/// DEPRECATED. Use the FileSystem API in arrow::fs instead. -struct ARROW_EXPORT FileStatistics { - /// Size of file, -1 if finding length is unsupported - int64_t size; - ObjectType::type kind; -}; - class ARROW_EXPORT FileSystem { Review Comment: Ok, will go with the disappearing =) IIUC `hdfs_io.h` will be removed altogether: - `FileSystem` and `HadoopFileSystem` will go into `hdfs.cc`, folded into the public `HadoopFileSystem` - `HdfsConnectionConfig` will also go into `hdfs.cc` - declarations that are left will go into `hdfs_internal.cc` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code [arrow]
AlenkaF commented on code in PR #45998: URL: https://github.com/apache/arrow/pull/45998#discussion_r2073351700 ## cpp/src/arrow/CMakeLists.txt: ## @@ -833,7 +830,11 @@ if(ARROW_FILESYSTEM) PROPERTIES SKIP_UNITY_BUILD_INCLUSION ON) endif() if(ARROW_HDFS) -list(APPEND ARROW_FILESYSTEM_SRCS filesystem/hdfs.cc) +list(APPEND + ARROW_FILESYSTEM_SRCS + filesystem/hdfs.cc + filesystem/hdfs_io.cc + filesystem/hdfs_internal.cc) Review Comment: Ah, it is there already (that explains why nothing failed =) ) https://github.com/apache/arrow/blob/53ef438333efd4a6eb1421b3e812d10e5a8ed633/cpp/src/arrow/CMakeLists.txt#L857-L861 Not sure if the line with `CMAKE_DL_LIBS` is also needed here then? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code [arrow]
AlenkaF commented on code in PR #45998: URL: https://github.com/apache/arrow/pull/45998#discussion_r2073347504 ## cpp/src/arrow/CMakeLists.txt: ## @@ -833,7 +830,11 @@ if(ARROW_FILESYSTEM) PROPERTIES SKIP_UNITY_BUILD_INCLUSION ON) endif() if(ARROW_HDFS) -list(APPEND ARROW_FILESYSTEM_SRCS filesystem/hdfs.cc) +list(APPEND + ARROW_FILESYSTEM_SRCS + filesystem/hdfs.cc + filesystem/hdfs_io.cc + filesystem/hdfs_internal.cc) Review Comment: Hm, yeah. I will add a link as above as it makes sense. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code [arrow]
AlenkaF commented on code in PR #45998: URL: https://github.com/apache/arrow/pull/45998#discussion_r2073309002 ## python/pyarrow/includes/libarrow_fs.pxd: ## @@ -300,6 +300,80 @@ cdef extern from "arrow/filesystem/api.h" namespace "arrow::fs" nogil: const CIOContext& io_context, int64_t chunk_size, c_bool use_threads) +cdef extern from "arrow/filesystem/hdfs_io.h" namespace "arrow::io" nogil: Review Comment: No, most of them aren't and are copied from `libarrow.pxd`. I can remove the unused ones - but am not sure if some external application can actually use them? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code [arrow]
pitrou commented on code in PR #45998: URL: https://github.com/apache/arrow/pull/45998#discussion_r2065856352 ## cpp/src/arrow/CMakeLists.txt: ## @@ -408,15 +408,12 @@ arrow_add_object_library(ARROW_IO io/caching.cc io/compressed.cc io/file.cc - io/hdfs.cc - io/hdfs_internal.cc io/interfaces.cc io/memory.cc io/slow.cc io/stdio.cc io/transform.cc) foreach(ARROW_IO_TARGET ${ARROW_IO_TARGETS}) - target_link_libraries(${ARROW_IO_TARGET} PRIVATE arrow::hadoop) if(NOT MSVC) target_link_libraries(${ARROW_IO_TARGET} PRIVATE ${CMAKE_DL_LIBS}) endif() Review Comment: I think `CMAKE_DL_LIBS` was for HDFS, you need to move it as well? ## cpp/src/arrow/filesystem/CMakeLists.txt: ## @@ -143,4 +143,12 @@ endif() if(ARROW_HDFS) add_arrow_test(hdfs_test EXTRA_LABELS filesystem) + add_arrow_test(hdfs_test_io Review Comment: Nit: should call this `hdfs_io_test`, or perhaps better `hdfs_internals_test` ## cpp/src/arrow/filesystem/hdfs_io.h: ## @@ -38,18 +40,6 @@ namespace io { class HdfsReadableFile; Review Comment: Most of these declarations should IMHO go into the `arrow::filesystem::internal` namespace, except for `HdfsConnectionConfig` which can go into `arrow::filesystem`. ## cpp/src/arrow/filesystem/api.h: ## @@ -32,3 +32,6 @@ #ifdef ARROW_S3 # include "arrow/filesystem/s3fs.h" // IWYU pragma: export #endif +#ifdef ARROW_HDFS +# include "arrow/filesystem/hdfs_io.h" // IWYU pragma: export Review Comment: I don't think we need to include this one here. (also, it's included implicitly by `hdfs.h`) ## cpp/src/arrow/filesystem/hdfs_io.h: ## @@ -38,18 +40,6 @@ namespace io { class HdfsReadableFile; class HdfsOutputStream; -/// DEPRECATED. Use the FileSystem API in arrow::fs instead. -struct ObjectType { - enum type { FILE, DIRECTORY }; -}; - -/// DEPRECATED. Use the FileSystem API in arrow::fs instead. -struct ARROW_EXPORT FileStatistics { - /// Size of file, -1 if finding length is unsupported - int64_t size; - ObjectType::type kind; -}; - class ARROW_EXPORT FileSystem { Review Comment: Ok, but we don't want to keep those two unofficial `FileSystem` and `HadoopFileSystem` classes which create confusion with the other (public) filesystem classes. Ideally, those two classes disappear and their implementation code gets folded into the public `HadoopFileSystem` class. If that's too annoying, we should at least merge those two classes and give them a less ambiguous name, for example `HdfsClient`. ## cpp/src/arrow/CMakeLists.txt: ## @@ -833,7 +830,11 @@ if(ARROW_FILESYSTEM) PROPERTIES SKIP_UNITY_BUILD_INCLUSION ON) endif() if(ARROW_HDFS) -list(APPEND ARROW_FILESYSTEM_SRCS filesystem/hdfs.cc) +list(APPEND + ARROW_FILESYSTEM_SRCS + filesystem/hdfs.cc + filesystem/hdfs_io.cc + filesystem/hdfs_internal.cc) Review Comment: Don't we need to link to `arrow::hadoop` as was done above? cc @kou for advice ## python/pyarrow/includes/libarrow_fs.pxd: ## @@ -300,6 +300,80 @@ cdef extern from "arrow/filesystem/api.h" namespace "arrow::fs" nogil: const CIOContext& io_context, int64_t chunk_size, c_bool use_threads) +cdef extern from "arrow/filesystem/hdfs_io.h" namespace "arrow::io" nogil: Review Comment: Are all these declarations actually needed by PyArrow? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code [arrow]
AlenkaF commented on PR #45998: URL: https://github.com/apache/arrow/pull/45998#issuecomment-2789399921 @pitrou I think this is ready for review. The failing builds have an issue opened: https://github.com/apache/arrow/issues/46077 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org