Re: HDFS ORC to Arrow Dataset
On further analysis :- ==114164== Process terminating with default action of signal 6 (SIGABRT) ==114164==at 0x4AD118B: raise (raise.c:51) ==114164==by 0x4AB092D: abort (abort.c:100) ==114164==by 0x598D768: os::abort(bool) (in /home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so) ==114164==by 0x5B52802: VMError::report_and_die() (in /home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so) ==114164==by 0x59979F4: JVM_handle_linux_signal (in /home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so) ==114164==by 0x598A8B7: signalHandler(int, siginfo*, void*) (in /home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so) ==114164==by 0x485F3BF: ??? (in /usr/lib/x86_64-linux-gnu/ libpthread-2.31.so) ==114164==by 0x5949C26: Monitor::ILock(Thread*) [clone .part.2] (in /home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so) ==114164==by 0x594B50A: Monitor::lock_without_safepoint_check() (in /home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so) ==114164==by 0x5B59660: VM_Exit::wait_if_vm_exited() (in /home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so) ==114164==by 0x574137C: jni_DetachCurrentThread (in /home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so) ==114164==by 0x4140AA4E: hdfsThreadDestructor (thread_local_storage.c:53 It turned out that the issue was in libhdfs, so I fixed that. Now ORC JNI also works fine There are many features missing in ORC-JN like reading full split or index based reading etc etc Do we have any plan to support those ? On Wed, 8 Sept 2021 at 22:06, Manoj Kumar wrote: > Hi Wes, > > Thanks, > > *[ Part 1 ]* > *C++ HDFS/ORC [Completed]* > Steps which I followed : > 1) arrow::fs::HadoopFileSystem --> create a hadoop FS > 2) std::shared_ptr -->then create a stream > 3) Pass that stream to adapters::orc::ORCFileReader > > *[Part 2 ]* > *C++ HDFS/ORC via Java JNI [Partial Completed]* > *Follow same approach in orc.jni_wrapper* > 1) arrow::fs::HadoopFileSystem --> create a hadoop FS > 2) std::shared_ptr -->then create a stream > 3) Pass that stream to adapters::orc::ORCFileReader > > ** > std::unique_ptr reader; > arrow::Status ret; > if (path.find("hdfs://") == 0) { > > > > > > > > > > > > > > > * arrow::fs::HdfsOptions options_;options_ = > *arrow::fs::HdfsOptions::FromUri(path);auto _fsRes = > arrow::fs::HadoopFileSystem::Make(options_);if (!_fsRes.ok()) { > std::cerr<< "HadoopFileSystem::Make failed, it > is possible when we don't have " "proper driver on > this node, err msg is "<< _fsRes.status().ToString(); > } _fs = *_fsRes;auto _stream = > *_fs->OpenInputFile(path);hadoop_fs_holder_.Insert(_fs); //global > holder in arrow::jni::ConcurrentMap, cleared during unload ret = > ORCFileReader::Open(* *_stream* > *,arrow::default_memory_pool(), );* > > > > * if (!ret.ok()) {env->ThrowNew(io_exception_class, > std::string("Failed open file" + path).c_str());}* > > * return > orc_reader_holder_.Insert(std::shared_ptr(reader.release()));* > *}* > > > JNI also works fine, but at the end of application, I am getting > segmentation fault. > > *Do you have any idea about , looks like some issue with libhdfs > connection close or cleanup ?* > > *stack trace:* > /tmp/tmp3973555041947319188libarrow_orc_jni.so : ()+0xb8b1a3 > /lib/x86_64-linux-gnu/libpthread.so.0 : ()+0x153c0 > /lib/x86_64-linux-gnu/libc.so.6 : gsignal()+0xcb > /lib/x86_64-linux-gnu/libc.so.6 : abort()+0x12b > > /home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so > : ()+0x90e769 > > /home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so > : ()+0xad3803 > > /home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so > : JVM_handle_linux_signal()+0x1a5 > > /home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so > : ()+0x90b8b8 > /lib/x86_64-linux-gnu/libpthread.so.0 : ()+0x153c0 > > /home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so > : ()+0x8cac27 > > /home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so > : ()+0x8cc50b > > /home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so > : ()+0xada661 > > /home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so > : ()+0x6c237d > * > /home/legion/ha_devel/hadoop-ecosystem-3x/hadoop-3.1.1/lib/native/libhdfs.so > : ()+0xaa4f* > /lib/x86_64-linux-gnu/libpthread.so.0 : ()+0x85a1 > /lib/x86_64-linux-gnu/libpthread.so.0 : ()+0x962a >
Re: HDFS ORC to Arrow Dataset
Hi Wes, Thanks, *[ Part 1 ]* *C++ HDFS/ORC [Completed]* Steps which I followed : 1) arrow::fs::HadoopFileSystem --> create a hadoop FS 2) std::shared_ptr -->then create a stream 3) Pass that stream to adapters::orc::ORCFileReader *[Part 2 ]* *C++ HDFS/ORC via Java JNI [Partial Completed]* *Follow same approach in orc.jni_wrapper* 1) arrow::fs::HadoopFileSystem --> create a hadoop FS 2) std::shared_ptr -->then create a stream 3) Pass that stream to adapters::orc::ORCFileReader ** std::unique_ptr reader; arrow::Status ret; if (path.find("hdfs://") == 0) { * arrow::fs::HdfsOptions options_;options_ = *arrow::fs::HdfsOptions::FromUri(path);auto _fsRes = arrow::fs::HadoopFileSystem::Make(options_);if (!_fsRes.ok()) { std::cerr<< "HadoopFileSystem::Make failed, it is possible when we don't have " "proper driver on this node, err msg is "<< _fsRes.status().ToString(); } _fs = *_fsRes;auto _stream = *_fs->OpenInputFile(path);hadoop_fs_holder_.Insert(_fs); //global holder in arrow::jni::ConcurrentMap, cleared during unload ret = ORCFileReader::Open(* *_stream* *,arrow::default_memory_pool(), );* * if (!ret.ok()) {env->ThrowNew(io_exception_class, std::string("Failed open file" + path).c_str());}* * return orc_reader_holder_.Insert(std::shared_ptr(reader.release()));* *}* JNI also works fine, but at the end of application, I am getting segmentation fault. *Do you have any idea about , looks like some issue with libhdfs connection close or cleanup ?* *stack trace:* /tmp/tmp3973555041947319188libarrow_orc_jni.so : ()+0xb8b1a3 /lib/x86_64-linux-gnu/libpthread.so.0 : ()+0x153c0 /lib/x86_64-linux-gnu/libc.so.6 : gsignal()+0xcb /lib/x86_64-linux-gnu/libc.so.6 : abort()+0x12b /home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so : ()+0x90e769 /home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so : ()+0xad3803 /home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so : JVM_handle_linux_signal()+0x1a5 /home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so : ()+0x90b8b8 /lib/x86_64-linux-gnu/libpthread.so.0 : ()+0x153c0 /home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so : ()+0x8cac27 /home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so : ()+0x8cc50b /home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so : ()+0xada661 /home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so : ()+0x6c237d * /home/legion/ha_devel/hadoop-ecosystem-3x/hadoop-3.1.1/lib/native/libhdfs.so : ()+0xaa4f* /lib/x86_64-linux-gnu/libpthread.so.0 : ()+0x85a1 /lib/x86_64-linux-gnu/libpthread.so.0 : ()+0x962a /lib/x86_64-linux-gnu/libc.so.6 : clone()+0x43 On Wed, 8 Sept 2021 at 04:07, Weston Pace wrote: > I'll just add that a PR in in progress (thanks Joris!) for adding this > adapter: https://github.com/apache/arrow/pull/10991 > > On Tue, Sep 7, 2021 at 12:05 PM Wes McKinney wrote: > > > > I'm missing context but if you're talking about C++/Python, we are > > currently missing a wrapper interface to the ORC reader in the Arrow > > datasets library > > > > https://github.com/apache/arrow/tree/master/cpp/src/arrow/dataset > > > > We have CSV, Arrow (IPC), and Parquet interfaces. > > > > But we have an HDFS filesystem implementation and an ORC reader > > implementation, so mechanically all of the pieces are there but need > > to be connected together. > > > > Thanks, > > Wes > > > > On Tue, Sep 7, 2021 at 8:22 AM Manoj Kumar wrote: > > > > > > Hi Dev-Community, > > > > > > Anyone can help me to guide how to read ORC directly from HDFS to an > > > arrow dataset. > > > > > > Thanks > > > Manoj >
Re: HDFS ORC to Arrow Dataset
I'll just add that a PR in in progress (thanks Joris!) for adding this adapter: https://github.com/apache/arrow/pull/10991 On Tue, Sep 7, 2021 at 12:05 PM Wes McKinney wrote: > > I'm missing context but if you're talking about C++/Python, we are > currently missing a wrapper interface to the ORC reader in the Arrow > datasets library > > https://github.com/apache/arrow/tree/master/cpp/src/arrow/dataset > > We have CSV, Arrow (IPC), and Parquet interfaces. > > But we have an HDFS filesystem implementation and an ORC reader > implementation, so mechanically all of the pieces are there but need > to be connected together. > > Thanks, > Wes > > On Tue, Sep 7, 2021 at 8:22 AM Manoj Kumar wrote: > > > > Hi Dev-Community, > > > > Anyone can help me to guide how to read ORC directly from HDFS to an > > arrow dataset. > > > > Thanks > > Manoj
Re: HDFS ORC to Arrow Dataset
I'm missing context but if you're talking about C++/Python, we are currently missing a wrapper interface to the ORC reader in the Arrow datasets library https://github.com/apache/arrow/tree/master/cpp/src/arrow/dataset We have CSV, Arrow (IPC), and Parquet interfaces. But we have an HDFS filesystem implementation and an ORC reader implementation, so mechanically all of the pieces are there but need to be connected together. Thanks, Wes On Tue, Sep 7, 2021 at 8:22 AM Manoj Kumar wrote: > > Hi Dev-Community, > > Anyone can help me to guide how to read ORC directly from HDFS to an > arrow dataset. > > Thanks > Manoj
Fwd: HDFS ORC to Arrow Dataset
Hi Dev-Community, Anyone can help me to guide how to read ORC directly from HDFS to an arrow dataset. Thanks Manoj