Re: HDFS ORC to Arrow Dataset

2021-09-09 Thread Manoj Kumar
On further analysis :-

==114164== Process terminating with default action of signal 6 (SIGABRT)
==114164==at 0x4AD118B: raise (raise.c:51)
==114164==by 0x4AB092D: abort (abort.c:100)
==114164==by 0x598D768: os::abort(bool) (in
/home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so)
==114164==by 0x5B52802: VMError::report_and_die() (in
/home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so)
==114164==by 0x59979F4: JVM_handle_linux_signal (in
/home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so)
==114164==by 0x598A8B7: signalHandler(int, siginfo*, void*) (in
/home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so)
==114164==by 0x485F3BF: ??? (in /usr/lib/x86_64-linux-gnu/
libpthread-2.31.so)
==114164==by 0x5949C26: Monitor::ILock(Thread*) [clone .part.2] (in
/home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so)
==114164==by 0x594B50A: Monitor::lock_without_safepoint_check() (in
/home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so)
==114164==by 0x5B59660: VM_Exit::wait_if_vm_exited() (in
/home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so)
==114164==by 0x574137C: jni_DetachCurrentThread (in
/home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so)
==114164==by 0x4140AA4E: hdfsThreadDestructor (thread_local_storage.c:53

It turned out that the issue was in libhdfs, so I fixed that.

Now ORC JNI also works fine

There are many features missing in ORC-JN like reading full split or index
based reading etc etc
Do we have any plan to support those ?


On Wed, 8 Sept 2021 at 22:06, Manoj Kumar  wrote:

> Hi Wes,
>
> Thanks,
>
> *[ Part 1 ]*
> *C++ HDFS/ORC  [Completed]*
> Steps which I followed :
> 1) arrow::fs::HadoopFileSystem --> create a hadoop FS
> 2) std::shared_ptr -->then create a stream
> 3) Pass that stream to adapters::orc::ORCFileReader
>
> *[Part 2 ]*
> *C++ HDFS/ORC via Java JNI [Partial Completed]*
> *Follow same approach in orc.jni_wrapper*
> 1) arrow::fs::HadoopFileSystem --> create a hadoop FS
> 2) std::shared_ptr -->then create a stream
> 3) Pass that stream to adapters::orc::ORCFileReader
>
> **
>  std::unique_ptr reader;
>  arrow::Status ret;
>  if (path.find("hdfs://") == 0) {
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *   arrow::fs::HdfsOptions options_;options_ =
> *arrow::fs::HdfsOptions::FromUri(path);auto _fsRes =
> arrow::fs::HadoopFileSystem::Make(options_);if (!_fsRes.ok()) {
> std::cerr<< "HadoopFileSystem::Make failed, it
> is possible when we don't have "   "proper driver on
> this node, err msg is "<< _fsRes.status().ToString();
>   } _fs = *_fsRes;auto _stream =
> *_fs->OpenInputFile(path);hadoop_fs_holder_.Insert(_fs); //global
> holder in arrow::jni::ConcurrentMap, cleared during unload ret =
> ORCFileReader::Open(* *_stream*
> *,arrow::default_memory_pool(), );*
>
>
>
> *  if (!ret.ok()) {env->ThrowNew(io_exception_class,
> std::string("Failed open file" + path).c_str());}*
>
> *   return
> orc_reader_holder_.Insert(std::shared_ptr(reader.release()));*
> *}*
>
>
> JNI also works fine, but at the end of application, I am getting
> segmentation fault.
>
> *Do you have any idea about , looks like some issue with libhdfs
> connection close or cleanup ?*
>
> *stack trace:*
>   /tmp/tmp3973555041947319188libarrow_orc_jni.so : ()+0xb8b1a3
>   /lib/x86_64-linux-gnu/libpthread.so.0 : ()+0x153c0
>   /lib/x86_64-linux-gnu/libc.so.6 : gsignal()+0xcb
>   /lib/x86_64-linux-gnu/libc.so.6 : abort()+0x12b
>
> /home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so
> : ()+0x90e769
>
> /home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so
> : ()+0xad3803
>
> /home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so
> : JVM_handle_linux_signal()+0x1a5
>
> /home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so
> : ()+0x90b8b8
>   /lib/x86_64-linux-gnu/libpthread.so.0 : ()+0x153c0
>
> /home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so
> : ()+0x8cac27
>
> /home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so
> : ()+0x8cc50b
>
> /home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so
> : ()+0xada661
>
> /home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so
> : ()+0x6c237d
> *
> /home/legion/ha_devel/hadoop-ecosystem-3x/hadoop-3.1.1/lib/native/libhdfs.so
> : ()+0xaa4f*
>   /lib/x86_64-linux-gnu/libpthread.so.0 : ()+0x85a1
>   /lib/x86_64-linux-gnu/libpthread.so.0 : ()+0x962a
>   

Re: HDFS ORC to Arrow Dataset

2021-09-08 Thread Manoj Kumar
Hi Wes,

Thanks,

*[ Part 1 ]*
*C++ HDFS/ORC  [Completed]*
Steps which I followed :
1) arrow::fs::HadoopFileSystem --> create a hadoop FS
2) std::shared_ptr -->then create a stream
3) Pass that stream to adapters::orc::ORCFileReader

*[Part 2 ]*
*C++ HDFS/ORC via Java JNI [Partial Completed]*
*Follow same approach in orc.jni_wrapper*
1) arrow::fs::HadoopFileSystem --> create a hadoop FS
2) std::shared_ptr -->then create a stream
3) Pass that stream to adapters::orc::ORCFileReader

**
 std::unique_ptr reader;
 arrow::Status ret;
 if (path.find("hdfs://") == 0) {














*   arrow::fs::HdfsOptions options_;options_ =
*arrow::fs::HdfsOptions::FromUri(path);auto _fsRes =
arrow::fs::HadoopFileSystem::Make(options_);if (!_fsRes.ok()) {
std::cerr<< "HadoopFileSystem::Make failed, it
is possible when we don't have "   "proper driver on
this node, err msg is "<< _fsRes.status().ToString();
  } _fs = *_fsRes;auto _stream =
*_fs->OpenInputFile(path);hadoop_fs_holder_.Insert(_fs); //global
holder in arrow::jni::ConcurrentMap, cleared during unload ret =
ORCFileReader::Open(* *_stream*
*,arrow::default_memory_pool(), );*



*  if (!ret.ok()) {env->ThrowNew(io_exception_class,
std::string("Failed open file" + path).c_str());}*

*   return
orc_reader_holder_.Insert(std::shared_ptr(reader.release()));*
*}*


JNI also works fine, but at the end of application, I am getting
segmentation fault.

*Do you have any idea about , looks like some issue with libhdfs connection
close or cleanup ?*

*stack trace:*
  /tmp/tmp3973555041947319188libarrow_orc_jni.so : ()+0xb8b1a3
  /lib/x86_64-linux-gnu/libpthread.so.0 : ()+0x153c0
  /lib/x86_64-linux-gnu/libc.so.6 : gsignal()+0xcb
  /lib/x86_64-linux-gnu/libc.so.6 : abort()+0x12b

/home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so
: ()+0x90e769

/home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so
: ()+0xad3803

/home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so
: JVM_handle_linux_signal()+0x1a5

/home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so
: ()+0x90b8b8
  /lib/x86_64-linux-gnu/libpthread.so.0 : ()+0x153c0

/home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so
: ()+0x8cac27

/home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so
: ()+0x8cc50b

/home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so
: ()+0xada661

/home/legion/ha_devel/hadoop-ecosystem-3x/jdk1.8.0_201/jre/lib/amd64/server/libjvm.so
: ()+0x6c237d
*
/home/legion/ha_devel/hadoop-ecosystem-3x/hadoop-3.1.1/lib/native/libhdfs.so
: ()+0xaa4f*
  /lib/x86_64-linux-gnu/libpthread.so.0 : ()+0x85a1
  /lib/x86_64-linux-gnu/libpthread.so.0 : ()+0x962a
  /lib/x86_64-linux-gnu/libc.so.6 : clone()+0x43



On Wed, 8 Sept 2021 at 04:07, Weston Pace  wrote:

> I'll just add that a PR in in progress (thanks Joris!) for adding this
> adapter: https://github.com/apache/arrow/pull/10991
>
> On Tue, Sep 7, 2021 at 12:05 PM Wes McKinney  wrote:
> >
> > I'm missing context but if you're talking about C++/Python, we are
> > currently missing a wrapper interface to the ORC reader in the Arrow
> > datasets library
> >
> > https://github.com/apache/arrow/tree/master/cpp/src/arrow/dataset
> >
> > We have CSV, Arrow (IPC), and Parquet interfaces.
> >
> > But we have an HDFS filesystem implementation and an ORC reader
> > implementation, so mechanically all of the pieces are there but need
> > to be connected together.
> >
> > Thanks,
> > Wes
> >
> > On Tue, Sep 7, 2021 at 8:22 AM Manoj Kumar  wrote:
> > >
> > > Hi Dev-Community,
> > >
> > > Anyone can help me to guide how to read ORC directly from HDFS to an
> > > arrow dataset.
> > >
> > > Thanks
> > > Manoj
>


Re: HDFS ORC to Arrow Dataset

2021-09-07 Thread Weston Pace
I'll just add that a PR in in progress (thanks Joris!) for adding this
adapter: https://github.com/apache/arrow/pull/10991

On Tue, Sep 7, 2021 at 12:05 PM Wes McKinney  wrote:
>
> I'm missing context but if you're talking about C++/Python, we are
> currently missing a wrapper interface to the ORC reader in the Arrow
> datasets library
>
> https://github.com/apache/arrow/tree/master/cpp/src/arrow/dataset
>
> We have CSV, Arrow (IPC), and Parquet interfaces.
>
> But we have an HDFS filesystem implementation and an ORC reader
> implementation, so mechanically all of the pieces are there but need
> to be connected together.
>
> Thanks,
> Wes
>
> On Tue, Sep 7, 2021 at 8:22 AM Manoj Kumar  wrote:
> >
> > Hi Dev-Community,
> >
> > Anyone can help me to guide how to read ORC directly from HDFS to an
> > arrow dataset.
> >
> > Thanks
> > Manoj


Re: HDFS ORC to Arrow Dataset

2021-09-07 Thread Wes McKinney
I'm missing context but if you're talking about C++/Python, we are
currently missing a wrapper interface to the ORC reader in the Arrow
datasets library

https://github.com/apache/arrow/tree/master/cpp/src/arrow/dataset

We have CSV, Arrow (IPC), and Parquet interfaces.

But we have an HDFS filesystem implementation and an ORC reader
implementation, so mechanically all of the pieces are there but need
to be connected together.

Thanks,
Wes

On Tue, Sep 7, 2021 at 8:22 AM Manoj Kumar  wrote:
>
> Hi Dev-Community,
>
> Anyone can help me to guide how to read ORC directly from HDFS to an
> arrow dataset.
>
> Thanks
> Manoj


Fwd: HDFS ORC to Arrow Dataset

2021-09-07 Thread Manoj Kumar
Hi Dev-Community,

Anyone can help me to guide how to read ORC directly from HDFS to an
arrow dataset.

Thanks
Manoj