[GitHub] [arrow-datafusion] alamb commented on pull request #1223: Add support of HDFS as remote object store

GitBox Fri, 05 Nov 2021 12:30:04 -0700


alamb commented on pull request #1223:
URL: https://github.com/apache/arrow-datafusion/pull/1223#issuecomment-961315661

Thank you very much @yahoNanJing .

I personally would like to see such integration / connector code live in
other repos (rather than a feature flag on datafusion). My primary rationale is
to keep DataFusion and Ballista development easier.

DataFusion and Ballista already have fairly substantial overhead to working
on this codebase, so adding another integration will make testing / coding
changes in this repo take longer.

For example, I don't know how to run / test hdfs locally. I tried on this
branch:

```
-*- mode: compilation; default-directory: "~/Software/arrow-datafusion/" -*-
cargo test --features=hdfs -p datafusion --test sql
Compiling fs-hdfs v0.1.3
Compiling parquet v6.0.0
error: failed to run custom build command for `fs-hdfs v0.1.3`

Caused by:
process didn't exit successfully:
`/Volumes/RAMDisk/df-target/debug/build/fs-hdfs-e451fc9b245fa518/build-script-build`
(exit status: 101)
--- stderr
thread 'main' panicked at 'JAVA_HOME shell environment must be set:
environment variable not found',
/Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/fs-hdfs-0.1.3/build.rs:132:13
note: run with `RUST_BACKTRACE=1` environment variable to display a
backtrace
warning: build failed, waiting for other jobs to finish...
```

I am sure I could go figure out how to setup the HDFS dependency, but I
don't have the time right now.

Putting the HDFS code in this repo will help ensure we keep DataFusion's API
resonably stable as you mentioned on
https://github.com/apache/arrow-datafusion/pull/1062#discussion_r741575976 and
effectivly make the maintenance of the hdfs connector easier.

However, in total, I selfishly would like to optimize for the maintenance of
DataFusion as it currently exists rather than the hdfs connector.

Thoughts @houqp @Dandandan @Jimexist @rdettai ?

BTW if there is consensus for putting this code into the arrow-datafusion
repo, prior I would like to see:
1. CI coverage (aka testing that this works as part of the github checks --
I may have missed it but I don't think this is covered now)
2. A developer's guide explaining how to configure / run the tests

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on pull request #1223: Add support of HDFS as remote object store

Reply via email to