hudi-bot opened a new issue, #14584:
URL: https://github.com/apache/hudi/issues/14584
Test presto integration for HDFS environment as well in addition to S3.
Blockers faced so far
[~bdscheller] I tried to apply your presto patch to test mor queries on
Presto. The way I set it up was create a docker image from your presto patch
and use that image in hudi local docker environment. I observed couple of
issues there:
* I got NoClassDefFoundError for these classes:
** org/apache/parquet/avro/AvroSchemaConverter
** org/apache/parquet/hadoop/ParquetFileReader
** org/apache/parquet/io/InputFile
** org/apache/parquet/format/TypeDefinedOrder
I was able to get around the first three errors by shading
org.apache.parquet inside hudi-presto-bundle and changing presto-hive to depend
on the hudi-presto-bundle. However, for the last one shading dint help because
its already a Thrift generated class. I am wondering you also ran into similar
issues while testing S3.
Could you please elaborate your test set up so we can do similar thing for
HDFS as well. If we need to add more changes to hudi-presto-bundle, we would
need to prioritize that for 0.5.3 release asap.
## JIRA info
- Link: https://issues.apache.org/jira/browse/HUDI-907
- Type: Sub-task
- Affects version(s):
- 0.9.0
---
## Comments
20/May/20 07:46;bhavanisudha;[~bdscheller] These are the steps to recreate
and test.
h4. Setup to test mor queries through Presto on HDFS data
I made some changes to this original presto patch -
[https://github.com/bschell/presto/commit/a3fb658c1cd70fd72f0a3021b3d994fe383303aa]
* Rebased it on top of latest Presto master that brings hudi as a compile
time dependency.
* I added changes to this function isHudiInputFormat. Renamed it to
isHudiParquetInputFormat. So the new behavior will be: if its for COW table it
would use the HoodieROTablePathFilter route, else if its MOR table query it
would invoke HoodieParquetRealtimeInputFormat.getSplits()
You can find the changes here -
[https://github.com/bhasudha/presto/commit/ce961a6ee10e154dd98f28615d628c2cf995a3c7]
Next I took these changes and tried to run a query. I got NoClassDefError at
runtime for AvroSchemaConverter. From here it would mean either
* adding additional deps on org.apache.parquet:parquet-avro and
org.apache.avro:avro inside presto-hive module *OR*
* compile time dep on hudi-presto-bundle which already shades these deps.
I took the second route and changed the root presto pom to depend on
`hudi-presto-bundle` instead of 'hudi-hadoop-mr' and also made similar changes
inside presto-hive module's pom. At this point when trying to build presto, I
got conflicting errors between hudi's version of parquet and presto's version
of parquet. So, I tried relocating the shaded parquet inside hudi-presto-bundle
and also added deps on 'parquet-common', 'parquet-encoding','parquet-column',
'parquet-hadoop` etc inside hudi-presto-bundle. Next time build ran fine but
saw NoClassDefFoundError for `org/apache/parquet/format/TypeDefinedOrder` which
is a thrift generated class in parquet-format. At this point I was blocked.
h5. *Docker set up*
* I built hudi locally with changes (if any as described above) in
hudi-presto-bunde's pom.
* I would publish it local .m2 maven repo using `mvn install:install-file
-Dfile=./hudi-presto-bundle-0.6.0-SNAPSHOT.jar -DgroupId=org.apache.hudi
-DartifactId=hudi-presto-bundle -Dversion=0.6.0-SNAPSHOT -Dpackaging=jar`
command
* build presto normally with the changes from your patch (described above).
This will pick the above hudi version in local .m2 repo
* Copied the presto-server/target/presto-server-0.236-SNAPSHOT.tar.gz and
presto-cli-0.236-SNAPSHOT-executable.jar to a temporary directory where a
[https://www.pythonforbeginners.com/modules-in-python/how-to-use-simplehttpserver/]
would run. Example command `python -m SimpleHTTPServer 1234`. This would
serve as a webserver url from where local Docker presto images can be built in
next steps.
* Build a docker presto image using this patch (Replace x.x.x.x with your
host ip)
**
{quote}diff --git a/docker/hoodie/hadoop/prestobase/Dockerfile
b/docker/hoodie/hadoop/prestobase/Dockerfile
index 43b989e6..98b5dc7c 100644
--- a/docker/hoodie/hadoop/prestobase/Dockerfile
+++ b/docker/hoodie/hadoop/prestobase/Dockerfile
@@ -22,10 +22,9 @@ ARG HADOOP_VERSION=2.8.4
ARG HIVE_VERSION=2.3.3
FROM apachehudi/hudi-hadoop_${HADOOP_VERSION}-base:latest as hadoop-base
-ARG PRESTO_VERSION=0.217
-
+ARG PRESTO_VERSION=0.236
ENV PRESTO_VERSION ${PRESTO_VERSION}
-ENV PRESTO_HOME /opt/presto-server-${PRESTO_VERSION}
+ENV PRESTO_HOME /opt/presto-server-${PRESTO_VERSION}-SNAPSHOT
ENV PRESTO_CONF_DIR ${PRESTO_HOME}/etc
ENV PRESTO_LOG_DIR /var/log/presto
ENV PRESTO_JVM_MAX_HEAP 2G
@@ -53,11 +52,11 @@ RUN set -x \
gosu \
&& rm -rf /var/lib/apt/lists/* \
## presto-server
- && wget -q -O -
[https://repo1.maven.org/maven2/com/facebook/presto/presto-server/${PRESTO_VERSION}/presto-server-${PRESTO_VERSION}.tar.gz|https://repo1.maven.org/maven2/com/facebook/presto/presto-server/$%7BPRESTO_VERSION%7D/presto-server-$%7BPRESTO_VERSION%7D.tar.gz]
\
+ && wget -q -O -
[http://x.x.x.x:1234/presto-server-${PRESTO_VERSION}.tar.gz|http://10.0.0.148:1234/presto-server-$%7BPRESTO_VERSION%7D.tar.gz]
\
| tar -xzf - -C /opt/ \
&& mkdir -p /var/hoodie/ws/docker/hoodie/hadoop/prestobase/target/ \
## presto-client
- && wget -q -O /usr/local/bin/presto
[https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/${PRESTO_VERSION}/presto-cli-${PRESTO_VERSION}-executable.jar|https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/$%7BPRESTO_VERSION%7D/presto-cli-$%7BPRESTO_VERSION%7D-executable.jar]
\
+ && wget -q -O /usr/local/bin/presto
[http://x.x.x.x:1234/presto-cli-${PRESTO_VERSION}-executable.jar|http://10.0.0.148:1234/presto-cli-$%7BPRESTO_VERSION%7D-executable.jar]
\
&& chmod +x /usr/local/bin/presto \
## user/dir/permmsion
&& adduser --shell /sbin/nologin --uid 1000 docker \
@@ -76,10 +75,6 @@ COPY bin/* /usr/local/bin/
COPY lib/* /usr/local/lib/
RUN chmod +x /usr/local/bin/entrypoint.sh
-ADD target/ /var/hoodie/ws/docker/hoodie/hadoop/prestobase/target/
-ENV HUDI_PRESTO_BUNDLE
/var/hoodie/ws/docker/hoodie/hadoop/prestobase/target/hudi-presto-bundle.jar
-RUN cp ${HUDI_PRESTO_BUNDLE} ${PRESTO_HOME}/plugin/hive-hadoop2/
-
VOLUME ["${PRESTO_LOG_DIR}"]
WORKDIR ${PRESTO_HOME}
diff --git a/docker/hoodie/hadoop/prestobase/bin/entrypoint.sh
b/docker/hoodie/hadoop/prestobase/bin/entrypoint.sh
index 58b55085..c457f646 100755
--- a/docker/hoodie/hadoop/prestobase/bin/entrypoint.sh
+++ b/docker/hoodie/hadoop/prestobase/bin/entrypoint.sh
@@ -54,10 +54,6 @@ do
conf_file=${template%.mustache}
cat ${conf_file}.mustache | mustache.sh > ${conf_file}
done
-
-# Copy the presto bundle at run time so that locally built bundle overrides
the one that is present in the image
-cp ${HUDI_PRESTO_BUNDLE} ${PRESTO_HOME}/plugin/hive-hadoop2/
-
case "$1" in
"coordinator" | "worker" )
server_role="$1"{quote}
Now build image using command
{quote}cd docker/hoodie/hadoop/prestobase{quote}
{quote}docker build ./Dockerfile{quote} * The next step is to push this to a
local docker registry using steps
{quote}docker run -d -p 5000:5000 --restart=always --name registry registry:2
docker tag <ImageID> localhost:5000/prestobase:latest
docker push localhost:5000/prestobase:latest
{quote}
Now make changes to the
docker/compose/docker-compose_hadoop284_hive233_spark244.yml to replace `image:
apachehudi/hudi-hadoop_2.8.4-prestobase_0.217:latest` with `
image: localhost:5000/prestobase_0.236:latest` and run docker as described
here - [https://hudi.apache.org/docs/docker_demo.html]
This way Presto queries can be tested in HDFS env locally using Hudi's
docker setup.;;;
---
25/May/20 12:38;shivnarayan;[~bhavanisudha]: do we need this in 0.5.3 ? can
I move it to 0.6.0 ? ;;;
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]