Aside from this the low network performance could also stem from the Java application receiving the JDBC stream (not threaded / not efficiently implemented etc). However that being said, do not use jdbc for this.
> On 20 Jun 2016, at 17:28, Jörn Franke <jornfra...@gmail.com> wrote: > > Hallo, > > For no databases (including traditional ones) it is advisable to fetch this > amount through jdbc. Jdbc is not designed for this (neither for import nor > for export of large data volumes). It is a highly questionable approach from > a reliability point of view. > > Export it as file to HDFS and fetch it from there or use oozie to dump the > file from HDFS to a sftp or other server. There are alternatives depending on > your use case. > > Best regards > >> On 20 Jun 2016, at 16:43, David Nies <david.n...@adition.com> wrote: >> >> Dear Hive mailing list, >> >> in my setup, network throughput from the HiveServer2 to the client seems to >> be the bottleneck and I’m seeking a way do increase throughput. Let me >> elaborate my use case: >> >> I’m using Hive version 1.1.0 that is bundeled with Clouders 5.5.1. >> >> I want to fetch a huge amount of data from our Hive cluster. By huge I mean >> something around 100 million rows. The Hive table I’m querying is an >> external table whose data is stored in .avro. On HDFS, the data I want to >> fetch (i.e. the aforementioned 100 million rows) is about 5GB in size. A >> cleverer filtering strategy (to reduce the amount of data) is no option, >> sadly, since I need all the data. >> >> I was able to reduce the time the MapReduce job takes to an agreeable >> interval fiddling around with >> `mapreduce.input.fileinputformat.split.maxsize`. The part that is taking >> ages comes after MapReduce. I’m observing that the Hadoop namenode that is >> hosting the HiveServer2 is merely sending data with around 3 MB/sec. Our >> network is capable of much more. Playing around with `fetchSize` did not >> increase throughput. >> >> As I identified network throughput to be the bottleneck, I restricted my >> efforts to trying to increase it. For this, I simply run the query I’d >> normally run through JDBC (from Clojure/Java) via `beeline` and dumping the >> output to `/dev/null`. My `beeline` query looks something like that: >> >> beeline \ >> -u jdbc:hive2://srv:10000/db \ >> -n user -p password \ >> --outputformat=csv2 \ >> --incremental=true \ >> --hiveconf mapreduce.input.fileinputformat.split.maxsize=33554432 \ >> -e 'SELECT <a lot of columns> FROM `db`.`table` WHERE (year=2016 AND >> month=6 AND day=1 AND hour=10)' > /dev/null >> >> I already tried playing around with additional `—hiveconf`s: >> >> --hiveconf hive.exec.compress.output=true \ >> --hiveconf mapred.output.compression.type=BLOCK \ >> --hiveconf >> mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec \ >> >> without success. >> >> In all cases, Hive is able only to utilize a tiny fraction of the bandwidth >> that is available. Is there a possibility to increase network throughput? >> >> Thank you in advance! >> >> Yours >> >> David Nies >> Entwickler Business Intelligence >> >> ADITION technologies AG >> >> Oststraße 55, D-40211 Düsseldorf >> Schwarzwaldstraße 78b, D-79117 Freiburg im Breisgau >> >> T +49 211 987400 30 >> F +49 211 987400 33 >> E david.n...@adition.com >> >> Technischen Support erhalten Sie unter der +49 1805 2348466 >> (Festnetzpreis: 14 ct/min; Mobilfunkpreise: maximal 42 ct/min) >> >> Abonnieren Sie uns auf XING oder besuchen Sie uns unter www.adition.com. >> >> Vorstände: Andreas Kleiser, Jörg Klekamp, Dr. Lutz Lowis, Marcus Schlüter >> Aufsichtsratsvorsitzender: Joachim Schneidmadl >> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 >> UStIDNr.: DE 218 858 434 >>