Hi David, What are you actually trying to do with the data.
Hive and map-reduce are notoriously slow for this type of operations. Hive is good for storage that is what I vouch for. There are other alternatives. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 20 June 2016 at 15:43, David Nies <david.n...@adition.com> wrote: > Dear Hive mailing list, > > in my setup, network throughput from the HiveServer2 to the client seems > to be the bottleneck and I’m seeking a way do increase throughput. Let me > elaborate my use case: > > I’m using Hive version 1.1.0 that is bundeled with Clouders 5.5.1. > > I want to fetch a huge amount of data from our Hive cluster. By huge I > mean something around 100 million rows. The Hive table I’m querying is an > external table whose data is stored in .avro. On HDFS, the data I want to > fetch (i.e. the aforementioned 100 million rows) is about 5GB in size. A > cleverer filtering strategy (to reduce the amount of data) is no option, > sadly, since I need all the data. > > I was able to reduce the time the MapReduce job takes to an agreeable > interval fiddling around with > `mapreduce.input.fileinputformat.split.maxsize`. The part that is taking > ages comes after MapReduce. I’m observing that the Hadoop namenode that is > hosting the HiveServer2 is merely sending data with around 3 MB/sec. Our > network is capable of much more. Playing around with `fetchSize` did not > increase throughput. > > As I identified network throughput to be the bottleneck, I restricted my > efforts to trying to increase it. For this, I simply run the query I’d > normally run through JDBC (from Clojure/Java) via `beeline` and dumping the > output to `/dev/null`. My `beeline` query looks something like that: > > beeline \ > -u jdbc:hive2://srv:10000/db \ > -n user -p password \ > --outputformat=csv2 \ > --incremental=true \ > --hiveconf mapreduce.input.fileinputformat.split.maxsize=33554432 \ > -e 'SELECT <a lot of columns> FROM `db`.`table` WHERE (year=2016 AND > month=6 AND day=1 AND hour=10)' > /dev/null > > I already tried playing around with additional `—hiveconf`s: > > --hiveconf hive.exec.compress.output=true \ > --hiveconf mapred.output.compression.type=BLOCK \ > --hiveconf > mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec \ > > without success. > > In all cases, Hive is able only to utilize a tiny fraction of the > bandwidth that is available. Is there a possibility to increase network > throughput? > > Thank you in advance! > > Yours > > David Nies > Entwickler Business Intelligence > ADITION technologies AG > > Oststraße 55, D-40211 Düsseldorf > Schwarzwaldstraße 78b, D-79117 Freiburg im Breisgau > > T +49 211 987400 30 > F +49 211 987400 33 > E david.n...@adition.com > > Technischen Support erhalten Sie unter der +49 1805 2348466 > (Festnetzpreis: 14 ct/min; Mobilfunkpreise: maximal 42 ct/min) > > Abonnieren Sie uns auf XING oder besuchen Sie uns unter www.adition.com. > > Vorstände: Andreas Kleiser, Jörg Klekamp, Dr. Lutz Lowis, Marcus Schlüter > Aufsichtsratsvorsitzender: Joachim Schneidmadl > Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 > UStIDNr.: DE 218 858 434 > > >