Aside from this the low network performance could also stem from the Java 
application receiving the JDBC stream (not threaded / not efficiently 
implemented etc). However that being said, do not use jdbc for this.

> On 20 Jun 2016, at 17:28, Jörn Franke <jornfra...@gmail.com> wrote:
> 
> Hallo,
> 
> For no databases (including traditional ones) it is advisable to fetch this 
> amount through jdbc. Jdbc is not designed for this (neither for import nor 
> for export of large data volumes). It is a highly questionable approach from 
> a reliability point of view.
> 
> Export it as file to HDFS and fetch it from there or use oozie to dump the 
> file from HDFS to a sftp or other server. There are alternatives depending on 
> your use case.
> 
> Best regards
> 
>> On 20 Jun 2016, at 16:43, David Nies <david.n...@adition.com> wrote:
>> 
>> Dear Hive mailing list,
>> 
>> in my setup, network throughput from the HiveServer2 to the client seems to 
>> be the bottleneck and I’m seeking a way do increase throughput. Let me 
>> elaborate my use case:
>> 
>> I’m using Hive version 1.1.0 that is bundeled with Clouders 5.5.1.
>> 
>> I want to fetch a huge amount of data from our Hive cluster. By huge I mean 
>> something around 100 million rows. The Hive table I’m querying is an 
>> external table whose data is stored in .avro. On HDFS, the data I want to 
>> fetch (i.e. the aforementioned 100 million rows) is about 5GB in size. A 
>> cleverer filtering strategy (to reduce the amount of data) is no option, 
>> sadly, since I need all the data.
>> 
>> I was able to reduce the time the MapReduce job takes to an agreeable 
>> interval fiddling around with 
>> `mapreduce.input.fileinputformat.split.maxsize`. The part that is taking 
>> ages comes after MapReduce. I’m observing that the Hadoop namenode that is 
>> hosting the HiveServer2 is merely sending data with around 3 MB/sec. Our 
>> network is capable of much more. Playing around with `fetchSize` did not 
>> increase throughput. 
>> 
>> As I identified network throughput to be the bottleneck, I restricted my 
>> efforts to trying to increase it. For this, I simply run the query I’d 
>> normally run through JDBC (from Clojure/Java) via `beeline` and dumping the 
>> output to `/dev/null`. My `beeline` query looks something like that:
>> 
>> beeline \
>>     -u jdbc:hive2://srv:10000/db \
>>     -n user -p password \
>>     --outputformat=csv2 \
>>     --incremental=true \
>>     --hiveconf mapreduce.input.fileinputformat.split.maxsize=33554432 \
>>     -e 'SELECT <a lot of columns> FROM `db`.`table` WHERE (year=2016 AND 
>> month=6 AND day=1 AND hour=10)' > /dev/null
>> 
>> I already tried playing around with additional `—hiveconf`s:
>> 
>>     --hiveconf hive.exec.compress.output=true \
>>     --hiveconf mapred.output.compression.type=BLOCK \
>>     --hiveconf 
>> mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec \
>> 
>> without success. 
>> 
>> In all cases, Hive is able only to utilize a tiny fraction of the bandwidth 
>> that is available. Is there a possibility to increase network throughput?
>> 
>> Thank you in advance!
>> 
>> Yours
>> 
>> David Nies
>> Entwickler Business Intelligence
>>  
>> ADITION technologies AG
>>  
>> Oststraße 55, D-40211 Düsseldorf
>> Schwarzwaldstraße 78b, D-79117 Freiburg im Breisgau
>>  
>> T +49 211 987400 30
>> F +49 211 987400 33
>> E david.n...@adition.com
>>  
>> Technischen Support erhalten Sie unter der +49 1805 2348466
>> (Festnetzpreis: 14 ct/min; Mobilfunkpreise: maximal 42 ct/min)
>>  
>> Abonnieren Sie uns auf XING oder besuchen Sie uns unter www.adition.com.
>>  
>> Vorstände: Andreas Kleiser, Jörg Klekamp, Dr. Lutz Lowis, Marcus Schlüter 
>> Aufsichtsratsvorsitzender: Joachim Schneidmadl
>> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 
>> UStIDNr.: DE 218 858 434
>> 

Reply via email to