Hallo,

For no databases (including traditional ones) it is advisable to fetch this 
amount through jdbc. Jdbc is not designed for this (neither for import nor for 
export of large data volumes). It is a highly questionable approach from a 
reliability point of view.

Export it as file to HDFS and fetch it from there or use oozie to dump the file 
from HDFS to a sftp or other server. There are alternatives depending on your 
use case.

Best regards

> On 20 Jun 2016, at 16:43, David Nies <david.n...@adition.com> wrote:
> 
> Dear Hive mailing list,
> 
> in my setup, network throughput from the HiveServer2 to the client seems to 
> be the bottleneck and I’m seeking a way do increase throughput. Let me 
> elaborate my use case:
> 
> I’m using Hive version 1.1.0 that is bundeled with Clouders 5.5.1.
> 
> I want to fetch a huge amount of data from our Hive cluster. By huge I mean 
> something around 100 million rows. The Hive table I’m querying is an external 
> table whose data is stored in .avro. On HDFS, the data I want to fetch (i.e. 
> the aforementioned 100 million rows) is about 5GB in size. A cleverer 
> filtering strategy (to reduce the amount of data) is no option, sadly, since 
> I need all the data.
> 
> I was able to reduce the time the MapReduce job takes to an agreeable 
> interval fiddling around with 
> `mapreduce.input.fileinputformat.split.maxsize`. The part that is taking ages 
> comes after MapReduce. I’m observing that the Hadoop namenode that is hosting 
> the HiveServer2 is merely sending data with around 3 MB/sec. Our network is 
> capable of much more. Playing around with `fetchSize` did not increase 
> throughput. 
> 
> As I identified network throughput to be the bottleneck, I restricted my 
> efforts to trying to increase it. For this, I simply run the query I’d 
> normally run through JDBC (from Clojure/Java) via `beeline` and dumping the 
> output to `/dev/null`. My `beeline` query looks something like that:
> 
> beeline \
>     -u jdbc:hive2://srv:10000/db \
>     -n user -p password \
>     --outputformat=csv2 \
>     --incremental=true \
>     --hiveconf mapreduce.input.fileinputformat.split.maxsize=33554432 \
>     -e 'SELECT <a lot of columns> FROM `db`.`table` WHERE (year=2016 AND 
> month=6 AND day=1 AND hour=10)' > /dev/null
> 
> I already tried playing around with additional `—hiveconf`s:
> 
>     --hiveconf hive.exec.compress.output=true \
>     --hiveconf mapred.output.compression.type=BLOCK \
>     --hiveconf 
> mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec \
> 
> without success. 
> 
> In all cases, Hive is able only to utilize a tiny fraction of the bandwidth 
> that is available. Is there a possibility to increase network throughput?
> 
> Thank you in advance!
> 
> Yours
> 
> David Nies
> Entwickler Business Intelligence
>  
> ADITION technologies AG
>  
> Oststraße 55, D-40211 Düsseldorf
> Schwarzwaldstraße 78b, D-79117 Freiburg im Breisgau
>  
> T +49 211 987400 30
> F +49 211 987400 33
> E david.n...@adition.com
>  
> Technischen Support erhalten Sie unter der +49 1805 2348466
> (Festnetzpreis: 14 ct/min; Mobilfunkpreise: maximal 42 ct/min)
>  
> Abonnieren Sie uns auf XING oder besuchen Sie uns unter www.adition.com.
>  
> Vorstände: Andreas Kleiser, Jörg Klekamp, Dr. Lutz Lowis, Marcus Schlüter 
> Aufsichtsratsvorsitzender: Joachim Schneidmadl
> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 
> UStIDNr.: DE 218 858 434
> 

Reply via email to