> Am 21.06.2016 um 08:59 schrieb Mich Talebzadeh <mich.talebza...@gmail.com>: > > is the underlying table partitioned i.e. > > 'SELECT <a lot of columns> FROM `db`.`table` WHERE (year=2016 AND month=6 AND > day=1 AND hour=10)‘ Yes, it is, year, month, day and hour are partition columns.
> > and also what is the RS size it is expected. I don’t know if I understand the question. Do you ask if I expect the size of the result set to be like that? As I mentioned in my eMail that started this thread, I’m expecting around 100 million rows. The raw data on HDFS is about 5 GB in size. > > JDBC on its own should work. Is this an ORC table? > > What version of Hive are you using? Kindly find the answer to these questions in my first eMail :) > > HTH -David > > > > > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> > > > On 21 June 2016 at 07:52, David Nies <david.n...@adition.com > <mailto:david.n...@adition.com>> wrote: > In my test case below, I’m using `beeline` as the Java application receiving > the JDBC stream. As I understand, this is the reference command line > interface to Hive. Are you saying that the reference command line interface > is not efficiently implemented? :) > > -David Nies > >> Am 20.06.2016 um 17:46 schrieb Jörn Franke <jornfra...@gmail.com >> <mailto:jornfra...@gmail.com>>: >> >> Aside from this the low network performance could also stem from the Java >> application receiving the JDBC stream (not threaded / not efficiently >> implemented etc). However that being said, do not use jdbc for this. >> >> On 20 Jun 2016, at 17:28, Jörn Franke <jornfra...@gmail.com >> <mailto:jornfra...@gmail.com>> wrote: >> >>> Hallo, >>> >>> For no databases (including traditional ones) it is advisable to fetch this >>> amount through jdbc. Jdbc is not designed for this (neither for import nor >>> for export of large data volumes). It is a highly questionable approach >>> from a reliability point of view. >>> >>> Export it as file to HDFS and fetch it from there or use oozie to dump the >>> file from HDFS to a sftp or other server. There are alternatives depending >>> on your use case. >>> >>> Best regards >>> >>> On 20 Jun 2016, at 16:43, David Nies <david.n...@adition.com >>> <mailto:david.n...@adition.com>> wrote: >>> >>>> Dear Hive mailing list, >>>> >>>> in my setup, network throughput from the HiveServer2 to the client seems >>>> to be the bottleneck and I’m seeking a way do increase throughput. Let me >>>> elaborate my use case: >>>> >>>> I’m using Hive version 1.1.0 that is bundeled with Clouders 5.5.1. >>>> >>>> I want to fetch a huge amount of data from our Hive cluster. By huge I >>>> mean something around 100 million rows. The Hive table I’m querying is an >>>> external table whose data is stored in .avro. On HDFS, the data I want to >>>> fetch (i.e. the aforementioned 100 million rows) is about 5GB in size. A >>>> cleverer filtering strategy (to reduce the amount of data) is no option, >>>> sadly, since I need all the data. >>>> >>>> I was able to reduce the time the MapReduce job takes to an agreeable >>>> interval fiddling around with >>>> `mapreduce.input.fileinputformat.split.maxsize`. The part that is taking >>>> ages comes after MapReduce. I’m observing that the Hadoop namenode that is >>>> hosting the HiveServer2 is merely sending data with around 3 MB/sec. Our >>>> network is capable of much more. Playing around with `fetchSize` did not >>>> increase throughput. >>>> >>>> As I identified network throughput to be the bottleneck, I restricted my >>>> efforts to trying to increase it. For this, I simply run the query I’d >>>> normally run through JDBC (from Clojure/Java) via `beeline` and dumping >>>> the output to `/dev/null`. My `beeline` query looks something like that: >>>> >>>> beeline \ >>>> -u jdbc:hive2://srv:10000/db \ >>>> -n user -p password \ >>>> --outputformat=csv2 \ >>>> --incremental=true \ >>>> --hiveconf mapreduce.input.fileinputformat.split.maxsize=33554432 \ >>>> -e 'SELECT <a lot of columns> FROM `db`.`table` WHERE (year=2016 AND >>>> month=6 AND day=1 AND hour=10)' > /dev/null >>>> >>>> I already tried playing around with additional `—hiveconf`s: >>>> >>>> --hiveconf hive.exec.compress.output=true \ >>>> --hiveconf mapred.output.compression.type=BLOCK \ >>>> --hiveconf >>>> mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec \ >>>> >>>> without success. >>>> >>>> In all cases, Hive is able only to utilize a tiny fraction of the >>>> bandwidth that is available. Is there a possibility to increase network >>>> throughput? >>>> >>>> Thank you in advance! >>>> >>>> Yours >>>> >>>> David Nies >>>> Entwickler Business Intelligence >>>> >>>> ADITION technologies AG >>>> >>>> Oststraße 55, D-40211 Düsseldorf >>>> Schwarzwaldstraße 78b, D-79117 Freiburg im Breisgau >>>> >>>> T +49 211 987400 30 <tel:%2B49%20211%20987400%2030> >>>> F +49 211 987400 33 <tel:%2B49%20211%20987400%2033> >>>> E david.n...@adition.com <mailto:david.n...@adition.com> >>>> >>>> Technischen Support erhalten Sie unter der +49 1805 2348466 >>>> <tel:%2B49%201805%202348466> >>>> (Festnetzpreis: 14 ct/min; Mobilfunkpreise: maximal 42 ct/min) >>>> >>>> Abonnieren Sie uns auf XING oder besuchen Sie uns unter www.adition.com >>>> <http://www.adition.com/>. >>>> >>>> Vorstände: Andreas Kleiser, Jörg Klekamp, Dr. Lutz Lowis, Marcus Schlüter >>>> Aufsichtsratsvorsitzender: Joachim Schneidmadl >>>> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 >>>> UStIDNr.: DE 218 858 434 >>>> > >