> Am 21.06.2016 um 08:59 schrieb Mich Talebzadeh <mich.talebza...@gmail.com>:
> 
> is the underlying table partitioned i.e.
> 
> 'SELECT <a lot of columns> FROM `db`.`table` WHERE (year=2016 AND month=6 AND 
> day=1 AND hour=10)‘
Yes, it is, year, month, day and hour are partition columns.

> 
> and also what is the RS size it is expected.
I don’t know if I understand the question. Do you ask if I expect the size of 
the result set to be like that? As I mentioned in my eMail that started this 
thread, I’m expecting around 100 million rows. The raw data on HDFS is about 5 
GB in size.

> 
> JDBC on its own should work. Is this an ORC table?
> 
> What version of Hive are you using?
Kindly find the answer to these questions in my first eMail :)

> 
> HTH

-David

> 
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 21 June 2016 at 07:52, David Nies <david.n...@adition.com 
> <mailto:david.n...@adition.com>> wrote:
> In my test case below, I’m using `beeline` as the Java application receiving 
> the JDBC stream. As I understand, this is the reference command line 
> interface to Hive. Are you saying that the reference command line interface 
> is not efficiently implemented? :)
> 
> -David Nies
> 
>> Am 20.06.2016 um 17:46 schrieb Jörn Franke <jornfra...@gmail.com 
>> <mailto:jornfra...@gmail.com>>:
>> 
>> Aside from this the low network performance could also stem from the Java 
>> application receiving the JDBC stream (not threaded / not efficiently 
>> implemented etc). However that being said, do not use jdbc for this.
>> 
>> On 20 Jun 2016, at 17:28, Jörn Franke <jornfra...@gmail.com 
>> <mailto:jornfra...@gmail.com>> wrote:
>> 
>>> Hallo,
>>> 
>>> For no databases (including traditional ones) it is advisable to fetch this 
>>> amount through jdbc. Jdbc is not designed for this (neither for import nor 
>>> for export of large data volumes). It is a highly questionable approach 
>>> from a reliability point of view.
>>> 
>>> Export it as file to HDFS and fetch it from there or use oozie to dump the 
>>> file from HDFS to a sftp or other server. There are alternatives depending 
>>> on your use case.
>>> 
>>> Best regards
>>> 
>>> On 20 Jun 2016, at 16:43, David Nies <david.n...@adition.com 
>>> <mailto:david.n...@adition.com>> wrote:
>>> 
>>>> Dear Hive mailing list,
>>>> 
>>>> in my setup, network throughput from the HiveServer2 to the client seems 
>>>> to be the bottleneck and I’m seeking a way do increase throughput. Let me 
>>>> elaborate my use case:
>>>> 
>>>> I’m using Hive version 1.1.0 that is bundeled with Clouders 5.5.1.
>>>> 
>>>> I want to fetch a huge amount of data from our Hive cluster. By huge I 
>>>> mean something around 100 million rows. The Hive table I’m querying is an 
>>>> external table whose data is stored in .avro. On HDFS, the data I want to 
>>>> fetch (i.e. the aforementioned 100 million rows) is about 5GB in size. A 
>>>> cleverer filtering strategy (to reduce the amount of data) is no option, 
>>>> sadly, since I need all the data.
>>>> 
>>>> I was able to reduce the time the MapReduce job takes to an agreeable 
>>>> interval fiddling around with 
>>>> `mapreduce.input.fileinputformat.split.maxsize`. The part that is taking 
>>>> ages comes after MapReduce. I’m observing that the Hadoop namenode that is 
>>>> hosting the HiveServer2 is merely sending data with around 3 MB/sec. Our 
>>>> network is capable of much more. Playing around with `fetchSize` did not 
>>>> increase throughput. 
>>>> 
>>>> As I identified network throughput to be the bottleneck, I restricted my 
>>>> efforts to trying to increase it. For this, I simply run the query I’d 
>>>> normally run through JDBC (from Clojure/Java) via `beeline` and dumping 
>>>> the output to `/dev/null`. My `beeline` query looks something like that:
>>>> 
>>>> beeline \
>>>>     -u jdbc:hive2://srv:10000/db \
>>>>     -n user -p password \
>>>>     --outputformat=csv2 \
>>>>     --incremental=true \
>>>>     --hiveconf mapreduce.input.fileinputformat.split.maxsize=33554432 \
>>>>     -e 'SELECT <a lot of columns> FROM `db`.`table` WHERE (year=2016 AND 
>>>> month=6 AND day=1 AND hour=10)' > /dev/null
>>>> 
>>>> I already tried playing around with additional `—hiveconf`s:
>>>> 
>>>>     --hiveconf hive.exec.compress.output=true \
>>>>     --hiveconf mapred.output.compression.type=BLOCK \
>>>>     --hiveconf 
>>>> mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec \
>>>> 
>>>> without success. 
>>>> 
>>>> In all cases, Hive is able only to utilize a tiny fraction of the 
>>>> bandwidth that is available. Is there a possibility to increase network 
>>>> throughput?
>>>> 
>>>> Thank you in advance!
>>>> 
>>>> Yours
>>>> 
>>>> David Nies
>>>> Entwickler Business Intelligence
>>>>  
>>>> ADITION technologies AG
>>>>  
>>>> Oststraße 55, D-40211 Düsseldorf
>>>> Schwarzwaldstraße 78b, D-79117 Freiburg im Breisgau
>>>>  
>>>> T +49 211 987400 30 <tel:%2B49%20211%20987400%2030>
>>>> F +49 211 987400 33 <tel:%2B49%20211%20987400%2033>
>>>> E david.n...@adition.com <mailto:david.n...@adition.com>
>>>>  
>>>> Technischen Support erhalten Sie unter der +49 1805 2348466 
>>>> <tel:%2B49%201805%202348466>
>>>> (Festnetzpreis: 14 ct/min; Mobilfunkpreise: maximal 42 ct/min)
>>>>  
>>>> Abonnieren Sie uns auf XING oder besuchen Sie uns unter www.adition.com 
>>>> <http://www.adition.com/>.
>>>>  
>>>> Vorstände: Andreas Kleiser, Jörg Klekamp, Dr. Lutz Lowis, Marcus Schlüter 
>>>> Aufsichtsratsvorsitzender: Joachim Schneidmadl
>>>> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 
>>>> UStIDNr.: DE 218 858 434
>>>> 
> 
> 

Reply via email to