Re: Issue with Hive and table with lots of column

Stephen Sprague Tue, 18 Feb 2014 07:18:12 -0800

He lives on after all! and thanks for the continued feedback.

We need the answers to these questions using HS2:



   1. what is the output of "ps -ef | grep -i hiveserver2" on your system?
in particular what is the value of -Xmx ?

   2. does "select * from table limit 1" work?

Thanks,
Stephen.



On Tue, Feb 18, 2014 at 6:32 AM, David Gayou <david.ga...@kxen.com> wrote:

> I'm so sorry, i wrote an answer, and i forgot to sent it....
> And i haven't been able to work on this for a few days.
>
>
> So far :
>
> I have a 15k columns table and 50k rows.
>
> I do not see any changes if i change the storage.
>
>
> *Hive 12.0*
>
> My test query is "select * from bigtable"
>
>
> If i use the hive cli, it works fine.
>
> If i use hiveserver1 + ODBC : it works fine
>
> If i use hiverserver2 + odbc or hiverserver2 + beeline,i have this java
> exception :
>
> 2014-02-18 13:22:22,571 ERROR thrift.ProcessFunction
> (ProcessFunction.java:process(41)) - Internal error processing FetchResults
>
> java.lang.OutOfMemoryError: Java heap space
>         at java.util.Arrays.copyOf(Arrays.java:2734)
>         at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
>         at java.util.ArrayList.add(ArrayList.java:351)
>          at
> org.apache.hive.service.cli.thrift.TRow.addToColVals(TRow.java:160)
>         at
> org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:60)
>         at
> org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:32)
>         at
> org.apache.hive.service.cli.operation.SQLOperation.prepareFromRow(SQLOperation.java:270)
>         at
> org.apache.hive.service.cli.operation.SQLOperation.decode(SQLOperation.java:262)
>         at
> org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:246)
>         at
> org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:171)
>         at
> org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:438)
>         at
> org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:346)
>         at
> org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:407)
>         at
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1373)
>
>
>
>
> *From the SVN trunk* : (for the HIVE-3746)
>
> With the maven change, most of the documentation and wiki are out of date.
> Compiling from trunk was not that easy and i may have failed some steps
> but :
>
> It has the same behavior. It works in CLI and hiveserver1.
> It fails with hiveserver 2.
>
>
> Regards
>
> David Gayou
>
>
>
>
>
> On Thu, Feb 13, 2014 at 3:11 AM, Navis류승우 <navis....@nexr.com> wrote:
>
>> With HIVE-3746, which will be included in hive-0.13, HiveServer2 takes
>> less memory than before.
>>
>> Could you try it with the version in trunk?
>>
>>
>> 2014-02-13 10:49 GMT+09:00 Stephen Sprague <sprag...@gmail.com>:
>>
>> question to the original poster.  closure appreciated!
>>>
>>>
>>> On Fri, Jan 31, 2014 at 12:22 PM, Stephen Sprague <sprag...@gmail.com>wrote:
>>>
>>>> thanks Ed. And on a separate tact lets look at Hiveserver2.
>>>>
>>>>
>>>> @OP>
>>>>
>>>> *I've tried to look around on how i can change the thrift heap size but
>>>> haven't found anything.*
>>>>
>>>>
>>>> looking at my hiveserver2 i find this:
>>>>
>>>>    $ ps -ef | grep -i hiveserver2
>>>>    dwr       9824 20479  0 12:11 pts/1    00:00:00 grep -i hiveserver2
>>>>    dwr      28410     1  0 00:05 ?        00:01:04
>>>> /usr/lib/jvm/java-6-sun/jre/bin/java 
>>>> *-Xmx256m*-Dhadoop.log.dir=/usr/lib/hadoop/logs 
>>>> -Dhadoop.log.file=hadoop.log
>>>> -Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str=
>>>> -Dhadoop.root.logger=INFO,console
>>>> -Djava.library.path=/usr/lib/hadoop/lib/native
>>>> -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true
>>>> -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar
>>>> /usr/lib/hive/lib/hive-service-0.12.0.jar
>>>> org.apache.hive.service.server.HiveServer2
>>>>
>>>>
>>>>
>>>>
>>>> questions:
>>>>
>>>>    1. what is the output of "ps -ef | grep -i hiveserver2" on your
>>>> system? in particular what is the value of -Xmx ?
>>>>
>>>>    2. can you restart your hiveserver with -Xmx1g? or some value that
>>>> makes sense to your system?
>>>>
>>>>
>>>>
>>>> Lots of questions now.  we await your answers! :)
>>>>
>>>>
>>>>
>>>> On Fri, Jan 31, 2014 at 11:51 AM, Edward Capriolo <
>>>> edlinuxg...@gmail.com> wrote:
>>>>
>>>>> Final table compression should not effect the de serialized size of
>>>>> the data over the wire.
>>>>>
>>>>>
>>>>> On Fri, Jan 31, 2014 at 2:49 PM, Stephen Sprague 
>>>>> <sprag...@gmail.com>wrote:
>>>>>
>>>>>> Excellent progress David.   So.  What the most important thing here
>>>>>> we learned was that it works (!) by running hive in local mode and that
>>>>>> this error is a limitation in the HiveServer2.  That's important.
>>>>>>
>>>>>> so textfile storage handler and having issues converting it to ORC.
>>>>>> hmmm.
>>>>>>
>>>>>> follow-ups.
>>>>>>
>>>>>> 1. what is your query that fails?
>>>>>>
>>>>>> 2. can you add a "limit 1" to the end of your query and tell us if
>>>>>> that works? this'll tell us if it's column or row bound.
>>>>>>
>>>>>> 3. bonus points. run these in local mode:
>>>>>>       > set hive.exec.compress.output=true;
>>>>>>       > set mapred.output.compression.type=BLOCK;
>>>>>>       > set
>>>>>> mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
>>>>>>       > create table blah stored as ORC as select * from <your
>>>>>> table>;   #i'm curious if this'll work.
>>>>>>       > show create table blah;  #send output back if previous step
>>>>>> worked.
>>>>>>
>>>>>> 4. extra bonus.  change ORC to SEQUENCEFILE in #3 see if that works
>>>>>> any differently.
>>>>>>
>>>>>>
>>>>>>
>>>>>> I'm wondering if compression would have any effect on the size of the
>>>>>> internal ArrayList the thrift server uses.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Jan 31, 2014 at 9:21 AM, David Gayou <david.ga...@kxen.com>wrote:
>>>>>>
>>>>>>> Ok, so here are some news :
>>>>>>>
>>>>>>> I tried to boost the HADOOP_HEAPSIZE to 8192,
>>>>>>> I also setted the mapred.child.java.opts to 512M
>>>>>>>
>>>>>>> And it doesn't seem's to have any effect.
>>>>>>>  ------
>>>>>>>
>>>>>>> I tried it using an ODBC driver => fail after few minutes.
>>>>>>> Using a local JDBC (beeline) => running forever without any error.
>>>>>>>
>>>>>>> Both through hiveserver 2
>>>>>>>
>>>>>>> If i use the local mode : it works!   (but that not really what i
>>>>>>> need, as i don't really how to access it with my software)
>>>>>>>
>>>>>>> ------
>>>>>>> I use a text file as storage.
>>>>>>> I tried to use ORC, but i can't populate it with a load data  (it
>>>>>>> return an error of file format).
>>>>>>>
>>>>>>> Using an "ALTER TABLE orange_large_train_3 SET FILEFORMAT ORC" after
>>>>>>> populating the table, i have a file format error on select.
>>>>>>>
>>>>>>> ------
>>>>>>>
>>>>>>> @Edward :
>>>>>>>
>>>>>>> I've tried to look around on how i can change the thrift heap size
>>>>>>> but haven't found anything.
>>>>>>> Same thing for my client (haven't found how to change the heap size)
>>>>>>>
>>>>>>> My usecase is really to have the most possible columns.
>>>>>>>
>>>>>>>
>>>>>>> Thanks a lot for your help
>>>>>>>
>>>>>>>
>>>>>>> Regards
>>>>>>>
>>>>>>> David
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jan 31, 2014 at 1:12 AM, Edward Capriolo <
>>>>>>> edlinuxg...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Ok here are the problem(s). Thrift has frame size limits, thrift
>>>>>>>> has to buffer rows into memory.
>>>>>>>>
>>>>>>>> Hove thrift has a heap size, it needs to big in this case.
>>>>>>>>
>>>>>>>> Your client needs a big heap size as well.
>>>>>>>>
>>>>>>>> The way to do this query if it is possible may be turning row
>>>>>>>> lateral, potwntially by treating it as a list, it will make queries on 
>>>>>>>> it
>>>>>>>> awkward.
>>>>>>>>
>>>>>>>> Good luck
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thursday, January 30, 2014, Stephen Sprague <sprag...@gmail.com>
>>>>>>>> wrote:
>>>>>>>> > oh. thinking some more about this i forgot to ask some other
>>>>>>>> basic questions.
>>>>>>>> >
>>>>>>>> > a) what storage format are you using for the table (text,
>>>>>>>> sequence, rcfile, orc or custom)?   "show create table <table>" would 
>>>>>>>> yield
>>>>>>>> that.
>>>>>>>> >
>>>>>>>> > b) what command is causing the stack trace?
>>>>>>>> >
>>>>>>>> > my thinking here is rcfile and orc are column based (i think) and
>>>>>>>> if you don't select all the columns that could very well limit the 
>>>>>>>> size of
>>>>>>>> the "row" being returned and hence the size of the internal ArrayList.
>>>>>>>> OTOH, if you're using "select *", um, you have my sympathies. :)
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > On Thu, Jan 30, 2014 at 11:33 AM, Stephen Sprague <
>>>>>>>> sprag...@gmail.com> wrote:
>>>>>>>> >
>>>>>>>> > thanks for the information. Up-to-date hive. Cluster on the
>>>>>>>> smallish side. And, well, sure looks like a memory issue. :)  rather 
>>>>>>>> than
>>>>>>>> an inherent hive limitation that is.
>>>>>>>> >
>>>>>>>> > So.  I can only speak as a user (ie. not a hive developer) but
>>>>>>>> what i'd be interested in knowing next is is this via running hive in 
>>>>>>>> local
>>>>>>>> mode, correct? (eg. not through hiveserver1/2).  And it looks like it
>>>>>>>> boinks on array processing which i assume to be internal code arrays 
>>>>>>>> and
>>>>>>>> not hive data arrays - your 15K columns are all scalar/simple types,
>>>>>>>> correct?  Its clearly fetching results and looks be trying to store 
>>>>>>>> them in
>>>>>>>> a java array  - and not just one row but a *set* of rows (ArrayList)
>>>>>>>> >
>>>>>>>> > two things to try.
>>>>>>>> >
>>>>>>>> > 1. boost the heap-size. try 8192. And I don't know if
>>>>>>>> HADOOP_HEAPSIZE is the controller of that. I woulda hoped it was called
>>>>>>>> something like "HIVE_HEAPSIZE". :)  Anyway, can't hurt to try.
>>>>>>>> >
>>>>>>>> > 2. trim down the number of columns and see where the breaking
>>>>>>>> point is.  is it 10K? is it 5K?   The idea is to confirm its _the 
>>>>>>>> number of
>>>>>>>> columns_ that is causing the memory to blow and not some other artifact
>>>>>>>> unbeknownst to us.
>>>>>>>> >
>>>>>>>> > 3. Google around the Hive namespace for something that might
>>>>>>>> limit or otherwise control the number of rows stored at once in Hive's
>>>>>>>> internal buffer. I snoop around too.
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > That's all i got for now and maybe we'll get lucky and someone on
>>>>>>>> this list will know something or another about this. :)
>>>>>>>> >
>>>>>>>> > cheers,
>>>>>>>> > Stephen.
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > On Thu, Jan 30, 2014 at 2:32 AM, David Gayou <
>>>>>>>> david.ga...@kxen.com> wrote:
>>>>>>>> >
>>>>>>>> > We are using the Hive 0.12.0, but it doesn't work better on hive
>>>>>>>> 0.11.0 or hive 0.10.0
>>>>>>>> > Our hadoop version is 1.1.2.
>>>>>>>> > Our cluster is 1 master + 4 slaves with 1 dual core xeon CPU
>>>>>>>> (with hyperthreading so 4 cores per machine) + 16Gb Ram each
>>>>>>>> >
>>>>>>>> > The error message i get is :
>>>>>>>> >
>>>>>>>> > 2014-01-29 12:41:09,086 ERROR thrift.ProcessFunction
>>>>>>>> (ProcessFunction.java:process(41)) - Internal error processing 
>>>>>>>> FetchResults
>>>>>>>> > java.lang.OutOfMemoryError: Java heap space
>>>>>>>> >         at java.util.Arrays.copyOf(Arrays.java:2734)
>>>>>>>> >         at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
>>>>>>>> >         at java.util.ArrayList.add(ArrayList.java:351)
>>>>>>>> >         at org.apache.hive.service.cli.Row.<init>(Row.java:47)
>>>>>>>> >         at
>>>>>>>> org.apache.hive.service.cli.RowSet.addRow(RowSet.java:61)
>>>>>>>> >         at
>>>>>>>> org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:235)
>>>>>>>> >         at
>>>>>>>> org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:170)
>>>>>>>> >         at
>>>>>>>> org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:417)
>>>>>>>> >         at
>>>>>>>> org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:306)
>>>>>>>> >         at
>>>>>>>> org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:386)
>>>>>>>> >         at
>>>>>>>> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1373)
>>>>>>>> >         at
>>>>>>>> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1358)
>>>>>>>> >         at
>>>>>>>> org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
>>>>>>>> >         at
>>>>>>>> org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
>>>>>>>> >         at
>>>>>>>> org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:58)
>>>>>>>> >         at
>>>>>>>> org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:55)
>>>>>>>> >         at java.security.AccessCont
>>>>>>>>
>>>>>>>> --
>>>>>>>> Sorry this was sent from mobile. Will do less grammar and spell
>>>>>>>> check than usual.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Issue with Hive and table with lots of column

Reply via email to