Re: Issue with Hive and table with lots of column

2014-02-19 Thread Stephen Sprague
ok. thanks.

so given everything we know the choices i see are:

1. increase your heapsize some more. (And of course confirm your process
that your reported the -Xmx8192M is the HiveServer2 process.)

2. modify your query such that it doesn't use select *

3. modify your query such that it does its own buffering.  maybe stream it?

4. create a jira ticket and request that the internal buffer size that the
hiveserver2 uses for staging results be configurable.

That's all _i_ got left in the tank for this issue.  I think we need an SME
who is familiar with the code now.

Regards,
Stephen.


On Tue, Feb 18, 2014 at 10:57 AM, David Gayou david.ga...@kxen.com wrote:

 Sorry i badly reported it. It's 8192M

 Thanks,

 David.
 Le 18 févr. 2014 18:37, Stephen Sprague sprag...@gmail.com a écrit :

 oh. i just noticed the -Xmx value you reported.

 there's no M or G after that number??  I'd like to see -Xmx8192M or
 -Xmx8G.  That *is* very important.

 thanks,
 Stephen.


 On Tue, Feb 18, 2014 at 9:22 AM, Stephen Sprague sprag...@gmail.comwrote:

 thanks.

 re #1.  we need to find that Hiveserver2 process. For all i know the one
 you reported is hiveserver1 (which works.) chances are they use the same
 -Xmx value but we really shouldn't make any assumptions.

 try wide format on the ps command (eg. ps -efw | grep -i Hiveserver2)

 re.#2.  okay.  so that tells us is not the number of columns blowing the
 heap but rather the combination of rows + columns.  There's no way it
 stores the full result set on the heap even under normal circumstances so
 my guess is there's an internal number of rows it buffers.  sorta like how
 unix buffers stdout.  How and where that's set is out of my league.
 However, maybe you get around it by upping your heapsize again if you have
 the available memory of course.


 On Tue, Feb 18, 2014 at 8:39 AM, David Gayou david.ga...@kxen.comwrote:


 1. I have no process with hiveserver2 ...

 ps -ef | grep -i hive  return some pretty long command with a
 -Xmx8192 and that's the value set in hive-env.sh


 2. The select * from table limit 1 or even 100 is working correctly.


 David.


 On Tue, Feb 18, 2014 at 4:16 PM, Stephen Sprague sprag...@gmail.comwrote:

 He lives on after all! and thanks for the continued feedback.

 We need the answers to these questions using HS2:



1. what is the output of ps -ef | grep -i hiveserver2 on your
 system? in particular what is the value of -Xmx ?

2. does select * from table limit 1 work?

 Thanks,
 Stephen.



 On Tue, Feb 18, 2014 at 6:32 AM, David Gayou david.ga...@kxen.comwrote:

 I'm so sorry, i wrote an answer, and i forgot to sent it
 And i haven't been able to work on this for a few days.


 So far :

 I have a 15k columns table and 50k rows.

 I do not see any changes if i change the storage.


 *Hive 12.0*

 My test query is select * from bigtable


 If i use the hive cli, it works fine.

 If i use hiveserver1 + ODBC : it works fine

 If i use hiverserver2 + odbc or hiverserver2 + beeline,i have this
 java exception :

 2014-02-18 13:22:22,571 ERROR thrift.ProcessFunction
 (ProcessFunction.java:process(41)) - Internal error processing 
 FetchResults

 java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2734)
 at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
 at java.util.ArrayList.add(ArrayList.java:351)
  at
 org.apache.hive.service.cli.thrift.TRow.addToColVals(TRow.java:160)
 at
 org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:60)
 at
 org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:32)
 at
 org.apache.hive.service.cli.operation.SQLOperation.prepareFromRow(SQLOperation.java:270)
 at
 org.apache.hive.service.cli.operation.SQLOperation.decode(SQLOperation.java:262)
 at
 org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:246)
 at
 org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:171)
 at
 org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:438)
 at
 org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:346)
 at
 org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:407)
 at
 org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1373)




 *From the SVN trunk* : (for the HIVE-3746)

 With the maven change, most of the documentation and wiki are out of
 date.
 Compiling from trunk was not that easy and i may have failed some
 steps but :

 It has the same behavior. It works in CLI and hiveserver1.
 It fails with hiveserver 2.


 Regards

 David Gayou





 On Thu, Feb 13, 2014 at 3:11 AM, Navis류승우 navis@nexr.com wrote:

 With HIVE-3746, which will be included in hive-0.13, HiveServer2
 takes less memory than before.

 Could you try it with the 

Re: Issue with Hive and table with lots of column

2014-02-18 Thread David Gayou
I'm so sorry, i wrote an answer, and i forgot to sent it
And i haven't been able to work on this for a few days.


So far :

I have a 15k columns table and 50k rows.

I do not see any changes if i change the storage.


*Hive 12.0*

My test query is select * from bigtable


If i use the hive cli, it works fine.

If i use hiveserver1 + ODBC : it works fine

If i use hiverserver2 + odbc or hiverserver2 + beeline,i have this java
exception :

2014-02-18 13:22:22,571 ERROR thrift.ProcessFunction
(ProcessFunction.java:process(41)) - Internal error processing FetchResults
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2734)
at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
at java.util.ArrayList.add(ArrayList.java:351)
at
org.apache.hive.service.cli.thrift.TRow.addToColVals(TRow.java:160)
at
org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:60)
at
org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:32)
at
org.apache.hive.service.cli.operation.SQLOperation.prepareFromRow(SQLOperation.java:270)
at
org.apache.hive.service.cli.operation.SQLOperation.decode(SQLOperation.java:262)
at
org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:246)
at
org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:171)
at
org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:438)
at
org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:346)
at
org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:407)
at
org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1373)




*From the SVN trunk* : (for the HIVE-3746)

With the maven change, most of the documentation and wiki are out of date.
Compiling from trunk was not that easy and i may have failed some steps but
:

It has the same behavior. It works in CLI and hiveserver1.
It fails with hiveserver 2.


Regards

David Gayou





On Thu, Feb 13, 2014 at 3:11 AM, Navis류승우 navis@nexr.com wrote:

 With HIVE-3746, which will be included in hive-0.13, HiveServer2 takes
 less memory than before.

 Could you try it with the version in trunk?


 2014-02-13 10:49 GMT+09:00 Stephen Sprague sprag...@gmail.com:

 question to the original poster.  closure appreciated!


 On Fri, Jan 31, 2014 at 12:22 PM, Stephen Sprague sprag...@gmail.comwrote:

 thanks Ed. And on a separate tact lets look at Hiveserver2.


 @OP

 *I've tried to look around on how i can change the thrift heap size but
 haven't found anything.*


 looking at my hiveserver2 i find this:

$ ps -ef | grep -i hiveserver2
dwr   9824 20479  0 12:11 pts/100:00:00 grep -i hiveserver2
dwr  28410 1  0 00:05 ?00:01:04
 /usr/lib/jvm/java-6-sun/jre/bin/java 
 *-Xmx256m*-Dhadoop.log.dir=/usr/lib/hadoop/logs -Dhadoop.log.file=hadoop.log
 -Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str=
 -Dhadoop.root.logger=INFO,console
 -Djava.library.path=/usr/lib/hadoop/lib/native
 -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true
 -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar
 /usr/lib/hive/lib/hive-service-0.12.0.jar
 org.apache.hive.service.server.HiveServer2




 questions:

1. what is the output of ps -ef | grep -i hiveserver2 on your
 system? in particular what is the value of -Xmx ?

2. can you restart your hiveserver with -Xmx1g? or some value that
 makes sense to your system?



 Lots of questions now.  we await your answers! :)



 On Fri, Jan 31, 2014 at 11:51 AM, Edward Capriolo edlinuxg...@gmail.com
  wrote:

 Final table compression should not effect the de serialized size of the
 data over the wire.


 On Fri, Jan 31, 2014 at 2:49 PM, Stephen Sprague sprag...@gmail.comwrote:

 Excellent progress David.   So.  What the most important thing here we
 learned was that it works (!) by running hive in local mode and that this
 error is a limitation in the HiveServer2.  That's important.

 so textfile storage handler and having issues converting it to ORC.
 hmmm.

 follow-ups.

 1. what is your query that fails?

 2. can you add a limit 1 to the end of your query and tell us if
 that works? this'll tell us if it's column or row bound.

 3. bonus points. run these in local mode:
set hive.exec.compress.output=true;
set mapred.output.compression.type=BLOCK;
set
 mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
create table blah stored as ORC as select * from your
 table;   #i'm curious if this'll work.
show create table blah;  #send output back if previous step
 worked.

 4. extra bonus.  change ORC to SEQUENCEFILE in #3 see if that works
 any differently.



 I'm wondering if compression would have any effect on the size of 

Re: Issue with Hive and table with lots of column

2014-02-18 Thread Stephen Sprague
He lives on after all! and thanks for the continued feedback.

We need the answers to these questions using HS2:


   1. what is the output of ps -ef | grep -i hiveserver2 on your system?
in particular what is the value of -Xmx ?

   2. does select * from table limit 1 work?

Thanks,
Stephen.



On Tue, Feb 18, 2014 at 6:32 AM, David Gayou david.ga...@kxen.com wrote:

 I'm so sorry, i wrote an answer, and i forgot to sent it
 And i haven't been able to work on this for a few days.


 So far :

 I have a 15k columns table and 50k rows.

 I do not see any changes if i change the storage.


 *Hive 12.0*

 My test query is select * from bigtable


 If i use the hive cli, it works fine.

 If i use hiveserver1 + ODBC : it works fine

 If i use hiverserver2 + odbc or hiverserver2 + beeline,i have this java
 exception :

 2014-02-18 13:22:22,571 ERROR thrift.ProcessFunction
 (ProcessFunction.java:process(41)) - Internal error processing FetchResults

 java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2734)
 at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
 at java.util.ArrayList.add(ArrayList.java:351)
  at
 org.apache.hive.service.cli.thrift.TRow.addToColVals(TRow.java:160)
 at
 org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:60)
 at
 org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:32)
 at
 org.apache.hive.service.cli.operation.SQLOperation.prepareFromRow(SQLOperation.java:270)
 at
 org.apache.hive.service.cli.operation.SQLOperation.decode(SQLOperation.java:262)
 at
 org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:246)
 at
 org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:171)
 at
 org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:438)
 at
 org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:346)
 at
 org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:407)
 at
 org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1373)




 *From the SVN trunk* : (for the HIVE-3746)

 With the maven change, most of the documentation and wiki are out of date.
 Compiling from trunk was not that easy and i may have failed some steps
 but :

 It has the same behavior. It works in CLI and hiveserver1.
 It fails with hiveserver 2.


 Regards

 David Gayou





 On Thu, Feb 13, 2014 at 3:11 AM, Navis류승우 navis@nexr.com wrote:

 With HIVE-3746, which will be included in hive-0.13, HiveServer2 takes
 less memory than before.

 Could you try it with the version in trunk?


 2014-02-13 10:49 GMT+09:00 Stephen Sprague sprag...@gmail.com:

 question to the original poster.  closure appreciated!


 On Fri, Jan 31, 2014 at 12:22 PM, Stephen Sprague sprag...@gmail.comwrote:

 thanks Ed. And on a separate tact lets look at Hiveserver2.


 @OP

 *I've tried to look around on how i can change the thrift heap size but
 haven't found anything.*


 looking at my hiveserver2 i find this:

$ ps -ef | grep -i hiveserver2
dwr   9824 20479  0 12:11 pts/100:00:00 grep -i hiveserver2
dwr  28410 1  0 00:05 ?00:01:04
 /usr/lib/jvm/java-6-sun/jre/bin/java 
 *-Xmx256m*-Dhadoop.log.dir=/usr/lib/hadoop/logs 
 -Dhadoop.log.file=hadoop.log
 -Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str=
 -Dhadoop.root.logger=INFO,console
 -Djava.library.path=/usr/lib/hadoop/lib/native
 -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true
 -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar
 /usr/lib/hive/lib/hive-service-0.12.0.jar
 org.apache.hive.service.server.HiveServer2




 questions:

1. what is the output of ps -ef | grep -i hiveserver2 on your
 system? in particular what is the value of -Xmx ?

2. can you restart your hiveserver with -Xmx1g? or some value that
 makes sense to your system?



 Lots of questions now.  we await your answers! :)



 On Fri, Jan 31, 2014 at 11:51 AM, Edward Capriolo 
 edlinuxg...@gmail.com wrote:

 Final table compression should not effect the de serialized size of
 the data over the wire.


 On Fri, Jan 31, 2014 at 2:49 PM, Stephen Sprague 
 sprag...@gmail.comwrote:

 Excellent progress David.   So.  What the most important thing here
 we learned was that it works (!) by running hive in local mode and that
 this error is a limitation in the HiveServer2.  That's important.

 so textfile storage handler and having issues converting it to ORC.
 hmmm.

 follow-ups.

 1. what is your query that fails?

 2. can you add a limit 1 to the end of your query and tell us if
 that works? this'll tell us if it's column or row bound.

 3. bonus points. run these in local mode:
set hive.exec.compress.output=true;
set mapred.output.compression.type=BLOCK;

Re: Issue with Hive and table with lots of column

2014-02-18 Thread David Gayou
1. I have no process with hiveserver2 ...

ps -ef | grep -i hive  return some pretty long command with a -Xmx8192
and that's the value set in hive-env.sh


2. The select * from table limit 1 or even 100 is working correctly.


David.


On Tue, Feb 18, 2014 at 4:16 PM, Stephen Sprague sprag...@gmail.com wrote:

 He lives on after all! and thanks for the continued feedback.

 We need the answers to these questions using HS2:



1. what is the output of ps -ef | grep -i hiveserver2 on your system?
 in particular what is the value of -Xmx ?

2. does select * from table limit 1 work?

 Thanks,
 Stephen.



 On Tue, Feb 18, 2014 at 6:32 AM, David Gayou david.ga...@kxen.com wrote:

 I'm so sorry, i wrote an answer, and i forgot to sent it
 And i haven't been able to work on this for a few days.


 So far :

 I have a 15k columns table and 50k rows.

 I do not see any changes if i change the storage.


 *Hive 12.0*

 My test query is select * from bigtable


 If i use the hive cli, it works fine.

 If i use hiveserver1 + ODBC : it works fine

 If i use hiverserver2 + odbc or hiverserver2 + beeline,i have this java
 exception :

 2014-02-18 13:22:22,571 ERROR thrift.ProcessFunction
 (ProcessFunction.java:process(41)) - Internal error processing FetchResults

 java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2734)
 at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
 at java.util.ArrayList.add(ArrayList.java:351)
  at
 org.apache.hive.service.cli.thrift.TRow.addToColVals(TRow.java:160)
 at
 org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:60)
 at
 org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:32)
 at
 org.apache.hive.service.cli.operation.SQLOperation.prepareFromRow(SQLOperation.java:270)
 at
 org.apache.hive.service.cli.operation.SQLOperation.decode(SQLOperation.java:262)
 at
 org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:246)
 at
 org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:171)
 at
 org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:438)
 at
 org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:346)
 at
 org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:407)
 at
 org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1373)




 *From the SVN trunk* : (for the HIVE-3746)

 With the maven change, most of the documentation and wiki are out of
 date.
 Compiling from trunk was not that easy and i may have failed some steps
 but :

 It has the same behavior. It works in CLI and hiveserver1.
 It fails with hiveserver 2.


 Regards

 David Gayou





 On Thu, Feb 13, 2014 at 3:11 AM, Navis류승우 navis@nexr.com wrote:

 With HIVE-3746, which will be included in hive-0.13, HiveServer2 takes
 less memory than before.

 Could you try it with the version in trunk?


 2014-02-13 10:49 GMT+09:00 Stephen Sprague sprag...@gmail.com:

 question to the original poster.  closure appreciated!


 On Fri, Jan 31, 2014 at 12:22 PM, Stephen Sprague 
 sprag...@gmail.comwrote:

 thanks Ed. And on a separate tact lets look at Hiveserver2.


 @OP

 *I've tried to look around on how i can change the thrift heap size
 but haven't found anything.*


 looking at my hiveserver2 i find this:

$ ps -ef | grep -i hiveserver2
dwr   9824 20479  0 12:11 pts/100:00:00 grep -i hiveserver2
dwr  28410 1  0 00:05 ?00:01:04
 /usr/lib/jvm/java-6-sun/jre/bin/java 
 *-Xmx256m*-Dhadoop.log.dir=/usr/lib/hadoop/logs 
 -Dhadoop.log.file=hadoop.log
 -Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str=
 -Dhadoop.root.logger=INFO,console
 -Djava.library.path=/usr/lib/hadoop/lib/native
 -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true
 -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar
 /usr/lib/hive/lib/hive-service-0.12.0.jar
 org.apache.hive.service.server.HiveServer2




 questions:

1. what is the output of ps -ef | grep -i hiveserver2 on your
 system? in particular what is the value of -Xmx ?

2. can you restart your hiveserver with -Xmx1g? or some value that
 makes sense to your system?



 Lots of questions now.  we await your answers! :)



 On Fri, Jan 31, 2014 at 11:51 AM, Edward Capriolo 
 edlinuxg...@gmail.com wrote:

 Final table compression should not effect the de serialized size of
 the data over the wire.


 On Fri, Jan 31, 2014 at 2:49 PM, Stephen Sprague 
 sprag...@gmail.comwrote:

 Excellent progress David.   So.  What the most important thing here
 we learned was that it works (!) by running hive in local mode and that
 this error is a limitation in the HiveServer2.  That's important.

 so textfile storage handler and having issues converting it to 

Re: Issue with Hive and table with lots of column

2014-02-18 Thread Stephen Sprague
thanks.

re #1.  we need to find that Hiveserver2 process. For all i know the one
you reported is hiveserver1 (which works.) chances are they use the same
-Xmx value but we really shouldn't make any assumptions.

try wide format on the ps command (eg. ps -efw | grep -i Hiveserver2)

re.#2.  okay.  so that tells us is not the number of columns blowing the
heap but rather the combination of rows + columns.  There's no way it
stores the full result set on the heap even under normal circumstances so
my guess is there's an internal number of rows it buffers.  sorta like how
unix buffers stdout.  How and where that's set is out of my league.
However, maybe you get around it by upping your heapsize again if you have
the available memory of course.


On Tue, Feb 18, 2014 at 8:39 AM, David Gayou david.ga...@kxen.com wrote:


 1. I have no process with hiveserver2 ...

 ps -ef | grep -i hive  return some pretty long command with a -Xmx8192
 and that's the value set in hive-env.sh


 2. The select * from table limit 1 or even 100 is working correctly.


 David.


 On Tue, Feb 18, 2014 at 4:16 PM, Stephen Sprague sprag...@gmail.comwrote:

 He lives on after all! and thanks for the continued feedback.

 We need the answers to these questions using HS2:



1. what is the output of ps -ef | grep -i hiveserver2 on your
 system? in particular what is the value of -Xmx ?

2. does select * from table limit 1 work?

 Thanks,
 Stephen.



 On Tue, Feb 18, 2014 at 6:32 AM, David Gayou david.ga...@kxen.comwrote:

 I'm so sorry, i wrote an answer, and i forgot to sent it
 And i haven't been able to work on this for a few days.


 So far :

 I have a 15k columns table and 50k rows.

 I do not see any changes if i change the storage.


 *Hive 12.0*

 My test query is select * from bigtable


 If i use the hive cli, it works fine.

 If i use hiveserver1 + ODBC : it works fine

 If i use hiverserver2 + odbc or hiverserver2 + beeline,i have this java
 exception :

 2014-02-18 13:22:22,571 ERROR thrift.ProcessFunction
 (ProcessFunction.java:process(41)) - Internal error processing FetchResults

 java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2734)
 at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
 at java.util.ArrayList.add(ArrayList.java:351)
  at
 org.apache.hive.service.cli.thrift.TRow.addToColVals(TRow.java:160)
 at
 org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:60)
 at
 org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:32)
 at
 org.apache.hive.service.cli.operation.SQLOperation.prepareFromRow(SQLOperation.java:270)
 at
 org.apache.hive.service.cli.operation.SQLOperation.decode(SQLOperation.java:262)
 at
 org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:246)
 at
 org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:171)
 at
 org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:438)
 at
 org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:346)
 at
 org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:407)
 at
 org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1373)




 *From the SVN trunk* : (for the HIVE-3746)

 With the maven change, most of the documentation and wiki are out of
 date.
 Compiling from trunk was not that easy and i may have failed some steps
 but :

 It has the same behavior. It works in CLI and hiveserver1.
 It fails with hiveserver 2.


 Regards

 David Gayou





 On Thu, Feb 13, 2014 at 3:11 AM, Navis류승우 navis@nexr.com wrote:

 With HIVE-3746, which will be included in hive-0.13, HiveServer2 takes
 less memory than before.

 Could you try it with the version in trunk?


 2014-02-13 10:49 GMT+09:00 Stephen Sprague sprag...@gmail.com:

 question to the original poster.  closure appreciated!


 On Fri, Jan 31, 2014 at 12:22 PM, Stephen Sprague 
 sprag...@gmail.comwrote:

 thanks Ed. And on a separate tact lets look at Hiveserver2.


 @OP

 *I've tried to look around on how i can change the thrift heap size
 but haven't found anything.*


 looking at my hiveserver2 i find this:

$ ps -ef | grep -i hiveserver2
dwr   9824 20479  0 12:11 pts/100:00:00 grep -i hiveserver2
dwr  28410 1  0 00:05 ?00:01:04
 /usr/lib/jvm/java-6-sun/jre/bin/java 
 *-Xmx256m*-Dhadoop.log.dir=/usr/lib/hadoop/logs 
 -Dhadoop.log.file=hadoop.log
 -Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str=
 -Dhadoop.root.logger=INFO,console
 -Djava.library.path=/usr/lib/hadoop/lib/native
 -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true
 -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar
 /usr/lib/hive/lib/hive-service-0.12.0.jar
 

Re: Issue with Hive and table with lots of column

2014-02-18 Thread Stephen Sprague
oh. i just noticed the -Xmx value you reported.

there's no M or G after that number??  I'd like to see -Xmx8192M or
-Xmx8G.  That *is* very important.

thanks,
Stephen.


On Tue, Feb 18, 2014 at 9:22 AM, Stephen Sprague sprag...@gmail.com wrote:

 thanks.

 re #1.  we need to find that Hiveserver2 process. For all i know the one
 you reported is hiveserver1 (which works.) chances are they use the same
 -Xmx value but we really shouldn't make any assumptions.

 try wide format on the ps command (eg. ps -efw | grep -i Hiveserver2)

 re.#2.  okay.  so that tells us is not the number of columns blowing the
 heap but rather the combination of rows + columns.  There's no way it
 stores the full result set on the heap even under normal circumstances so
 my guess is there's an internal number of rows it buffers.  sorta like how
 unix buffers stdout.  How and where that's set is out of my league.
 However, maybe you get around it by upping your heapsize again if you have
 the available memory of course.


 On Tue, Feb 18, 2014 at 8:39 AM, David Gayou david.ga...@kxen.com wrote:


 1. I have no process with hiveserver2 ...

 ps -ef | grep -i hive  return some pretty long command with a -Xmx8192
 and that's the value set in hive-env.sh


 2. The select * from table limit 1 or even 100 is working correctly.


 David.


 On Tue, Feb 18, 2014 at 4:16 PM, Stephen Sprague sprag...@gmail.comwrote:

 He lives on after all! and thanks for the continued feedback.

 We need the answers to these questions using HS2:



1. what is the output of ps -ef | grep -i hiveserver2 on your
 system? in particular what is the value of -Xmx ?

2. does select * from table limit 1 work?

 Thanks,
 Stephen.



 On Tue, Feb 18, 2014 at 6:32 AM, David Gayou david.ga...@kxen.comwrote:

 I'm so sorry, i wrote an answer, and i forgot to sent it
 And i haven't been able to work on this for a few days.


 So far :

 I have a 15k columns table and 50k rows.

 I do not see any changes if i change the storage.


 *Hive 12.0*

 My test query is select * from bigtable


 If i use the hive cli, it works fine.

 If i use hiveserver1 + ODBC : it works fine

 If i use hiverserver2 + odbc or hiverserver2 + beeline,i have this java
 exception :

 2014-02-18 13:22:22,571 ERROR thrift.ProcessFunction
 (ProcessFunction.java:process(41)) - Internal error processing FetchResults

 java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2734)
 at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
 at java.util.ArrayList.add(ArrayList.java:351)
  at
 org.apache.hive.service.cli.thrift.TRow.addToColVals(TRow.java:160)
 at
 org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:60)
 at
 org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:32)
 at
 org.apache.hive.service.cli.operation.SQLOperation.prepareFromRow(SQLOperation.java:270)
 at
 org.apache.hive.service.cli.operation.SQLOperation.decode(SQLOperation.java:262)
 at
 org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:246)
 at
 org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:171)
 at
 org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:438)
 at
 org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:346)
 at
 org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:407)
 at
 org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1373)




 *From the SVN trunk* : (for the HIVE-3746)

 With the maven change, most of the documentation and wiki are out of
 date.
 Compiling from trunk was not that easy and i may have failed some steps
 but :

 It has the same behavior. It works in CLI and hiveserver1.
 It fails with hiveserver 2.


 Regards

 David Gayou





 On Thu, Feb 13, 2014 at 3:11 AM, Navis류승우 navis@nexr.com wrote:

 With HIVE-3746, which will be included in hive-0.13, HiveServer2 takes
 less memory than before.

 Could you try it with the version in trunk?


 2014-02-13 10:49 GMT+09:00 Stephen Sprague sprag...@gmail.com:

 question to the original poster.  closure appreciated!


 On Fri, Jan 31, 2014 at 12:22 PM, Stephen Sprague sprag...@gmail.com
  wrote:

 thanks Ed. And on a separate tact lets look at Hiveserver2.


 @OP

 *I've tried to look around on how i can change the thrift heap size
 but haven't found anything.*


 looking at my hiveserver2 i find this:

$ ps -ef | grep -i hiveserver2
dwr   9824 20479  0 12:11 pts/100:00:00 grep -i
 hiveserver2
dwr  28410 1  0 00:05 ?00:01:04
 /usr/lib/jvm/java-6-sun/jre/bin/java 
 *-Xmx256m*-Dhadoop.log.dir=/usr/lib/hadoop/logs 
 -Dhadoop.log.file=hadoop.log
 -Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str=
 -Dhadoop.root.logger=INFO,console
 

Re: Issue with Hive and table with lots of column

2014-02-18 Thread David Gayou
Sorry i badly reported it. It's 8192M

Thanks,

David.
Le 18 févr. 2014 18:37, Stephen Sprague sprag...@gmail.com a écrit :

 oh. i just noticed the -Xmx value you reported.

 there's no M or G after that number??  I'd like to see -Xmx8192M or
 -Xmx8G.  That *is* very important.

 thanks,
 Stephen.


 On Tue, Feb 18, 2014 at 9:22 AM, Stephen Sprague sprag...@gmail.comwrote:

 thanks.

 re #1.  we need to find that Hiveserver2 process. For all i know the one
 you reported is hiveserver1 (which works.) chances are they use the same
 -Xmx value but we really shouldn't make any assumptions.

 try wide format on the ps command (eg. ps -efw | grep -i Hiveserver2)

 re.#2.  okay.  so that tells us is not the number of columns blowing the
 heap but rather the combination of rows + columns.  There's no way it
 stores the full result set on the heap even under normal circumstances so
 my guess is there's an internal number of rows it buffers.  sorta like how
 unix buffers stdout.  How and where that's set is out of my league.
 However, maybe you get around it by upping your heapsize again if you have
 the available memory of course.


 On Tue, Feb 18, 2014 at 8:39 AM, David Gayou david.ga...@kxen.comwrote:


 1. I have no process with hiveserver2 ...

 ps -ef | grep -i hive  return some pretty long command with a -Xmx8192
 and that's the value set in hive-env.sh


 2. The select * from table limit 1 or even 100 is working correctly.


 David.


 On Tue, Feb 18, 2014 at 4:16 PM, Stephen Sprague sprag...@gmail.comwrote:

 He lives on after all! and thanks for the continued feedback.

 We need the answers to these questions using HS2:



1. what is the output of ps -ef | grep -i hiveserver2 on your
 system? in particular what is the value of -Xmx ?

2. does select * from table limit 1 work?

 Thanks,
 Stephen.



 On Tue, Feb 18, 2014 at 6:32 AM, David Gayou david.ga...@kxen.comwrote:

 I'm so sorry, i wrote an answer, and i forgot to sent it
 And i haven't been able to work on this for a few days.


 So far :

 I have a 15k columns table and 50k rows.

 I do not see any changes if i change the storage.


 *Hive 12.0*

 My test query is select * from bigtable


 If i use the hive cli, it works fine.

 If i use hiveserver1 + ODBC : it works fine

 If i use hiverserver2 + odbc or hiverserver2 + beeline,i have this
 java exception :

 2014-02-18 13:22:22,571 ERROR thrift.ProcessFunction
 (ProcessFunction.java:process(41)) - Internal error processing 
 FetchResults

 java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2734)
 at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
 at java.util.ArrayList.add(ArrayList.java:351)
  at
 org.apache.hive.service.cli.thrift.TRow.addToColVals(TRow.java:160)
 at
 org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:60)
 at
 org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:32)
 at
 org.apache.hive.service.cli.operation.SQLOperation.prepareFromRow(SQLOperation.java:270)
 at
 org.apache.hive.service.cli.operation.SQLOperation.decode(SQLOperation.java:262)
 at
 org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:246)
 at
 org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:171)
 at
 org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:438)
 at
 org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:346)
 at
 org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:407)
 at
 org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1373)




 *From the SVN trunk* : (for the HIVE-3746)

 With the maven change, most of the documentation and wiki are out of
 date.
 Compiling from trunk was not that easy and i may have failed some
 steps but :

 It has the same behavior. It works in CLI and hiveserver1.
 It fails with hiveserver 2.


 Regards

 David Gayou





 On Thu, Feb 13, 2014 at 3:11 AM, Navis류승우 navis@nexr.com wrote:

 With HIVE-3746, which will be included in hive-0.13, HiveServer2
 takes less memory than before.

 Could you try it with the version in trunk?


 2014-02-13 10:49 GMT+09:00 Stephen Sprague sprag...@gmail.com:

 question to the original poster.  closure appreciated!


 On Fri, Jan 31, 2014 at 12:22 PM, Stephen Sprague 
 sprag...@gmail.com wrote:

 thanks Ed. And on a separate tact lets look at Hiveserver2.


 @OP

 *I've tried to look around on how i can change the thrift heap size
 but haven't found anything.*


 looking at my hiveserver2 i find this:

$ ps -ef | grep -i hiveserver2
dwr   9824 20479  0 12:11 pts/100:00:00 grep -i
 hiveserver2
dwr  28410 1  0 00:05 ?00:01:04
 /usr/lib/jvm/java-6-sun/jre/bin/java 
 

Re: Issue with Hive and table with lots of column

2014-02-12 Thread Navis류승우
With HIVE-3746, which will be included in hive-0.13, HiveServer2 takes less
memory than before.

Could you try it with the version in trunk?


2014-02-13 10:49 GMT+09:00 Stephen Sprague sprag...@gmail.com:

 question to the original poster.  closure appreciated!


 On Fri, Jan 31, 2014 at 12:22 PM, Stephen Sprague sprag...@gmail.comwrote:

 thanks Ed. And on a separate tact lets look at Hiveserver2.


 @OP

 *I've tried to look around on how i can change the thrift heap size but
 haven't found anything.*


 looking at my hiveserver2 i find this:

$ ps -ef | grep -i hiveserver2
dwr   9824 20479  0 12:11 pts/100:00:00 grep -i hiveserver2
dwr  28410 1  0 00:05 ?00:01:04
 /usr/lib/jvm/java-6-sun/jre/bin/java 
 *-Xmx256m*-Dhadoop.log.dir=/usr/lib/hadoop/logs -Dhadoop.log.file=hadoop.log
 -Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str=
 -Dhadoop.root.logger=INFO,console
 -Djava.library.path=/usr/lib/hadoop/lib/native
 -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true
 -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar
 /usr/lib/hive/lib/hive-service-0.12.0.jar
 org.apache.hive.service.server.HiveServer2




 questions:

1. what is the output of ps -ef | grep -i hiveserver2 on your
 system? in particular what is the value of -Xmx ?

2. can you restart your hiveserver with -Xmx1g? or some value that
 makes sense to your system?



 Lots of questions now.  we await your answers! :)



 On Fri, Jan 31, 2014 at 11:51 AM, Edward Capriolo 
 edlinuxg...@gmail.comwrote:

 Final table compression should not effect the de serialized size of the
 data over the wire.


 On Fri, Jan 31, 2014 at 2:49 PM, Stephen Sprague sprag...@gmail.comwrote:

 Excellent progress David.   So.  What the most important thing here we
 learned was that it works (!) by running hive in local mode and that this
 error is a limitation in the HiveServer2.  That's important.

 so textfile storage handler and having issues converting it to ORC.
 hmmm.

 follow-ups.

 1. what is your query that fails?

 2. can you add a limit 1 to the end of your query and tell us if that
 works? this'll tell us if it's column or row bound.

 3. bonus points. run these in local mode:
set hive.exec.compress.output=true;
set mapred.output.compression.type=BLOCK;
set
 mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
create table blah stored as ORC as select * from your
 table;   #i'm curious if this'll work.
show create table blah;  #send output back if previous step
 worked.

 4. extra bonus.  change ORC to SEQUENCEFILE in #3 see if that works any
 differently.



 I'm wondering if compression would have any effect on the size of the
 internal ArrayList the thrift server uses.



 On Fri, Jan 31, 2014 at 9:21 AM, David Gayou david.ga...@kxen.comwrote:

 Ok, so here are some news :

 I tried to boost the HADOOP_HEAPSIZE to 8192,
 I also setted the mapred.child.java.opts to 512M

 And it doesn't seem's to have any effect.
  --

 I tried it using an ODBC driver = fail after few minutes.
 Using a local JDBC (beeline) = running forever without any error.

 Both through hiveserver 2

 If i use the local mode : it works!   (but that not really what i
 need, as i don't really how to access it with my software)

 --
 I use a text file as storage.
 I tried to use ORC, but i can't populate it with a load data  (it
 return an error of file format).

 Using an ALTER TABLE orange_large_train_3 SET FILEFORMAT ORC after
 populating the table, i have a file format error on select.

 --

 @Edward :

 I've tried to look around on how i can change the thrift heap size but
 haven't found anything.
 Same thing for my client (haven't found how to change the heap size)

 My usecase is really to have the most possible columns.


 Thanks a lot for your help


 Regards

 David





 On Fri, Jan 31, 2014 at 1:12 AM, Edward Capriolo 
 edlinuxg...@gmail.com wrote:

 Ok here are the problem(s). Thrift has frame size limits, thrift has
 to buffer rows into memory.

 Hove thrift has a heap size, it needs to big in this case.

 Your client needs a big heap size as well.

 The way to do this query if it is possible may be turning row
 lateral, potwntially by treating it as a list, it will make queries on it
 awkward.

 Good luck


 On Thursday, January 30, 2014, Stephen Sprague sprag...@gmail.com
 wrote:
  oh. thinking some more about this i forgot to ask some other basic
 questions.
 
  a) what storage format are you using for the table (text, sequence,
 rcfile, orc or custom)?   show create table table would yield that.
 
  b) what command is causing the stack trace?
 
  my thinking here is rcfile and orc are column based (i think) and
 if you don't select all the columns that could very well limit the size 
 of
 the row being returned and hence the size of the internal ArrayList.
 OTOH, if you're using select *, um, you have my sympathies. 

Re: Issue with Hive and table with lots of column

2014-01-31 Thread David Gayou
Ok, so here are some news :

I tried to boost the HADOOP_HEAPSIZE to 8192,
I also setted the mapred.child.java.opts to 512M

And it doesn't seem's to have any effect.
 --

I tried it using an ODBC driver = fail after few minutes.
Using a local JDBC (beeline) = running forever without any error.

Both through hiveserver 2

If i use the local mode : it works!   (but that not really what i need, as
i don't really how to access it with my software)

--
I use a text file as storage.
I tried to use ORC, but i can't populate it with a load data  (it return an
error of file format).

Using an ALTER TABLE orange_large_train_3 SET FILEFORMAT ORC after
populating the table, i have a file format error on select.

--

@Edward :

I've tried to look around on how i can change the thrift heap size but
haven't found anything.
Same thing for my client (haven't found how to change the heap size)

My usecase is really to have the most possible columns.


Thanks a lot for your help


Regards

David





On Fri, Jan 31, 2014 at 1:12 AM, Edward Capriolo edlinuxg...@gmail.comwrote:

 Ok here are the problem(s). Thrift has frame size limits, thrift has to
 buffer rows into memory.

 Hove thrift has a heap size, it needs to big in this case.

 Your client needs a big heap size as well.

 The way to do this query if it is possible may be turning row lateral,
 potwntially by treating it as a list, it will make queries on it awkward.

 Good luck


 On Thursday, January 30, 2014, Stephen Sprague sprag...@gmail.com wrote:
  oh. thinking some more about this i forgot to ask some other basic
 questions.
 
  a) what storage format are you using for the table (text, sequence,
 rcfile, orc or custom)?   show create table table would yield that.
 
  b) what command is causing the stack trace?
 
  my thinking here is rcfile and orc are column based (i think) and if you
 don't select all the columns that could very well limit the size of the
 row being returned and hence the size of the internal ArrayList.  OTOH,
 if you're using select *, um, you have my sympathies. :)
 
 
 
 
  On Thu, Jan 30, 2014 at 11:33 AM, Stephen Sprague sprag...@gmail.com
 wrote:
 
  thanks for the information. Up-to-date hive. Cluster on the smallish
 side. And, well, sure looks like a memory issue. :)  rather than an
 inherent hive limitation that is.
 
  So.  I can only speak as a user (ie. not a hive developer) but what i'd
 be interested in knowing next is is this via running hive in local mode,
 correct? (eg. not through hiveserver1/2).  And it looks like it boinks on
 array processing which i assume to be internal code arrays and not hive
 data arrays - your 15K columns are all scalar/simple types, correct?  Its
 clearly fetching results and looks be trying to store them in a java array
 - and not just one row but a *set* of rows (ArrayList)
 
  two things to try.
 
  1. boost the heap-size. try 8192. And I don't know if HADOOP_HEAPSIZE is
 the controller of that. I woulda hoped it was called something like
 HIVE_HEAPSIZE. :)  Anyway, can't hurt to try.
 
  2. trim down the number of columns and see where the breaking point is.
 is it 10K? is it 5K?   The idea is to confirm its _the number of columns_
 that is causing the memory to blow and not some other artifact unbeknownst
 to us.
 
  3. Google around the Hive namespace for something that might limit or
 otherwise control the number of rows stored at once in Hive's internal
 buffer. I snoop around too.
 
 
  That's all i got for now and maybe we'll get lucky and someone on this
 list will know something or another about this. :)
 
  cheers,
  Stephen.
 
 
 
  On Thu, Jan 30, 2014 at 2:32 AM, David Gayou david.ga...@kxen.com
 wrote:
 
  We are using the Hive 0.12.0, but it doesn't work better on hive 0.11.0
 or hive 0.10.0
  Our hadoop version is 1.1.2.
  Our cluster is 1 master + 4 slaves with 1 dual core xeon CPU (with
 hyperthreading so 4 cores per machine) + 16Gb Ram each
 
  The error message i get is :
 
  2014-01-29 12:41:09,086 ERROR thrift.ProcessFunction
 (ProcessFunction.java:process(41)) - Internal error processing FetchResults
  java.lang.OutOfMemoryError: Java heap space
  at java.util.Arrays.copyOf(Arrays.java:2734)
  at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
  at java.util.ArrayList.add(ArrayList.java:351)
  at org.apache.hive.service.cli.Row.init(Row.java:47)
  at org.apache.hive.service.cli.RowSet.addRow(RowSet.java:61)
  at
 org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:235)
  at
 org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:170)
  at
 org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:417)
  at
 org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:306)
  at
 

Re: Issue with Hive and table with lots of column

2014-01-31 Thread Stephen Sprague
Excellent progress David.   So.  What the most important thing here we
learned was that it works (!) by running hive in local mode and that this
error is a limitation in the HiveServer2.  That's important.

so textfile storage handler and having issues converting it to ORC. hmmm.

follow-ups.

1. what is your query that fails?

2. can you add a limit 1 to the end of your query and tell us if that
works? this'll tell us if it's column or row bound.

3. bonus points. run these in local mode:
   set hive.exec.compress.output=true;
   set mapred.output.compression.type=BLOCK;
   set
mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
   create table blah stored as ORC as select * from your table;
#i'm curious if this'll work.
   show create table blah;  #send output back if previous step worked.

4. extra bonus.  change ORC to SEQUENCEFILE in #3 see if that works any
differently.



I'm wondering if compression would have any effect on the size of the
internal ArrayList the thrift server uses.



On Fri, Jan 31, 2014 at 9:21 AM, David Gayou david.ga...@kxen.com wrote:

 Ok, so here are some news :

 I tried to boost the HADOOP_HEAPSIZE to 8192,
 I also setted the mapred.child.java.opts to 512M

 And it doesn't seem's to have any effect.
  --

 I tried it using an ODBC driver = fail after few minutes.
 Using a local JDBC (beeline) = running forever without any error.

 Both through hiveserver 2

 If i use the local mode : it works!   (but that not really what i need, as
 i don't really how to access it with my software)

 --
 I use a text file as storage.
 I tried to use ORC, but i can't populate it with a load data  (it return
 an error of file format).

 Using an ALTER TABLE orange_large_train_3 SET FILEFORMAT ORC after
 populating the table, i have a file format error on select.

 --

 @Edward :

 I've tried to look around on how i can change the thrift heap size but
 haven't found anything.
 Same thing for my client (haven't found how to change the heap size)

 My usecase is really to have the most possible columns.


 Thanks a lot for your help


 Regards

 David





 On Fri, Jan 31, 2014 at 1:12 AM, Edward Capriolo edlinuxg...@gmail.comwrote:

 Ok here are the problem(s). Thrift has frame size limits, thrift has to
 buffer rows into memory.

 Hove thrift has a heap size, it needs to big in this case.

 Your client needs a big heap size as well.

 The way to do this query if it is possible may be turning row lateral,
 potwntially by treating it as a list, it will make queries on it awkward.

 Good luck


 On Thursday, January 30, 2014, Stephen Sprague sprag...@gmail.com
 wrote:
  oh. thinking some more about this i forgot to ask some other basic
 questions.
 
  a) what storage format are you using for the table (text, sequence,
 rcfile, orc or custom)?   show create table table would yield that.
 
  b) what command is causing the stack trace?
 
  my thinking here is rcfile and orc are column based (i think) and if
 you don't select all the columns that could very well limit the size of the
 row being returned and hence the size of the internal ArrayList.  OTOH,
 if you're using select *, um, you have my sympathies. :)
 
 
 
 
  On Thu, Jan 30, 2014 at 11:33 AM, Stephen Sprague sprag...@gmail.com
 wrote:
 
  thanks for the information. Up-to-date hive. Cluster on the smallish
 side. And, well, sure looks like a memory issue. :)  rather than an
 inherent hive limitation that is.
 
  So.  I can only speak as a user (ie. not a hive developer) but what i'd
 be interested in knowing next is is this via running hive in local mode,
 correct? (eg. not through hiveserver1/2).  And it looks like it boinks on
 array processing which i assume to be internal code arrays and not hive
 data arrays - your 15K columns are all scalar/simple types, correct?  Its
 clearly fetching results and looks be trying to store them in a java array
 - and not just one row but a *set* of rows (ArrayList)
 
  two things to try.
 
  1. boost the heap-size. try 8192. And I don't know if HADOOP_HEAPSIZE
 is the controller of that. I woulda hoped it was called something like
 HIVE_HEAPSIZE. :)  Anyway, can't hurt to try.
 
  2. trim down the number of columns and see where the breaking point
 is.  is it 10K? is it 5K?   The idea is to confirm its _the number of
 columns_ that is causing the memory to blow and not some other artifact
 unbeknownst to us.
 
  3. Google around the Hive namespace for something that might limit or
 otherwise control the number of rows stored at once in Hive's internal
 buffer. I snoop around too.
 
 
  That's all i got for now and maybe we'll get lucky and someone on this
 list will know something or another about this. :)
 
  cheers,
  Stephen.
 
 
 
  On Thu, Jan 30, 2014 at 2:32 AM, David Gayou david.ga...@kxen.com
 wrote:
 
  We are using the Hive 0.12.0, but it doesn't work better on hive 0.11.0
 or hive 0.10.0
  Our hadoop version is 1.1.2.
  Our cluster 

Re: Issue with Hive and table with lots of column

2014-01-31 Thread Edward Capriolo
Final table compression should not effect the de serialized size of the
data over the wire.


On Fri, Jan 31, 2014 at 2:49 PM, Stephen Sprague sprag...@gmail.com wrote:

 Excellent progress David.   So.  What the most important thing here we
 learned was that it works (!) by running hive in local mode and that this
 error is a limitation in the HiveServer2.  That's important.

 so textfile storage handler and having issues converting it to ORC. hmmm.

 follow-ups.

 1. what is your query that fails?

 2. can you add a limit 1 to the end of your query and tell us if that
 works? this'll tell us if it's column or row bound.

 3. bonus points. run these in local mode:
set hive.exec.compress.output=true;
set mapred.output.compression.type=BLOCK;
set
 mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
create table blah stored as ORC as select * from your table;
 #i'm curious if this'll work.
show create table blah;  #send output back if previous step worked.

 4. extra bonus.  change ORC to SEQUENCEFILE in #3 see if that works any
 differently.



 I'm wondering if compression would have any effect on the size of the
 internal ArrayList the thrift server uses.



 On Fri, Jan 31, 2014 at 9:21 AM, David Gayou david.ga...@kxen.com wrote:

 Ok, so here are some news :

 I tried to boost the HADOOP_HEAPSIZE to 8192,
 I also setted the mapred.child.java.opts to 512M

 And it doesn't seem's to have any effect.
  --

 I tried it using an ODBC driver = fail after few minutes.
 Using a local JDBC (beeline) = running forever without any error.

 Both through hiveserver 2

 If i use the local mode : it works!   (but that not really what i need,
 as i don't really how to access it with my software)

 --
 I use a text file as storage.
 I tried to use ORC, but i can't populate it with a load data  (it return
 an error of file format).

 Using an ALTER TABLE orange_large_train_3 SET FILEFORMAT ORC after
 populating the table, i have a file format error on select.

 --

 @Edward :

 I've tried to look around on how i can change the thrift heap size but
 haven't found anything.
 Same thing for my client (haven't found how to change the heap size)

 My usecase is really to have the most possible columns.


 Thanks a lot for your help


 Regards

 David





 On Fri, Jan 31, 2014 at 1:12 AM, Edward Capriolo 
 edlinuxg...@gmail.comwrote:

 Ok here are the problem(s). Thrift has frame size limits, thrift has to
 buffer rows into memory.

 Hove thrift has a heap size, it needs to big in this case.

 Your client needs a big heap size as well.

 The way to do this query if it is possible may be turning row lateral,
 potwntially by treating it as a list, it will make queries on it awkward.

 Good luck


 On Thursday, January 30, 2014, Stephen Sprague sprag...@gmail.com
 wrote:
  oh. thinking some more about this i forgot to ask some other basic
 questions.
 
  a) what storage format are you using for the table (text, sequence,
 rcfile, orc or custom)?   show create table table would yield that.
 
  b) what command is causing the stack trace?
 
  my thinking here is rcfile and orc are column based (i think) and if
 you don't select all the columns that could very well limit the size of the
 row being returned and hence the size of the internal ArrayList.  OTOH,
 if you're using select *, um, you have my sympathies. :)
 
 
 
 
  On Thu, Jan 30, 2014 at 11:33 AM, Stephen Sprague sprag...@gmail.com
 wrote:
 
  thanks for the information. Up-to-date hive. Cluster on the smallish
 side. And, well, sure looks like a memory issue. :)  rather than an
 inherent hive limitation that is.
 
  So.  I can only speak as a user (ie. not a hive developer) but what
 i'd be interested in knowing next is is this via running hive in local
 mode, correct? (eg. not through hiveserver1/2).  And it looks like it
 boinks on array processing which i assume to be internal code arrays and
 not hive data arrays - your 15K columns are all scalar/simple types,
 correct?  Its clearly fetching results and looks be trying to store them in
 a java array  - and not just one row but a *set* of rows (ArrayList)
 
  two things to try.
 
  1. boost the heap-size. try 8192. And I don't know if HADOOP_HEAPSIZE
 is the controller of that. I woulda hoped it was called something like
 HIVE_HEAPSIZE. :)  Anyway, can't hurt to try.
 
  2. trim down the number of columns and see where the breaking point
 is.  is it 10K? is it 5K?   The idea is to confirm its _the number of
 columns_ that is causing the memory to blow and not some other artifact
 unbeknownst to us.
 
  3. Google around the Hive namespace for something that might limit or
 otherwise control the number of rows stored at once in Hive's internal
 buffer. I snoop around too.
 
 
  That's all i got for now and maybe we'll get lucky and someone on this
 list will know something or another about this. :)
 
  cheers,
  Stephen.
 
 
 
  On Thu, Jan 30, 

Re: Issue with Hive and table with lots of column

2014-01-31 Thread Stephen Sprague
thanks Ed. And on a separate tact lets look at Hiveserver2.


@OP

*I've tried to look around on how i can change the thrift heap size but
haven't found anything.*


looking at my hiveserver2 i find this:

   $ ps -ef | grep -i hiveserver2
   dwr   9824 20479  0 12:11 pts/100:00:00 grep -i hiveserver2
   dwr  28410 1  0 00:05 ?00:01:04
/usr/lib/jvm/java-6-sun/jre/bin/java
*-Xmx256m*-Dhadoop.log.dir=/usr/lib/hadoop/logs
-Dhadoop.log.file=hadoop.log
-Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str=
-Dhadoop.root.logger=INFO,console
-Djava.library.path=/usr/lib/hadoop/lib/native
-Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true
-Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar
/usr/lib/hive/lib/hive-service-0.12.0.jar
org.apache.hive.service.server.HiveServer2




questions:

   1. what is the output of ps -ef | grep -i hiveserver2 on your system?
in particular what is the value of -Xmx ?

   2. can you restart your hiveserver with -Xmx1g? or some value that makes
sense to your system?



Lots of questions now.  we await your answers! :)



On Fri, Jan 31, 2014 at 11:51 AM, Edward Capriolo edlinuxg...@gmail.comwrote:

 Final table compression should not effect the de serialized size of the
 data over the wire.


 On Fri, Jan 31, 2014 at 2:49 PM, Stephen Sprague sprag...@gmail.comwrote:

 Excellent progress David.   So.  What the most important thing here we
 learned was that it works (!) by running hive in local mode and that this
 error is a limitation in the HiveServer2.  That's important.

 so textfile storage handler and having issues converting it to ORC. hmmm.

 follow-ups.

 1. what is your query that fails?

 2. can you add a limit 1 to the end of your query and tell us if that
 works? this'll tell us if it's column or row bound.

 3. bonus points. run these in local mode:
set hive.exec.compress.output=true;
set mapred.output.compression.type=BLOCK;
set
 mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
create table blah stored as ORC as select * from your table;
 #i'm curious if this'll work.
show create table blah;  #send output back if previous step
 worked.

 4. extra bonus.  change ORC to SEQUENCEFILE in #3 see if that works any
 differently.



 I'm wondering if compression would have any effect on the size of the
 internal ArrayList the thrift server uses.



 On Fri, Jan 31, 2014 at 9:21 AM, David Gayou david.ga...@kxen.comwrote:

 Ok, so here are some news :

 I tried to boost the HADOOP_HEAPSIZE to 8192,
 I also setted the mapred.child.java.opts to 512M

 And it doesn't seem's to have any effect.
  --

 I tried it using an ODBC driver = fail after few minutes.
 Using a local JDBC (beeline) = running forever without any error.

 Both through hiveserver 2

 If i use the local mode : it works!   (but that not really what i need,
 as i don't really how to access it with my software)

 --
 I use a text file as storage.
 I tried to use ORC, but i can't populate it with a load data  (it return
 an error of file format).

 Using an ALTER TABLE orange_large_train_3 SET FILEFORMAT ORC after
 populating the table, i have a file format error on select.

 --

 @Edward :

 I've tried to look around on how i can change the thrift heap size but
 haven't found anything.
 Same thing for my client (haven't found how to change the heap size)

 My usecase is really to have the most possible columns.


 Thanks a lot for your help


 Regards

 David





 On Fri, Jan 31, 2014 at 1:12 AM, Edward Capriolo 
 edlinuxg...@gmail.comwrote:

 Ok here are the problem(s). Thrift has frame size limits, thrift has to
 buffer rows into memory.

 Hove thrift has a heap size, it needs to big in this case.

 Your client needs a big heap size as well.

 The way to do this query if it is possible may be turning row lateral,
 potwntially by treating it as a list, it will make queries on it awkward.

 Good luck


 On Thursday, January 30, 2014, Stephen Sprague sprag...@gmail.com
 wrote:
  oh. thinking some more about this i forgot to ask some other basic
 questions.
 
  a) what storage format are you using for the table (text, sequence,
 rcfile, orc or custom)?   show create table table would yield that.
 
  b) what command is causing the stack trace?
 
  my thinking here is rcfile and orc are column based (i think) and if
 you don't select all the columns that could very well limit the size of the
 row being returned and hence the size of the internal ArrayList.  OTOH,
 if you're using select *, um, you have my sympathies. :)
 
 
 
 
  On Thu, Jan 30, 2014 at 11:33 AM, Stephen Sprague sprag...@gmail.com
 wrote:
 
  thanks for the information. Up-to-date hive. Cluster on the smallish
 side. And, well, sure looks like a memory issue. :)  rather than an
 inherent hive limitation that is.
 
  So.  I can only speak as a user (ie. not a hive developer) but what
 i'd be interested in knowing next 

Re: Issue with Hive and table with lots of column

2014-01-30 Thread David Gayou
We are using the Hive 0.12.0, but it doesn't work better on hive 0.11.0 or
hive 0.10.0
Our hadoop version is 1.1.2.
Our cluster is 1 master + 4 slaves with 1 dual core xeon CPU (with
hyperthreading so 4 cores per machine) + 16Gb Ram each

The error message i get is :

2014-01-29 12:41:09,086 ERROR thrift.ProcessFunction
(ProcessFunction.java:process(41)) - Internal error processing FetchResults
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2734)
at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
at java.util.ArrayList.add(ArrayList.java:351)
at org.apache.hive.service.cli.Row.init(Row.java:47)
at org.apache.hive.service.cli.RowSet.addRow(RowSet.java:61)
at
org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:235)
at
org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:170)
at
org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:417)
at
org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:306)
at
org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:386)
at
org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1373)
at
org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1358)
at
org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at
org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:58)
at
org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:55)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
at
org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:526)
at
org.apache.hive.service.auth.TUGIContainingProcessor.process(TUGIContainingProcessor.java:55)
at
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)

My HADOOP_HEAPSIZE is setted to 4096 in hive-env.sh


We are doing some machine learning on row by row basis on those dataset, so
basically the more column we have the better it is.

We are coming from the SQL world, and Hive is the closest to SQL syntax.
We'd like to keep some SQL manipulation on the data.

Thanks for the Help,

Regards,

David Gayou

On Tue, Jan 28, 2014 at 8:35 PM, Stephen Sprague sprag...@gmail.com wrote:

 there's always a use case out there that stretches the imagination isn't
 there?   gotta love it.

 first things first.  can you share the error message? the hive version?
 and the number of nodes in your cluster?

 then a couple of things come to my mind.   Might you consider pivoting the
 data such that you represent one row of 15K columns as  15K rows as, say, 3
 columns (id, column_name, column_value) before you even load it into hive?

 the other thing is when i hear 15K columns the first thing i think is
 HBase (their motto is millions of columns and billions of rows)

  Anyway, lets see what you got for the first question! :)

 cheers,
 Stephen.


 On Tue, Jan 28, 2014 at 3:20 AM, David Gayou david.ga...@kxen.com wrote:

 Hello,

 I'm trying to test Hive with Tables including quite a lot of Columns.

 We are using the data from the KDD Cup 2009 based on anonymised real case
 dataset.
 http://www.sigkdd.org/kdd-cup-2009-customer-relationship-prediction

 The aim is to be able to create and manipulate a table with 15,000
 columns.

 We were actually able to create the table and to load data inside it.
 You can find the create statement inside the attached file.
 The data file is pretty big, but i can share it if anyone want it.


 The statement
 SELECT * FROM orange_large_train_3 LIMIT 1000
 is working fine,

 But the
 SELECT * FROM orange_large_train_3
 doesn't work.


 We have tried several options for creating tables including creating the
 table using the ColumnarSerde row format, but couldn't make it works.

 Does any of you have any server configuration or storage to use when
 creating table
 in order to make it works with such a number of columns ?



 Regards,

 David Gayou





Re: Issue with Hive and table with lots of column

2014-01-30 Thread Stephen Sprague
thanks for the information. Up-to-date hive. Cluster on the smallish side.
And, well, sure looks like a memory issue. :)  rather than an inherent hive
limitation that is.

So.  I can only speak as a user (ie. not a hive developer) but what i'd be
interested in knowing next is is this via running hive in local mode,
correct? (eg. not through hiveserver1/2).  And it looks like it boinks on
array processing which i assume to be internal code arrays and not hive
data arrays - your 15K columns are all scalar/simple types, correct?  Its
clearly fetching results and looks be trying to store them in a java array
- and not just one row but a *set* of rows (ArrayList)

two things to try.

1. boost the heap-size. try 8192. And I don't know if HADOOP_HEAPSIZE is
the controller of that. I woulda hoped it was called something like
HIVE_HEAPSIZE. :)  Anyway, can't hurt to try.

2. trim down the number of columns and see where the breaking point is.  is
it 10K? is it 5K?   The idea is to confirm its _the number of columns_ that
is causing the memory to blow and not some other artifact unbeknownst to us.

3. Google around the Hive namespace for something that might limit or
otherwise control the number of rows stored at once in Hive's internal
buffer. I snoop around too.


That's all i got for now and maybe we'll get lucky and someone on this list
will know something or another about this. :)

cheers,
Stephen.



On Thu, Jan 30, 2014 at 2:32 AM, David Gayou david.ga...@kxen.com wrote:


 We are using the Hive 0.12.0, but it doesn't work better on hive 0.11.0 or
 hive 0.10.0
 Our hadoop version is 1.1.2.
 Our cluster is 1 master + 4 slaves with 1 dual core xeon CPU (with
 hyperthreading so 4 cores per machine) + 16Gb Ram each

 The error message i get is :

 2014-01-29 12:41:09,086 ERROR thrift.ProcessFunction
 (ProcessFunction.java:process(41)) - Internal error processing FetchResults
 java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2734)
 at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
 at java.util.ArrayList.add(ArrayList.java:351)
 at org.apache.hive.service.cli.Row.init(Row.java:47)
 at org.apache.hive.service.cli.RowSet.addRow(RowSet.java:61)
 at
 org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:235)
 at
 org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:170)
 at
 org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:417)
 at
 org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:306)
 at
 org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:386)
 at
 org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1373)
 at
 org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1358)
 at
 org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
 at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
 at
 org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:58)
 at
 org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:55)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
 at
 org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:526)
 at
 org.apache.hive.service.auth.TUGIContainingProcessor.process(TUGIContainingProcessor.java:55)
 at
 org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
 at java.lang.Thread.run(Thread.java:662)

 My HADOOP_HEAPSIZE is setted to 4096 in hive-env.sh


 We are doing some machine learning on row by row basis on those dataset,
 so basically the more column we have the better it is.

 We are coming from the SQL world, and Hive is the closest to SQL syntax.
 We'd like to keep some SQL manipulation on the data.

 Thanks for the Help,

 Regards,

 David Gayou

 On Tue, Jan 28, 2014 at 8:35 PM, Stephen Sprague sprag...@gmail.comwrote:

 there's always a use case out there that stretches the imagination isn't
 there?   gotta love it.

 first things first.  can you share the error message? the hive version?
 and the number of nodes in your cluster?

 then a couple of things come to my mind.   Might you consider pivoting
 the data such that you represent one row of 15K columns as  15K rows as,
 say, 3 columns (id, column_name, column_value) before 

Re: Issue with Hive and table with lots of column

2014-01-30 Thread Stephen Sprague
oh. thinking some more about this i forgot to ask some other basic
questions.

a) what storage format are you using for the table (text, sequence, rcfile,
orc or custom)?   show create table table would yield that.

b) what command is causing the stack trace?

my thinking here is rcfile and orc are column based (i think) and if you
don't select all the columns that could very well limit the size of the
row being returned and hence the size of the internal ArrayList.  OTOH,
if you're using select *, um, you have my sympathies. :)




On Thu, Jan 30, 2014 at 11:33 AM, Stephen Sprague sprag...@gmail.comwrote:

 thanks for the information. Up-to-date hive. Cluster on the smallish side.
 And, well, sure looks like a memory issue. :)  rather than an inherent hive
 limitation that is.

 So.  I can only speak as a user (ie. not a hive developer) but what i'd be
 interested in knowing next is is this via running hive in local mode,
 correct? (eg. not through hiveserver1/2).  And it looks like it boinks on
 array processing which i assume to be internal code arrays and not hive
 data arrays - your 15K columns are all scalar/simple types, correct?  Its
 clearly fetching results and looks be trying to store them in a java array
 - and not just one row but a *set* of rows (ArrayList)

 two things to try.

 1. boost the heap-size. try 8192. And I don't know if HADOOP_HEAPSIZE is
 the controller of that. I woulda hoped it was called something like
 HIVE_HEAPSIZE. :)  Anyway, can't hurt to try.

 2. trim down the number of columns and see where the breaking point is.
 is it 10K? is it 5K?   The idea is to confirm its _the number of columns_
 that is causing the memory to blow and not some other artifact unbeknownst
 to us.

 3. Google around the Hive namespace for something that might limit or
 otherwise control the number of rows stored at once in Hive's internal
 buffer. I snoop around too.


 That's all i got for now and maybe we'll get lucky and someone on this
 list will know something or another about this. :)

 cheers,
 Stephen.



 On Thu, Jan 30, 2014 at 2:32 AM, David Gayou david.ga...@kxen.com wrote:


 We are using the Hive 0.12.0, but it doesn't work better on hive 0.11.0
 or hive 0.10.0
 Our hadoop version is 1.1.2.
 Our cluster is 1 master + 4 slaves with 1 dual core xeon CPU (with
 hyperthreading so 4 cores per machine) + 16Gb Ram each

 The error message i get is :

 2014-01-29 12:41:09,086 ERROR thrift.ProcessFunction
 (ProcessFunction.java:process(41)) - Internal error processing FetchResults
 java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2734)
 at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
 at java.util.ArrayList.add(ArrayList.java:351)
 at org.apache.hive.service.cli.Row.init(Row.java:47)
 at org.apache.hive.service.cli.RowSet.addRow(RowSet.java:61)
 at
 org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:235)
 at
 org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:170)
 at
 org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:417)
 at
 org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:306)
 at
 org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:386)
 at
 org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1373)
 at
 org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1358)
 at
 org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
 at
 org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
 at
 org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:58)
 at
 org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:55)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
 at
 org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:526)
 at
 org.apache.hive.service.auth.TUGIContainingProcessor.process(TUGIContainingProcessor.java:55)
 at
 org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
 at java.lang.Thread.run(Thread.java:662)

 My HADOOP_HEAPSIZE is setted to 4096 in hive-env.sh


 We are doing some machine learning on row by row basis on those dataset,
 so basically the more column we have the better it is.

 We are coming from the SQL 

Re: Issue with Hive and table with lots of column

2014-01-30 Thread Edward Capriolo
Ok here are the problem(s). Thrift has frame size limits, thrift has to
buffer rows into memory.

Hove thrift has a heap size, it needs to big in this case.

Your client needs a big heap size as well.

The way to do this query if it is possible may be turning row lateral,
potwntially by treating it as a list, it will make queries on it awkward.

Good luck

On Thursday, January 30, 2014, Stephen Sprague sprag...@gmail.com wrote:
 oh. thinking some more about this i forgot to ask some other basic
questions.

 a) what storage format are you using for the table (text, sequence,
rcfile, orc or custom)?   show create table table would yield that.

 b) what command is causing the stack trace?

 my thinking here is rcfile and orc are column based (i think) and if you
don't select all the columns that could very well limit the size of the
row being returned and hence the size of the internal ArrayList.  OTOH,
if you're using select *, um, you have my sympathies. :)




 On Thu, Jan 30, 2014 at 11:33 AM, Stephen Sprague sprag...@gmail.com
wrote:

 thanks for the information. Up-to-date hive. Cluster on the smallish
side. And, well, sure looks like a memory issue. :)  rather than an
inherent hive limitation that is.

 So.  I can only speak as a user (ie. not a hive developer) but what i'd
be interested in knowing next is is this via running hive in local mode,
correct? (eg. not through hiveserver1/2).  And it looks like it boinks on
array processing which i assume to be internal code arrays and not hive
data arrays - your 15K columns are all scalar/simple types, correct?  Its
clearly fetching results and looks be trying to store them in a java array
- and not just one row but a *set* of rows (ArrayList)

 two things to try.

 1. boost the heap-size. try 8192. And I don't know if HADOOP_HEAPSIZE is
the controller of that. I woulda hoped it was called something like
HIVE_HEAPSIZE. :)  Anyway, can't hurt to try.

 2. trim down the number of columns and see where the breaking point is.
is it 10K? is it 5K?   The idea is to confirm its _the number of columns_
that is causing the memory to blow and not some other artifact unbeknownst
to us.

 3. Google around the Hive namespace for something that might limit or
otherwise control the number of rows stored at once in Hive's internal
buffer. I snoop around too.


 That's all i got for now and maybe we'll get lucky and someone on this
list will know something or another about this. :)

 cheers,
 Stephen.



 On Thu, Jan 30, 2014 at 2:32 AM, David Gayou david.ga...@kxen.com wrote:

 We are using the Hive 0.12.0, but it doesn't work better on hive 0.11.0
or hive 0.10.0
 Our hadoop version is 1.1.2.
 Our cluster is 1 master + 4 slaves with 1 dual core xeon CPU (with
hyperthreading so 4 cores per machine) + 16Gb Ram each

 The error message i get is :

 2014-01-29 12:41:09,086 ERROR thrift.ProcessFunction
(ProcessFunction.java:process(41)) - Internal error processing FetchResults
 java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2734)
 at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
 at java.util.ArrayList.add(ArrayList.java:351)
 at org.apache.hive.service.cli.Row.init(Row.java:47)
 at org.apache.hive.service.cli.RowSet.addRow(RowSet.java:61)
 at
org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:235)
 at
org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:170)
 at
org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:417)
 at
org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:306)
 at
org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:386)
 at
org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1373)
 at
org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1358)
 at
org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
 at
org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
 at
org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:58)
 at
org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:55)
 at java.security.AccessCont

-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.