Hive 0.13 vs LZO index vs hive.hadoop.supports.splittable.combineinputformat issue

2015-01-07 Thread Nathalie Blais
Hello Hive support team,

Happy new year to you!

Quick question in regards to combining small LZO files in Hive.  As some of our 
HDFS files are indexed (not all, but there is always a few .lzo.index files in 
the directory structure), we are experiencing the problematic behavior 
described in JIRA MAPREDUCE-5537 
(https://issues.apache.org/jira/browse/MAPREDUCE-5537 ); the case is 100% 
reproducible.

We have a separate aggregation process that runs on the cluster to take care of 
the “small files issue”.  However, in between runs, in order to reduce the 
number of mappers (and busy containers), we would have loved to set 
hive.hadoop.supports.splittable.combineinputformat to true and allow Hive to 
combine small files by itself.

We are using Cloudera distro CDH 5.2.0 and ideally we would avoid building 
hadoop-core manually.  Do you know if the patch on JIRA MAPREDUCE-5537 has ever 
been included in any official release?

I will wait for news from you.

Thank you very much,

Nathalie Blais
Ubisoft Montreal

[cid:image002.png@01CFED39.93DB5F20]

Nathalie Blais
BI Developer - DNAhttp://technologygroup/dna
Technology Group Online – Ubisoft Montreal








Hive returns different results with/without LZO index when hive.hadoop.supports.splittable.combineinputformat=true

2014-12-08 Thread Nathalie Blais
Hello,

We are experiencing this old issue in our current installation:

https://issues.apache.org/jira/browse/MAPREDUCE-5537

All our data is LZO compressed and indexed; the case is 100% reproducible on 
our CDH 5.2.0 cluster (using MR2 and Yarn).

Do you know if we might be missing a patch or if maybe this particular problem 
found a way back into the code?

Best regards,

Nathalie Blais
B.I. Developer - Ubisoft Montreal





RE: Hiveserver2 crash with RStudio (using RJDBC)

2014-10-06 Thread Nathalie Blais
Hello Vaibhav,

Sorry for the delay in getting back to you on this.  We now have an up and 
running test cluster with a hive server I can “crash at will”.  I have been 
able to reproduce the crash on this new server by following the steps mentioned 
below; I will now try to grab a heap dump.

In the meantime, I have observed that Hiverserver crashes * after * the 
Map/Reduce job has completed successfully.  Something in the “gymnastic” of 
returning the rows to RStudio through RJDBC makes it crash.  Such a crash does 
not happen on other JDBC clients; I have tried out several: SQuirreL, SQL 
Workbench/J, Aqua Data Studio, etc.  They all work fine with hiveserver2 
through JDBC.

Again, thank you very for your patience and for your collaboration; I’ll return 
shortly with the heap dump.

Best regards,

-- Nathalie

From: Nathalie Blais
Sent: 25 septembre 2014 13:57
To: 'user@hive.apache.org'
Subject: RE: Hiveserver2 crash with RStudio (using RJDBC)

Hello Vaibhav,

Thanks a lot for your quick response!

I will grab a heapdump as soon as I have “the ok to crash the server” and 
attach it to this thread.  In the meantime, regarding our metastore, it looks 
like it is remote (excerpt from our hive-site.xml below):

property
  namehive.metastore.local/name
  valuefalse/value
/property
property
  namehive.metastore.uris/name
  valuethrift://server_name:9083/value
/property
property
  namehive.metastore.client.socket.timeout/name
  value300/value
/property
property
  namehive.metastore.warehouse.dir/name
  value/user/hive/warehouse/value
/property
property
  namehive.warehouse.subdir.inherit.perms/name
  valuetrue/value
/property

On a side note, the forum might have received my inquiry several times.  I had 
a bit of trouble sending it and I retried a few times; please disregard any 
dupes of this request.

Thanks!

-- Nathalie

From: Vaibhav Gumashta [mailto:vgumas...@hortonworks.com]
Sent: 25 septembre 2014 03:52
To: user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: Hiveserver2 crash with RStudio (using RJDBC)

Nathalie,

Can you grab a heapdump at the time the server crashes (export this to your 
environment: HADOOP_CLIENT_OPTS=-XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=give-your-path-here $HADOOP_CLIENT_OPTS.)? What type of 
metastore are you using with HiveServer2 - embedded (if you specify -hiveconf 
hive.metastore.uris=  in the HiveServer2 startup command, it uses embedded 
metastore) or remote?

Thanks,
--Vaibhav

On Mon, Sep 22, 2014 at 10:55 AM, Nathalie Blais 
nathalie.bl...@ubisoft.commailto:nathalie.bl...@ubisoft.com wrote:
Hello,

We are currently experiencing a severe reproducible hiveserver2 crash when 
using the RJDBC connector in RStudio (please refer to the description below for 
the detailed test case).  We have a hard time pinpointing the source of the 
problem and we are wondering whether this is a known issue or we have a glitch 
in our configuration; we would sincerely appreciate your input on this case.

Case
Severe Hiveserver2 crash when returning “a certain” volume of data (really not 
that big) to RStudio through RJDBC

Config Versions
Hadoop Distribution: Cloudera – cdh5.0.1p0.47
Hiverserver2: 0.12
RStudio: 0.98.1056
RJDBC: 0.2-4

How to Reproduce

1.   In a SQL client application (Aqua Data Studio was used for the purpose 
of this example), create Hive test table

a.   create table test_table_connection_crash(col1 string);

2.   Load data into table (data file attached)

a.   LOAD DATA INPATH '/user/test/testFile.txt' INTO TABLE 
test_table_connection_crash;

3.   Verify row count

a.   select count(*) nbRows from test_table_connection_crash;

b.  720 000 rows

4.   Display all rows

a.   select * from test_table_connection_crash order by col1 desc

b.  All the rows are returned by the Map/Reduce to the client and displayed 
properly in the interface

5.   Open RStudio

6.   Create connection to Hive

a.   library(RJDBC)

b.  drv - JDBC(driverClass=org.apache.hive.jdbc.HiveDriver, 
classPath=list.files(D:/myJavaDriversFolderFromClusterInstall/, 
pattern=jar$, full.names=T), identifier.quote=`)

c.   conn - dbConnect(drv, 
jdbc:hive2://server_name:1/default;ssl=true;sslTrustStore=C:/Progra~1/Java/jdk1.7.0_60/jre/lib/security/cacerts;trustStorePassword=pswd,
 user, password)

7.   Verify connection with a small query

a.   r - dbGetQuery(conn, select * from test_table_connection_crash order 
by col1 desc limit 100)

b.  print(r)

c.   100 rows are returned to RStudio and properly displayed in the console 
interface

8.   Remove the limit and try the original query (as performed in the SQL 
client application)

a.   r - dbGetQuery(conn, select * from test_table_connection_crash order 
by col1 desc)

b.  Query starts running

c.   *** Cluster crash ***

Worst comes to worst, in the eventuality that RStudio desktop client cannot 
handle such an amount of data, we might

RE: Hiveserver2 crash with RStudio (using RJDBC)

2014-10-06 Thread Nathalie Blais
Hello,

My heap dump file is very heavy (400MB).  Even compressed a lot with 7z, it 
still weighs 15MB; policies @Ubisoft are currently preventing me from sending 
it via email =(

Would you know of a way I could attach it to the support thread other than by 
email?

Thanks a lot!

-- Nathalie

From: Nathalie Blais
Sent: 6 octobre 2014 13:49
To: 'user@hive.apache.org'
Subject: RE: Hiveserver2 crash with RStudio (using RJDBC)

Hello Vaibhav,

Sorry for the delay in getting back to you on this.  We now have an up and 
running test cluster with a hive server I can “crash at will”.  I have been 
able to reproduce the crash on this new server by following the steps mentioned 
below; I will now try to grab a heap dump.

In the meantime, I have observed that Hiverserver crashes * after * the 
Map/Reduce job has completed successfully.  Something in the “gymnastic” of 
returning the rows to RStudio through RJDBC makes it crash.  Such a crash does 
not happen on other JDBC clients; I have tried out several: SQuirreL, SQL 
Workbench/J, Aqua Data Studio, etc.  They all work fine with hiveserver2 
through JDBC.

Again, thank you very for your patience and for your collaboration; I’ll return 
shortly with the heap dump.

Best regards,

-- Nathalie

From: Nathalie Blais
Sent: 25 septembre 2014 13:57
To: 'user@hive.apache.org'
Subject: RE: Hiveserver2 crash with RStudio (using RJDBC)

Hello Vaibhav,

Thanks a lot for your quick response!

I will grab a heapdump as soon as I have “the ok to crash the server” and 
attach it to this thread.  In the meantime, regarding our metastore, it looks 
like it is remote (excerpt from our hive-site.xml below):

property
  namehive.metastore.local/name
  valuefalse/value
/property
property
  namehive.metastore.uris/name
  valuethrift://server_name:9083/value
/property
property
  namehive.metastore.client.socket.timeout/name
  value300/value
/property
property
  namehive.metastore.warehouse.dir/name
  value/user/hive/warehouse/value
/property
property
  namehive.warehouse.subdir.inherit.perms/name
  valuetrue/value
/property

On a side note, the forum might have received my inquiry several times.  I had 
a bit of trouble sending it and I retried a few times; please disregard any 
dupes of this request.

Thanks!

-- Nathalie

From: Vaibhav Gumashta [mailto:vgumas...@hortonworks.com]
Sent: 25 septembre 2014 03:52
To: user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: Hiveserver2 crash with RStudio (using RJDBC)

Nathalie,

Can you grab a heapdump at the time the server crashes (export this to your 
environment: HADOOP_CLIENT_OPTS=-XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=give-your-path-here $HADOOP_CLIENT_OPTS.)? What type of 
metastore are you using with HiveServer2 - embedded (if you specify -hiveconf 
hive.metastore.uris=  in the HiveServer2 startup command, it uses embedded 
metastore) or remote?

Thanks,
--Vaibhav

On Mon, Sep 22, 2014 at 10:55 AM, Nathalie Blais 
nathalie.bl...@ubisoft.commailto:nathalie.bl...@ubisoft.com wrote:
Hello,

We are currently experiencing a severe reproducible hiveserver2 crash when 
using the RJDBC connector in RStudio (please refer to the description below for 
the detailed test case).  We have a hard time pinpointing the source of the 
problem and we are wondering whether this is a known issue or we have a glitch 
in our configuration; we would sincerely appreciate your input on this case.

Case
Severe Hiveserver2 crash when returning “a certain” volume of data (really not 
that big) to RStudio through RJDBC

Config Versions
Hadoop Distribution: Cloudera – cdh5.0.1p0.47
Hiverserver2: 0.12
RStudio: 0.98.1056
RJDBC: 0.2-4

How to Reproduce

1.   In a SQL client application (Aqua Data Studio was used for the purpose 
of this example), create Hive test table

a.   create table test_table_connection_crash(col1 string);

2.   Load data into table (data file attached)

a.   LOAD DATA INPATH '/user/test/testFile.txt' INTO TABLE 
test_table_connection_crash;

3.   Verify row count

a.   select count(*) nbRows from test_table_connection_crash;

b.  720 000 rows

4.   Display all rows

a.   select * from test_table_connection_crash order by col1 desc

b.  All the rows are returned by the Map/Reduce to the client and displayed 
properly in the interface

5.   Open RStudio

6.   Create connection to Hive

a.   library(RJDBC)

b.  drv - JDBC(driverClass=org.apache.hive.jdbc.HiveDriver, 
classPath=list.files(D:/myJavaDriversFolderFromClusterInstall/, 
pattern=jar$, full.names=T), identifier.quote=`)

c.   conn - dbConnect(drv, 
jdbc:hive2://server_name:1/default;ssl=true;sslTrustStore=C:/Progra~1/Java/jdk1.7.0_60/jre/lib/security/cacerts;trustStorePassword=pswd,
 user, password)

7.   Verify connection with a small query

a.   r - dbGetQuery(conn, select * from test_table_connection_crash order 
by col1 desc limit 100)

b.  print(r)

c.   100 rows are returned

RE: Hiveserver2 crash with RStudio (using RJDBC)

2014-09-25 Thread Nathalie Blais
Hello Vaibhav,

Thanks a lot for your quick response!

I will grab a heapdump as soon as I have “the ok to crash the server” and 
attach it to this thread.  In the meantime, regarding our metastore, it looks 
like it is remote (excerpt from our hive-site.xml below):

property
  namehive.metastore.local/name
  valuefalse/value
/property
property
  namehive.metastore.uris/name
  valuethrift://server_name:9083/value
/property
property
  namehive.metastore.client.socket.timeout/name
  value300/value
/property
property
  namehive.metastore.warehouse.dir/name
  value/user/hive/warehouse/value
/property
property
  namehive.warehouse.subdir.inherit.perms/name
  valuetrue/value
/property

On a side note, the forum might have received my inquiry several times.  I had 
a bit of trouble sending it and I retried a few times; please disregard any 
dupes of this request.

Thanks!

-- Nathalie

From: Vaibhav Gumashta [mailto:vgumas...@hortonworks.com]
Sent: 25 septembre 2014 03:52
To: user@hive.apache.org
Subject: Re: Hiveserver2 crash with RStudio (using RJDBC)

Nathalie,

Can you grab a heapdump at the time the server crashes (export this to your 
environment: HADOOP_CLIENT_OPTS=-XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=give-your-path-here $HADOOP_CLIENT_OPTS.)? What type of 
metastore are you using with HiveServer2 - embedded (if you specify -hiveconf 
hive.metastore.uris=  in the HiveServer2 startup command, it uses embedded 
metastore) or remote?

Thanks,
--Vaibhav

On Mon, Sep 22, 2014 at 10:55 AM, Nathalie Blais 
nathalie.bl...@ubisoft.commailto:nathalie.bl...@ubisoft.com wrote:
Hello,

We are currently experiencing a severe reproducible hiveserver2 crash when 
using the RJDBC connector in RStudio (please refer to the description below for 
the detailed test case).  We have a hard time pinpointing the source of the 
problem and we are wondering whether this is a known issue or we have a glitch 
in our configuration; we would sincerely appreciate your input on this case.

Case
Severe Hiveserver2 crash when returning “a certain” volume of data (really not 
that big) to RStudio through RJDBC

Config Versions
Hadoop Distribution: Cloudera – cdh5.0.1p0.47
Hiverserver2: 0.12
RStudio: 0.98.1056
RJDBC: 0.2-4

How to Reproduce

1.   In a SQL client application (Aqua Data Studio was used for the purpose 
of this example), create Hive test table

a.   create table test_table_connection_crash(col1 string);

2.   Load data into table (data file attached)

a.   LOAD DATA INPATH '/user/test/testFile.txt' INTO TABLE 
test_table_connection_crash;

3.   Verify row count

a.   select count(*) nbRows from test_table_connection_crash;

b.  720 000 rows

4.   Display all rows

a.   select * from test_table_connection_crash order by col1 desc

b.  All the rows are returned by the Map/Reduce to the client and displayed 
properly in the interface

5.   Open RStudio

6.   Create connection to Hive

a.   library(RJDBC)

b.  drv - JDBC(driverClass=org.apache.hive.jdbc.HiveDriver, 
classPath=list.files(D:/myJavaDriversFolderFromClusterInstall/, 
pattern=jar$, full.names=T), identifier.quote=`)

c.   conn - dbConnect(drv, 
jdbc:hive2://server_name:1/default;ssl=true;sslTrustStore=C:/Progra~1/Java/jdk1.7.0_60/jre/lib/security/cacerts;trustStorePassword=pswd,
 user, password)

7.   Verify connection with a small query

a.   r - dbGetQuery(conn, select * from test_table_connection_crash order 
by col1 desc limit 100)

b.  print(r)

c.   100 rows are returned to RStudio and properly displayed in the console 
interface

8.   Remove the limit and try the original query (as performed in the SQL 
client application)

a.   r - dbGetQuery(conn, select * from test_table_connection_crash order 
by col1 desc)

b.  Query starts running

c.   *** Cluster crash ***

Worst comes to worst, in the eventuality that RStudio desktop client cannot 
handle such an amount of data, we might expect the desktop application to 
crash; not the whole hiveserver2.

Please let us know whether or not you are aware of any issues of the kind.  
Also, please do not hesitate to request any configuration file you might need 
to examine.

Thank you very much!

Best regards,

Nathalie


[dna_signature]

Nathalie Blais
B.I. Developer | Technology Group
Ubisoft Montreal





CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader of 
this message is not the intended recipient, you are hereby notified that any 
printing, copying, dissemination, distribution, disclosure or forwarding of 
this communication is strictly prohibited. If you have received this 
communication in error, please contact the sender immediately and delete it 
from your system. Thank You.


Hiveserver crash with RStudio (using RJDBC)

2014-09-23 Thread Nathalie Blais
Hello,

We are currently experiencing a severe reproducible hiveserver2 crash when 
using the RJDBC connector in RStudio (please refer to the description below for 
the detailed test case).  We have a hard time pinpointing the source of the 
problem and we are wondering whether this is a known issue or we have a glitch 
in our configuration; we would sincerely appreciate your input on this case.

Case
Severe Hiveserver2 crash when returning a certain volume of data (really not 
that big) to RStudio through RJDBC

Config Versions
Hadoop Distribution: Cloudera - cdh5.0.1p0.47
Hiverserver2: 0.12
RStudio: 0.98.1056
RJDBC: 0.2-4

How to Reproduce

1.   In a SQL client application (Aqua Data Studio was used for the purpose 
of this example), create Hive test table

a.   create table test_table_connection_crash(col1 string);

2.   Load data into table (data file attached)

a.   LOAD DATA INPATH '/user/test/testFile.txt' INTO TABLE 
test_table_connection_crash;

3.   Verify row count

a.   select count(*) nbRows from test_table_connection_crash;

b.  720 000 rows

4.   Display all rows

a.   select * from test_table_connection_crash order by col1 desc

b.  All the rows are returned by the Map/Reduce to the client and displayed 
properly in the interface

5.   Open RStudio

6.   Create connection to Hive

a.   library(RJDBC)

b.  drv - JDBC(driverClass=org.apache.hive.jdbc.HiveDriver, 
classPath=list.files(D:/myJavaDriversFolderFromClusterInstall/, 
pattern=jar$, full.names=T), identifier.quote=`)

c.   conn - dbConnect(drv, 
jdbc:hive2://server_name:1/default;ssl=true;sslTrustStore=C:/Progra~1/Java/jdk1.7.0_60/jre/lib/security/cacerts;trustStorePassword=pswd,
 user, password)

7.   Verify connection with a small query

a.   r - dbGetQuery(conn, select * from test_table_connection_crash order 
by col1 desc limit 100)

b.  print(r)

c.   100 rows are returned to RStudio and properly displayed in the console 
interface

8.   Remove the limit and try the original query (as performed in the SQL 
client application)

a.   r - dbGetQuery(conn, select * from test_table_connection_crash order 
by col1 desc)

b.  Query starts running

c.   *** hiveserver crash ***

Worst comes to worst, in the eventuality that RStudio desktop client cannot 
handle such an amount of data, we might expect the desktop application to 
crash; not the whole hiveserver2.

Please let us know whether or not you are aware of any issues of the kind.  
Also, please do not hesitate to request any configuration file you might need 
to examine.

Thank you very much!

Best regards,

Nathalie


[dna_signature]

Nathalie Blais
B.I. Developer | Technology Group
Ubisoft Montreal





testFile.7z
Description: testFile.7z