Hello,

We are currently experiencing a severe reproducible hiveserver2 crash when 
using the RJDBC connector in RStudio (please refer to the description below for 
the detailed test case).  We have a hard time pinpointing the source of the 
problem and we are wondering whether this is a known issue or we have a glitch 
in our configuration; we would sincerely appreciate your input on this case.

Case
Severe Hiveserver2 crash when returning "a certain" volume of data (really not 
that big) to RStudio through RJDBC

Config Versions
Hadoop Distribution: Cloudera - cdh5.0.1p0.47
Hiverserver2: 0.12
RStudio: 0.98.1056
RJDBC: 0.2-4

How to Reproduce

1.       In a SQL client application (Aqua Data Studio was used for the purpose 
of this example), create Hive test table

a.       create table test_table_connection_crash(col1 string);

2.       Load data into table (data file attached)

a.       LOAD DATA INPATH '/user/test/testFile.txt' INTO TABLE 
test_table_connection_crash;

3.       Verify row count

a.       select count(*) nbRows from test_table_connection_crash;

b.      720 000 rows

4.       Display all rows

a.       select * from test_table_connection_crash order by col1 desc

b.      All the rows are returned by the Map/Reduce to the client and displayed 
properly in the interface

5.       Open RStudio

6.       Create connection to Hive

a.       library(RJDBC)

b.      drv <- JDBC(driverClass="org.apache.hive.jdbc.HiveDriver", 
classPath=list.files("D:/myJavaDriversFolderFromClusterInstall/", 
pattern="jar$", full.names=T), identifier.quote="`")

c.       conn <- dbConnect(drv, 
"jdbc:hive2://server_name:10000/default;ssl=true;sslTrustStore=C:/Progra~1/Java/jdk1.7.0_60/jre/lib/security/cacerts;trustStorePassword=pswd",
 "user", "password")

7.       Verify connection with a small query

a.       r <- dbGetQuery(conn, "select * from test_table_connection_crash order 
by col1 desc limit 100")

b.      print(r)

c.       100 rows are returned to RStudio and properly displayed in the console 
interface

8.       Remove the limit and try the original query (as performed in the SQL 
client application)

a.       r <- dbGetQuery(conn, "select * from test_table_connection_crash order 
by col1 desc")

b.      Query starts running

c.       *** hiveserver crash ***

Worst comes to worst, in the eventuality that RStudio desktop client cannot 
handle such an amount of data, we might expect the desktop application to 
crash; not the whole hiveserver2.

Please let us know whether or not you are aware of any issues of the kind.  
Also, please do not hesitate to request any configuration file you might need 
to examine.

Thank you very much!

Best regards,

Nathalie


[dna_signature]

Nathalie Blais
B.I. Developer | Technology Group
Ubisoft Montreal



Attachment: testFile.7z
Description: testFile.7z

Reply via email to