Problem creating external table to a sequence file on HDFS
Hi I am trying to create a external table to a sequence file on HDFS. I have my own input format and a SerDe, which is compiled into a jar and added in HIVE. Inspite of doing this, I get the following error. Please Suggest. hive create table seq (key STRING, value STRING) ROW FORMAT SERDE 'com.org.SequenceFileKeyRecordReader' STORED AS INPUTFORMAT 'com.org.SequenceFileKeyInputFormat' OUTPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileOutputFormat' location '/user/hiveFiles/'; FAILED: Error in metadata: Cannot validate serde: com.org.SequenceFileKeyRecordReader FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask hive Thanks Ranjitha. ::DISCLAIMER:: The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects.
Where should Hive Process Installed??
Hi I am using Hive-0.9.0 + Hadoop-2.0.1 with 2 machine. One machine contains say Machine-1 : NameNode, SecondaraNameNode , ResourceManager and Hive Machine-2 : Proxy server, JHS , DataNode and NodeManager. Problem : When I execute Job queries i.e “select count(key) from src” for the table src, Job is getting killed. The exception in application log was MODIFY_APP VIEW_APP APPLICATION_OWNERrohith(container_1365412074766_0031_01_03�stderr1503WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files. java.io.FileNotFoundException: /tmp/rohith/hive_2013-04-08_15-36-36_587_2956773547832982704/-mr-10001/41253f2c-9b15-49e0-a419-32ea4559e4e6 (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.init(FileInputStream.java:120) at java.io.FileInputStream.init(FileInputStream.java:79) at org.apache.hadoop.hive.ql.exec.Utilities.getMapRedWork(Utilities.java:215) at org.apache.hadoop.hive.ql.io.HiveInputFormat.init(HiveInputFormat.java:255) at org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:381) at org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:374) at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:536) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.init(MapTask.java:161) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:382) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:335) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:154) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:149) stdout0syslog58942013-04-08 15:45:52,293 WARN [main] org.apache.hadoop.conf.Configuration: job Please let us know, where HiveServer should run exactly?? Thanks Regards Rohith Sharma K S
Re: Hive Problems Reading Avro+Snappy Data
Thanks Chuck. I think the problem is the job configuration on the query. I logged back into the system this morning and started a new Hive client shell and issued a series of more complex queries against the Avro+Snappy table and they all worked fine. So I started trying to recall what could have been different between my new Hive client shell and the old one returning NULLs. I am able to reproduce the NULLs being returned by setting SET hive.exec.compress.output=true;. A brand new Hive client has that set to false and all the queries come back normal, but the second I set it to true the NULLs return. Best, Matt From: Connell, Chuck chuck.conn...@nuance.commailto:chuck.conn...@nuance.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Sunday, April 7, 2013 7:32 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: RE: Hive Problems Reading Avro+Snappy Data When you do SELECT *, Hive does not run a real MapReduce job, so it is not a good test. Something is wrong with SerDe or InputFormat. Chuck From: Thomas, Matthew [mailto:mtho...@verisign.com] Sent: Sunday, April 07, 2013 5:41 PM To: user@hive.apache.orgmailto:user@hive.apache.org Subject: Hive Problems Reading Avro+Snappy Data Hive users, I am having problems performing complex queries on Avro+Snappy data. If I do a SELECT * FROM Blah LIMIT 50, I see the data coming back as it should be. But if I perform any kind of more complex query such as SELECT count(*) FROM Blah I am receive several rows of NULL values. My workflow of how I created the table is described below along with some of the setup. - I am running CDH4.2 with Avro 1.7.3 hive select * From mthomas_testavro limit 1; OK Field1Field2 03-19-2013a 03-19-2013b 03-19-2013c 03-19-2013c Time taken: 0.103 seconds hive select count(*) From mthomas_testavro; … Total MapReduce CPU Time Spent: 6 seconds 420 msec OK NULL NULL NULL NULL Time taken: 17.634 seconds … CREATE EXTERNAL TABLE mthomas_testavro ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/tmp/testavro/' TBLPROPERTIES ( 'avro.schema.literal'='{ namespace: hello.world, name: some_schema, type: record, fields: [ { name:field1,type:string}, { name:field2,type:string} ] }') ; SET avro.output.codec=snappy; SET mapred.output.compression.type=BLOCK; SET hive.exec.compress.output=true; SET mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; INSERT OVERWRITE TABLE mthomas_testavro SELECT * FROM identical_table_inGzip_format; If I cat the output file in the external table, I see Objavro.codec^Lsnappyavro.schema?{type… at the beginning followed by the rest of the schema and binary data. So I am assuming the snappy compression worked. Furthermore, I also tried to query this table via Impala and both queries worked just fine. Maybe it is related to https://issues.apache.org/jira/browse/HIVE-3308 ??? Any ideas? Thanks. Matt “This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed, and may contain information that is non-public, proprietary, privileged, confidential and exempt from disclosure under applicable law or may be constituted as attorney work product. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this message in error, notify sender immediately and delete this message immediately.”
RE: Hive Problems Reading Avro+Snappy Data
Just so you know there is still at least one bug using avro+compression like snappy: https://issues.apache.org/jira/browse/HIVE-3308 There's a simple one line patch but unfortunately it's not committed yet. From: Thomas, Matthew [mailto:mtho...@verisign.com] Sent: Monday, April 08, 2013 1:59 PM To: user@hive.apache.org Subject: Re: Hive Problems Reading Avro+Snappy Data Thanks Chuck. I think the problem is the job configuration on the query. I logged back into the system this morning and started a new Hive client shell and issued a series of more complex queries against the Avro+Snappy table and they all worked fine. So I started trying to recall what could have been different between my new Hive client shell and the old one returning NULLs. I am able to reproduce the NULLs being returned by setting SET hive.exec.compress.output=true;. A brand new Hive client has that set to false and all the queries come back normal, but the second I set it to true the NULLs return. Best, Matt From: Connell, Chuck chuck.conn...@nuance.commailto:chuck.conn...@nuance.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Sunday, April 7, 2013 7:32 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: RE: Hive Problems Reading Avro+Snappy Data When you do SELECT *, Hive does not run a real MapReduce job, so it is not a good test. Something is wrong with SerDe or InputFormat. Chuck From: Thomas, Matthew [mailto:mtho...@verisign.com] Sent: Sunday, April 07, 2013 5:41 PM To: user@hive.apache.orgmailto:user@hive.apache.org Subject: Hive Problems Reading Avro+Snappy Data Hive users, I am having problems performing complex queries on Avro+Snappy data. If I do a SELECT * FROM Blah LIMIT 50, I see the data coming back as it should be. But if I perform any kind of more complex query such as SELECT count(*) FROM Blah I am receive several rows of NULL values. My workflow of how I created the table is described below along with some of the setup. - I am running CDH4.2 with Avro 1.7.3 hive select * From mthomas_testavro limit 1; OK Field1Field2 03-19-2013a 03-19-2013b 03-19-2013c 03-19-2013c Time taken: 0.103 seconds hive select count(*) From mthomas_testavro; ... Total MapReduce CPU Time Spent: 6 seconds 420 msec OK NULL NULL NULL NULL Time taken: 17.634 seconds ... CREATE EXTERNAL TABLE mthomas_testavro ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/tmp/testavro/' TBLPROPERTIES ( 'avro.schema.literal'='{ namespace: hello.world, name: some_schema, type: record, fields: [ { name:field1,type:string}, { name:field2,type:string} ] }') ; SET avro.output.codec=snappy; SET mapred.output.compression.type=BLOCK; SET hive.exec.compress.output=true; SET mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; INSERT OVERWRITE TABLE mthomas_testavro SELECT * FROM identical_table_inGzip_format; If I cat the output file in the external table, I see Objavro.codec^Lsnappyavro.schema?{type... at the beginning followed by the rest of the schema and binary data. So I am assuming the snappy compression worked. Furthermore, I also tried to query this table via Impala and both queries worked just fine. Maybe it is related to https://issues.apache.org/jira/browse/HIVE-3308 ??? Any ideas? Thanks. Matt This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed, and may contain information that is non-public, proprietary, privileged, confidential and exempt from disclosure under applicable law or may be constituted as attorney work product. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this message in error, notify sender immediately and delete this message immediately.
StackOverflowError when add jar using multiple thread cocurrently
Hi All, Recently we find that when multiple jdbc connection concurrently add jars, hiveserver will throw StackOverflowError when serializeMapRedWork to hdfs, I find the relate issue hive-2666 is similar, but I think it missed the concurrently scenario. I find it is because the classloader is changed which will lead to the infinite loop. Any one met this issue? Any suggestion? Regards Wenli
RE: Syntax for filters on timstamp data type
Hi Mark, Correct, I just did some tests and the cast is the way to go. While for comparison operations (equal, diff, ...) implicit casts work, this is not the case for the IN clause. I think it should, as eventually this just translates to a disjunction of comparisons so it should be the same. Anyway, I have a working solution now. For the record I paste two working example queries below. Thanks a lot for your help !!! Steffen Example 1: SELECT * FROM table1 WHERE datecol = CAST('2009-01-17 00:00:00' AS timestamp) Example 2: SELECT * FROM table1 WHERE datecol IN (CAST ('2009-01-11 00:00:00' AS timestamp), CAST ('2009-01-08 00:00:00' AS timestamp) ) From: Mark Grover [mailto:grover.markgro...@gmail.com] Sent: 05 April 2013 18:43 To: user@hive.apache.org Subject: Re: Syntax for filters on timstamp data type Steffan, One thing that may be different is that equal can cast operands to make equals work but that may not be true for IN. FWIW, this is me just speculating, I haven't looked at the code just yet. Perhaps, you could explicit casting to get around this? On Fri, Apr 5, 2013 at 7:36 AM, LUTTER, Steffen steffen.lut...@sap.commailto:steffen.lut...@sap.com wrote: Equal, not equal, less than, less or equal, greater than, greater or equal all work. Also the function execution in the IN clause seems to work, as the error message states that the result type is bigint. Following the error message, it expects the input as timestamp, but I couldn't find a syntax to express timestamps in HiveQL. Two questions remain: 1) How to express timestamps in HiveQL? 2) Why doesn't the IN clause support comparisons between timestamp and bigint, if equal and so on does? Thanks for any thought in this, Steffen From: Nitin Pawar [mailto:nitinpawar...@gmail.commailto:nitinpawar...@gmail.com] Sent: 05 April 2013 16:11 To: user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: Syntax for filters on timstamp data type I am not sure IN clause supports executing functions in the query did it fail when you tried less than greater than type On Fri, Apr 5, 2013 at 7:36 PM, LUTTER, Steffen steffen.lut...@sap.commailto:steffen.lut...@sap.com wrote: Hi, I have a question regarding filters on timestamps. The syntax seems to be UNIX_TIMESTAMP('-MM-dd hh:mm:ss'), is there another way to express a datetime type? The problem is that I get an exception when using the IN list syntax, while the equal comparison works without problems. Example: SELECT * FROM table1 WHERE datecol IN ( UNIX_TIMESTAMP('2009-01-05 00:00:00'),UNIX_TIMESTAMP('2009-01-10 00:00:00') ) Throws exception: Caused by: java.sql.SQLException: Query returned non-zero code: 10014, cause: FAILED: SemanticException [Error 10014]: Line 5:21 Wrong arguments ''2009-01-10 00:00:00'': The arguments for IN should be the same type! Types are: {timestamp IN (bigint, bigint)} at org.apache.hadoop.hive.jdbc.HiveStatement.executeQuery(HiveStatement.java:189) at org.apache.hadoop.hive.jdbc.HiveStatement.execute(HiveStatement.java:127) at com.sap.connectivity.cs.java.drivers.jdbc.JDBCAPI$Statement.execute(JDBCAPI.java:1648) at com.sap.connectivity.cs.java.drivers.jdbc.JDBCDriver.prepare(JDBCDriver.java:1760) ... 15 more Following query works: SELECT * FROM table1 WHERE datecol = UNIX_TIMESTAMP('2009-01-17 00:00:00') Is there another syntax for datetime types? Could it be a bug in the filter IN list operation? Thanks in advance, Steffen -- Nitin Pawar
Re: Syntax for filters on timstamp data type
great so can we assume that equals comparison can auto cast but not the in range statement ? On Mon, Apr 8, 2013 at 7:08 PM, LUTTER, Steffen steffen.lut...@sap.comwrote: Hi Mark, ** ** Correct, I just did some tests and the cast is the way to go. While for comparison operations (equal, diff, …) implicit casts work, this is not the case for the IN clause. I think it should, as eventually this just translates to a disjunction of comparisons so it should be the same. ** ** Anyway, I have a working solution now. For the record I paste two working example queries below. ** ** Thanks a lot for your help !!! ** ** Steffen ** ** Example 1: SELECT * FROM table1 WHERE datecol = CAST('2009-01-17 00:00:00' AS timestamp) Example 2: SELECT * FROM table1 WHERE datecol IN (CAST ('2009-01-11 00:00:00' AS timestamp), CAST ('2009-01-08 00:00:00' AS timestamp) ) ** ** *From:* Mark Grover [mailto:grover.markgro...@gmail.com] *Sent:* 05 April 2013 18:43 *To:* user@hive.apache.org *Subject:* Re: Syntax for filters on timstamp data type ** ** Steffan, One thing that may be different is that equal can cast operands to make equals work but that may not be true for IN. FWIW, this is me just speculating, I haven't looked at the code just yet. ** ** Perhaps, you could explicit casting to get around this? On Fri, Apr 5, 2013 at 7:36 AM, LUTTER, Steffen steffen.lut...@sap.com wrote: Equal, not equal, less than, less or equal, greater than, greater or equal all work. Also the function execution in the IN clause seems to work, as the error message states that the result type is bigint. Following the error message, it expects the input as timestamp, but I couldn’t find a syntax to express timestamps in HiveQL. Two questions remain: 1) How to express timestamps in HiveQL? 2) Why doesn’t the IN clause support comparisons between timestamp and bigint, if “equal” and so on does? Thanks for any thought in this, Steffen *From:* Nitin Pawar [mailto:nitinpawar...@gmail.com] *Sent:* 05 April 2013 16:11 *To:* user@hive.apache.org *Subject:* Re: Syntax for filters on timstamp data type I am not sure IN clause supports executing functions in the query did it fail when you tried less than greater than type On Fri, Apr 5, 2013 at 7:36 PM, LUTTER, Steffen steffen.lut...@sap.com wrote: Hi, I have a question regarding filters on timestamps. The syntax seems to be UNIX_TIMESTAMP('-MM-dd hh:mm:ss'), is there another way to express a datetime type? The problem is that I get an exception when using the IN list syntax, while the equal comparison works without problems. Example: SELECT * FROM table1 WHERE datecol IN ( UNIX_TIMESTAMP('2009-01-05 00:00:00'),UNIX_TIMESTAMP('2009-01-10 00:00:00') ) Throws exception: Caused by: java.sql.SQLException: Query returned non-zero code: 10014, cause: FAILED: SemanticException [Error 10014]: Line 5:21 Wrong arguments ''2009-01-10 00:00:00'': The arguments for IN should be the same type! Types are: {timestamp IN (bigint, bigint)} at org.apache.hadoop.hive.jdbc.HiveStatement.executeQuery(HiveStatement.java:189) at org.apache.hadoop.hive.jdbc.HiveStatement.execute(HiveStatement.java:127)* *** at com.sap.connectivity.cs.java.drivers.jdbc.JDBCAPI$Statement.execute(JDBCAPI.java:1648) at com.sap.connectivity.cs.java.drivers.jdbc.JDBCDriver.prepare(JDBCDriver.java:1760) ... 15 more Following query works: SELECT * FROM table1 WHERE datecol = UNIX_TIMESTAMP('2009-01-17 00:00:00') Is there another syntax for datetime types? Could it be a bug in the filter “IN list” operation? Thanks in advance, Steffen -- Nitin Pawar ** ** -- Nitin Pawar
RE: Syntax for filters on timstamp data type
Exactly From: Nitin Pawar [mailto:nitinpawar...@gmail.com] Sent: 08 April 2013 16:06 To: user@hive.apache.org Subject: Re: Syntax for filters on timstamp data type great so can we assume that equals comparison can auto cast but not the in range statement ? On Mon, Apr 8, 2013 at 7:08 PM, LUTTER, Steffen steffen.lut...@sap.commailto:steffen.lut...@sap.com wrote: Hi Mark, Correct, I just did some tests and the cast is the way to go. While for comparison operations (equal, diff, ...) implicit casts work, this is not the case for the IN clause. I think it should, as eventually this just translates to a disjunction of comparisons so it should be the same. Anyway, I have a working solution now. For the record I paste two working example queries below. Thanks a lot for your help !!! Steffen Example 1: SELECT * FROM table1 WHERE datecol = CAST('2009-01-17 00:00:00' AS timestamp) Example 2: SELECT * FROM table1 WHERE datecol IN (CAST ('2009-01-11 00:00:00' AS timestamp), CAST ('2009-01-08 00:00:00' AS timestamp) ) From: Mark Grover [mailto:grover.markgro...@gmail.commailto:grover.markgro...@gmail.com] Sent: 05 April 2013 18:43 To: user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: Syntax for filters on timstamp data type Steffan, One thing that may be different is that equal can cast operands to make equals work but that may not be true for IN. FWIW, this is me just speculating, I haven't looked at the code just yet. Perhaps, you could explicit casting to get around this? On Fri, Apr 5, 2013 at 7:36 AM, LUTTER, Steffen steffen.lut...@sap.commailto:steffen.lut...@sap.com wrote: Equal, not equal, less than, less or equal, greater than, greater or equal all work. Also the function execution in the IN clause seems to work, as the error message states that the result type is bigint. Following the error message, it expects the input as timestamp, but I couldn't find a syntax to express timestamps in HiveQL. Two questions remain: 1) How to express timestamps in HiveQL? 2) Why doesn't the IN clause support comparisons between timestamp and bigint, if equal and so on does? Thanks for any thought in this, Steffen From: Nitin Pawar [mailto:nitinpawar...@gmail.commailto:nitinpawar...@gmail.com] Sent: 05 April 2013 16:11 To: user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: Syntax for filters on timstamp data type I am not sure IN clause supports executing functions in the query did it fail when you tried less than greater than type On Fri, Apr 5, 2013 at 7:36 PM, LUTTER, Steffen steffen.lut...@sap.commailto:steffen.lut...@sap.com wrote: Hi, I have a question regarding filters on timestamps. The syntax seems to be UNIX_TIMESTAMP('-MM-dd hh:mm:ss'), is there another way to express a datetime type? The problem is that I get an exception when using the IN list syntax, while the equal comparison works without problems. Example: SELECT * FROM table1 WHERE datecol IN ( UNIX_TIMESTAMP('2009-01-05 00:00:00'),UNIX_TIMESTAMP('2009-01-10 00:00:00') ) Throws exception: Caused by: java.sql.SQLException: Query returned non-zero code: 10014, cause: FAILED: SemanticException [Error 10014]: Line 5:21 Wrong arguments ''2009-01-10 00:00:00'': The arguments for IN should be the same type! Types are: {timestamp IN (bigint, bigint)} at org.apache.hadoop.hive.jdbc.HiveStatement.executeQuery(HiveStatement.java:189) at org.apache.hadoop.hive.jdbc.HiveStatement.execute(HiveStatement.java:127) at com.sap.connectivity.cs.java.drivers.jdbc.JDBCAPI$Statement.execute(JDBCAPI.java:1648) at com.sap.connectivity.cs.java.drivers.jdbc.JDBCDriver.prepare(JDBCDriver.java:1760) ... 15 more Following query works: SELECT * FROM table1 WHERE datecol = UNIX_TIMESTAMP('2009-01-17 00:00:00') Is there another syntax for datetime types? Could it be a bug in the filter IN list operation? Thanks in advance, Steffen -- Nitin Pawar -- Nitin Pawar
Re: Huge join performance issue
Did you verify that all your available mappers are running (and reducers too)? If you have a small number of partitions with huge files, you might me underutilizing mappers (check that the files are being split). Also, it might be optimal to have a single wave of reducers by setting the number of reduce tasks appropriately. You might also consider optimizing a simpler query first: select t1.a, count(*) from (select a from table baseTB1 where ... ) t1 -- filter by partition as well join (select a from baseTB2 where ...) t2-- filter by partition as well on t1.a=t2.a group by t1.a just to give you an idea how much overhead the extra columns are adding. If the columns are pretty big they could be causing the slowdown. igor decide.com On Sat, Apr 6, 2013 at 2:30 PM, Gabi D gabi...@gmail.com wrote: Thank you for your answer Nitin. Does anyone have additional insight into this? will be greatly appreciated. On Thu, Apr 4, 2013 at 3:39 PM, Nitin Pawar nitinpawar...@gmail.comwrote: you dont really need subqueries to join the tables which have common columns. Its an additional overhead best way to filter your data and speed up your data processing is how you layout your data When you have larger table I will use partitioning and bucketing to trim down the data and improve the performances over joins distribute by is mainly used when you have your custom map reduce scripts and you want to use transform functionality in hive. I have not used it a lot so not sure on that part. also its helpful to write where clauses in join statements to reduce the dataset you want to join. On Thu, Apr 4, 2013 at 5:53 PM, Gabi D gabi...@gmail.com wrote: Hi all, I have two tables I need to join and then summarize. They are both huge (about 1B rows each, in the relevant partitions) and the query runs for over 2 hours creating 5T intermediate data. The current query looks like this: select t1.b,t1.c,t2.d,t2.e, count(*) from (select a,b,cfrom table baseTB1 where ... ) t1 -- filter by partition as well join (select a,d,e from baseTB2 where ...) t2-- filter by partition as well on t1.a=t2.a group by t1.b,t1.c,t2.d,t2.e two questions: 1. would joining baseTB1 and baseTB2 directly (instead of subqueries) be better in any way? (I know subqueries cause a lot of writes of the intermediate data but we also understand it's best to filter down the data that is being joined, which is more correct?) 2. can I use 'distribute by ' and/or 'sort by' in some way that would help this? my understanding at the moment is that the problem lies in the fact that the reduces are on column a while the group by is on column b ... Any thoughts would be appreciated. -- Nitin Pawar
Should the SQL schema for HIVE be identical to the SQL schema for a relational database?
Hi It would be terrific to get some advice on migrating a schema from RDMS to Hive. Should the SQL schema for HIVE be identical to the SQL schema for a Posgresql/mysql database? Specifically: I have an application that generates events that look like the following: {ts:N+1,userId:123,event:location,payload:{verticalAccuracy:10,longitude:-73.99718090313884,latitude:40.72473278788106,altitude:27.79653739929199,horizontalAccuracy:65}} {ts:N+2,userId:123,event:addProduct,payload:[cart,osprey-kestrel-48]} ... Events are being written to persistent storage (AWS S3). A 'worker' wakes up periodically, reads the new events received and inserts them in a postgresql database. The database has user, product, user_product (user_id, product_id, action:(viewed|wishlist...), timestamp), location etc... tables. We are migrating to HIVE. Should we also create user, product, user_product, locations etc... as HIVE tables and have a MapReduce job process event files to populate the HIVE tables? Or should/can we implement a different schema that would allow for external HIVE tables to map directly to the event files generated. Or a mix of both? Thank you for your help! -matt
RE: Should the SQL schema for HIVE be identical to the SQL schema for a relational database?
Some general guidance would be to aim for minimal JOINs in your regular queries. Thus the biggest change from a normal-form RDBMS schema is to denormalize such that joins do not come in to play until a query has already performed data reduction via filtering or aggregation. This implies a star schema comprising one primary fact table that has sufficient data in it to sensibly partition it and support direct filtering and aggregations. Supplementing the main table will be dimension tables that can provide additional data to flesh out result sets. It looks like your data will be highly amenable to this. -mike. From: Matthieu Labour [mailto:matth...@actionx.com] Sent: Monday, April 08, 2013 3:50 PM To: user@hive.apache.org Subject: Should the SQL schema for HIVE be identical to the SQL schema for a relational database? Hi It would be terrific to get some advice on migrating a schema from RDMS to Hive. Should the SQL schema for HIVE be identical to the SQL schema for a Posgresql/mysql database? Specifically: I have an application that generates events that look like the following: {ts:N+1,userId:123,event:location,payload:{verticalAccuracy:10,longitude:-73.99718090313884,latitude:40.72473278788106,altitude:27.79653739929199,horizontalAccuracy:65}} {ts:N+2,userId:123,event:addProduct,payload:[cart,osprey-kestrel-48]} ... Events are being written to persistent storage (AWS S3). A 'worker' wakes up periodically, reads the new events received and inserts them in a postgresql database. The database has user, product, user_product (user_id, product_id, action:(viewed|wishlist...), timestamp), location etc... tables. We are migrating to HIVE. Should we also create user, product, user_product, locations etc... as HIVE tables and have a MapReduce job process event files to populate the HIVE tables? Or should/can we implement a different schema that would allow for external HIVE tables to map directly to the event files generated. Or a mix of both? Thank you for your help! -matt