Re: Maintaining big and complex Hive queries
Hi Elliot, Thanks for letting me know. HPL-SQL sounded particularly interesting. But in the documentation I could not see any way to pass output generated by one Hive query to the next one. The tool looks good as a homogeneous PL-SQL platform for multiple big-data systems (http://www.hplsql.org/about). However in order to break single complex hive query, DDLs look to be only way in HPL-SQL too. Or is there any alternate way that I might have missed? -- Saumitra S. Shahapure On Thu, Dec 15, 2016 at 6:21 PM, Elliot West wrote: > I notice that HPL/SQL is not mentioned on the page I referenced, however I > expect that is another approach that you could use to modularise: > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=59690156 > http://www.hplsql.org/doc > > On 15 December 2016 at 17:17, Elliot West wrote: > >> Some options are covered here, although there is no definitive guidance >> as far as I know: >> >> https://cwiki.apache.org/confluence/display/Hive/Unit+Testin >> g+Hive+SQL#UnitTestingHiveSQL-Modularisation >> >> On 15 December 2016 at 17:08, Saumitra Shahapure < >> saumitra.offic...@gmail.com> wrote: >> >>> Hello, >>> >>> We are running and maintaining quite big and complex Hive SELECT query >>> right now. It's basically a single SELECT query which performs JOIN of >>> about ten other SELECT query outputs. >>> >>> A simplest way to refactor that we can think of is to break this query >>> down into multiple views and then join the views. There is similar >>> possibility to create intermediate tables. >>> >>> However creating multiple DDLs in order to maintain a single DML is not >>> very smooth. We would end up polluting metadata database by creating views >>> / intermediate tables which are used in just this ETL. >>> >>> What are the other efficient ways to maintain complex SQL queries >>> written in Hive? Are there better ways to break Hive query into multiple >>> modules? >>> >>> -- Saumitra S. Shahapure >>> >> >> >
Maintaining big and complex Hive queries
Hello, We are running and maintaining quite big and complex Hive SELECT query right now. It's basically a single SELECT query which performs JOIN of about ten other SELECT query outputs. A simplest way to refactor that we can think of is to break this query down into multiple views and then join the views. There is similar possibility to create intermediate tables. However creating multiple DDLs in order to maintain a single DML is not very smooth. We would end up polluting metadata database by creating views / intermediate tables which are used in just this ETL. What are the other efficient ways to maintain complex SQL queries written in Hive? Are there better ways to break Hive query into multiple modules? -- Saumitra S. Shahapure
java.lang.ArrayIndexOutOfBoundsException in getSplitHosts
Hello, I am using using Hive 0.13.1 in EMR and trying to create Hive table on top of our custom file system (which is a thin wrapper on top of S3) and I am getting error while accessing the data in the table. Stack trace and command history below. I had a doubt that CombineFileInputFormat is trying to access the splits in incorrect way, but HiveInputFormat is also causing the same problem. Has anyone seen such problem before? Note that SerDe and FileSystem are both custom. Could either of those causing this problem? hive> add jar /home/hadoop/logprocessing-pig-combined.jar; > Added /home/hadoop/logprocessing-pig-combined.jar to class path > Added resource: /home/hadoop/logprocessing-pig-combined.jar > hive> Create external table nulf > > ( > > tm STRING > > ) > > ROW FORMAT SERDE 'logprocessing.nulf.basic.BasicHiveSerDe' > > location 'cda://path/to/logs/'; > OK > Time taken: 6.706 seconds > hive> set hive.input.format= org.apache.hadoop.hive.ql.io.HiveInputFormat; > hive> select count(*) from nulf; > Total jobs = 1 > Launching Job 1 out of 1 > Number of reduce tasks determined at compile time: 1 > In order to change the average load for a reducer (in bytes): > set hive.exec.reducers.bytes.per.reducer= > In order to limit the maximum number of reducers: > set hive.exec.reducers.max= > In order to set a constant number of reducers: > set mapreduce.job.reduces= > java.lang.ArrayIndexOutOfBoundsException: 1 > at > org.apache.hadoop.mapred.FileInputFormat.getSplitHosts(FileInputFormat.java:529) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:320) > at > org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:290) > at > org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:371) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:520) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:512) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:394) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282) > at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562) > at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) > at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548) > at > org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:420) > at > org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:136) > at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153) > at > org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85) > at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1503) > at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1270) > at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1088) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901) > at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:275) > at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:227) > at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:430) > at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:803) > at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:697) > at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:636) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at org.apache.hadoop.util.RunJar.main(RunJar.java:212) > Job Submission failed with exception > 'java.lang.ArrayIndexOutOfBoundsException(1)' > FAILED: Execution Error, return code 1 from > org.apache.hadoop.hive.ql.exec.mr.MapRedTask -- Saumitra S. Shahapure
Re: Spark performance for small queries
Hi Gopal, Thanks for the informative answer, but my question was around difference in the processing of Spark SQL and Hive. Right now I am not trying to optimizing either. I totally agree that Hive can perform much better than the number I got. I was just wondering, even though both systems would generate quite similar execution plans for this query, what exactly is making difference. My question is from the point of understanding both the systems, Answering your questions inline, -- Regards, Saumitra Shahapure On Fri, Jan 23, 2015 at 5:01 AM, Gopal V wrote: > On 1/22/15, 3:03 AM, Saumitra Shahapure (Vizury) wrote: > >> We were comparing performance of some of our production hive queries >> between Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against both >> Spark 0.9 and 1.1. We could see that the performance gains have been good >> in Spark. >> > > Is there any particular reason you are using an ancient & slow Hadoop-1.x > version instead of a modern YARN 2.0 cluster? The cluster I was experimenting on, is a legacy cluster in our system. We are already in process of migrating everything from here to Hadoop 2. > > > We tried a very simple query, >> select count(*) from T where col3=123 >> in both sparkSQL and Hive (with hive.map.aggr=true) and found that Spark >> performance had been 2x better than Hive (120sec vs 60sec). Table T is >> stored in S3 and contains 600MB single GZIP file. >> > > Not sure if you understand that what you're doing is one of the worst > cases for both the platforms. > > Using a big single gzip file is like a massive anti-pattern. > > I'm assuming what you want is fast SQL in Hive (since this is the hive > list) along with all the other lead/lag functions there. > > You need a SQL oriented columnar format like ORC, mix with YARN and add > Tez, that is going to be somewhere near 10-12 seconds. > > Oh, and that's a ball-park figure for a single node. > Agree on that end as well. Smaller gzipped files or uncompressed files will give better performance. This specific query is just a kind-of-rare test case that one of our job encounters sometimes. > > Cheers, > Gopal >
Re: Spark performance for small queries
Hello, We were comparing performance of some of our production hive queries between Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against both Spark 0.9 and 1.1. We could see that the performance gains have been good in Spark. We tried a very simple query, select count(*) from T where col3=123 in both sparkSQL and Hive (with hive.map.aggr=true) and found that Spark performance had been 2x better than Hive (120sec vs 60sec). Table T is stored in S3 and contains 600MB single GZIP file. My question is, why Spark is faster than Hive here? In both of the cases, the file will be downloaded, uncompressed and lines will be counted by a single process. For Hive case, reducer will be identity function since hive.map.aggr is true. Note that disk spills and network I/O are very less for Hive's case as well,
Simplest way to create partition hierarchies
Hello, We are using hive 0.12. We have a large directory in S3 where we dump daily logs of various application servers s3://logs/20141005/machine1/log.gz s3://logs/20141005/machine2/log.gz s3://logs/20141006/machine1/log.gz s3://logs/20141006/machine2/log.gz In Hive, we have mapped it to table with date and machine-name as partition columns. We add new partitions for all machines everyday manually by ALTER TABLE applogs ADD PARTITION ( machine= 8, dt = '20141006') LOCATION xyz; Now the problem is that we have such partitions for several years for 100s of machines which is now clogging hive metadata. Any way to simplify this? Note that, our current approach has these advantages: 1. partition addition does not require launching of any mapreduce job, it just adds mapping in metadata 2. querying on specific date range and specific machine range is easily possible. Queries on only one date and only one machine are extremely fast. We do not want to lose these advantages. Also changing the directory hierarchy is costly for us. We already looked into dynamic partitioning <https://cwiki.apache.org/confluence/display/Hive/DynamicPartitions> but it lauches mapreduce job to create data hierarchies. In our case, the hierarchy is already created. -- Regards, Saumitra Shahapure
Re: HDFS file system size issue
Hi Rahman, These are few lines from hadoop fsck / -blocks -files -locations /mnt/hadoop/hive/warehouse/user.db/table1/000255_0 44323326 bytes, 1 block(s): OK 0. blk_-7919979022650423857_446500 len=44323326 repl=3 [ip1:50010, ip2:50010, ip3:50010] /mnt/hadoop/hive/warehouse/user.db/table1/000256_0 44566965 bytes, 1 block(s): OK 0. blk_-576894812882540_446288 len=44566965 repl=3 [ip1:50010, ip2:50010, ip4:50010] Biswa may have guessed replication factor from fsck summary that I posted earlier. I am posting it again for today's run: Status: HEALTHY Total size:58143055251 B Total dirs:307 Total files: 5093 Total blocks (validated): 3903 (avg. block size 14897016 B) Minimally replicated blocks: 3903 (100.0 %) Over-replicated blocks:0 (0.0 %) Under-replicated blocks: 92 (2.357161 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor:2 Average block replication: 3.1401486 Corrupt blocks:0 Missing replicas: 92 (0.75065273 %) Number of data-nodes: 9 Number of racks: 1 FSCK ended at Tue Apr 15 13:20:25 UTC 2014 in 655 milliseconds The filesystem under path '/' is HEALTHY I have not overridden dfs.datanode.du.reserved. It defaults to 0. $ less $HADOOP_HOME/conf/hdfs-site.xml |grep -A3 'dfs.datanode.du.reserved' $ less $HADOOP_HOME/src/hdfs/hdfs-default.xml |grep -A3 'dfs.datanode.du.reserved' dfs.datanode.du.reserved 0 Reserved space in bytes per volume. Always leave this much space free for non dfs use. Below is du -h on every node. FYI, my dfs.data.dir is /mnt/hadoop/dfs/data and all hadoop/hive logs are dumped in /mnt/logs in various directories. All machines have 400GB for /mnt. $for i in `echo $dfs_slaves`; do ssh $i 'du -sh /mnt/hadoop; du -sh /mnt/hadoop/dfs/data; du -sh /mnt/logs;'; done 225G/mnt/hadoop 224G/mnt/hadoop/dfs/data 61M /mnt/logs 281G/mnt/hadoop 281G/mnt/hadoop/dfs/data 63M /mnt/logs 139G/mnt/hadoop 139G/mnt/hadoop/dfs/data 68M /mnt/logs 135G/mnt/hadoop 134G/mnt/hadoop/dfs/data 92M /mnt/logs 165G/mnt/hadoop 164G/mnt/hadoop/dfs/data 75M /mnt/logs 137G/mnt/hadoop 137G/mnt/hadoop/dfs/data 95M /mnt/logs 160G/mnt/hadoop 160G/mnt/hadoop/dfs/data 74M /mnt/logs 180G/mnt/hadoop 122G/mnt/hadoop/dfs/data 23M /mnt/logs 139G/mnt/hadoop 138G/mnt/hadoop/dfs/data 76M /mnt/logs All these numbers are for today, and may differ bit from yesterday. Today hadoop dfs -dus is 58GB and namenode is reporting DFS Used as 1.46TB. Pardon me for making the mail dirty by lot of copy-pastes, hope it's still readable, -- Saumitra S. Shahapure On Tue, Apr 15, 2014 at 2:57 AM, Abdelrahman Shettia < ashet...@hortonworks.com> wrote: > Hi Biswa, > > Are you sure that the replication factor of the files are three? Please > run a ‘hadoop fsck / -blocks -files -locations’ and see the replication > factor for each file. Also, Post the configuration of dfs.datanode. > du.reserved and please check the real space presented by a > DataNode by running ‘du -h’ > > Thanks, > Rahman > > On Apr 14, 2014, at 2:07 PM, Saumitra wrote: > > Hello, > > Biswanath, looks like we have confusion in calculation, 1TB would be equal > to 1024GB, not 114GB. > > > Sandeep, I checked log directory size as well. Log directories are hardly > in few GBs, I have configured log4j properties so that logs won’t be too > large. > > In our slave machines, we have 450GB disk partition for hadoop logs and > DFS. Over there logs directory is < 10GBs and rest space is occupied by > DFS. 10GB partition is for /. > > Let me quote my confusion point once again: > > Basically I wanted to point out discrepancy in name node status page and > hadoop >>> dfs -dus. In my case, earlier one reports DFS usage as 1TB and later >>> one reports it to be 35GB. What are the factors that can cause this >>> difference? And why is just 35GB data causing DFS to hit its limits? >>> >> > > I am talking about name node status page on 50070 port. Here is the > screenshot of my name node status page > > > > As I understand, 'DFS used’ is the space taken by DFS, non-DFS used is > spaces taken by non-DFS data like logs or other local files from users. > Namenode shows that DFS used is ~1TB but hadoop dfs -dus shows it to be > ~38GB. > > > > On 14-Apr-2014, at 12:33 pm, Sandeep Nemuri wrote: > > Please check your logs directory usage. > > > > On Mon, Apr 14, 2014 at 12:08 PM, Biswajit Nayak < > biswajit.na...@inmobi.com> wrote: > >> Whats the replication factor you have? I believe it should be 3. hadoop >> dus shows that disk usage without replication. While name node ui page >> gives with replication. >> >> 38gb * 3 =114gb ~ 1TB >> >> ~Biswa >> -oThe important thing is not to stop questioning o- >> >> >> On Mon, Apr 14, 2014 at 9:38 AM, Saumitra wrote: >> >>> Hi Biswajeet, >>> >>> Non-dfs usage is ~100GB over the clu
Re: Handling hierarchical data in Hive
Hi Nitin/Prasan, Thanks for your replies, I appreciate your help :) Clustering looks to be quite close to what we want. However one main gap is that we need to fire hive query to populate clusters. In our case, the clustered data is already there. So computation in Hive query would be redundant. If CREATE TABLE analyze (generated_by INT, other_representative_field INT) PARTITIONED BY (dt STRING) CLUSTERED BY (generated_by) INTO 100 BUCKETS; Just accepts s3 directory hierarchy that we have (as explained in first mail), our problem would be solved. Another interesting solutions seem to be creating partition on dt field and creating Hive index/view on *generated_by *field. If anyone has insights around these, they would be really helpful. Meanwhile we will try to solve our problem by buckets/indices. -- Regards, Saumitra Shahapure On Tue, Mar 25, 2014 at 7:44 PM, Prasan Samtani wrote: > Hi Saumitra, > > You might want to look into clustering within the partition. That is, > partition by "day", but cluster by "generated by" (within those > partitions), and see if that improves performance. Refer to the CLUSTER BY > command in the Hive language Manual. > > -Prasan > > > On Mar 25, 2014, at 4:26 AM, "Saumitra Shahapure (Vizury)" < > saumitra.shahap...@vizury.com> wrote: > > Hi Nitin, > > We are not facing small files problem since data is in S3. Also we do not > want to merge files. Merging files are creating large analyze table for say > one day would slow down queries fired on specific day and *generated_by.* > > Let me explain my problem in other words. > Right now we are over-partitioning our table. Over-partitioning is giving > us benefit that query on 1-2 partitions is too fast. It's side-effect is > that If we try to query large number of partitions, query is too slow. Is > there a way to get good performance in both of the scenarios? > > -- > Regards, > Saumitra Shahapure > > > On Tue, Mar 25, 2014 at 4:25 PM, Nitin Pawar wrote: > >> see if this is what you are looking for >> https://github.com/sskaje/hive_merge >> >> >> >> >> On Tue, Mar 25, 2014 at 4:21 PM, Saumitra Shahapure (Vizury) < >> saumitra.shahap...@vizury.com> wrote: >> >>> Hello, >>> >>> We are using Hive to query S3 data. For one of our tables named analyze, >>> we generate data hierarchically. First level of hierarchy is date and >>> second level is a field named *generated_by*. e.g. for 20 march we may >>> have S3 directories as >>> s3://analyze/20140320/111/ >>> s3://analyze/20140320/222/ >>> s3://analyze/20140320/333/ >>> Size of files in each folders is typically small. >>> >>> Till now we have been using static partitioning so that queries on >>> specific date and *generated_by* would be faster. >>> >>> Now problem is that number of *generated_by* folders is increased to >>> 1000s. Everyday we end up adding 1000s of partitions to Hive. So queries on >>> analyze on one month are slowed down. >>> >>> Is there any way to get rid of partitions, and at the same time maintain >>> good performance of queries which are fired on specific day and >>> *generated_by*? >>> -- >>> Regards, >>> Saumitra Shahapure >>> >> >> >> >> -- >> Nitin Pawar >> > >
Re: Handling hierarchical data in Hive
Hi Nitin, We are not facing small files problem since data is in S3. Also we do not want to merge files. Merging files are creating large analyze table for say one day would slow down queries fired on specific day and *generated_by.* Let me explain my problem in other words. Right now we are over-partitioning our table. Over-partitioning is giving us benefit that query on 1-2 partitions is too fast. It's side-effect is that If we try to query large number of partitions, query is too slow. Is there a way to get good performance in both of the scenarios? -- Regards, Saumitra Shahapure On Tue, Mar 25, 2014 at 4:25 PM, Nitin Pawar wrote: > see if this is what you are looking for > https://github.com/sskaje/hive_merge > > > > > On Tue, Mar 25, 2014 at 4:21 PM, Saumitra Shahapure (Vizury) < > saumitra.shahap...@vizury.com> wrote: > >> Hello, >> >> We are using Hive to query S3 data. For one of our tables named analyze, >> we generate data hierarchically. First level of hierarchy is date and >> second level is a field named *generated_by*. e.g. for 20 march we may >> have S3 directories as >> s3://analyze/20140320/111/ >> s3://analyze/20140320/222/ >> s3://analyze/20140320/333/ >> Size of files in each folders is typically small. >> >> Till now we have been using static partitioning so that queries on >> specific date and *generated_by* would be faster. >> >> Now problem is that number of *generated_by* folders is increased to >> 1000s. Everyday we end up adding 1000s of partitions to Hive. So queries on >> analyze on one month are slowed down. >> >> Is there any way to get rid of partitions, and at the same time maintain >> good performance of queries which are fired on specific day and >> *generated_by*? >> -- >> Regards, >> Saumitra Shahapure >> > > > > -- > Nitin Pawar >
Handling hierarchical data in Hive
Hello, We are using Hive to query S3 data. For one of our tables named analyze, we generate data hierarchically. First level of hierarchy is date and second level is a field named *generated_by*. e.g. for 20 march we may have S3 directories as s3://analyze/20140320/111/ s3://analyze/20140320/222/ s3://analyze/20140320/333/ Size of files in each folders is typically small. Till now we have been using static partitioning so that queries on specific date and *generated_by* would be faster. Now problem is that number of *generated_by* folders is increased to 1000s. Everyday we end up adding 1000s of partitions to Hive. So queries on analyze on one month are slowed down. Is there any way to get rid of partitions, and at the same time maintain good performance of queries which are fired on specific day and *generated_by*? -- Regards, Saumitra Shahapure