Re: Maintaining big and complex Hive queries

2016-12-21 Thread Saumitra Shahapure
Hi Elliot,

Thanks for letting me know. HPL-SQL sounded particularly interesting. But
in the documentation I could not see any way to pass output generated by
one Hive query to the next one. The tool looks good as a homogeneous PL-SQL
platform for multiple big-data systems (http://www.hplsql.org/about).

However in order to break single complex hive query, DDLs look to be only
way in HPL-SQL too. Or is there any alternate way that I might have missed?

-- Saumitra S. Shahapure

On Thu, Dec 15, 2016 at 6:21 PM, Elliot West  wrote:

> I notice that HPL/SQL is not mentioned on the page I referenced, however I
> expect that is another approach that you could use to modularise:
>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=59690156
> http://www.hplsql.org/doc
>
> On 15 December 2016 at 17:17, Elliot West  wrote:
>
>> Some options are covered here, although there is no definitive guidance
>> as far as I know:
>>
>> https://cwiki.apache.org/confluence/display/Hive/Unit+Testin
>> g+Hive+SQL#UnitTestingHiveSQL-Modularisation
>>
>> On 15 December 2016 at 17:08, Saumitra Shahapure <
>> saumitra.offic...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> We are running and maintaining quite big and complex Hive SELECT query
>>> right now. It's basically a single SELECT query which performs JOIN of
>>> about ten other SELECT query outputs.
>>>
>>> A simplest way to refactor that we can think of is to break this query
>>> down into multiple views and then join the views. There is similar
>>> possibility to create intermediate tables.
>>>
>>> However creating multiple DDLs in order to maintain a single DML is not
>>> very smooth. We would end up polluting metadata database by creating views
>>> / intermediate tables which are used in just this ETL.
>>>
>>> What are the other efficient ways to maintain complex SQL queries
>>> written in Hive? Are there better ways to break Hive query into multiple
>>> modules?
>>>
>>> -- Saumitra S. Shahapure
>>>
>>
>>
>


Maintaining big and complex Hive queries

2016-12-15 Thread Saumitra Shahapure
Hello,

We are running and maintaining quite big and complex Hive SELECT query
right now. It's basically a single SELECT query which performs JOIN of
about ten other SELECT query outputs.

A simplest way to refactor that we can think of is to break this query down
into multiple views and then join the views. There is similar possibility
to create intermediate tables.

However creating multiple DDLs in order to maintain a single DML is not
very smooth. We would end up polluting metadata database by creating views
/ intermediate tables which are used in just this ETL.

What are the other efficient ways to maintain complex SQL queries written
in Hive? Are there better ways to break Hive query into multiple modules?

-- Saumitra S. Shahapure


java.lang.ArrayIndexOutOfBoundsException in getSplitHosts

2016-04-25 Thread Saumitra Shahapure
Hello,

I am using using Hive 0.13.1 in EMR and trying to create Hive table on top
of our custom file system (which is a thin wrapper on top of S3) and I am
getting error while accessing the data in the table. Stack trace and
command history below.

I had a doubt that CombineFileInputFormat is trying to access the splits in
incorrect way, but HiveInputFormat is also causing the same problem. Has
anyone seen such problem before? Note that SerDe and FileSystem are both
custom. Could either of those causing this problem?


hive> add jar /home/hadoop/logprocessing-pig-combined.jar;
> Added /home/hadoop/logprocessing-pig-combined.jar to class path
> Added resource: /home/hadoop/logprocessing-pig-combined.jar
> hive> Create external table nulf
> > (
> > tm STRING
> > )
> > ROW FORMAT SERDE 'logprocessing.nulf.basic.BasicHiveSerDe'
> > location 'cda://path/to/logs/';
> OK
> Time taken: 6.706 seconds
> hive> set hive.input.format= org.apache.hadoop.hive.ql.io.HiveInputFormat;
> hive> select count(*) from nulf;
> Total jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks determined at compile time: 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=
> In order to set a constant number of reducers:
>   set mapreduce.job.reduces=
> java.lang.ArrayIndexOutOfBoundsException: 1
> at
> org.apache.hadoop.mapred.FileInputFormat.getSplitHosts(FileInputFormat.java:529)
> at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:320)
> at
> org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:290)
> at
> org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:371)
> at
> org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:520)
> at
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:512)
> at
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:394)
> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
> at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
> at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
> at
> org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:420)
> at
> org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:136)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
> at
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
> at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1503)
> at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1270)
> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1088)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)
> at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:275)
> at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:227)
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:430)
> at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:803)
> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:697)
> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:636)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
> Job Submission failed with exception
> 'java.lang.ArrayIndexOutOfBoundsException(1)'
> FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.mr.MapRedTask


-- Saumitra S. Shahapure


Re: Spark performance for small queries

2015-01-23 Thread Saumitra Shahapure (Vizury)
Hi Gopal,

Thanks for the informative answer, but my question was around difference in
the processing of Spark SQL and Hive. Right now I am not trying to
optimizing either. I totally agree that Hive can perform much better than
the number I got.

I was just wondering, even though both systems would generate quite similar
execution plans for this query, what exactly is making difference.  My
question is from the point of understanding both the systems,

Answering your questions inline,

--
Regards,
Saumitra Shahapure

On Fri, Jan 23, 2015 at 5:01 AM, Gopal V  wrote:

> On 1/22/15, 3:03 AM, Saumitra Shahapure (Vizury) wrote:
>
>> We were comparing performance of some of our production hive queries
>> between Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against both
>> Spark 0.9 and 1.1. We could see that the performance gains have been good
>> in Spark.
>>
>
> Is there any particular reason you are using an ancient & slow Hadoop-1.x
> version instead of a modern YARN 2.0 cluster?


The cluster I was experimenting on, is a legacy cluster in our system. We
are already in process of migrating everything from here to Hadoop 2.


>
>
>  We tried a very simple query,
>> select count(*) from T where col3=123
>> in both sparkSQL and Hive (with hive.map.aggr=true) and found that Spark
>> performance had been 2x better than Hive (120sec vs 60sec). Table T is
>> stored in S3 and contains 600MB single GZIP file.
>>
>
> Not sure if you understand that what you're doing is one of the worst
> cases for both the platforms.
>
> Using a big single gzip file is like a massive anti-pattern.
>
> I'm assuming what you want is fast SQL in Hive (since this is the hive
> list) along with all the other lead/lag functions there.
>
> You need a SQL oriented columnar format like ORC, mix with YARN and add
> Tez, that is going to be somewhere near 10-12 seconds.
>
> Oh, and that's a ball-park figure for a single node.
>

Agree on that end as well. Smaller gzipped files or uncompressed files will
give better performance. This specific query is just a kind-of-rare test
case that one of our job encounters sometimes.


>
> Cheers,
> Gopal
>


Re: Spark performance for small queries

2015-01-22 Thread Saumitra Shahapure (Vizury)
Hello,

We were comparing performance of some of our production hive queries
between Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against both
Spark 0.9 and 1.1. We could see that the performance gains have been good
in Spark.

We tried a very simple query,
select count(*) from T where col3=123
in both sparkSQL and Hive (with hive.map.aggr=true) and found that Spark
performance had been 2x better than Hive (120sec vs 60sec). Table T is
stored in S3 and contains 600MB single GZIP file.

My question is, why Spark is faster than Hive here? In both of the cases,
the file will be downloaded, uncompressed and lines will be counted by a
single process. For Hive case, reducer will be identity function
since hive.map.aggr is true.

Note that disk spills and network I/O are very less for Hive's case as well,


Simplest way to create partition hierarchies

2014-10-05 Thread Saumitra Shahapure (Vizury)
Hello,

We are using hive 0.12. We have a large directory in S3 where we dump daily
logs of various application servers
s3://logs/20141005/machine1/log.gz
s3://logs/20141005/machine2/log.gz
s3://logs/20141006/machine1/log.gz
s3://logs/20141006/machine2/log.gz

In Hive, we have mapped it to table with date and machine-name as partition
columns. We add new partitions for all machines everyday manually by
ALTER TABLE applogs ADD PARTITION ( machine= 8, dt = '20141006') LOCATION
xyz;

Now the problem is that we have such partitions for several years for 100s
of machines which is now clogging hive metadata.

Any way to simplify this?

Note that, our current approach has these advantages:
1. partition addition does not require launching of any mapreduce job, it
just adds mapping in metadata
2. querying on specific date range and specific machine range is easily
possible. Queries on only one date and only one machine are extremely fast.

We do not want to lose these advantages. Also changing the directory
hierarchy is costly for us.

We already looked into dynamic partitioning
<https://cwiki.apache.org/confluence/display/Hive/DynamicPartitions> but it
lauches mapreduce job to create data hierarchies. In our case, the
hierarchy is already created.
--
Regards,
Saumitra Shahapure


Re: HDFS file system size issue

2014-04-15 Thread Saumitra Shahapure
Hi Rahman,

These are few lines from hadoop fsck / -blocks -files -locations

/mnt/hadoop/hive/warehouse/user.db/table1/000255_0 44323326 bytes, 1
block(s):  OK
0. blk_-7919979022650423857_446500 len=44323326 repl=3 [ip1:50010,
ip2:50010, ip3:50010]

/mnt/hadoop/hive/warehouse/user.db/table1/000256_0 44566965 bytes, 1
block(s):  OK
0. blk_-576894812882540_446288 len=44566965 repl=3 [ip1:50010,
ip2:50010, ip4:50010]


Biswa may have guessed replication factor from fsck summary that I posted
earlier. I am posting it again for today's run:

Status: HEALTHY
 Total size:58143055251 B
 Total dirs:307
 Total files:   5093
 Total blocks (validated):  3903 (avg. block size 14897016 B)
 Minimally replicated blocks:   3903 (100.0 %)
 Over-replicated blocks:0 (0.0 %)
 Under-replicated blocks:   92 (2.357161 %)
 Mis-replicated blocks: 0 (0.0 %)
 Default replication factor:2
 Average block replication: 3.1401486
 Corrupt blocks:0
 Missing replicas:  92 (0.75065273 %)
 Number of data-nodes:  9
 Number of racks:   1
FSCK ended at Tue Apr 15 13:20:25 UTC 2014 in 655 milliseconds


The filesystem under path '/' is HEALTHY

I have not overridden dfs.datanode.du.reserved. It defaults to 0.

$ less $HADOOP_HOME/conf/hdfs-site.xml |grep -A3 'dfs.datanode.du.reserved'
$ less $HADOOP_HOME/src/hdfs/hdfs-default.xml |grep -A3
'dfs.datanode.du.reserved'
  dfs.datanode.du.reserved
  0
  Reserved space in bytes per volume. Always leave this much
space free for non dfs use.
  

Below is du -h on every node. FYI, my dfs.data.dir is /mnt/hadoop/dfs/data
and all hadoop/hive logs are dumped in /mnt/logs in various directories.
All machines have 400GB for /mnt.

$for i in `echo $dfs_slaves`; do  ssh $i 'du -sh /mnt/hadoop; du -sh
/mnt/hadoop/dfs/data; du -sh /mnt/logs;'; done


225G/mnt/hadoop
224G/mnt/hadoop/dfs/data
61M /mnt/logs

281G/mnt/hadoop
281G/mnt/hadoop/dfs/data
63M /mnt/logs

139G/mnt/hadoop
139G/mnt/hadoop/dfs/data
68M /mnt/logs

135G/mnt/hadoop
134G/mnt/hadoop/dfs/data
92M /mnt/logs

165G/mnt/hadoop
164G/mnt/hadoop/dfs/data
75M /mnt/logs

137G/mnt/hadoop
137G/mnt/hadoop/dfs/data
95M /mnt/logs

160G/mnt/hadoop
160G/mnt/hadoop/dfs/data
74M /mnt/logs

180G/mnt/hadoop
122G/mnt/hadoop/dfs/data
23M /mnt/logs

139G/mnt/hadoop
138G/mnt/hadoop/dfs/data
76M /mnt/logs



All these numbers are for today, and may differ bit from yesterday.

Today hadoop dfs -dus is 58GB and namenode is reporting DFS Used as 1.46TB.

Pardon me for making the mail dirty by lot of copy-pastes, hope it's still
readable,

-- Saumitra S. Shahapure


On Tue, Apr 15, 2014 at 2:57 AM, Abdelrahman Shettia <
ashet...@hortonworks.com> wrote:

> Hi Biswa,
>
> Are you sure that the replication factor of the files are three? Please
> run a ‘hadoop fsck / -blocks -files -locations’ and see the replication
> factor for each file.  Also, Post the configuration of dfs.datanode.
> du.reserved and please check the real space presented by a
> DataNode by running ‘du -h’
>
> Thanks,
> Rahman
>
> On Apr 14, 2014, at 2:07 PM, Saumitra  wrote:
>
> Hello,
>
> Biswanath, looks like we have confusion in calculation, 1TB would be equal
> to 1024GB, not 114GB.
>
>
> Sandeep, I checked log directory size as well. Log directories are hardly
> in few GBs, I have configured log4j properties so that logs won’t be too
> large.
>
> In our slave machines, we have 450GB disk partition for hadoop logs and
> DFS. Over there logs directory is < 10GBs and rest space is occupied by
> DFS. 10GB partition is for /.
>
> Let me quote my confusion point once again:
>
>  Basically I wanted to point out discrepancy in name node status page and 
> hadoop
>>> dfs -dus. In my case, earlier one reports DFS usage as 1TB and later
>>> one reports it to be 35GB. What are the factors that can cause this
>>> difference? And why is just 35GB data causing DFS to hit its limits?
>>>
>>
>
> I am talking about name node status page on 50070 port. Here is the
> screenshot of my name node status page
>
> 
>
> As I understand, 'DFS used’ is the space taken by DFS, non-DFS used is
> spaces taken by non-DFS data like logs or other local files from users.
> Namenode shows that DFS used is ~1TB but hadoop dfs -dus shows it to be
> ~38GB.
>
>
>
> On 14-Apr-2014, at 12:33 pm, Sandeep Nemuri  wrote:
>
>  Please check your logs directory usage.
>
>
>
> On Mon, Apr 14, 2014 at 12:08 PM, Biswajit Nayak <
> biswajit.na...@inmobi.com> wrote:
>
>> Whats the replication factor you have? I believe it should be 3. hadoop
>> dus shows that disk usage without replication. While name node ui page
>> gives with replication.
>>
>> 38gb * 3 =114gb ~ 1TB
>>
>> ~Biswa
>> -oThe important thing is not to stop questioning o-
>>
>>
>> On Mon, Apr 14, 2014 at 9:38 AM, Saumitra wrote:
>>
>>> Hi Biswajeet,
>>>
>>> Non-dfs usage is ~100GB over the clu

Re: Handling hierarchical data in Hive

2014-03-25 Thread Saumitra Shahapure (Vizury)
Hi Nitin/Prasan,

Thanks for your replies, I appreciate your help :)

Clustering looks to be quite close to what we want. However one main gap is
that we need to fire hive query to populate clusters. In our case, the
clustered data is already there. So computation in Hive query would be
redundant. If

CREATE TABLE analyze (generated_by INT, other_representative_field INT)
PARTITIONED BY (dt STRING)
CLUSTERED BY (generated_by) INTO 100 BUCKETS;

Just accepts s3 directory hierarchy that we have (as explained in first
mail), our problem would be solved.

Another interesting solutions seem to be creating partition on dt field and
creating Hive index/view on *generated_by *field.

If anyone has insights around these, they would be really helpful.
Meanwhile we will try to solve our problem by buckets/indices.


--
Regards,
Saumitra Shahapure


On Tue, Mar 25, 2014 at 7:44 PM, Prasan Samtani wrote:

> Hi Saumitra,
>
> You might want to look into clustering within the partition. That is,
> partition by "day", but cluster by "generated by" (within those
> partitions), and see if that improves performance. Refer to the CLUSTER BY
> command in the Hive language Manual.
>
> -Prasan
>
>
> On Mar 25, 2014, at 4:26 AM, "Saumitra Shahapure (Vizury)" <
> saumitra.shahap...@vizury.com> wrote:
>
> Hi Nitin,
>
> We are not facing small files problem since data is in S3. Also we do not
> want to merge files. Merging files are creating large analyze table for say
> one day would slow down queries fired on specific day and *generated_by.*
>
> Let me explain my problem in other words.
> Right now we are over-partitioning our table. Over-partitioning is giving
> us benefit that query on 1-2 partitions is too fast. It's side-effect is
> that If we try to query large number of partitions, query is too slow. Is
> there a way to get good performance in both of the scenarios?
>
> --
> Regards,
> Saumitra Shahapure
>
>
> On Tue, Mar 25, 2014 at 4:25 PM, Nitin Pawar wrote:
>
>> see if this is what you are looking for
>> https://github.com/sskaje/hive_merge
>>
>>
>>
>>
>> On Tue, Mar 25, 2014 at 4:21 PM, Saumitra Shahapure (Vizury) <
>> saumitra.shahap...@vizury.com> wrote:
>>
>>> Hello,
>>>
>>> We are using Hive to query S3 data. For one of our tables named analyze,
>>> we generate data hierarchically. First level of hierarchy is date and
>>> second level is a field named *generated_by*. e.g. for 20 march we may
>>> have S3 directories as
>>> s3://analyze/20140320/111/
>>> s3://analyze/20140320/222/
>>> s3://analyze/20140320/333/
>>> Size of files in each folders is typically small.
>>>
>>> Till now we have been using static partitioning so that queries on
>>> specific date and *generated_by* would be faster.
>>>
>>> Now problem is that number of *generated_by* folders is increased to
>>> 1000s. Everyday we end up adding 1000s of partitions to Hive. So queries on
>>> analyze on one month are slowed down.
>>>
>>> Is there any way to get rid of partitions, and at the same time maintain
>>> good  performance of queries which are fired on specific day and
>>> *generated_by*?
>>> --
>>> Regards,
>>> Saumitra Shahapure
>>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>


Re: Handling hierarchical data in Hive

2014-03-25 Thread Saumitra Shahapure (Vizury)
Hi Nitin,

We are not facing small files problem since data is in S3. Also we do not
want to merge files. Merging files are creating large analyze table for say
one day would slow down queries fired on specific day and *generated_by.*

Let me explain my problem in other words.
Right now we are over-partitioning our table. Over-partitioning is giving
us benefit that query on 1-2 partitions is too fast. It's side-effect is
that If we try to query large number of partitions, query is too slow. Is
there a way to get good performance in both of the scenarios?

--
Regards,
Saumitra Shahapure


On Tue, Mar 25, 2014 at 4:25 PM, Nitin Pawar wrote:

> see if this is what you are looking for
> https://github.com/sskaje/hive_merge
>
>
>
>
> On Tue, Mar 25, 2014 at 4:21 PM, Saumitra Shahapure (Vizury) <
> saumitra.shahap...@vizury.com> wrote:
>
>> Hello,
>>
>> We are using Hive to query S3 data. For one of our tables named analyze,
>> we generate data hierarchically. First level of hierarchy is date and
>> second level is a field named *generated_by*. e.g. for 20 march we may
>> have S3 directories as
>> s3://analyze/20140320/111/
>> s3://analyze/20140320/222/
>> s3://analyze/20140320/333/
>> Size of files in each folders is typically small.
>>
>> Till now we have been using static partitioning so that queries on
>> specific date and *generated_by* would be faster.
>>
>> Now problem is that number of *generated_by* folders is increased to
>> 1000s. Everyday we end up adding 1000s of partitions to Hive. So queries on
>> analyze on one month are slowed down.
>>
>> Is there any way to get rid of partitions, and at the same time maintain
>> good  performance of queries which are fired on specific day and
>> *generated_by*?
>> --
>> Regards,
>> Saumitra Shahapure
>>
>
>
>
> --
> Nitin Pawar
>


Handling hierarchical data in Hive

2014-03-25 Thread Saumitra Shahapure (Vizury)
Hello,

We are using Hive to query S3 data. For one of our tables named analyze, we
generate data hierarchically. First level of hierarchy is date and second
level is a field named *generated_by*. e.g. for 20 march we may have S3
directories as
s3://analyze/20140320/111/
s3://analyze/20140320/222/
s3://analyze/20140320/333/
Size of files in each folders is typically small.

Till now we have been using static partitioning so that queries on specific
date and *generated_by* would be faster.

Now problem is that number of *generated_by* folders is increased to 1000s.
Everyday we end up adding 1000s of partitions to Hive. So queries on
analyze on one month are slowed down.

Is there any way to get rid of partitions, and at the same time maintain
good  performance of queries which are fired on specific day and
*generated_by*?
--
Regards,
Saumitra Shahapure