Re: Hive on Spark - Mesos

2016-09-15 Thread John Omernik
Did you run it on Mesos? Your presentation doesn't mention Mesos at all...

John


On Thu, Sep 15, 2016 at 4:20 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Yes you can. Hive on Spark meaning Hive using Spark as its execution
> engine works fine. The version that I managed to make it work  is any Hive
> version> 1,2 with Spark 1.3.1.
>
> You  need to build Spark from the source code excluding Hive libraries.
>
> Check my attached presentation.
>
>  HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 15 September 2016 at 22:10, John Omernik <j...@omernik.com> wrote:
>
>> Hey all, I was experimenting with some bleeding edge Hive.  (2.1) and
>> trying to get it to run on bleeding edge Spark (2.0).
>>
>> Spark is working fine, I can query the data all is setup, however, I
>> can't get Hive on Spark to work. I understand it's not really a thing (Hive
>> on Spark on Mesos) but I am thinking... why not?  Thus I am posting here.
>> (I.e. is there some reason why this shouldn't work other than it just
>> hasn't been attempted?)
>>
>> The error I am getting is odd.. (see below) not sure why that would pop
>> up, everything seems right other wise... any help would be appreciated.
>>
>> John
>>
>>
>>
>>
>> at java.lang.ClassLoader.defineClass1(Native Method)
>>
>> at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
>>
>> at java.security.SecureClassLoader.defineClass(SecureClassLoade
>> r.java:142)
>>
>> at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
>>
>> at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
>>
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
>>
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
>>
>> at java.security.AccessController.doPrivileged(Native Method)
>>
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
>>
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>
>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>>
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>
>> at java.lang.Class.forName0(Native Method)
>>
>> at java.lang.Class.forName(Class.java:348)
>>
>> at org.apache.spark.util.Utils$.classForName(Utils.scala:225)
>>
>> at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy
>> $SparkSubmit$$runMain(SparkSubmit.scala:686)
>>
>> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit
>> .scala:185)
>>
>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
>>
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
>>
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>
>> Caused by: java.lang.ClassNotFoundException:
>> org.apache.spark.JavaSparkListener
>>
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>
>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>>
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>
>> ... 20 more
>>
>>
>> at org.apache.hive.spark.client.rpc.RpcServer.cancelClient(RpcS
>> erver.java:179)
>>
>> at org.apache.hive.spark.client.SparkClientImpl$3.run(SparkClie
>> ntImpl.java:465)
>>
>
>


Hive on Spark - Mesos

2016-09-15 Thread John Omernik
Hey all, I was experimenting with some bleeding edge Hive.  (2.1) and
trying to get it to run on bleeding edge Spark (2.0).

Spark is working fine, I can query the data all is setup, however, I can't
get Hive on Spark to work. I understand it's not really a thing (Hive on
Spark on Mesos) but I am thinking... why not?  Thus I am posting here.
(I.e. is there some reason why this shouldn't work other than it just
hasn't been attempted?)

The error I am getting is odd.. (see below) not sure why that would pop up,
everything seems right other wise... any help would be appreciated.

John




at java.lang.ClassLoader.defineClass1(Native Method)

at java.lang.ClassLoader.defineClass(ClassLoader.java:760)

at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)

at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)

at java.net.URLClassLoader.access$100(URLClassLoader.java:73)

at java.net.URLClassLoader$1.run(URLClassLoader.java:368)

at java.net.URLClassLoader$1.run(URLClassLoader.java:362)

at java.security.AccessController.doPrivileged(Native Method)

at java.net.URLClassLoader.findClass(URLClassLoader.java:361)

at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)

at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

at java.lang.Class.forName0(Native Method)

at java.lang.Class.forName(Class.java:348)

at org.apache.spark.util.Utils$.classForName(Utils.scala:225)

at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:686)

at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)

at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)

at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)

at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Caused by: java.lang.ClassNotFoundException:
org.apache.spark.JavaSparkListener

at java.net.URLClassLoader.findClass(URLClassLoader.java:381)

at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)

at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

... 20 more


at
org.apache.hive.spark.client.rpc.RpcServer.cancelClient(RpcServer.java:179)

at
org.apache.hive.spark.client.SparkClientImpl$3.run(SparkClientImpl.java:465)


Re: Re: [VOTE] Hive 2.0 release plan

2015-11-30 Thread John Omernik
Agreed, any plans for Hive 1.3?  Will Hive 2.0 be a breaking release for
those running 1.x?




On Sun, Nov 15, 2015 at 7:07 PM, Wangwenli  wrote:

> Good News,   *Any release plan for hive 1.3*  ???
>
> --
> Wangwenli
>
>
> *From:* Gopal Vijayaraghavan 
> *Date:* 2015-11-14 14:21
> *To:* d...@hive.apache.org
> *CC:* user@hive.apache.org
> *Subject:* Re: [VOTE] Hive 2.0 release plan
>
> (+user@)
>
> +1.
>
> Cheers,
> Gopal
>
> On 11/13/15, 5:54 PM, "Lefty Leverenz"  wrote:
>
> >The Hive bylaws require this to be submitted on the user@hive mailing
> list
> >(even though users don't get to vote).  See Release Plan in Actions
> >.
> >
> >-- Lefty
> ...
> >> > On Fri, Nov 13, 2015 at 1:38 PM, Sergey Shelukhin <
> >> ser...@hortonworks.com>
> >> > wrote:
> >> >
> >> >> Hi.
> >> >> With no strong objections on DISCUSS thread, some issues raised and
> >> >> addressed, and a reminder from Carl about the bylaws for the release
> >> >> process, I propose we release the first version of Hive 2 (2.0), and
> >> >> nominate myself as release manager.
> >> >> The goal is to have the first release of Hive with aggressive set of
> >>new
> >> >> features, some of which are ready to use and some are at experimental
> >> >> stage and will be developed in future Hive 2 releases, in line with
> >>the
> >> >> Hive-1-Hive-2 split discussion.
> >> >> If the vote passes, the timeline to create a branch should be around
> >>the
> >> >> end of next week (to minimize merging in the wake of the release),
> >>and
> >> the
> >> >> timeline to release would be around the end of November, depending on
> >> the
> >> >> issues found during the RC cutting process, as usual.
> >> >>
> >> >> Please vote:
> >> >> +1 proceed with the release plan
> >> >> +-0 don1t care
> >> >> -1 don1t proceed with the release plan, for such and such reasons
> >> >>
> >> >> The vote will run for 3 days.
> >> >>
> >> >>
> >>
>
>
>


Hive on Spark on Mesos

2015-09-09 Thread John Omernik
In the docs for Hive on Spark, it appears to have instructions only for
Yarn.  Will there be instructions or the ability to run hive on spark with
Mesos implementations of spark?  Is it possible now and just not
documented? What are the issues in running it this way?

John


Parquet Files in Hive - Settings

2015-08-18 Thread John Omernik
Is there a good writeup on what the settings that can be tweaked in hive as
it pertains to writing parquet files are? For example, in some obscure
pages I've found settings like parquet.compression,
parquet.dictionary.page.size and parquet.enable.dictionary, but they were
in reference to stock mapr reduce jobs, not hive, and thus, I don't even
know what the defaults for these are when using hive.  I tried doing hive
-e set|grep parquet\. but these settings aren't there.

Any documentation on what these are, what hive uses as defaults etc, and
how I can optimize my parquet writing with hive would be appreciated.


Re: Hive 1.0 vs. 0.15

2015-02-09 Thread John Omernik
Can you point me to the blog, I didn't see that in the official onlist
announce email, and probably need to bookmark said blogs if they are
valuable.

Thanks!

On Mon, Feb 9, 2015 at 1:46 PM, DU DU will...@gmail.com wrote:
 as said in the blog, 0.15.0 maps to hive 1.1.0

 On Mon, Feb 9, 2015 at 2:41 PM, John Omernik j...@omernik.com wrote:

 Hey all, I was monitoring a specific JIRA for Hive
 (https://issues.apache.org/jira/browse/HIVE-7073) and saw that it was
 resolved in Hive 0.15.  But now with the Hive 1.0.0 release after Hive
 0.14, am confused how where the Binary column support for Parquet will
 be moved into mainline.   Thoughts?

 John




 --
 Thanks,
 Dayong


Re: Hive 1.0 vs. 0.15

2015-02-09 Thread John Omernik
Oh, is that an the hive blog? I guess, I was looking for that
information in the Apache release notes (I try to stay Vendor Agnostic
with the releases etc).

Thanks!


On Mon, Feb 9, 2015 at 2:11 PM, DU DU will...@gmail.com wrote:
 http://blog.cloudera.com/blog/2015/02/apache-hive-1-0-0-has-been-released/

 On Mon, Feb 9, 2015 at 2:52 PM, John Omernik j...@omernik.com wrote:

 Can you point me to the blog, I didn't see that in the official onlist
 announce email, and probably need to bookmark said blogs if they are
 valuable.

 Thanks!

 On Mon, Feb 9, 2015 at 1:46 PM, DU DU will...@gmail.com wrote:
  as said in the blog, 0.15.0 maps to hive 1.1.0
 
  On Mon, Feb 9, 2015 at 2:41 PM, John Omernik j...@omernik.com wrote:
 
  Hey all, I was monitoring a specific JIRA for Hive
  (https://issues.apache.org/jira/browse/HIVE-7073) and saw that it was
  resolved in Hive 0.15.  But now with the Hive 1.0.0 release after Hive
  0.14, am confused how where the Binary column support for Parquet will
  be moved into mainline.   Thoughts?
 
  John
 
 
 
 
  --
  Thanks,
  Dayong




 --
 Thanks,
 Dayong


Hive 1.0 vs. 0.15

2015-02-09 Thread John Omernik
Hey all, I was monitoring a specific JIRA for Hive
(https://issues.apache.org/jira/browse/HIVE-7073) and saw that it was
resolved in Hive 0.15.  But now with the Hive 1.0.0 release after Hive
0.14, am confused how where the Binary column support for Parquet will
be moved into mainline.   Thoughts?

John


Files Per Partition Causing Slowness

2014-12-02 Thread John Omernik
I am running Hive 0.12 in production, I have a table that ha 1100
partitions, (flat, no multi level partitions) and in those partitions some
have a small number of files (5- 10) and others have quite a few files (up
to 120).   The total table size is not huge around 285 GB.

While this is not terrible to my eyes, when I try to run a query on lots of
partition say all 1100, the time from query start to the time the query is
submitted to the jobtracker is horribly slow.  For example, it can take up
to 3.5 minutes just to get to the point where the job is seen in the job
tracker.
Is the number of files here what's hurting me? Is there some sort of per
file enumeration going on under the hood in Hive?  I ran Hive with debug
mode on and saw lots of file calls for each individual file... I guess I am
curious for others out there who may have similar tables, would a query
like that take a horribly long time for you as well? Is this normal or am
I seeing issues here?


Fwd: Files Per Partition Causing Slowness

2014-12-02 Thread John Omernik
-- Forwarded message --
From: John Omernik j...@omernik.com
Date: Tue, Dec 2, 2014 at 1:58 PM
Subject: Re: Files Per Partition Causing Slowness
To: user@hive.apache.org


Thank you Edward, I knew the number of partitions mattered,  but I
didn't think 1000 would be to much.  However, I didn't realize the
number of files per partition was also a fact prior to job submission.
I am looking at reducing some of those now too.

Out of curiosity, if I have a per day partition for three years of
data, how would I setup bucketing to keep my partitions lower? I am
struggling to find a way to approach this problem.


Thanks!

On Tue, Dec 2, 2014 at 12:28 PM, John Omernik j...@omernik.com wrote:

 Thank you Edward, I knew the number of partitions mattered, and knew I was 
 getting high, however, I didn't realize the number of files per partition was 
 also a fact prior to job submission.

 Thanks!

 John

 On Tue, Dec 2, 2014 at 11:35 AM, Edward Capriolo edlinuxg...@gmail.com 
 wrote:

 This is discussed in the programming hive book. The more files the longer it 
 takes the job tracker to plan the job. The more tasks the more things the 
 job tracker has to track. The more partitions the more metastore lookups are 
 required. All of these things limit throughput. I do not like tables with 
 more then 100 partitions above that I would switch to bucketing or some 
 other mechanism (application level partitioning)

 On Tue, Dec 2, 2014 at 12:25 PM, John Omernik j...@omernik.com wrote:

 I am running Hive 0.12 in production, I have a table that ha 1100 
 partitions, (flat, no multi level partitions) and in those partitions some 
 have a small number of files (5- 10) and others have quite a few files (up 
 to 120).   The total table size is not huge around 285 GB.

 While this is not terrible to my eyes, when I try to run a query on lots of 
 partition say all 1100, the time from query start to the time the query is 
 submitted to the jobtracker is horribly slow.  For example, it can take up 
 to 3.5 minutes just to get to the point where the job is seen in the job 
 tracker.
 Is the number of files here what's hurting me? Is there some sort of per 
 file enumeration going on under the hood in Hive?  I ran Hive with debug 
 mode on and saw lots of file calls for each individual file... I guess I am 
 curious for others out there who may have similar tables, would a query 
 like that take a horribly long time for you as well? Is this normal or am 
 I seeing issues here?






Re: Running hive inside a bash script

2014-12-02 Thread John Omernik
That's not what I've found:

$ hive -e show tables
table1
table2

$ echo $?
0

$ hive -e show partitions notable
FAILED: SemanticException [Error 10001]: Table not found notable

$ echo $?
17


In a bash script:

hive -e show partitions notable
hiveresult=`echo $?`
if [ $hiveresult -ne 0 ]; then
echo We have a hive Error
echo Hive Error Num $hiveresult
exit 1

else

   echo Great, no Hive errror, Moving on

fi



On Tue, Dec 2, 2014 at 12:16 PM, Daniel Haviv
daniel.ha...@veracity-group.com wrote:
 Hi,
 I have a bash script that runs a hive query and I would like it to do
 something if the query succeeds and something else if it fails.
 My testings show that a query failure does not change Hive's exit code,
 what's the right way to achieve this ?

 Thanks,
 Daniel


Re: bug in hive

2014-09-22 Thread John Omernik
Shushant -

What I believe what Stephen is sarcastically trying to say is that some
organizational education may be in order here. Hive itself is not even at
version 1.0, those of us who use Hive in production know this, and have to
accept that there will be bugs like the one you are trying to address.
There MAY be a workaround, that takes more hours and introduces other bugs
into your environment, alternatively, taking the time to explain why moving
forward form Hive 0.10 to Hive 0.14 really is in the best interest of your
organization.  Perhaps there can be a way where you can do a proof of
concept using Hive 0.14, i.e. copy the metastore to another SQL server, and
try moving the data the table to another location so you can prove out the
fix of your issue. Also, perhaps there can be a way to test the current
workflows that work on 0.10 in 0.14 so you can show that this change really
is right way to move.

Being at this level in an open source project has  huge benefits, but
challenges as well. On one hand you can be much more nimble in your
environment because open source is fluid, but if you are trying to do this
within an environment that doesn't allow you to move like you need, it may
be losing a long term war while winning short term battles.  I guess, what
I am saying is the similar to Stephen, but I highly recommend you work with
team that sets the policy and develop a new way to address how Hive and
other similar projects live within your change management policies.  You
will benefit greatly in the long run.

John



On Sun, Sep 21, 2014 at 1:26 AM, Shushant Arora shushantaror...@gmail.com
wrote:

 Hi Stephen

 We have cloudera setup deployed in our cluster, which we cannot update due
 to orgs policy.
 Till the time its not updated to version 0.14, How can I achieve the
 locking feature please suggest.


 On Sun, Sep 21, 2014 at 10:40 AM, Stephen Sprague sprag...@gmail.com
 wrote:

 great policy. install open source software that's not even version 1.0
 into production and then not allow the ability to improve it (but of course
 reap all the rewards of its benefits.)  so instead of actually fixing the
 problem the right way introduce a super-hack work-around cuz, you know,
 that's much more stable.

 Gotta luv it.   Good luck.

 On Sat, Sep 20, 2014 at 8:00 AM, Shushant Arora 
 shushantaror...@gmail.com wrote:

 Hi Alan

 I have 0.10 version of hive deployed in my org's cluster, I cannot
 update that because of org's policy.
 How can I achieve exclusive lock functionality while inserting in
 dynamic partition on hive 0.10 ?
 Does calling hive scripts via some sort of java api with patched jar
 included will help ?
 Moreover hive does not release locks in 0.10 when hive session is killed
 . User has to explicitly unlock a table.
 Can i specify any sort of max expiry time while taking a lock.

 Thanks
 Shushant

 On Sat, Sep 20, 2014 at 8:11 PM, Alan Gates ga...@hortonworks.com
 wrote:

 Up until Hive 0.13 locks in Hive were really advisory only, since as
 you note any user can remove any other user's lock.  In Hive 0.13 a new
 type of locking was introduced, see
 https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-LockManager
 This new locking is automatic and ignores both LOCK and UNLOCK commands.
 Note that it is off by default, you have to configure Hive to use the new
 DbTxnManager to get turn on this locking.  In 0.13 it still has the bug you
 describe as far as acquiring the wrong lock for dynamic partitioning, but I
 believe I've fixed that in 0.14.

 Alan.

   Shushant Arora shushantaror...@gmail.com
  September 20, 2014 at 5:39

 Hive version 0.9 and later has a bug



 While inserting in a hive table Hive takes an exclusive lock. But if
 table is partitioned , and insert is in dynamic partition , it will take
 shared lock on table but if all partitions are static then hive takes
 exclusive lock on partitions in which data is being inserted

 and shared lock on table.

 https://issues.apache.org/jira/browse/HIVE-3509


 1.What if I want to take exclusive lock on table while inserting in
 dynamic partition ?


 I tried to take explicit lock using :

 LOCK TABLE tablename EXCLUSIVE;


 But it made table to be disabled.

 I cannot even read from table anymore even is same session until I do

 unlock table tablename in another session;


 2. moreover whats lock level in hive , I mean any user can remove any
 other users lock. that too seems buggy.


 Thanks

 Shushant




 --
 Sent with Postbox http://www.getpostbox.com

 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or
 entity to which it is addressed and may contain information that is
 confidential, privileged and exempt from disclosure under applicable law.
 If the reader of this message is not the intended recipient, you are hereby
 notified that any printing, copying, dissemination, distribution,
 disclosure or forwarding of this communication is strictly 

Weird Error on Inserting in Table [ORC, MESOS, HIVE]

2014-09-09 Thread John Omernik
I am doing a dynamic partition load in Hive 0.13 using ORC files. This has
always worked in the past both with MapReduce V1 and YARN. I am working
with Mesos now, and trying to trouble shoot this weird error:



Failed with exception AlreadyExistsException(message:Partition already
exists



What's odd is is my insert is an insert (without Overwrite) so it's like
two different reducers have data to go into the same partition, but then
there is a collision of some sort? Perhaps there is a situation where the
partition doesn't exist prior to the run, but when two reducers have data,
they both think they should be the one to create the partition? Shouldn't
if a partition already exists, the reducer just copies it's file into the
partition?  I am struggling to see why this would be an issue with Mesos,
but not on Yarn, or MRv1.


Any thoughts would be welcome.


John


Re: Weird Error on Inserting in Table [ORC, MESOS, HIVE]

2014-09-09 Thread John Omernik
I ran with debug logging, and this is interesting, there was a loss of
connection to the metastore client RIGHT before the partition mention
above... as data was looking to be moved around... I wonder if the timing
on that is bad?

14/09/09 12:47:37 [main]: INFO exec.MoveTask: Partition is: {day=null,
source=null}

14/09/09 12:47:38 [main]: INFO metadata.Hive: Renaming
src:maprfs:/user/hive/scratch/hive-mapr/hive_2014-09-09_12-38-30_860_3555291990145206535-1/-ext-1/day=2012-11-30/source=20121119_SWAirlines_Spam/04_0;dest:
maprfs:/user/hive/warehouse/intel_flow.db/pcaps/day=2012-11-30/source=20121119_SWAirlines_Spam/04_0;Status:true

14/09/09 12:48:02 [main]: WARN metastore.RetryingMetaStoreClient:
MetaStoreClient lost connection. Attempting to reconnect.

org.apache.thrift.transport.TTransportException:
java.net.SocketTimeoutException: Read timed out

at
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)




On Tue, Sep 9, 2014 at 11:02 AM, John Omernik j...@omernik.com wrote:

 I am doing a dynamic partition load in Hive 0.13 using ORC files. This has
 always worked in the past both with MapReduce V1 and YARN. I am working
 with Mesos now, and trying to trouble shoot this weird error:



 Failed with exception AlreadyExistsException(message:Partition already
 exists



 What's odd is is my insert is an insert (without Overwrite) so it's like
 two different reducers have data to go into the same partition, but then
 there is a collision of some sort? Perhaps there is a situation where the
 partition doesn't exist prior to the run, but when two reducers have data,
 they both think they should be the one to create the partition? Shouldn't
 if a partition already exists, the reducer just copies it's file into the
 partition?  I am struggling to see why this would be an issue with Mesos,
 but not on Yarn, or MRv1.


 Any thoughts would be welcome.


 John



Re: Weird Error on Inserting in Table [ORC, MESOS, HIVE]

2014-09-09 Thread John Omernik
Well, here is me talking to myself: but in case someone else runs across
this, I changed the hive metastore connect timeout to 600 seconds (per the
JIRA below for Hive 0.14) and now my problem has gone away. It looks like
the timeout was causing some craziness.

https://issues.apache.org/jira/browse/HIVE-7140



On Tue, Sep 9, 2014 at 1:00 PM, John Omernik j...@omernik.com wrote:

 I ran with debug logging, and this is interesting, there was a loss of
 connection to the metastore client RIGHT before the partition mention
 above... as data was looking to be moved around... I wonder if the timing
 on that is bad?

 14/09/09 12:47:37 [main]: INFO exec.MoveTask: Partition is: {day=null,
 source=null}

 14/09/09 12:47:38 [main]: INFO metadata.Hive: Renaming
 src:maprfs:/user/hive/scratch/hive-mapr/hive_2014-09-09_12-38-30_860_3555291990145206535-1/-ext-1/day=2012-11-30/source=20121119_SWAirlines_Spam/04_0;dest:
 maprfs:/user/hive/warehouse/intel_flow.db/pcaps/day=2012-11-30/source=20121119_SWAirlines_Spam/04_0;Status:true

 14/09/09 12:48:02 [main]: WARN metastore.RetryingMetaStoreClient:
 MetaStoreClient lost connection. Attempting to reconnect.

 org.apache.thrift.transport.TTransportException:
 java.net.SocketTimeoutException: Read timed out

 at
 org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)




 On Tue, Sep 9, 2014 at 11:02 AM, John Omernik j...@omernik.com wrote:

 I am doing a dynamic partition load in Hive 0.13 using ORC files. This
 has always worked in the past both with MapReduce V1 and YARN. I am working
 with Mesos now, and trying to trouble shoot this weird error:



 Failed with exception AlreadyExistsException(message:Partition already
 exists



 What's odd is is my insert is an insert (without Overwrite) so it's like
 two different reducers have data to go into the same partition, but then
 there is a collision of some sort? Perhaps there is a situation where the
 partition doesn't exist prior to the run, but when two reducers have data,
 they both think they should be the one to create the partition? Shouldn't
 if a partition already exists, the reducer just copies it's file into the
 partition?  I am struggling to see why this would be an issue with Mesos,
 but not on Yarn, or MRv1.


 Any thoughts would be welcome.


 John





Re: ORC file in Hive 0.13 throws Java heap space error

2014-05-16 Thread John Omernik
When I created the table, I had to reduce the orc.compress.size quite a bit
to make my table with many columns work. This was on Hive 0.12 (I thought
it was supposed to be fixed on Hive 0.13, but 3k+ columns is huge)  The
default of orc.compress size is quite a bit larger ( think in the 268k
range) Try moving that smaller and smaller if that level doesn't work.
 Good luck.

STORED AS orc tblproperties (orc.compress.size=8192);


On Thu, May 15, 2014 at 8:11 PM, Premal Shah premal.j.s...@gmail.comwrote:

 I have a table in hive stored as text file with 3283 columns. All columns
 are of string data type.

 I'm trying to convert that table into an orc file table using this command
 *create table orc_table stored as orc as select * from text_table;*

 This is the setting under mapred-site.xml

 property
 namemapred.child.java.opts/name
 value-Xmx4G -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode
 -verbose:gc -Xloggc:/mnt/hadoop/@taskid@.gc/value
 finaltrue/final
   /property

 The tasks die with this error

 2014-05-16 00:53:42,424 FATAL org.apache.hadoop.mapred.Child: Error running 
 child : java.lang.OutOfMemoryError: Java heap space
   at java.nio.HeapByteBuffer.init(HeapByteBuffer.java:39)
   at java.nio.ByteBuffer.allocate(ByteBuffer.java:312)
   at 
 org.apache.hadoop.hive.ql.io.orc.OutStream.getNewOutputBuffer(OutStream.java:117)
   at org.apache.hadoop.hive.ql.io.orc.OutStream.spill(OutStream.java:168)
   at org.apache.hadoop.hive.ql.io.orc.OutStream.flush(OutStream.java:239)
   at 
 org.apache.hadoop.hive.ql.io.orc.RunLengthByteWriter.flush(RunLengthByteWriter.java:58)
   at 
 org.apache.hadoop.hive.ql.io.orc.BitFieldWriter.flush(BitFieldWriter.java:44)
   at 
 org.apache.hadoop.hive.ql.io.orc.WriterImpl$TreeWriter.writeStripe(WriterImpl.java:553)
   at 
 org.apache.hadoop.hive.ql.io.orc.WriterImpl$StringTreeWriter.writeStripe(WriterImpl.java:1012)
   at 
 org.apache.hadoop.hive.ql.io.orc.WriterImpl$ListTreeWriter.writeStripe(WriterImpl.java:1455)
   at 
 org.apache.hadoop.hive.ql.io.orc.WriterImpl$StructTreeWriter.writeStripe(WriterImpl.java:1400)
   at 
 org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushStripe(WriterImpl.java:1780)
   at 
 org.apache.hadoop.hive.ql.io.orc.WriterImpl.checkMemory(WriterImpl.java:221)
   at 
 org.apache.hadoop.hive.ql.io.orc.MemoryManager.notifyWriters(MemoryManager.java:168)
   at 
 org.apache.hadoop.hive.ql.io.orc.MemoryManager.addedRow(MemoryManager.java:157)
   at 
 org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:2028)
   at 
 org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.write(OrcOutputFormat.java:86)
   at 
 org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:622)
   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:793)
   at 
 org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:87)
   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:793)
   at 
 org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:92)
   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:793)
   at 
 org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:540)
   at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:177)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)


 This is the GC output for a task that ran out of memory

 0.690: [GC 17024K-768K(83008K), 0.0019170 secs]
 0.842: [GC 8488K(83008K), 0.0066800 secs]
 1.031: [GC 17792K-1481K(83008K), 0.0015400 secs]
 1.352: [GC 17142K(83008K), 0.0041840 secs]
 1.371: [GC 18505K-2249K(83008K), 0.0097240 secs]
 34.779: [GC 28384K(4177280K), 0.0014050 secs]


 Anything I can tweak to make it work?

 --
 Regards,
 Premal Shah.



Count(distinct col) in Windowing

2014-05-16 Thread John Omernik
Is there a reason why I can't use

select col1, col2, count(distinct col3) over (PARTITION by col4 order by
col5 ROWS BETWEEN 5 PRECEDING AND FOLLOWING) as col1 from table

?

I am trying to see for any given window if there is a lot of variability in
a col4, and it just doesn't work with count distinct, count(1) works find,
but that's not the data I am looking for.


Re: Hive 0.12 ORC Heap Issues on Write

2014-04-28 Thread John Omernik
Prasanth -

This is easily the best and most complete explanation I've received to any
online posted question ever.  I know that sounds like a an overstatement,
but this answer is awesome.  :)  I really appreciate your insight on this.
 My only follow-up is asking how the memory.pool percentage plays a roll in
my success vs. fail. I.e. in my data, when I got down to 16k but had the
default memory pool of .50, it failed, when I scaled that back to .25, it
was successful at 16k.  Thoughts?

Thanks again for your research on this.




On Sun, Apr 27, 2014 at 11:07 PM, Prasanth Jayachandran 
pjayachand...@hortonworks.com wrote:

 Hi John

 I prepared a presentation earlier that explains the impact of changing
 compression buffer size on the overall size of ORC file. It should help you
 understand all the questions that you had.

 In Hive 0.13, a new optimization is added that should avoid this OOM
 issue. https://issues.apache.org/jira/browse/HIVE-6455
 Unfortunately, hive 0.12 does not support this optimization. Hence
 reducing the compression size is the only option. As you can see from the
 PPT, reducing the compression buffer size does not have significant impact
 in file size or query execution time.


 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity
 to which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.

 Thanks
 Prasanth Jayachandran

 On Apr 27, 2014, at 3:06 PM, John Omernik j...@omernik.com wrote:

 So one more follow-up:

 The 16-.25-Success turns to a fail if I throw more data (and hence more
 partitions) at the problem. Could there be some sort of issue that rears
 it's head based on the number of output dynamic partitions?

 Thanks all!




 On Sun, Apr 27, 2014 at 3:33 PM, John Omernik j...@omernik.com wrote:

 Here is some testing, I focused on two variables (Not really
 understanding what they do)
 orc.compress.size (256k by default)
 hive.exec.orc.memory.pool (0.50 by default).

 The job I am running is a admittedly complex job running through a Python
 Transform script.  However, as noted above, RCFile writes have NO issues.
 Another point... the results of this job end up being is LOTs of Dynamic
 partitions.  I am not sure if that plays a role here, or could help in
 troubleshooting.

 So for these two I ran a bunch of tests, the results are in the format
 (compress.size in k-memory.pool-Success/fail)
 256-0.50-Fail
 128-0.50-Fail
64-0.50-Fail
32-0.50-Fail
16-0.50-Fail
16-0.25-Success
32-0.25-Fail
16-0.35-Success
16-0.45-Success


 So after doing this I have questions:
 1. On the memory.pool what is happening when I change this? Is this
 affecting the written files on subsequent reads?
 2. Does the hive memory pool change the speed of things? (I'll take
 slower speed if it works)
 3. On the compress.size, do I hurt subsequent reads with the smaller
 compress size?
 4. These two variables, changed by themselves do not fix the problem, but
 together they seem to... lucky? Or are they related?
 5. Is there a better approach I can take on this?
 6. Any other variables I could look at?









 On Sun, Apr 27, 2014 at 11:56 AM, John Omernik j...@omernik.com wrote:

 Hello all,

 I am working with Hive 0.12 right now on YARN.  When I am writing a
 table that is admittedly quite wide (there are lots of columns, near 60,
 including one binary field that can get quite large).   Some tasks will
 fail on ORC file write with Java Heap Space Issues.

 I have confirmed that using RCFiles on the same data produces no
 failures.

 This led me down the path of experimenting with the table properties.
 Obviously, living on the cutting edge here makes it so there is not tons of
 documentation on what these settings do, I have lots of slide shows showing
 me the settings that be used to tune ORC, but not what they do, or what the
 ramifications may be.

 For example, I've gone ahead and reduced the orc.compress.size to 64k
 This seems to address lots of the failures, (all other things being
 unchanged). But what does that mean for me in the long run? Larger files?
  More files?  How is this negatively affecting me from a file perspective?

 In addition, would this be a good time to try SNAPPY over ZLIB as my
 default compression? I tried to find some direct memory comparisons but
 didn't see anything.

 So, give my data and the issues on write for my wide table, how would
 you recommend I address this? Is the compress.size the way to go?  What are
 the long term affects of this?  Any thoughts would be welcome

Hive 0.12 ORC Heap Issues on Write

2014-04-27 Thread John Omernik
Hello all,

I am working with Hive 0.12 right now on YARN.  When I am writing a table
that is admittedly quite wide (there are lots of columns, near 60,
including one binary field that can get quite large).   Some tasks will
fail on ORC file write with Java Heap Space Issues.

I have confirmed that using RCFiles on the same data produces no failures.

This led me down the path of experimenting with the table properties.
Obviously, living on the cutting edge here makes it so there is not tons of
documentation on what these settings do, I have lots of slide shows showing
me the settings that be used to tune ORC, but not what they do, or what the
ramifications may be.

For example, I've gone ahead and reduced the orc.compress.size to 64k This
seems to address lots of the failures, (all other things being unchanged).
But what does that mean for me in the long run? Larger files?  More files?
 How is this negatively affecting me from a file perspective?

In addition, would this be a good time to try SNAPPY over ZLIB as my
default compression? I tried to find some direct memory comparisons but
didn't see anything.

So, give my data and the issues on write for my wide table, how would you
recommend I address this? Is the compress.size the way to go?  What are the
long term affects of this?  Any thoughts would be welcome.

Thanks!

John


Re: Hive 0.12 ORC Heap Issues on Write

2014-04-27 Thread John Omernik
Here is some testing, I focused on two variables (Not really understanding
what they do)
orc.compress.size (256k by default)
hive.exec.orc.memory.pool (0.50 by default).

The job I am running is a admittedly complex job running through a Python
Transform script.  However, as noted above, RCFile writes have NO issues.
Another point... the results of this job end up being is LOTs of Dynamic
partitions.  I am not sure if that plays a role here, or could help in
troubleshooting.

So for these two I ran a bunch of tests, the results are in the format
(compress.size in k-memory.pool-Success/fail)
256-0.50-Fail
128-0.50-Fail
   64-0.50-Fail
   32-0.50-Fail
   16-0.50-Fail
   16-0.25-Success
   32-0.25-Fail
   16-0.35-Success
   16-0.45-Success


So after doing this I have questions:
1. On the memory.pool what is happening when I change this? Is this
affecting the written files on subsequent reads?
2. Does the hive memory pool change the speed of things? (I'll take slower
speed if it works)
3. On the compress.size, do I hurt subsequent reads with the smaller
compress size?
4. These two variables, changed by themselves do not fix the problem, but
together they seem to... lucky? Or are they related?
5. Is there a better approach I can take on this?
6. Any other variables I could look at?









On Sun, Apr 27, 2014 at 11:56 AM, John Omernik j...@omernik.com wrote:

 Hello all,

 I am working with Hive 0.12 right now on YARN.  When I am writing a table
 that is admittedly quite wide (there are lots of columns, near 60,
 including one binary field that can get quite large).   Some tasks will
 fail on ORC file write with Java Heap Space Issues.

 I have confirmed that using RCFiles on the same data produces no failures.

 This led me down the path of experimenting with the table properties.
 Obviously, living on the cutting edge here makes it so there is not tons of
 documentation on what these settings do, I have lots of slide shows showing
 me the settings that be used to tune ORC, but not what they do, or what the
 ramifications may be.

 For example, I've gone ahead and reduced the orc.compress.size to 64k This
 seems to address lots of the failures, (all other things being unchanged).
 But what does that mean for me in the long run? Larger files?  More files?
  How is this negatively affecting me from a file perspective?

 In addition, would this be a good time to try SNAPPY over ZLIB as my
 default compression? I tried to find some direct memory comparisons but
 didn't see anything.

 So, give my data and the issues on write for my wide table, how would you
 recommend I address this? Is the compress.size the way to go?  What are the
 long term affects of this?  Any thoughts would be welcome.

 Thanks!

 John



Re: Hive 0.12 ORC Heap Issues on Write

2014-04-27 Thread John Omernik
So one more follow-up:

The 16-.25-Success turns to a fail if I throw more data (and hence more
partitions) at the problem. Could there be some sort of issue that rears
it's head based on the number of output dynamic partitions?

Thanks all!




On Sun, Apr 27, 2014 at 3:33 PM, John Omernik j...@omernik.com wrote:

 Here is some testing, I focused on two variables (Not really understanding
 what they do)
 orc.compress.size (256k by default)
 hive.exec.orc.memory.pool (0.50 by default).

 The job I am running is a admittedly complex job running through a Python
 Transform script.  However, as noted above, RCFile writes have NO issues.
 Another point... the results of this job end up being is LOTs of Dynamic
 partitions.  I am not sure if that plays a role here, or could help in
 troubleshooting.

 So for these two I ran a bunch of tests, the results are in the format
 (compress.size in k-memory.pool-Success/fail)
 256-0.50-Fail
 128-0.50-Fail
64-0.50-Fail
32-0.50-Fail
16-0.50-Fail
16-0.25-Success
32-0.25-Fail
16-0.35-Success
16-0.45-Success


 So after doing this I have questions:
 1. On the memory.pool what is happening when I change this? Is this
 affecting the written files on subsequent reads?
 2. Does the hive memory pool change the speed of things? (I'll take slower
 speed if it works)
 3. On the compress.size, do I hurt subsequent reads with the smaller
 compress size?
 4. These two variables, changed by themselves do not fix the problem, but
 together they seem to... lucky? Or are they related?
 5. Is there a better approach I can take on this?
 6. Any other variables I could look at?









 On Sun, Apr 27, 2014 at 11:56 AM, John Omernik j...@omernik.com wrote:

 Hello all,

 I am working with Hive 0.12 right now on YARN.  When I am writing a table
 that is admittedly quite wide (there are lots of columns, near 60,
 including one binary field that can get quite large).   Some tasks will
 fail on ORC file write with Java Heap Space Issues.

 I have confirmed that using RCFiles on the same data produces no
 failures.

 This led me down the path of experimenting with the table properties.
 Obviously, living on the cutting edge here makes it so there is not tons of
 documentation on what these settings do, I have lots of slide shows showing
 me the settings that be used to tune ORC, but not what they do, or what the
 ramifications may be.

 For example, I've gone ahead and reduced the orc.compress.size to 64k
 This seems to address lots of the failures, (all other things being
 unchanged). But what does that mean for me in the long run? Larger files?
  More files?  How is this negatively affecting me from a file perspective?

 In addition, would this be a good time to try SNAPPY over ZLIB as my
 default compression? I tried to find some direct memory comparisons but
 didn't see anything.

 So, give my data and the issues on write for my wide table, how would you
 recommend I address this? Is the compress.size the way to go?  What are the
 long term affects of this?  Any thoughts would be welcome.

 Thanks!

 John





Re: What are all the factors that go into the number of mappers - ORC

2014-02-03 Thread John Omernik
No the size is closer to 10GB, the difference between the tables is only
around 2000 bytes.  I will try to get exact numbers for you soon, I am
traveling right now, but I'll get you better data to work with shortly.

Thanks!



On Mon, Feb 3, 2014 at 12:22 AM, Prasanth Jayachandran 
pjayachand...@hortonworks.com wrote:

 Hi John

 Number of mappers is equal to the number of splits generated. Following
 are the factors that go into split generation
 1) HDFS block size
 2) Max split size

 a split is cut when
 1) the cumulative size of all adjacent stripes are greater than HDFS block
 size
 2) the cumulative size of all adjacent stripes are greater than max split
 size

 HDFS block size for ORC files will be min(1.5GB, 2*stripe_size) in the
 current version of hive (and probably hive 0.12 too). In older versions,
 HDFS block size = min(2GB, 2*stripe_size).

 The other important thing to note is ORC split is generated only when
 HiveInputFormat is used. By default hive uses CombineHiveInputFormat which
 uses a different strategy to generate splits. In CombineHiveInputFormat,
 many small files are combined together to form a large logical split.

 In any case for the size you had mentioned (2000 bytes) there should be
 only one mapper. Can you provide the value for following configs so that we
 can understand it better?

 1) hive.input.format
 2) hive.min.split.size
 3) hive.max.split.size
 4) total size on disk for the table

 Thanks
 Prasanth Jayachandran

 On Feb 2, 2014, at 5:25 PM, John Omernik j...@omernik.com wrote:

  I have two clusters, but small dev clusters, and I loaded the same
 dataset into both of them.   The data size on disk is within 2000 Bytes.
 Both are ORC, one is Hive 11 and one is Hive 12.  One is allocating about 8
 more mappers to the exact same query. I am just curious what settings would
 change that. I checked through all my setting, but can't see what would
 cause the discrepancy. Is this an ORC v11 vs v12 thing?
 
  I'd be curious on the thoughts of the group.


 --
 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity to
 which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.



What are all the factors that go into the number of mappers - ORC

2014-02-02 Thread John Omernik
I have two clusters, but small dev clusters, and I loaded the same dataset
into both of them.   The data size on disk is within 2000 Bytes. Both are
ORC, one is Hive 11 and one is Hive 12.  One is allocating about 8 more
mappers to the exact same query. I am just curious what settings would
change that. I checked through all my setting, but can't see what would
cause the discrepancy. Is this an ORC v11 vs v12 thing?

I'd be curious on the thoughts of the group.


Survey: What do you use to interface with Hive?

2013-12-04 Thread John Omernik
This can be an interesting subject, I know orgs that are all over on this
questions. I'd be interested in hearing what you use, how it works for you,
and what you are wishing you had in your interface.

I'll start:

We've used a number of things:
- CLI for scheduled jobs.  Pros: Solid running, fairly bug free. Cons: not
for analysis of data, clunky in that regard.

-SQL Squirrel via JDBC: Pros: Supported platform. Some nice analysis
features (keeping old results, sorting of results once obtained, keeping of
old queries Cons: Buggy with Hive, sometimes it just crashes for no reason,
can be frustrating with lots of tabs, hard to extend and add little
features for how you work (from my perspective)

- Custom web based tools: pros designed around how we interact with our
data. cons: no support, it currently has memory leak issues etc.

- Apache Hue/Beeswax: Just starting to look into this now.

I'd be curious on what you are using and challenges/wins you've had.

Thanks!

John


Re: Table creation for logfile data

2013-11-24 Thread John Omernik
Put the logile into a location on HDFS, and create an external table
pointing to that location. The External table should just have one column,
a string,
CREATE EXTERNAL TABLE logfile_etl (message STRING) LOCATION '/etl/logfile'

I think that should work.

Then Create another table

CREATE TABLE logfile (ts STRING, ADD STRING, files STRING) PARTITIONED BY
(DAY STRING) STORED AS ORC;

copy files into /etc/logfile

run this hive file:

SET hive.exec.dynamic.partition=true;

SET hive.exec.dynamic.partition.mode=nonstrict;

SET hive.exec.max.dynamic.partitions=1;

SET hive.exec.max.dynamic.partitions.pernode=1000;

INSERT INTO logfile
select substring(message_line, 0, 17) as ts
regexp_extract(message_line, '\\[([^\\]+)\\]') as ADD,
regexp_extract(message_line,'\\] \\[([^\\]]+)\\]') as files,
concat('20', substring(messageline, 0, 8)) as day
from logfile_etl

delete the the files /etl/logfile (or move them to an archival)

That will get you a day partitioned (I added the 20 in front of your date
so that string sorts well, although it probably would without it, it'
early, and I have not had coffee yet) ORC file table (with compression and
ORC good ness. The regexs are a little messy, by based on your one line of
data, should work. Also: If you have data from pre 2000 obviously, the
concat('20' thing needs to be  updated.  Note, I didn't use a regex on the
date... why? It appears to be properly padded data, therefore a substring
is fast. This type of stuff has so many ways to skin a cat, so your way may
be totally different from my way, but this is how I'd approach it long
term. (if it's a one time thing, I may not create the managed the table,
but if so, having partitions and ORC files will make things faster).  If
there are syntax errors I apologize, see earlier disclaimer about lack of
proper bean sourced stimulants.





On Sat, Nov 23, 2013 at 7:36 AM, Baahu bahub...@gmail.com wrote:

 Hi,
 I have a messy log file which I want to use to create a table, I am only
 interested to retrieve 3 columns (time,ADD,files),which are in bold.
 Sample entry from log file
 *: 13-11-23 06:23:45 [ADD] [file1.zip|file2.zip] *  junkjunk|2013-11-23
 06:23:44:592 EST|file3.zip xyz|2013-11-23 06:23:44:592 EST|file3.zip

 Can you please let me know how I should go about, regex seems to be way
 out,but I am struggling with that as well !!

 Thanks,
 Baahu




Re: Table creation for logfile data

2013-11-24 Thread John Omernik
I wouldn't worry about efficiency to much:

concat('20', split(date_field, '\\/')[2], '-', split(date_field, '\\/')[1],
'-', split(date_field, '\\/')[0]) as proper_date -- -MM-DD




On Sun, Nov 24, 2013 at 12:13 PM, Baahu bahub...@gmail.com wrote:

 Hi John,
 Thanks for the reply,I have been given a new format of data and now the
 logs aren't as messy as they were earlier, but yes your mail gave me
 pointers which helped me is handling the new data.

 Now..I am stuck while handling a format of date,I am getting date in the
 form 22/11/13 which is dd/mm/yy, I have to rearrange this to /mm/dd,
 can you please shed some light on this. I think we need to use split() to
 get the tokens and then rearrange, but I am  not able to think of an
 efficient way to do this.

 Thanks.


 On Sun, Nov 24, 2013 at 5:25 PM, John Omernik j...@omernik.com wrote:

 Put the logile into a location on HDFS, and create an external table
 pointing to that location. The External table should just have one column,
 a string,
 CREATE EXTERNAL TABLE logfile_etl (message STRING) LOCATION
 '/etl/logfile'

 I think that should work.

 Then Create another table

 CREATE TABLE logfile (ts STRING, ADD STRING, files STRING) PARTITIONED BY
 (DAY STRING) STORED AS ORC;

 copy files into /etc/logfile

 run this hive file:

 SET hive.exec.dynamic.partition=true;

 SET hive.exec.dynamic.partition.mode=nonstrict;

 SET hive.exec.max.dynamic.partitions=1;

 SET hive.exec.max.dynamic.partitions.pernode=1000;

 INSERT INTO logfile
 select substring(message_line, 0, 17) as ts
 regexp_extract(message_line, '\\[([^\\]+)\\]') as ADD,
 regexp_extract(message_line,'\\] \\[([^\\]]+)\\]') as files,
 concat('20', substring(messageline, 0, 8)) as day
 from logfile_etl

 delete the the files /etl/logfile (or move them to an archival)

 That will get you a day partitioned (I added the 20 in front of your date
 so that string sorts well, although it probably would without it, it'
 early, and I have not had coffee yet) ORC file table (with compression and
 ORC good ness. The regexs are a little messy, by based on your one line of
 data, should work. Also: If you have data from pre 2000 obviously, the
 concat('20' thing needs to be  updated.  Note, I didn't use a regex on the
 date... why? It appears to be properly padded data, therefore a substring
 is fast. This type of stuff has so many ways to skin a cat, so your way may
 be totally different from my way, but this is how I'd approach it long
 term. (if it's a one time thing, I may not create the managed the table,
 but if so, having partitions and ORC files will make things faster).  If
 there are syntax errors I apologize, see earlier disclaimer about lack of
 proper bean sourced stimulants.





 On Sat, Nov 23, 2013 at 7:36 AM, Baahu bahub...@gmail.com wrote:

 Hi,
 I have a messy log file which I want to use to create a table, I am only
 interested to retrieve 3 columns (time,ADD,files),which are in bold.
 Sample entry from log file
 *: 13-11-23 06:23:45 [ADD] [file1.zip|file2.zip] *  junkjunk|2013-11-23
 06:23:44:592 EST|file3.zip xyz|2013-11-23 06:23:44:592 EST|file3.zip

 Can you please let me know how I should go about, regex seems to be way
 out,but I am struggling with that as well !!

 Thanks,
 Baahu





 --
 Twitter:http://twitter.com/Baahu




Is this a Bug in from_utc_timestamp?

2013-11-19 Thread John Omernik
There are some discussions on this
https://issues.apache.org/jira/browse/HIVE-3822

However, one person is stating there is not an issue with timestamp, thus I
am asking this question: is this a bug in from_utc_timestamp?

Example:

I have a column starttime with the value 1384495201 in it (it's stored as
string, hence the cast)

I select it out

select starttime, from_utc_timestamp(cast(starttime as double), 'UTC') from
table

Here is how I describe what SHOULD happen:

starttime is epoch time, thus it is by nature, timezone less.  There is no
timezone in Epoch time because it is defined as the number of seconds from
a certain point in time, at a certain timezone. Ok, so I am casting that
into a hive timestamp, which is also timezone less. Right?  Based on that,
I want to print the timestamp using the from_utc_timestamp (which has no
timezone) in a human readable format, applying a timezone UTC).  Ok

What should the result be? 2013-11-15 06:00:01   I had an epoch time
(timezone less) I converted to timestamp (timezone less) and I wanted to
print that time in human readable applying the timezone UTC, thus
2013-11-15 06:00:01 is correct.

What prints?

2013-11-15 00:00:01

It appears that somewhere in that process there is an implicit conversion
using the timezone of my cluster (-6). This makes NO sense from Epoch time
to timezone should be affected by timezone, both are timezoneless, from
timestamp to human readable, with a specified timezone, should take the
specified timezone.   Should I open a JIRA or is there another way to
look/address this?


Re: ORC Tuning - Examples?

2013-11-13 Thread John Omernik
Yin -

Fantastic! That is exactly the type of explanation of settings I'd like to
see. More than just what it does, but the tradeoffs, and how things are
applied in the real world.  Have you played with the stride length at all?


On Wed, Nov 13, 2013 at 1:13 PM, Yin Huai huaiyin@gmail.com wrote:

 Hi John,

 Here is my experience on the stripe size. For a given table, when the
 stripe size is increased, the size of a column in a stripe increases, which
 means the ORC reader can read a column from disks in a more efficient way
 because the reader can sequentially read more data (assuming the reader and
 the HDFS block are co-located). But, a larger stripe size may decrease the
 number of concurrent Map tasks reading an ORC file because a Map task needs
 to process at least one stripe (seems a stripe is not splitable right now).
 If you can get enough degree of parallelism, I think increasing the stripe
 size generally gives you better data reading efficiency in one task.
 However, on HDDs, the benefit from increasing the stripe size on data
 reading efficiency in a Map task is getting smaller with the increase of
 the stripe size. So, for a table with only a few columns (assuming a single
 ORC file is used), using a smaller stripe size may not significantly affect
 data reading efficiency in a task, and you can potentially have more
 concurrent tasks to read this ORC file. So, I think you need to tradeoff
 the data reading efficiency in a single task (larger stripe size - better
 data reading efficiency in a task) and the degree of parallelism (smaller
 stripe size - more concurrent tasks to read an ORC file) when determining
 the right stripe size.

 btw, I have a paper studying file formats and it has some related
 contents. Here is the link:
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-13-5.pdf
 .

 Thanks,

 Yin


 On Tue, Nov 12, 2013 at 8:51 PM, Lefty Leverenz 
 leftylever...@gmail.comwrote:

 If you get some useful advice, let's improve the doc.

 -- Lefty


 On Tue, Nov 12, 2013 at 6:15 PM, John Omernik j...@omernik.com wrote:

 I am looking for guidance (read examples) on tuning ORC settings for my
 data.  I see the documentation that shows the defaults, as well as a brief
 description of what it is.  What I am looking for is some examples of
 things to try.  *Note: I understand that nobody wants to make sweeping
 declaring of set this setting without knowing the data*  That said, I would
 love to see some examples, specifically around:

 orc.row.index.stride

 orc.compress.size

 orc.stripe.size


 For example, I'd love to see some statements like:


 If your data has lots of columns of small data, and you'd like better x,
 try changing y setting because this allows hive to do z when querying.


 If your data has few columns of large data, try changing y and this
 allows hive to do z while querying.


 It would be really neat to see some examples so we can get in and tune
 our data. Right now, everything is a crapshoot for me, and I don't know if
 there are detrimental affects that may make themselves known later.


 Any input would be welcome.






ORC Tuning - Examples?

2013-11-12 Thread John Omernik
I am looking for guidance (read examples) on tuning ORC settings for my
data.  I see the documentation that shows the defaults, as well as a brief
description of what it is.  What I am looking for is some examples of
things to try.  *Note: I understand that nobody wants to make sweeping
declaring of set this setting without knowing the data*  That said, I would
love to see some examples, specifically around:

orc.row.index.stride

orc.compress.size

orc.stripe.size


For example, I'd love to see some statements like:


If your data has lots of columns of small data, and you'd like better x,
try changing y setting because this allows hive to do z when querying.


If your data has few columns of large data, try changing y and this allows
hive to do z while querying.


It would be really neat to see some examples so we can get in and tune our
data. Right now, everything is a crapshoot for me, and I don't know if
there are detrimental affects that may make themselves known later.


Any input would be welcome.


Suggestion for Metastore Oprerations around ORC Files

2013-11-09 Thread John Omernik
I was testing out the conversion of a table to ORC.  Using previous posts,
I did alter table tablename set fileformat ORC;  This worked great  All new
partitions created were ORC, the RC and ORC files played nice next to each
other.

Then I had a hypothesis. I have tables that almost always have hive jobs
running and inserting data. Ideally, I don't want to stop those.  In my
head, I saw a problem, if I converted the table mid INSERT job, what would
happen?

Ideally, the rc format that existed when the job started would be honored,
the files would be written as RC files, and all would be well.  What I
think actually happened is that the setting was not honored; either the
writers changed to ORC mid files causing major borkage, or, and this is
what I suspect happened, the writers used RC file format, but when the
partition metadata was updated, it was ORC? Either way, I am not an expert,
but I could cause all subsequent queries to fail when I did that.

Like I said, almost everything about the conversion of ORC is going well,
but I'd recommend a change that would allow the setting to be changed, and
that current running jobs would honor the old setting for partitions, and
all would be well, and any new jobs would use the new settings.
Also, for the group: how does things respond when you are doing insert
append operations, and the first jobs where RC files and then other files
in the same partition are ORC?

Thanks!


ORC Files: Does this get me anything?

2013-10-16 Thread John Omernik
So I am experimenting with ORC files, and I have a fast little table that
has login events.  Out of curiosity, I was wondering if based on what we
all knew about ORC files, if did the below, would the per file indexing get
me anything? Now, before people complain about small files, let's toss that
aside for now.

I have
set mapred.reduce.tasks=26;
insert into table ogintable
select * from main_table where loginid != ''
distribute by (abs(hash(substring(loginid, 0, 1))) % 26)
sort by loginid


Basically, I am thinking that if I distribute by what I put, each letter
will get it's own file, and thus, acts as a mini index? Am I over thinking
this? I know if I do just the sort by I get 3 to 4 files, with this method,
I get more files, and since loginid is extremely common where clause
member, I was thinking this may be a good thing? Maybe I am wrong, figured
I'd send it out to the group to get made fun of/ridiculed in public :)


Bug in Hive Split function (Tested on Hive 0.9 and 0.11)

2013-10-09 Thread John Omernik
Hello all, I think I have outlined a bug in the hive split function:

Summary: When calling split on a string of data, it will only return all
array items if the the last array item has a value. For example, if I have
a string of text delimited by tab with 7 columns, and the first four are
filled, but the last three are blank, split will only return a 4 position
array. If  any number of middle columns are empty, but the last item
still has a value, then it will return the proper number of columns.  This
was tested in Hive 0.9 and hive 0.11.

Data:
(Note \t represents a tab char, \x09 the line endings should be \n (UNIX
style) not sure what email will do to them).  Basically my data is 7 lines
of data with the first 7 letters separated by tab.  On some lines I've left
out certain letters, but kept the number of tabs exactly the same.

input.txt
a\tb\tc\td\te\tf\tg
a\tb\tc\td\te\t\tg
a\tb\t\td\t\tf\tg
\t\t\td\te\tf\tg
a\tb\tc\td\t\t\t
a\t\t\t\te\tf\tg
a\t\t\td\t\t\tg

I then created a table with one column from that data:


DROP TABLE tmp_jo_tab_test;

CREATE table tmp_jo_tab_test (message_line STRING)

STORED AS TEXTFILE;

** **

LOAD DATA LOCAL INPATH '/tmp/input.txt'

OVERWRITE INTO TABLE tmp_jo_tab_test;


Ok just to validate I created a python counting script:


#!/usr/bin/python

** **

import sys

** **

** **

for line in sys.stdin:

line = line[0:-1]

out = line.split(\t)

print len(out)


The output there is :

$ cat input.txt |./cnt_tabs.py

7

7

7

7

7

7

7


Based on that information, split on tab should return me 7 for each line as
well:


hive -e select size(split(message_line, '\\t')) from tmp_jo_tab_test;

** **

7

7

7

7

4

7

7


However it does not.  It would appear that the line where only the first
four letters are filled in(and blank is passed in on the last three) only
returns 4 splits, where there should technically be 7, 4 for letters
included, and three blanks.


a\tb\tc\td\t\t\t


Re: Bug in Hive Split function (Tested on Hive 0.9 and 0.11)

2013-10-09 Thread John Omernik
I opened a JIRA on this: https://issues.apache.org/jira/browse/HIVE-5506




On Wed, Oct 9, 2013 at 9:44 AM, John Omernik j...@omernik.com wrote:

 Hello all, I think I have outlined a bug in the hive split function:

 Summary: When calling split on a string of data, it will only return all
 array items if the the last array item has a value. For example, if I have
 a string of text delimited by tab with 7 columns, and the first four are
 filled, but the last three are blank, split will only return a 4 position
 array. If  any number of middle columns are empty, but the last item
 still has a value, then it will return the proper number of columns.  This
 was tested in Hive 0.9 and hive 0.11.

 Data:
 (Note \t represents a tab char, \x09 the line endings should be \n (UNIX
 style) not sure what email will do to them).  Basically my data is 7 lines
 of data with the first 7 letters separated by tab.  On some lines I've left
 out certain letters, but kept the number of tabs exactly the same.

 input.txt
 a\tb\tc\td\te\tf\tg
 a\tb\tc\td\te\t\tg
 a\tb\t\td\t\tf\tg
 \t\t\td\te\tf\tg
 a\tb\tc\td\t\t\t
 a\t\t\t\te\tf\tg
 a\t\t\td\t\t\tg

 I then created a table with one column from that data:


 DROP TABLE tmp_jo_tab_test;

 CREATE table tmp_jo_tab_test (message_line STRING)

 STORED AS TEXTFILE;

 ** **

 LOAD DATA LOCAL INPATH '/tmp/input.txt'

 OVERWRITE INTO TABLE tmp_jo_tab_test;


 Ok just to validate I created a python counting script:


 #!/usr/bin/python

 ** **

 import sys

 ** **

 ** **

 for line in sys.stdin:

 line = line[0:-1]

 out = line.split(\t)

 print len(out)


 The output there is :

 $ cat input.txt |./cnt_tabs.py

 7

 7

 7

 7

 7

 7

 7


 Based on that information, split on tab should return me 7 for each line
 as well:


 hive -e select size(split(message_line, '\\t')) from tmp_jo_tab_test;***
 *

 ** **

 7

 7

 7

 7

 4

 7

 7


 However it does not.  It would appear that the line where only the first
 four letters are filled in(and blank is passed in on the last three) only
 returns 4 splits, where there should technically be 7, 4 for letters
 included, and three blanks.


 a\tb\tc\td\t\t\t









RC - ORC INSERT OVERWRITE metastore Heap Error

2013-08-10 Thread John Omernik
I am doing some testing going from table_rc to table_orc . The
table/partition structure is the same, and there is a two level partition
day= then source=

I am doing a single day (including all 10 or so sources in the day). This
worked just fine in one environment, but now, I am getting strange errors
before things even get to the data (or so it seams). The error seems to be
related to the metastore, but on the table_rc table rather than the orc.  I
guess I am curious why this error is happening. This is on the mapr
distribution of Hadoop. But it seems more related to metastore operations
than hadoop filesystem operations.


INSERT OVERWRITE TABLE table_orc PARTITION (day, source)
SELECT * from table_rc where day = '2013-08-04'
13/08/10 08:44:41 INFO parse.ParseDriver: Parse Completed
13/08/10 08:44:41 INFO parse.SemanticAnalyzer: Starting Semantic Analysis
13/08/10 08:44:41 INFO parse.SemanticAnalyzer: Completed phase 1 of
Semantic Analysis
13/08/10 08:44:41 INFO parse.SemanticAnalyzer: Get metadata for source
tables
13/08/10 08:44:41 INFO metastore.HiveMetaStore: 0: get_table : db=hive_flow
tbl=pcaps
13/08/10 08:44:41 INFO HiveMetaStore.audit: ugi=root
ip=unknown-ip-addr cmd=get_table
: db=hive_flow tbl=table_rc
13/08/10 08:44:41 INFO parse.SemanticAnalyzer: Get metadata for subqueries
13/08/10 08:44:41 INFO parse.SemanticAnalyzer: Get metadata for destination
tables
13/08/10 08:44:41 INFO metastore.HiveMetaStore: 0: get_table : db=hive_flow
tbl=table_orc
13/08/10 08:44:41 INFO HiveMetaStore.audit: ugi=root
ip=unknown-ip-addr cmd=get_table
: db=hive_flow tbl=table_orc
13/08/10 08:44:41 INFO parse.SemanticAnalyzer: Completed getting MetaData
in Semantic Analysis
13/08/10 08:44:41 INFO metastore.HiveMetaStore: 0: get_partitions_with_auth
: db=hive_flow tbl=table_rc
13/08/10 08:44:41 INFO HiveMetaStore.audit: ugi=root
ip=unknown-ip-addr cmd=get_partitions_with_auth
: db=hive_flow tbl=table_rc
13/08/10 08:52:12 ERROR metastore.RetryingHMSHandler:
java.lang.OutOfMemoryError: Java heap space
at com.mysql.jdbc.MysqlIO.buildResultSetWithRows(MysqlIO.java:2098)
 at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:459)
at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:2076)
 at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:1451)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:1787)
 at com.mysql.jdbc.Connection.execSQL(Connection.java:3283)
at
com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1332)
 at
com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1467)
at
org.apache.commons.dbcp.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java:96)
 at
org.apache.commons.dbcp.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java:96)
at
org.datanucleus.store.rdbms.SQLController.executeStatementQuery(SQLController.java:457)
 at
org.datanucleus.store.rdbms.scostore.RDBMSMapEntrySetStore.iterator(RDBMSMapEntrySetStore.java:248)
at
org.datanucleus.sco.SCOUtils.populateMapDelegateWithStoreData(SCOUtils.java:987)
 at org.datanucleus.sco.backed.Map.loadFromStore(Map.java:258)
at org.datanucleus.sco.backed.Map.keySet(Map.java:509)
 at
org.datanucleus.store.fieldmanager.LoadFieldManager.internalFetchObjectField(LoadFieldManager.java:118)
at
org.datanucleus.store.fieldmanager.AbstractFetchFieldManager.fetchObjectField(AbstractFetchFieldManager.java:114)
 at
org.datanucleus.state.AbstractStateManager.replacingObjectField(AbstractStateManager.java:1183)
at
org.apache.hadoop.hive.metastore.model.MSerDeInfo.jdoReplaceField(MSerDeInfo.java)
 at
org.apache.hadoop.hive.metastore.model.MSerDeInfo.jdoReplaceFields(MSerDeInfo.java)
at
org.datanucleus.jdo.state.JDOStateManagerImpl.replaceFields(JDOStateManagerImpl.java:2860)
 at
org.datanucleus.jdo.state.JDOStateManagerImpl.replaceFields(JDOStateManagerImpl.java:2879)
at
org.datanucleus.jdo.state.JDOStateManagerImpl.loadFieldsInFetchPlan(JDOStateManagerImpl.java:1647)
 at
org.datanucleus.ObjectManagerImpl.performDetachAllOnTxnEndPreparation(ObjectManagerImpl.java:3552)
at org.datanucleus.ObjectManagerImpl.preCommit(ObjectManagerImpl.java:3291)
 at
org.datanucleus.TransactionImpl.internalPreCommit(TransactionImpl.java:369)
at org.datanucleus.TransactionImpl.commit(TransactionImpl.java:256)
 at org.datanucleus.jdo.JDOTransaction.commit(JDOTransaction.java:83)
at
org.apache.hadoop.hive.metastore.ObjectStore.commitTransaction(ObjectStore.java:345)
 at
org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsWithAuth(ObjectStore.java:1405)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

13/08/10 08:52:12 INFO ql.Driver: /PERFLOG method=compile
start=1376138681224 end=137613913 duration=450998
Exception in thread main java.lang.OutOfMemoryError: Java heap space
at com.mysql.jdbc.MysqlIO.buildResultSetWithRows(MysqlIO.java:2098)
 at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:459)
at 

Optimizing ORC Sorting - Replace two level Partitions with one?

2013-08-10 Thread John Omernik
I have a table that currently uses RC files and has two levels of
partitions.  day and source.  The table is first partitioned by day, then
within each day there are 6-15 source partitions.  This makes for a lot of
crazy partitions and was wondering if there'd be a way to optimize this
with ORC files and some sorting.

Specifically, would there be a way in a new table to make source a field
(removing the partition)and somehow, as I am inserting into this new setup
sort by source in such a way that will help separate the files/indexes in a
way that gives me almost the same performance as ORC with the two level
partitions?  Just trying to optimize here and curious what people think.

John


Re: Optimizing ORC Sorting - Replace two level Partitions with one?

2013-08-10 Thread John Omernik
One issue with the bucketing is that the number of sources on any given day
is dynamic. On some days it's 4, others it's 14 and it's also constantly
changing.  I am hoping to use some of the features of the ORC files to
almost make virtual partitions, but apparently I am going to run into
issues either way.

On another note, is there a limit to hive and partitions? I am hovering
around 10k partitions on one table right now. It's still working, but some
metadata operations can take a long time. The Sub-Partitions are going to
hurt me here going forward I am guessing, so it may be worth flattening out
to only days, even at the expense of read queries... thoughts?



On Sat, Aug 10, 2013 at 11:46 AM, Nitin Pawar nitinpawar...@gmail.comwrote:

 Agree with Edward,

 whole purpose of bucketing for me is to prune the data in where clause.
 Else it totally defeats the purpose of splitting data into finite number of
 identifiable distributions to improve the performance.

 But is my understanding correct that it  does help in reducing the number
 of sub partitions we create at the bottom of table can be limited if we
 identify the pattern does not exceed a finite number of values on that
 partitions? (even if it cross this limit bucketting does take care of it
 upto some volume)


 On Sat, Aug 10, 2013 at 10:09 PM, Edward Capriolo 
 edlinuxg...@gmail.comwrote:

 So there is one thing to be really carefully about bucketing. Say you
 bucket a table into 10 buckets, select with where does not actually prune
 the input buckets so many queries scan all the buckets.


 On Sat, Aug 10, 2013 at 12:34 PM, Nitin Pawar nitinpawar...@gmail.comwrote:

 will bucketing help? if you know finite # partiotions ?


 On Sat, Aug 10, 2013 at 9:26 PM, John Omernik j...@omernik.com wrote:

 I have a table that currently uses RC files and has two levels of
 partitions.  day and source.  The table is first partitioned by day, then
 within each day there are 6-15 source partitions.  This makes for a lot of
 crazy partitions and was wondering if there'd be a way to optimize this
 with ORC files and some sorting.

 Specifically, would there be a way in a new table to make source a
 field (removing the partition)and somehow, as I am inserting into this new
 setup sort by source in such a way that will help separate the
 files/indexes in a way that gives me almost the same performance as ORC
 with the two level partitions?  Just trying to optimize here and curious
 what people think.

 John




 --
 Nitin Pawar





 --
 Nitin Pawar



Re: Optimizing ORC Sorting - Replace two level Partitions with one?

2013-08-10 Thread John Omernik
Are there any effective limits on the number of partitions? Partitions is
the answer that we choose because it makes logical sense. I.e. I have Days,
on a given day I have a number of sources. Sometimes I want to query by day
and search all sources, other times, I want to focus on specific sources.
 With Bucketing, will it prune on the column like partitions do
automatically? (Remember, this is specific to ORC files that I am working
with here).


On Sat, Aug 10, 2013 at 12:19 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 Bucketing does deal with that if you bucket on column you always get
 bucket number of files. Because your hashing the value into a bucket.

 A query scanning many partitions and files is needlessly slow from MR
 overhead.


 On Sat, Aug 10, 2013 at 12:58 PM, John Omernik j...@omernik.com wrote:

 One issue with the bucketing is that the number of sources on any given
 day is dynamic. On some days it's 4, others it's 14 and it's also
 constantly changing.  I am hoping to use some of the features of the ORC
 files to almost make virtual partitions, but apparently I am going to run
 into issues either way.

 On another note, is there a limit to hive and partitions? I am hovering
 around 10k partitions on one table right now. It's still working, but some
 metadata operations can take a long time. The Sub-Partitions are going to
 hurt me here going forward I am guessing, so it may be worth flattening out
 to only days, even at the expense of read queries... thoughts?



 On Sat, Aug 10, 2013 at 11:46 AM, Nitin Pawar nitinpawar...@gmail.comwrote:

 Agree with Edward,

 whole purpose of bucketing for me is to prune the data in where clause.
 Else it totally defeats the purpose of splitting data into finite number of
 identifiable distributions to improve the performance.

 But is my understanding correct that it  does help in reducing the
 number of sub partitions we create at the bottom of table can be limited if
 we identify the pattern does not exceed a finite number of values on that
 partitions? (even if it cross this limit bucketting does take care of it
 upto some volume)


 On Sat, Aug 10, 2013 at 10:09 PM, Edward Capriolo edlinuxg...@gmail.com
  wrote:

 So there is one thing to be really carefully about bucketing. Say you
 bucket a table into 10 buckets, select with where does not actually prune
 the input buckets so many queries scan all the buckets.


 On Sat, Aug 10, 2013 at 12:34 PM, Nitin Pawar 
 nitinpawar...@gmail.comwrote:

 will bucketing help? if you know finite # partiotions ?


 On Sat, Aug 10, 2013 at 9:26 PM, John Omernik j...@omernik.comwrote:

 I have a table that currently uses RC files and has two levels of
 partitions.  day and source.  The table is first partitioned by day, then
 within each day there are 6-15 source partitions.  This makes for a lot 
 of
 crazy partitions and was wondering if there'd be a way to optimize this
 with ORC files and some sorting.

 Specifically, would there be a way in a new table to make source a
 field (removing the partition)and somehow, as I am inserting into this 
 new
 setup sort by source in such a way that will help separate the
 files/indexes in a way that gives me almost the same performance as ORC
 with the two level partitions?  Just trying to optimize here and curious
 what people think.

 John




 --
 Nitin Pawar





 --
 Nitin Pawar






Re: Large Scale Table Reprocess

2013-07-26 Thread John Omernik
Can you give some examples of how to alter partitions for different input
types? I'd appreciate it :)


On Fri, Jul 26, 2013 at 3:29 PM, Alan Gates ga...@hortonworks.com wrote:

 A table can definitely have partitions with different input
 formats/serdes.  We test this all the time.

 Assuming your old data doesn't stay for ever and most of your queries are
 on more recent data (which is usually the case) I'd advise you to not
 reprocess any data, just alter the table to store new partitions in ORC.
  Then with time you'll slowly transition the table to ORC.  This avoids all
 the issues you noted.  And since most queries probably only access recent
 data you'll see speed ups soon after the switch.

 Alan.

 On Jul 25, 2013, at 4:45 PM, John Omernik wrote:

  Just finishing up testing with Hive 11 and ORC. Thank you to Owen and
 all those who have put hard work into this. Just ORC files, when compared
 to RC files in Hive 9, 10, and 11 saw a huge increase in performance, it
 was amazing.  That said, now we gotta reprocess.
 
 
  We have a large table with lots of partitions. I'd love to be able to
 reprocess into a new table, like table_orc, and then at the end of it all,
 just drop the original table. That said, I see it being hard to do from a
 space perspective. and I will have to do partition at a time.  But then
 theirs production issues, if I update a partition, insert overwrite int the
 ORC table, then I have delete the original and production users will be
 missing data decisions decisions.
 
  So any ideas? Can a table have some partitions in one file type and
 other partitions in another? That sounds scary.  Anywho, a good problem to
 have... that performance will be worth it.
 
 




Re: Large Scale Table Reprocess

2013-07-26 Thread John Omernik
More specifically, we have a table that is currently defined as RCFile, to
do this, I'd like to define all new partitions as ORC.  With the advent of
ORC, these types of problems are going to come up for many folks, any
guidance would be appreciated ...

Also, based on the strategic goals of ORC files, do you see ORC files
changing significantly (i.e. to the point where we have to do another re
process?)



On Fri, Jul 26, 2013 at 5:09 PM, John Omernik j...@omernik.com wrote:

 Can you give some examples of how to alter partitions for different input
 types? I'd appreciate it :)


 On Fri, Jul 26, 2013 at 3:29 PM, Alan Gates ga...@hortonworks.com wrote:

 A table can definitely have partitions with different input
 formats/serdes.  We test this all the time.

 Assuming your old data doesn't stay for ever and most of your queries are
 on more recent data (which is usually the case) I'd advise you to not
 reprocess any data, just alter the table to store new partitions in ORC.
  Then with time you'll slowly transition the table to ORC.  This avoids all
 the issues you noted.  And since most queries probably only access recent
 data you'll see speed ups soon after the switch.

 Alan.

 On Jul 25, 2013, at 4:45 PM, John Omernik wrote:

  Just finishing up testing with Hive 11 and ORC. Thank you to Owen and
 all those who have put hard work into this. Just ORC files, when compared
 to RC files in Hive 9, 10, and 11 saw a huge increase in performance, it
 was amazing.  That said, now we gotta reprocess.
 
 
  We have a large table with lots of partitions. I'd love to be able to
 reprocess into a new table, like table_orc, and then at the end of it all,
 just drop the original table. That said, I see it being hard to do from a
 space perspective. and I will have to do partition at a time.  But then
 theirs production issues, if I update a partition, insert overwrite int the
 ORC table, then I have delete the original and production users will be
 missing data decisions decisions.
 
  So any ideas? Can a table have some partitions in one file type and
 other partitions in another? That sounds scary.  Anywho, a good problem to
 have... that performance will be worth it.
 
 





Large Scale Table Reprocess

2013-07-25 Thread John Omernik
Just finishing up testing with Hive 11 and ORC. Thank you to Owen and all
those who have put hard work into this. Just ORC files, when compared to RC
files in Hive 9, 10, and 11 saw a huge increase in performance, it was
amazing.  That said, now we gotta reprocess.


We have a large table with lots of partitions. I'd love to be able to
reprocess into a new table, like table_orc, and then at the end of it all,
just drop the original table. That said, I see it being hard to do from a
space perspective. and I will have to do partition at a time.  But then
theirs production issues, if I update a partition, insert overwrite int the
ORC table, then I have delete the original and production users will be
missing data decisions decisions.

So any ideas? Can a table have some partitions in one file type and other
partitions in another? That sounds scary.  Anywho, a good problem to
have... that performance will be worth it.


Java Courses for Scripters/Big Data Geeks

2013-07-17 Thread John Omernik
Hey all -

I was wondering if there were any shortcut Java courses out there.  As
in, I am not looking for a holistic learn everything about Java course, but
more of a So you are a big data/hive geek and you get Python/Perl pretty
well, but when you try to understand Java your head explodes and it feels
like you are missing something entry level and basic thus you need these
basic things and you'll be fine course.


Any thoughts?


Re: Difference between like %A% and %a%

2013-05-24 Thread John Omernik
I have mentioned this before, and I think this a big miss by the Hive team.
 Like, by default in many SQL RDBMS (like MSSQL or MYSQL)  is not case
sensitive. Thus when you have new users moving over to Hive, if they see a
command like like they will assume similarity (like many other SQL like
qualities) and thus false negatives may ensue.  Even though it's different
by default (I am ok with this ... I guess, my personal preference is that
it matches the defaults on other systems, and outside of that (which I am,
in in the end fine with, just grumbly :) ) give us the ability to set
that behavior in the hive-site.xml.  That way when an org realizes that it
is different, and their users are all getting false negatives, they can
just update the hive-site and fix the problem rather than have to include
it in training that may or may not work.  I've added this comment to
https://issues.apache.org/jira/browse/HIVE-4070#comment-13666278  for fun.
:)

Please? :)




On Fri, May 24, 2013 at 7:53 AM, Dean Wampler deanwamp...@gmail.com wrote:

 Your where clause looks at the abbreviation, requiring 'A', not the state
 name. You got the correct answer.


 On Fri, May 24, 2013 at 6:21 AM, Sai Sai saigr...@yahoo.in wrote:

 But it should get more results for this:

 %a%

 than for

 %A%

 Please let me know if i am missing something.
 Thanks
 Sai


--
  *From:* Jov am...@amutu.com
 *To:* user@hive.apache.org; Sai Sai saigr...@yahoo.in
 *Sent:* Friday, 24 May 2013 4:39 PM
 *Subject:* Re: Difference between like %A% and %a%


 2013/5/24 Sai Sai saigr...@yahoo.in

 abbreviation l


 unlike MySQL, string in Hive is case sensitive,so '%A%' is not equal with
 '%a%'.


 --
 Jov
 blog: http:amutu.com/blog http://amutu.com/blog





 --
 Dean Wampler, Ph.D.
 @deanwampler
 http://polyglotprogramming.com



Re: Hive Authorization and Views

2013-05-16 Thread John Omernik
I am curious on the thoughts of the community here, this seems like
something many enterprises would drool over with Hive... I am not a coder
so the level coding involved something like this is unknown.


On Sat, May 4, 2013 at 8:31 AM, John Omernik j...@omernik.com wrote:

 We were doing some tests this past week with hive authorization, one of
 our current use challenges is when we have an underlying, well managed
 and partitioned table, and we want to allow access to certain columns in
 that table.  Our first thoughts went to VIEWs as that's a common use case
 with Relational Databases, (i.e. setup a view with only the columns you
 want the user to access) and set the permissions appropriately.

 In testing, and this is not surprising given the the newness of Hive
 Authorization, a VIEW can not be created as to allow access to to a table
 without granting access to the underlying table, defeating the idea of the
 view as tool to manage that access.

 So I wanted to put to the user group: I've done some JIRA searching and
 didn't find anything (I will admit my JIRA search Foo is not stellar), but
 is there an option that could be thrown together in Hive that would allow
 that use case?  Perhaps a configuration setting that would allow views to
 execute as a specific user (perhaps a global user, or perhaps a user
 specified as view creation).  This could allow the view to have access to
 underlying table, but since the view is created, and it couldn't be changed
 by the user, and thus you could set view read permissions to your user or
 group of users you want access.

 I suppose this has challenges i.e. can a user just create a view to
 bypass table level restrictions? Perhaps if this model was taken, the
 privilege for CREATING/MODIFYING views could be created and granted only to
 a superuser of some sort.  I am really just walking through ideas here as
 this is the one last stumbling blocks we have with Hive from an Enterprise
 ready point of view. Heck, if done right, you could almost do data masking
 at the view level. You have a column in your source data that is sensitive,
 so instead of returning that column you do a MD5 (can we have a native MD5
 function? :) of that column or you blank that column. If we put in strong
 security on the creation, modification of views, and allow views to execute
 as a different user that has access to source data, you have a powerful way
 to represent your data to all levels within your org.

 Also: Since I am just brain storming here, I'd love to hear what others
 maybe doing around this area. Perhaps the Hive User Community can come up
 with a strategic plan, while at the same time share some shorter term
 workarounds.

 Thanks!



Re: Hive Authorization and Views

2013-05-16 Thread John Omernik
Edward - I agree that hive and rdbms are different animals, so in looking
at that current work around hive authorization, I  get that the user would
still have access to the underlying file system.  We have to assume that
permissions are only enforced from a metadata perspective.  But given that
it's high on the list of questions around hive in enterprise adoption of
any data warehousing solution, it may provide enough of a control to pass
audit requirements if views could be used as the control. User can access
data directly (outside of hive) however in hive users can't access table
directly, but can access the view.   Need to think it through some more,
even in a RBDMS, sometimes certain users would be able to access the files
of the data store (administrators etc) but be controlled from a perspective
of accessing the data through the rdbms.   Great discussion, I love stuff
like this, Hive is awesome its community discussion that makes it kick ass
(excuse the language) :)



On Thu, May 16, 2013 at 4:19 PM, Sanjay Subramanian 
sanjay.subraman...@wizecommerce.com wrote:

  Also we have all external tables to ensure that accidental dropping of
 tables does not delete data…Plus the good part of HDFS architecture is data
 is immutable….which means u cannot update rows….u can move partitions or
 delete/insert data from hdfs which IMHO is very cool….but may not solve all
 use cases
 Regards
 sanjay

   From: Edward Capriolo edlinuxg...@gmail.com
 Reply-To: user@hive.apache.org user@hive.apache.org
 Date: Thursday, May 16, 2013 2:05 PM
 To: user@hive.apache.org user@hive.apache.org
 Subject: Re: Hive Authorization and Views

   The largest issue is that the RDBMS security model does not match with
 hive. Hive/Hadoop has file permissions, RDMBS have column and sometimes row
 level permissions.

  When you physically have access to the underlying file (row level)
 permissions are not enforceable. The only way to enforce this type of
 security is to force users through a turnstyle that changes how hive
 currently works.




 On Thu, May 16, 2013 at 4:42 PM, John Omernik j...@omernik.com wrote:

 I am curious on the thoughts of the community here, this seems like
 something many enterprises would drool over with Hive... I am not a coder
 so the level coding involved something like this is unknown.


 On Sat, May 4, 2013 at 8:31 AM, John Omernik j...@omernik.com wrote:

 We were doing some tests this past week with hive authorization, one of
 our current use challenges is when we have an underlying, well managed
 and partitioned table, and we want to allow access to certain columns in
 that table.  Our first thoughts went to VIEWs as that's a common use case
 with Relational Databases, (i.e. setup a view with only the columns you
 want the user to access) and set the permissions appropriately.

  In testing, and this is not surprising given the the newness of Hive
 Authorization, a VIEW can not be created as to allow access to to a table
 without granting access to the underlying table, defeating the idea of the
 view as tool to manage that access.

  So I wanted to put to the user group: I've done some JIRA searching
 and didn't find anything (I will admit my JIRA search Foo is not stellar),
 but is there an option that could be thrown together in Hive that would
 allow that use case?  Perhaps a configuration setting that would allow
 views to execute as a specific user (perhaps a global user, or perhaps a
 user specified as view creation).  This could allow the view to have
 access to underlying table, but since the view is created, and it couldn't
 be changed by the user, and thus you could set view read permissions to
 your user or group of users you want access.

  I suppose this has challenges i.e. can a user just create a view to
 bypass table level restrictions? Perhaps if this model was taken, the
 privilege for CREATING/MODIFYING views could be created and granted only to
 a superuser of some sort.  I am really just walking through ideas here as
 this is the one last stumbling blocks we have with Hive from an Enterprise
 ready point of view. Heck, if done right, you could almost do data masking
 at the view level. You have a column in your source data that is sensitive,
 so instead of returning that column you do a MD5 (can we have a native MD5
 function? :) of that column or you blank that column. If we put in strong
 security on the creation, modification of views, and allow views to execute
 as a different user that has access to source data, you have a powerful way
 to represent your data to all levels within your org.

  Also: Since I am just brain storming here, I'd love to hear what
 others maybe doing around this area. Perhaps the Hive User Community can
 come up with a strategic plan, while at the same time share some shorter
 term workarounds.

  Thanks!




 CONFIDENTIALITY NOTICE
 ==
 This email message and any attachments are for the exclusive use of the
 intended

HIVE-3979 in Hive 0.11

2013-05-04 Thread John Omernik
I see in the release notes for HIVE -3979

[HIVE-3979 https://issues.apache.org/jira/browse/HIVE-3979] - Provide
syntax for unescaped regex on rlike, and other regexp_* functions

Yet when I click on that JIRA there are not notes etc.  Could it be that
this was included by mistake?  I am curious, if this isn't a mistake, how
this was implemented.


Thanks!

John


Hive Authorization and Views

2013-05-04 Thread John Omernik
We were doing some tests this past week with hive authorization, one of our
current use challenges is when we have an underlying, well managed and
partitioned table, and we want to allow access to certain columns in that
table.  Our first thoughts went to VIEWs as that's a common use case with
Relational Databases, (i.e. setup a view with only the columns you want the
user to access) and set the permissions appropriately.

In testing, and this is not surprising given the the newness of Hive
Authorization, a VIEW can not be created as to allow access to to a table
without granting access to the underlying table, defeating the idea of the
view as tool to manage that access.

So I wanted to put to the user group: I've done some JIRA searching and
didn't find anything (I will admit my JIRA search Foo is not stellar), but
is there an option that could be thrown together in Hive that would allow
that use case?  Perhaps a configuration setting that would allow views to
execute as a specific user (perhaps a global user, or perhaps a user
specified as view creation).  This could allow the view to have access to
underlying table, but since the view is created, and it couldn't be changed
by the user, and thus you could set view read permissions to your user or
group of users you want access.

I suppose this has challenges i.e. can a user just create a view to bypass
table level restrictions? Perhaps if this model was taken, the privilege
for CREATING/MODIFYING views could be created and granted only to a
superuser of some sort.  I am really just walking through ideas here as
this is the one last stumbling blocks we have with Hive from an Enterprise
ready point of view. Heck, if done right, you could almost do data masking
at the view level. You have a column in your source data that is sensitive,
so instead of returning that column you do a MD5 (can we have a native MD5
function? :) of that column or you blank that column. If we put in strong
security on the creation, modification of views, and allow views to execute
as a different user that has access to source data, you have a powerful way
to represent your data to all levels within your org.

Also: Since I am just brain storming here, I'd love to hear what others
maybe doing around this area. Perhaps the Hive User Community can come up
with a strategic plan, while at the same time share some shorter term
workarounds.

Thanks!


Upgrade from Hive 0.9 to Hive 0.10 Heap Error on show tables;

2013-04-03 Thread John Omernik
Not sure what the issues is, conf is good, validated I can log in to mysql
with username in the hive-site, and I ran the metastore update scripts.

show tables;
java.lang.OutOfMemoryError: Java heap space
 at
org.apache.thrift.protocol.TBinaryProtocol.readStringBody(TBinaryProtocol.java:353)
at
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:215)
 at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
at
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_database(ThriftHiveMetastore.java:412)
 at
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_database(ThriftHiveMetastore.java:399)
at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDatabase(HiveMetaStoreClient.java:736)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
 at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:74)
at $Proxy6.getDatabase(Unknown Source)
 at org.apache.hadoop.hive.ql.metadata.Hive.getDatabase(Hive.java:1110)
at org.apache.hadoop.hive.ql.metadata.Hive.databaseExists(Hive.java:1099)
 at org.apache.hadoop.hive.ql.exec.DDLTask.showTables(DDLTask.java:2206)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:334)
 at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:138)
at
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
 at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1336)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1122)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:935)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:259)
 at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:412)
 at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:755)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:613)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:197)
FAILED: Execution Error, return code -101 from
org.apache.hadoop.hive.ql.exec.DDLTask


Thoughts?


Re: Upgrade from Hive 0.9 to Hive 0.10 Heap Error on show tables;

2013-04-03 Thread John Omernik
Ya, the issue ended up being I had some thirft settings that came with
previous versions of MapR

!-- property
namehive.metastore.local/name
valuetrue/value
descriptioncontrols whether to connect to remove metastore server or
open a new metastore server in Hive Client JVM/description
 /property
--
!--
 property
namehive.metastore.uris/name
valuethrift://localhost:3306/value
 /property
--

Once commented out, it worked... *shrug* I did read some stuff about these
settings changing in 0.10


On Wed, Apr 3, 2013 at 6:59 PM, Richard Nadeau strout...@gmail.com wrote:

 Hi John,

 Do you have a copy of the MySQL JDBC driver in your Hive library path?

 Rick
 On Apr 3, 2013 3:57 PM, John Omernik j...@omernik.com wrote:

 Not sure what the issues is, conf is good, validated I can log in to
 mysql with username in the hive-site, and I ran the metastore update
 scripts.

 show tables;
 java.lang.OutOfMemoryError: Java heap space
  at
 org.apache.thrift.protocol.TBinaryProtocol.readStringBody(TBinaryProtocol.java:353)
 at
 org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:215)
  at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
 at
 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_database(ThriftHiveMetastore.java:412)
  at
 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_database(ThriftHiveMetastore.java:399)
 at
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDatabase(HiveMetaStoreClient.java:736)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
  at
 org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:74)
 at $Proxy6.getDatabase(Unknown Source)
  at org.apache.hadoop.hive.ql.metadata.Hive.getDatabase(Hive.java:1110)
 at org.apache.hadoop.hive.ql.metadata.Hive.databaseExists(Hive.java:1099)
  at org.apache.hadoop.hive.ql.exec.DDLTask.showTables(DDLTask.java:2206)
 at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:334)
  at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:138)
 at
 org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
  at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1336)
 at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1122)
  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:935)
 at
 org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:259)
  at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216)
 at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:412)
  at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:755)
 at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:613)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
  at org.apache.hadoop.util.RunJar.main(RunJar.java:197)
 FAILED: Execution Error, return code -101 from
 org.apache.hadoop.hive.ql.exec.DDLTask


 Thoughts?




Re: Books and good starting point for Hive

2013-02-24 Thread John Omernik
Hello William - Dean Wampler posts quit often on this list and has done (to
my eye) a great job of separating his business (he and other authors have
written a Hive book)  from the community aspect of (he participates freely
on the list without a lot of self promotion).

Therefore, I will give him some unself (is that a word?) promotion for his
book.  Dean, and others have written a Hive book published by O'Reilly (I
included a link to it on Amazon)

A few thoughts. Prior to getting the book, I had used Hive from a
standpoint of a strong SQL background and no big data experience for about
a year. Much of what Hive is and can do was abstracted away from me, and I
just wrote queries. After getting the Programming Hive book I truly
understood Hive AND I also understod Hadoop better. I had purchased the
Hadoop O'Reilly book when I started my new position, but found Hadoop to be
pretty beyond me (at the time).  Since then, experience and the O'Reilly
Hive book have been awesome for me understanding what makes Hive great.
 Give it a shout. I have no financial interest, and Dean and the Other
Authors don't know me at all, this is just a freely given thought on the
book.



http://www.amazon.com/Programming-Hive-Edward-Capriolo/dp/1449319335/

\\\





On Sun, Feb 24, 2013 at 10:43 AM, William Kang weliam.cl...@gmail.comwrote:

 Hi All,
 I just get started with Hive.

 I already installed Hive and finished a few examples.

 Would you please give me some suggestions on a few good books or
 tutorials for me to continue?

 Many thanks.


 William



Using Reflect: A thread for ideas

2013-02-13 Thread John Omernik
I stumbled across the little documented reflect function today. I've always
known about it, but java scares me if it's not in a cup so I didn't dig.
 Well today I dug, and found an awesome use case for reflect (for me) and
wanted to share.  I also thought it would be nice to validate some thoughts
I had on reflect, and how we could possibly share ideas on reflect so that
folks could get more use out of this great feature of hive.

Here's my example: A simple URL decode function:

select url, reflect('java.net.URLDecoder', 'decode', url, 'utf-8') as
decoded_url from logs
Basically I am using the decode function of the java.net.URLDecoder class.
 Pretty awesome, works great, no files to distribute either.  Even works
through JDBC!

Ok that being said, I realized now that the function I am trying to call
has to return data in a simple data type.  For example, I struggle to come
up with a simple reflect() for making an Hex MD5 out of a string because
the built in function return an object, which have methods that can return
what I am looking for. Which is great, but then I have to compile java
code, distribute a jar, and then run the code. I am looking for simple like
the URLDecoding function.

I love this reflect feature, but I think it's probably underutilized due to
the perceived usability issues for beginners.  So that leads me to my next
thought. What if we brain storm here handy functions in Java that are not
included in the standard hive language, that make the transition to hive
well using the reflect function and the show an example of it's use? I went
first with my URLDecode, and obviously will be looking for more, but have
you seen some examples that we neat and worked well for you? Can you share?


Perhaps if we get enough examples we could roll some of these into a wiki
page on the hive wiki that folks can use to get over the perceived
complexity of using java reflect?

Thanks to those who have worked hard to implement features like this, it is
truly awesome.


Re: Union in Multi Insert

2013-02-12 Thread John Omernik
You are correct on the what I am hoping to do, basically emit two records
for every row.  What was interesting was when I just did the union in the
from, it didn't see to do a double table scan. I ended up doing:

INSERT OVERWRITE TABLE table_summary
select col1, unioned_col, count(distinct col4) from
( select col1, col2 as unioned_col, col4 from table
UNION ALL
select col1, col3 as unioned_col, col4 from table
) a
group by col1, unioned_col

When I ran that, it did not run two sets of maps (as I expected it would,
one for each part of the union) is there something that Hive is doing under
the covers to optimize that for me?

I didn't think to use the transform, but you are right, that would be a
very simple transform in any language.

Thoughts on my resultant query? perhaps I am just not getting all the data
I should be?

On Mon, Feb 11, 2013 at 11:08 PM, Mark Grover
grover.markgro...@gmail.comwrote:

 John,
 Please correct me if I didn't understand the problem correctly.

 I think in this scenario, it's best to think about the query in terms of
 MapReduce. In this case, you would want for each record sent as input to
 your mapper, two records to be emitted, one with col2's value and one with
 col3's value. Then, if you did whatever count/distinct magic you wanted to
 do in the reduce phase, you would have read the table only once.

 With the above in mind, if I were doing this, I would consider using
 Hive's transform functionality to use a custom mapper.

 Also, FWIW, this part seems unnecessary:
  from (
 select
col1, col2, col3, col4
 from table
 ) a

 I think you might just be able to do (this is from top of my head, no
 guarantees):

 from table
 insert overwrite...
 

 And, if I misunderstood your problem, my apologies. If you could provide
 an example with sample data and expected output, that might be helpful.


 Mark

 On Mon, Feb 11, 2013 at 7:34 PM, John Omernik j...@omernik.com wrote:

 I am trying to do a union, group by, and multi insert all at once. I know
 this convoluted but I what I am trying to do is avoid having to scan
 through the original table more than once... if I can get all my data from
 two columns that I want to pull together, in one round of mappers, I win...
  Basically, col2 and col3 are they type of data, one is src, one is dst, I
 want a single record for every uniq value of both col2 and col3.

 Any thoughts?



 from (
 select
col1, col2, col3, col4
 from table
 ) a
 INSERT overwrite TABLE table_summary
 select col1, unioned_col, count(1), count(distinct col4) from
 (select col1, col2 as unioned_col, col4
 UNION ALL
 select col1, col3 as unioned_col, col4
 ) b
 group by col1, unioned_col





Union in Multi Insert

2013-02-11 Thread John Omernik
I am trying to do a union, group by, and multi insert all at once. I know
this convoluted but I what I am trying to do is avoid having to scan
through the original table more than once... if I can get all my data from
two columns that I want to pull together, in one round of mappers, I win...
 Basically, col2 and col3 are they type of data, one is src, one is dst, I
want a single record for every uniq value of both col2 and col3.

Any thoughts?



from (
select
   col1, col2, col3, col4
from table
) a
INSERT overwrite TABLE table_summary
select col1, unioned_col, count(1), count(distinct col4) from
(select col1, col2 as unioned_col, col4
UNION ALL
select col1, col3 as unioned_col, col4
) b
group by col1, unioned_col


Re: The dreaded Heap Space Issue on a Transform

2013-01-30 Thread John Omernik
So just a follow-up. I am less looking for specific troubleshooting on how
to fix my problem, and more looking for a general understanding of heap
space usage with Hive.  When I get an error like this, is it heap space on
a node, or heap space on my hive server?  Is it the heap space of the
tasktracker? Heap of the job kicked off on the node?  Which heap is being
affected? If it's not clear in my output, where can I better understand
this? I am sorely out of my league here when it comes to understanding the
JVM interactions of Hive and Hadoop, i.e. where hive is run, vs where task
trackers are run etc.

Thanks is advance!



On Tue, Jan 29, 2013 at 7:43 AM, John Omernik j...@omernik.com wrote:

 I am running a transform script that parses through a bunch of binary
 data. In 99% of the cases it runs, it runs fine, but on certain files I get
 a failure (as seen below).  Funny thing is, I can run a job with only the
 problem source file, and it will work fine, but when as a group of files, I
 get these warnings.  I guess what I am asking here is this: Where is the
 heap error? Is this occurring on the nodes themselves or, since this is
 where the script is emitting records (and potentially large ones at that)
 and in this case my hive server running the job may be memory light, could
 the issue actually be due to heap on the hive server itself?   My setup is
 1 Hive node (that is woefully underpowered, under memoried, and under disk
 I/Oed) and 4 beefy hadoop nodes.  I guess, my question is the heap issue on
 the sender or the receiver :)




 13-01-29 08:20:24,107 INFO org.apache.hadoop.hive.ql.io.CodecPool: Got
 brand-new compressor
 2013-01-29 08:20:24,107 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 12 forwarding 1 rows
 2013-01-29 08:20:24,410 INFO
 org.apache.hadoop.hive.ql.exec.ScriptOperator: 3 forwarding 10 rows
 2013-01-29 08:20:24,410 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 4 forwarding 10 rows
 2013-01-29 08:20:24,411 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 10 rows
 2013-01-29 08:20:24,411 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarding 10 rows
 2013-01-29 08:20:24,411 INFO
 org.apache.hadoop.hive.ql.exec.FilterOperator: 8 forwarding 10 rows
 2013-01-29 08:20:24,411 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 10 rows
 2013-01-29 08:20:24,411 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 10 forwarding 10 rows
 2013-01-29 08:20:24,412 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 12 forwarding 10 rows
 2013-01-29 08:20:27,170 INFO
 org.apache.hadoop.hive.ql.exec.ScriptOperator: 3 forwarding 100 rows
 2013-01-29 08:20:27,170 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 4 forwarding 100 rows
 2013-01-29 08:20:27,170 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 100 rows
 2013-01-29 08:20:27,171 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarding 100 rows
 2013-01-29 08:20:27,171 INFO
 org.apache.hadoop.hive.ql.exec.FilterOperator: 8 forwarding 100 rows
 2013-01-29 08:20:27,171 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 100 rows
 2013-01-29 08:20:27,171 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 10 forwarding 100 rows
 2013-01-29 08:20:27,171 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 12 forwarding 100 rows
 2013-01-29 08:21:16,247 INFO
 org.apache.hadoop.hive.ql.exec.ScriptOperator: 3 forwarding 1000 rows
 2013-01-29 08:21:16,247 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 4 forwarding 1000 rows
 2013-01-29 08:21:16,247 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 1000 rows
 2013-01-29 08:21:16,247 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarding 1000 rows
 2013-01-29 08:21:16,248 INFO
 org.apache.hadoop.hive.ql.exec.FilterOperator: 8 forwarding 1000 rows
 2013-01-29 08:21:16,248 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 1000 rows
 2013-01-29 08:21:16,248 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 10 forwarding 1000 rows
 2013-01-29 08:21:16,248 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 12 forwarding 1000 rows
 2013-01-29 08:25:47,532 INFO
 org.apache.hadoop.hive.ql.exec.ScriptOperator: 3 forwarding 1 rows
 2013-01-29 08:25:47,532 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 4 forwarding 1 rows
 2013-01-29 08:25:47,532 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 1 rows
 2013-01-29 08:25:47,532 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarding 1 rows
  2013-01-29 08:25:47,532 INFO
 org.apache.hadoop.hive.ql.exec.FilterOperator: 8 forwarding 1 rows
 2013-01-29 08:25:47,532 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 1 rows
 2013-01-29 08:25:47,532 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 10 forwarding 1 rows
 2013-01-29 08:25:47,532 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 12 forwarding 1 rows
 2013-01-29 08

Re: The dreaded Heap Space Issue on a Transform

2013-01-30 Thread John Omernik
I am realizing one of my challenges is that I have quite a few cores and
map tasks per node, but (I didn't set it up) I am only running 4 GB per
physical core (12) with 18 map slots.  I am guessing right now that any
given time, with 18 map slots, the 1.8 total GB of ram I am assigning to to
the sort stuff is under sized, yet I am constrained on memory, so I can't
just up it. Working on getting things upgraded. Thanks for all I appreciate
the thoughts.



On Wed, Jan 30, 2013 at 10:40 AM, Dean Wampler 
dean.wamp...@thinkbiganalytics.com wrote:

 We didn't ask yet, but to be sure, are all the slave nodes configured the
 same, both in terms of hardware and other apps running, if any, running on
 them?


 On Wed, Jan 30, 2013 at 10:14 AM, Richard Nadeau strout...@gmail.comwrote:

 What do you have set in core-site.XML for io.sort.mb, io.sort.factor, and
 io.file.buffer.size? You should be able to adjust these and get past the
 heap issue. Be careful about how much ram you ave though, and don't st them
 too high.

 Rick
 On Jan 30, 2013 8:55 AM, John Omernik j...@omernik.com wrote:

 So it's filling up on the emitting stage, so I need to look at the task
 logs and or my script that's printing to stdout as the likely culprits I am
 guessing.



 On Wed, Jan 30, 2013 at 9:11 AM, Philip Tromans 
 philip.j.trom...@gmail.com wrote:

 That particular OutOfMemoryError is happening on one of your hadoop
 nodes. It's the heap within the process forked by the hadoop tasktracker, I
 think.

 Phil.


 On 30 January 2013 14:28, John Omernik j...@omernik.com wrote:

 So just a follow-up. I am less looking for specific troubleshooting on
 how to fix my problem, and more looking for a general understanding of 
 heap
 space usage with Hive.  When I get an error like this, is it heap space on
 a node, or heap space on my hive server?  Is it the heap space of the
 tasktracker? Heap of the job kicked off on the node?  Which heap is being
 affected? If it's not clear in my output, where can I better understand
 this? I am sorely out of my league here when it comes to understanding the
 JVM interactions of Hive and Hadoop, i.e. where hive is run, vs where task
 trackers are run etc.

 Thanks is advance!



 On Tue, Jan 29, 2013 at 7:43 AM, John Omernik j...@omernik.comwrote:

 I am running a transform script that parses through a bunch of binary
 data. In 99% of the cases it runs, it runs fine, but on certain files I 
 get
 a failure (as seen below).  Funny thing is, I can run a job with only 
 the
 problem source file, and it will work fine, but when as a group of 
 files, I
 get these warnings.  I guess what I am asking here is this: Where is the
 heap error? Is this occurring on the nodes themselves or, since this is
 where the script is emitting records (and potentially large ones at that)
 and in this case my hive server running the job may be memory light, 
 could
 the issue actually be due to heap on the hive server itself?   My setup 
 is
 1 Hive node (that is woefully underpowered, under memoried, and under 
 disk
 I/Oed) and 4 beefy hadoop nodes.  I guess, my question is the heap issue 
 on
 the sender or the receiver :)




 13-01-29 08:20:24,107 INFO org.apache.hadoop.hive.ql.io.CodecPool:
 Got brand-new compressor
 2013-01-29 08:20:24,107 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 12 forwarding 1 rows
 2013-01-29 08:20:24,410 INFO
 org.apache.hadoop.hive.ql.exec.ScriptOperator: 3 forwarding 10 rows
 2013-01-29 08:20:24,410 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 4 forwarding 10 rows
 2013-01-29 08:20:24,411 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 10 rows
 2013-01-29 08:20:24,411 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarding 10 rows
 2013-01-29 08:20:24,411 INFO
 org.apache.hadoop.hive.ql.exec.FilterOperator: 8 forwarding 10 rows
 2013-01-29 08:20:24,411 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 10 rows
 2013-01-29 08:20:24,411 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 10 forwarding 10 rows
 2013-01-29 08:20:24,412 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 12 forwarding 10 rows
 2013-01-29 08:20:27,170 INFO
 org.apache.hadoop.hive.ql.exec.ScriptOperator: 3 forwarding 100 rows
 2013-01-29 08:20:27,170 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 4 forwarding 100 rows
 2013-01-29 08:20:27,170 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 100 rows
 2013-01-29 08:20:27,171 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarding 100 rows
 2013-01-29 08:20:27,171 INFO
 org.apache.hadoop.hive.ql.exec.FilterOperator: 8 forwarding 100 rows
 2013-01-29 08:20:27,171 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 100 rows
 2013-01-29 08:20:27,171 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 10 forwarding 100 rows
 2013-01-29 08:20:27,171 INFO
 org.apache.hadoop.hive.ql.exec.SelectOperator: 12 forwarding 100 rows
 2013-01-29 08:21:16,247 INFO

The dreaded Heap Space Issue on a Transform

2013-01-29 Thread John Omernik
I am running a transform script that parses through a bunch of binary data.
In 99% of the cases it runs, it runs fine, but on certain files I get a
failure (as seen below).  Funny thing is, I can run a job with only the
problem source file, and it will work fine, but when as a group of files, I
get these warnings.  I guess what I am asking here is this: Where is the
heap error? Is this occurring on the nodes themselves or, since this is
where the script is emitting records (and potentially large ones at that)
and in this case my hive server running the job may be memory light, could
the issue actually be due to heap on the hive server itself?   My setup is
1 Hive node (that is woefully underpowered, under memoried, and under disk
I/Oed) and 4 beefy hadoop nodes.  I guess, my question is the heap issue on
the sender or the receiver :)




13-01-29 08:20:24,107 INFO org.apache.hadoop.hive.ql.io.CodecPool: Got
brand-new compressor
2013-01-29 08:20:24,107 INFO org.apache.hadoop.hive.ql.exec.SelectOperator:
12 forwarding 1 rows
2013-01-29 08:20:24,410 INFO org.apache.hadoop.hive.ql.exec.ScriptOperator:
3 forwarding 10 rows
2013-01-29 08:20:24,410 INFO org.apache.hadoop.hive.ql.exec.SelectOperator:
4 forwarding 10 rows
2013-01-29 08:20:24,411 INFO org.apache.hadoop.hive.ql.exec.SelectOperator:
5 forwarding 10 rows
2013-01-29 08:20:24,411 INFO org.apache.hadoop.hive.ql.exec.SelectOperator:
6 forwarding 10 rows
2013-01-29 08:20:24,411 INFO org.apache.hadoop.hive.ql.exec.FilterOperator:
8 forwarding 10 rows
2013-01-29 08:20:24,411 INFO org.apache.hadoop.hive.ql.exec.SelectOperator:
9 forwarding 10 rows
2013-01-29 08:20:24,411 INFO org.apache.hadoop.hive.ql.exec.SelectOperator:
10 forwarding 10 rows
2013-01-29 08:20:24,412 INFO org.apache.hadoop.hive.ql.exec.SelectOperator:
12 forwarding 10 rows
2013-01-29 08:20:27,170 INFO org.apache.hadoop.hive.ql.exec.ScriptOperator:
3 forwarding 100 rows
2013-01-29 08:20:27,170 INFO org.apache.hadoop.hive.ql.exec.SelectOperator:
4 forwarding 100 rows
2013-01-29 08:20:27,170 INFO org.apache.hadoop.hive.ql.exec.SelectOperator:
5 forwarding 100 rows
2013-01-29 08:20:27,171 INFO org.apache.hadoop.hive.ql.exec.SelectOperator:
6 forwarding 100 rows
2013-01-29 08:20:27,171 INFO org.apache.hadoop.hive.ql.exec.FilterOperator:
8 forwarding 100 rows
2013-01-29 08:20:27,171 INFO org.apache.hadoop.hive.ql.exec.SelectOperator:
9 forwarding 100 rows
2013-01-29 08:20:27,171 INFO org.apache.hadoop.hive.ql.exec.SelectOperator:
10 forwarding 100 rows
2013-01-29 08:20:27,171 INFO org.apache.hadoop.hive.ql.exec.SelectOperator:
12 forwarding 100 rows
2013-01-29 08:21:16,247 INFO org.apache.hadoop.hive.ql.exec.ScriptOperator:
3 forwarding 1000 rows
2013-01-29 08:21:16,247 INFO org.apache.hadoop.hive.ql.exec.SelectOperator:
4 forwarding 1000 rows
2013-01-29 08:21:16,247 INFO org.apache.hadoop.hive.ql.exec.SelectOperator:
5 forwarding 1000 rows
2013-01-29 08:21:16,247 INFO org.apache.hadoop.hive.ql.exec.SelectOperator:
6 forwarding 1000 rows
2013-01-29 08:21:16,248 INFO org.apache.hadoop.hive.ql.exec.FilterOperator:
8 forwarding 1000 rows
2013-01-29 08:21:16,248 INFO org.apache.hadoop.hive.ql.exec.SelectOperator:
9 forwarding 1000 rows
2013-01-29 08:21:16,248 INFO org.apache.hadoop.hive.ql.exec.SelectOperator:
10 forwarding 1000 rows
2013-01-29 08:21:16,248 INFO org.apache.hadoop.hive.ql.exec.SelectOperator:
12 forwarding 1000 rows
2013-01-29 08:25:47,532 INFO org.apache.hadoop.hive.ql.exec.ScriptOperator:
3 forwarding 1 rows
2013-01-29 08:25:47,532 INFO org.apache.hadoop.hive.ql.exec.SelectOperator:
4 forwarding 1 rows
2013-01-29 08:25:47,532 INFO org.apache.hadoop.hive.ql.exec.SelectOperator:
5 forwarding 1 rows
2013-01-29 08:25:47,532 INFO org.apache.hadoop.hive.ql.exec.SelectOperator:
6 forwarding 1 rows
2013-01-29 08:25:47,532 INFO org.apache.hadoop.hive.ql.exec.FilterOperator:
8 forwarding 1 rows
2013-01-29 08:25:47,532 INFO org.apache.hadoop.hive.ql.exec.SelectOperator:
9 forwarding 1 rows
2013-01-29 08:25:47,532 INFO org.apache.hadoop.hive.ql.exec.SelectOperator:
10 forwarding 1 rows
2013-01-29 08:25:47,532 INFO org.apache.hadoop.hive.ql.exec.SelectOperator:
12 forwarding 1 rows
2013-01-29 08:27:34,276 WARN org.apache.hadoop.hive.ql.exec.ScriptOperator:
Exception in StreamThread.run(): Java heap space
Cause: null
2013-01-29 08:27:34,277 WARN org.apache.hadoop.hive.ql.exec.ScriptOperator:
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3209)
 at java.lang.String.init(String.java:215)
at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542)
 at java.nio.CharBuffer.toString(CharBuffer.java:1157)
at org.apache.hadoop.io.Text.decode(Text.java:350)
 at org.apache.hadoop.io.Text.decode(Text.java:327)
at org.apache.hadoop.io.Text.toString(Text.java:254)
 at java.lang.String.valueOf(String.java:2826)
at java.lang.StringBuilder.append(StringBuilder.java:115)
 at

Interaction between Java and Transform Scripts on Hive

2013-01-16 Thread John Omernik
I am perplexed  if I run a transform script on a file by itself, it runs
fine, outputs to standard out life is good. If I run the transform script
on that same file (with the path and filename being passed into the script
via transform so that the python script is doing the exact same thing) I
get a java heap space error. This process works on 99% of files, and I just
can't figure out why this file is different.  How does say a python
transform script run in the java process (if that is even what it is
doing) so that it causes a heap error in a transform script but not run
without java around?

I am curious on what steps I can take to trouble shoot or eliminate this
problem.


Re: Timestamp, Epoch Time, Functions and other Frustrations

2013-01-05 Thread John Omernik
Mark - I see your discussion with Mr. Harris here:

https://issues.apache.org/jira/browse/HIVE-3822

I agree that the result of the from_unixtime() function would return the ts
based on the system time, but and struggling with the cast(int as
timestamp) returning a value affected by the system time.  This does not
make sense, if we have a value that is  an integer, it is timezone less, if
we are casting the value to a timezonel ess value (timestamp) then it
should not be affected by any system timezone, this is is counter intuitive
and requires a user to set work arounds (setting the timezone of a JVM etc)
that may cause further heartburn down the road.  I completely understand
the from_unixtime() using the timezone, but not the cast.  I think the
difference is when a date is is converted to a human readable form, then it
is a acceptable, even normal to use the timezone of the system, whereas if
the conversion is to a type such as timestamp which is by design
timzoneless, we should not apply a timezone to it. (unless specified
through the helper functions)

 I am open to seeing where I am looking at things wrong.

On Fri, Jan 4, 2013 at 12:06 PM, John Omernik j...@omernik.com wrote:

 So I read that JIRA, and also found this linked JIRA:

 https://issues.apache.org/jira/browse/HIVE-3454

 So I decided to try the * 1.0 work around.

 select
 starttime,
 from_unixtime(starttime) as unixtime,
  cast((starttime * 1.0)  as timestamp) as castts,
 from_utc_timestamp(starttime * 1.0, 'GMT') as fromtsgmt,
 from_utc_timestamp(starttime * 1.0, 'CST') asfromtscst
 from table

 Hypothesis give starttime= 1356588013 (and based off the epoch convertor
 website)

 unixtime = 2012-12-27 00:00:13 # This is because unix time displays the
 time in the system time zone
 castts = 2012-12-27 06:00:13.0  # This is because timestamp is a UTC time,
 it should match the GMT time
 fromtsgmt = 2012-12-27 06:00:13.0 # This should be exactly what the TS is
 so it should be the same as the cast
 fromtsCST =2012-12-27 00:00:13.0 # This should be the same (time based)
 result as from from_unixtime

 Actual Results:

 unixtime =2012-12-27 00:00:13 # 1 for 1 !
 castts = 2012-12-27 00:00:13.0 # What? Why is this the same as unixtime?
 fromtsgmt = 2012-12-27 00:00:13.0 # What is THIS the same as unixtime?
 fromtscst = 2012-12-26 18:00:13.0 # This is 6 hours behind? Why did my
 epoch time get coverted to timestamp as if we added 6 to the hour?

 !  That makes NO sense, even ignoring the bug in the conversion requiring
 a float, am I doing this wrong or is there a different bug in how this is
 approached?





 On Fri, Jan 4, 2013 at 10:30 AM, Mark Grover 
 grover.markgro...@gmail.comwrote:

 Brad is correct, there is a JIRA about this already:
 https://issues.apache.org/jira/browse/HIVE-3822

 Sorry for the inconvenience.

 Mark

 On Fri, Jan 4, 2013 at 8:25 AM, Brad Cavanagh brad.cavan...@gmail.com
 wrote:
  Try multiplying your values by 1000, then running the conversions. I bet
  they expect milliseconds since the epoch instead of seconds.
 
  Brad.
 
 
  On 2013-01-04, at 8:03 AM, John Omernik j...@omernik.com wrote:
 
  Greetings all. I am getting frustrated with the documentation and lack
 of
  intuitiveness in Hive relating to timestamps and was hoping I could post
  here and get some clarification or other ideas.
 
  I have a field that is a string, but is actually a 10 digit int
  representation of epoch time, I am going to list out the results of
 various
  functions.
 
  Value = 1356588013
 
  Hive:
 
  from_unixtime(Value) = 2012-12-27 00:00:13 (Timezone CST on the system
 time,
  so that works)
  cast(value as timestamp) = 1970-01-16 10:49:48.013
  cast(cast(value as int) as timestamp = 1970-01-16 10:49:48.013
  from_utc_timestamp(starttime, 'GMT') = 1970-01-16 10:49:48.013
  from_utc_timestamp(starttime, 'CST') = 1970-01-16 04:49:48.013
 
 
  Epoch Converter - http://www.epochconverter.com/
 
  Thu, 27 Dec 2012 06:00:13 GMT - GMT Representation of the time
  Thu Dec 27 2012 00:00:13 GMT-6 - My Timezone representation
 
  Ok Given all of these representations... how do I get the Value ( a
 valid
  epoch time) into a GMT time basically, 2012-12-27 06:00:13 without just
  doing math. (Math is error prone on system as we move across timezone).
 Why
  doesn't the casting of the value to timestamp or even the casting of
 the int
  cast of the time stamp work?   Why does it read 1970?  This is very
  frustrating and should be more intuitive.  Please advise.
 
 





Re: Timestamp, Epoch Time, Functions and other Frustrations

2013-01-04 Thread John Omernik
One more test:

to_utc_timestamp(from_unixtime(value), 'CST') as to_from, provided the
proper timestamp for me, however, I still had to provide the timezone which
I should NOT have to do. I know that this data coming in is in epoch time,
therefore I should be able to create a timezone without knowing a timezone
or timezone offset.



On Fri, Jan 4, 2013 at 10:03 AM, John Omernik j...@omernik.com wrote:

 Greetings all. I am getting frustrated with the documentation and lack of
 intuitiveness in Hive relating to timestamps and was hoping I could post
 here and get some clarification or other ideas.

 I have a field that is a string, but is actually a 10 digit int
 representation of epoch time, I am going to list out the results of various
 functions.

 Value = 1356588013

 Hive:

 from_unixtime(Value) = 2012-12-27 00:00:13 (Timezone CST on the system
 time, so that works)
 cast(value as timestamp) = 1970-01-16 10:49:48.013
 cast(cast(value as int) as timestamp = 1970-01-16 10:49:48.013
 from_utc_timestamp(starttime, 'GMT') = 1970-01-16 10:49:48.013
 from_utc_timestamp(starttime, 'CST') = 1970-01-16 04:49:48.013


 Epoch Converter - http://www.epochconverter.com/

 Thu, 27 Dec 2012 06:00:13 GMT - GMT Representation of the time
 Thu Dec 27 2012 00:00:13 GMT-6 - My Timezone representation

 Ok Given all of these representations... how do I get the Value ( a valid
 epoch time) into a GMT time basically, 2012-12-27 06:00:13 without just
 doing math. (Math is error prone on system as we move across timezone). Why
 doesn't the casting of the value to timestamp or even the casting of the
 int cast of the time stamp work?   Why does it read 1970?  This is very
 frustrating and should be more intuitive.  Please advise.





Re: Running commands at hive cli or hive thirft startup

2012-12-10 Thread John Omernik
Will that work for my thrift server connections?

On Sun, Dec 9, 2012 at 7:56 PM, विनोद सिंह vi...@vinodsingh.com wrote:

 Put a .hiverc file in your home directory containing commands, Hive CLI
 will execute all of them at startup.

 Thanks,
 Vinod

 On Sun, Dec 9, 2012 at 10:25 PM, John Omernik j...@omernik.com wrote:

 I am looking for ways to streamline some of my analytics. One thing I
 notice is that when I use hive cli, or connect to my hive thrift server,
 there are a some commands I always end up running for my session.  If I
 have multiple CLIs or connections to Thrift, then I have to run it each
 time. If I lose a connection to hive thrift, I have to run them.  Etc etc.

 My thought was, is there a way that upon opening a hive cli or connection
 to a hive thrift server, could I have certain commands be executed?

 These commands include a use command to get me to a specific database
 (perhaps there is a default database config variable?) or loading up all
 the temporary functions I use (UDFs) .

 For example, I have a UDF to do URL decoding:

 CREATE TEMPORARY FUNCTION uridecode AS 'org.domain.analytics.URIDECODE;

 Can I get this to run auto magically at hive cli start or thrift server
 connection?

 If not, could we build it in that we can add UDFs to hive without doing a
 recompile that stay in permanently?

 I would welcome discussion on this!






Running commands at hive cli or hive thirft startup

2012-12-09 Thread John Omernik
I am looking for ways to streamline some of my analytics. One thing I
notice is that when I use hive cli, or connect to my hive thrift server,
there are a some commands I always end up running for my session.  If I
have multiple CLIs or connections to Thrift, then I have to run it each
time. If I lose a connection to hive thrift, I have to run them.  Etc etc.

My thought was, is there a way that upon opening a hive cli or connection
to a hive thrift server, could I have certain commands be executed?

These commands include a use command to get me to a specific database
(perhaps there is a default database config variable?) or loading up all
the temporary functions I use (UDFs) .

For example, I have a UDF to do URL decoding:

CREATE TEMPORARY FUNCTION uridecode AS 'org.domain.analytics.URIDECODE;

Can I get this to run auto magically at hive cli start or thrift server
connection?

If not, could we build it in that we can add UDFs to hive without doing a
recompile that stay in permanently?

I would welcome discussion on this!


Re: BINARY column type

2012-12-02 Thread John Omernik
Ya, for me it's pcap data so I had to take the data and process it out of
the pcaps into something serialized for hive anyhow.  in that case, I took
the pcaps and loaded them with a transform.  My transform script to a
single file name in on STDIN and then read the PCAP, parsed out the packets
in the formated I wanted and then took the raw data for each packet and
hexed it as it outputted it to STDOUT.  My Insert statement took the
results of the pcap parsing script (including the hexed data) and then
unhexed it at insert.  There may be a better way to do this, but for me it
works well. *shrug*



On Sun, Dec 2, 2012 at 9:00 AM, Connell, Chuck chuck.conn...@nuance.comwrote:

  The hex idea is clever. But does this mean that the files you brought
 into Hive (with a LOAD statement) were essentially ascii (hexed), not raw
 binary?

  --
 *From:* John Omernik [j...@omernik.com]
 *Sent:* Saturday, December 01, 2012 11:58 PM

 *To:* user@hive.apache.org
 *Subject:* Re: BINARY column type

  No, I didn't remove any newline characters. newline became 0A  By using
 perl or python in a transform if I had Hi how are you\n It would be
 come 486920686f772061726520796f75200A

  From there it would pass that to the unhex() function in hive in the
 insert statement. That allowed me to move the data with newline around
 easily, but on the final step (on insert) it would unhex it and put it in
 as actual binary, no bytes were harmed in the hexing (or unhexing) of my
 data.



 On Sat, Dec 1, 2012 at 4:11 PM, Connell, Chuck 
 chuck.conn...@nuance.comwrote:

  Thanks John. When you say hexed data, do you mean binary encoded to
 ascii hex? This would remove the raw newline characters.

 We considered Base64 encoding our data, a similar idea, which would also
 remove raw newlines. But my preference is to put real binary data into
 Hive, and find a way to make this work.

 Chuck

  --
 *From:* John Omernik [j...@omernik.com]
 *Sent:* Saturday, December 01, 2012 4:22 PM
 *To:* user@hive.apache.org
 *Subject:* Re: BINARY column type

   Hi Chuck -

  I've used binary columns with Newlines in the data. I used RCFile
 format for my storage method. Works great so far. Whether or not this is
 the way to get data in, I use hexed data (my transform script outputs hex
 encoded) and the final insert into the table gets a unhex(sourcedata).
  That's never been a problem for me, seems a bit hackish, but works well.

 On Sat, Dec 1, 2012 at 10:50 AM, Connell, Chuck chuck.conn...@nuance.com
  wrote:

  I am trying to use BINARY columns and believe I have the perfect
 use-case for it, but I am missing something. Has anyone used this for true
 binary data (which may contain newlines)?


  Here is the background... I have some files that each contain just one
 logical field, which is a binary object. (The files are Google Protobuf
 format.) I want to put these binary files into a larger file, where each
 protobuf is a logical record. Then I want to define a Hive table that
 stores each protobuf as one row, with the entire protobuf object in one
 BINARY column. Then I will use a custom UDF to select/query the binary
 object.


  This is about as simple as can be for putting binary data into Hive.


  What file format should I use to package the binary rows? What should
 the Hive table definition be? Which SerDe option (LazySimpleBinary?). I
 cannot use TEXTFILE, since the binary may contain newlines. Many of my
 attempts have choked on the newlines.


  Thank you,

 Chuck Connell

 Nuance

 Burlington, MA






Re: BINARY column type

2012-12-01 Thread John Omernik
No, I didn't remove any newline characters. newline became 0A  By using
perl or python in a transform if I had Hi how are you\n It would be
come 486920686f772061726520796f75200A

From there it would pass that to the unhex() function in hive in the insert
statement. That allowed me to move the data with newline around easily, but
on the final step (on insert) it would unhex it and put it in as actual
binary, no bytes were harmed in the hexing (or unhexing) of my data.



On Sat, Dec 1, 2012 at 4:11 PM, Connell, Chuck chuck.conn...@nuance.comwrote:

  Thanks John. When you say hexed data, do you mean binary encoded to
 ascii hex? This would remove the raw newline characters.

 We considered Base64 encoding our data, a similar idea, which would also
 remove raw newlines. But my preference is to put real binary data into
 Hive, and find a way to make this work.

 Chuck

  --
 *From:* John Omernik [j...@omernik.com]
 *Sent:* Saturday, December 01, 2012 4:22 PM
 *To:* user@hive.apache.org
 *Subject:* Re: BINARY column type

  Hi Chuck -

  I've used binary columns with Newlines in the data. I used RCFile format
 for my storage method. Works great so far. Whether or not this is the way
 to get data in, I use hexed data (my transform script outputs hex encoded)
 and the final insert into the table gets a unhex(sourcedata).  That's never
 been a problem for me, seems a bit hackish, but works well.

 On Sat, Dec 1, 2012 at 10:50 AM, Connell, Chuck 
 chuck.conn...@nuance.comwrote:

  I am trying to use BINARY columns and believe I have the perfect
 use-case for it, but I am missing something. Has anyone used this for true
 binary data (which may contain newlines)?


  Here is the background... I have some files that each contain just one
 logical field, which is a binary object. (The files are Google Protobuf
 format.) I want to put these binary files into a larger file, where each
 protobuf is a logical record. Then I want to define a Hive table that
 stores each protobuf as one row, with the entire protobuf object in one
 BINARY column. Then I will use a custom UDF to select/query the binary
 object.


  This is about as simple as can be for putting binary data into Hive.


  What file format should I use to package the binary rows? What should
 the Hive table definition be? Which SerDe option (LazySimpleBinary?). I
 cannot use TEXTFILE, since the binary may contain newlines. Many of my
 attempts have choked on the newlines.


  Thank you,

 Chuck Connell

 Nuance

 Burlington, MA





Re: Hive table backed by a txt file on S3

2012-10-25 Thread John Omernik
Try putting the location to the directory the file is in. If there are
other files you don't want to be included make a subdir.

On Thu, Oct 25, 2012 at 6:26 AM, Nitin Pawar nitinpawar...@gmail.comwrote:

 In that case, it looks like when you do a select * .. its just a cat
 operation.

 whats the error you are getting when you put something in where clause?
 If there is no error and you are not getting any result but you expect
 a result then the data loaded into table is not in the format you
 wanted.

 Will help if you can share sample data and table structure

 On Thu, Oct 25, 2012 at 3:49 AM, Binesh Gummadi
 binesh.gumm...@gmail.com wrote:
  Folks,
 
  I have been using Hive with no issues until now with lzo and gz files
  backing Hive tables. Compressed files are stored in Amazon S3 and I am
 using
  EMR to process my data.
 
  However I am having problem pointing S3 table to a flat text file. Text
  files contains name and age separated by a tab located in
  s3://mybucket/users.txt
  Eg:
  name1 31
  name2 56
  name3 40
 
  Created hive table as follows.
  create external table S3_users(user_name string, age int) ROW FORMAT
  DELIMITED FIELDS TERMINATED BY '\t' location 's3://mybucket/';
 
  I couldn't create table pointing to a text file like so:
  create external table S3_users(user_name string, age int) ROW FORMAT
  DELIMITED FIELDS TERMINATED BY '\t' location 's3://mybucket/users.txt';
 
 
  select * from S3_users LIMIT 10; -- Works
  select * from S3_users where user_name = 'name1' --  Doesn't Work
 
  Exception is follows
  java.io.FileNotFoundException: File /users.txt does not exist
 
  What am I doing wrong here? How do I create a hive table pointing to a
 txt
  file?
 
  Thanks
  Binnyg



 --
 Nitin Pawar



Writing Custom Serdes for Hive

2012-10-16 Thread John Omernik
We have a maybe obvious question about a serde. When a serde in invoked,
does it have access to the original hive query?  Ideally the original query
could provide the Serde some hints on how to access the data on the
backend.

Also, are there any good links/documention on how to write Serdes?  Kinda
hard to google on for some reason.


Re: Writing Custom Serdes for Hive

2012-10-16 Thread John Omernik
There reason I am asking (and maybe YC reads this list and can chime in)
but he has written a connector for MongoDB.  It's simple, basically it
connects to a MongoDB, maps columns (primitives only) to mongodb fields,
and allows you to select out of Mongo. Pretty sweet actually, and with
Mongo, things are really fast for small tables.


That being said, I noticed that his connector basically gets all rows from
a Mongo DB collection every time it's ran.  And we wanted to see if we
could extend it to do some simple MongoDB level filtering based on the
passed query.  Basically have a fail open approach... if it saw something
it thought it could optimize in the mongodb query to limit data, it would,
otherwise, it would default to the original approach of getting all the
data.


For example:

select * from mongo_table where name rlike 'Bobby\\sWhite'

Current method: the connection do db.collection.find() gets all the
documents from MongoDB, and then hive does the regex.

Thing we want to try Oh one of our defined mongo columns has a rlike, ok
send this instead: db.collection.find(name:/Bobby\sWhite);   less data
that would need to be transfered. Yes, Hive would still run the rlike on
the data... shrug at least it's running it on far less data.   Basically
if we could determine shortcuts, we could use them.


Just trying to understand Serdes and how we are completely not using them
as intended :)




On Tue, Oct 16, 2012 at 10:42 AM, Connell, Chuck
chuck.conn...@nuance.comwrote:

  A serde is actually used the other way around… Hive parses the query,
 writes MapReduce code to solve the query, and the generated code uses the
 serde for field access.

 ** **

 Standard way to write a serde is to start from the trunk regex serde, then
 modify as needed…

 ** **


 http://svn.apache.org/viewvc/hive/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java?revision=1131106view=markup

 

 Also, nice article by Roberto Congiu…

 ** **

 http://www.congiu.com/a-json-readwrite-serde-for-hive/

 ** **

 Chuck Connell

 Nuance RD Data Team

 Burlington, MA

 ** **

 ** **

 *From:* John Omernik [mailto:j...@omernik.com]
 *Sent:* Tuesday, October 16, 2012 11:30 AM
 *To:* user@hive.apache.org
 *Subject:* Writing Custom Serdes for Hive

 ** **

 We have a maybe obvious question about a serde. When a serde in invoked,
 does it have access to the original hive query?  Ideally the original query
 could provide the Serde some hints on how to access the data on the
 backend.  

 ** **

 Also, are there any good links/documention on how to write Serdes?  Kinda
 hard to google on for some reason. 

 ** **

 ** **



Re: Writing Custom Serdes for Hive

2012-10-16 Thread John Omernik
AWESOME This is exactly what we were looking for. Sorry that I was looking
in the wrong spot!



On Tue, Oct 16, 2012 at 11:09 AM, shrikanth shankar sshan...@qubole.comwrote:

 I think what you need is a custom Input Format/ Record Reader. By the time
 the SerDe is called the row has been fetched. I believe the record reader
 can get access to predicates. The code to access HBase from Hive needs it
 for the same reasons as you would need with Mongo and might be a good place
 to start.

 thanks,
 Shrikanth

 On Oct 16, 2012, at 8:54 AM, John Omernik wrote:

 There reason I am asking (and maybe YC reads this list and can chime in)
 but he has written a connector for MongoDB.  It's simple, basically it
 connects to a MongoDB, maps columns (primitives only) to mongodb fields,
 and allows you to select out of Mongo. Pretty sweet actually, and with
 Mongo, things are really fast for small tables.


 That being said, I noticed that his connector basically gets all rows from
 a Mongo DB collection every time it's ran.  And we wanted to see if we
 could extend it to do some simple MongoDB level filtering based on the
 passed query.  Basically have a fail open approach... if it saw something
 it thought it could optimize in the mongodb query to limit data, it would,
 otherwise, it would default to the original approach of getting all the
 data.


 For example:

 select * from mongo_table where name rlike 'Bobby\\sWhite'

 Current method: the connection do db.collection.find() gets all the
 documents from MongoDB, and then hive does the regex.

 Thing we want to try Oh one of our defined mongo columns has a rlike, ok
 send this instead: db.collection.find(name:/Bobby\sWhite);   less data
 that would need to be transfered. Yes, Hive would still run the rlike on
 the data... shrug at least it's running it on far less data.   Basically
 if we could determine shortcuts, we could use them.


 Just trying to understand Serdes and how we are completely not using them
 as intended :)




 On Tue, Oct 16, 2012 at 10:42 AM, Connell, Chuck chuck.conn...@nuance.com
  wrote:

  A serde is actually used the other way around… Hive parses the query,
 writes MapReduce code to solve the query, and the generated code uses the
 serde for field access.

 ** **

 Standard way to write a serde is to start from the trunk regex serde,
 then modify as needed…

 ** **


 http://svn.apache.org/viewvc/hive/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java?revision=1131106view=markup

 

 Also, nice article by Roberto Congiu…

 ** **

 http://www.congiu.com/a-json-readwrite-serde-for-hive/

 ** **

 Chuck Connell

 Nuance RD Data Team

 Burlington, MA

 ** **

 ** **

 *From:* John Omernik [mailto:j...@omernik.com]
 *Sent:* Tuesday, October 16, 2012 11:30 AM
 *To:* user@hive.apache.org
 *Subject:* Writing Custom Serdes for Hive

 ** **

 We have a maybe obvious question about a serde. When a serde in invoked,
 does it have access to the original hive query?  Ideally the original query
 could provide the Serde some hints on how to access the data on the
 backend.  

 ** **

 Also, are there any good links/documention on how to write Serdes?  Kinda
 hard to google on for some reason. 

 ** **

 ** **






Hive Issues RCFile - Binary fields Corruption or field storage issues?

2012-10-15 Thread John Omernik
I am putting binary data into binary columns in hive and using RCFile. Most
data is just fine in my very large table, however queries over certain time
frames get me RCFile/Compression issues.  The data goes in fine.  Is this a
FS level corruption issue? Is this something tunable? How would I even go
about troubleshooting something like this?

Hive Runtime Error while processing writable SEQ
-org.apache.hadoop.hive.ql.io.RCFile$KeyBuffer/org.apache.hadoop.hive.ql.io.RCFile$ValueBuffer
'org.apache.hadoop.io.compress.GzipCodec hive.io.rcfile.column.number 72 T6J


Re: NEED HELP in Hive Query

2012-10-14 Thread John Omernik
select NAME, DATE, URL, SUM(HITCOUNT) as HITCOUNT from yourtable group by
NAME, DATE, URL

That's the HIVE answer. Not sure the PIG answer.




On Sun, Oct 14, 2012 at 9:54 AM, yogesh dhari yogeshdh...@live.com wrote:

  Hi all,

 I have this file. I want this operation to perform in *HIVE  PIG*

   NAME  DATE
 URL
 HITCOUNT
timesascent.in2008-08-27
 http://timesascent.in/index.aspx?page=tparchives15
 timesascent.in2008-08-27
 http://timesascent.in/index.aspx?page=articlesectid=1contentid=200812182008121814134447219270b26
 20
 timesascent.in2008-08-27http://timesascent.in/37
 timesascent.in2008-08-27
 http://timesascent.in/section/39/Job%20Wise14
 timesascent.in2008-08-27
 http://timesascent.in/article/7/2011062120110621171709769aacc537/Work-environment--Employee-productivity.html
 20
 timesascent.in2008-08-27http://timesascent.in/17
 timesascent.in2008-08-27
 http://timesascent.in/section/2/Interviews15
 timesascent.in2008-08-27http://timesascent.in/17
timesascent.in2008-08-27http://timesascent.in/27
 timesascent.in2008-08-27http://timesascent.in/37
 timesascent.in2008-08-27http://timesascent.in/27
 timesascent.in2008-08-27http://www.timesascent.in/16
 timesascent.in2008-08-27
 http://timesascent.in/section/2/Interviews14
 timesascent.in2008-08-27http://timesascent.in/14
 timesascent.in2008-08-27http://timesascent.in/22


 I want to *add all HITCOUNT for the same NAME, DATE  URL  *

 like

 timesascent.in2008-08-27http://timesascent.in/(addition of
 all hitcount under same name, date, url   (37+17+17+27+))

 Please suggest me is there any method to perform this query.


 Thanks  Regards
 Yogesh Kumar






Custom Serde/Connector Null Pointer Exception

2012-10-13 Thread John Omernik
Greetings all. I am not sure if this is a hive issue or a custom serde
issue. So I will ask in both places.  I am trying to use the mongodb
connection written by yc-huang.  This could have great potential with our
data. THe link is here.

https://github.com/yc-huang/Hive-mongo


I followed instructions for compiling, and running the first CREATE TABLE.
I have verified that the mongo db on 192.168.0.11 is exposed on the proper
port as well as the collection referenced exists properly.

So that being said, this is a hard one for me, a non-java expert to
troubleshoot. And wanted to get thoughts on whether this was a connector
problem or a hive issue. I am running Hive 0.9.0 on MapR.  Note: All other
aspects of Hive are working for me right now.  Look forward to any
thoughts.

Thanks!






hive create external table mongo_users(name STRING, age INT)
 COMMENT 'Ya boy'
 stored by org.yong3.hive.mongo.MongoStorageHandler
 with serdeproperties ( mongo.column.mapping = name,age )
 tblproperties (
mongo.host=192.168.0.11,mongo.port=27017,mongo.db=test,mongo.collection=users);
12/10/13 14:30:13 INFO ql.Driver: PERFLOG method=Driver.run
12/10/13 14:30:13 INFO ql.Driver: PERFLOG method=compile
12/10/13 14:30:13 DEBUG parse.VariableSubstitution: Substitution is on:
create external table mongo_users(name STRING, age INT)
COMMENT 'Ya boy'
stored by org.yong3.hive.mongo.MongoStorageHandler
with serdeproperties ( mongo.column.mapping = name,age )
tblproperties (
mongo.host=192.168.0.11,mongo.port=27017,mongo.db=test,mongo.collection=users)
12/10/13 14:30:13 INFO parse.ParseDriver: Parsing command: create external
table mongo_users(name STRING, age INT)
COMMENT 'Ya boy'
stored by org.yong3.hive.mongo.MongoStorageHandler
with serdeproperties ( mongo.column.mapping = name,age )
tblproperties (
mongo.host=192.168.0.11,mongo.port=27017,mongo.db=test,mongo.collection=users)
12/10/13 14:30:13 INFO parse.ParseDriver: Parse Completed
12/10/13 14:30:13 INFO parse.SemanticAnalyzer: Starting Semantic Analysis
12/10/13 14:30:13 INFO parse.SemanticAnalyzer: Creating table mongo_users
position=22
12/10/13 14:30:13 INFO ql.Driver: Semantic Analysis Completed
12/10/13 14:30:13 DEBUG parse.SemanticAnalyzer: validation start
12/10/13 14:30:13 INFO ql.Driver: Returning Hive schema:
Schema(fieldSchemas:null, properties:null)
12/10/13 14:30:13 INFO ql.Driver: /PERFLOG method=compile
start=1350156613478 end=1350156613635 duration=157
12/10/13 14:30:13 INFO ql.Driver: PERFLOG method=Driver.execute
12/10/13 14:30:13 INFO ql.Driver: Starting command: create external table
mongo_users(name STRING, age INT)
COMMENT 'Ya boy'
stored by org.yong3.hive.mongo.MongoStorageHandler
with serdeproperties ( mongo.column.mapping = name,age )
tblproperties (
mongo.host=192.168.0.11,mongo.port=27017,mongo.db=test,mongo.collection=users)
12/10/13 14:30:13 INFO exec.DDLTask: Use StorageHandler-supplied
org.yong3.hive.mongo.MongoSerDe for table mongo_users
12/10/13 14:30:13 INFO hive.log: DDL: struct mongo_users { string name, i32
age}
FAILED: Error in metadata: java.lang.NullPointerException
12/10/13 14:30:13 ERROR exec.Task: FAILED: Error in metadata:
java.lang.NullPointerException
org.apache.hadoop.hive.ql.metadata.HiveException:
java.lang.NullPointerException
 at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:544)
at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:3305)
 at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:242)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:134)
 at
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1326)
 at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1118)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:951)
 at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215)
 at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:689)
 at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:557)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:197)
Caused by: java.lang.NullPointerException
at
org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector.init(StandardStructObjectInspector.java:116)
 at
org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector.init(StandardStructObjectInspector.java:106)
at
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory.getStandardStructObjectInspector(ObjectInspectorFactory.java:274)
 at

Re: Book 'Programming Hive' from O'Reilly now available!

2012-10-11 Thread John Omernik
Read the book cover to cover.

I feel like so many different areas have been filled in with my Hive
knowledge. WONDERFUL book.

One question:  With RCFile, does Block compression with GZIP work like
SequenceFiles where by using RCFILE + BLOCK + GZIP you actually get some of
the benefits of splitable compresses  due to how the records are compressed?

(The reason for the question is that was one question I had after reading
the sections on Sequence files and RCFile)




On Wed, Oct 10, 2012 at 4:23 AM, Alexander Lorenz wget.n...@gmail.comwrote:

 Awesome! Order placed.

 Great stuff Ed and Mark!

 On Oct 10, 2012, at 11:06 AM, Navis류승우 navis@nexr.com wrote:

  Great works!
 
  Ant I've heard our team will translate your book into Korean. Let's sell
 it
  a lot. ^^
 
  Regards,
  Navis
 
  2012/10/1 Aniket Mokashi aniket...@gmail.com
 
  +1. Great work guys. Congrats!
  I just placed an order.
 
  ~Aniket
 
 
  On Sun, Sep 30, 2012 at 11:37 AM, varun kumar varun@gmail.com
 wrote:
 
  Hi Edward,
 
  May i know the password to open the pdf file.
 
  Regards,
  Varun
 
  On Sun, Sep 30, 2012 at 5:21 AM, Edward Capriolo 
 edlinuxg...@gmail.comwrote:
 
  Hello all,
 
  I wanted to let you know that Programming Hive from O'Reilly is now
  available!
 
  http://shop.oreilly.com/product/0636920023555.do
 
  I could not have said enough in the dedication about how absolutely
  awesome Hive and Hadoop are! It is a great joy to work with hive and
  to get to write the book.
 
  One of the things I love about the book is that it had three authors,
  five case studies, and two entire sections written by others. This is
  much like the code base itself, which has a number of major
  contributions by a number of people and organizations.
 
  There is a long run down of people to thank for both the great work in
  hive as well as those who supported my involvement. I hope the book
  does all your great work justice!
 
  Thank you,
  Edward
 
 
 
 
  --
  Regards,
  Varun Kumar.P
 
 
 
 
  --
  ...:::Aniket:::... Quetzalco@tl
 

 --
 Alexander Alten-Lorenz
 http://mapredit.blogspot.com
 German Hadoop LinkedIn Group: http://goo.gl/N8pCF




Re: View Partition Pruning not Occurring during transform

2012-10-11 Thread John Omernik
I did try nesting, the problem is that I am trying to do it in a view and I
think something gets lost in translation...

On Thu, Oct 11, 2012 at 8:32 AM, Edward Capriolo edlinuxg...@gmail.comwrote:

 Have you considered rewriting the query using nested from clauses.
 Generally if hive is not 'pushing down' as you would assume nesting froms
 make the query happen in a specific way.


 On Wednesday, October 10, 2012, John Omernik j...@omernik.com wrote:
  Agreed. That's the conclusion we came to as well. So it's less of a bug
 and more of a feature request. I think one of the main advantages of hive
 is the flexibility in allowing non-technical users to run basic queries
 without having to think about the transform stuff. (i.e. we in the IT shop
 can setup the transform)  I like the annotation idea that some how the
 partition specs can be pushed through (identified in some other way etc).
  I am new to the Apache/JIRA world, what would you recommend for getting
 this into a feature request for consideration? I am not a Java programmer,
 so my idea may need to be paired with a champion to help implement it :)
 
 
  On Wed, Oct 10, 2012 at 3:24 PM, shrikanth shankar sshan...@qubole.com
 wrote:
 
  I assume the reason for this is that the Hive compiler has no way of
 determining that the 'day' that is input into the transform script is the
 same 'day' that is output from the transform script. Even if it did, its
 unclear if pushing down would be legal without knowing the semantics of the
 transformation. Any optimization to be done here will likely need an
 annotation somewhere to say that certain columns in the output of a
 transform refer to specific columns in the input of a transform for
 predicate push down purposes (and that such pushdown is legal for this
 transformation)
 
  thanks,
  Shrikanth
  On Oct 10, 2012, at 12:04 PM, John Omernik wrote:
 
   Greetings all, I am trying to incorporate a TRANSFORM into a view (so
 we can abstract the transform script away from the user)
  
  
  
   As a Test, I have a table partitioned on day (in -MM-DD formated)
 with lots of partitions
  
   and I tried this
  
   CREATE VIEW view_transform as
   Select TRANSFORM (day, ip) using 'cat' as (day, ip) from source_table;
  
   The reason I used 'cat' in my test is if this works, I will
 distribute my transform scripts to each node manually, I know each node has
 cat, so this works as a test.
  
   When run
  
   SELECT * from view_transform where day = '2012-10-08'  10,432 map
 tasks get spun up.
  
   If I rewrite the view to be
  
   CREATE VIEW view_transform as
   Select TRANSFORM (day, ip) using 'cat' as (day, ip) from source_table
 where day = '2012-10-08';
  
   Then only 16 map tasks get spun up (the desired behavior, but the
 pruning is happening in the view not in the query)
  
   Thus I wanted input on whether this should be considered a bug.  I.e.
 Should we be able to define a partition spec in a view that uses a
 transform that allows normal pruning to occur even though the partition
 spec will be passed to the transfrom script?  I think we should, and it's
 likely doable some how. This would be awesome for a number of situations
 where you may want to expose transformed data to analysis without the
 mess of having them format their script for transform.
  
  
 
 
 



View Partition Pruning not Occurring during transform

2012-10-10 Thread John Omernik
Greetings all, I am trying to incorporate a TRANSFORM into a view (so we
can abstract the transform script away from the user)



As a Test, I have a table partitioned on day (in -MM-DD formated) with
lots of partitions

and I tried this

CREATE VIEW view_transform as
Select TRANSFORM (day, ip) using 'cat' as (day, ip) from source_table;

The reason I used 'cat' in my test is if this works, I will distribute my
transform scripts to each node manually, I know each node has cat, so this
works as a test.

When run

SELECT * from view_transform where day = '2012-10-08'  10,432 map tasks get
spun up.

If I rewrite the view to be

CREATE VIEW view_transform as
Select TRANSFORM (day, ip) using 'cat' as (day, ip) from source_table where
day = '2012-10-08';

Then only 16 map tasks get spun up (the desired behavior, but the pruning
is happening in the view not in the query)

Thus I wanted input on whether this should be considered a bug.  I.e.
Should we be able to define a partition spec in a view that uses a
transform that allows normal pruning to occur even though the partition
spec will be passed to the transfrom script?  I think we should, and it's
likely doable some how. This would be awesome for a number of situations
where you may want to expose transformed data to analysis without the
mess of having them format their script for transform.


Re: View Partition Pruning not Occurring during transform

2012-10-10 Thread John Omernik
Agreed. That's the conclusion we came to as well. So it's less of a bug and
more of a feature request. I think one of the main advantages of hive is
the flexibility in allowing non-technical users to run basic queries
without having to think about the transform stuff. (i.e. we in the IT shop
can setup the transform)  I like the annotation idea that some how the
partition specs can be pushed through (identified in some other way etc).
 I am new to the Apache/JIRA world, what would you recommend for getting
this into a feature request for consideration? I am not a Java programmer,
so my idea may need to be paired with a champion to help implement it :)



On Wed, Oct 10, 2012 at 3:24 PM, shrikanth shankar sshan...@qubole.comwrote:

 I assume the reason for this is that the Hive compiler has no way of
 determining that the 'day' that is input into the transform script is the
 same 'day' that is output from the transform script. Even if it did, its
 unclear if pushing down would be legal without knowing the semantics of the
 transformation. Any optimization to be done here will likely need an
 annotation somewhere to say that certain columns in the output of a
 transform refer to specific columns in the input of a transform for
 predicate push down purposes (and that such pushdown is legal for this
 transformation)

 thanks,
 Shrikanth
 On Oct 10, 2012, at 12:04 PM, John Omernik wrote:

  Greetings all, I am trying to incorporate a TRANSFORM into a view (so we
 can abstract the transform script away from the user)
 
 
 
  As a Test, I have a table partitioned on day (in -MM-DD formated)
 with lots of partitions
 
  and I tried this
 
  CREATE VIEW view_transform as
  Select TRANSFORM (day, ip) using 'cat' as (day, ip) from source_table;
 
  The reason I used 'cat' in my test is if this works, I will distribute
 my transform scripts to each node manually, I know each node has cat, so
 this works as a test.
 
  When run
 
  SELECT * from view_transform where day = '2012-10-08'  10,432 map tasks
 get spun up.
 
  If I rewrite the view to be
 
  CREATE VIEW view_transform as
  Select TRANSFORM (day, ip) using 'cat' as (day, ip) from source_table
 where day = '2012-10-08';
 
  Then only 16 map tasks get spun up (the desired behavior, but the
 pruning is happening in the view not in the query)
 
  Thus I wanted input on whether this should be considered a bug.  I.e.
 Should we be able to define a partition spec in a view that uses a
 transform that allows normal pruning to occur even though the partition
 spec will be passed to the transfrom script?  I think we should, and it's
 likely doable some how. This would be awesome for a number of situations
 where you may want to expose transformed data to analysis without the
 mess of having them format their script for transform.
 
 




Hive File Sizes, Merging, and Splits

2012-09-25 Thread John Omernik
I am really struggling trying to make hears or tails out of how to optimize
the data in my tables for best query times.  I have a partition that is
compressed (Gzip) RCFile data in two files

total 421877
263715 -rwxr-xr-x 1 darkness darkness 270044140 2012-09-25 13:32 00_0
158162 -rwxr-xr-x 1 darkness darkness 161956948 2012-09-25 13:32 01_0



No matter what I set my split settings to prior to the job, I always get
three mappers.  My block size is 268435456 but the setting doesn't seem to
change anything. I can set split size huge or small with no apparent affect
on the data.


I know there are many esoteric items here, but is there any good
documentation on setting these things to make my queries on this data more
efficient. I am not sure what it needs three map tasks on this data, it
should really just grab two mappers. Not to mention, I thought gzip wasn't
splitable anyhow.  So, from that standpoint, how does it even send data to
three mappers.  If you know of some secret cache of documentation for hive,
I'd love to read it.

Thanks


Hive Transform Scripts Ending Cleanly

2012-09-21 Thread John Omernik
Greetings All -

I have a transform script that some some awesome stuff (at least to my eyes)

Basically, here is the SQL


  SELECT TRANSFORM (filename)
  USING 'worker.sh' as (col1, col2, col3, col4, col5)
  FROM mysource_filetable


worker.sh is actually a wrapper script that

looks like this:

#!/bin/bash

while read line; do
filename=$line
python /mnt/node_scripts/parser.py -i $filename -o STDOUT
done

The reason for handling calling the python script in a bash script is so I
can read off stdin, process the data, and then shoot it off to standard
OUT.  There are some other reasons... but it works great, most of the time.

Sometimes, for whatever reason, we have a situation where the hive
listener )(I don't know what else to call it) gets bored listening for
data. The python script can take a long time depending on the data being
sent to it.  It gives up listening for STDOUT, the task times out, and the
job retries that file somewhere else where it succeeds. No big deal.
However, the python script and the java that's calling it seems to still be
running using up resources. If it doesn't exit cleanly, it kinda wigs out
and goes on to TRANSFORM THE WORLD (said in a loud echoing booming voice).
 Anywho, just curious if there are ways I can monitor for that. Perhaps
check for things in my worker.sh, maybe run python direct from hive?
Settings in hive that will force kill the runaways?  Transform, and it's
capabilities are AWESOME, but like much in hive, documentation is all over
the place.


Hive job not distributing

2012-09-01 Thread John Omernik
I have a job that has lots of tiny map tasks that finish very fast (I think
my max time was 9 seconds) I understand that I should change my input to
avoid that... and it's difficult because these processes are using a
transform script on binary data so it makes it difficult to pull off (Long
story). That being said, the cluster isn't distributing the job. Only one
node is running the map tasks. They are finishing quickly, but I am just
wondering would cause that ... I know the job tracker and task trackers on
other nodes are running, it's just odd.


Force number of records per map task

2012-08-31 Thread John Omernik
This is going to sound very odd, but I am hoping to use a transform script
in such a way that I pass a filepath to the transform script, to which it
reads the file and produces a bunch of rows in hive.  In this case the data
is pcaps.  I have a location accessible to all nodes, and I want to have my
transform script read in a file location, and then spit out, for example
the IP addresses that were seen in the packet capture (using a script I've
already written).   Can I do something whereby I load my file locations
into a table in hive (one file per row) and read that table into a
transform script and only have one map task per source row?  I don't want
my script to parse several files, it may make for some poor
parrelelization, but I am having trouble forcing such a small record count
per map task.

Thoughts?


Troubles with Heap Space Issues on Insert

2012-08-29 Thread John Omernik
I am running some data that isn't huge persay, but I performing processing
on it to get into my final table (RCFile).

One of the challenges is that it comes in large blocks of data, for
example, I may have a 70MB chunk of binary data that I want to put in. My
process that generates this data hexes it, so that 70 MB becomes a 140 MB
string of data. Then when I insert into the binary field I use unhex.  Now,
my nodes are not huge, I have 8 nodes 6 GB of ram each.  A typical load
reads the hex encoded from an external load table, and then inserts it (no
joins etc).  Most data loads fine, but when I get chunks above 32 MB in raw
size I'll get failures.  I am working getting a some adjustments on my
source data to minimize those large chunks.

That being said, what are somethings I can do at the hive/insert level that
can reduce the heap space issues? I've tried playing with split size,
reusing jvms, and heap space.  But it's all trial and error, and I'd like
to have more real world examples of conditions where one settings makes
sense and another does not.  I am not looking for a googling here, just
some examples (even links to examples) showing that with this type of data
or setup, you can get (less mem usage, faster performance, etc) by tweaking
these settings.   I think my issue is there are so many settings that say
do this or that, and they don't really provide real world examples, it
makes it tough to know where to start.


Re: Join with OR condition in hive

2012-08-29 Thread John Omernik
How do you join two tables that aren't represented in both sides of the =?
 Can you describe a bit more of what you are trying to get out of the data?
I am having a hard time wrapping my head around this...




On Wed, Aug 29, 2012 at 4:44 PM, sonia gehlot sonia.geh...@gmail.comwrote:

 Hi All,

 I am joining 2 tables in hive, with or condition. for example:

 select blah
 from table a
 Join table b
 *on (a.col1 = b.col2 or a.col2 = 0)*
 *
 *
 but this is giving me error that OR not supported in hive currently.

 Any suggestion how I can handle this in hive query.

 Thanks,
 Sonia



Hive 0.9 and Indexing

2012-07-26 Thread John Omernik
I am playing with Hive indexing and a little discouraged by the gap between
the potential seen and the amount of documentation around indexing. I am
running Hive 0.9 and started playing with indexing as follows:

I have a table logs that has a bunch of fields but for this, lets say
three. sessionutc, srcip, dstip and partitioned by DAY.

CREATE TABLE logs(sessionutc STRING, srcip STRING, dstip STRING)
PARTITIONED by (day STRING)

The field I am hoping to index is srcip, so I created this:

CREATE INDEX idx_srcip ON TABLE logs(srcip) as
'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED
REBUILD;

ALTER INDEX idx_srcip ON pcaps REBUILD;

This alter index ran overnight Also be warned, the data in my table is
522GB.  The buidling of index took all night and wrote a ton of data (49
GB) to the hive history file in the CLI... that just aint right if you ask
me. :) Can we limit that somehow?

The actual index table ended up being around 1.8 GB on 522GB of data.  That
wasn't too bad I guess.

Then time for queries... I thought Hive0.9 just supported indexes on
queries, the old school DB guy just ran a query

SELECT sessionutc, srcip, dstip FROM logs WHERE srcip='127.0.0.1'

Started running 1028 Map Tasks... obviously not usiung an index, takes
forever, hmmm..

So I google, and find this:

INSERT OVERWRITE DIRECTORY /tmp/index_result SELECT `_bucketname` ,
 `_offsets` FROM web__logs_idx_srcip__ WHERE srcip = '127.0.0.1';
SET hive.index.compact.file=/tmp/index_result;
SET
hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;

SELECT sessionutc, srcip, dst IP FROM logs WHERE srcip='127.0.0.1'

43 Map task, done quickly. Indexes WORK! WOOO.

So Do Indexes work when you have a non-indexed field in your where clause.

Hypothesis: If you and your indexed field, it will work (like partition
pruning)

SELECT sessionutc, srcip, dst IP FROM logs WHERE srcip='127.0.0.1' and
sessionutc like '2012-04-04%'
43 Map Tasks: Quick Hits Index Hypothesis: True

Hypothesis If you or your indexed field it will not work (like partition
pruning)
SELECT sessionutc, srcip, dst IP FROM logs WHERE srcip='127.0.0.1' or
sessionutc like '2012-04-04%'
43 Map Tasks: Quick Hits Index Hypothesis: False

What the heck? HOW DOES THAT WORK? That's impossible. The data returned is
truly that data without the srcip = 127.0.0.1 It's from all partitions.  I
am so confused there.

Ok: Other questions.

How are indexed updated?  How does INSERT OVERWRITE or INSERT APPEND affect
currently built indexes? Are indexes rebuilt automatically or do you have
to have separate job to rebuild the indexes on partitions that you've added
data too? If index updating is automagic, does it slow INSERT times? Need
to do some testing here.

Also: Is there a better way to enable Index hitting automagically without
having to do the separate index query and the two SET commands prior to my
query I want to hit the index? I'd like to utilize the indices to help our
operations staff write better queries (read use less cluster resources) but
teaching that process may be difficult.

Any other good non-googlable sources of information on indexes, The reason
I posted my results and my questions here is the general lack of
information around this topic.


Re: Searching for a string off a group by query

2012-07-17 Thread John Omernik
Then I think the array_contains is your best bet.  See my example query
below.  I used a subquery (always handy) also I replaced count(activityID)
in your SQL with count(1)  The results returned are exactly the same (at
least in my data set) and for some reason, my cluster seemed to run faster
with count(1) I am not sure why.  That and it didn't make sense to me to
count the field I am grouping on, it's confusing, are you looking for the
count of that grouped field or the count of the number of grouped fields
(obviously the first is returned, but from a reading point of view a
person reading/viewing your query may be confused).

Original
select activityId, count(activityId), *find_in_set(CCC,
collect_set(msgBody))* from ActivityStream group by activityId;

Suggested:
select * from (
select
activityId,
count(1) as activityid_count,
collect_set(msgBody) as msg_bodies
from
ActivityStream
group by activityId
) b where array_contains(msg_bodies, CCC)  0


On Tue, Jul 17, 2012 at 4:32 AM, Tharindu Mathew mcclou...@gmail.comwrote:

 I'm basically trying to search for a string (ex: CCC) for a set of
 strings (ex: * *collect_set(msgBody)*)* that comes as a result of a group
 by query.


 On Tue, Jul 17, 2012 at 8:50 AM, John Omernik j...@omernik.com wrote:

 Not sure what you are trying to do, but you may want to check out the
 array_contains function. Also, if you are using Hive 9 you can use the
 concat_ws() function.  This is taken from a google search:

 select concat_ws(‘.’, array(‘www’,’apache’,’org’)) from src limit 1;
 www.apache.org


 https://cwiki.apache.org/Hive/presentations.data/WhatsNewInHive090HadoopSummit2012BoF.pdf

 On the array_contains:


 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-CollectionFunctions

 boolean array_contains(ArrayT, value)Returns TRUE if the array
 contains value



 On Mon, Jul 16, 2012 at 3:56 PM, Tharindu Mathew mcclou...@gmail.comwrote:

 Hi everyone,

 I'd like to do $subject and was approaching it with the following query:

 select activityId, count(activityId), *find_in_set(CCC,
 collect_set(msgBody))* from ActivityStream group by activityId;

 But find_in_set doesn't seem to accept arrays. Is there a way to cast
 this string array into a string list or a string so I can conduct a search?
 Maybe, there's another way to do this.

 Thanks in advance.

 --
 Regards,

 Tharindu

 blog: http://mackiemathew.com/





 --
 Regards,

 Tharindu

 blog: http://mackiemathew.com/




Re: Searching for a string off a group by query

2012-07-16 Thread John Omernik
Not sure what you are trying to do, but you may want to check out the
array_contains function. Also, if you are using Hive 9 you can use the
concat_ws() function.  This is taken from a google search:

select concat_ws(‘.’, array(‘www’,’apache’,’org’)) from src limit 1;
www.apache.org

https://cwiki.apache.org/Hive/presentations.data/WhatsNewInHive090HadoopSummit2012BoF.pdf

On the array_contains:

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-CollectionFunctions

boolean array_contains(ArrayT, value)Returns TRUE if the array contains
value



On Mon, Jul 16, 2012 at 3:56 PM, Tharindu Mathew mcclou...@gmail.comwrote:

 Hi everyone,

 I'd like to do $subject and was approaching it with the following query:

 select activityId, count(activityId), *find_in_set(CCC,
 collect_set(msgBody))* from ActivityStream group by activityId;

 But find_in_set doesn't seem to accept arrays. Is there a way to cast this
 string array into a string list or a string so I can conduct a search?
 Maybe, there's another way to do this.

 Thanks in advance.

 --
 Regards,

 Tharindu

 blog: http://mackiemathew.com/




Re: Anything wrong with this query?

2012-07-14 Thread John Omernik
The LEFT and the RIGHT JOIN is invalid, you need  LEFT OUTER JOIN or RIGHT
OUTER JOIN or just JOIN (no INNER JOIN)



On Fri, Jul 13, 2012 at 11:06 AM, Raihan Jamal jamalrai...@gmail.comwrote:

 Whenever I try to run this below query, I always get error as OR is not
 supported in JOIN? Is there any other way of doing this query?


SELECT *
 FROM   TestingTable1 tt1 LEFT JOIN TestingTable2 tt2
 ON tt1.buyer_id = tt2.user_id
   AND (tt1.item_id  = tt2.product_id
 OR unix_timestamp(tt1.created_time) = tt2.timestamps)


 Any suggestions will be appreciated.


 *Raihan Jamal*




Re: LIKE Statement

2012-05-14 Thread John Omernik
Well the link provided isn't really about what I originally asked about.  I
have not come across a SQL implementation (Postgres, MySQL, or MSSQL are
the ones I have experience in) where LIKE was by default case sensitive
with wildcards.  That being said, I'm not the type to based my assertions
on my own experience.

That being said, I have understood that in other implementations
users/administrators have the option to specify whether LIKE is case
sensitive (I saw that in the posted link)  Could we have that? If we can't
agree on what other databases are doing, can we at least agree that some
other databases do it differently by default, and for organizations making
the move from a more traditional system to hive, having the option may help
their transition?



On Mon, May 14, 2012 at 4:36 PM, Keith Wiley kwi...@keithwiley.com wrote:

 Thanks for the followup.


 
 Keith Wiley kwi...@keithwiley.com keithwiley.com
 music.keithwiley.com

 I used to be with it, but then they changed what it was.  Now, what I'm
 with
 isn't it, and what's it seems weird and scary to me.
   --  Abe (Grandpa) Simpson

 




Re: Hive ODBC - Microsofts Involvement

2012-02-01 Thread John Omernik
I see that, but will that hive ODBC driver work with a standard hive
install, or will it be limited to Microsoft's cloud version of Hadoop/Hive?
 Anyone tried the driver?

On Wed, Feb 1, 2012 at 4:23 PM, Tucker, Matt matt.tuc...@disney.com wrote:

 The Hive driver that Microsoft will be releasing is ODBC, so you should be
 able to interact with Hive just like you would with any other relational
 database.

 ** **

 *From:* John Omernik [mailto:j...@omernik.com]
 *Sent:* Wednesday, February 01, 2012 3:22 PM
 *To:* user@hive.apache.org
 *Subject:* Hive ODBC - Microsofts Involvement

 ** **

 Does anyone know if the driver Microsoft is talking about with their Azure
 based hadoop/hive setup would work for connecting Windows applications
 (Excel/.NET Web Apps etc) to Apache Hive running on Unix?  Looking for a
 way to connect .NET Web apps to Hive for some process flow upgrades. 

 ** **

 Thanks!

 ** **