Re: Getting job progress in java application

2012-04-29 Thread Bill Graham
Take a look at the JobClient API. You can use that to get the current
progress of a running job.

On Sunday, April 29, 2012, Ondřej Klimpera wrote:

 Hello I'd like to ask you what is the preferred way of getting running
 jobs progress from Java application, that has executed them.

 Im using Hadoop 0.20.203, tried job.end.notification.url property that
 works well, but as the property name says, it sends only job end
 notifications.

 What I need is to get updates on map() and reduce() progress.

 Please help how to do this.

 Thanks.
 Ondrej Klimpera



-- 
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgra...@gmail.com going forward.*


Re: Feedback on real world production experience with Flume

2012-04-22 Thread Bill Graham
+1 on Edward's comment.

The MapR comment was relevant and informative and the original poster never
said he was only interested in open source options.

On Sunday, April 22, 2012, Michael Segel wrote:

 Gee Edward, what about putting a link to a company website or your blog in
 your signature... ;-)

 Seriously one could also mention fuse, right?  ;-)


 Sent from my iPhone

 On Apr 22, 2012, at 7:15 AM, Edward Capriolo edlinuxg...@gmail.com
 wrote:

  I think this is valid to talk about for example one need not need a
  decentralized collector if they can just write log directly to
  decentralized files in a decentralized file system. In any case it was
  not even a hard vendor pitch. It was someone describing how they
  handle centralized logging. It stated facts and it was informative.
 
  Lets face it, if fuse-mounting-hdfs or directly soft mounting NFS in a
  way that performs well many of the use cases for flume and scribe like
  tools would be gone. (not all but many)
 
  I never knew there was a rule that discussing alternative software on
  a mailing list. It seems like a closed minded thing. I also doubt the
  ASF would back a rule like that. Are we not allowed to talk about EMR
  or S3, or am I not even allowed to mention S3?
 
  Can flume run on ec2 and log to S3? (oops party foul I guess I cant ask
 that.)
 
  Edward
 
  On Sun, Apr 22, 2012 at 12:59 AM, Alexander Lorenz
  wget.n...@googlemail.com wrote:
  no. That is the Flume Open Source Mailinglist. Not a vendor list.
 
  NFS logging has nothing to do with decentralized collectors like Flume,
 JMS or Scribe.
 
  sent via my mobile device
 
  On Apr 22, 2012, at 12:23 AM, Edward Capriolo edlinuxg...@gmail.com
 wrote:
 
  It seems pretty relevant. If you can directly log via NFS that is a
  viable alternative.
 
  On Sat, Apr 21, 2012 at 11:42 AM, alo alt wget.n...@googlemail.com
 wrote:
  We decided NO product and vendor advertising on apache mailing lists!
  I do not understand why you'll put that closed source stuff from your
 employe in the room. It has nothing to do with flume or the use cases!
 
  --
  Alexander Lorenz
  http://mapredit.blogspot.com
 
  On Apr 21, 2012, at 4:06 PM, M. C. Srivas wrote:
 
  Karl,
 
  since you did ask for alternatives,  people using MapR prefer to use
 the
  NFS access to directly deposit data (or access it).  Works
 seamlessly from
  all Linuxes, Solaris, Windows, AIX and a myriad of other legacy
 systems
  without having to load any agents on those machines. And it is fully
  automatic HA
 
  Since compression is built-in in MapR, the data gets compressed
 coming in
  over NFS automatically without much fuss.
 
  Wrt to performance,  can get about 870 MB/s per node if you have
 10GigE
  attached (of course, with compression, the effective throughput will
  surpass that based on how good the data can be squeezed).
 
 
  On Fri, Apr 20, 2012 at 3:14 PM, Karl Hennig khen...@baynote.com
 wrote:
 
  I am investigating automated methods of moving our data from the
 web tier
  into HDFS for processing, a process that's performed periodically.
 
  I am looking for feedback from anyone who has actually used Flume
 in a
  production setup (redundant, failover) successfully.  I understand
 it is
  now being largely rearchitected during its incubation as Apache
 Flume-NG,
  so I don't have full confidence in the old, stable releases.
 
  The other option would be to write our own tools.  What methods are
 you
  using for these kinds of tasks?  Did you write your own or does
 Flume (or
  something else) work for you?
 
  I'm a



-- 
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgra...@gmail.com going forward.*


Re: [Blog Post]: Accumulo and Pig play together now

2012-03-02 Thread Bill Graham
- bcc: u...@nutch.apache.org common-user@hadoop.apache.org

This is great Jason. One thing to add though is this line in your Pig
script:

SET mapred.map.tasks.speculative.execution false

Otherwise you'll likely going to get duplicate writes into accumulo.


On Fri, Mar 2, 2012 at 5:48 AM, Jason Trost jason.tr...@gmail.com wrote:

 For anyone interested...

 Accumulo and Pig play together now:
 http://www.covert.io/post/18605091231/accumulo-and-pig
   and
 https://github.com/jt6211/accumulo-pig

 --Jason




-- 
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgra...@gmail.com going forward.*


Re: Writing small files to one big file in hdfs

2012-02-21 Thread Bill Graham
You might want to check out File Crusher:
http://www.jointhegrid.com/hadoop_filecrush/index.jsp

I've never used it, but it sounds like it could be helpful.

On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks bejoy.had...@gmail.com wrote:

 Hi Mohit
  AFAIK XMLLoader in pig won't be suited for Sequence Files. Please
 post the same to Pig user group for some workaround over the same.
 SequenceFIle is a preferred option when we want to store small
 files in hdfs and needs to be processed by MapReduce as it stores data in
 key value format.Since SequenceFileInputFormat is available at your
 disposal you don't need any custom input formats for processing the same
 using map reduce. It is a cleaner and better approach compared to just
 appending small xml file contents into a big file.

 On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks bejoy.had...@gmail.com
 wrote:
 
   Mohit
 Rather than just appending the content into a normal text file or
   so, you can create a sequence file with the individual smaller file
  content
   as values.
  
Thanks. I was planning to use pig's
  org.apache.pig.piggybank.storage.XMLLoader
  for processing. Would it work with sequence file?
 
  This text file that I was referring to would be in hdfs itself. Is it
 still
  different than using sequence file?
 
   Regards
   Bejoy.K.S
  
   On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia 
 mohitanch...@gmail.com
   wrote:
  
We have small xml files. Currently I am planning to append these
 small
files to one file in hdfs so that I can take advantage of splits,
  larger
blocks and sequential IO. What I am unsure is if it's ok to append
 one
   file
at a time to this hdfs file
   
Could someone suggest if this is ok? Would like to know how other do
  it.
   
  
 




-- 
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgra...@gmail.com going forward.*


Re: How to delete files older than X days in HDFS/Hadoop

2011-11-27 Thread Bill Graham
If you're able to put your data in directories named by date (i.e.
MMdd), you can take advantage of the fact that the HDFS client will
return directories in sort order of the name, which returns the most recent
dirs last. You can then cron a bash script that deletes all the but last N
directories returned where N is how many days you want to keep.



On Sat, Nov 26, 2011 at 8:26 PM, Ronnie Dove ron...@oceansync.com wrote:

 Hello Raimon,

 I like the idea of being able to search through files on HDFS so that we
 can find keywords or timestamp criteria, something that OceanSync will be
 doing in the future as a tool option.  The others have told you some great
 ideas but I wanted to help you out from a Java API perspective.  If you are
 a Java programmer, you would utilize FileSystem.listFiles() which returns
 the directory listing in a FileStatus[] format.  You would crawl through
 the FileStatus Array in search for whether the FileStatus is a file or a
 directory.  If it is a file, you will check the time stamp of the file
 using the FileStatus.getModificationTime().  If its a directory than it
 will be processed again using a while loop to check the contents of that
 directory.  This sounds tough but as part of testing this, it is fairly
 fast and accurate.  Below are the two API's that are needed as part of this
 method:


 http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileStatus.html

 http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html

 
 Ronnie Dove
 OceanSync Management Developer
 http://www.oceansync.com
 RDove on irc.freenode.net #Hadoop


 - Original Message -
 From: Raimon Bosch raimon.bo...@gmail.com
 To: common-user@hadoop.apache.org
 Cc:
 Sent: Saturday, November 26, 2011 10:01 AM
 Subject: How to delete files older than X days in HDFS/Hadoop

 Hi,

 I'm wondering how to delete files older than X days with HDFS/Hadoop. On
 linux we can do it with the folowing command:

 find ~/datafolder/* -mtime +7 -exec rm {} \;

 Any ideas?




Re: Why hadoop should be built on JAVA?

2011-08-16 Thread Bill Graham
There was a fairly long discussion on this topic at the beginning of the
year FYI:

http://search-hadoop.com/m/JvSQe2wNlY11

On Mon, Aug 15, 2011 at 9:00 PM, Chris Song sjh...@gmail.com wrote:

 Why hadoop should be built in JAVA?

 For integrity and stability, it is good for hadoop to be implemented in
 Java

 But, when it comes to speed issue, I have a question...

 How will it be if HADOOP is implemented in C or Phython?



Re: Distcp failure - Server returned HTTP response code: 500

2011-05-18 Thread Bill Graham
Are you able to distcp folders that don't have special characters?

What are the versions of the two clusters and are you running on the
destination cluster if there's a mis-match? If there is you'll need to use
hftp:

http://hadoop.apache.org/common/docs/current/distcp.html#cpver

On Wed, May 18, 2011 at 12:44 PM, sonia gehlot sonia.geh...@gmail.comwrote:

 Hi Guys

 I am trying to copy hadoop data from one cluster to another but I am keep
 on
 getting this error  *Server returned HTTP response code: 500 for URL*
 *
 *
 My distcp command is:
 scripts/hadoop.sh distcp

 hftp://c13-hadoop1-nn-r0-n1:50070/user/dwadmin/live/data/warehouse/facts/page_events/
 *day=2011-05-17* hdfs://phx1-rb-dev40-pipe1.cnet.com:9000/user/sgehlot

 In here I have *day=2011-05-17* in my file path

 I found this online:  https://issues.apache.org/jira/browse/HDFS-31

 Is this issue is still exists? Is this could be the reason of my job
 failure?

 Job Error log:

 2011-05-18 11:34:56,505 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
 Initializing JVM Metrics with processName=MAP, sessionId=
 2011-05-18 11:34:56,713 INFO org.apache.hadoop.mapred.MapTask:
 numReduceTasks: 0
 2011-05-18 11:34:57,039 INFO org.apache.hadoop.tools.DistCp: FAIL

 day=2011-05-17/_logs/history/c13-hadoop1-nn-r0-n1_1291919715221_job_201012091035_41977_dwadmin_CopyFactsToHive%3A+page_events+day%3D2011-05-17
 : java.io.IOException: *Server returned HTTP response code: 500 for URL*:

 http://c13-hadoop1-wkr-r10-n4.cnet.com:50075/streamFile?filename=/user/dwadmin/live/data/warehouse/facts/page_events/day=2011-05-17/_logs/history/c13-hadoop1-nn-r0-n1_1291919715221_job_201012091035_41977_dwadmin_CopyFactsToHive%253A+page_events+day%253D2011-05-17ugi=sgehlot,user
  at

 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1313)
 at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:157)
  at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:398)
 at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:410)
  at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:537)
 at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:306)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
 at org.apache.hadoop.mapred.Child.main(Child.java:170)

 2011-05-18 11:35:06,118 WARN org.apache.hadoop.mapred.TaskTracker: Error
 running child
 java.io.IOException: Copied: 0 Skipped: 5 Failed: 1
 at org.apache.hadoop.tools.DistCp$CopyFilesMapper.close(DistCp.java:572)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
 at org.apache.hadoop.mapred.Child.main(Child.java:170)
 2011-05-18 11:35:06,124 INFO org.apache.hadoop.mapred.TaskRunner: Runnning
 cleanup for the task

 Any help is appreciated.

 Thanks,
 Sonia



Re: Including Additional Jars

2011-04-06 Thread Bill Graham
If you could share more specifics regarding just how it's not working
(i.e., job specifics, stack traces, how you're invoking it, etc), you
might get more assistance in troubleshooting.


On Wed, Apr 6, 2011 at 1:44 AM, Shuja Rehman shujamug...@gmail.com wrote:
 -libjars is not working nor distributed cache, any other
 solution??

 On Mon, Apr 4, 2011 at 11:40 PM, James Seigel ja...@tynt.com wrote:

 James’ quick and dirty, get your job running guideline:

 -libjars -- for jars you want accessible by the mappers and reducers
 classpath or bundled in the main jar -- for jars you want accessible to
 the runner

 Cheers
 James.



 On 2011-04-04, at 12:31 PM, Shuja Rehman wrote:

  well...i think to put in distributed cache is good idea. do u have any
  working example how to put extra jars in distributed cache and how to
 make
  available these jars for job?
  Thanks
 
  On Mon, Apr 4, 2011 at 10:20 PM, Mark Kerzner markkerz...@gmail.com
 wrote:
 
  I think you can put them either in your jar or in distributed cache.
 
  As Allen pointed out, my idea of putting them into hadoop lib jar was
  wrong.
 
  Mark
 
  On Mon, Apr 4, 2011 at 12:16 PM, Marco Didonna m.didonn...@gmail.com
  wrote:
 
  On 04/04/2011 07:06 PM, Allen Wittenauer wrote:
 
 
  On Apr 4, 2011, at 8:06 AM, Shuja Rehman wrote:
 
  Hi All
 
  I have created a map reduce job and to run on it on the cluster, i
 have
  bundled all jars(hadoop, hbase etc) into single jar which increases
 the
  size
  of overall file. During the development process, i need to copy again
  and
  again this complete file which is very time consuming so is there any
  way
  that i just copy the program jar only and do not need to copy the lib
  files
  again and again. i am using net beans to develop the program.
 
  kindly let me know how to solve this issue?
 
 
        This was in the FAQ, but in a non-obvious place.  I've updated
 it
  to be more visible (hopefully):
 
 
 
 
 http://wiki.apache.org/hadoop/FAQ#How_do_I_submit_extra_content_.28jars.2C_static_files.2C_etc.29_for_my_job_to_use_during_runtime.3F
 
 
  Does the same apply to jar containing libraries? Let's suppose I need
  lucene-core.jar to run my project. Can I put my this jar into my job
 jar
  and
  have hadoop see lucene's classes? Or should I use distributed cache??
 
  MD
 
 
 
 
 
 
  --
  Regards
  Shuja-ur-Rehman Baig
  http://pk.linkedin.com/in/shujamughal




 --
 Regards
 Shuja-ur-Rehman Baig
 http://pk.linkedin.com/in/shujamughal



Re: Including Additional Jars

2011-04-06 Thread Bill Graham
You need to pass the mainClass after the jar:

http://hadoop.apache.org/common/docs/r0.21.0/commands_manual.html#jar

On Wed, Apr 6, 2011 at 11:31 AM, Shuja Rehman shujamug...@gmail.com wrote:
 i am using the following command

 hadoop jar myjar.jar -libjars /home/shuja/lib/mylib.jar  param1 param2
 param3

 but the program still giving the error and does not find the mylib.jar. can
 u confirm the syntax of command?
 thnx



 On Wed, Apr 6, 2011 at 8:29 PM, Bill Graham billgra...@gmail.com wrote:

 If you could share more specifics regarding just how it's not working
 (i.e., job specifics, stack traces, how you're invoking it, etc), you
 might get more assistance in troubleshooting.


 On Wed, Apr 6, 2011 at 1:44 AM, Shuja Rehman shujamug...@gmail.com
 wrote:
  -libjars is not working nor distributed cache, any other
  solution??
 
  On Mon, Apr 4, 2011 at 11:40 PM, James Seigel ja...@tynt.com wrote:
 
  James’ quick and dirty, get your job running guideline:
 
  -libjars -- for jars you want accessible by the mappers and reducers
  classpath or bundled in the main jar -- for jars you want accessible
  to
  the runner
 
  Cheers
  James.
 
 
 
  On 2011-04-04, at 12:31 PM, Shuja Rehman wrote:
 
   well...i think to put in distributed cache is good idea. do u have
   any
   working example how to put extra jars in distributed cache and how to
  make
   available these jars for job?
   Thanks
  
   On Mon, Apr 4, 2011 at 10:20 PM, Mark Kerzner markkerz...@gmail.com
  wrote:
  
   I think you can put them either in your jar or in distributed cache.
  
   As Allen pointed out, my idea of putting them into hadoop lib jar
   was
   wrong.
  
   Mark
  
   On Mon, Apr 4, 2011 at 12:16 PM, Marco Didonna
   m.didonn...@gmail.com
   wrote:
  
   On 04/04/2011 07:06 PM, Allen Wittenauer wrote:
  
  
   On Apr 4, 2011, at 8:06 AM, Shuja Rehman wrote:
  
   Hi All
  
   I have created a map reduce job and to run on it on the cluster,
   i
  have
   bundled all jars(hadoop, hbase etc) into single jar which
   increases
  the
   size
   of overall file. During the development process, i need to copy
   again
   and
   again this complete file which is very time consuming so is there
   any
   way
   that i just copy the program jar only and do not need to copy the
   lib
   files
   again and again. i am using net beans to develop the program.
  
   kindly let me know how to solve this issue?
  
  
         This was in the FAQ, but in a non-obvious place.  I've
   updated
  it
   to be more visible (hopefully):
  
  
  
  
 
  http://wiki.apache.org/hadoop/FAQ#How_do_I_submit_extra_content_.28jars.2C_static_files.2C_etc.29_for_my_job_to_use_during_runtime.3F
  
  
   Does the same apply to jar containing libraries? Let's suppose I
   need
   lucene-core.jar to run my project. Can I put my this jar into my
   job
  jar
   and
   have hadoop see lucene's classes? Or should I use distributed
   cache??
  
   MD
  
  
  
  
  
  
   --
   Regards
   Shuja-ur-Rehman Baig
   http://pk.linkedin.com/in/shujamughal
 
 
 
 
  --
  Regards
  Shuja-ur-Rehman Baig
  http://pk.linkedin.com/in/shujamughal
 



 --
 Regards
 Shuja-ur-Rehman Baig





Re: Including Additional Jars

2011-04-04 Thread Bill Graham
Shuja, I haven't tried this, but from what I've read it seems you
could just add all your jars required by the Mapper and Reducer to
HDFS and then add them to the classpath in your run() method like
this:

DistributedCache.addFileToClassPath(new Path(/myapp/mylib.jar), job);

I think that's all there is to it, but like I said, I haven't tried
it. Just be sure your run() method isn't in the same class as your
mapper/reducer if they import packages from any of the distributed
cache jars.


On Mon, Apr 4, 2011 at 11:40 AM, James Seigel ja...@tynt.com wrote:
 James’ quick and dirty, get your job running guideline:

 -libjars -- for jars you want accessible by the mappers and reducers
 classpath or bundled in the main jar -- for jars you want accessible to the 
 runner

 Cheers
 James.



 On 2011-04-04, at 12:31 PM, Shuja Rehman wrote:

 well...i think to put in distributed cache is good idea. do u have any
 working example how to put extra jars in distributed cache and how to make
 available these jars for job?
 Thanks

 On Mon, Apr 4, 2011 at 10:20 PM, Mark Kerzner markkerz...@gmail.com wrote:

 I think you can put them either in your jar or in distributed cache.

 As Allen pointed out, my idea of putting them into hadoop lib jar was
 wrong.

 Mark

 On Mon, Apr 4, 2011 at 12:16 PM, Marco Didonna m.didonn...@gmail.com
 wrote:

 On 04/04/2011 07:06 PM, Allen Wittenauer wrote:


 On Apr 4, 2011, at 8:06 AM, Shuja Rehman wrote:

 Hi All

 I have created a map reduce job and to run on it on the cluster, i have
 bundled all jars(hadoop, hbase etc) into single jar which increases the
 size
 of overall file. During the development process, i need to copy again
 and
 again this complete file which is very time consuming so is there any
 way
 that i just copy the program jar only and do not need to copy the lib
 files
 again and again. i am using net beans to develop the program.

 kindly let me know how to solve this issue?


       This was in the FAQ, but in a non-obvious place.  I've updated it
 to be more visible (hopefully):



 http://wiki.apache.org/hadoop/FAQ#How_do_I_submit_extra_content_.28jars.2C_static_files.2C_etc.29_for_my_job_to_use_during_runtime.3F


 Does the same apply to jar containing libraries? Let's suppose I need
 lucene-core.jar to run my project. Can I put my this jar into my job jar
 and
 have hadoop see lucene's classes? Or should I use distributed cache??

 MD






 --
 Regards
 Shuja-ur-Rehman Baig
 http://pk.linkedin.com/in/shujamughal




Re: Chukwa setup issues

2011-04-01 Thread Bill Graham
Unfortunately conf/collectors is used in two different ways in Chukwa,
each with a different syntax. This should really be fixed.

1. The script that starts the collectors looks at it for a list of
hostnames (no ports) to start collectors on. To start it just on one
host, set it to localhost.
2. The agent looks at that file for the list of collectors to attempt
to communicate with. In that case the format is a list of HTTP urls
with ports of the collectors.

Can you telnet to port ? It looks like it's listening, but
nothing's being sent. Is there anything in logs/collector.log?

On Fri, Apr 1, 2011 at 1:09 PM, bikash sharma sharmabiks...@gmail.com wrote:
 Hi,
 I am trying to setup Chukwa for a 16-node Hadoop cluster.
 I followed the admin guide -
 http://incubator.apache.org/chukwa/docs/r0.4.0/admin.html#Agents
 However, I ran two the following issues:
 1. What should be the collector port that needs to be specified in
 conf/collectors file
 2. Am unable to see the collector running via web browser

 Am I missing something?

 Thanks in advance.

 -bikash

 p.s. - after i run collector, nothing happens
 % bin/chukwa collector
 2011-04-01 16:07:16.410::INFO:  Logging to STDERR via
 org.mortbay.log.StdErrLog
 2011-04-01 16:07:16.523::INFO:  jetty-6.1.11
 2011-04-01 16:07:17.707::INFO:  Started SelectChannelConnector@0.0.0.0:
 started Chukwa http collector on port 



Re: Chukwa - Lightweight agents

2011-03-20 Thread Bill Graham
Yes, we run light weight Chukwa agents and collectors only, using
Chukwa just as you describe. We've been doing so for over a year or so
without many issues. The code is fairly easy to extend when needed. We
rolled our own collector, agent and demux RPMs.

The monitoring peice of chukwa is optional and we don't use that part.


On Sun, Mar 20, 2011 at 6:47 PM, Ted Dunning tdunn...@maprtech.com wrote:
 Bummer.

 On Sun, Mar 20, 2011 at 10:12 AM, Mark static.void@gmail.com wrote:

  We tried flume however there are some pretty strange bugs occurring which
 prevent us from using it.


 http://groups.google.com/a/cloudera.org/group/flume-user/browse_thread/thread/66c6aecec9d1869b


 On 3/20/11 10:03 AM, Ted Dunning wrote:

 OpenTSDB is purely a monitoring solution which is the primary mission of
 chukwa.

  If you are looking for data import, what about Flume?

 On Sun, Mar 20, 2011 at 9:59 AM, Mark static.void@gmail.com wrote:

  Thanks but we need Chukwa to aggregate and store files from across our
 app servers into Hadoop. Doesn't really look like opentsdb is meant for
 that. I could be wrong though?


 On 3/20/11 9:49 AM, Ted Dunning wrote:

 Take a look at openTsDb at http://opentsdb.net/

  It provides lots of the capability in a MUCH simpler package.

 On Sun, Mar 20, 2011 at 8:43 AM, Mark static.void@gmail.com wrote:

 Sorry but it doesn't look like Chukwa mailing list exists anymore?

 Is there an easy way to set up lightweight agents on cluster of machines
 instead of downloading the full Chukwa source (+50mb)?

 Has anyone build separate RPM's for the agents/collectors?

 Thanks







Re: Chukwa?

2011-03-20 Thread Bill Graham
Chukwa hasn't had a release since moving from Hadoop to incubator so
there are no releases in the /incubator repos. Follow the link on the
Chukwa homepage to the downloads repos:

http://incubator.apache.org/chukwa/
http://www.apache.org/dyn/closer.cgi/hadoop/chukwa/chukwa-0.4.0


On Sun, Mar 20, 2011 at 9:38 AM, Mark static.void@gmail.com wrote:
 Whats the deal with Chukwa? Mailing list doesn't look like its alive as well
 as any of the download options???
 http://www.apache.org/dyn/closer.cgi/incubator/chukwa/

 Is this project dead?



Re: Hadoop exercises

2011-01-05 Thread Bill Graham
For the even lazier, you could give both the test data and the
expected output data. That way they know for sure if they got it
right. This also promotes a good testing best practice, which is to
assert against and expected set of results.


On Wed, Jan 5, 2011 at 9:19 AM, Mark Kerzner markkerz...@gmail.com wrote:
 Thank you, Harsh, for the suggestion. It also gave me an idea to add one
 more exercise, generate test data - which is helpful in its own right,
 since it brings out the idea of Hadoop testing philosophy: think of large
 tests.

 Mark

 On Wed, Jan 5, 2011 at 10:39 AM, Harsh J qwertyman...@gmail.com wrote:

 Providing a data sample after describing it would be good for the yet-lazy.

 On Wed, Jan 5, 2011 at 9:19 PM, Mark Kerzner markkerz...@gmail.com
 wrote:
  Hi,
 
  what would you think of these
  exercises
 http://hadoopinpractice.blogspot.com/2011/01/exercises-for-chapter-1-how-do-they.html
 for
  the Hadoop intro chapter?
 
  Thank you,
  Mark
 



 --
 Harsh J
 www.harshj.com




Re: Which patch to apply out of multiple available ?

2010-09-07 Thread Bill Graham
You typically want the last one only. Generally higher numbered patches are
revisions of previous ones and they're cumulative.

On Tue, Sep 7, 2010 at 10:34 AM, Shrijeet Paliwal
shrij...@rocketfuel.comwrote:

 Probably a silly question, If I want to apply a patch to a version I am
 running and there are multiple patches attached to the Jira - which one
 should I pick? The latest one?

 Example:
 https://issues.apache.org/jira/browse/HIVE-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

 Thanks,
 -Shrijeet





PigServer.executeBatch throwing NotSerializableException

2010-09-07 Thread Bill Graham
Hi,

I've just deployed some new Pig jobs live (Pig version 0.7.0) and I'm
getting the error shown below. Has anyone seen this before? What's strange
is that I have a tier of 4 load-balanced machines that serve as my job
runners, all with identical code and hardware, but 2 of the four will fail
consistently with this error. The other 2 run the same job without problem.
I'm racking my brain to understand why...

java.io.NotSerializableException:
org.apache.commons.logging.impl.Log4JLogger
at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1156)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1509)
at
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1474)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1392)
at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150)
at
java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:326)
at java.util.HashMap.writeObject(HashMap.java:1000)
at sun.reflect.GeneratedMethodAccessor624.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:945)
at
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1461)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1392)
at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1509)
at
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1474)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1392)
at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1509)
at
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1474)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1392)
at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150)
at
java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:326)
at
org.apache.pig.impl.util.ObjectSerializer.serialize(ObjectSerializer.java:40)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:511)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:246)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:131)
at
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:308)
at
org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:844)
at org.apache.pig.PigServer.execute(PigServer.java:837)
at org.apache.pig.PigServer.access$100(PigServer.java:107)
at org.apache.pig.PigServer$Graph.execute(PigServer.java:1089)
at org.apache.pig.PigServer.executeBatch(PigServer.java:290)

thanks,
Bill


Re: PigServer.executeBatch throwing NotSerializableException

2010-09-07 Thread Bill Graham
Thanks Yan! That definitely looks like it could be the culprit. I'll give
that a shot.

FYI, I also found an instance of Log in one of my UDFs that wasn't static.
From what I gather, changing that to static will keep serialization from
occurring, so I'm giving that a shot as well.

On Tue, Sep 7, 2010 at 10:33 AM, Yan Zhou y...@yahoo-inc.com wrote:

 Most likely you, when build PIG, were using the older version 1.0.3 of
 log4j logger. See https://jira.jboss.org/browse/JBAS-1781. The root cause
 might be due to https://issues.apache.org/jira/browse/PIG-1582. So the
 first thing you probably need to do is upgrade your local copy of PIG trunk.

 Yan

 -Original Message-
 From: Bill Graham [mailto:billgra...@gmail.com]
 Sent: Tuesday, September 07, 2010 10:00 AM
 To: pig-user@hadoop.apache.org
 Subject: PigServer.executeBatch throwing NotSerializableException

 Hi,

 I've just deployed some new Pig jobs live (Pig version 0.7.0) and I'm
 getting the error shown below. Has anyone seen this before? What's strange
 is that I have a tier of 4 load-balanced machines that serve as my job
 runners, all with identical code and hardware, but 2 of the four will fail
 consistently with this error. The other 2 run the same job without problem.
 I'm racking my brain to understand why...

 java.io.NotSerializableException:
 org.apache.commons.logging.impl.Log4JLogger
at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1156)
at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1509)
at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1474)
at

 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1392)
at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150)
at
 java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:326)
at java.util.HashMap.writeObject(HashMap.java:1000)
at sun.reflect.GeneratedMethodAccessor624.invoke(Unknown Source)
at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
 java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:945)
at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1461)
at

 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1392)
at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150)
at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1509)
at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1474)
at

 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1392)
at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150)
at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1509)
at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1474)
at

 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1392)
at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150)
at
 java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:326)
at

 org.apache.pig.impl.util.ObjectSerializer.serialize(ObjectSerializer.java:40)
at

 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:511)
at

 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:246)
at

 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:131)
at

 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:308)
at
 org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:844)
at org.apache.pig.PigServer.execute(PigServer.java:837)
at org.apache.pig.PigServer.access$100(PigServer.java:107)
at org.apache.pig.PigServer$Graph.execute(PigServer.java:1089)
at org.apache.pig.PigServer.executeBatch(PigServer.java:290)

 thanks,
 Bill



Re: Real-time log processing in Hadoop

2010-09-06 Thread Bill Graham
We're using Chukwa to do steps a-d before writing summary data into MySQL.
Data is written into new directories every 5 minutes. Our MR jobs and data
load into MySQL takes  5 minutes, so after a 5 minute window closes, we
typically have summary data from that interval in MySQL about a few minutes
later.

But as Ranjib points out, how fast you can process your data depends on both
cluster size and data rate.

thanks,
Bill

On Sun, Sep 5, 2010 at 10:42 PM, Ranjib Dey ranj...@thoughtworks.comwrote:

 we are  using hadoop for log crunching, and the mined data feeds on of our
 app. its not exactly real time, the is basically a mail responder which
 provides certain services given an e-mail (with prescribed format) against
 it (a...@xxx.com). We have been able to bring down the response time to 30
 mins. This includes automated hadoop job submission - processing the out
 put , and intermediate status notification. From our experiences we have
 learned the entire response time is dependent on your data size, your
 hadoop
 clusters strength etc. And you need to do the performance optimization at
 each level (as they required), which includes jvm tuning (different tuning
 in name nodes / data nodes) to app level code refactoring (like using har
 on
 hdfs  for smaller files , etc).

 regards
 ranjib

 On Mon, Sep 6, 2010 at 10:32 AM, Ricky Ho rickyphyl...@yahoo.com wrote:

  Can anyone share their experience in doing real-time log processing using
  Chukwa/Scribe + Hadoop ?
 
  I am wondering how real-time can this be given Hadoop is designed for
  batch
  rather than stream processing 
  1) The startup / Teardown time of running Hadoop jobs typically takes
  minutes
  2) Data is typically stored in HDFS which is large file, it takes some
 time
  to
  accumulate data to that size.
 
  All these will add up to the latencies of Hadoop.  So I am wondering what
  is the
  shortest latencies are people doing log processing at real-life.
 
  To my understanding, the Chukwa/Scribe model accumulates log requests
 (from
  many
  machines) and write them to HDFS (inside a directory).  After the logger
  switch
  to a new directory, the old one is ready for Map/Reduce processing, and
  then
  produce the result.
 
  So the latency is ...
  a) Accumulate enough data to fill an HDFS block size
  b) Write the block to HDFS
  c) Keep doing (b) until the criteria of switching to a new directory is
 met
  d) Start the Map/Reduce processing in the old directory
  e) Write the processed data to the output directory
  f) Load the output to a queriable form.
 
  I think the above can easily be a 30 minutes or 1 hour duration.  Is this
  ball-part inline with the real-life projects that you have done ?
 
  Rgds,
  Ricky
 
 
 
 
 



JIRA down

2010-08-25 Thread Bill Graham
JIRA seems to be down FYI. Database errors are being returned:

*Cause: *
org.apache.commons.lang.exception.NestableRuntimeException:
com.atlassian.jira.exception.DataAccessException:
org.ofbiz.core.entity.GenericDataSourceException: Unable to esablish a
connection with the database. (FATAL: database is not accepting commands to
avoid wraparound data loss in database postgres)

*Stack Trace: * [hide]

org.apache.commons.lang.exception.NestableRuntimeException:
com.atlassian.jira.exception.DataAccessException:
org.ofbiz.core.entity.GenericDataSourceException: Unable to esablish a
connection with the database. (FATAL: database is not accepting
commands to avoid wraparound data loss in database postgres)
at 
com.atlassian.jira.web.component.TableLayoutFactory.getUserColumns(TableLayoutFactory.java:239)
at 
com.atlassian.jira.web.component.TableLayoutFactory.getStandardLayout(TableLayoutFactory.java:42)
at 
org.apache.jsp.includes.navigator.table_jsp._jspService(table_jsp.java:178)
at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
at 
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:374)
at 
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:342)
at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:267)


Re: Reopen and append to SequenceFile

2010-08-23 Thread Bill Graham
Chukwa also has a JMSAdaptor that can listen to an AMQ queue and stream the
messages to a collector(s) to then be persisted as sequence files.



On Fri, Aug 20, 2010 at 3:29 AM, cliff palmer palmercl...@gmail.com wrote:

 You may want to consider using something like the *nix tee command to save
 a
 copy of each message in a log directory.  A periodic job (like Flume)
 would load the logged messages into sequence files.
 HTH
 Cliff

 On Fri, Aug 20, 2010 at 3:32 AM, skantsoni shashikant_s...@mindtree.com
 wrote:

 
  Hi, I am fairly new to Hadoop and HDFS and am trying to do the following:
  1. consume some information being published by a system from AMQP
  2. write these to SequenceFile as Text, Text into a sequence file.
 
  Periodically these files would be consumed by another system to generate
  reports.
  The problem is our system which consumes messages is distributed and runs
  accross multiple machines and i cannot keep the writer on a sequencefile
  open for a long time to keep appending. I want to open the file for a
  message and then close it for each message that i receive (Dont know if
  this
  is the correct approach for HDFS). But if i close the writer once i
 cannot
  reopen to append to it. I saw a few threads talking about merging these
  files but i felt that may be an overhead.
 
  I feel i am missing something on the fundamental usage of sequence files
 or
  is there another way to do this. Can someone please point me to the
 correct
  direction?
  Thanks in advance
  --
  View this message in context:
 
 http://old.nabble.com/Reopen-and-append-to-SequenceFile-tp29489425p29489425.html
  Sent from the Hadoop core-user mailing list archive at Nabble.com.
 
 



Re: Pig and Cassandra

2010-08-13 Thread Bill Graham
I've seen that exception in other cases where there is an unmeet
dependency on a superclass that is included in a separate (and not
provided) jar. Check the thrift source to see if that's the case.

On Friday, August 13, 2010, Christian Decker decker.christ...@gmail.com wrote:
 Hi all,

 I'm pretty new to Pig and Hadoop so excuse me if this is trivial, but I
 couldn't find anyone able to help me.
 I'm trying to get Pig to read data from a Cassandra cluster, which I thought
 trivial since Cassandra already provides me with the CassandraStorage class
 [1]. Problem is that once I try executing a simple script like this:

 register /path/to/pig-0.7.0-core.jar;register /path/to/libthrift-r917130.jar;
 register /path/to/cassandra_loadfunc.jarrows = LOAD
 'cassandra://Keyspace1/Standard1' USING
 org.apache.cassandra.hadoop.pig.CassandraStorage();cols = FOREACH rows
 GENERATE flatten($1);colnames = FOREACH cols GENERATE $0;namegroups =
 GROUP colnames BY $0;namecounts = FOREACH namegroups GENERATE
 COUNT($1), group;orderednames = ORDER namecounts BY $0;topnames =
 LIMIT orderednames 50;dump topnames;

 I just end up with a NoClassDefFoundError:

 ERROR org.apache.pig.tools.grunt.Grunt -
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to
 open iterator for alias topnames
 at org.apache.pig.PigServer.openIterator(PigServer.java:521)
  at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:544)
 at
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:241)
  at
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
 at
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
  at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
 at org.apache.pig.Main.main(Main.java:391)
 Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002:
 Unable to store alias topnames
 at org.apache.pig.PigServer.store(PigServer.java:577)
  at org.apache.pig.PigServer.openIterator(PigServer.java:504)
 ... 6 more
 Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2117:
 Unexpected error when launching map reduce job.
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:209)
 at
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:308)
  at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:835)
 at org.apache.pig.PigServer.store(PigServer.java:569)
  ... 7 more
 Caused by: java.lang.RuntimeException: Could not resolve error that occured
 when launching map reduce job: java.lang.NoClassDefFoundError:
 org/apache/thrift/TBase
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLauncher.java:510)
  at java.lang.Thread.dispatchUncaughtException(Thread.java:1845)

 (sorry for posting all the error message).
 I cannot think of a reason as to why. As far as I understood it Pig takes
 the jar files in the script, unpackages them, creates the execution plan for
 the script itself and then bundles it into a single jar again, then submits
 it to the HDFS from where it will be executed in Hadoop, right?
 I also checked that the class in question actually is in the libthrift jar,
 so what's going wrong?

 Regards,
 Chris

 [1]
 http://svn.apache.org/viewvc/cassandra/trunk/contrib/pig/src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java?revision=984904view=markup



Re: Changing hostnames of tasktracker/datanode nodes - any problems?

2010-08-10 Thread Bill Graham
Sorry to hijack the thread but I have a similar use case.

In a few months we're going to be moving colos. The new cluster will be the
same size as the current cluster and some downtime is acceptable. The
hostnames will be different. From what I've read in this thread it seems
like it would be safe to do the following:

1. Build out the new cluster without starting it.
2. Shut down the entire old cluster (NN, SNN, DNs)
3. scp the relevant data and name dirs for each host to the new hardware.
4. Start the new cluster

Is is correct to say that that would work fine? We have a replication factor
of 2, so we'd be copying twice as much data as we'd need to so I'm sure
there's a more efficient approach.

What about adding the new nodes in the new colo to the existing cluster,
rebalancing and then decommissioning the old cluster nodes before finally
migrating the NN/SNN? I know Hadoop isn't intended to run cross-colo, but
would this be a more efficient approach than the one above?



On Tue, Aug 10, 2010 at 8:59 AM, Allen Wittenauer
awittena...@linkedin.comwrote:


 On Aug 10, 2010, at 7:07 AM, Brian Bockelman wrote:

  Hi Erik,
 
  You can also do this one-by-one (aka, a rolling reboot).  Shut it down,
 wait for it to be recognized as dead, then bring it back up with a new
 hostname.  It will take a much longer time, but you won't have any decrease
 in availability, just some minor decrease in capacity.

 ... and potentially problems with dfs.hosts.





Re: Changing hostnames of tasktracker/datanode nodes - any problems?

2010-08-10 Thread Bill Graham
Ahh yes of course, distcp. Thanks!

On Tue, Aug 10, 2010 at 11:01 AM, Allen Wittenauer awittena...@linkedin.com
 wrote:


 On Aug 10, 2010, at 10:54 AM, Bill Graham wrote:
  Is is correct to say that that would work fine? We have a replication
 factor
  of 2, so we'd be copying twice as much data as we'd need to so I'm sure
  there's a more efficient approach.

 It should work fine.  But yes, highly inefficient.

  What about adding the new nodes in the new colo to the existing cluster,
  rebalancing and then decommissioning the old cluster nodes before finally
  migrating the NN/SNN? I know Hadoop isn't intended to run cross-colo, but
  would this be a more efficient approach than the one above?

 If you can keep both grids up at the same time, use distcp to do the copy.
  This will make sure the blocks get copied once, will keep permissions with
 -p, keep the replication factor, redistribute data (free balancing!), etc,
 etc, etc.







Re: how to convert java program to pig UDF, and important statements in the pig script

2010-08-09 Thread Bill Graham
http://hadoop.apache.org/pig/docs/r0.7.0/udf.html#How+to+Write+a+Simple+Eval+Function

Replace 'return str.toUpperCase()' in the example with 'return str + *'
and you have a star UDF.

On Mon, Aug 9, 2010 at 1:40 PM, Ifeanyichukwu Osuji
osujii...@potsdam.eduwrote:

 i have a simple java file that adds a star to a word and prints the
 result.(this is the simplest java program i could think of)

 import java.util.*;

 public class Star {

public static void main (String[] args) {
if (args.length == 1) {
addStar(args[0]);
}else
System.exit(1);

}
public static void addStar (String word) {
String str = ;
str += word + *;
System.out.println(str);
}

 }

How can i use pig to execute this java program? The answer i can come
 up with on my own is converting method addStar to a UDF but i dont
 know how to do it(please help). The documentation wasn't that helpful.

Re-wording the question: Lets say i have a file words.log that
 contains a column of words (all of which i want to add star to). I
 would like to use pig to pass each word in the log through the java
 program above. How can i do this?

If i were to write a pig script, would it be like this?
 myscript.pig

 a = load 'words.log' as (word:chararray);
 b = foreach a generate star(word);... (I dont know what to do, please help)
 dump b;

 ubuntu-user
 ife






Re: mapred.min.split.size

2010-08-05 Thread Bill Graham
FYI, Chukwa support for Pig 0.7.0 was just committed last week:

https://issues.apache.org/jira/browse/CHUKWA-495

The patch was built on Chukwa 0.4.0, but you could try applying the patch
against Chukwa 0.3.0. I don't think the relevant code changed much between
3-4.


On Thu, Aug 5, 2010 at 4:40 PM, Richard Ding rd...@yahoo-inc.com wrote:

 What version of Pig you are on? ChukwaStorage loader for Pig 0.7 uses
 Hadoop FileInputFormat to generate splits so the mapred.min.split.size
 property should work.

 But from the release date, Chukwa 0.3 seems not on Pig 0.7.

 Thanks,
 -Richard

 -Original Message-
 From: Corbin Hoenes [mailto:cor...@tynt.com]
 Sent: Thursday, August 05, 2010 3:50 PM
 To: pig-user@hadoop.apache.org
 Subject: Re: mapred.min.split.size

 I am using the ChukwaStorage loader from chukwa 0.3.  Is it the loader's
 responsibility to deal with input splits?

 On Aug 5, 2010, at 4:14 PM, Richard Ding wrote:

  I misunderstood your earlier question. If you have one large file, set
 mapred.min.split.size property will help to increase the file split size.
 Pig will pass system properties to Hadoop. What loader are you using?
 
  Thanks,
  -Richard
 
  -Original Message-
  From: Corbin Hoenes [mailto:cor...@tynt.com]
  Sent: Thursday, August 05, 2010 1:22 PM
  To: pig-user@hadoop.apache.org
  Subject: Re: mapred.min.split.size
 
  So what does pig do when I have a 5 gig file?  Does it simply hardcode
 the split size to block size?   Is there no way to tell it to just operate
 on a larger split size?
 
 
  On Jul 27, 2010, at 3:41 PM, Richard Ding wrote:
 
  For Pig loaders, each split can have at most one file, doesn't matter
 what split size is.
 
  You can concatenate the input files before loading them.
 
  Thanks,
  -Richard
  -Original Message-
  From: Corbin Hoenes [mailto:cor...@tynt.com]
  Sent: Tuesday, July 27, 2010 2:09 PM
  To: pig-user@hadoop.apache.org
  Subject: mapred.min.split.size
 
  Is there a way to set the mapred.min.split.size property in pig? I set
 it but doesn't seem to have changed the mapper's HDFS_BYTES_READ counter.
  My mappers are finishing ~10 secs.  I have ~20,000 of them.
 
 
 
 




Re: why I can't reply email

2010-08-04 Thread Bill Graham
Try sending your email as text and not html, if you're not already. Others
have also had issues with apache lists with html emails getting flagged as
spam more easily.


On Wed, Aug 4, 2010 at 3:30 PM, Todd Lee ronnietodd...@gmail.com wrote:

 maybe qq.com got blacklisted :)

 T

 2010/8/4 我很快乐 896923...@qq.com

 I can send email to hive-user@hadoop.apache.org, but  after other people
 reply my email, I can't reply the people's email and I recieve below
 message:
 host mx1.eu.apache.org[192.87.106.230] said: 552 spam score (14.4)
 exceeded threshold (in reply to end of DATA command) .


 Could anybody can tell me what is reason?


 Thanks,


 LiuLei





Re: Chukwa questions

2010-07-09 Thread Bill Graham
Your understanding of how Chukwa works is correct.

Hadoop by itself is a system that contains both the HDFS and the MapReduce
systems. The other projects you lists are all projects built upon Hadoop,
but you don't need them to run or to get value out of Hadoop by itself.

To run the Chukwa agent on a data-source node you do not need to have Hadoop
on that node. The Chukwa agent contains Hadoop jars in its run-time
distribution, and those will be used by the agent, but none of the Hadoop
daemons are needed on that node.

CC chukwa-us...@hadoop.apache.org list, where this discussion should
probably move to if there are follow-up Chukwa questions.

Bill



On Fri, Jul 9, 2010 at 8:33 AM, Blargy zman...@hotmail.com wrote:


 I am looking into to Chukwa to collect/aggregate our search logs from
 across
 multiple hosts. As I understand it I need to have a agent/adaptor running
 on
 each host which then in turn forward this to a collector (across the
 network) which will then write out to HDFS. Correct?

 Does Hadoop need to be installed on the host machines that are running the
 agent/adaptors or just Chuckwa? Is Hadoop by itself anything or is Hadoop
 just a collection of tools... HDFS, Hive, Chukwa, Mahout, etc?

 Thanks
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Chukwa-questions-tp954643p954643.html
 Sent from the Hadoop lucene-users mailing list archive at Nabble.com.



Re: Problem in chukwa output

2010-06-03 Thread Bill Graham
FYI, the TsProcessor is not the default processor, so if you want to use it
you need to explicitly configure it to be used. If you have done that, then
the default time format of the TsProcessor is '-MM-dd HH:mm:ss,SSS',
which is not what you have. If you process logs like you show with the
TsProcessor without overriding the default time format, you will get many
InError files as output.

Here's the code:

http://svn.apache.org/viewvc/hadoop/chukwa/trunk/src/java/org/apache/hadoop/chukwa/extraction/demux/processor/mapper/TsProcessor.java?view=markup

And here's how to configure the times format expected by the processor:
https://issues.apache.org/jira/browse/CHUKWA-472

And here's how to set the default processor to something other than what's
hardcoded (which is DefaultProcessor):
https://issues.apache.org/jira/browse/CHUKWA-473

On Thu, Jun 3, 2010 at 10:15 AM, Jerome Boulon jbou...@netflix.com wrote:

 The default TSProcessor expect that every record/line starts with a Date.

 The only thing that matter is the record delimiter. All current readers are
 using \n as a record delimiter.
 So for your specific case, is \n the right record delimiter?
 If yes, then, there's a bug in the reader, create a Jira for that.
 If \n is not a record delimiter then you have to write your own reader or
 change your log format to use \n as a record delimiter or escape the \n
 as we are doing in the log4j appender.

 /Jerome.


 On 6/3/10 12:14 AM, Stuti Awasthi stuti_awas...@persistent.co.in
 wrote:

  Hi,
 
  I checked the new TsProcessor class but I don't think that I have to
 change
  the date format as IM using standard SysLog types of log files.
 
  In my case, I am using TsProcessor. It is able to partially parse the log
  files correctly and generate .evt files beneath the repos/ dir. However,
 there
  is also an error directory and most of the data is going into that
 directory.
  I am getting the date parse exception.
 
  I tried to find out why some of the data could be parsed and the
 remaining
  could not be parsed. Then I found out that this is because the data is
 getting
  divided into chunks as follows:
 
  Suppose the contents of the log file are as follows:
 
  May 29 13:09:02 ps3156 /USR/SBIN/CRON[19815]: (root) CMD (  [ -x
  /usr/lib/php5/maxlifetime ]  [ -d /var/lib/php5 ]  find
 /var/lib/php5/
  -type f -cmin +$(/usr/lib/php5/maxlifetime) -print0 | xargs -n 200 -r -0
 rm)
  May 29 13:09:02 ps3156 /USR/SBIN/CRON[19815]: (root) CMD (  [ -x
  /usr/lib/php5/maxlifetime ]  [ -d /var/lib/php5 ]  find
 /var/lib/php5/
  -type f -cmin +$(/usr/lib/php5/maxlifetime) -print0 | xargs -n 200 -r -0
 rm)
  May 29 13:09:02 ps3156 /USR/SBIN/CRON[19815]: (root) CMD (  [ -x
  /usr/lib/php5/maxlifetime ]  [ -d /var/lib/php5 ]  find
 /var/lib/php5/
  -type f -cmin +$(/usr/lib/php5/maxlifetime) -print0 | xargs -n 200 -r -0
 rm)
  May 29 13:09:02 ps3156 /USR/SBIN/CRON[19815]: (root) CMD (  [ -x
  /usr/lib/php5/maxlifetime ]  [ -d /var/lib/php5 ]  find
 /var/lib/php5/
  -type f -cmin +$(/usr/lib/php5/maxlifetime) -print0 | xargs -n 200 -r -0
 rm)
 
 
  Chunk 1:
 
  May 29 13:09:02 ps3156 /USR/SBIN/CRON[19815]: (root) CMD (  [ -x
  /usr/lib/php5/maxlifetime ]  [ -d /var/lib/php5 ]  find
 /var/lib/php5/
  -type f -cmin +$(/usr/lib/php5/maxlifetime) -print0 | xargs -n 200 -r -0
 rm)
  May 29 13:09:02 ps3156 /USR/SBIN/CRON[19815]: (root) CMD (  [ -x
  /usr/lib/php5/maxlifetime ]  [ -d /var/lib/php5 ]  find
 /var/lib/php5/
  -type f -cmin +$(/usr/lib/php5/maxlifetime) -print0 |
 
 
  Chunk 2:
 
  xargs -n 200 -r -0 rm)
  May 29 13:09:02 ps3156 /USR/SBIN/CRON[19815]: (root) CMD (  [ -x
  /usr/lib/php5/maxlifetime ]  [ -d /var/lib/php5 ]  find
 /var/lib/php5/
  -type f -cmin +$(/usr/lib/php5/maxlifetime) -print0 | xargs -n 200 -r -0
 rm)
  May 29 13:09:02 ps3156 /USR/SBIN/CRON[19815]: (root) CMD (  [ -x
  /usr/lib/php5/maxlifetime ]  [ -d /var/lib/php5 ]  find
 /var/lib/php5/
  -type f -cmin +$(/usr/lib/php5/maxlifetime) -print0 | xargs -n 200 -r -0
 rm)
 
  There is no problem with the first chunk. It gets parsed properly and
 .evt
  file is created.
  But the second chunk starts with  xargs -n 200 -r -0 rm) which is not a
  valid date format. So the date parse exception is thrown.
  So the problem is with the way data is getting divided into chunks.
 
  So is there any way to divide the chunks evenly? Any pointers in this
 case
  would help.
 
  -Original Message-
  From: Bill Graham [mailto:billgra...@gmail.com]
  Sent: Tuesday, June 01, 2010 5:36 AM
  To: chukwa-user@hadoop.apache.org
  Cc: Jerome Boulon
  Subject: Re: Problem in chukwa output
 
  The unparseable date errors are due to the map processor not being
  able to properly extract the date from the record. Look at the
  TsProcessor (on the trunk) and the latest demux configs for examples
  of how to configure a processor for a given date format.
 
  I'm away from my computer now, but if you search for jiras asignex to
  me, you should find the relevant

Re: Problem in chukwa output

2010-05-31 Thread Bill Graham
The unparseable date errors are due to the map processor not being
able to properly extract the date from the record. Look at the
TsProcessor (on the trunk) and the latest demux configs for examples
of how to configure a processor for a given date format.

I'm away from my computer now, but if you search for jiras asignex to
me, you should find the relevant  patches.

On Friday, May 28, 2010, Stuti Awasthi stuti_awas...@persistent.co.in wrote:














 Hi,



 Sorry
 for replying late I was trying with what you have suggested.

 Yes
 it worked for me. Rotation factor increased my file size but now have other
 issue J



 @Issue
 :



 When
 chukwa demuxer  gets the log for the processing , it is getting distributed
 in 2 directories :

 1)
 After
 correct processing , it generates .evt files.

 2)
 Chuwa
 parser does not parse the data properly and end up giving ..InError directory.



 Rotation
 Time : 5 min to 1 Hour



 1.
 SYSTEM LOGS

 Log File used : message1

 Datatype
 used : SysLog



 Error : java.text.ParseException:
 Unparseable date: y  4 06:12:38 p



 2.
 Hadoop Logs

 Log
 File Used : Hadoop datanode logs , Hadoop TaskTracker logs

 Datatype Used : HadoopLog



 Error : java.text.ParseException:
 Unparseable date: 0 for block blk_1617125




 3.
 Chuwa Agent Logs

 Log
 File Used : Chuwa Agent logs

 Datatype
 Used : chuwaAgent



 Error : org.json.JSONException: A JSONObject text must begin with '{'
 at character 1 of post thread ChukwaHttpSender - collected 1 chunks





 I
 am wondering why data is getting into these INError directory. Is there any 
 way
 we can get  correct evt files after demuxing rather than these INError.evt
 files.



 Thanks

 Stuti





 From: Jerome Boulon
 [mailto:jbou...@netflix.com]
 Sent: Thursday, May 27, 2010 1:01 AM
 To: chukwa-user@hadoop.apache.org
 Subject: Re: Problem in chukwa output







 Hi,
 The demux is grouping you data per date/hour/TimeWindow so yes, 1 .done file
 could be split into multiple .evt file depending on the content/timestamp of
 your data.
 Generally, if you have a SysLogInError directory, it’s because the parser
 throws an exception and you should have some files in there.

 You may want to take a look at this wiki page to get an idea of Demux data
 flow.
 http://wiki.apache.org/hadoop/Chukwa_Processes_and_Data_Flow

 Regards,
 /Jerome.

 On 5/26/10 10:55 AM, Stuti Awasthi stuti_awas...@persistent.co.in
 wrote:

 Hi all,
 I am facing some problems in chukwa output.

 The following are the process flow in Collector :
 I worked with single .done file of 16MB in size for the analysis

 1) Logs were collected in /logs directory.

 2) After demux processing the output was stored in
 /repos directory.

 Following is the structure inside  repos:
    /repos
 /SysLog
 Total
 Size : 1MB
 /20100503/
 *.evt

 /20100504/*.evt


 /SysLogInError
    Total
 Size  : 15MB
 /…./*.evt

 I have 2 doubts :


 I noticed that my single log file was spilt into multiple  .evt
 file. My output file contained 2 folders inside / SysLog .Is this the correct
 behaviour that a single .done file is split into n number of .evt files and in
 different directory structure?

 There was a directory of SysLogInError generated but there was no ERROR in the
 log file. I was not sure when this directory gets created?

 Any pointers will be helpful.
 Thanks
 Stuti
 DISCLAIMER == This e-mail may contain privileged and confidential
 information which is the property of Persistent Systems Ltd. It is intended
 only for the use of the individual or entity to which it is addressed. If you
 are not the intended recipient, you are not authorized to read, retain, copy,
 print, distribute or use this message. If you have received this communication
 in error, please notify the sender and delete all copies of this message.
 Persistent Systems Ltd. does not accept any liability for virus infected 
 mails.



 DISCLAIMER
 ==
 This e-mail may contain privileged and confidential information which is the 
 property of Persistent Systems Ltd. It is intended only for the use of the 
 individual or entity to which it is addressed. If you are not the intended 
 recipient, you are not authorized to read, retain, copy, print, distribute or 
 use this message. If you have received this communication in error, please 
 notify the sender and delete all copies of this message. Persistent Systems 
 Ltd. does not accept any liability for 

Re: how to determine hive version

2010-05-31 Thread Bill Graham
The getVersion method in the hive jdbc driver should give you the
version, which is read from the manifest version in the META-INF
folder of th hive jar.

On Monday, May 31, 2010, Arvind Prabhakar arv...@cloudera.com wrote:
 Hello Kortni,
 One way to find out which version of Hive you are using is to look at the 
 hive-default.xml file under conf directory. In this file, check out the value 
 of the property hive.hwi.war.file, which should be of the format:

 valuelib/hive-hwi-VERSION.war/value
 Form there you can infer the version.
 If you want a more direct means of finding out the version of Hive, please 
 file a Jira as enhancement request.

 Arvind

 On Thu, May 27, 2010 at 2:58 PM, Kortni Smith ksm...@abebooks.com wrote:














 Hello,

 How can you tell what version of hive is running?

 I’m working with hive and EMR, and know that’s
 it’s hive 0.4 from the EMR job’s first step configuration 
 (s3://elasticmapreduce/libs/hive/0.4/install-hive),
 but I need to know if it’s 0.4.1 or 0.4.0.



 Thanks





 Kortni Smith | Software Developer
 AbeBooks.com  Passion for
 books.

 ksm...@abebooks.com
 phone: 250.412.3272  |  fax: 250.475.6014

 Suite 500 - 655 Tyee Rd. Victoria, BC. Canada V9A 6X5

 www.abebooks.com  |  www.abebooks.co.uk  |  www.abebooks.de
 www.abebooks.fr  |  www.abebooks.it  |  www.iberlibro.com














Re: including multiple delimited fields (of unknown count) into one

2010-05-20 Thread Bill Graham
Correct, I don't need to know the arity of the tuple and if I LOAD without
specifying the fields like you show I should be able to effectively STORE
the same data.  The problem though is that I need to include both the tuple
and the timestamp in the grouping (but no the count), then sum the counts.

As an example, this:

127120140   3  1770162 5
127120140   4  2000162 100
127120170   3  1770162 5
127120170   4  2000162 100

Would become this (where 127119960 is the hour that the two timestamps
both roll up to):

127119960   6  1770162 5
127119960   8  2000162 100

So in my case I'd like to be able to load timetamps, count and tuple and
then group on timestamp and tuple and output in the same format of
timestamp, count, tuple.

The easiest hack I've come up with for now is to dynamically insert the
field definitions in my script before I run it. So in the example above I
would insert 'f1, f2, f3' everywhere I need to reference the tuple. Another
run might insert 'f1, f2' for an input that only has 2 extra fields.


On Thu, May 20, 2010 at 12:39 AM, Mridul Muralidharan mrid...@yahoo-inc.com
 wrote:



 I am not sure what the processing is once the group'ing is done, but each
 tuple has a size() (for arity) method which gives us the number of fields in
 that tuple [if using in udf].
 So that can be used to aid in computation.


 If you are interested in aggregating and simply storing it - you dont
 really need to know the arity of a tuple, right ? (That is, group by
 timestamp, and store - PigStorage should continue to store the variable
 number of fields as was present in input).



 Regards,
 Mridul


 On Thursday 20 May 2010 05:39 AM, Bill Graham wrote:

 Thanks Mridul, but how would I access the items in the numbered fields
 3..N where I don't know what N is? Are you suggesting I pass A to a
 custom UDF to convert to a tuple of [time, count, rest_of_line]?


 On Wed, May 19, 2010 at 4:11 PM, Mridul Muralidharan
 mrid...@yahoo-inc.com mailto:mrid...@yahoo-inc.com wrote:


You can simply skip specifying schema in the load - and access the
fields either through the udf or through $0, etc positional indexes.


Like :

A = load 'myfile' USING PigStorage();
B = GROUP A by round_hour($0) PARALLEL $PARALLELISM;
C = ...



Regards,
Mridul


On Thursday 20 May 2010 04:07 AM, Bill Graham wrote:

Hi,

Is there a way to read a collection (of unknown size) of
tab-delimited
values into a single data type (tuple?) during the LOAD phase?

Here's specifically what I'm looking to do. I have a given input
file format
of tab-delimited fields like so:

[timestamp] [count] [field1] [field2] [field2] .. [fieldN]

I'm writing a pig job to take many small files and roll up the
counts for a
given time increment of a lesser granularity. For example, many
files with
timestamps rounded to 5 minute intervals will be rolled into a
single file
with 1 hour granularity.

I'm able to do this by grouping on the timestamp (rounded down
to the hour)
and each of the fields shown if I know the number of fields and
I list them
all explicitly.  I'd like to write this script though that would
work on
different input formats, some which might have N fields, where
others have
M. For a given job run, the number of fields in the input files
passed would
be fixed.

So I'd like to be able to do something like this in pseudo code:

LOAD USING PigStorage('\t') AS (timestamp, count, rest_of_line)
...
GROUP BY round_hour(timestamp), rest_of_line
[flatten group and sum counts]
...
STORE round_hour(timestamp), totalCount, rest_of_line

Where I know nothing about how many tokens are in next_of_line.
Any ideas
besides subclassing PigStorage or writing a new FileInputLoadFunc?

thanks,
Bill







Re: Add user define jars

2010-05-19 Thread Bill Graham
Hi Jerome,

I haven't had to use external jars yet (I've pushed the logic I need into
chukwa instead), but I would like to have a solution where including an
external jar is just a matter of adding a jar to a local directory. Similar
to how Hive has the auxlibs/ dir. It sounds like your solution aligns with
this approach. If that's the case, I'm in favor of it.

thanks,
Bill

On Wed, May 19, 2010 at 6:02 PM, Kirk True k...@mustardgrain.com wrote:

  Hi Jerome,

 I'm trying to use the 'stick your JAR in HDFS' chukwa/demux directory'
 approach. I'm not able to get it working (see the Chukwa can't find Demux
 class thread in the mailing list).

 Thanks,
 Kirk


 On 5/11/10 5:29 PM, Jerome Boulon wrote:

 Hi,
 I would like to get feedback from people using their own external jars with
 Demux.
 The current implementation is using the distributed cache to upload jars to
 Hadoop.
 - Is it working well?
 - Do you have any problem with this feature?

 I’m asking this because we solved this requirement in a different way in
 Honu and I wonder if that’s something we need to improve in Chukwa.

 Thanks in advance,
   /Jerome.




including multiple delimited fields (of unknown count) into one

2010-05-19 Thread Bill Graham
Hi,

Is there a way to read a collection (of unknown size) of tab-delimited
values into a single data type (tuple?) during the LOAD phase?

Here's specifically what I'm looking to do. I have a given input file format
of tab-delimited fields like so:

[timestamp] [count] [field1] [field2] [field2] .. [fieldN]

I'm writing a pig job to take many small files and roll up the counts for a
given time increment of a lesser granularity. For example, many files with
timestamps rounded to 5 minute intervals will be rolled into a single file
with 1 hour granularity.

I'm able to do this by grouping on the timestamp (rounded down to the hour)
and each of the fields shown if I know the number of fields and I list them
all explicitly.  I'd like to write this script though that would work on
different input formats, some which might have N fields, where others have
M. For a given job run, the number of fields in the input files passed would
be fixed.

So I'd like to be able to do something like this in pseudo code:

LOAD USING PigStorage('\t') AS (timestamp, count, rest_of_line)
...
GROUP BY round_hour(timestamp), rest_of_line
[flatten group and sum counts]
...
STORE round_hour(timestamp), totalCount, rest_of_line

Where I know nothing about how many tokens are in next_of_line. Any ideas
besides subclassing PigStorage or writing a new FileInputLoadFunc?

thanks,
Bill


HttpTriggerAction - configuring N objects

2010-04-21 Thread Bill Graham
Hi,

As a follow up to CHUKWA-477, I'm writing a class called HttpTriggerAction
that can hit one or more URLs upon successful completion of a demux job. I'd
like to contribute it back unless anyone objects. Anyway, I'm looking for
feedback though on how to configure this object.

The issue is that since the class can hit N urls, it needs N sets of
key-value configurations. The hadoop configurations model is just name-value
pairs though, so I'm kicking around ideas around the best way to handle
this.

Specifically, I need to configure values for url, an optional HTTP method
(default is GET), an optional collection of HTTP headers and an optional
post body. I was thinking of just making a convention where key values could
be incremented like below, but wanted to see if there were better
suggestions out there.

chukwa.trigger.action.[eventName].http.1.url=http://site.com/firstTrigger
chukwa.trigger.action.[eventName].http.1.headers=User-Agent:chukwa

chukwa.trigger.action.[eventName].http.2.url=http://site.com/secondTrigger
chukwa.trigger.action.[eventName].http.2.method=POST
chukwa.trigger.action.[eventName].http.2.headers=User-Agent:chukwa,Accepts:text/plain
chukwa.trigger.action.[eventName].http.2.body=Some post body to submit

chukwa.trigger.action.[eventName].http.N.url=
chukwa.trigger.action.[eventName].http.N.method=
chukwa.trigger.action.[eventName].http.N.headers=
chukwa.trigger.action.[eventName].http.N.body=

Since the action could potentially be used by other types of events, the
event name should be included. This implies that we should add an eventName
field to the TriggerAction.execute method in CHUKWA-477.

Thoughts?

thanks,
Bill


Re: HttpTriggerAction - configuring N objects

2010-04-21 Thread Bill Graham
Thanks guys. Eric, I think I'm on a similar path to what you're suggesting
except I'm using a two-step config to specidy a.) the action classes, and
b.) their configs. Jerome, I considered that, but gets tricky with
multi-value sets (like headers). At some point we get into delimiter
overload.

To get to specifics, here's an example of what I currently have:

- A comma separated list of TriggerAction classes to be invoked upon
successful completion of a successful demux run. This is similar to the
current data loader pattern and can also be used wherever else we later want
to fire TriggerActions.

  property
namechukwa.post.demux.success.action/name

valueorg.apache.hadoop.chukwa.extraction.demux.HttpTriggerAction,some.other.TriggerAction/value
  /property

- Then when the HttpTriggerAction runs, it knows how to look for it's
configs in the way that we've been discussing. For the above action, the
trigger event name will be passed to TriggerAction as an enum. The enum
would have a name like 'postDemuxSuccess', as well as a pointer to the base
config string for the event, like 'chukwa.trigger.post.demux.success'.

- HttpTriggerAction would in this case then know to look for it's configs
under
chukwa.trigger.post.demux.success.*http *and look for values like this:

  property
namechukwa.trigger.post.demux.success.http.1.url/name
valuehttp://site.com/firstTrigger/value
  /property

so the syntax of the config key then is:

chukwa.[eventName].[actionNS].N.[actionKeys]

Although other actions are free to implement all parts below eventName
however makes the most sense for their needs.

thanks,
Bill

On Wed, Apr 21, 2010 at 4:50 PM, Jerome Boulon jbou...@netflix.com wrote:

  Hi,
 You can have a root configuration key that will give the list of keys to
 look for:
 Or you can look for the existence of http.N key in the configuration
 object.

 Ex
 property 
 namemyConfig.eventName.list/name
 valuehttp.1, http.2, ..., http.x/value
 /property


 Can you clarify what the eventName is?

 /Jerome.


 On 4/21/10 3:31 PM, Bill Graham billgra...@gmail.com wrote:

 Hi,

 As a follow up to CHUKWA-477, I'm writing a class called HttpTriggerAction
 that can hit one or more URLs upon successful completion of a demux job. I'd
 like to contribute it back unless anyone objects. Anyway, I'm looking for
 feedback though on how to configure this object.

 The issue is that since the class can hit N urls, it needs N sets of
 key-value configurations. The hadoop configurations model is just name-value
 pairs though, so I'm kicking around ideas around the best way to handle
 this.

 Specifically, I need to configure values for url, an optional HTTP method
 (default is GET), an optional collection of HTTP headers and an optional
 post body. I was thinking of just making a convention where key values could
 be incremented like below, but wanted to see if there were better
 suggestions out there.

 chukwa.trigger.action.[eventName].http.1.url=http://site.com/firstTrigger
 chukwa.trigger.action.[eventName].http.1.headers=User-Agent:chukwa

 chukwa.trigger.action.[eventName].http.2.url=http://site.com/secondTrigger
 chukwa.trigger.action.[eventName].http.2.method=POST

 chukwa.trigger.action.[eventName].http.2.headers=User-Agent:chukwa,Accepts:text/plain
 chukwa.trigger.action.[eventName].http.2.body=Some post body to submit
 
 chukwa.trigger.action.[eventName].http.N.url=
 chukwa.trigger.action.[eventName].http.N.method=
 chukwa.trigger.action.[eventName].http.N.headers=
 chukwa.trigger.action.[eventName].http.N.body=

 Since the action could potentially be used by other types of events, the
 event name should be included. This implies that we should add an eventName
 field to the TriggerAction.execute method in CHUKWA-477.

 Thoughts?

 thanks,
 Bill






Re: How to write log4j output to HDFS?

2010-04-21 Thread Bill Graham
Hi,

Check out Chukwa:
http://hadoop.apache.org/chukwa/docs/r0.3.0/design.html#Introduction

It allows you to run agents which tail log4j output and send the data to
collectors, which write the data to HDFS.


thanks,
Bill


On Wed, Apr 21, 2010 at 3:43 AM, Dhanya Aishwarya Palanisamy 
dhanya.aishwa...@gmail.com wrote:

 Hi,

  Has anyone tried to write log4j log file directly to HDFS? If yes,
 please reply how to achieve this.
 I am trying to create a appender. Is this the way? My necessity is to write
 logs to a file at particular intervals and query that data at a later
 stage.

 Thanks,
 Dhanya



Re: Demux trigger

2010-04-19 Thread Bill Graham
Thanks, Eric.

Looking into the PostProcessorManager code a little more, it seems the
chukwa.post.demux.data.loader loaders get called before the post processor
moves the finished files into place. I need a trigger that fires after
they're in place in the repos/ dir.

This is the code I'm referring to from PostProcessorManager.

if ( processDemuxPigOutput(directoryToBeProcessed) == true) {
  if (movetoMainRepository(directoryToBeProcessed,chukwaRootReposDir) ==
true) {
deleteDirectory(directoryToBeProcessed);
...
continue;
  }
}

The data loaders get called as part of processDemuxPigOutput. Is this
sequence correct or am I missing something?

If this is in fact the case, I'd like to add a hook to take some post action
once the files are in the /repos dir. From a users perspective, that's what
'post demux' implies. I'm open for suggestions re the best way to do that
and what to call the configs. One thought is to follow a similar pattern as
how DataLoaders are configured, but use a new interface that's more generic
than loading data. Not 'Action', but something that denotes that.

thanks,
Bill


On Sat, Apr 17, 2010 at 1:35 PM, Eric Yang ey...@yahoo-inc.com wrote:

  Yes.

 Regards,
 Eric



 On 4/16/10 9:56 PM, Bill Graham billgra...@gmail.com wrote:

 Thanks Eric, I'm glad I emailed before writing code.

 I can see how data loaders get triggered, but I don't see one that makes an
 HTTP request like I'm proposing. Are you suggesting I implement a new
 DataLoader that doesn't actually load data, but makes an HTTP request
 instead?


 On Fri, Apr 16, 2010 at 7:30 PM, Eric Yang ey...@yahoo-inc.com wrote:

 Hi Bill,

 This already exist in Chukwa.  Take a look in DataLoaderFactory.java and
 SocketDataLoader.java, they are triggered after demux jobs.  Hence, you can
 use PostProcessManager as triggers, and configure it through
 chukwa-demux-conf.xml, chukwa.post.demux.data.loader.  Hope this helps.

 Regards,
 Eric


 On 4/16/10 4:39 PM, Bill Graham billgra...@gmail.com 
 http://billgra...@gmail.com  wrote:

 Hi,

 I'd like to add a feature to the DemuxManager where you can configure an
 HTTP request to be fired after a Demux run. It would be similar to what's
 currently there for Nagios alerts, only this would be HTTP (the Nagios alert
 is a raw TCP socket call). You'd configure the host, port, (POST|GET) and
 uri for this first pass.

 Some metadata about the job would also go along for the ride. Maybe things
 like status code and job name.

 The use case is to trigger a dependent job to run elsewhere upon
 completion. The same functionality could potentially be ported to some of
 the other chukwa processor jobs if the need arose.

 Thoughts?

 thanks,
 Bill






Re: Demux trigger

2010-04-19 Thread Bill Graham
Sure, I can make that changes. I named it processDataLoaders since it
handles a collection.

I created a JIRA with a first pass of the implementation, along with a few
points of discussion:

https://issues.apache.org/jira/browse/CHUKWA-477

Let me know what you think.

thanks,
Bill

On Mon, Apr 19, 2010 at 10:23 AM, Eric Yang ey...@yahoo-inc.com wrote:

 Yes this is indeed the case.  +1 on adding hook inside the if block, to
 generate post move triggers.  Could you also rename processDemuxPigOutput
 to
 processDataLoader?  The name is more concise.  Thanks

 Regards,
 Eric


 On 4/19/10 9:54 AM, Bill Graham billgra...@gmail.com wrote:

  Thanks, Eric.
 
  Looking into the PostProcessorManager code a little more, it seems the
  chukwa.post.demux.data.loader loaders get called before the post
 processor
  moves the finished files into place. I need a trigger that fires after
 they're
  in place in the repos/ dir.
 
  This is the code I'm referring to from PostProcessorManager.
 
  if ( processDemuxPigOutput(directoryToBeProcessed) == true) {
if (movetoMainRepository(directoryToBeProcessed,chukwaRootReposDir) ==
 true)
  {
  deleteDirectory(directoryToBeProcessed);
  ...
  continue;
}
  }
 
  The data loaders get called as part of processDemuxPigOutput. Is this
 sequence
  correct or am I missing something?
 
  If this is in fact the case, I'd like to add a hook to take some post
 action
  once the files are in the /repos dir. From a users perspective, that's
 what
  'post demux' implies. I'm open for suggestions re the best way to do that
 and
  what to call the configs. One thought is to follow a similar pattern as
 how
  DataLoaders are configured, but use a new interface that's more generic
 than
  loading data. Not 'Action', but something that denotes that.
 
  thanks,
  Bill
 
 
  On Sat, Apr 17, 2010 at 1:35 PM, Eric Yang ey...@yahoo-inc.com wrote:
  Yes.
 
  Regards,
  Eric
 
 
 
  On 4/16/10 9:56 PM, Bill Graham billgra...@gmail.com
  http://billgra...@gmail.com  wrote:
 
  Thanks Eric, I'm glad I emailed before writing code.
 
  I can see how data loaders get triggered, but I don't see one that
 makes an
  HTTP request like I'm proposing. Are you suggesting I implement a new
  DataLoader that doesn't actually load data, but makes an HTTP request
  instead?
 
 
  On Fri, Apr 16, 2010 at 7:30 PM, Eric Yang ey...@yahoo-inc.com
  http://ey...@yahoo-inc.com  wrote:
  Hi Bill,
 
  This already exist in Chukwa.  Take a look in DataLoaderFactory.java
 and
  SocketDataLoader.java, they are triggered after demux jobs.  Hence,
 you can
  use PostProcessManager as triggers, and configure it through
  chukwa-demux-conf.xml, chukwa.post.demux.data.loader.  Hope this
 helps.
 
  Regards,
  Eric
 
 
  On 4/16/10 4:39 PM, Bill Graham billgra...@gmail.com
  http://billgra...@gmail.com  http://billgra...@gmail.com  wrote:
 
  Hi,
 
  I'd like to add a feature to the DemuxManager where you can configure
 an
  HTTP request to be fired after a Demux run. It would be similar to
 what's
  currently there for Nagios alerts, only this would be HTTP (the
 Nagios
  alert is a raw TCP socket call). You'd configure the host, port,
  (POST|GET) and uri for this first pass.
 
  Some metadata about the job would also go along for the ride. Maybe
 things
  like status code and job name.
 
  The use case is to trigger a dependent job to run elsewhere upon
  completion. The same functionality could potentially be ported to
 some of
  the other chukwa processor jobs if the need arose.
 
  Thoughts?
 
  thanks,
  Bill
 
 
 
 
 




Re: Demux trigger

2010-04-16 Thread Bill Graham
Thanks Eric, I'm glad I emailed before writing code.

I can see how data loaders get triggered, but I don't see one that makes an
HTTP request like I'm proposing. Are you suggesting I implement a new
DataLoader that doesn't actually load data, but makes an HTTP request
instead?


On Fri, Apr 16, 2010 at 7:30 PM, Eric Yang ey...@yahoo-inc.com wrote:

  Hi Bill,

 This already exist in Chukwa.  Take a look in DataLoaderFactory.java and
 SocketDataLoader.java, they are triggered after demux jobs.  Hence, you can
 use PostProcessManager as triggers, and configure it through
 chukwa-demux-conf.xml, chukwa.post.demux.data.loader.  Hope this helps.

 Regards,
 Eric


 On 4/16/10 4:39 PM, Bill Graham billgra...@gmail.com wrote:

 Hi,

 I'd like to add a feature to the DemuxManager where you can configure an
 HTTP request to be fired after a Demux run. It would be similar to what's
 currently there for Nagios alerts, only this would be HTTP (the Nagios alert
 is a raw TCP socket call). You'd configure the host, port, (POST|GET) and
 uri for this first pass.

 Some metadata about the job would also go along for the ride. Maybe things
 like status code and job name.

 The use case is to trigger a dependent job to run elsewhere upon
 completion. The same functionality could potentially be ported to some of
 the other chukwa processor jobs if the need arose.

 Thoughts?

 thanks,
 Bill




Re: Making TsProcessor's date format configurable

2010-04-06 Thread Bill Graham
Sure, thanks Jerome. I assigned you the JobConf work:
https://issues.apache.org/jira/browse/CHUKWA-471

And I've got the date format for TsProcessor JIRA;
https://issues.apache.org/jira/browse/CHUKWA-472

As well as making the default processor configurable:
https://issues.apache.org/jira/browse/CHUKWA-473

For this last one how about a config like this:

property
 namechukwa.demux.default.processor/name
 
valueorg.apache.hadoop.chukwa.extraction.demux.processor.mapper.DefaultProcessor/value
/property


On Tue, Apr 6, 2010 at 10:57 AM, Jerome Boulon jbou...@netflix.com wrote:

 Hi,
 When you'll create a Jira for that, can you create a separate one for
 JobConf?
 I'll submit a patch for it.
 Thanks,
   /Jerome


 On 4/6/10 10:50 AM, Jerome Boulon jbou...@netflix.com wrote:

  Hi,
 
  Could you also make sure that you force sdf to be GMT?
  sdf.setTimeZone(TimeZone.getTimeZone(GMT));
 
  - Instead of an If/then/else you could use the default value in
  conf.get(key,defaultVal) to set the default format.
 
  - You can load jobConf directlty from the mapper/reducer but you will
 have
  to add a new method to the AbstractProcessor/Reducer then any
 parser/reducer
  class will have access to it. We don't need a distributed cache todo
 that.
 
  Thanks,
/Jerome.
 
  On 4/6/10 10:18 AM, Eric Yang ey...@yahoo-inc.com wrote:
 
  Hi Bill,
 
  We can introduce some configuration like this in chukwa-demux.conf.xml:
 
  property
nameTsProcessor.time.format.some_data_type/name
value-MM-dd HH:mm:ss,SSS/value
  /property
 
  Move the SimpleDateFormat outside of constructor.
 
  StringBuilder format = new StringBuilder();
  format.append(³TsProcessor.time.format²);
  format.append(chunk.getDataType());
  if(conf.get(format.toString)!=null) {
sdf = new SImpleDateFormat(conf.get(format.toString));
  } else {
sdf = new SImpleDateFormat(-MM-dd HH:mm:ss,SSS);
  }
 
  It will require changes the MapperFactory class to include the running
  JobConf has a HashMap or load the JobConf from distributed cache.
 
  Regards,
  Eric
 
  On 4/6/10 9:55 AM, Bill Graham billgra...@gmail.com wrote:
 
  Hi,
 
  I'd like to be able to configure the date format for TSProcessor.
 Looking at
  the code, others have had the same thought:
 
public TsProcessor() {
  // TODO move that to config
  sdf = new SimpleDateFormat(-MM-dd HH:mm:ss,SSS);
}
 
  I can write a patch to support this change, but how do we want to make
 the
  date configurable? Currently there is a single config (AFAIK) that
 binds the
  processor class to a data type in chukwa-demux-conf.xml that looks like
  this:
 
property
  namesome_data_type/name
 
 
 

 valueorg.apache.hadoop.chukwa.extraction.demux.processor.mapper.TsProcessor
 
  
  /value
  descriptionParser for some_data_type/description
/property
 
 
  Any suggestions for how we'd incorporate date format into that config?
 Or
  perhaps it would be a separate conf. Are there any examples in the code
 of
  processors that take configurations currently?
 
  As a side note, I'd also like to add a configuration for what the
 default
  processor should be, since I'd prefer to change ours from
 DefaultProcessor
  to
  TsProcessor. Maybe 'chukwa.demux.default.processor'? Thoughts?
 
  thanks,
  Bill
 
 
 
 
 




Re: Web/Data Analytics and Data Collection using Hadoop

2010-03-22 Thread Bill Graham
Hi Utku,

We're using Chukwa to collect and aggregate data as you describe and so far
it's working well. Typically chukwa collectors are deployed to all data
nodes, so there is no master write-bottleneck with this approach actually.

There have been discussions lately on the Chukwa list regarding how to write
data into HBase using Chukwa collectors or data processors that you might
want to check out.

thanks,
Bill


On Mon, Mar 22, 2010 at 4:50 AM, Utku Can Topçu u...@topcu.gen.tr wrote:

 Hey All,

 Currently in a project I'm involved, we're about to make design choices
 regarding the use of Hadoop as a scalable and distributed data analytics
 framework.
 Basically the application would be the base of a Web Analytics tool, so I
 do
 have the vision that Hadoop would be the finest choice for analyzing the
 collected data.
 But for the collection of data is somewhat a different issue to consider,
 there needs to be serious design decision taken for the data collection
 architecture.

 Actually, I'd like to have a distributed and scalable data collection in
 production. The current situation is like we have multiple of servers in
 3-4
 different locations, each collect some sort of data.
 The basic approach on analyzing this distributed data would be: logging
 them
 into structured text files so that we'll be able to transfer them to the
 hadoop cluster and analyze them using some MapReduce jobs.
 The basic process I define follows like this
 - Transfer log files to Hadoop master, (collectors to master)
 - Put the files on the master to the HDFS, (master to the cluster)
 As it's clear there's an overhead in the transfer of the log files. And the
 big log files will have to be analyzed even if you'll somehow need a small
 portion of the data.

 One better other option is, logging directly to a distributed database like
 Cassandra and HBase, so the MapReduce jobs would be fetching the data from
 the databases and doing the analysis. And the data will also be randomly
 accessible and open to queries in real-time.
 I'm not that much familiar in this area of distributed databases, however I
 can see that,
 -If we're using cassandra for storing logging information, we won't have a
 connection overhead for writing the data to the Cassandra cluster, since
 all
 nodes in the cluster are able to accept incoming write requests. However in
 HBase I'm afraid we'll need to write to the master only, so in such
 situation, there seems to be a connection overhead on the master and we can
 only scale up-to the levels that the through-put of master. Logging to
 HBase
 doesn't seem scalable from this point of view.
 -On the other hand, using a different Cassandra cluster which is not
 directly from the ecosystem of Hadoop, I'm afraid we'll lose the concept of
 data locality while using the data for analysis in MapReduce jobs if
 Cassandra was the choice for keeping the log data. However in the case of
 HBase we'll be able to use the data locality since it's directly related to
 the HDFS.
 -Is there a stable way for integrating Cassandra with Hadoop?

 So finally Chukwa seems to be a good choice for such kind of a data
 collection. Where each server that can be defined as sources will be
 running
 Agents on them, so they can transfer the data to the Collectors that reside
 close to the HDFS. After series of pipe-lined processes the data would be
 clearly available for analysis using MapReduce jobs.
 I see some connection overhead due to the through-put of master in this
 scenario and the files that need to be analyzed will also be again
 available
 in big files, so a sample range of the data analysis would require the
 reading of the full files.

 I feel like these are the brief options I figured out till now. Actually
 all
 decision will come with some kind of a drawback and provide some decision
 specific more functionality compared to the others.

 Is there anyone on the list who solved the need in such functionality
 previously? I'm open to all kind of comments and suggestions,

 Best Regards,
 Utku



Re: Web/Data Analytics and Data Collection using Hadoop

2010-03-22 Thread Bill Graham
Sure, any framework that writes data into HDFS will need to communicate with
the namenode. So yes, there can potentially be large numbers of connections
to the namenode.

I (possibly mistakenly) thought you were speaking specifically of a
bottleneck caused by writes through a single master node. The actually data
does not go through the name node though, so there is no bottleneck in the
data flow.


On Mon, Mar 22, 2010 at 10:50 AM, Utku Can Topçu u...@topcu.gen.tr wrote:

 Hi Bill,

 Thank you for your comments,
 The main thing about the Chukwa installation on top of Hadoop is I guess,
 you somehow need to connect to the namenode from the collectors.
 Isn't it the case while trying to reach the HDFS, or the Chukwa collectors
 are writing on the local drives instead of HDFS?

 Best,
 Utku


 On Mon, Mar 22, 2010 at 6:34 PM, Bill Graham billgra...@gmail.com wrote:

 Hi Utku,

 We're using Chukwa to collect and aggregate data as you describe and so
 far
 it's working well. Typically chukwa collectors are deployed to all data
 nodes, so there is no master write-bottleneck with this approach actually.

 There have been discussions lately on the Chukwa list regarding how to
 write
 data into HBase using Chukwa collectors or data processors that you might
 want to check out.

 thanks,
 Bill


 On Mon, Mar 22, 2010 at 4:50 AM, Utku Can Topçu u...@topcu.gen.tr
 wrote:

  Hey All,
 
  Currently in a project I'm involved, we're about to make design choices
  regarding the use of Hadoop as a scalable and distributed data analytics
  framework.
  Basically the application would be the base of a Web Analytics tool, so
 I
  do
  have the vision that Hadoop would be the finest choice for analyzing the
  collected data.
  But for the collection of data is somewhat a different issue to
 consider,
  there needs to be serious design decision taken for the data collection
  architecture.
 
  Actually, I'd like to have a distributed and scalable data collection in
  production. The current situation is like we have multiple of servers in
  3-4
  different locations, each collect some sort of data.
  The basic approach on analyzing this distributed data would be: logging
  them
  into structured text files so that we'll be able to transfer them to the
  hadoop cluster and analyze them using some MapReduce jobs.
  The basic process I define follows like this
  - Transfer log files to Hadoop master, (collectors to master)
  - Put the files on the master to the HDFS, (master to the cluster)
  As it's clear there's an overhead in the transfer of the log files. And
 the
  big log files will have to be analyzed even if you'll somehow need a
 small
  portion of the data.
 
  One better other option is, logging directly to a distributed database
 like
  Cassandra and HBase, so the MapReduce jobs would be fetching the data
 from
  the databases and doing the analysis. And the data will also be randomly
  accessible and open to queries in real-time.
  I'm not that much familiar in this area of distributed databases,
 however I
  can see that,
  -If we're using cassandra for storing logging information, we won't have
 a
  connection overhead for writing the data to the Cassandra cluster, since
  all
  nodes in the cluster are able to accept incoming write requests. However
 in
  HBase I'm afraid we'll need to write to the master only, so in such
  situation, there seems to be a connection overhead on the master and we
 can
  only scale up-to the levels that the through-put of master. Logging to
  HBase
  doesn't seem scalable from this point of view.
  -On the other hand, using a different Cassandra cluster which is not
  directly from the ecosystem of Hadoop, I'm afraid we'll lose the concept
 of
  data locality while using the data for analysis in MapReduce jobs if
  Cassandra was the choice for keeping the log data. However in the case
 of
  HBase we'll be able to use the data locality since it's directly related
 to
  the HDFS.
  -Is there a stable way for integrating Cassandra with Hadoop?
 
  So finally Chukwa seems to be a good choice for such kind of a data
  collection. Where each server that can be defined as sources will be
  running
  Agents on them, so they can transfer the data to the Collectors that
 reside
  close to the HDFS. After series of pipe-lined processes the data would
 be
  clearly available for analysis using MapReduce jobs.
  I see some connection overhead due to the through-put of master in this
  scenario and the files that need to be analyzed will also be again
  available
  in big files, so a sample range of the data analysis would require the
  reading of the full files.
 
  I feel like these are the brief options I figured out till now. Actually
  all
  decision will come with some kind of a drawback and provide some
 decision
  specific more functionality compared to the others.
 
  Is there anyone on the list who solved the need in such functionality
  previously? I'm open to all kind of comments

Re: PigServer memory leak

2010-03-19 Thread Bill Graham
I believe I've found the cause of my Pig memory leak so I wanted to report
back. I profiled my app after letting it run for a couple of days and found
that the static toDelete Stack in the FileLocalizer object was growing over
time without getting flushed. I had thousands of HFile objects in that
stack. This produced a memory leak both in my app and in HDFS.

The fix seems straightforward enough in my app. I suspect calling
FileLocalizer.deleteTempFiles() after each usage of PigServer for a given
execution of a given pig script will do the trick.

This seems to be a major gotcha though that will likely burn others. I
suggest we add FileLocalizer.deleteTempFiles() to the shutdown() method of
PigServer. Thoughts?

Currently shutdown isn't doing much:

public void shutdown() {
// clean-up activities
// TODO: reclaim scope to free up resources. Currently
// this is not implemented and throws an exception
// hence, for now, we won't call it.
//
// pigContext.getExecutionEngine().reclaimScope(this.scope);
}

thanks,
Bill



On Wed, Mar 10, 2010 at 12:15 PM, Bill Graham billgra...@gmail.com wrote:

 Yes, these errors appear in the Pig client and the jobs are definitely
 being executed on the cluster. I can see the data in HDFS and the jobs in
 the JobTracker UI of the cluster.


 On Wed, Mar 10, 2010 at 10:54 AM, Ashutosh Chauhan 
 ashutosh.chau...@gmail.com wrote:

 [Low Memory Detector] [INFO] SpillableMemoryManager.java:143 low memory
 handler called

 Are you seeing this warning on client side, in pig logs? If so, then
 are you sure your job is actually running on real hadoop cluster.
 Because these logs should appear in task-tracker logs not in client
 logs. This may imply that you job is getting executed locally in local
 mode and not actually submitted to cluster. Look for the very first
 lines in the client logs, where Pig tries to connect to the cluster.
 See, if its successful in doing so.



 On Wed, Mar 10, 2010 at 10:15, Ashutosh Chauhan
 ashutosh.chau...@gmail.com wrote:
  Posting for Bill.
 
 
  -- Forwarded message --
  From: Bill Graham billgra...@gmail.com
  Date: Wed, Mar 10, 2010 at 10:11
  Subject: Re: PigServer memory leak
  To: ashutosh.chau...@gmail.com
 
 
  Thanks for the reply, Ashutosh.
 
  [hadoop.apache.org keeps flagging my reply as spam, so I'm replying
  directly to you. Feel free to push this conversation back onto the
  list, if you can. :)]
 
  I'm running the same two scripts, one after the other, every 5
  minutes. The scripts have dynamic tokens substituted to change the
  input and output directories. Besides that, they have the same logic.
 
  I will try to execute the script from grunt next time it happens, but
  I don't see how a lack of pig MR optimizations could cause a memory
  issue on the client? If I bounce my daemon, the next jobs to run
  executes without a problem upon start, so I would also expect a script
  run through grunt at that time to run without a problem as well.
 
  I reverted back to re-initializing PigServer for every run. I have
  other places in my scheduled workflow where I interact with HDFS which
  I've now modified to re-use an instance of Hadoop's Configuration
  object for the life of the VM. I was re-initializing that many times
  per run. Looking at the Configuration code it seems to re-parse the
  XML configs into a DOM every time it's called, so this certainly looks
  like a place for a potential leak. If nothing else it should give me
  an optimization. Configuration seems to be stateless and read-only
  after initiation so this seems safe.
 
  Anyway, here are my two scripts. The first generates summaries, the
  second makes a report from the summaries and they run in separate
  PigServer instances via registerQuery(..). Let me know if you see
  anything that seems off:
 
 
  define chukwaLoader org.apache.hadoop.chukwa.
  ChukwaStorage();
  define tokenize com.foo.hadoop.mapreduce.pig.udf.TOKENIZE();
  define regexMatch   com.foo.hadoop.mapreduce.pig.udf.REGEX_MATCH();
  define timePeriod
 org.apache.hadoop.chukwa.TimePartition('@TIME_PERIOD@');
 
  raw = LOAD '@HADOOP_INPUT_LOCATION@'
  USING chukwaLoader AS (ts: long, fields);
  bodies = FOREACH raw GENERATE tokenize((chararray)fields#'body') as
  tokens, timePeriod(ts) as time;
 
  -- pull values out of the URL
  tokens1 = FOREACH bodies GENERATE
(int)regexMatch($0.token4, '(?:[?])ptId=([^]*)', 1) as
 pageTypeId,
(int)regexMatch($0.token4, '(?:[?])sId=([^]*)', 1) as siteId,
(int)regexMatch($0.token4, '(?:[?])aId=([^]*)', 1) as assetId,
 time,
regexMatch($0.token4, '(?:[?])tag=([^]*)', 1) as tagValue;
 
  -- filter out entries without an assetId
  tokens2 = FILTER tokens1 BY
  (assetId is not null) AND (pageTypeId is not null) AND (siteId is
 not null);
 
  -- group by tagValue, time, assetId and flatten to get counts
  grouped = GROUP tokens2 BY (tagValue

Re: Is there a way to suppress the attempt logs?

2010-03-17 Thread Bill Graham
Not sure if what you're asking is possible or not, but you could experiment
with these params to see if you could achieve a similar effect.

property
  namemapred.userlog.limit.kb/name
  value0/value
  descriptionThe maximum size of user-logs of each task in KB. 0 disables
the cap.
  /description
/property

property
  namemapred.userlog.retain.hours/name
  value24/value
  descriptionThe maximum time, in hours, for which the user-logs are to be
retained.
  /description
/property


On Mon, Mar 15, 2010 at 5:54 PM, abhishek sharma absha...@usc.edu wrote:

 Hi all,

 Hadoop creates a directory (and some files) for each map and reduce
 task attempts in logs/userlogs on each tasktracker.

 Is there a way to configure Hadoop not to create these attempt logs?

 Thanks,
 Abhishek



Re: Something wrong the pig-user mail list ?

2010-03-11 Thread Bill Graham
Yesterday I couldn't send emails to this list. Google was reporting that
apache was blocking them as spam. We'll see if this goes through...

On Wed, Mar 10, 2010 at 6:09 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote:

 This one appears to have worked..

 On Wed, Mar 10, 2010 at 6:07 PM, Jeff Zhang zjf...@gmail.com wrote:

  I always receive the failed deviled message from google.
 
  --
  Best Regards
 
  Jeff Zhang
 



Re: PigServer memory leak

2010-03-10 Thread Bill Graham
Yes, these errors appear in the Pig client and the jobs are definitely being
executed on the cluster. I can see the data in HDFS and the jobs in the
JobTracker UI of the cluster.

On Wed, Mar 10, 2010 at 10:54 AM, Ashutosh Chauhan 
ashutosh.chau...@gmail.com wrote:

 [Low Memory Detector] [INFO] SpillableMemoryManager.java:143 low memory
 handler called

 Are you seeing this warning on client side, in pig logs? If so, then
 are you sure your job is actually running on real hadoop cluster.
 Because these logs should appear in task-tracker logs not in client
 logs. This may imply that you job is getting executed locally in local
 mode and not actually submitted to cluster. Look for the very first
 lines in the client logs, where Pig tries to connect to the cluster.
 See, if its successful in doing so.



 On Wed, Mar 10, 2010 at 10:15, Ashutosh Chauhan
 ashutosh.chau...@gmail.com wrote:
  Posting for Bill.
 
 
  -- Forwarded message --
  From: Bill Graham billgra...@gmail.com
  Date: Wed, Mar 10, 2010 at 10:11
  Subject: Re: PigServer memory leak
  To: ashutosh.chau...@gmail.com
 
 
  Thanks for the reply, Ashutosh.
 
  [hadoop.apache.org keeps flagging my reply as spam, so I'm replying
  directly to you. Feel free to push this conversation back onto the
  list, if you can. :)]
 
  I'm running the same two scripts, one after the other, every 5
  minutes. The scripts have dynamic tokens substituted to change the
  input and output directories. Besides that, they have the same logic.
 
  I will try to execute the script from grunt next time it happens, but
  I don't see how a lack of pig MR optimizations could cause a memory
  issue on the client? If I bounce my daemon, the next jobs to run
  executes without a problem upon start, so I would also expect a script
  run through grunt at that time to run without a problem as well.
 
  I reverted back to re-initializing PigServer for every run. I have
  other places in my scheduled workflow where I interact with HDFS which
  I've now modified to re-use an instance of Hadoop's Configuration
  object for the life of the VM. I was re-initializing that many times
  per run. Looking at the Configuration code it seems to re-parse the
  XML configs into a DOM every time it's called, so this certainly looks
  like a place for a potential leak. If nothing else it should give me
  an optimization. Configuration seems to be stateless and read-only
  after initiation so this seems safe.
 
  Anyway, here are my two scripts. The first generates summaries, the
  second makes a report from the summaries and they run in separate
  PigServer instances via registerQuery(..). Let me know if you see
  anything that seems off:
 
 
  define chukwaLoader org.apache.hadoop.chukwa.
  ChukwaStorage();
  define tokenize com.foo.hadoop.mapreduce.pig.udf.TOKENIZE();
  define regexMatch   com.foo.hadoop.mapreduce.pig.udf.REGEX_MATCH();
  define timePeriod   org.apache.hadoop.chukwa.TimePartition('@TIME_PERIOD@
 ');
 
  raw = LOAD '@HADOOP_INPUT_LOCATION@'
  USING chukwaLoader AS (ts: long, fields);
  bodies = FOREACH raw GENERATE tokenize((chararray)fields#'body') as
  tokens, timePeriod(ts) as time;
 
  -- pull values out of the URL
  tokens1 = FOREACH bodies GENERATE
(int)regexMatch($0.token4, '(?:[?])ptId=([^]*)', 1) as
 pageTypeId,
(int)regexMatch($0.token4, '(?:[?])sId=([^]*)', 1) as siteId,
(int)regexMatch($0.token4, '(?:[?])aId=([^]*)', 1) as assetId,
 time,
regexMatch($0.token4, '(?:[?])tag=([^]*)', 1) as tagValue;
 
  -- filter out entries without an assetId
  tokens2 = FILTER tokens1 BY
  (assetId is not null) AND (pageTypeId is not null) AND (siteId is not
 null);
 
  -- group by tagValue, time, assetId and flatten to get counts
  grouped = GROUP tokens2 BY (tagValue, time, assetId, pageTypeId, siteId);
  flattened = FOREACH grouped GENERATE
  FLATTEN(group) as (tagValue, time, assetId, pageTypeId, siteId),
  COUNT(tokens2) as count;
 
  shifted = FOREACH flattened GENERATE time, count, assetId, pageTypeId,
  siteId, tagValue;
 
  -- order and store
  ordered = ORDER shifted BY tagValue ASC, count DESC, assetId DESC,
  pageTypeId ASC, siteId ASC, time DESC;
  STORE ordered INTO '@HADOOP_OUTPUT_LOCATION@';
 
 
 
 
 
  raw = LOAD '@HADOOP_INPUT_LOCATION@' USING PigStorage('\t') AS
  (ts: long, count: int, assetId: int, pageTypeId: chararray,
  siteId: int, tagValue: chararray);
 
  -- now store most popular overall - filtered by pageTypeId
  most_popular_filtered = FILTER raw BY
  (siteId == 162) AND (pageTypeId matches
  '(2100)|(1606)|(1801)|(2300)|(2718)');
  most_popular = GROUP most_popular_filtered BY (ts, assetId, pageTypeId);
  most_popular_flattened = FOREACH most_popular GENERATE
  FLATTEN(group) as (ts, assetId, pageTypeId),
  SUM(most_popular_filtered.count) as count;
  most_popular_shifted = FOREACH most_popular_flattened
  GENERATE ts, count, assetId, (int)pageTypeId

PigServer memory leak

2010-03-09 Thread Bill Graham
hi,

I've got a long running daemon application that periodically kicks of Pig
jobs via quartz (Pig version 0.4.0). It uses a wrapper class that initilizes
an instance of PigServer before parsing and executing a pig script. As
implemented, the app would leak memory and after a while jobs would fail to
run with messages like this appearing in the logs:

[Low Memory Detector] [INFO] SpillableMemoryManager.java:143 low memory
handler called

To fix the issue, I created an instance of PigServer at application
initialization and I re-use that instance for all jobs for the life of the
daemon. Problem solved.

So my question is, is this a bug in PigServer that it leaks memory when
multiple instances are created, or is that just improper use of the class?

thanks,
Bill


Re: PigServer memory leak

2010-03-09 Thread Bill Graham
Actually, upon closer investigation, re-using PigServer isn't working as
well as I thought. I'm digging into the issue.

To step back a bit though, I want to pose a different question: What is the
intended usage of PigServer and PigContext w.r.t. it's scope? Should a new
instance of each be used for every job or is one or the other intended for
re-use throughout the lifecycle of the VM instance?

Digging into the code of PigServer it seems like it's intended to be used
for a single script's execution only, but it's not entirely clear if that's
the case.



On Tue, Mar 9, 2010 at 9:29 AM, Bill Graham billgra...@gmail.com wrote:

 hi,

 I've got a long running daemon application that periodically kicks of Pig
 jobs via quartz (Pig version 0.4.0). It uses a wrapper class that initilizes
 an instance of PigServer before parsing and executing a pig script. As
 implemented, the app would leak memory and after a while jobs would fail to
 run with messages like this appearing in the logs:

 [Low Memory Detector] [INFO] SpillableMemoryManager.java:143 low memory
 handler called

 To fix the issue, I created an instance of PigServer at application
 initialization and I re-use that instance for all jobs for the life of the
 daemon. Problem solved.

 So my question is, is this a bug in PigServer that it leaks memory when
 multiple instances are created, or is that just improper use of the class?

 thanks,
 Bill



Re: filter/join by sql like %pattern condition

2010-02-25 Thread Bill Graham
You could specify a condition using the the RegexMatch or RegexExtract UDF
in piggybank:

http://svn.apache.org/repos/asf/hadoop/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/string/RegexMatch.java

http://svn.apache.org/repos/asf/hadoop/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/string/RegexExtract.java

On Thu, Feb 25, 2010 at 10:17 AM, Jan Zimmek jan.zim...@toptarif.de wrote:

 hi,

 i recently found pig, really like it and want to use it for one of our
 actual projects.

 getting the basics running was easy, but now i am struggling one a problem.

 i am trying to get customers whose email is not blacklisted.

 blacklist entires can be specified as:

 n...@domain.de

 or wildcarded

 @domain.de

 in sql i would solve this by:

 

 select
  *
 from
  customer c
 left join blacklist b
 on
  c.email like concat(%,b.email)
 where
  b.email is null

 

 this is the structure of my input files:

 raw_customer = LOAD 'customer.csv' USING PigStorage('\t') AS (id: long,
 email: chararray);
 raw_blacklist = LOAD 'blacklist.csv' USING PigStorage('\t') AS (email:
 chararray);


 how would i solve this using pig ? - especially handling the like %
 condition.

 i already looked into udf, but need some advice how to implement this.


 any help would be really appreciated.

 regards,
 jan




Re: Some additions to the hive jdbc driver.

2010-02-03 Thread Bill Graham
This would certainly be useful. When creating the JIRA you can make it a
child task to this one:

https://issues.apache.org/jira/browse/HIVE-576

On Wed, Feb 3, 2010 at 1:18 AM, 김영우 warwit...@gmail.com wrote:

 Hi Bennie,

 Sounds great! That should be very useful for users.
 It would be nice to have more jdbc functionality on hive jdbc.

 Thanks,
 Youngwoo

 2010/2/3 Bennie Schut bsc...@ebuddy.com

 I've been using the hive jdbc driver more and more and was missing some
 functionality which I added

 HiveDatabaseMetaData.getTables
 Using show tables to get the info from hive.

 HiveDatabaseMetaData.getColumns
 Using describe tablename to get the columns.

 This makes using something like SQuirreL a lot nicer since you have the
 list of tables and just click on the content tab to see what's in the table.


 I also implemented
 HiveResultSet.getObject(String columnName) so you call most get* methods
 based on the column name.

 Is it worth making a patch for this?





Re: ChukwaArchiveManager and DemuxManager

2010-02-02 Thread Bill Graham
I had a lot of questions regarding the data flow as well. I spent a while
reverse engineering it and wrote something up on our internal wiki. I
believe this is what's happening. If others with more knowledge could verify
what I have below, I'll gladly move it to a wiki on the Chukwa site.

Regarding your specific question, I believe the DemuxManager process is the
first step in aggregating the data sink files. It moves the chunks to the
dataSinkArchives directory once it's done with them. The ArchiveManager
later archives those chunks.


   1. Collectors write chunks to logs/*.chukwa files until a 64MB chunk size
   is reached or a given time interval is reached.
  - to: logs/*.chukwa
   2. Collectors close chunks and rename them to *.done
  - from: logs/*.chukwa
  - to: logs/*.done
   3. DemuxManager wakes up every 20 seconds, runs M/R to merges
*.donefiles and moves them.
  - from: logs/*.done
  - to: demuxProcessing/mrInput
  - to: demuxProcessing/mrOutput
  - to: dataSinkArchives/[MMdd]/*/*.done
   4. PostProcessManager wakes up every few minutes and aggregates, orders
   and de-dups record files.
  - from:
  
postProcess/demuxOutputDir_*/[clusterName]/[dataType]/[dataType]_[MMdd]_[HH].R.evt
  - to:
  
repos/[clusterName]/[dataType]/[MMdd]/[HH]/[mm]/[dataType]_[MMdd]_[HH]_[N].[N].evt
   5. HourlyChukwaRecordRolling runs M/R jobs at 16 past the hour to group 5
   minute logs to hourly.
  - from:
  
repos/[clusterName]/[dataType]/[MMdd]/[HH]/[mm]/[dataType]_[MMdd]_[mm].[N].evt
  - to: temp/hourlyRolling/[clusterName]/[dataType]/[MMdd]
  - to:
  
repos/[clusterName]/[dataType]/[MMdd]/[HH]/[dataType]_HourlyDone_[MMdd]_[HH].[N].evt
  - leaves: repos/[clusterName]/[dataType]/[MMdd]/[HH]/rotateDone/
   6. DailyChukwaRecordRolling runs M/R jobs at 1:30AM to group hourly logs
   to daily.
  - from:
  
repos/[clusterName]/[dataType]/[MMdd]/[HH]/[dataType]_[MMdd]_[HH].[N].evt
  - to: temp/dailyRolling/[clusterName]/[dataType]/[MMdd]
  - to:
  
repos/[clusterName]/[dataType]/[MMdd]/[dataType]_DailyDone_[MMdd].[N].evt
  - leaves: repos/[clusterName]/[dataType]/[MMdd]/rotateDone/
   7. ChukwaArchiveManager every half hour or so aggregates and removes
   dataSinkArchives data using M/R.
  - from: dataSinkArchives/[MMdd]/*/*.done
  - to: archivesProcessing/mrInput
  - to: archivesProcessing/mrOutput
  - to: finalArchives/[MMdd]/*/chukwaArchive-part-*


thanks,
Bill

On Tue, Feb 2, 2010 at 10:21 AM, Corbin Hoenes cor...@tynt.com wrote:

 I am trying to understand the flow of data inside hdfs as it's processed by
 the data processor script.
 I see the archive.sh and demux.sh are run which runs ArchiveManager and
 DemuxManager.   It appears to that just reading the code that they both are
 looking at the data sink (default /chukwa/logs).

 Can someone shed some light on how ArchiveManager and DemuxManager
 interact?  E.g. I was under the impression that the data flowed through the
 archiving process first then got fed into the demuxing after it had created
 .arc files.




Re: ChukwaArchiveManager and DemuxManager

2010-02-02 Thread Bill Graham
I'm doing the same thing with Pig and log files.

If the date format/location of your log entries doesn't match the chukwa
date format found in the TsProcessor, you'll need to write your own. The
TsProcessor is a good example to follow. You'll need to configure your
processor to be used for your datatype in chukwa-demux-conf.xml. Even if you
use the TsProcessor, you'll need to configure that in chukwa-demux-conf.xml,
since the default processor is DefaultProcessor (despite a bug in the wiki
documentation that says TsProcessor).

If you write your own processor, be aware of this JIRA:
https://issues.apache.org/jira/browse/CHUKWA-440

The processor should be the only Chukwa customization you need to do. If you
follow the TsProcessor example, all you're doing is determining the
timestamp of the record. All the downstream processes should work fine
without customization.

If your processor is working properly you should see *.evt files written
beneith the repos/ dir. If it's not working, the data will go into an error
directory, probably because the date parsing failed (the MR logs will
indicate the cause). These are the files you'll write your Pig scripts
against using the ChukwaStorage class to read the Chukwa sequence files.
Here's an example of the start of a script which normalizes the timestamp of
the record down to 5 minutes:

define chukwaLoader org.apache.hadoop.chukwa.ChukwaStorage();
define timePeriod   org.apache.hadoop.chukwa.TimePartition('30');

raw = LOAD '/chukwa/repos/path/to/evt/files' USING chukwaLoader AS (ts:
long, fields);
bodies = FOREACH raw GENERATE (chararray)fields#'body' as body,
timePeriod(ts) as time;

Also, if you want to generate a sequence file from an apache log for
testing, without setting up the chukwa cluster you can use the utility
discussed here FYI:

https://issues.apache.org/jira/browse/CHUKWA-449

HTH,
Bill


On Tue, Feb 2, 2010 at 11:18 AM, Corbin Hoenes cor...@tynt.com wrote:

 This is exactly what I've been trying to create so that I can understand
 how we can use the data once in chukwa.
 Our goal is to use pig to process our apache logs.  It looks like I need to
 customize the demux with a custom processor to create a chukwa record per
 line in the log file since right now we get a chukwa record per chunk which
 isn't useful to our pig scripts.

 I noticed in another conversation you've written a custom processor.  What
 kinds of data are you processing?  Did you find you had to split up the
 chunked data into individual ChukwaRecords?  And how does this affect the
 rest of processing (archiving,postprocessing etc...)  I am trying to
 understand how much customization I'm going to have to do.


 On Feb 2, 2010, at 11:56 AM, Bill Graham wrote:

 I had a lot of questions regarding the data flow as well. I spent a while
 reverse engineering it and wrote something up on our internal wiki. I
 believe this is what's happening. If others with more knowledge could verify
 what I have below, I'll gladly move it to a wiki on the Chukwa site.

 Regarding your specific question, I believe the DemuxManager process is the
 first step in aggregating the data sink files. It moves the chunks to the
 dataSinkArchives directory once it's done with them. The ArchiveManager
 later archives those chunks.


1. Collectors write chunks to logs/*.chukwa files until a 64MB chunk
size is reached or a given time interval is reached.
   - to: logs/*.chukwa
2. Collectors close chunks and rename them to *.done
   - from: logs/*.chukwa
   - to: logs/*.done
3. DemuxManager wakes up every 20 seconds, runs M/R to merges *.donefiles 
 and moves them.
   - from: logs/*.done
   - to: demuxProcessing/mrInput
   - to: demuxProcessing/mrOutput
   - to: dataSinkArchives/[MMdd]/*/*.done
4. PostProcessManager wakes up every few minutes and aggregates, orders
and de-dups record files.
   - from:
   
 postProcess/demuxOutputDir_*/[clusterName]/[dataType]/[dataType]_[MMdd]_[HH].R.evt
   - to:
   
 repos/[clusterName]/[dataType]/[MMdd]/[HH]/[mm]/[dataType]_[MMdd]_[HH]_[N].[N].evt
5. HourlyChukwaRecordRolling runs M/R jobs at 16 past the hour to group
5 minute logs to hourly.
   - from:
   
 repos/[clusterName]/[dataType]/[MMdd]/[HH]/[mm]/[dataType]_[MMdd]_[mm].[N].evt
   - to: temp/hourlyRolling/[clusterName]/[dataType]/[MMdd]
   - to:
   
 repos/[clusterName]/[dataType]/[MMdd]/[HH]/[dataType]_HourlyDone_[MMdd]_[HH].[N].evt
   - leaves: repos/[clusterName]/[dataType]/[MMdd]/[HH]/rotateDone/
6. DailyChukwaRecordRolling runs M/R jobs at 1:30AM to group hourly
logs to daily.
   - from:
   
 repos/[clusterName]/[dataType]/[MMdd]/[HH]/[dataType]_[MMdd]_[HH].[N].evt
   - to: temp/dailyRolling/[clusterName]/[dataType]/[MMdd]
   - to:
   
 repos/[clusterName]/[dataType]/[MMdd]/[dataType]_DailyDone_[MMdd].[N].evt
   - leaves

NN fails to start with LeaseManager errors

2010-02-02 Thread Bill Graham
Hi,

This morning the namenode of my hadoop cluster shut itself down after the
logs/ directory had filled itself with job configs, log files and all the
other fun things hadoop leaves there. It had been running for a few months.
I deleted all off the job configs and attempt log directories and tried to
restart the namenode, but it failed due to many LeaseManager errors.

Does anyone know what needs to be done to fix this and get the namenode back
up?

Here's what the logs report. I'm using Cloudera's 0.18.3 distro.

STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = my-host-name.com/10.15.137.204
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.18.3-2
STARTUP_MSG:   build =  -r ; compiled by 'httpd' on Fri Jun 12 15:27:43 PDT
2009
/
2010-02-02 13:38:31,199 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
Initializing RPC Metrics with hostName=NameNode, port=9000
2010-02-02 13:38:31,208 INFO org.apache.hadoop.dfs.NameNode: Namenode up at:
my-host-name.com/10.15.137.204:9000
2010-02-02 13:38:31,212 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=NameNode, sessionId=null
2010-02-02 13:38:31,218 INFO org.apache.hadoop.dfs.NameNodeMetrics:
Initializing NameNodeMeterics using context
object:org.apache.hadoop.metrics.spi.NullContext
2010-02-02 13:38:31,318 INFO org.apache.hadoop.fs.FSNamesystem:
fsOwner=app,app
2010-02-02 13:38:31,319 INFO org.apache.hadoop.fs.FSNamesystem:
supergroup=supergroup
2010-02-02 13:38:31,319 INFO org.apache.hadoop.fs.FSNamesystem:
isPermissionEnabled=true
2010-02-02 13:38:31,329 INFO org.apache.hadoop.dfs.FSNamesystemMetrics:
Initializing FSNamesystemMeterics using context
object:org.apache.hadoop.metrics.spi.NullContext
2010-02-02 13:38:31,331 INFO org.apache.hadoop.fs.FSNamesystem: Registered
FSNamesystemStatusMBean
2010-02-02 13:38:31,375 INFO org.apache.hadoop.dfs.Storage: Number of files
= 248675
2010-02-02 13:38:36,932 INFO org.apache.hadoop.dfs.Storage: Number of files
under construction = 2
2010-02-02 13:38:37,008 INFO org.apache.hadoop.dfs.Storage: Image file of
size 42924164 loaded in 5 seconds.
2010-02-02 13:38:37,020 ERROR org.apache.hadoop.dfs.LeaseManager:
/path/on/hdfs/_logs/history/my-host-name.com_1261508934685_job_200912221108_15967_conf.xml
not found in lease.paths
(=[/path/on/hdfs/_logs/history/my-host-name.com_1261508934685_job_200912221108_15967_app_MyJobName_20100202_10_59])

[[ a bunch more errors like the one above ]]

2010-02-02 13:38:37,076 ERROR org.apache.hadoop.fs.FSNamesystem:
FSNamesystem initialization failed.
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:375)
at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:585)
at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:846)
at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:675)
at
org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:289)
at
org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80)
at
org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:294)
at org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:273)
at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:148)
at org.apache.hadoop.dfs.NameNode.init(NameNode.java:193)
at org.apache.hadoop.dfs.NameNode.init(NameNode.java:179)
at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:830)
at org.apache.hadoop.dfs.NameNode.main(NameNode.java:839)
2010-02-02 13:38:37,077 INFO org.apache.hadoop.ipc.Server: Stopping server
on 9000
2010-02-02 13:38:37,081 ERROR org.apache.hadoop.dfs.NameNode:
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:375)
at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:585)
at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:846)
at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:675)
at
org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:289)
at
org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80)
at
org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:294)
at org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:273)
at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:148)
at org.apache.hadoop.dfs.NameNode.init(NameNode.java:193)
at org.apache.hadoop.dfs.NameNode.init(NameNode.java:179)
at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:830)
at org.apache.hadoop.dfs.NameNode.main(NameNode.java:839)

2010-02-02 13:38:37,082 INFO org.apache.hadoop.dfs.NameNode: SHUTDOWN_MSG:
/
SHUTDOWN_MSG: Shutting down NameNode at my-host-name.com/10.15.137.204
/

thanks,
Bill


Re: piggybank build problem

2010-01-29 Thread Bill Graham
I also was unable to build piggybank with similar errors. I did the
following:

$ svn co 
http://svn.apache.org/repos/asf/hadoop/pig/trunk/contrib/piggybankpiggybank-trunk
$ cd piggybank-trunk
$ ant compile

The problem is that the classpath references the following files:

property name=pigjar value=../../../pig.jar /
property name=pigjar-withouthadoop
value=../../../pig-withouthadoop.jar /
property name=hadoopjar value=../../../lib/hadoop20.jar /
property name=pigtest value=../../../build/test/classes /

You need to check out the entire pig project then cd to
contrib/piggybank/java to build piggybank. It won't work if you check out
just piggybank itself.

thanks,
Bill

On Thu, Jan 28, 2010 at 10:22 AM, Dmitriy Ryaboy dvrya...@gmail.com wrote:

 You should be able to compile piggybank itself (just ant jar).
 To compile and run the tests, you also need to compile Pig's test
 classes -- so for that you need to first run ant jar compile-test in
 the top-level pig directory.

 -D

 On Wed, Jan 27, 2010 at 11:08 PM, felix gao gre1...@gmail.com wrote:
  OK I checked out the version 5 's piggybank and still can't compile it.
 
  /usr/local/pig  svn co
 
 http://svn.apache.org/repos/asf/hadoop/pig/tags/release-0.5.0/contrib/piggybankpiggybank
 
  /usr/local/pig/piggybank/java  ant jar compile-test
  Buildfile: build.xml
 
  init:
 [mkdir] Created dir:
  /Users/felixgao/mapreduce/pig-0.5.0/piggybank/java/build
 [mkdir] Created dir:
  /Users/felixgao/mapreduce/pig-0.5.0/piggybank/java/build/classes
 [mkdir] Created dir:
  /Users/felixgao/mapreduce/pig-0.5.0/piggybank/java/build/test
 [mkdir] Created dir:
  /Users/felixgao/mapreduce/pig-0.5.0/piggybank/java/build/test/classes
 [mkdir] Created dir:
  /Users/felixgao/mapreduce/pig-0.5.0/piggybank/java/build/docs/api
 
  compile:
  [echo]  *** Compiling Pig UDFs ***
 [javac] Compiling 97 source files to
  /Users/felixgao/mapreduce/pig-0.5.0/piggybank/java/build/classes
 
  jar:
  [echo]  *** Creating pigudf.jar ***
   [jar] Building jar:
  /Users/felixgao/mapreduce/pig-0.5.0/piggybank/java/piggybank.jar
 
  init:
 
  compile:
  [echo]  *** Compiling Pig UDFs ***
 [javac] Compiling 97 source files to
  /Users/felixgao/mapreduce/pig-0.5.0/piggybank/java/build/classes
 
  compile-test:
  [echo]  *** Compiling UDF tests ***
 [javac] Compiling 20 source files to
  /Users/felixgao/mapreduce/pig-0.5.0/piggybank/java/build/test/classes
 [javac]
 
 /Users/felixgao/mapreduce/pig-0.5.0/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/string/TestLookupInFiles.java:31:
  package org.apache.pig.test does not exist
 [javac] import org.apache.pig.test.MiniCluster;
 [javac]   ^
 [javac]
 
 /Users/felixgao/mapreduce/pig-0.5.0/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/string/TestLookupInFiles.java:32:
  package org.apache.pig.test does not exist
 [javac] import org.apache.pig.test.Util;
 [javac]   ^
 [javac]
 
 /Users/felixgao/mapreduce/pig-0.5.0/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/string/TestLookupInFiles.java:38:
  cannot find symbol
 [javac] symbol  : class MiniCluster
 [javac] location: class
  org.apache.pig.piggybank.test.evaluation.string.TestLookupInFiles
 [javac] MiniCluster cluster = MiniCluster.buildCluster();
 [javac] ^
 [javac]
 
 /Users/felixgao/mapreduce/pig-0.5.0/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/TestSequenceFileLoader.java:38:
  package org.apache.pig.test does not exist
 [javac] import org.apache.pig.test.Util;
 [javac]   ^
 [javac]
 
 /Users/felixgao/mapreduce/pig-0.5.0/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/string/TestLookupInFiles.java:38:
  cannot find symbol
 [javac] symbol  : variable MiniCluster
 [javac] location: class
  org.apache.pig.piggybank.test.evaluation.string.TestLookupInFiles
 [javac] MiniCluster cluster = MiniCluster.buildCluster();
 [javac]   ^
 [javac]
 
 /Users/felixgao/mapreduce/pig-0.5.0/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/string/TestLookupInFiles.java:73:
  cannot find symbol
 [javac] symbol  : variable Util
 [javac] location: class
  org.apache.pig.piggybank.test.evaluation.string.TestLookupInFiles
 [javac] pigServer.registerQuery(A = LOAD ' +
  Util.generateURI(tmpFile.toString()) + ' AS (key:chararray););
 [javac]^
 [javac]
 
 /Users/felixgao/mapreduce/pig-0.5.0/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/TestSequenceFileLoader.java:84:
  cannot find symbol
 [javac] symbol  : variable Util
 [javac] location: class
  org.apache.pig.piggybank.test.storage.TestSequenceFileLoader
 [javac] 

Re: How to cleanup old Job jars

2010-01-27 Thread Bill Graham
Thanks Rekha.

These issues seem to be related to cleaning up Pig/Hadoop file upon shutdown
of the VM. I just checked and when I shut down the VM, all files are cleaned
up as expected.

My issue is that I have Pig jobs that run in an app server which are
triggered by quartz. It might be days or weeks between app server bounces.
If anyone knows a way to configure or kick off some sort of cleanup process
without shutting down the VM, please let me know.

Otherwise, I need to deploy a hacky crontab script like this:

find /tmp/Job[0-9]*.jar -type f -mmin +50 -exec rm {} \;


On Tue, Jan 26, 2010 at 8:40 PM, Rekha Joshi rekha...@yahoo-inc.com wrote:

  You might like to check up PIG-116 and HADOOP-5175.Also think there is a
 JobCleanup task which takes care of cleaning.., AFAIK.., unless its failed
 job.
 Cheers,
 /R



 On 1/27/10 12:01 AM, Bill Graham billgra...@gmail.com wrote:

 Hi,

 Every time I run a Pig script I get a number of Job jars left in the /tmp
 directory of my client, 1 per MR job it seems. The file names look like
 /tmp/Job875278192.jar.

 I have scripts that run every five minutes and fire 10 MR jobs each, so the
 amount of space used by these jars grows rapidly. Is there a way to tell
 Pig
 to clean up after itself and remove these jars, or do I need to just write
 my own clean-up script?

 thanks,
 Bill




How to cleanup old Job jars

2010-01-26 Thread Bill Graham
Hi,

Every time I run a Pig script I get a number of Job jars left in the /tmp
directory of my client, 1 per MR job it seems. The file names look like
/tmp/Job875278192.jar.

I have scripts that run every five minutes and fire 10 MR jobs each, so the
amount of space used by these jars grows rapidly. Is there a way to tell Pig
to clean up after itself and remove these jars, or do I need to just write
my own clean-up script?

thanks,
Bill


Re: Deleted input files after load

2010-01-22 Thread Bill Graham
Hive doesn't delete the files upon load, it moves them to a location under
the Hive warehouse directory. Try looking under
/user/hive/warehouse/t_word_count.

On Fri, Jan 22, 2010 at 10:44 AM, Shiva shiv...@gmail.com wrote:

 Hi,
 For the first time I used Hive to load couple of word count data input
 files into tables with and without OVERWRITE.  Both the times the input file
 in HDFS got deleted. Is that a expected behavior? And couldn't find any
 definitive answer on the Hive wiki.

 hive LOAD  DATA  INPATH  '/user/vmplanet/output/part-0'  OVERWRITE
 INTO  TABLE  t_word_count;

 Env.: Using Hadoop 0.20.1 and latest Hive on Ubuntu 9.10 running in VMware.

 Thanks,
 Shiva



Re: how to generate a Chukwa SequenceFile

2010-01-21 Thread Bill Graham
Here's a JIRA with a patch. Let me know if you think I should refactor any
parts of it:

https://issues.apache.org/jira/browse/CHUKWA-449

On Tue, Jan 19, 2010 at 6:03 PM, Ariel Rabkin asrab...@gmail.com wrote:

 Yes, if by processing you mean demux.  Which should be renamed, I
 think, at some point.

 --Ari

 On Tue, Jan 19, 2010 at 4:53 PM, Bill Graham billgra...@gmail.com wrote:
  Thanks Ari, that helps. The TempFileUtil.writeASinkFile method seems
 similar
  to what I want actually.
 
  From looking at the code though it seems that a sink file contains
  ChukwaArchiveKey - ChunkImpl key value pairs, but a processed file
 instead
  contains ChukwaRecordKey - ChukwaRecord pairs.
 
  If I followed that code as an example, but just created the latter k/v
 pairs
  instead of the former I'd be good to go, correct?
 
 
  On Tue, Jan 19, 2010 at 3:59 PM, Ariel Rabkin asrab...@gmail.com
 wrote:
 
  There isn't a polished utility for this, and there should be.  I think
  it'll be entirely straightforward, depending on your specific
  requirements.
 
  If you look in
  org.apache.hadoop.chukwa.util.TempFileUtil.RandSeqFileWriter
  there's an example of code that writes out a sequence file for test
  purposes.
 
  --Ari
 
  On Tue, Jan 19, 2010 at 3:46 PM, Bill Graham billgra...@gmail.com
 wrote:
   Hi,
  
   Is there an easy way (maybe using a utility class or the chukwa API)
 to
   manually create a sequence file of chukwa records from a log file
   without
   the need for HDFS?
  
   My use case is this: I've got pig unit tests that read input sequence
   file
   input using ChukwaStorage from local disk. I generated these files by
   putting data into the cluster an waiting for the data processor to
 run.
   We're looking to change the log format though, and I'd like to be able
   to
   write and run the unit tests without putting the new data into the
   cluster.
  
   If there were a command line way that I could do this that would be
 very
   helpful. Or if anyone could point me to the relevant classes, I could
   write
   such a utility and contribute it back.
  
   thanks,
   Bill
  
 
 
 
  --
  Ari Rabkin asrab...@gmail.com
  UC Berkeley Computer Science Department
 
 



 --
 Ari Rabkin asrab...@gmail.com
 UC Berkeley Computer Science Department



LOAD from multiple directories

2010-01-21 Thread Bill Graham
Hi,

I have summary data created in directories every 10 minutes and I have a job
that needs to LOAD from all directories in a one hour period. I was hoping
to use Hadoop file path globing, but it doesn't seem to allow the glob
patterns with slashes '/' in them. If my directory structure looks like what
I show below, does anyone have any suggestions for how I could write a LOAD
command that would load all directories from 10:30-11:20, for example?


/20100121/10/00
/20100121/10/10
/20100121/10/20
/20100121/10/30 --
/20100121/10/40 --
/20100121/10/50 --
/20100121/11/00 --
/20100121/11/10 --
/20100121/11/20 --
/20100121/11/30
/20100121/11/40


thanks,
Bill


Re: LOAD from multiple directories

2010-01-21 Thread Bill Graham
Thanks for the union suggestion, Thejas.

Dmitry, how were you envisioning that globs can be used for this use case?
Globs with slashes like this don't work:

{10/30,10/40,10/50,11/00,11/10,11/20}


On Thu, Jan 21, 2010 at 11:57 AM, Dmitriy Ryaboy dvrya...@gmail.com wrote:

 you should be able to use globs:

 http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29

 {ab,c{de,fh}}
Matches a string from the string set {ab, cde, cfh}

 -D

 On Thu, Jan 21, 2010 at 11:29 AM, Thejas Nair te...@yahoo-inc.com wrote:
  I was going to suggest -
  /20100121/{10,11}/{30,40,50,00,10,20} but that would not work because
 it
  will also match - /20100121/10/00 . I don't think hadoop file path
 globing
  can be used for this use case.
 
  You can use multiple loads followed by a union .
 
  -Thejas
 
 
 
  On 1/21/10 11:02 AM, Bill Graham billgra...@gmail.com wrote:
 
  Hi,
 
  I have summary data created in directories every 10 minutes and I have a
 job
  that needs to LOAD from all directories in a one hour period. I was
 hoping
  to use Hadoop file path globing, but it doesn't seem to allow the glob
  patterns with slashes '/' in them. If my directory structure looks like
 what
  I show below, does anyone have any suggestions for how I could write a
 LOAD
  command that would load all directories from 10:30-11:20, for example?
 
 
  /20100121/10/00
  /20100121/10/10
  /20100121/10/20
  /20100121/10/30 --
  /20100121/10/40 --
  /20100121/10/50 --
  /20100121/11/00 --
  /20100121/11/10 --
  /20100121/11/20 --
  /20100121/11/30
  /20100121/11/40
 
 
  thanks,
  Bill
 
 



Re: LOAD from multiple directories

2010-01-21 Thread Bill Graham
Note to those that are interested. As of 0.19.0, globs with slashes do work:

http://issues.apache.org/jira/browse/HADOOP-3498

Of course we're on 0.18.3. Sigh...


On Thu, Jan 21, 2010 at 12:09 PM, Bill Graham billgra...@gmail.com wrote:

 Thanks for the union suggestion, Thejas.

 Dmitry, how were you envisioning that globs can be used for this use case?
 Globs with slashes like this don't work:

 {10/30,10/40,10/50,11/00,11/10,11/20}



 On Thu, Jan 21, 2010 at 11:57 AM, Dmitriy Ryaboy dvrya...@gmail.comwrote:

 you should be able to use globs:

 http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29

 {ab,c{de,fh}}
Matches a string from the string set {ab, cde, cfh}

 -D

 On Thu, Jan 21, 2010 at 11:29 AM, Thejas Nair te...@yahoo-inc.com
 wrote:
  I was going to suggest -
  /20100121/{10,11}/{30,40,50,00,10,20} but that would not work because
 it
  will also match - /20100121/10/00 . I don't think hadoop file path
 globing
  can be used for this use case.
 
  You can use multiple loads followed by a union .
 
  -Thejas
 
 
 
  On 1/21/10 11:02 AM, Bill Graham billgra...@gmail.com wrote:
 
  Hi,
 
  I have summary data created in directories every 10 minutes and I have
 a job
  that needs to LOAD from all directories in a one hour period. I was
 hoping
  to use Hadoop file path globing, but it doesn't seem to allow the glob
  patterns with slashes '/' in them. If my directory structure looks like
 what
  I show below, does anyone have any suggestions for how I could write a
 LOAD
  command that would load all directories from 10:30-11:20, for example?
 
 
  /20100121/10/00
  /20100121/10/10
  /20100121/10/20
  /20100121/10/30 --
  /20100121/10/40 --
  /20100121/10/50 --
  /20100121/11/00 --
  /20100121/11/10 --
  /20100121/11/20 --
  /20100121/11/30
  /20100121/11/40
 
 
  thanks,
  Bill
 
 





Re: Google has obtained the patent over mapreduce

2010-01-20 Thread Bill Graham
Typically companies will patent their IP as a defensive measure to protect
themselves from being sued, as has been pointed out already.  Another
typical reason is to exercise the patent against companies that present a
challenge to their core business.

I would bet that unless you're making a noticeable dent in google's
search/ad business, then you really don't need to worry about them enforcing
the patent against you.


On Wed, Jan 20, 2010 at 1:42 PM, Colin Freas colinfr...@gmail.com wrote:

 Developers do themselves, their code, and their users a disservice if they
 lack some understanding of intellectual property law.  It can be
 complicated, but it isn't rocket science.

 In the United States, Google is protected by the first to
 inventhttp://en.wikipedia.org/wiki/First_to_file_and_first_to_invent
 principle: they can safely publish anything they want about their invention
 prior to applying for a patent if they can prove they came up with the
 invention first.

 As others have pointed out, it isn't something to panic over.  This is
 Google, not Rambus.  It would be nice to see Google proactively and
 explicitly say We're not going to enforce this patent.

 But this patent and a lot of other software and business process patents
 could be in danger of being summarily overturned, depending on how the US
 Supreme Court rules in the Bilski case.  It's possible they wanted to
 acquire this patent before that ruling, since it would give them standing
 to
 challenge a lot of potentially unfavorable outcomes.



 On Wed, Jan 20, 2010 at 4:07 PM, brien colwell xcolw...@gmail.com wrote:

   Personally, it
  seems like they gave away too much information before they had the
  patent.
 
  I'm not a patent lawyer, but I'd expect they submitted the patent
  application or a provisional before they submitted their academic paper
 or
  other public disclosure.
 
 
 
 
  On Wed, Jan 20, 2010 at 12:09 PM, Edward Capriolo edlinuxg...@gmail.com
  wrote:
 
   Interesting situation.
  
   I try to compare mapreduce to the camera. Let argue Google is Kodak,
   Apache is Polaroid, and MapReduce is a Camera. Imagine Kodak invented
   the camera privately, never sold it to anyone, but produced some
   document describing what a camera did.
  
   Polaroid followed the document and produced a camera and sold it
   publicly. Kodak later patents a camera, even though no one outside of
   Kodak can confirm Kodak ever made a camera before Polaroid.
  
   Not saying that is what happened here, but google releasing the GFS
   pdf was a large factor in causing hadoop to happen. Personally, it
   seems like they gave away too much information before they had the
   patent.
  
   The patent system faces many problems including this 'back to the
   future' issue. Where it takes so long to get a patent no one can wait,
   by the time a patent is issued there are already multiple viable
   implementations of a patent.
  
   I am no patent layer or anything, but I notice the phrase master
   process all over the claims. Maybe if a piece of software (hadoop)
   had a distributed process that would be sufficient to say hadoop
   technology does not infringe on this patent.
  
   I think it would be interesting to look deeply at each claim and
   determine if hadoop could be designed to not infringe on these
   patents, to deal with what if scenarios.
  
  
  
   On Wed, Jan 20, 2010 at 11:29 AM, Ravi ravindra.babu.rav...@gmail.com
 
   wrote:
Hi,
 I too read about that news. I don't think that it will be any
 problem.
However Google didn't invent the model.
   
Thanks.
   
On Wed, Jan 20, 2010 at 9:47 PM, Udaya Lakshmi udaya...@gmail.com
   wrote:
   
Hi,
  As an user of hadoop, Is there anything to worry about Google
   obtaining
the patent over mapreduce?
   
Thanks.
   
   
  
 



Ho to deploying a custom processor to demux

2009-12-22 Thread Bill Graham
Hi,

I've written my own Processor to handle my log format per this wiki and I've
run into a couple of gotchast:
http://wiki.apache.org/hadoop/DemuxModification

1. The default processor is not the TsProcessor as documented, but the
DefaultProcessor (see line 83 of Demux.java). This causes headaches because
when using DefaultProcessor data always goes under minute 0 in hdfs,
regardless of when in the hour it was created.

2. When implementing a custom parser as shown in the wiki, how do you
register the class so it gets included in the job that's submitted to the
hadoop cluster? The only way I've been able to do this is to put my class in
the package org.apache.hadoop.chukwa.extraction.demux.processor.mapper and
then manually add that class to the chukwa-core-0.3.0.jar that  is on my
data processor, which is a pretty rough hack. Otherwise, I get class not
found exceptions in my mapper.

thanks,
Bill


Re: Ho to deploying a custom processor to demux

2009-12-22 Thread Bill Graham
Thanks for your quick reply Eric.

The TsProcessor does use buildGenericRecord and has been working fine for me
(at least I thought it was). I've mapped it to my dataType as you described
without problems. My only point with issue #1 was just that the
documentation is off and that the DefaultProcessor yields what I think is
unexpected behavior.

 There is an plan to load parser class from class path by using Java
annotation.
 It is still in the initial phase of planning.  Design participation are
welcome.

Yes, annotations would be useful. Or what about just having an extensions
directory (maybe lib/ext/) or something similar where custom jars could be
placed that are to be submitted by demux M/R? Do you know where the code
resides that handles adding the chukwa-core jar? I poked around bit but
couldn't find it.

Finally, is there a JIRA for this issue that you know of? If not I'll create
one. This is going to become a pain point for us soon, so if we have a
design I might be able to contribute a patch.

thanks,
Bill


On Tue, Dec 22, 2009 at 2:14 PM, Eric Yang ey...@yahoo-inc.com wrote:

 On 12/22/09 1:36 PM, Bill Graham billgra...@gmail.com wrote:

  I've written my own Processor to handle my log format per this wiki and
 I've
  run into a couple of gotchast:
  http://wiki.apache.org/hadoop/DemuxModification
 
  1. The default processor is not the TsProcessor as documented, but the
  DefaultProcessor (see line 83 of Demux.java). This causes headaches
 because
  when using DefaultProcessor  data always goes under minute 0 in hdfs,
  regardless of when in the hour it was created.
 

 There is a generic method to build the record, like:

 buildGenericRecord(record, recordEntry, timestamp, recordType);

 This method will build up key like:

 Time partition/Primary Key/timestamp

 When all records are roll up into large sequence file by end of the hour
 and
 end of the day, the sequence file is sorted by time partition and primary
 key.  This arrangement of data structure was put in place to assist data
 scanning.  When data is retrieved, use record.getTimestamp() to find the
 real timestamp for the record.

 TsProcessor is incompleted for now because the key in ChukwaRecord is used
 in hourly and daily roll up.  Without using buildGenericRecord, hourly and
 daily roll up will not work correctly.

  2. When implementing a custom parser as shown in the wiki, how do you
 register
  the class so it gets included in the job that's submitted to the hadoop
  cluster? The only way I've been able to do this is to put my class in the
  package org.apache.hadoop.chukwa.extraction.demux.processor.mapper and
 then
  manually add that class to the chukwa-core-0.3.0.jar that  is on my data
  processor, which is a pretty rough hack. Otherwise, I get class not found
  exceptions in my mapper.

 The demux process is controlled by $CHUKWA_HOME/conf/chukwa-demux-conf.xml,
 and map the recordType to your parser class.  There is an plan to load
 parser class from class path by using Java annotation.  It is still in the
 initial phase of planning.  Design participation are welcome.  Hope this
 helps.  :)

 Regards,
 Eric




Log data written to [DataType]InError

2009-12-21 Thread Bill Graham
Hi,

For some reason data that I put into Chukwa appears in the following
director after the demux/post processor processes runs:

/chukwa/repos[clusterName]/[dataType]InError/

instead of

/chukwa/repos[clusterName]/[dataType]/

There's no explanation in the logs as to why this is happening, nor are
there any exceptions. Any idea why this happens or how I can troubleshoot?

thanks,
Bill


Re: dynamically calling STORE

2009-12-16 Thread Bill Graham
Thanks Dmitriy, this is exactly what I need.

There was one bug I ran into though FYI, which is when making a request like
this, as documented in the JavaDocs:

STORE A INTO '/my/home/output' USING MultiStorage('/my/home/output','0',
'none', '\t');

Pig would create a file '/my/home/output' and then an exception would be
thrown when MultiStorage tried to make a directory under '/my/home/output'.
The workaround that worked for me was to instead specify a dummy location as
the first path like so:

STORE A INTO '/my/home/output/temp' USING
MultiStorage('/my/home/output','0', 'none', '\t');


On Tue, Dec 15, 2009 at 1:06 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote:

 Bill,
 A custom storefunc should do the trick. See
 https://issues.apache.org/jira/browse/PIG-958  (aka
 piggybank.storage.MultiStorage) for a jumping-off point.

 -D

 On Tue, Dec 15, 2009 at 1:59 PM, Bill Graham billgra...@gmail.com wrote:
  Hi,
 
  I'm pretty sure the answer to my question is no, but I have to ask. Is it
  possible within Pig to store different groups of data into different
 output
  files where the grouping is dynamic (i.e. not known ahead of time)?
 Here's
  what I'm trying to do...
 
  I've got a script that reads log files of URLs and generates counts for a
  given time period. The urls might have a 'tag' querystring param though,
 and
  in that case I want to get the most popular urls for each tag output to
 it's
  own file.
 
  My data looks like this and is ordered by tag asc, count desc:
 
  [tag] [timeinterval] [url] [count]
 
  I need to do something like so:
 
  for each tag group found
   store all records in file foo_[tag].txt
 
  I ultimately need these files on local disk and I'm looking for a better
 way
  to do so than generating a file of N unique tags in HDFS, reading it from
  Java, submitting N jobs with the tag name substituted into a script file,
  followed by N copyToLocal calls.
 
  At least two possible solutions come to mind, but am curious if there's
  another that I'm overlooking:
  1. In java submit pig dynamic commands to an instance of PigServer. I'd
  still need a unique tag file for this case.
  2. Maybe with a custom store function??
 
  thanks,
  Bill
 



dynamically calling STORE

2009-12-15 Thread Bill Graham
Hi,

I'm pretty sure the answer to my question is no, but I have to ask. Is it
possible within Pig to store different groups of data into different output
files where the grouping is dynamic (i.e. not known ahead of time)? Here's
what I'm trying to do...

I've got a script that reads log files of URLs and generates counts for a
given time period. The urls might have a 'tag' querystring param though, and
in that case I want to get the most popular urls for each tag output to it's
own file.

My data looks like this and is ordered by tag asc, count desc:

[tag] [timeinterval] [url] [count]

I need to do something like so:

for each tag group found
  store all records in file foo_[tag].txt

I ultimately need these files on local disk and I'm looking for a better way
to do so than generating a file of N unique tags in HDFS, reading it from
Java, submitting N jobs with the tag name substituted into a script file,
followed by N copyToLocal calls.

At least two possible solutions come to mind, but am curious if there's
another that I'm overlooking:
1. In java submit pig dynamic commands to an instance of PigServer. I'd
still need a unique tag file for this case.
2. Maybe with a custom store function??

thanks,
Bill


HDFS move instead of LOAD DATA INPATH?

2009-12-08 Thread Bill Graham
Hi,

When the LOAD DATA INPATH is issued, does Hive modify the metastore data,
or do anything else special besides just moving the files in HDFS?

I've got a daily MR job that runs before I need to load data into a daily
Hive partition and using the FileSystem class to move the files from Java
would be pretty easy.

thanks,
Bill


broken links to Hive documentation

2009-11-20 Thread Bill Graham
The links to documentation for releases 3.0 and 4.0 on the left nav of the
Hive homepage are broken FYI:

http://hadoop.apache.org/hive/

They send to you these pages that show white apache directory listing pages:

http://hadoop.apache.org/hive/docs/r0.3.0/
http://hadoop.apache.org/hive/docs/r0.4.0/


Re: Tracking files deletions in HDFS

2009-11-19 Thread Bill Graham
I'm don't know about the auditing tools, but I have seen files get randomly
deleted in dev setups when using hadoop with the default hadoop.tmp.dir
setting, which is /tmp/hadoop-${user.name}.



On Thu, Nov 19, 2009 at 9:03 AM, Stas Oskin stas.os...@gmail.com wrote:

 Hi.

 I have a strange case of missing files, which most probably randomly delete
 by my application.

 Does HDFS provides any auditing tools for tracking who deleted what and
 when?

 Thanks in advance.



Accessing a bag of token tuples from TOKENIZE

2009-11-18 Thread Bill Graham
Hi,

I'm struggling to get the tokens out of a bag of tuples created by the
TOKENIZE UDF and could use some help. I want to tokenize and then be able to
reference the tokens by their position. Is this even possible? Since the
token count is non-deterministic, I'm question whether I can use positional
parameters to dig them out.

Anyway, here's what I'm doing, starting with a chararray where each:

grunt describe B;
B: {body: chararray}
grunt dump B;
(2009-11-18 09:32:43,000 color=blue)
(2009-11-18 09:32:43,000 color=red)
(2009-11-18 09:32:44,000 color=red)
(2009-11-18 09:32:45,000 color=green)

grunt C = FOREACH B GENERATE TOKENIZE((chararray)body) as
B1:bag{T1:tuple(T:chararray)};
grunt describe C;
C: {B1: {T1: (T: chararray)}}

grunt D = FOREACH C GENERATE B1.$0 as date;
grunt describe D;
D: {date: {T: chararray}}

grunt dump D;
...
({(2009-11-18),(09:32:43),(000),(color=blue)})
({(2009-11-18),(09:32:43),(000),(color=red)})
({(2009-11-18),(09:32:44),(000),(color=red)})
({(2009-11-18),(09:32:45),(000),(color=green)})

What I'd expect to see is just the date values.

Any ideas?

thanks,
Bill


Re: Accessing a bag of token tuples from TOKENIZE

2009-11-18 Thread Bill Graham
Thanks Zaki, I think you're right about bag values lacking order and not
being able to be accessed by position.

I'll take a look at the regex UDF. What I'm ultimately trying to get is a
handle to each token in the body though, I'm just using date as an example.
I'd like to be able to pull these values out with one UDF execution per line
(as opposed to per field).

My input is basically access log entries and I need to get the different
space-delimited values in it. Seems like the thing to do would be to write
my own UDF that returns a tuple from the space-delimited tokens for each
line passed.

I'm sure this problem has been solved a million times before though, so if
anyone has a better suggestion I'd love to hear it. I recall talk about an
access log UDF at one point (maybe it was in hive), but I can't find any
references to it at the moment.


On Wed, Nov 18, 2009 at 12:38 PM, zaki rahaman zaki.raha...@gmail.comwrote:

 Hm,

 I may be wrong about this, but from what I recall, there are no 'fields' in
 the bag of tokens (and no ordering) created by TOKENIZE. As such, I don't
 think there's a way to accomplish what you're trying to do the way it's
 written. As an alternative approach, you might try using FLATTEN to unnest
 the TOKENIZE output and give you tuples for each token and then filter the
 tokens to those that match your date pattern. Alternatively, you could
 accomplish this in one step with a regex extract UDF (there's one in
 piggybank if I recall correctly and something similar in amazon's pig
 function jar). If the data you described below is your input data, then you
 could remove the projection step altogether by using a RegEx LoadFunc to get
 the date field. Hope this helps, and others feel free to correct me if I'm
 wrong, as I'm sure there's probably a better/more elegant way.

 -- Zaki


 On Wed, Nov 18, 2009 at 3:03 PM, Bill Graham billgra...@gmail.com wrote:

 Hi,

 I'm struggling to get the tokens out of a bag of tuples created by the
 TOKENIZE UDF and could use some help. I want to tokenize and then be able
 to
 reference the tokens by their position. Is this even possible? Since the
 token count is non-deterministic, I'm question whether I can use
 positional
 parameters to dig them out.

 Anyway, here's what I'm doing, starting with a chararray where each:

 grunt describe B;
 B: {body: chararray}
 grunt dump B;
 (2009-11-18 09:32:43,000 color=blue)
 (2009-11-18 09:32:43,000 color=red)
 (2009-11-18 09:32:44,000 color=red)
 (2009-11-18 09:32:45,000 color=green)

 grunt C = FOREACH B GENERATE TOKENIZE((chararray)body) as
 B1:bag{T1:tuple(T:chararray)};
 grunt describe C;
 C: {B1: {T1: (T: chararray)}}

 grunt D = FOREACH C GENERATE B1.$0 as date;
 grunt describe D;
 D: {date: {T: chararray}}

 grunt dump D;
 ...
 ({(2009-11-18),(09:32:43),(000),(color=blue)})
 ({(2009-11-18),(09:32:43),(000),(color=red)})
 ({(2009-11-18),(09:32:44),(000),(color=red)})
 ({(2009-11-18),(09:32:45),(000),(color=green)})

 What I'd expect to see is just the date values.

 Any ideas?

 thanks,
 Bill




 --
 Zaki Rahaman




Re: Accessing a bag of token tuples from TOKENIZE

2009-11-18 Thread Bill Graham
This is exactly what I need, thanks! I had checked piggybank previously, but
didn't catch these the first time.

On Wed, Nov 18, 2009 at 1:15 PM, zaki rahaman zaki.raha...@gmail.comwrote:

 Hey Bill,

 If you look in piggybank (http://wiki.apache.org/pig/PiggyBank look in) in
 the contrib dir of your pig installation, you'll find several functions
 that
 might help. I haven't used any myself, but in
 org.apache.pig.piggybank.storage you'll find RegExLoader and MyRegExLoader.
 If you pass a reg exp with capturing groups I believe you can simply use
 these functions directly. There are also apache log specific load funcs, I
 think theres Common and Combined Log Loaders... simply set up your scripts
 to use those functions to load your input data and you'll have what you
 need
 I believe.



 On Wed, Nov 18, 2009 at 4:03 PM, Mridul Muralidharan
 mrid...@yahoo-inc.comwrote:

 
 
  You are right, there is no ordering of tuples within a bag by default
  (except in some cases - like output of ORDER BY).
 
  For the specific purpose of pulling the date field - you could just use
  some regexp udf instead of tokenize to pick the value you are interested
 in.
 
  There should be udf's in piggy bank which do this ...
 
 
 
  Or is this a more general question regarding accessing tuples within a
 bag
  in some ordered fashion ?
 
 
  Regards,
  Mridul
 
 
 
 
  Bill Graham wrote:
 
  Hi,
 
  I'm struggling to get the tokens out of a bag of tuples created by the
  TOKENIZE UDF and could use some help. I want to tokenize and then be
 able
  to
  reference the tokens by their position. Is this even possible? Since the
  token count is non-deterministic, I'm question whether I can use
  positional
  parameters to dig them out.
 
  Anyway, here's what I'm doing, starting with a chararray where each:
 
  grunt describe B;
  B: {body: chararray}
  grunt dump B;
  (2009-11-18 09:32:43,000 color=blue)
  (2009-11-18 09:32:43,000 color=red)
  (2009-11-18 09:32:44,000 color=red)
  (2009-11-18 09:32:45,000 color=green)
 
  grunt C = FOREACH B GENERATE TOKENIZE((chararray)body) as
  B1:bag{T1:tuple(T:chararray)};
  grunt describe C;
  C: {B1: {T1: (T: chararray)}}
 
  grunt D = FOREACH C GENERATE B1.$0 as date;
  grunt describe D;
  D: {date: {T: chararray}}
 
  grunt dump D;
  ...
  ({(2009-11-18),(09:32:43),(000),(color=blue)})
  ({(2009-11-18),(09:32:43),(000),(color=red)})
  ({(2009-11-18),(09:32:44),(000),(color=red)})
  ({(2009-11-18),(09:32:45),(000),(color=green)})
 
  What I'd expect to see is just the date values.
 
  Any ideas?
 
  thanks,
  Bill
 
 
 


 --
 Zaki Rahaman



Re: Problem regarding hive-jdbc driver

2009-10-26 Thread Bill Graham
Not all methods in the JDBC spec are implemented, as your noticing.
Statement.setMaxRows(int) is implemented though, so maybe that would work
for your needs. Or you could just specify a limit in your sql.

Bill

On Mon, Oct 26, 2009 at 6:21 AM, Mohan Agarwal mohan.agarwa...@gmail.comwrote:

 Hi everyone,
 I am writing a java program to create a Query Editor to a
 execute query through hive. I am using hive-jdbc driver  for database
 connection and query execution, but I am facing a problem regarding
 java.sql.Statement class.  When I am using setFetchSize() of Statement
 class, it is giving java.sql.SQLException: Method not supported
 exception. I have to  implement paging in user interface ,I can't show all
 the data to the user in a single page. Also  java.sql.Statement class is not
 supporting  scrollable Result Set.
  Can someone help me to solve this problem, so that I can
 implement paging in user interface.

 Thanking You
 Mohan Agarwal



Re: Hive query web service

2009-10-21 Thread Bill Graham
There is no J2EE web server or SOAP web service in this equation. The Hive
JDBC client connects to the Hive Server, which can be started with a script
like so run from your $HIVE_HOME/build/dist directory:

export HADOOP_HOME=/path/to/hadoop
HIVE_PORT=1 ./bin/hive --service hiveserver

No war files, or WEB-INF/ directories at all in this case.

On Wed, Oct 21, 2009 at 5:24 AM, Arijit Mukherjee ariji...@gmail.comwrote:

 If I understood the concept of standalone and embedded properly,
 then a Web Service which connects to the Hive/Thrift server via JDBC
 (jdbc:hive://host:port...) is actually a standalone client. The
 difference from a Java standalone client is that - in this case, the
 whole thing is packaged as a Web Service and deployed on a web server
 such as JBoss/GlassFish - and the connection is initiated only after a
 SOAP request is received from the web service client. If that is the
 case, then the Web Service should not require the conf files, or jpox
 libraries - is it not? Or did I misunderstand the concept?

 Arijit

 2009/10/21 Arijit Mukherjee ariji...@gmail.com:
  Update: I did a clean/build/deploy - the config files are within the
  Web Service WEB-INF/classes folder, and the libraries (including the
  jpox ones) are inside WEB-INF/lib - which are standard for any web
  application. But the config related exception is still there:-((
 
  Arijit
 
  2009/10/21 Arijit Mukherjee ariji...@gmail.com:
  Thanx Bill. I copied the jpox jars from the 0.3.0 distribution and
  added them to the web service archive, and they are in the classpath,
  but the config related exception is still there. Let me do a clean
  build/deploy and I'll get back again.
 
  Arijit
 
  2009/10/20 Bill Graham billgra...@gmail.com:
  The Hive JDBC client can run in two different modes: standalone and
  embedded.
 
  Standalone mode is where the client connects to a separate standalone
  HiveServer by specifying the host:port of the server in the jdbc URL
 like
  this: jdbc:hive://localhost:1/default. In this case the hive
 configs are
  not needed by the client, since the client is making thrift requests to
 the
  server which has the Hive configs. the Hive Server knows how to resolve
 the
  metastore.
 
  Embedded mode is where the JDBC client connects to itself so to speak
  using a JDBC url like this: jdbc:hive://. It's as if the client is
 running
  an embedded server that only it communicates with. In this case the
 client
  needs the Hive configs since it needs to resolve the metastore, amongst
  other things. The metastore dependency in this case is what will cause
 you
  to see jpox errors appear if those jars aren't found.
 
  HTH,
  Bill
 
  On Tue, Oct 20, 2009 at 4:14 AM, Arijit Mukherjee ariji...@gmail.com
  wrote:
 
  BTW - the service is working though, in spite of those exceptions. I'm
  able to run queries and get results.
 
  Arijit
 
  2009/10/20 Arijit Mukherjee ariji...@gmail.com:
   I created a hive-site.xml using the outline given in the Hive Web
   Interface tutorial - now that file is in the classpath of the Web
   Service - and the service can find the file. But, now there's
 another
   exception -
  
   2009-10-20 14:27:30,914 DEBUG [httpSSLWorkerThread-14854-0]
   HiveQueryService - connecting to Hive using URL:
   jdbc:hive://localhost:1/default
   2009-10-20 14:27:30,969 DEBUG [httpSSLWorkerThread-14854-0]
   Configuration - java.io.IOException: config()
  at
   org.apache.hadoop.conf.Configuration.init(Configuration.java:176)
  at
   org.apache.hadoop.conf.Configuration.init(Configuration.java:164)
  at
 org.apache.hadoop.hive.conf.HiveConf.init(HiveConf.java:287)
  at
  
 org.apache.hadoop.hive.jdbc.HiveConnection.init(HiveConnection.java:63)
  at
   org.apache.hadoop.hive.jdbc.HiveDriver.connect(HiveDriver.java:109)
  at
 java.sql.DriverManager.getConnection(DriverManager.java:582)
  at
 java.sql.DriverManager.getConnection(DriverManager.java:185)
  at
  
 com.ctva.poc.hive.service.HiveQueryService.getConnection(HiveQueryService.java:134)
  at
  
 com.ctva.poc.hive.service.HiveQueryService.connectDB(HiveQueryService.java:43)
  
   Apparently, something goes wrong during the config routine. Do I
 need
   something more within the service?
  
   Regards
   Arijit
  
   2009/10/20 Arijit Mukherjee ariji...@gmail.com:
   Hi
  
   I'm trying to create a Web Service which will access Hive (0.4.0
   release) using JDBC. I used to sample JDBC code from the wiki
  
   (
 http://wiki.apache.org/hadoop/Hive/HiveClient#head-fd2d8ae9e17fdc3d9b7048d088b2c23a53a6857d
 ),
   but when I'm trying to connect the the DB using the DriverManager,
   there's an exception which seems to relate to hive-site.xml
 (HiveConf
   - hive-site.xml not found.). But I could not find any hive-site.xml
 in
   $HIVE_HOME/conf - there's only hive-default.xml. The wiki page also
   speaks about couple of jpox JAR files, which aren't

Re: Hive query web service

2009-10-20 Thread Bill Graham
The Hive JDBC client can run in two different modes: standalone and
embedded.

Standalone mode is where the client connects to a separate standalone
HiveServer by specifying the host:port of the server in the jdbc URL like
this: jdbc:hive://localhost:1/default. In this case the hive configs are
not needed by the client, since the client is making thrift requests to the
server which has the Hive configs. the Hive Server knows how to resolve the
metastore.

Embedded mode is where the JDBC client connects to itself so to speak
using a JDBC url like this: jdbc:hive://. It's as if the client is running
an embedded server that only it communicates with. In this case the client
needs the Hive configs since it needs to resolve the metastore, amongst
other things. The metastore dependency in this case is what will cause you
to see jpox errors appear if those jars aren't found.

HTH,
Bill

On Tue, Oct 20, 2009 at 4:14 AM, Arijit Mukherjee ariji...@gmail.comwrote:

 BTW - the service is working though, in spite of those exceptions. I'm
 able to run queries and get results.

 Arijit

 2009/10/20 Arijit Mukherjee ariji...@gmail.com:
  I created a hive-site.xml using the outline given in the Hive Web
  Interface tutorial - now that file is in the classpath of the Web
  Service - and the service can find the file. But, now there's another
  exception -
 
  2009-10-20 14:27:30,914 DEBUG [httpSSLWorkerThread-14854-0]
  HiveQueryService - connecting to Hive using URL:
  jdbc:hive://localhost:1/default
  2009-10-20 14:27:30,969 DEBUG [httpSSLWorkerThread-14854-0]
  Configuration - java.io.IOException: config()
 at
 org.apache.hadoop.conf.Configuration.init(Configuration.java:176)
 at
 org.apache.hadoop.conf.Configuration.init(Configuration.java:164)
 at org.apache.hadoop.hive.conf.HiveConf.init(HiveConf.java:287)
 at
 org.apache.hadoop.hive.jdbc.HiveConnection.init(HiveConnection.java:63)
 at
 org.apache.hadoop.hive.jdbc.HiveDriver.connect(HiveDriver.java:109)
 at java.sql.DriverManager.getConnection(DriverManager.java:582)
 at java.sql.DriverManager.getConnection(DriverManager.java:185)
 at
 com.ctva.poc.hive.service.HiveQueryService.getConnection(HiveQueryService.java:134)
 at
 com.ctva.poc.hive.service.HiveQueryService.connectDB(HiveQueryService.java:43)
 
  Apparently, something goes wrong during the config routine. Do I need
  something more within the service?
 
  Regards
  Arijit
 
  2009/10/20 Arijit Mukherjee ariji...@gmail.com:
  Hi
 
  I'm trying to create a Web Service which will access Hive (0.4.0
  release) using JDBC. I used to sample JDBC code from the wiki
  (
 http://wiki.apache.org/hadoop/Hive/HiveClient#head-fd2d8ae9e17fdc3d9b7048d088b2c23a53a6857d
 ),
  but when I'm trying to connect the the DB using the DriverManager,
  there's an exception which seems to relate to hive-site.xml (HiveConf
  - hive-site.xml not found.). But I could not find any hive-site.xml in
  $HIVE_HOME/conf - there's only hive-default.xml. The wiki page also
  speaks about couple of jpox JAR files, which aren't in the lib folder
  either.
 
  Am I missing something here?
 
  Regards
  Arijit
 
  --
  And when the night is cloudy,
  There is still a light that shines on me,
  Shine on until tomorrow, let it be.
 
 
 
 
  --
  And when the night is cloudy,
  There is still a light that shines on me,
  Shine on until tomorrow, let it be.
 



 --
 And when the night is cloudy,
 There is still a light that shines on me,
 Shine on until tomorrow, let it be.



input20.q unit test failure

2009-10-02 Thread Bill Graham
Hi,

I'm trying to run the unit tests before submitting a patch and I'm getting a
test failure. I've tried running the same test on a fresh checkout and it
also fails. Below is an excerpt of the output. Is anyone else able to run
this test?


[grah...@bgrahammbproosx:~/ws/hive-svn/hive-trunk-clean]$ svn up
...
Updated to revision 821099.
[grah...@bgrahammbproosx:~/ws/hive-svn/hive-trunk-clean]$ svn stat
[grah...@bgrahammbproosx:~/ws/hive-svn/hive-trunk-clean]$ svn diff
[grah...@bgrahammbproosx:~/ws/hive-svn/hive-trunk-clean]$ svn info
Path: .
URL: http://svn.apache.org/repos/asf/hadoop/hive/trunk
Repository Root: http://svn.apache.org/repos/asf
Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
Revision: 821099
Node Kind: directory
Schedule: normal
Last Changed Author: zshao
Last Changed Rev: 820823
Last Changed Date: 2009-10-01 15:19:08 -0700 (Thu, 01 Oct 2009)
[grah...@bgrahammbproosx:~/ws/hive-svn/hive-trunk-clean]$ ant clean test
-Dtestcase=TestCliDriver -Dqfile=input20.q
...
[junit]  NULL  432
[junit]  NULL  435
[junit]  NULL  436
[junit] junit.framework.AssertionFailedError: Client execution results
failed with error code = 1
[junit] at junit.framework.Assert.fail(Assert.java:47)
[junit] at
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_input20(TestCliDriver.java:96)
[junit] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
[junit] at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
[junit] at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
[junit] at java.lang.reflect.Method.invoke(Method.java:597)
[junit] at junit.framework.TestCase.runTest(TestCase.java:154)
[junit] at junit.framework.TestCase.runBare(TestCase.java:127)
[junit] at junit.framework.TestResult$1.protect(TestResult.java:106)
[junit] at
junit.framework.TestResult.runProtected(TestResult.java:124)
[junit] at junit.framework.TestResult.run(TestResult.java:109)
[junit] at junit.framework.TestCase.run(TestCase.java:118)
[junit] at junit.framework.TestSuite.runTest(TestSuite.java:208)
[junit] at junit.framework.TestSuite.run(TestSuite.java:203)
[junit] at
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:421)
[junit] at
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:912)
[junit] at
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:766)
[junit]  NULL  437
[junit]  NULL  438
[junit]  NULL  439
[junit]  NULL  44
...
[junit]  5 348_348
[junit]  5 401_401
[junit]  5 469_469
[junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 22.597 sec
[junit] Test org.apache.hadoop.hive.cli.TestCliDriver FAILED

BUILD FAILED
/Users/grahamb/ws/hive-svn/hive-trunk-clean/build.xml:142: The following
error occurred while executing this line:
/Users/grahamb/ws/hive-svn/hive-trunk-clean/build.xml:89: The following
error occurred while executing this line:
/Users/grahamb/ws/hive-svn/hive-trunk-clean/build-common.xml:316: Tests
failed!

Total time: 1 minute 28 seconds

thanks,
Bill


Re: input20.q unit test failure

2009-10-02 Thread Bill Graham
Thanks Namit for giving it a shot. I just updated again and it's still
failing. I tried a fresh checkout and it also failed. Any pointers on how to
troubleshoot this? I'm reluctant to submit a patch with this failure, even
though it's happening without any of my modifications.


On Fri, Oct 2, 2009 at 10:36 AM, Namit Jain nj...@facebook.com wrote:

  I tried the test on the latest version and it ran fine for me.





 [nj...@dev029 hive4]$ svn info

 Path: .

 URL: http://svn.apache.org/repos/asf/hadoop/hive/trunk

 Repository Root: http://svn.apache.org/repos/asf

 Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68

 Revision: 821103

 Node Kind: directory

 Schedule: normal

 Last Changed Author: zshao

 Last Changed Rev: 820823

 Last Changed Date: 2009-10-01 15:19:08 -0700 (Thu, 01 Oct 2009)



 Can you do update your repository ?





 *From:* Bill Graham [mailto:billgra...@gmail.com]
 *Sent:* Friday, October 02, 2009 10:19 AM
 *To:* hive-user@hadoop.apache.org
 *Subject:* input20.q unit test failure



 Hi,

 I'm trying to run the unit tests before submitting a patch and I'm getting
 a test failure. I've tried running the same test on a fresh checkout and it
 also fails. Below is an excerpt of the output. Is anyone else able to run
 this test?


 [grah...@bgrahammbproosx:~/ws/hive-svn/hive-trunk-clean]$ svn up
 ...
 Updated to revision 821099.
 [grah...@bgrahammbproosx:~/ws/hive-svn/hive-trunk-clean]$ svn stat
 [grah...@bgrahammbproosx:~/ws/hive-svn/hive-trunk-clean]$ svn diff
 [grah...@bgrahammbproosx:~/ws/hive-svn/hive-trunk-clean]$ svn info
 Path: .
 URL: http://svn.apache.org/repos/asf/hadoop/hive/trunk
 Repository Root: http://svn.apache.org/repos/asf
 Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
 Revision: 821099
 Node Kind: directory
 Schedule: normal
 Last Changed Author: zshao
 Last Changed Rev: 820823
 Last Changed Date: 2009-10-01 15:19:08 -0700 (Thu, 01 Oct 2009)
 [grah...@bgrahammbproosx:~/ws/hive-svn/hive-trunk-clean]$ ant clean test
 -Dtestcase=TestCliDriver -Dqfile=input20.q
 ...
 [junit]  NULL  432
 [junit]  NULL  435
 [junit]  NULL  436
 [junit] junit.framework.AssertionFailedError: Client execution results
 failed with error code = 1
 [junit] at junit.framework.Assert.fail(Assert.java:47)
 [junit] at
 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_input20(TestCliDriver.java:96)
 [junit] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
 Method)
 [junit] at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 [junit] at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 [junit] at java.lang.reflect.Method.invoke(Method.java:597)
 [junit] at junit.framework.TestCase.runTest(TestCase.java:154)
 [junit] at junit.framework.TestCase.runBare(TestCase.java:127)
 [junit] at
 junit.framework.TestResult$1.protect(TestResult.java:106)
 [junit] at
 junit.framework.TestResult.runProtected(TestResult.java:124)
 [junit] at junit.framework.TestResult.run(TestResult.java:109)
 [junit] at junit.framework.TestCase.run(TestCase.java:118)
 [junit] at junit.framework.TestSuite.runTest(TestSuite.java:208)
 [junit] at junit.framework.TestSuite.run(TestSuite.java:203)
 [junit] at
 org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:421)
 [junit] at
 org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:912)
 [junit] at
 org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:766)
 [junit]  NULL  437
 [junit]  NULL  438
 [junit]  NULL  439
 [junit]  NULL  44
 ...
 [junit]  5 348_348
 [junit]  5 401_401
 [junit]  5 469_469
 [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 22.597 sec
 [junit] Test org.apache.hadoop.hive.cli.TestCliDriver FAILED

 BUILD FAILED
 /Users/grahamb/ws/hive-svn/hive-trunk-clean/build.xml:142: The following
 error occurred while executing this line:
 /Users/grahamb/ws/hive-svn/hive-trunk-clean/build.xml:89: The following
 error occurred while executing this line:
 /Users/grahamb/ws/hive-svn/hive-trunk-clean/build-common.xml:316: Tests
 failed!

 Total time: 1 minute 28 seconds

 thanks,
 Bill



Re: Should all processors return a DriverResponse?

2009-09-30 Thread Bill Graham
Looking again at the solution to HIVE-795 I see that adding the runCommand
method to Driver
 worked for that class, but deviated from the approach used by the
CommandProcessor interface. Hence the issue you're running into.

Driver got this new method:

public DriverResponse runCommand(String command)

But the CommandProcessor interface, which Driver implements has this method:

public int run(String command)


Other implementations CommandProcessor should also return a composite
response object instead of an int. I think the ideal solution would be if
the CommandProcessor interface instead had a method either of these:

public CommandProcessorResponse runCommand(String command);

or

public CommandProcessorResponse run(String command);

And CommandProcessorResponse had attributes for responseCode and
errorMessage. DriverResponse could extend CommandProcessorResponse (as could
other response types as needed) and have the SQLState attribute.

If we move towards a solution like this, then the question becomes how hard
to we try to maintain backward compatibility with the CommandProcessor
interface? Do we just change the interface (which would be easier and result
in cleaner code) or do we migrate to a new interface and deprecate the old?
I lean towards the former, but would like to hear what others have to say.
Although it's a public method, I'd expect that there probably aren't many
implementations outside of the Hive code base that are written against
CommandProcessor, and the fact that we're at a 0.0.x version should imply
that internal interfaces might change from release to release. Other
thoughts?

thanks,
Bill


On Tue, Sep 29, 2009 at 7:42 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 All,

 I am looking to integrate HWI with
 https://issues.apache.org/jira/browse/HIVE-795
 Should all Processors return a DriverResponse? the web interface
 allows a list of commands as the CLI would take.

 I was storing this in ListInt

 I was looking to change this to ListDriverResult

 I also have to extend the class...

  class DriverResponseWrapper extends DriverResponse {
public DriverResponseWrapper (int x){
  super(x);
}
  }

 Should DriverReponse and CommandResponse exist maybe as a subclass of
 Response.

 Edward



Re: Should all processors return a DriverResponse?

2009-09-30 Thread Bill Graham
Edward your approach looks good, but I have a few comments.

- Since we agree that we don't need to worry about backward compatibility,
then why have both methods as part of the CommandProcessor interface? Seems
we should get rid of the method that returns an int. If we decide to keep
it, then we're saying that backward compatibility does matter to us. In that
case we should keep it and deprecate it.

- You included SQLState in the response base class, which makes sense. If
that's the case though, do we need to create DriverResponse and
CommandResponse subclasses? If the subclasses need to have add'l methods,
then yes we do. But if they don't then I don't see the need. We could always
subclass later if the need arises. Maybe we extend for DriverResponse just
for backward compatibility, but otherwise we don't need to?

thanks,
Bill

On Wed, Sep 30, 2009 at 1:45 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 On Wed, Sep 30, 2009 at 4:32 PM, Bill Graham billgra...@gmail.com wrote:
  Looking again at the solution to HIVE-795 I see that adding the
 runCommand
  method to Driver
   worked for that class, but deviated from the approach used by the
  CommandProcessor interface. Hence the issue you're running into.
 
  Driver got this new method:
 
  public DriverResponse runCommand(String command)
 
  But the CommandProcessor interface, which Driver implements has this
 method:
 
  public int run(String command)
 
 
  Other implementations CommandProcessor should also return a composite
  response object instead of an int. I think the ideal solution would be if
  the CommandProcessor interface instead had a method either of these:
 
  public CommandProcessorResponse runCommand(String command);
 
  or
 
  public CommandProcessorResponse run(String command);
 
  And CommandProcessorResponse had attributes for responseCode and
  errorMessage. DriverResponse could extend CommandProcessorResponse (as
 could
  other response types as needed) and have the SQLState attribute.
 
  If we move towards a solution like this, then the question becomes how
 hard
  to we try to maintain backward compatibility with the CommandProcessor
  interface? Do we just change the interface (which would be easier and
 result
  in cleaner code) or do we migrate to a new interface and deprecate the
 old?
  I lean towards the former, but would like to hear what others have to
 say.
  Although it's a public method, I'd expect that there probably aren't many
  implementations outside of the Hive code base that are written against
  CommandProcessor, and the fact that we're at a 0.0.x version should imply
  that internal interfaces might change from release to release. Other
  thoughts?
 
  thanks,
  Bill
 
 
  On Tue, Sep 29, 2009 at 7:42 PM, Edward Capriolo edlinuxg...@gmail.com
  wrote:
 
  All,
 
  I am looking to integrate HWI with
  https://issues.apache.org/jira/browse/HIVE-795
  Should all Processors return a DriverResponse? the web interface
  allows a list of commands as the CLI would take.
 
  I was storing this in ListInt
 
  I was looking to change this to ListDriverResult
 
  I also have to extend the class...
 
   class DriverResponseWrapper extends DriverResponse {
 public DriverResponseWrapper (int x){
   super(x);
 }
   }
 
  Should DriverReponse and CommandResponse exist maybe as a subclass of
  Response.
 
  Edward
 
 

 Thanks Bill,

 I wanted to open a Jira but it seems like an issue the past two days.
 I agree that there are few/no implementations outside of the CLI of
 CommandProcessor. I do not think we need to support backwards
 compatibility for the change, but doing it like DriverResponse is
 logical.

 Have the old method be a chained call to the new method

 I have a slight variation of your suggestion but essentially the same idea.

  Driver.java
  int run (String command)
  DriverResponse runCommand(command)


  We should do the same with CommandProcessor
  CommandProcessor.java
  int run (String command)
  CommandResponse runCommand(command)

 The refactoring would involved creating an abstract base class
 Response. All the private members would move to the base class and
 become protected.

 public abstract class Response {
protected int responseCode;
protected String errorMessage;
protected String SQLState;

public DriverResponse(int responseCode) {
  this(responseCode, null, null);
}

public DriverResponse(int responseCode, String errorMessage,
 String SQLState) {
  this.responseCode = responseCode;
  this.errorMessage = errorMessage;
  this.SQLState = SQLState;
}

public int getResponseCode() { return responseCode; }
   public String getErrorMessage() { return errorMessage; }
public String getSQLState() { return SQLState; }
  }


 public class DriverResponse extends Response{

public DriverResponse(int responseCode) {
  this(responseCode, null, null);
}

public DriverResponse(int responseCode, String errorMessage,
 String

Re: Problems using hive JDBC driver on Windows (with Squirrel SQL Client)

2009-09-16 Thread Bill Graham
Additional help could certainly be used on the JDBC driver. I agree that
getting the JDBC support full-featured is a good place to focus your energy.
Guys correct me if I'm wrong, but the approach has generally been to pick a
SQL client desktop app and implement the JDBC methods need to support it.
Raghu did that with Pentaho and I did the same for Squirrel.

I don't normally use Squirrel, but I implemented to it since it's open
source and it gives great visibility in the logs/admin window into which
JDBC methods are being called that are not yet implemented and it's fairly
forgiving. (I'm a fan of DBVisualizer, but it doesn't give very much useful
info when something in the driver isn't implemented, plus it tends to
fail-fast when an optional metadata methods fails.)

Regarding what's currently being worked on in the driver, I'm working on
HIVE-795 to get better error messaing and SQLStates passed from the Hive
server to the JDBC driver. Others can comment on what else is active, but
I'd recommend searching for open JIRAs where Component/s is 'Client'. This
issue has also become somewhat of an umbrella bug for JDBC, and contains a
number of TODOs:

https://issues.apache.org/jira/browse/HIVE-576

HTH.

thanks,
BIll


On Wed, Sep 16, 2009 at 8:36 PM, Vijay tec...@gmail.com wrote:

 This may be something i'd be interested in working on. My idea in general
 would be to make the jdbc driver as thin and full featured as reasonable.
 I'm convinced this is the best way to integrate hive with many of the
 excellent tools available out thee.

 Although I could get the jdbc driver to work with a couple of these tools
 I'm having trouble to get it working with many others. Especially on windows
 it is slightly more painful. Is there some work in progress along these
 lines? Any thoughts or pointers?

 Thanks in advance,
 Vijay

 On Sep 15, 2009 10:59 PM, Prasad Chakka pcha...@facebook.com wrote:

  SerDe may be needed to deserialize the result from Hive server. But I
 also thought the results are in delimited form (LazySimpleSerDe or
 MetadataTypedColumnSetSerDe or something like that) so it is possible to
 decouple hadoop jar but will take some work.


 --
 *From: *Bill Graham billgra...@gmail.com
 *Reply-To: *hive-user@hadoop.apache.org, billgra...@gmail.com
 *Date: *Tue, 15 Sep 2009 22:54:50 -0700
 *To: *Vijay tec...@gmail.com
 *Cc: *hive-user@hadoop.apache.org
 *Subject: *Re: Problems using hive JDBC driver on Windows (with Squirrel
 SQL  Client)

 Good question. The JDBC package relies on the hive serde2 package, which
 has at least the followin...

   Thanks a lot Bill! That was my problem. I thought my code base was
 pretty recent. Apparently not...




Re: Problems using hive JDBC driver on Windows (with Squirrel SQL Client)

2009-09-15 Thread Bill Graham
Hi Vijay,

Check your classpath to make sure you've got the correct hive-jdbc.jar
version built using either the trunk or the current 4.0 branch.
HiveStatement.java:390 used to throw 'java.sql.SQLException: Method not
supported' before HIVE-679 was committed. In the current code base after the
commit, the setMaxRows method is on lines 422-425.

thanks,
BIll

On Tue, Sep 15, 2009 at 2:13 PM, Vijay tec...@gmail.com wrote:

 I'm having trouble getting the hive jdbc driver to work on Windows. I'm
 following the Squirrel SQL Client setup from the Hive/HiveJDBCInterface wiki
 page. Everything works great when Squirrel SQL Client is running on Linux
 but on Windows it doesn't. Squirrel can connect to the hive server fine so
 the setup is alright. However, when I issue a query, it doesn't seem to
 execute at all. I see this exception in the Squirrel client:

 2009-09-15 14:10:35,268 [Thread-5] ERROR
 net.sourceforge.squirrel_sql.client.session.SQLExecuterTask  - Can't Set
 MaxRows
 java.sql.SQLException: Method not supported
 at
 org.apache.hadoop.hive.jdbc.HiveStatement.setMaxRows(HiveStatement.java:390)
 at
 net.sourceforge.squirrel_sql.client.session.SQLExecuterTask.setMaxRows(SQLExecuterTask.java:318)
 at
 net.sourceforge.squirrel_sql.client.session.SQLExecuterTask.run(SQLExecuterTask.java:157)
 at
 net.sourceforge.squirrel_sql.fw.util.TaskExecuter.run(TaskExecuter.java:82)
 at java.lang.Thread.run(Thread.java:619)

 I don't seem to get this exception on Linux. I can't get the Squirrel
 client to not set max rows but I'm not entirely sure that's the real
 problem.

 Any help is appreciated.
 Vijay



Re: which thrift reversion do you use ?

2009-09-11 Thread Bill Graham
+1

I've been struggling with thrift versions as well, see:
https://issues.apache.org/jira/browse/HIVE-795?focusedCommentId=12754020page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12754020

Any insight into which version of thrift the Hive trunk is using would be
helpful.



On Fri, Sep 11, 2009 at 1:21 AM, Min Zhou coderp...@gmail.com wrote:

 Hi all,

 we've tried newest one from trunk and r760184, both of them can't produce
 the same code with hive trunk.
 which thrift reversion do you use ?


 Thanks,
 Min
 --
 My research interests are distributed systems, parallel computing and
 bytecode based virtual machine.

 My profile:
 http://www.linkedin.com/in/coderplay
 My blog:
 http://coderplay.javaeye.com



Re: Adding jar files when running hive in hwi mode or hiveserver mode

2009-08-26 Thread Bill Graham
+1 for the HWI - HiveServer approach.

Building out rich APIs in the HiveServer (thrift currently, and possible
REST at some point), would allow the HiveServer to focus on the functional
API. The HWI (and others) could then focus on rich UI functionality. The two
would have a clean decoupling, which would reduce complexity of the
codebases and help abid by the KISS principle.



On Wed, Aug 26, 2009 at 2:42 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 On Wed, Aug 26, 2009 at 3:25 PM, Raghu Murthyrmur...@facebook.com wrote:
  Even if we decided to have multiple HiveServers, wouldn't it be possible
 for
  HWI to randomly pick a HiveServer to connect to per query/client?
 
  On 8/26/09 12:16 PM, Ashish Thusoo athu...@facebook.com wrote:
 
  +1 for ajaxing this baby.
 
  On the broader question of whether we should combine HWI and HiveServer
 - I
  think there are definite deployment and code reuse advantages in doing
 so,
  however keeping them separate also has the advantage that we can cluster
  HiveServers independently from HWI. Since the HiveServer sits in the
 data
  path, the independent scaling may have advantages. I am not sure how
 strong of
  an argument that is to not put them together. Simplicity obviously
 indicates
  that we should have them together.
 
  Thoughts?
 
  Ashish
 
  -Original Message-
  From: Edward Capriolo [mailto:edlinuxg...@gmail.com]
  Sent: Wednesday, August 26, 2009 9:45 AM
  To: hive-user@hadoop.apache.org
  Subject: Re: Adding jar files when running hive in hwi mode or
 hiveserver mode
 
  On Tue, Aug 25, 2009 at 8:13 PM, Vijaytec...@gmail.com wrote:
  Yep, I got it and now it works perfectly! I like hwi btw! It
  definitely makes things easier for a wider audience to try out hive.
  Your new session result bucket idea is very nice as well. I will keep
  trying more things and see if anything else comes up but so far it
 looks
  great!
  Thanks Edward!
 
  On Tue, Aug 25, 2009 at 7:25 AM, Edward Capriolo
  edlinuxg...@gmail.com
  wrote:
 
  On Tue, Aug 25, 2009 at 10:18 AM, Edward
  Caprioloedlinuxg...@gmail.com
  wrote:
  On Mon, Aug 24, 2009 at 10:13 PM, Vijaytec...@gmail.com wrote:
  Probably spoke too soon :) I added this comment to the JIRA ticket
  above.
 
  Hi, I tried the latest patch on trunk and there seems to be a
 problem.
 
  I was interested in using the add jar  command to add jar files
  to the path. However, by the time the command flows through the
  SessionState to the AddResourceProcessor (in
 
  ./ql/src/java/org/apache/hadoop/hive/ql/processors/AddResourceProc
  essor.java), the command word add is not being stripped so the
  resource processor is trying to find a ResourceType of ADD.
 
  I'm not sure if this was an existing bug or was a result of the
  current set of changes.
 
  [ Show  ]
  Vijay added a comment - 24/Aug/09 07:12 PM Hi, I tried the latest
  patch on trunk and there seems to be a problem. I was interested
  in using the add jar  command to add jar files to the path.
  However, by the time the command flows through the SessionState to
  the AddResourceProcessor (in
 
  ./ql/src/java/org/apache/hadoop/hive/ql/processors/AddResourceProc
  essor.java), the command word add is not being stripped so the
  resource processor is trying to find a ResourceType of ADD. I'm
  not sure if this was an existing bug or was a result of the
  current set of changes.
  On Mon, Aug 24, 2009 at 5:30 PM, Vijay tec...@gmail.com wrote:
 
  That's awesome and looks like exactly what I needed. Local file
  system requirement is perfectly ok for now. I will check it out
 right
  away!
  Hopefully it will be checked in soon.
 
  Thanks Edward!
 
  On Mon, Aug 24, 2009 at 5:14 PM, Edward Capriolo
  edlinuxg...@gmail.com
  wrote:
 
  On Mon, Aug 24, 2009 at 8:09 PM, Prasad
  Chakkapcha...@facebook.com
  wrote:
  Vijay, there is no solution for it yet. There may be a jira
  open but AFAIK, no one is working on it. You are welcome to
  contribute this feature.
 
  Prasad
 
 
  
  From: Vijay tec...@gmail.com
  Reply-To: hive-user@hadoop.apache.org
  Date: Mon, 24 Aug 2009 16:59:28 -0700
  To: hive-user@hadoop.apache.org
  Subject: Re: Adding jar files when running hive in hwi mode or
  hiveserver mode
 
  Hi, is there any solution for this? How does everybody include
  custom jar files running hive in a non-cli mode?
 
  Thanks in advance,
  Vijay
 
  On Sat, Aug 22, 2009 at 6:19 PM, Vijay tec...@gmail.com wrote:
 
  When I run hive in cli mode, I add the hive_contrib.jar file
  using this
  command:
 
  hive add jar lib/hive_contrib.jar
 
  Is there a way to do this automatically when running hive in
  hwi or hiveserver modes? Or do I have to add the jar file
  explicitly to any of the startup scripts?
 
 
 
 
  Vijay,
 
  Currently HWI does not support this. The changes in
  https://issues.apache.org/jira/browse/HIVE-716 will make this
  possible (although I did not test but it should work as the cli
  does). The 

Re: Adding jar files when running hive in hwi mode or hiveserver mode

2009-08-26 Thread Bill Graham
The JDBC driver now is now able to integrate with some SQL desktop tools for
basic querying FYI. That still requires the user to know SQL, but at least
it doesn't require working on the command line. The SQuirrel SQL client has
been tested with the current JDBC release:

http://wiki.apache.org/hadoop/Hive/HiveJDBCInterface#head-98f2bc43312161b56e267773267546c080f4fb27

There's also ODBC driver work being done that has been tested with
MicroStrategy, but it only supports Linux currently:

http://wiki.apache.org/hadoop/Hive/HiveODBC

On Wed, Aug 26, 2009 at 3:40 PM, Vijay tec...@gmail.com wrote:

 Having played around with hive cli/hiveserver/hwi for a few weeks I think I
 understand the various pieces better now. Can people provide some real world
 scenarios where they use the different modes?

 As far as UI and more importantly making hive accessible to users that are
 not super familiar with SQL goes, it seems to me like hive JDBC might be the
 best option since that way hive can be relatively seamlessly integrated with
 many sophisticated reporting tools. I haven't explored that option much yet.

 Hive cli was good enough for me to play around with the framework and I can
 keep using it for real work. However, having a simple GUI like hwi is better
 for many reasons but I don't think it can ever be a replacement for all the
 available reporting tools.

 So, I guess I'm kind of conflicted at this point :) My ultimate goal is to
 put the power of hadoop and hive into the hands of non-technical (business)
 users. At first I thought I could probably build a simple UI (which is a
 bunch of php files really) using the php thrift API but that API did not
 seem suited for short lived web applications without some sort of
 sophisticated session management.

 Any thoughts/ideas are greatly appreciated.


 On Wed, Aug 26, 2009 at 2:50 PM, Bill Graham billgra...@gmail.com wrote:

 +1 for the HWI - HiveServer approach.

 Building out rich APIs in the HiveServer (thrift currently, and possible
 REST at some point), would allow the HiveServer to focus on the functional
 API. The HWI (and others) could then focus on rich UI functionality. The two
 would have a clean decoupling, which would reduce complexity of the
 codebases and help abid by the KISS principle.




 On Wed, Aug 26, 2009 at 2:42 PM, Edward Capriolo 
 edlinuxg...@gmail.comwrote:

 On Wed, Aug 26, 2009 at 3:25 PM, Raghu Murthyrmur...@facebook.com
 wrote:
  Even if we decided to have multiple HiveServers, wouldn't it be
 possible for
  HWI to randomly pick a HiveServer to connect to per query/client?
 
  On 8/26/09 12:16 PM, Ashish Thusoo athu...@facebook.com wrote:
 
  +1 for ajaxing this baby.
 
  On the broader question of whether we should combine HWI and
 HiveServer - I
  think there are definite deployment and code reuse advantages in doing
 so,
  however keeping them separate also has the advantage that we can
 cluster
  HiveServers independently from HWI. Since the HiveServer sits in the
 data
  path, the independent scaling may have advantages. I am not sure how
 strong of
  an argument that is to not put them together. Simplicity obviously
 indicates
  that we should have them together.
 
  Thoughts?
 
  Ashish
 
  -Original Message-
  From: Edward Capriolo [mailto:edlinuxg...@gmail.com]
  Sent: Wednesday, August 26, 2009 9:45 AM
  To: hive-user@hadoop.apache.org
  Subject: Re: Adding jar files when running hive in hwi mode or
 hiveserver mode
 
  On Tue, Aug 25, 2009 at 8:13 PM, Vijaytec...@gmail.com wrote:
  Yep, I got it and now it works perfectly! I like hwi btw! It
  definitely makes things easier for a wider audience to try out hive.
  Your new session result bucket idea is very nice as well. I will keep
  trying more things and see if anything else comes up but so far it
 looks
  great!
  Thanks Edward!
 
  On Tue, Aug 25, 2009 at 7:25 AM, Edward Capriolo
  edlinuxg...@gmail.com
  wrote:
 
  On Tue, Aug 25, 2009 at 10:18 AM, Edward
  Caprioloedlinuxg...@gmail.com
  wrote:
  On Mon, Aug 24, 2009 at 10:13 PM, Vijaytec...@gmail.com wrote:
  Probably spoke too soon :) I added this comment to the JIRA ticket
  above.
 
  Hi, I tried the latest patch on trunk and there seems to be a
 problem.
 
  I was interested in using the add jar  command to add jar files
  to the path. However, by the time the command flows through the
  SessionState to the AddResourceProcessor (in
 
  ./ql/src/java/org/apache/hadoop/hive/ql/processors/AddResourceProc
  essor.java), the command word add is not being stripped so the
  resource processor is trying to find a ResourceType of ADD.
 
  I'm not sure if this was an existing bug or was a result of the
  current set of changes.
 
  [ Show  ]
  Vijay added a comment - 24/Aug/09 07:12 PM Hi, I tried the latest
  patch on trunk and there seems to be a problem. I was interested
  in using the add jar  command to add jar files to the path.
  However, by the time the command flows through the SessionState

Re: Adding jar files when running hive in hwi mode or hiveserver mode

2009-08-26 Thread Bill Graham
How does HWI go about caching query results for others to view? Are the
results durable given a bounce of HWI or are they held in memory?

We have a process where we build daily summaries from Hive queries that get
emailed. Instead I'd like a way to persist/cache the query results on a
server and build a custom AJAXy web UI to expose them. Just wondering if HWI
could help with this...


On Wed, Aug 26, 2009 at 5:46 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 On Wed, Aug 26, 2009 at 7:31 PM, Bill Grahambillgra...@gmail.com wrote:
  The JDBC driver now is now able to integrate with some SQL desktop tools
 for
  basic querying FYI. That still requires the user to know SQL, but at
 least
  it doesn't require working on the command line. The SQuirrel SQL client
 has
  been tested with the current JDBC release:
 
 
 http://wiki.apache.org/hadoop/Hive/HiveJDBCInterface#head-98f2bc43312161b56e267773267546c080f4fb27
 
  There's also ODBC driver work being done that has been tested with
  MicroStrategy, but it only supports Linux currently:
 
  http://wiki.apache.org/hadoop/Hive/HiveODBC
 
  On Wed, Aug 26, 2009 at 3:40 PM, Vijay tec...@gmail.com wrote:
 
  Having played around with hive cli/hiveserver/hwi for a few weeks I
 think
  I understand the various pieces better now. Can people provide some real
  world scenarios where they use the different modes?
 
  As far as UI and more importantly making hive accessible to users that
 are
  not super familiar with SQL goes, it seems to me like hive JDBC might be
 the
  best option since that way hive can be relatively seamlessly integrated
 with
  many sophisticated reporting tools. I haven't explored that option much
 yet.
 
  Hive cli was good enough for me to play around with the framework and I
  can keep using it for real work. However, having a simple GUI like hwi
 is
  better for many reasons but I don't think it can ever be a replacement
 for
  all the available reporting tools.
 
  So, I guess I'm kind of conflicted at this point :) My ultimate goal is
 to
  put the power of hadoop and hive into the hands of non-technical
 (business)
  users. At first I thought I could probably build a simple UI (which is a
  bunch of php files really) using the php thrift API but that API did not
  seem suited for short lived web applications without some sort of
  sophisticated session management.
 
  Any thoughts/ideas are greatly appreciated.
 
  On Wed, Aug 26, 2009 at 2:50 PM, Bill Graham billgra...@gmail.com
 wrote:
 
  +1 for the HWI - HiveServer approach.
 
  Building out rich APIs in the HiveServer (thrift currently, and
 possible
  REST at some point), would allow the HiveServer to focus on the
 functional
  API. The HWI (and others) could then focus on rich UI functionality.
 The two
  would have a clean decoupling, which would reduce complexity of the
  codebases and help abid by the KISS principle.
 
 
 
  On Wed, Aug 26, 2009 at 2:42 PM, Edward Capriolo 
 edlinuxg...@gmail.com
  wrote:
 
  On Wed, Aug 26, 2009 at 3:25 PM, Raghu Murthyrmur...@facebook.com
  wrote:
   Even if we decided to have multiple HiveServers, wouldn't it be
   possible for
   HWI to randomly pick a HiveServer to connect to per query/client?
  
   On 8/26/09 12:16 PM, Ashish Thusoo athu...@facebook.com wrote:
  
   +1 for ajaxing this baby.
  
   On the broader question of whether we should combine HWI and
   HiveServer - I
   think there are definite deployment and code reuse advantages in
   doing so,
   however keeping them separate also has the advantage that we can
   cluster
   HiveServers independently from HWI. Since the HiveServer sits in
 the
   data
   path, the independent scaling may have advantages. I am not sure
 how
   strong of
   an argument that is to not put them together. Simplicity obviously
   indicates
   that we should have them together.
  
   Thoughts?
  
   Ashish
  
   -Original Message-
   From: Edward Capriolo [mailto:edlinuxg...@gmail.com]
   Sent: Wednesday, August 26, 2009 9:45 AM
   To: hive-user@hadoop.apache.org
   Subject: Re: Adding jar files when running hive in hwi mode or
   hiveserver mode
  
   On Tue, Aug 25, 2009 at 8:13 PM, Vijaytec...@gmail.com wrote:
   Yep, I got it and now it works perfectly! I like hwi btw! It
   definitely makes things easier for a wider audience to try out
 hive.
   Your new session result bucket idea is very nice as well. I will
   keep
   trying more things and see if anything else comes up but so far it
   looks
   great!
   Thanks Edward!
  
   On Tue, Aug 25, 2009 at 7:25 AM, Edward Capriolo
   edlinuxg...@gmail.com
   wrote:
  
   On Tue, Aug 25, 2009 at 10:18 AM, Edward
   Caprioloedlinuxg...@gmail.com
   wrote:
   On Mon, Aug 24, 2009 at 10:13 PM, Vijaytec...@gmail.com
 wrote:
   Probably spoke too soon :) I added this comment to the JIRA
   ticket
   above.
  
   Hi, I tried the latest patch on trunk and there seems to be a
   problem.
  
   I was interested in using the add jar  command to add jar

Errors creating MySQL metastore

2009-08-05 Thread Bill Graham
Hi,

I'm trying to set up a MySQL metastore for Hive and I'm getting the
exceptions shown below. If anyone could shed some insight as to why this is
happening, it would be greatly appreciated.

My hive-sites.xml is also attached below. This is how it looks when I start
the Hive client with an empty db. The schema gets created, but the errors
attached below appear. After the db is created I change
datanucleus.autoCreateSchema to false before restarting the client. The same
errors appear whenever I restart the client and run show tables, which
takes about 30 seconds to complete.

Any ideas how to fix this? I've experimented with many combinations of
datanucleus.autoCreateColumns and datanucleus.identifier.case, but nothing
makes a difference.

property
namehive.metastore.local/name
valuetrue/value
  /property

  property
namejavax.jdo.option.ConnectionURL/name
valuejdbc:mysql://x:11000/hive/value
  /property

  property
namejavax.jdo.option.ConnectionDriverName/name
valuecom.mysql.jdbc.Driver/value
  /property

  property
namejavax.jdo.option.ConnectionUserName/name
value/value
  /property

  property
namejavax.jdo.option.ConnectionPassword/name
value/value
  /property

  property
namedatanucleus.autoCreateSchema/name
valuetrue/value
  /property


2009-08-05 10:05:46,543 ERROR Datastore.Schema (Log4JLogger.java:error(115))
- Failed to validate SchemaTable for Schema . Either it doesnt exist, or
doesnt validate : Required columns missing from table NUCLEUS_TABLES :
`TABLE_NAME`, VERSION, CLASS_NAME, INTERFACE_NAME, OWNER, `TYPE`. Perhaps
your MetaData is incorrect, or you havent enabled
datanucleus.autoCreateColumns.
Required columns missing from table NUCLEUS_TABLES : `TABLE_NAME`,
VERSION, CLASS_NAME, INTERFACE_NAME, OWNER, `TYPE`. Perhaps your MetaData is
incorrect, or you havent enabled datanucleus.autoCreateColumns.
org.datanucleus.store.rdbms.exceptions.MissingColumnException: Required
columns missing from table NUCLEUS_TABLES : `TABLE_NAME`, VERSION,
CLASS_NAME, INTERFACE_NAME, OWNER, `TYPE`. Perhaps your MetaData is
incorrect, or you havent enabled datanucleus.autoCreateColumns.
at
org.datanucleus.store.rdbms.table.TableImpl.validateColumns(TableImpl.java:280)
at
org.datanucleus.store.rdbms.table.TableImpl.validate(TableImpl.java:173)
at
org.datanucleus.store.rdbms.SchemaAutoStarter.init(SchemaAutoStarter.java:101)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at
org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:576)
at
org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:300)
at
org.datanucleus.store.AbstractStoreManager.initialiseAutoStart(AbstractStoreManager.java:486)
at
org.datanucleus.store.rdbms.RDBMSManager.initialiseSchema(RDBMSManager.java:821)
at
org.datanucleus.store.rdbms.RDBMSManager.init(RDBMSManager.java:394)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at
org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:576)
at
org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:300)
at
org.datanucleus.store.FederationManager.initialiseStoreManager(FederationManager.java:106)
at
org.datanucleus.store.FederationManager.init(FederationManager.java:68)
at
org.datanucleus.ObjectManagerFactoryImpl.initialiseStoreManager(ObjectManagerFactoryImpl.java:152)
at
org.datanucleus.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:529)
at
org.datanucleus.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:175)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at javax.jdo.JDOHelper$16.run(JDOHelper.java:1956)
at java.security.AccessController.doPrivileged(Native Method)
at javax.jdo.JDOHelper.invoke(JDOHelper.java:1951)
at

Re: Errors creating MySQL metastore

2009-08-05 Thread Bill Graham
Yes, I'm using the latest hive-default.xml. What I'm showing is just the
contents of my hive-sites.xml.

The NUCLEUS_TABLES and all of it's columns listed in the exception exists in
the DB, which is what's puzzling me.

On Wed, Aug 5, 2009 at 10:51 AM, Prasad Chakka pcha...@facebook.com wrote:

  Are you using the latest hive-default.xml? It should contain more
 datanucleus properties than below. It is looking for a table called
 ‘NUCLEUS_TABLES’ which contains list of tables that got created when the
 original schema was created.

 Prasad

 --
 *From: *Bill Graham billgra...@gmail.com
 *Reply-To: *hive-user@hadoop.apache.org, billgra...@gmail.com
 *Date: *Wed, 5 Aug 2009 10:19:24 -0700
 *To: *hive-user@hadoop.apache.org
 *Subject: *Errors creating MySQL metastore


 Hi,

 I'm trying to set up a MySQL metastore for Hive and I'm getting the
 exceptions shown below. If anyone could shed some insight as to why this is
 happening, it would be greatly appreciated.

 My hive-sites.xml is also attached below. This is how it looks when I start
 the Hive client with an empty db. The schema gets created, but the errors
 attached below appear. After the db is created I change
 datanucleus.autoCreateSchema to false before restarting the client. The same
 errors appear whenever I restart the client and run show tables, which
 takes about 30 seconds to complete.

 Any ideas how to fix this? I've experimented with many combinations of
 datanucleus.autoCreateColumns and datanucleus.identifier.case, but nothing
 makes a difference.

 property
 namehive.metastore.local/name
 valuetrue/value
   /property

   property
 namejavax.jdo.option.ConnectionURL/name
 valuejdbc:mysql://x:11000/hive/value
   /property

   property
 namejavax.jdo.option.ConnectionDriverName/name
 valuecom.mysql.jdbc.Driver/value
   /property

   property
 namejavax.jdo.option.ConnectionUserName/name
 value/value
   /property

   property
 namejavax.jdo.option.ConnectionPassword/name
 value/value
   /property

   property
 namedatanucleus.autoCreateSchema/name
 valuetrue/value
   /property


 2009-08-05 10:05:46,543 ERROR Datastore.Schema
 (Log4JLogger.java:error(115)) - Failed to validate SchemaTable for Schema
 . Either it doesnt exist, or doesnt validate : Required columns missing
 from table NUCLEUS_TABLES : `TABLE_NAME`, VERSION, CLASS_NAME,
 INTERFACE_NAME, OWNER, `TYPE`. Perhaps your MetaData is incorrect, or you
 havent enabled datanucleus.autoCreateColumns.
 Required columns missing from table NUCLEUS_TABLES : `TABLE_NAME`,
 VERSION, CLASS_NAME, INTERFACE_NAME, OWNER, `TYPE`. Perhaps your MetaData is
 incorrect, or you havent enabled datanucleus.autoCreateColumns.
 org.datanucleus.store.rdbms.exceptions.MissingColumnException: Required
 columns missing from table NUCLEUS_TABLES : `TABLE_NAME`, VERSION,
 CLASS_NAME, INTERFACE_NAME, OWNER, `TYPE`. Perhaps your MetaData is
 incorrect, or you havent enabled datanucleus.autoCreateColumns.
 at
 org.datanucleus.store.rdbms.table.TableImpl.validateColumns(TableImpl.java:280)
 at
 org.datanucleus.store.rdbms.table.TableImpl.validate(TableImpl.java:173)
 at
 org.datanucleus.store.rdbms.SchemaAutoStarter.init(SchemaAutoStarter.java:101)
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
 at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
 at
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
 at
 org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:576)
 at
 org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:300)
 at
 org.datanucleus.store.AbstractStoreManager.initialiseAutoStart(AbstractStoreManager.java:486)
 at
 org.datanucleus.store.rdbms.RDBMSManager.initialiseSchema(RDBMSManager.java:821)
 at
 org.datanucleus.store.rdbms.RDBMSManager.init(RDBMSManager.java:394)
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
 at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
 at
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
 at
 org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:576)
 at
 org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:300)
 at
 org.datanucleus.store.FederationManager.initialiseStoreManager(FederationManager.java:106)
 at
 org.datanucleus.store.FederationManager.init(FederationManager.java:68

Re: JDBC: Infinite while(rs.next()) loop

2009-08-04 Thread Bill Graham
I would love to see nightly/periodic builds published somewhere, especially
if it's going to be some time before Hive 0.4 is released.

It seems like people new to Hive get the check out and build from the
trunk or apply this patch answer often on this list. Having nightly
builds would make life easier on these folks as well.


On Tue, Aug 4, 2009 at 10:33 AM, Edward Capriolo edlinuxg...@gmail.comwrote:

 On Tue, Aug 4, 2009 at 1:20 PM, Saurabh Nandasaurabhna...@gmail.com
 wrote:
  Is there any possibility of having a nightly build off the trunk,
  before hive 0.4 is officially released?
 
  On 8/4/09, Edward Capriolo edlinuxg...@gmail.com wrote:
  On Tue, Aug 4, 2009 at 10:43 AM, Bill Grahambillgra...@gmail.com
 wrote:
  +1
 
  I agree. I do not know the answer to that one. Can anyone comment on
 the
  future Hive release schedule?
 
 
  On Tue, Aug 4, 2009 at 7:39 AM, Saurabh Nanda saurabhna...@gmail.com
  wrote:
 
  I was dreading this response! Are there any plans to push out a new
 Hive
  build with the latest features  bug fixes? Building from trunk is not
  everyone's cup of tea, you know :-)
 
  Any nightly builds that I can pick up?
 
  Saurabh.
 
  On Tue, Aug 4, 2009 at 8:05 PM, Bill Graham billgra...@gmail.com
 wrote:
 
  This bug has been fixed on the trunk. Check out the hive trunk and
 build
  the JDBC driver and you should be fine.
 
  On Tue, Aug 4, 2009 at 12:47 AM, Saurabh Nanda 
 saurabhna...@gmail.com
  wrote:
 
  Here's what I'm trying:
 
 ResultSet rs=st.executeQuery(show tables);
  while(rs.next()) {
  System.out.println(rs.getString(1));
  }
 
  The while loop never terminates, after going through the list of
  tables,
  it keeps printing the last table name over  over again. Am I doing
  something wrong over here, or have I hit a bug? I'm on
  hive-0.3.0-hadoop-0.18.0-bin
 
  Saurabh.
  --
  http://nandz.blogspot.com
  http://foodieforlife.blogspot.com
 
 
 
 
  --
  http://nandz.blogspot.com
  http://foodieforlife.blogspot.com
 
 
 
  Hive 4.0 will be a release candidate soon. The largest major blocker
  that I know of is dealing with Hadoop 0.20. See:
 
  https://issues.apache.org/jira/browse/HIVE-487
 
  Soon after that their should be a release candidate, then voting.
 
  Edward
 
 
 
  --
  http://nandz.blogspot.com
  http://foodieforlife.blogspot.com
 

 The major thing on that is we have to build releases for every hadoop
 major/minor and possibly one off the trunk. I was thinking of doing
 something similar on my site since accomplishing this is possible with
 hudson.  Does anyone think adding this to hadoop hive is something we
 should do.



Re: partitions not being created

2009-07-31 Thread Bill Graham
I just completely removed my all of my Hive tables and folders in HDFS, as
well as metadata_db. I then re-built Hive from the latest from the trunk.
After replacing my Hive server with the contents of build/dist, and doing
the same for my client, I created new tables from scratch and again tried to
migrate from ApiUsageTemp -- ApiUsage. I got the same get_partition
failed: unknown result error.

I decided to skip the table migration and just load data directly into a
partitioned table. That also gives the same error. Below is what I tried.
Any ideas?

hive CREATE TABLE ApiUsage
 (user STRING, restResource STRING, statusCode INT, requestDate
STRING, requestHour INT, numRequests STRING, responseTime STRING,
numSlowRequests STRING)
 PARTITIONED BY (dt STRING)
 ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ';
OK
Time taken: 0.27 seconds

hive describe extended
ApiUsage;

OK
userstring
restresourcestring
statuscode  int
requestdate string
requesthour int
numrequests string
responsetimestring
numslowrequests string
dt  string

Detailed Table Information  Table(tableName:apiusage, dbName:default,
owner:grahamb, createTime:1249073147, lastAccessTime:0, retention:0,
sd:StorageDescriptor(cols:[FieldSchema(name:user, type:string,
comment:null), FieldSchema(name:restresource, type:string, comment:null),
FieldSchema(name:statuscode, type:int, comment:null),
FieldSchema(name:requestdate, type:string, comment:null),
FieldSchema(name:requesthour, type:int, comment:null),
FieldSchema(name:numrequests, type:string, comment:null),
FieldSchema(name:responsetime, type:string, comment:null),
FieldSchema(name:numslowrequests, type:string, comment:null)],
location:hdfs://xxx:9000/user/hive/warehouse/apiusage,
inputFormat:org.apache.hadoop.mapred.TextInputFormat,
outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
parameters:{field.delim= , serialization.format= }), bucketCols:[],
sortCols:[], parameters:{}), partitionKeys:[FieldSchema(name:dt,
type:string, comment:null)], parameters:{})
Time taken: 0.276 seconds

hive LOAD DATA INPATH sample_data/apilogs/summary-small/2009/05/18 INTO
TABLE ApiUsage PARTITION (dt = 20090518 );
Loading data to table apiusage partition {dt=20090518}
Failed with exception org.apache.thrift.TApplicationException: get_partition
failed: unknown result
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.MoveTask

On Thu, Jul 30, 2009 at 1:24 PM, Prasad Chakka pcha...@facebook.com wrote:

  This is not backward compatibility issue. Check HIVE-592 for details.
 Before this patch, a rename doesn’t change the name of the hdfs directory
 and if you create a new table with the old name of the renamed  table then
 both tables will be pointing to the same directory thus causing problems.
 HIVE-592 fixes this to rename directories correctly. So if you have created
 all tables after HIVE-592 patch went in, you should be fine.


 --
 *From: *Bill Graham billgra...@gmail.com
 *Reply-To: *billgra...@gmail.com
 *Date: *Thu, 30 Jul 2009 13:09:03 -0700
 *To: *Prasad Chakka pcha...@facebook.com
 *Cc: *hive-user@hadoop.apache.org
 *Subject: *Re: partitions not being created

 I sent my last try reply before seeing your last email.

 Thanks, that seems possible. I did initially create ApiUsageTemp using the
 most recent Hive release. Then while working on a JIRA I updated my Hive
 client and server to the more recent builds from the trunk.

 If that could cause such a problem, this is troubling though, since it
 implies that we can't upgrade Hive without possibly corrupting our metadata
 store.

 I'll try again from scratch though and see if it works, thanks.


 On Thu, Jul 30, 2009 at 1:04 PM, Bill Graham billgra...@gmail.com wrote:

 Prasad,

 My setup is Hive client - Hive Server (with local metastore) - Hadoop. I
 was also suspecting metastore issues, so I've tried multiple times with
 newly created destination tables and I see the same thing happening.

 All of the log info I've been able to find I've included already in this
 thread. Let me know if there's anywhere else I could look for clues.

 I've included from the client:
 - /tmp/$USER/hive.log

 And from the hive server:
 - Stdout/err logs

 - /tmp/$USER/hive_job_log*.txt

 Is there anything else I should be looking at? All of the M/R logs don't
 show any exceptions anything suspect.

 Thanks for your time and insights on this issue, I appreciate it.

 thanks,
 Bill


 On Thu, Jul 30, 2009 at 11:57 AM, Prasad Chakka pcha...@facebook.com
 wrote:

 Bill,

 The real error is happening on the Hive Metastore Server or Hive Server
  (depending on the setup you are using). Error logs on it must have
 different stack trace. From the information below I am guessing that the way
 the destination table hdfs

Re: partitions not being created

2009-07-30 Thread Bill Graham
I'm trying to set a string to a string and I'm seeing this error. I also had
an attempt where it was a string to an int, and I also saw the same error.

The /tmp/$USER/hive_job_log*.txt file doesn't contain any exceptions, but
I've included it's output below. Only the Hive server logs show the
exceptions listed above. (Note that the table I'm loading from in this log
output is ApiUsageSmall, which is identical to ApiUsageTemp. For some reason
the data from ApiUsageTemp is now gone.)

QueryStart QUERY_STRING=INSERT OVERWRITE TABLE ApiUsage PARTITION (dt =
20090518) SELECT `(requestDate)?+.+` FROM ApiUsageSmall WHERE requestDate
= '2009/05/18' QUERY_ID=app_20090730104242 TIME=1248975752235
TaskStart TASK_NAME=org.apache.hadoop.hive.ql.exec.ExecDriver
TASK_ID=Stage-1 QUERY_ID=app_20090730104242 TIME=1248975752235
TaskProgress TASK_HADOOP_PROGRESS=2009-07-30 10:42:34,783 map = 0%,  reduce
=0% TASK_NAME=org.apache.hadoop.hive.ql.exec.ExecDriver
TASK_COUNTERS=Job Counters .Launched map tasks:1,Job Counters .Data-local
map tasks:1 TASK_ID=Stage-1 QUERY_ID=app_20090730104242
TASK_HADOOP_ID=job_200906301559_0409 TIME=1248975754785
TaskProgress ROWS_INSERTED=apiusage~296 TASK_HADOOP_PROGRESS=2009-07-30
10:42:43,031 map = 40%,  reduce =0%
TASK_NAME=org.apache.hadoop.hive.ql.exec.ExecDriver TASK_COUNTERS=File
Systems.HDFS bytes read:23019,File Systems.HDFS bytes written:19178,Job
Counters .Rack-local map tasks:2,Job Counters .Launched map tasks:5,Job
Counters .Data-local map
tasks:3,org.apache.hadoop.hive.ql.exec.FilterOperator$Counter.PASSED:592,org.apache.hadoop.hive.ql.exec.FilterOperator$Counter.FILTERED:6,org.apache.hadoop.hive.ql.exec.FileSinkOperator$TableIdEnum.TABLE_ID_1_ROWCOUNT:296,org.apache.hadoop.hive.ql.exec.MapOperator$Counter.DESERIALIZE_ERRORS:0,Map-Reduce
Framework.Map input records:302,Map-Reduce Framework.Map input
bytes:23019,Map-Reduce Framework.Map output records:0 TASK_ID=Stage-1
QUERY_ID=app_20090730104242 TASK_HADOOP_ID=job_200906301559_0409
TIME=1248975763033
TaskProgress ROWS_INSERTED=apiusage~1471 TASK_HADOOP_PROGRESS=2009-07-30
10:42:44,068 map = 100%,  reduce =100%
TASK_NAME=org.apache.hadoop.hive.ql.exec.ExecDriver TASK_COUNTERS=File
Systems.HDFS bytes read:114068,File Systems.HDFS bytes written:95275,Job
Counters .Rack-local map tasks:2,Job Counters .Launched map tasks:5,Job
Counters .Data-local map
tasks:3,org.apache.hadoop.hive.ql.exec.FilterOperator$Counter.PASSED:2942,org.apache.hadoop.hive.ql.exec.FilterOperator$Counter.FILTERED:27,org.apache.hadoop.hive.ql.exec.FileSinkOperator$TableIdEnum.TABLE_ID_1_ROWCOUNT:1471,org.apache.hadoop.hive.ql.exec.MapOperator$Counter.DESERIALIZE_ERRORS:0,Map-Reduce
Framework.Map input records:1498,Map-Reduce Framework.Map input
bytes:114068,Map-Reduce Framework.Map output records:0 TASK_ID=Stage-1
QUERY_ID=app_20090730104242 TASK_HADOOP_ID=job_200906301559_0409
TIME=1248975764071
TaskEnd ROWS_INSERTED=apiusage~1471 TASK_RET_CODE=0
TASK_HADOOP_PROGRESS=2009-07-30 10:42:44,068 map = 100%,  reduce =100%
TASK_NAME=org.apache.hadoop.hive.ql.exec.ExecDriver TASK_COUNTERS=File
Systems.HDFS bytes read:114068,File Systems.HDFS bytes written:95275,Job
Counters .Rack-local map tasks:2,Job Counters .Launched map tasks:5,Job
Counters .Data-local map
tasks:3,org.apache.hadoop.hive.ql.exec.FilterOperator$Counter.PASSED:2942,org.apache.hadoop.hive.ql.exec.FilterOperator$Counter.FILTERED:27,org.apache.hadoop.hive.ql.exec.FileSinkOperator$TableIdEnum.TABLE_ID_1_ROWCOUNT:1471,org.apache.hadoop.hive.ql.exec.MapOperator$Counter.DESERIALIZE_ERRORS:0,Map-Reduce
Framework.Map input records:1498,Map-Reduce Framework.Map input
bytes:114068,Map-Reduce Framework.Map output records:0 TASK_ID=Stage-1
QUERY_ID=app_20090730104242 TASK_HADOOP_ID=job_200906301559_0409
TIME=1248975764199
TaskStart TASK_NAME=org.apache.hadoop.hive.ql.exec.ConditionalTask
TASK_ID=Stage-4 QUERY_ID=app_20090730104242 TIME=1248975764199
TaskEnd TASK_RET_CODE=0
TASK_NAME=org.apache.hadoop.hive.ql.exec.ConditionalTask TASK_ID=Stage-4
QUERY_ID=app_20090730104242 TIME=1248975782277
TaskStart TASK_NAME=org.apache.hadoop.hive.ql.exec.MoveTask
TASK_ID=Stage-0 QUERY_ID=app_20090730104242 TIME=1248975782277
TaskEnd TASK_RET_CODE=1
TASK_NAME=org.apache.hadoop.hive.ql.exec.MoveTask TASK_ID=Stage-0
QUERY_ID=app_20090730104242 TIME=1248975782473
QueryEnd ROWS_INSERTED=apiusage~1471 QUERY_STRING=INSERT OVERWRITE TABLE
ApiUsage PARTITION (dt = 20090518) SELECT `(requestDate)?+.+` FROM
ApiUsageSmall WHERE requestDate = '2009/05/18'
QUERY_ID=app_20090730104242 QUERY_NUM_TASKS=2 TIME=1248975782474



On Thu, Jul 30, 2009 at 10:09 AM, Prasad Chakka pcha...@facebook.comwrote:

  Are you sure you are getting the same error even with the schema below
 (i.e. trying to set a string to an int column?). Can you give the full stack
 trace that you might see in /tmp/$USER/hive.log?


 --
 *From: *Bill Graham billgra...@gmail.com
 *Reply-To: *hive-user@hadoop.apache.org, billgra...@gmail.com
 *Date: *Thu

HiveServer and client user accounts

2009-07-30 Thread Bill Graham
Hi,

I've found that if I access Hive via the HiveServer (using either the Hive
shell or the JDBC client), tables are created as the user who is running the
Hive server, not the user who is executing the commands. I understand why
this happens, but it doesn't seem like the expected behavior to me.

If I were to run a Hive Server on the NN as user 'hive' I'd expect other
users to be able to connect to it and add/remove tables as themselves, which
isn't the case currently. If from the command line I use hadoop to add files
to HDFS (as user 'bill') and then use hive to create a table (as user
'hive'), I can't put my data into it, due to ownership conflicts.

It brings two questions to mind:

1. Is this a bug? If so I'll create a JIRA.

2. How are other people dealing with this issue in production?*


* Granted the HiveServer currently doesn't support more than one client at a
time, which is a completely separate issue which I'm curious about w.r.t.
production use. Is the answer that the Hive Server just isn't production
ready? I that is the case, then how are people using Hive in a multi-user
environement? Does each client just connect directly to a central
metastoredb?


thanks,
Bill


Re: partitions not being created

2009-07-30 Thread Bill Graham
That file contains a similar error as the Hive Server logs:

2009-07-30 11:44:21,095 WARN  mapred.JobClient
(JobClient.java:configureCommandLineOptions(510)) - Use GenericOptionsParser
for parsing the arguments. Applications should implement Tool for the same.
2009-07-30 11:44:48,070 WARN  mapred.JobClient
(JobClient.java:configureCommandLineOptions(510)) - Use GenericOptionsParser
for parsing the arguments. Applications should implement Tool for the same.
2009-07-30 11:45:27,796 ERROR metadata.Hive (Hive.java:getPartition(588)) -
org.apache.thrift.TApplicationException: get_partition failed: unknown
result
at
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_partition(ThriftHiveMetastore.java:784)
at
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partition(ThriftHiveMetastore.java:752)
at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getPartition(HiveMetaStoreClient.java:415)
at
org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:579)
at
org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:466)
at
org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:135)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:335)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:241)
at
org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:122)
at
org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:165)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:258)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

2009-07-30 11:45:27,797 ERROR exec.MoveTask
(SessionState.java:printError(279)) - Failed with exception
org.apache.thrift.TApplicationException: get_partition failed: unknown
result
org.apache.hadoop.hive.ql.metadata.HiveException:
org.apache.thrift.TApplicationException: get_partition failed: unknown
result
at
org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:589)
at
org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:466)
at
org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:135)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:335)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:241)
at
org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:122)
at
org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:165)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:258)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
Caused by: org.apache.thrift.TApplicationException: get_partition failed:
unknown result
at
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_partition(ThriftHiveMetastore.java:784)
at
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partition(ThriftHiveMetastore.java:752)
at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getPartition(HiveMetaStoreClient.java:415)
at
org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:579)
... 16 more

2009-07-30 11:45:27,798 ERROR ql.Driver (SessionState.java:printError(279))
- FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.MoveTask

On Thu, Jul 30, 2009 at 11:33 AM, Prasad Chakka pcha...@facebook.comwrote:


 The hive logs go into /tmp/$USER/hive.log not hive_job_log*.txt.

 --
 *From: *Bill Graham billgra...@gmail.com
 *Reply-To: *billgra...@gmail.com
 *Date: *Thu, 30 Jul 2009 10:52:06 -0700
 *To: *Prasad Chakka pcha...@facebook.com
 *Cc: *hive-user@hadoop.apache.org, Zheng Shao zsh...@gmail.com
 *Subject: *Re: partitions not being created

 I'm trying to set a string to a string and I'm seeing

Re: partitions not being created

2009-07-30 Thread Bill Graham
I sent my last try reply before seeing your last email.

Thanks, that seems possible. I did initially create ApiUsageTemp using the
most recent Hive release. Then while working on a JIRA I updated my Hive
client and server to the more recent builds from the trunk.

If that could cause such a problem, this is troubling though, since it
implies that we can't upgrade Hive without possibly corrupting our metadata
store.

I'll try again from scratch though and see if it works, thanks.


On Thu, Jul 30, 2009 at 1:04 PM, Bill Graham billgra...@gmail.com wrote:

 Prasad,

 My setup is Hive client - Hive Server (with local metastore) - Hadoop. I
 was also suspecting metastore issues, so I've tried multiple times with
 newly created destination tables and I see the same thing happening.

 All of the log info I've been able to find I've included already in this
 thread. Let me know if there's anywhere else I could look for clues.

 I've included from the client:
 - /tmp/$USER/hive.log

 And from the hive server:
 - Stdout/err logs
 - /tmp/$USER/hive_job_log*.txt

 Is there anything else I should be looking at? All of the M/R logs don't
 show any exceptions anything suspect.

 Thanks for your time and insights on this issue, I appreciate it.

 thanks,
 Bill


 On Thu, Jul 30, 2009 at 11:57 AM, Prasad Chakka pcha...@facebook.comwrote:

  Bill,

 The real error is happening on the Hive Metastore Server or Hive Server
  (depending on the setup you are using). Error logs on it must have
 different stack trace. From the information below I am guessing that the way
 the destination table hdfs directories that got created has some problems.
 Can you drop that table (and make sure that there is no corresponding HDFS
 directory for both integer and string type partitions that you created) and
 retry the query.

 If you don’t want to drop the destination table then send me the logs on
 Hive Server.

 Prasad


 --
 *From: *Bill Graham billgra...@gmail.com
 *Reply-To: *billgra...@gmail.com
 *Date: *Thu, 30 Jul 2009 11:47:41 -0700
 *To: *Prasad Chakka pcha...@facebook.com
 *Cc: *hive-user@hadoop.apache.org
 *Subject: *Re: partitions not being created

 That file contains a similar error as the Hive Server logs:

 2009-07-30 11:44:21,095 WARN  mapred.JobClient
 (JobClient.java:configureCommandLineOptions(510)) - Use GenericOptionsParser
 for parsing the arguments. Applications should implement Tool for the same.
 2009-07-30 11:44:48,070 WARN  mapred.JobClient
 (JobClient.java:configureCommandLineOptions(510)) - Use GenericOptionsParser
 for parsing the arguments. Applications should implement Tool for the same.
 2009-07-30 11:45:27,796 ERROR metadata.Hive (Hive.java:getPartition(588))
 - org.apache.thrift.TApplicationException: get_partition failed: unknown
 result
 at
 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_partition(ThriftHiveMetastore.java:784)
 at
 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partition(ThriftHiveMetastore.java:752)
 at
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getPartition(HiveMetaStoreClient.java:415)
 at
 org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:579)
 at
 org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:466)
 at
 org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:135)
 at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:335)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:241)
 at
 org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:122)
 at
 org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:165)
 at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:258)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
 at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
 at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

 2009-07-30 11:45:27,797 ERROR exec.MoveTask
 (SessionState.java:printError(279)) - Failed with exception
 org.apache.thrift.TApplicationException: get_partition failed: unknown
 result
 org.apache.hadoop.hive.ql.metadata.HiveException:
 org.apache.thrift.TApplicationException: get_partition failed: unknown
 result
 at
 org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:589)
 at
 org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:466)
 at
 org.apache.hadoop.hive.ql.exec.MoveTask.execute

partitions not being created

2009-07-28 Thread Bill Graham
Hi,

I'm trying to create a partitioned table and the partition is not appearing
for some reason. Am I doing something wrong, or is this a bug? Below are the
commands I'm executing with their output. Note that the 'show partitions'
command is not returning anything. If I were to try to load data into this
table I'd get a 'get_partition failed' error. I'm using bleeding-edge Hive,
built from the trunk.

hive create table partTable (a string, b int) partitioned by (dt int);
OK
Time taken: 0.308 seconds
hive show partitions partTable;
OK
Time taken: 0.329 seconds
hive describe partTable;
OK
a   string
b   int
dt  int
Time taken: 0.181 seconds

thanks,
Bill


Re: partitions not being created

2009-07-28 Thread Bill Graham
at
org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:589)
at
org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:466)
at
org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:135)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:335)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:241)
at
org.apache.hadoop.hive.service.HiveServer$HiveServerHandler.execute(HiveServer.java:105)
at
org.apache.hadoop.hive.service.ThriftHive$Processor$execute.process(ThriftHive.java:264)
at
org.apache.hadoop.hive.service.ThriftHive$Processor.process(ThriftHive.java:252)
at
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:252)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: org.apache.thrift.TApplicationException: get_partition failed:
unknown result
at
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_partition(ThriftHiveMetastore.java:784)
at
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partition(ThriftHiveMetastore.java:752)
at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getPartition(HiveMetaStoreClient.java:415)
at
org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:579)
... 11 more

FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.MoveTask
09/07/28 18:06:15 ERROR ql.Driver: FAILED: Execution Error, return code 1
from org.apache.hadoop.hive.ql.exec.MoveTask




On Tue, Jul 28, 2009 at 5:57 PM, Namit Jain nj...@facebook.com wrote:

  There are no partitions in the table –



 Can you post the output you get while loading the data ?



 *From:* Bill Graham [mailto:billgra...@gmail.com]
 *Sent:* Tuesday, July 28, 2009 5:54 PM
 *To:* hive-user@hadoop.apache.org
 *Subject:* partitions not being created



 Hi,

 I'm trying to create a partitioned table and the partition is not appearing
 for some reason. Am I doing something wrong, or is this a bug? Below are the
 commands I'm executing with their output. Note that the 'show partitions'
 command is not returning anything. If I were to try to load data into this
 table I'd get a 'get_partition failed' error. I'm using bleeding-edge Hive,
 built from the trunk.

 hive create table partTable (a string, b int) partitioned by (dt int);
 OK
 Time taken: 0.308 seconds
 hive show partitions partTable;
 OK
 Time taken: 0.329 seconds
 hive describe partTable;
 OK
 a   string
 b   int
 dt  int
 Time taken: 0.181 seconds

 thanks,
 Bill



Re: partitions not being created

2009-07-28 Thread Bill Graham
Thanks for the tip, but it fails in the same way when I use a string.

On Tue, Jul 28, 2009 at 6:21 PM, David Lerman dler...@videoegg.com wrote:

  hive create table partTable (a string, b int) partitioned by (dt int);

  INSERT OVERWRITE TABLE ApiUsage PARTITION (dt = 20090518)
  SELECT `(requestDate)?+.+` FROM ApiUsageTemp WHERE requestDate =
 '2009/05/18'

 The table has an int partition column (dt), but you're trying to set a
 string value (dt = 20090518).

 Try :

 create table partTable (a string, b int) partitioned by (dt string);

 and then do your insert.




Re: Apply a patch to hadoop

2009-07-27 Thread Bill Graham
http://wiki.apache.org/hadoop/HowToContribute

Search for Applying a patch and you'll find this:

patch -p0  cool_patch.patch

On Mon, Jul 27, 2009 at 2:33 PM, Gopal Gandhi gopal.gandhi2...@yahoo.comwrote:

 I am going to apply a patch to hadoop (version 18.3). I searched on line
 but could not find a step by step how-to manual. Would any hadoop guru tell
 me how to apply a patch, say HADOOP-.patch to hadoop? Thanks lot.





Re: Block not found

2009-07-02 Thread Bill Graham
I ran into the same issue when using the default settings for dfs.data.dir,
which is under the /tmp directory. Files in this directory will be cleaned
our periodically as needed by the OS, which will break HDFS.

On Thu, Jul 2, 2009 at 7:01 AM, Gross, Danny danny.gr...@spansion.comwrote:

 Hello Johnson,

 I have observed similar error messages when my system ran out of disk
 space on an HDFS node, or in hadoop.tmp.dir.

 Hope it helps.

 Best regards,

 Danny

 -Original Message-
 From: Johnson Chen [mailto:dong...@gmail.com]
 Sent: Thursday, July 02, 2009 3:47 AM
 To: core-u...@hadoop.apache.org
 Subject: Block not found

 Hi ,

 My hadoop program stop at 66% Reduce phrase

 09/07/02 16:41:37 INFO mapred.JobClient:  map 0% reduce 0%
 09/07/02 16:41:48 INFO mapred.JobClient:  map 50% reduce 0%
 09/07/02 16:41:51 INFO mapred.JobClient:  map 100% reduce 0%
 09/07/02 16:41:58 INFO mapred.JobClient:  map 100% reduce 66%

 And I found it pop up a lot of error message in namenode log .

 here is the error messages :

 2009-07-02 16:43:44,461 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
 commitBlockSynchronization(lastblock=blk_-7800669485846603924_144754,
 newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
 deleteBlock=true)
 2009-07-02 16:43:44,461 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 3 on 50040, call
 commitBlockSynchronization(blk_-7800669485846603924_144754, 0, 0, false,
 true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@2df2888) from
 192.168.151.231:39976: error: java.io.IOException: Block
 (=blk_-7800669485846603924_144754) not found
 java.io.IOException: Block (=blk_-7800669485846603924_144754) not found
at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchroni
 zation(FSNamesystem.java:1906)
at
 org.apache.hadoop.hdfs.server.namenode.NameNode.commitBlockSynchronizati
 on(NameNode.java:410)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
 Impl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
 2009-07-02 16:43:44,967 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
 commitBlockSynchronization(lastblock=blk_-8293900342000823254_748370,
 newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false,
 deleteBlock=true)
 2009-07-02 16:43:44,967 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 0 on 50040, call
 commitBlockSynchronization(blk_-8293900342000823254_748370, 0, 0, false,
 true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@8ddfa31) from
 192.168.151.232:45283: error: java.io.IOException: Block
 (=blk_-8293900342000823254_748370) not found
 java.io.IOException: Block (=blk_-8293900342000823254_748370) not found
at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchroni
 zation(FSNamesystem.java:1906)
at
 org.apache.hadoop.hdfs.server.namenode.NameNode.commitBlockSynchronizati
 on(NameNode.java:410)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
 Impl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)


 Can anyone help me ? Thanks.

 --
 Best wishes,
 Johnson Chen



Re: Pig ClassCastException trying to cast to org.apache.pig.data.DataBag

2009-06-25 Thread Bill Graham
That's the strange thing though, is that I have the entire logic of my UDF
wrapped in a try/catch, and nothing is thrown there. This exception seems to
be thrown elsewhere.

2009/6/26 zjffdu zjf...@gmail.com

 I think this may be caused by your UDF. you can write try catch in your
 UDF,
 and log more context information.



 -Original Message-
 From: Bill Graham [mailto:billgra...@gmail.com]
 Sent: 2009年6月25日 9:28
 To: pig-user@hadoop.apache.org
 Subject: Pig ClassCastException trying to cast to
 org.apache.pig.data.DataBag

 Hello Pig fans,

 I've implemented a collaborative filtering job in Pig using CROSS and
 FOREACH with a UDF. It works great until my dataset grows to a certain
 size,
 at which point I start to get Pig ClassCastExceptions in the logs. I know
 that CROSS can be expensive and difficult to scale, but it's strange to me
 that when things fall over, it's due to a Pig ClassCastException. Any
 insights as to why this is happening or how I should go about
 troubleshooting?

 Here's the script:

 userAssets1 = LOAD 'sample_data/userAssets' AS (user:bytearray,
 userAssetRatings: bag {T: tuple(user:bytearray, asset:chararray,
 rating:double)});
 userAssets2 = LOAD 'sample_data/userAssets' AS (user:bytearray,
 userAssetRatings: bag {T: tuple(user:bytearray, asset:chararray,
 rating:double)});
 X = CROSS userAssets1, userAssets2 PARALLEL 20;
 userToUserFilter = FILTER X BY userAssets1::user != userAssets2::user;
 REGISTER pearson.jar;
 dist = FOREACH userToUserFilter GENERATE userAssets1::user,
 userAssets2::user,
  cnwk.grahamb.pig.PEARSON(userAssets1::userAssetRatings,
 userAssets2::userAssetRatings);
 similarUsers = FILTER dist BY ($2 != 0.0);
 STORE similarUsers INTO 'sample_data/userSimilarityPearson';

 Once the number of userAssets values grows to about 28K, the mapper
 succeeds, but the reduces fails after around 60% complete. There are 558K
 input records for the reducer in this case. The exceptions look like this:

  2009-06-24 11:34:47,854 [main] ERROR
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher -
 Error message from task (reduce)
 task_200906171500_0141_r_12java.lang.ClassCastException:
 org.apache.pig.data.DataByteArray cannot be cast to
 org.apache.pig.data.DataBag
at

 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperat
 ors.POProject.processInputBag(POProject.java:368)
at

 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperat
 ors.POProject.getNext(POProject.java:171)
at

 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperat
 ors.POUserFunc.processInput(POUserFunc.java:129)
at

 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperat
 ors.POUserFunc.getNext(POUserFunc.java:181)
at

 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperat
 ors.POUserFunc.getNext(POUserFunc.java:235)
at

 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat
 ors.POForEach.processPlan(POForEach.java:262)
at

 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat
 ors.POForEach.getNext(POForEach.java:197)
at

 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator
 .processInput(PhysicalOperator.java:226)
at

 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat
 ors.POFilter.getNext(POFilter.java:95)
at

 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Re
 duce.runPipeline(PigMapReduce.java:280)
at

 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Re
 duce.processOnePackageOutput(PigMapReduce.java:247)
at

 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Re
 duce.reduce(PigMapReduce.java:216)
at

 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Re
 duce.reduce(PigMapReduce.java:136)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
at
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2210)

 thanks,
 Bill




  1   2   >