Re: Getting job progress in java application
Take a look at the JobClient API. You can use that to get the current progress of a running job. On Sunday, April 29, 2012, Ondřej Klimpera wrote: Hello I'd like to ask you what is the preferred way of getting running jobs progress from Java application, that has executed them. Im using Hadoop 0.20.203, tried job.end.notification.url property that works well, but as the property name says, it sends only job end notifications. What I need is to get updates on map() and reduce() progress. Please help how to do this. Thanks. Ondrej Klimpera -- *Note that I'm no longer using my Yahoo! email address. Please email me at billgra...@gmail.com going forward.*
Re: Feedback on real world production experience with Flume
+1 on Edward's comment. The MapR comment was relevant and informative and the original poster never said he was only interested in open source options. On Sunday, April 22, 2012, Michael Segel wrote: Gee Edward, what about putting a link to a company website or your blog in your signature... ;-) Seriously one could also mention fuse, right? ;-) Sent from my iPhone On Apr 22, 2012, at 7:15 AM, Edward Capriolo edlinuxg...@gmail.com wrote: I think this is valid to talk about for example one need not need a decentralized collector if they can just write log directly to decentralized files in a decentralized file system. In any case it was not even a hard vendor pitch. It was someone describing how they handle centralized logging. It stated facts and it was informative. Lets face it, if fuse-mounting-hdfs or directly soft mounting NFS in a way that performs well many of the use cases for flume and scribe like tools would be gone. (not all but many) I never knew there was a rule that discussing alternative software on a mailing list. It seems like a closed minded thing. I also doubt the ASF would back a rule like that. Are we not allowed to talk about EMR or S3, or am I not even allowed to mention S3? Can flume run on ec2 and log to S3? (oops party foul I guess I cant ask that.) Edward On Sun, Apr 22, 2012 at 12:59 AM, Alexander Lorenz wget.n...@googlemail.com wrote: no. That is the Flume Open Source Mailinglist. Not a vendor list. NFS logging has nothing to do with decentralized collectors like Flume, JMS or Scribe. sent via my mobile device On Apr 22, 2012, at 12:23 AM, Edward Capriolo edlinuxg...@gmail.com wrote: It seems pretty relevant. If you can directly log via NFS that is a viable alternative. On Sat, Apr 21, 2012 at 11:42 AM, alo alt wget.n...@googlemail.com wrote: We decided NO product and vendor advertising on apache mailing lists! I do not understand why you'll put that closed source stuff from your employe in the room. It has nothing to do with flume or the use cases! -- Alexander Lorenz http://mapredit.blogspot.com On Apr 21, 2012, at 4:06 PM, M. C. Srivas wrote: Karl, since you did ask for alternatives, people using MapR prefer to use the NFS access to directly deposit data (or access it). Works seamlessly from all Linuxes, Solaris, Windows, AIX and a myriad of other legacy systems without having to load any agents on those machines. And it is fully automatic HA Since compression is built-in in MapR, the data gets compressed coming in over NFS automatically without much fuss. Wrt to performance, can get about 870 MB/s per node if you have 10GigE attached (of course, with compression, the effective throughput will surpass that based on how good the data can be squeezed). On Fri, Apr 20, 2012 at 3:14 PM, Karl Hennig khen...@baynote.com wrote: I am investigating automated methods of moving our data from the web tier into HDFS for processing, a process that's performed periodically. I am looking for feedback from anyone who has actually used Flume in a production setup (redundant, failover) successfully. I understand it is now being largely rearchitected during its incubation as Apache Flume-NG, so I don't have full confidence in the old, stable releases. The other option would be to write our own tools. What methods are you using for these kinds of tasks? Did you write your own or does Flume (or something else) work for you? I'm a -- *Note that I'm no longer using my Yahoo! email address. Please email me at billgra...@gmail.com going forward.*
Re: [Blog Post]: Accumulo and Pig play together now
- bcc: u...@nutch.apache.org common-user@hadoop.apache.org This is great Jason. One thing to add though is this line in your Pig script: SET mapred.map.tasks.speculative.execution false Otherwise you'll likely going to get duplicate writes into accumulo. On Fri, Mar 2, 2012 at 5:48 AM, Jason Trost jason.tr...@gmail.com wrote: For anyone interested... Accumulo and Pig play together now: http://www.covert.io/post/18605091231/accumulo-and-pig and https://github.com/jt6211/accumulo-pig --Jason -- *Note that I'm no longer using my Yahoo! email address. Please email me at billgra...@gmail.com going forward.*
Re: Writing small files to one big file in hdfs
You might want to check out File Crusher: http://www.jointhegrid.com/hadoop_filecrush/index.jsp I've never used it, but it sounds like it could be helpful. On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks bejoy.had...@gmail.com wrote: Hi Mohit AFAIK XMLLoader in pig won't be suited for Sequence Files. Please post the same to Pig user group for some workaround over the same. SequenceFIle is a preferred option when we want to store small files in hdfs and needs to be processed by MapReduce as it stores data in key value format.Since SequenceFileInputFormat is available at your disposal you don't need any custom input formats for processing the same using map reduce. It is a cleaner and better approach compared to just appending small xml file contents into a big file. On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia mohitanch...@gmail.com wrote: On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks bejoy.had...@gmail.com wrote: Mohit Rather than just appending the content into a normal text file or so, you can create a sequence file with the individual smaller file content as values. Thanks. I was planning to use pig's org.apache.pig.piggybank.storage.XMLLoader for processing. Would it work with sequence file? This text file that I was referring to would be in hdfs itself. Is it still different than using sequence file? Regards Bejoy.K.S On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia mohitanch...@gmail.com wrote: We have small xml files. Currently I am planning to append these small files to one file in hdfs so that I can take advantage of splits, larger blocks and sequential IO. What I am unsure is if it's ok to append one file at a time to this hdfs file Could someone suggest if this is ok? Would like to know how other do it. -- *Note that I'm no longer using my Yahoo! email address. Please email me at billgra...@gmail.com going forward.*
Re: How to delete files older than X days in HDFS/Hadoop
If you're able to put your data in directories named by date (i.e. MMdd), you can take advantage of the fact that the HDFS client will return directories in sort order of the name, which returns the most recent dirs last. You can then cron a bash script that deletes all the but last N directories returned where N is how many days you want to keep. On Sat, Nov 26, 2011 at 8:26 PM, Ronnie Dove ron...@oceansync.com wrote: Hello Raimon, I like the idea of being able to search through files on HDFS so that we can find keywords or timestamp criteria, something that OceanSync will be doing in the future as a tool option. The others have told you some great ideas but I wanted to help you out from a Java API perspective. If you are a Java programmer, you would utilize FileSystem.listFiles() which returns the directory listing in a FileStatus[] format. You would crawl through the FileStatus Array in search for whether the FileStatus is a file or a directory. If it is a file, you will check the time stamp of the file using the FileStatus.getModificationTime(). If its a directory than it will be processed again using a while loop to check the contents of that directory. This sounds tough but as part of testing this, it is fairly fast and accurate. Below are the two API's that are needed as part of this method: http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileStatus.html http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html Ronnie Dove OceanSync Management Developer http://www.oceansync.com RDove on irc.freenode.net #Hadoop - Original Message - From: Raimon Bosch raimon.bo...@gmail.com To: common-user@hadoop.apache.org Cc: Sent: Saturday, November 26, 2011 10:01 AM Subject: How to delete files older than X days in HDFS/Hadoop Hi, I'm wondering how to delete files older than X days with HDFS/Hadoop. On linux we can do it with the folowing command: find ~/datafolder/* -mtime +7 -exec rm {} \; Any ideas?
Re: Why hadoop should be built on JAVA?
There was a fairly long discussion on this topic at the beginning of the year FYI: http://search-hadoop.com/m/JvSQe2wNlY11 On Mon, Aug 15, 2011 at 9:00 PM, Chris Song sjh...@gmail.com wrote: Why hadoop should be built in JAVA? For integrity and stability, it is good for hadoop to be implemented in Java But, when it comes to speed issue, I have a question... How will it be if HADOOP is implemented in C or Phython?
Re: Distcp failure - Server returned HTTP response code: 500
Are you able to distcp folders that don't have special characters? What are the versions of the two clusters and are you running on the destination cluster if there's a mis-match? If there is you'll need to use hftp: http://hadoop.apache.org/common/docs/current/distcp.html#cpver On Wed, May 18, 2011 at 12:44 PM, sonia gehlot sonia.geh...@gmail.comwrote: Hi Guys I am trying to copy hadoop data from one cluster to another but I am keep on getting this error *Server returned HTTP response code: 500 for URL* * * My distcp command is: scripts/hadoop.sh distcp hftp://c13-hadoop1-nn-r0-n1:50070/user/dwadmin/live/data/warehouse/facts/page_events/ *day=2011-05-17* hdfs://phx1-rb-dev40-pipe1.cnet.com:9000/user/sgehlot In here I have *day=2011-05-17* in my file path I found this online: https://issues.apache.org/jira/browse/HDFS-31 Is this issue is still exists? Is this could be the reason of my job failure? Job Error log: 2011-05-18 11:34:56,505 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId= 2011-05-18 11:34:56,713 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 0 2011-05-18 11:34:57,039 INFO org.apache.hadoop.tools.DistCp: FAIL day=2011-05-17/_logs/history/c13-hadoop1-nn-r0-n1_1291919715221_job_201012091035_41977_dwadmin_CopyFactsToHive%3A+page_events+day%3D2011-05-17 : java.io.IOException: *Server returned HTTP response code: 500 for URL*: http://c13-hadoop1-wkr-r10-n4.cnet.com:50075/streamFile?filename=/user/dwadmin/live/data/warehouse/facts/page_events/day=2011-05-17/_logs/history/c13-hadoop1-nn-r0-n1_1291919715221_job_201012091035_41977_dwadmin_CopyFactsToHive%253A+page_events+day%253D2011-05-17ugi=sgehlot,user at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1313) at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:157) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:398) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:410) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:537) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:306) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170) 2011-05-18 11:35:06,118 WARN org.apache.hadoop.mapred.TaskTracker: Error running child java.io.IOException: Copied: 0 Skipped: 5 Failed: 1 at org.apache.hadoop.tools.DistCp$CopyFilesMapper.close(DistCp.java:572) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170) 2011-05-18 11:35:06,124 INFO org.apache.hadoop.mapred.TaskRunner: Runnning cleanup for the task Any help is appreciated. Thanks, Sonia
Re: Including Additional Jars
If you could share more specifics regarding just how it's not working (i.e., job specifics, stack traces, how you're invoking it, etc), you might get more assistance in troubleshooting. On Wed, Apr 6, 2011 at 1:44 AM, Shuja Rehman shujamug...@gmail.com wrote: -libjars is not working nor distributed cache, any other solution?? On Mon, Apr 4, 2011 at 11:40 PM, James Seigel ja...@tynt.com wrote: James’ quick and dirty, get your job running guideline: -libjars -- for jars you want accessible by the mappers and reducers classpath or bundled in the main jar -- for jars you want accessible to the runner Cheers James. On 2011-04-04, at 12:31 PM, Shuja Rehman wrote: well...i think to put in distributed cache is good idea. do u have any working example how to put extra jars in distributed cache and how to make available these jars for job? Thanks On Mon, Apr 4, 2011 at 10:20 PM, Mark Kerzner markkerz...@gmail.com wrote: I think you can put them either in your jar or in distributed cache. As Allen pointed out, my idea of putting them into hadoop lib jar was wrong. Mark On Mon, Apr 4, 2011 at 12:16 PM, Marco Didonna m.didonn...@gmail.com wrote: On 04/04/2011 07:06 PM, Allen Wittenauer wrote: On Apr 4, 2011, at 8:06 AM, Shuja Rehman wrote: Hi All I have created a map reduce job and to run on it on the cluster, i have bundled all jars(hadoop, hbase etc) into single jar which increases the size of overall file. During the development process, i need to copy again and again this complete file which is very time consuming so is there any way that i just copy the program jar only and do not need to copy the lib files again and again. i am using net beans to develop the program. kindly let me know how to solve this issue? This was in the FAQ, but in a non-obvious place. I've updated it to be more visible (hopefully): http://wiki.apache.org/hadoop/FAQ#How_do_I_submit_extra_content_.28jars.2C_static_files.2C_etc.29_for_my_job_to_use_during_runtime.3F Does the same apply to jar containing libraries? Let's suppose I need lucene-core.jar to run my project. Can I put my this jar into my job jar and have hadoop see lucene's classes? Or should I use distributed cache?? MD -- Regards Shuja-ur-Rehman Baig http://pk.linkedin.com/in/shujamughal -- Regards Shuja-ur-Rehman Baig http://pk.linkedin.com/in/shujamughal
Re: Including Additional Jars
You need to pass the mainClass after the jar: http://hadoop.apache.org/common/docs/r0.21.0/commands_manual.html#jar On Wed, Apr 6, 2011 at 11:31 AM, Shuja Rehman shujamug...@gmail.com wrote: i am using the following command hadoop jar myjar.jar -libjars /home/shuja/lib/mylib.jar param1 param2 param3 but the program still giving the error and does not find the mylib.jar. can u confirm the syntax of command? thnx On Wed, Apr 6, 2011 at 8:29 PM, Bill Graham billgra...@gmail.com wrote: If you could share more specifics regarding just how it's not working (i.e., job specifics, stack traces, how you're invoking it, etc), you might get more assistance in troubleshooting. On Wed, Apr 6, 2011 at 1:44 AM, Shuja Rehman shujamug...@gmail.com wrote: -libjars is not working nor distributed cache, any other solution?? On Mon, Apr 4, 2011 at 11:40 PM, James Seigel ja...@tynt.com wrote: James’ quick and dirty, get your job running guideline: -libjars -- for jars you want accessible by the mappers and reducers classpath or bundled in the main jar -- for jars you want accessible to the runner Cheers James. On 2011-04-04, at 12:31 PM, Shuja Rehman wrote: well...i think to put in distributed cache is good idea. do u have any working example how to put extra jars in distributed cache and how to make available these jars for job? Thanks On Mon, Apr 4, 2011 at 10:20 PM, Mark Kerzner markkerz...@gmail.com wrote: I think you can put them either in your jar or in distributed cache. As Allen pointed out, my idea of putting them into hadoop lib jar was wrong. Mark On Mon, Apr 4, 2011 at 12:16 PM, Marco Didonna m.didonn...@gmail.com wrote: On 04/04/2011 07:06 PM, Allen Wittenauer wrote: On Apr 4, 2011, at 8:06 AM, Shuja Rehman wrote: Hi All I have created a map reduce job and to run on it on the cluster, i have bundled all jars(hadoop, hbase etc) into single jar which increases the size of overall file. During the development process, i need to copy again and again this complete file which is very time consuming so is there any way that i just copy the program jar only and do not need to copy the lib files again and again. i am using net beans to develop the program. kindly let me know how to solve this issue? This was in the FAQ, but in a non-obvious place. I've updated it to be more visible (hopefully): http://wiki.apache.org/hadoop/FAQ#How_do_I_submit_extra_content_.28jars.2C_static_files.2C_etc.29_for_my_job_to_use_during_runtime.3F Does the same apply to jar containing libraries? Let's suppose I need lucene-core.jar to run my project. Can I put my this jar into my job jar and have hadoop see lucene's classes? Or should I use distributed cache?? MD -- Regards Shuja-ur-Rehman Baig http://pk.linkedin.com/in/shujamughal -- Regards Shuja-ur-Rehman Baig http://pk.linkedin.com/in/shujamughal -- Regards Shuja-ur-Rehman Baig
Re: Including Additional Jars
Shuja, I haven't tried this, but from what I've read it seems you could just add all your jars required by the Mapper and Reducer to HDFS and then add them to the classpath in your run() method like this: DistributedCache.addFileToClassPath(new Path(/myapp/mylib.jar), job); I think that's all there is to it, but like I said, I haven't tried it. Just be sure your run() method isn't in the same class as your mapper/reducer if they import packages from any of the distributed cache jars. On Mon, Apr 4, 2011 at 11:40 AM, James Seigel ja...@tynt.com wrote: James’ quick and dirty, get your job running guideline: -libjars -- for jars you want accessible by the mappers and reducers classpath or bundled in the main jar -- for jars you want accessible to the runner Cheers James. On 2011-04-04, at 12:31 PM, Shuja Rehman wrote: well...i think to put in distributed cache is good idea. do u have any working example how to put extra jars in distributed cache and how to make available these jars for job? Thanks On Mon, Apr 4, 2011 at 10:20 PM, Mark Kerzner markkerz...@gmail.com wrote: I think you can put them either in your jar or in distributed cache. As Allen pointed out, my idea of putting them into hadoop lib jar was wrong. Mark On Mon, Apr 4, 2011 at 12:16 PM, Marco Didonna m.didonn...@gmail.com wrote: On 04/04/2011 07:06 PM, Allen Wittenauer wrote: On Apr 4, 2011, at 8:06 AM, Shuja Rehman wrote: Hi All I have created a map reduce job and to run on it on the cluster, i have bundled all jars(hadoop, hbase etc) into single jar which increases the size of overall file. During the development process, i need to copy again and again this complete file which is very time consuming so is there any way that i just copy the program jar only and do not need to copy the lib files again and again. i am using net beans to develop the program. kindly let me know how to solve this issue? This was in the FAQ, but in a non-obvious place. I've updated it to be more visible (hopefully): http://wiki.apache.org/hadoop/FAQ#How_do_I_submit_extra_content_.28jars.2C_static_files.2C_etc.29_for_my_job_to_use_during_runtime.3F Does the same apply to jar containing libraries? Let's suppose I need lucene-core.jar to run my project. Can I put my this jar into my job jar and have hadoop see lucene's classes? Or should I use distributed cache?? MD -- Regards Shuja-ur-Rehman Baig http://pk.linkedin.com/in/shujamughal
Re: Chukwa setup issues
Unfortunately conf/collectors is used in two different ways in Chukwa, each with a different syntax. This should really be fixed. 1. The script that starts the collectors looks at it for a list of hostnames (no ports) to start collectors on. To start it just on one host, set it to localhost. 2. The agent looks at that file for the list of collectors to attempt to communicate with. In that case the format is a list of HTTP urls with ports of the collectors. Can you telnet to port ? It looks like it's listening, but nothing's being sent. Is there anything in logs/collector.log? On Fri, Apr 1, 2011 at 1:09 PM, bikash sharma sharmabiks...@gmail.com wrote: Hi, I am trying to setup Chukwa for a 16-node Hadoop cluster. I followed the admin guide - http://incubator.apache.org/chukwa/docs/r0.4.0/admin.html#Agents However, I ran two the following issues: 1. What should be the collector port that needs to be specified in conf/collectors file 2. Am unable to see the collector running via web browser Am I missing something? Thanks in advance. -bikash p.s. - after i run collector, nothing happens % bin/chukwa collector 2011-04-01 16:07:16.410::INFO: Logging to STDERR via org.mortbay.log.StdErrLog 2011-04-01 16:07:16.523::INFO: jetty-6.1.11 2011-04-01 16:07:17.707::INFO: Started SelectChannelConnector@0.0.0.0: started Chukwa http collector on port
Re: Chukwa - Lightweight agents
Yes, we run light weight Chukwa agents and collectors only, using Chukwa just as you describe. We've been doing so for over a year or so without many issues. The code is fairly easy to extend when needed. We rolled our own collector, agent and demux RPMs. The monitoring peice of chukwa is optional and we don't use that part. On Sun, Mar 20, 2011 at 6:47 PM, Ted Dunning tdunn...@maprtech.com wrote: Bummer. On Sun, Mar 20, 2011 at 10:12 AM, Mark static.void@gmail.com wrote: We tried flume however there are some pretty strange bugs occurring which prevent us from using it. http://groups.google.com/a/cloudera.org/group/flume-user/browse_thread/thread/66c6aecec9d1869b On 3/20/11 10:03 AM, Ted Dunning wrote: OpenTSDB is purely a monitoring solution which is the primary mission of chukwa. If you are looking for data import, what about Flume? On Sun, Mar 20, 2011 at 9:59 AM, Mark static.void@gmail.com wrote: Thanks but we need Chukwa to aggregate and store files from across our app servers into Hadoop. Doesn't really look like opentsdb is meant for that. I could be wrong though? On 3/20/11 9:49 AM, Ted Dunning wrote: Take a look at openTsDb at http://opentsdb.net/ It provides lots of the capability in a MUCH simpler package. On Sun, Mar 20, 2011 at 8:43 AM, Mark static.void@gmail.com wrote: Sorry but it doesn't look like Chukwa mailing list exists anymore? Is there an easy way to set up lightweight agents on cluster of machines instead of downloading the full Chukwa source (+50mb)? Has anyone build separate RPM's for the agents/collectors? Thanks
Re: Chukwa?
Chukwa hasn't had a release since moving from Hadoop to incubator so there are no releases in the /incubator repos. Follow the link on the Chukwa homepage to the downloads repos: http://incubator.apache.org/chukwa/ http://www.apache.org/dyn/closer.cgi/hadoop/chukwa/chukwa-0.4.0 On Sun, Mar 20, 2011 at 9:38 AM, Mark static.void@gmail.com wrote: Whats the deal with Chukwa? Mailing list doesn't look like its alive as well as any of the download options??? http://www.apache.org/dyn/closer.cgi/incubator/chukwa/ Is this project dead?
Re: Hadoop exercises
For the even lazier, you could give both the test data and the expected output data. That way they know for sure if they got it right. This also promotes a good testing best practice, which is to assert against and expected set of results. On Wed, Jan 5, 2011 at 9:19 AM, Mark Kerzner markkerz...@gmail.com wrote: Thank you, Harsh, for the suggestion. It also gave me an idea to add one more exercise, generate test data - which is helpful in its own right, since it brings out the idea of Hadoop testing philosophy: think of large tests. Mark On Wed, Jan 5, 2011 at 10:39 AM, Harsh J qwertyman...@gmail.com wrote: Providing a data sample after describing it would be good for the yet-lazy. On Wed, Jan 5, 2011 at 9:19 PM, Mark Kerzner markkerz...@gmail.com wrote: Hi, what would you think of these exercises http://hadoopinpractice.blogspot.com/2011/01/exercises-for-chapter-1-how-do-they.html for the Hadoop intro chapter? Thank you, Mark -- Harsh J www.harshj.com
Re: Which patch to apply out of multiple available ?
You typically want the last one only. Generally higher numbered patches are revisions of previous ones and they're cumulative. On Tue, Sep 7, 2010 at 10:34 AM, Shrijeet Paliwal shrij...@rocketfuel.comwrote: Probably a silly question, If I want to apply a patch to a version I am running and there are multiple patches attached to the Jira - which one should I pick? The latest one? Example: https://issues.apache.org/jira/browse/HIVE-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel Thanks, -Shrijeet
PigServer.executeBatch throwing NotSerializableException
Hi, I've just deployed some new Pig jobs live (Pig version 0.7.0) and I'm getting the error shown below. Has anyone seen this before? What's strange is that I have a tier of 4 load-balanced machines that serve as my job runners, all with identical code and hardware, but 2 of the four will fail consistently with this error. The other 2 run the same job without problem. I'm racking my brain to understand why... java.io.NotSerializableException: org.apache.commons.logging.impl.Log4JLogger at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1156) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1474) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1392) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:326) at java.util.HashMap.writeObject(HashMap.java:1000) at sun.reflect.GeneratedMethodAccessor624.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:945) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1461) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1392) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1474) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1392) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1474) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1392) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:326) at org.apache.pig.impl.util.ObjectSerializer.serialize(ObjectSerializer.java:40) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:511) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:246) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:131) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:308) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:844) at org.apache.pig.PigServer.execute(PigServer.java:837) at org.apache.pig.PigServer.access$100(PigServer.java:107) at org.apache.pig.PigServer$Graph.execute(PigServer.java:1089) at org.apache.pig.PigServer.executeBatch(PigServer.java:290) thanks, Bill
Re: PigServer.executeBatch throwing NotSerializableException
Thanks Yan! That definitely looks like it could be the culprit. I'll give that a shot. FYI, I also found an instance of Log in one of my UDFs that wasn't static. From what I gather, changing that to static will keep serialization from occurring, so I'm giving that a shot as well. On Tue, Sep 7, 2010 at 10:33 AM, Yan Zhou y...@yahoo-inc.com wrote: Most likely you, when build PIG, were using the older version 1.0.3 of log4j logger. See https://jira.jboss.org/browse/JBAS-1781. The root cause might be due to https://issues.apache.org/jira/browse/PIG-1582. So the first thing you probably need to do is upgrade your local copy of PIG trunk. Yan -Original Message- From: Bill Graham [mailto:billgra...@gmail.com] Sent: Tuesday, September 07, 2010 10:00 AM To: pig-user@hadoop.apache.org Subject: PigServer.executeBatch throwing NotSerializableException Hi, I've just deployed some new Pig jobs live (Pig version 0.7.0) and I'm getting the error shown below. Has anyone seen this before? What's strange is that I have a tier of 4 load-balanced machines that serve as my job runners, all with identical code and hardware, but 2 of the four will fail consistently with this error. The other 2 run the same job without problem. I'm racking my brain to understand why... java.io.NotSerializableException: org.apache.commons.logging.impl.Log4JLogger at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1156) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1474) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1392) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:326) at java.util.HashMap.writeObject(HashMap.java:1000) at sun.reflect.GeneratedMethodAccessor624.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:945) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1461) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1392) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1474) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1392) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1474) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1392) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:326) at org.apache.pig.impl.util.ObjectSerializer.serialize(ObjectSerializer.java:40) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:511) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:246) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:131) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:308) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:844) at org.apache.pig.PigServer.execute(PigServer.java:837) at org.apache.pig.PigServer.access$100(PigServer.java:107) at org.apache.pig.PigServer$Graph.execute(PigServer.java:1089) at org.apache.pig.PigServer.executeBatch(PigServer.java:290) thanks, Bill
Re: Real-time log processing in Hadoop
We're using Chukwa to do steps a-d before writing summary data into MySQL. Data is written into new directories every 5 minutes. Our MR jobs and data load into MySQL takes 5 minutes, so after a 5 minute window closes, we typically have summary data from that interval in MySQL about a few minutes later. But as Ranjib points out, how fast you can process your data depends on both cluster size and data rate. thanks, Bill On Sun, Sep 5, 2010 at 10:42 PM, Ranjib Dey ranj...@thoughtworks.comwrote: we are using hadoop for log crunching, and the mined data feeds on of our app. its not exactly real time, the is basically a mail responder which provides certain services given an e-mail (with prescribed format) against it (a...@xxx.com). We have been able to bring down the response time to 30 mins. This includes automated hadoop job submission - processing the out put , and intermediate status notification. From our experiences we have learned the entire response time is dependent on your data size, your hadoop clusters strength etc. And you need to do the performance optimization at each level (as they required), which includes jvm tuning (different tuning in name nodes / data nodes) to app level code refactoring (like using har on hdfs for smaller files , etc). regards ranjib On Mon, Sep 6, 2010 at 10:32 AM, Ricky Ho rickyphyl...@yahoo.com wrote: Can anyone share their experience in doing real-time log processing using Chukwa/Scribe + Hadoop ? I am wondering how real-time can this be given Hadoop is designed for batch rather than stream processing 1) The startup / Teardown time of running Hadoop jobs typically takes minutes 2) Data is typically stored in HDFS which is large file, it takes some time to accumulate data to that size. All these will add up to the latencies of Hadoop. So I am wondering what is the shortest latencies are people doing log processing at real-life. To my understanding, the Chukwa/Scribe model accumulates log requests (from many machines) and write them to HDFS (inside a directory). After the logger switch to a new directory, the old one is ready for Map/Reduce processing, and then produce the result. So the latency is ... a) Accumulate enough data to fill an HDFS block size b) Write the block to HDFS c) Keep doing (b) until the criteria of switching to a new directory is met d) Start the Map/Reduce processing in the old directory e) Write the processed data to the output directory f) Load the output to a queriable form. I think the above can easily be a 30 minutes or 1 hour duration. Is this ball-part inline with the real-life projects that you have done ? Rgds, Ricky
JIRA down
JIRA seems to be down FYI. Database errors are being returned: *Cause: * org.apache.commons.lang.exception.NestableRuntimeException: com.atlassian.jira.exception.DataAccessException: org.ofbiz.core.entity.GenericDataSourceException: Unable to esablish a connection with the database. (FATAL: database is not accepting commands to avoid wraparound data loss in database postgres) *Stack Trace: * [hide] org.apache.commons.lang.exception.NestableRuntimeException: com.atlassian.jira.exception.DataAccessException: org.ofbiz.core.entity.GenericDataSourceException: Unable to esablish a connection with the database. (FATAL: database is not accepting commands to avoid wraparound data loss in database postgres) at com.atlassian.jira.web.component.TableLayoutFactory.getUserColumns(TableLayoutFactory.java:239) at com.atlassian.jira.web.component.TableLayoutFactory.getStandardLayout(TableLayoutFactory.java:42) at org.apache.jsp.includes.navigator.table_jsp._jspService(table_jsp.java:178) at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70) at javax.servlet.http.HttpServlet.service(HttpServlet.java:717) at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:374) at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:342) at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:267)
Re: Reopen and append to SequenceFile
Chukwa also has a JMSAdaptor that can listen to an AMQ queue and stream the messages to a collector(s) to then be persisted as sequence files. On Fri, Aug 20, 2010 at 3:29 AM, cliff palmer palmercl...@gmail.com wrote: You may want to consider using something like the *nix tee command to save a copy of each message in a log directory. A periodic job (like Flume) would load the logged messages into sequence files. HTH Cliff On Fri, Aug 20, 2010 at 3:32 AM, skantsoni shashikant_s...@mindtree.com wrote: Hi, I am fairly new to Hadoop and HDFS and am trying to do the following: 1. consume some information being published by a system from AMQP 2. write these to SequenceFile as Text, Text into a sequence file. Periodically these files would be consumed by another system to generate reports. The problem is our system which consumes messages is distributed and runs accross multiple machines and i cannot keep the writer on a sequencefile open for a long time to keep appending. I want to open the file for a message and then close it for each message that i receive (Dont know if this is the correct approach for HDFS). But if i close the writer once i cannot reopen to append to it. I saw a few threads talking about merging these files but i felt that may be an overhead. I feel i am missing something on the fundamental usage of sequence files or is there another way to do this. Can someone please point me to the correct direction? Thanks in advance -- View this message in context: http://old.nabble.com/Reopen-and-append-to-SequenceFile-tp29489425p29489425.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Pig and Cassandra
I've seen that exception in other cases where there is an unmeet dependency on a superclass that is included in a separate (and not provided) jar. Check the thrift source to see if that's the case. On Friday, August 13, 2010, Christian Decker decker.christ...@gmail.com wrote: Hi all, I'm pretty new to Pig and Hadoop so excuse me if this is trivial, but I couldn't find anyone able to help me. I'm trying to get Pig to read data from a Cassandra cluster, which I thought trivial since Cassandra already provides me with the CassandraStorage class [1]. Problem is that once I try executing a simple script like this: register /path/to/pig-0.7.0-core.jar;register /path/to/libthrift-r917130.jar; register /path/to/cassandra_loadfunc.jarrows = LOAD 'cassandra://Keyspace1/Standard1' USING org.apache.cassandra.hadoop.pig.CassandraStorage();cols = FOREACH rows GENERATE flatten($1);colnames = FOREACH cols GENERATE $0;namegroups = GROUP colnames BY $0;namecounts = FOREACH namegroups GENERATE COUNT($1), group;orderednames = ORDER namecounts BY $0;topnames = LIMIT orderednames 50;dump topnames; I just end up with a NoClassDefFoundError: ERROR org.apache.pig.tools.grunt.Grunt - org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias topnames at org.apache.pig.PigServer.openIterator(PigServer.java:521) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:544) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:241) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:391) Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias topnames at org.apache.pig.PigServer.store(PigServer.java:577) at org.apache.pig.PigServer.openIterator(PigServer.java:504) ... 6 more Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2117: Unexpected error when launching map reduce job. at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:209) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:308) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:835) at org.apache.pig.PigServer.store(PigServer.java:569) ... 7 more Caused by: java.lang.RuntimeException: Could not resolve error that occured when launching map reduce job: java.lang.NoClassDefFoundError: org/apache/thrift/TBase at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLauncher.java:510) at java.lang.Thread.dispatchUncaughtException(Thread.java:1845) (sorry for posting all the error message). I cannot think of a reason as to why. As far as I understood it Pig takes the jar files in the script, unpackages them, creates the execution plan for the script itself and then bundles it into a single jar again, then submits it to the HDFS from where it will be executed in Hadoop, right? I also checked that the class in question actually is in the libthrift jar, so what's going wrong? Regards, Chris [1] http://svn.apache.org/viewvc/cassandra/trunk/contrib/pig/src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java?revision=984904view=markup
Re: Changing hostnames of tasktracker/datanode nodes - any problems?
Sorry to hijack the thread but I have a similar use case. In a few months we're going to be moving colos. The new cluster will be the same size as the current cluster and some downtime is acceptable. The hostnames will be different. From what I've read in this thread it seems like it would be safe to do the following: 1. Build out the new cluster without starting it. 2. Shut down the entire old cluster (NN, SNN, DNs) 3. scp the relevant data and name dirs for each host to the new hardware. 4. Start the new cluster Is is correct to say that that would work fine? We have a replication factor of 2, so we'd be copying twice as much data as we'd need to so I'm sure there's a more efficient approach. What about adding the new nodes in the new colo to the existing cluster, rebalancing and then decommissioning the old cluster nodes before finally migrating the NN/SNN? I know Hadoop isn't intended to run cross-colo, but would this be a more efficient approach than the one above? On Tue, Aug 10, 2010 at 8:59 AM, Allen Wittenauer awittena...@linkedin.comwrote: On Aug 10, 2010, at 7:07 AM, Brian Bockelman wrote: Hi Erik, You can also do this one-by-one (aka, a rolling reboot). Shut it down, wait for it to be recognized as dead, then bring it back up with a new hostname. It will take a much longer time, but you won't have any decrease in availability, just some minor decrease in capacity. ... and potentially problems with dfs.hosts.
Re: Changing hostnames of tasktracker/datanode nodes - any problems?
Ahh yes of course, distcp. Thanks! On Tue, Aug 10, 2010 at 11:01 AM, Allen Wittenauer awittena...@linkedin.com wrote: On Aug 10, 2010, at 10:54 AM, Bill Graham wrote: Is is correct to say that that would work fine? We have a replication factor of 2, so we'd be copying twice as much data as we'd need to so I'm sure there's a more efficient approach. It should work fine. But yes, highly inefficient. What about adding the new nodes in the new colo to the existing cluster, rebalancing and then decommissioning the old cluster nodes before finally migrating the NN/SNN? I know Hadoop isn't intended to run cross-colo, but would this be a more efficient approach than the one above? If you can keep both grids up at the same time, use distcp to do the copy. This will make sure the blocks get copied once, will keep permissions with -p, keep the replication factor, redistribute data (free balancing!), etc, etc, etc.
Re: how to convert java program to pig UDF, and important statements in the pig script
http://hadoop.apache.org/pig/docs/r0.7.0/udf.html#How+to+Write+a+Simple+Eval+Function Replace 'return str.toUpperCase()' in the example with 'return str + *' and you have a star UDF. On Mon, Aug 9, 2010 at 1:40 PM, Ifeanyichukwu Osuji osujii...@potsdam.eduwrote: i have a simple java file that adds a star to a word and prints the result.(this is the simplest java program i could think of) import java.util.*; public class Star { public static void main (String[] args) { if (args.length == 1) { addStar(args[0]); }else System.exit(1); } public static void addStar (String word) { String str = ; str += word + *; System.out.println(str); } } How can i use pig to execute this java program? The answer i can come up with on my own is converting method addStar to a UDF but i dont know how to do it(please help). The documentation wasn't that helpful. Re-wording the question: Lets say i have a file words.log that contains a column of words (all of which i want to add star to). I would like to use pig to pass each word in the log through the java program above. How can i do this? If i were to write a pig script, would it be like this? myscript.pig a = load 'words.log' as (word:chararray); b = foreach a generate star(word);... (I dont know what to do, please help) dump b; ubuntu-user ife
Re: mapred.min.split.size
FYI, Chukwa support for Pig 0.7.0 was just committed last week: https://issues.apache.org/jira/browse/CHUKWA-495 The patch was built on Chukwa 0.4.0, but you could try applying the patch against Chukwa 0.3.0. I don't think the relevant code changed much between 3-4. On Thu, Aug 5, 2010 at 4:40 PM, Richard Ding rd...@yahoo-inc.com wrote: What version of Pig you are on? ChukwaStorage loader for Pig 0.7 uses Hadoop FileInputFormat to generate splits so the mapred.min.split.size property should work. But from the release date, Chukwa 0.3 seems not on Pig 0.7. Thanks, -Richard -Original Message- From: Corbin Hoenes [mailto:cor...@tynt.com] Sent: Thursday, August 05, 2010 3:50 PM To: pig-user@hadoop.apache.org Subject: Re: mapred.min.split.size I am using the ChukwaStorage loader from chukwa 0.3. Is it the loader's responsibility to deal with input splits? On Aug 5, 2010, at 4:14 PM, Richard Ding wrote: I misunderstood your earlier question. If you have one large file, set mapred.min.split.size property will help to increase the file split size. Pig will pass system properties to Hadoop. What loader are you using? Thanks, -Richard -Original Message- From: Corbin Hoenes [mailto:cor...@tynt.com] Sent: Thursday, August 05, 2010 1:22 PM To: pig-user@hadoop.apache.org Subject: Re: mapred.min.split.size So what does pig do when I have a 5 gig file? Does it simply hardcode the split size to block size? Is there no way to tell it to just operate on a larger split size? On Jul 27, 2010, at 3:41 PM, Richard Ding wrote: For Pig loaders, each split can have at most one file, doesn't matter what split size is. You can concatenate the input files before loading them. Thanks, -Richard -Original Message- From: Corbin Hoenes [mailto:cor...@tynt.com] Sent: Tuesday, July 27, 2010 2:09 PM To: pig-user@hadoop.apache.org Subject: mapred.min.split.size Is there a way to set the mapred.min.split.size property in pig? I set it but doesn't seem to have changed the mapper's HDFS_BYTES_READ counter. My mappers are finishing ~10 secs. I have ~20,000 of them.
Re: why I can't reply email
Try sending your email as text and not html, if you're not already. Others have also had issues with apache lists with html emails getting flagged as spam more easily. On Wed, Aug 4, 2010 at 3:30 PM, Todd Lee ronnietodd...@gmail.com wrote: maybe qq.com got blacklisted :) T 2010/8/4 我很快乐 896923...@qq.com I can send email to hive-user@hadoop.apache.org, but after other people reply my email, I can't reply the people's email and I recieve below message: host mx1.eu.apache.org[192.87.106.230] said: 552 spam score (14.4) exceeded threshold (in reply to end of DATA command) . Could anybody can tell me what is reason? Thanks, LiuLei
Re: Chukwa questions
Your understanding of how Chukwa works is correct. Hadoop by itself is a system that contains both the HDFS and the MapReduce systems. The other projects you lists are all projects built upon Hadoop, but you don't need them to run or to get value out of Hadoop by itself. To run the Chukwa agent on a data-source node you do not need to have Hadoop on that node. The Chukwa agent contains Hadoop jars in its run-time distribution, and those will be used by the agent, but none of the Hadoop daemons are needed on that node. CC chukwa-us...@hadoop.apache.org list, where this discussion should probably move to if there are follow-up Chukwa questions. Bill On Fri, Jul 9, 2010 at 8:33 AM, Blargy zman...@hotmail.com wrote: I am looking into to Chukwa to collect/aggregate our search logs from across multiple hosts. As I understand it I need to have a agent/adaptor running on each host which then in turn forward this to a collector (across the network) which will then write out to HDFS. Correct? Does Hadoop need to be installed on the host machines that are running the agent/adaptors or just Chuckwa? Is Hadoop by itself anything or is Hadoop just a collection of tools... HDFS, Hive, Chukwa, Mahout, etc? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Chukwa-questions-tp954643p954643.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Re: Problem in chukwa output
FYI, the TsProcessor is not the default processor, so if you want to use it you need to explicitly configure it to be used. If you have done that, then the default time format of the TsProcessor is '-MM-dd HH:mm:ss,SSS', which is not what you have. If you process logs like you show with the TsProcessor without overriding the default time format, you will get many InError files as output. Here's the code: http://svn.apache.org/viewvc/hadoop/chukwa/trunk/src/java/org/apache/hadoop/chukwa/extraction/demux/processor/mapper/TsProcessor.java?view=markup And here's how to configure the times format expected by the processor: https://issues.apache.org/jira/browse/CHUKWA-472 And here's how to set the default processor to something other than what's hardcoded (which is DefaultProcessor): https://issues.apache.org/jira/browse/CHUKWA-473 On Thu, Jun 3, 2010 at 10:15 AM, Jerome Boulon jbou...@netflix.com wrote: The default TSProcessor expect that every record/line starts with a Date. The only thing that matter is the record delimiter. All current readers are using \n as a record delimiter. So for your specific case, is \n the right record delimiter? If yes, then, there's a bug in the reader, create a Jira for that. If \n is not a record delimiter then you have to write your own reader or change your log format to use \n as a record delimiter or escape the \n as we are doing in the log4j appender. /Jerome. On 6/3/10 12:14 AM, Stuti Awasthi stuti_awas...@persistent.co.in wrote: Hi, I checked the new TsProcessor class but I don't think that I have to change the date format as IM using standard SysLog types of log files. In my case, I am using TsProcessor. It is able to partially parse the log files correctly and generate .evt files beneath the repos/ dir. However, there is also an error directory and most of the data is going into that directory. I am getting the date parse exception. I tried to find out why some of the data could be parsed and the remaining could not be parsed. Then I found out that this is because the data is getting divided into chunks as follows: Suppose the contents of the log file are as follows: May 29 13:09:02 ps3156 /USR/SBIN/CRON[19815]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] [ -d /var/lib/php5 ] find /var/lib/php5/ -type f -cmin +$(/usr/lib/php5/maxlifetime) -print0 | xargs -n 200 -r -0 rm) May 29 13:09:02 ps3156 /USR/SBIN/CRON[19815]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] [ -d /var/lib/php5 ] find /var/lib/php5/ -type f -cmin +$(/usr/lib/php5/maxlifetime) -print0 | xargs -n 200 -r -0 rm) May 29 13:09:02 ps3156 /USR/SBIN/CRON[19815]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] [ -d /var/lib/php5 ] find /var/lib/php5/ -type f -cmin +$(/usr/lib/php5/maxlifetime) -print0 | xargs -n 200 -r -0 rm) May 29 13:09:02 ps3156 /USR/SBIN/CRON[19815]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] [ -d /var/lib/php5 ] find /var/lib/php5/ -type f -cmin +$(/usr/lib/php5/maxlifetime) -print0 | xargs -n 200 -r -0 rm) Chunk 1: May 29 13:09:02 ps3156 /USR/SBIN/CRON[19815]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] [ -d /var/lib/php5 ] find /var/lib/php5/ -type f -cmin +$(/usr/lib/php5/maxlifetime) -print0 | xargs -n 200 -r -0 rm) May 29 13:09:02 ps3156 /USR/SBIN/CRON[19815]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] [ -d /var/lib/php5 ] find /var/lib/php5/ -type f -cmin +$(/usr/lib/php5/maxlifetime) -print0 | Chunk 2: xargs -n 200 -r -0 rm) May 29 13:09:02 ps3156 /USR/SBIN/CRON[19815]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] [ -d /var/lib/php5 ] find /var/lib/php5/ -type f -cmin +$(/usr/lib/php5/maxlifetime) -print0 | xargs -n 200 -r -0 rm) May 29 13:09:02 ps3156 /USR/SBIN/CRON[19815]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] [ -d /var/lib/php5 ] find /var/lib/php5/ -type f -cmin +$(/usr/lib/php5/maxlifetime) -print0 | xargs -n 200 -r -0 rm) There is no problem with the first chunk. It gets parsed properly and .evt file is created. But the second chunk starts with xargs -n 200 -r -0 rm) which is not a valid date format. So the date parse exception is thrown. So the problem is with the way data is getting divided into chunks. So is there any way to divide the chunks evenly? Any pointers in this case would help. -Original Message- From: Bill Graham [mailto:billgra...@gmail.com] Sent: Tuesday, June 01, 2010 5:36 AM To: chukwa-user@hadoop.apache.org Cc: Jerome Boulon Subject: Re: Problem in chukwa output The unparseable date errors are due to the map processor not being able to properly extract the date from the record. Look at the TsProcessor (on the trunk) and the latest demux configs for examples of how to configure a processor for a given date format. I'm away from my computer now, but if you search for jiras asignex to me, you should find the relevant
Re: Problem in chukwa output
The unparseable date errors are due to the map processor not being able to properly extract the date from the record. Look at the TsProcessor (on the trunk) and the latest demux configs for examples of how to configure a processor for a given date format. I'm away from my computer now, but if you search for jiras asignex to me, you should find the relevant patches. On Friday, May 28, 2010, Stuti Awasthi stuti_awas...@persistent.co.in wrote: Hi, Sorry for replying late I was trying with what you have suggested. Yes it worked for me. Rotation factor increased my file size but now have other issue J @Issue : When chukwa demuxer gets the log for the processing , it is getting distributed in 2 directories : 1) After correct processing , it generates .evt files. 2) Chuwa parser does not parse the data properly and end up giving ..InError directory. Rotation Time : 5 min to 1 Hour 1. SYSTEM LOGS Log File used : message1 Datatype used : SysLog Error : java.text.ParseException: Unparseable date: y 4 06:12:38 p 2. Hadoop Logs Log File Used : Hadoop datanode logs , Hadoop TaskTracker logs Datatype Used : HadoopLog Error : java.text.ParseException: Unparseable date: 0 for block blk_1617125 3. Chuwa Agent Logs Log File Used : Chuwa Agent logs Datatype Used : chuwaAgent Error : org.json.JSONException: A JSONObject text must begin with '{' at character 1 of post thread ChukwaHttpSender - collected 1 chunks I am wondering why data is getting into these INError directory. Is there any way we can get correct evt files after demuxing rather than these INError.evt files. Thanks Stuti From: Jerome Boulon [mailto:jbou...@netflix.com] Sent: Thursday, May 27, 2010 1:01 AM To: chukwa-user@hadoop.apache.org Subject: Re: Problem in chukwa output Hi, The demux is grouping you data per date/hour/TimeWindow so yes, 1 .done file could be split into multiple .evt file depending on the content/timestamp of your data. Generally, if you have a SysLogInError directory, it’s because the parser throws an exception and you should have some files in there. You may want to take a look at this wiki page to get an idea of Demux data flow. http://wiki.apache.org/hadoop/Chukwa_Processes_and_Data_Flow Regards, /Jerome. On 5/26/10 10:55 AM, Stuti Awasthi stuti_awas...@persistent.co.in wrote: Hi all, I am facing some problems in chukwa output. The following are the process flow in Collector : I worked with single .done file of 16MB in size for the analysis 1) Logs were collected in /logs directory. 2) After demux processing the output was stored in /repos directory. Following is the structure inside repos: /repos /SysLog Total Size : 1MB /20100503/ *.evt /20100504/*.evt /SysLogInError Total Size : 15MB /…./*.evt I have 2 doubts : I noticed that my single log file was spilt into multiple .evt file. My output file contained 2 folders inside / SysLog .Is this the correct behaviour that a single .done file is split into n number of .evt files and in different directory structure? There was a directory of SysLogInError generated but there was no ERROR in the log file. I was not sure when this directory gets created? Any pointers will be helpful. Thanks Stuti DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails. DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for
Re: how to determine hive version
The getVersion method in the hive jdbc driver should give you the version, which is read from the manifest version in the META-INF folder of th hive jar. On Monday, May 31, 2010, Arvind Prabhakar arv...@cloudera.com wrote: Hello Kortni, One way to find out which version of Hive you are using is to look at the hive-default.xml file under conf directory. In this file, check out the value of the property hive.hwi.war.file, which should be of the format: valuelib/hive-hwi-VERSION.war/value Form there you can infer the version. If you want a more direct means of finding out the version of Hive, please file a Jira as enhancement request. Arvind On Thu, May 27, 2010 at 2:58 PM, Kortni Smith ksm...@abebooks.com wrote: Hello, How can you tell what version of hive is running? I’m working with hive and EMR, and know that’s it’s hive 0.4 from the EMR job’s first step configuration (s3://elasticmapreduce/libs/hive/0.4/install-hive), but I need to know if it’s 0.4.1 or 0.4.0. Thanks Kortni Smith | Software Developer AbeBooks.com Passion for books. ksm...@abebooks.com phone: 250.412.3272 | fax: 250.475.6014 Suite 500 - 655 Tyee Rd. Victoria, BC. Canada V9A 6X5 www.abebooks.com | www.abebooks.co.uk | www.abebooks.de www.abebooks.fr | www.abebooks.it | www.iberlibro.com
Re: including multiple delimited fields (of unknown count) into one
Correct, I don't need to know the arity of the tuple and if I LOAD without specifying the fields like you show I should be able to effectively STORE the same data. The problem though is that I need to include both the tuple and the timestamp in the grouping (but no the count), then sum the counts. As an example, this: 127120140 3 1770162 5 127120140 4 2000162 100 127120170 3 1770162 5 127120170 4 2000162 100 Would become this (where 127119960 is the hour that the two timestamps both roll up to): 127119960 6 1770162 5 127119960 8 2000162 100 So in my case I'd like to be able to load timetamps, count and tuple and then group on timestamp and tuple and output in the same format of timestamp, count, tuple. The easiest hack I've come up with for now is to dynamically insert the field definitions in my script before I run it. So in the example above I would insert 'f1, f2, f3' everywhere I need to reference the tuple. Another run might insert 'f1, f2' for an input that only has 2 extra fields. On Thu, May 20, 2010 at 12:39 AM, Mridul Muralidharan mrid...@yahoo-inc.com wrote: I am not sure what the processing is once the group'ing is done, but each tuple has a size() (for arity) method which gives us the number of fields in that tuple [if using in udf]. So that can be used to aid in computation. If you are interested in aggregating and simply storing it - you dont really need to know the arity of a tuple, right ? (That is, group by timestamp, and store - PigStorage should continue to store the variable number of fields as was present in input). Regards, Mridul On Thursday 20 May 2010 05:39 AM, Bill Graham wrote: Thanks Mridul, but how would I access the items in the numbered fields 3..N where I don't know what N is? Are you suggesting I pass A to a custom UDF to convert to a tuple of [time, count, rest_of_line]? On Wed, May 19, 2010 at 4:11 PM, Mridul Muralidharan mrid...@yahoo-inc.com mailto:mrid...@yahoo-inc.com wrote: You can simply skip specifying schema in the load - and access the fields either through the udf or through $0, etc positional indexes. Like : A = load 'myfile' USING PigStorage(); B = GROUP A by round_hour($0) PARALLEL $PARALLELISM; C = ... Regards, Mridul On Thursday 20 May 2010 04:07 AM, Bill Graham wrote: Hi, Is there a way to read a collection (of unknown size) of tab-delimited values into a single data type (tuple?) during the LOAD phase? Here's specifically what I'm looking to do. I have a given input file format of tab-delimited fields like so: [timestamp] [count] [field1] [field2] [field2] .. [fieldN] I'm writing a pig job to take many small files and roll up the counts for a given time increment of a lesser granularity. For example, many files with timestamps rounded to 5 minute intervals will be rolled into a single file with 1 hour granularity. I'm able to do this by grouping on the timestamp (rounded down to the hour) and each of the fields shown if I know the number of fields and I list them all explicitly. I'd like to write this script though that would work on different input formats, some which might have N fields, where others have M. For a given job run, the number of fields in the input files passed would be fixed. So I'd like to be able to do something like this in pseudo code: LOAD USING PigStorage('\t') AS (timestamp, count, rest_of_line) ... GROUP BY round_hour(timestamp), rest_of_line [flatten group and sum counts] ... STORE round_hour(timestamp), totalCount, rest_of_line Where I know nothing about how many tokens are in next_of_line. Any ideas besides subclassing PigStorage or writing a new FileInputLoadFunc? thanks, Bill
Re: Add user define jars
Hi Jerome, I haven't had to use external jars yet (I've pushed the logic I need into chukwa instead), but I would like to have a solution where including an external jar is just a matter of adding a jar to a local directory. Similar to how Hive has the auxlibs/ dir. It sounds like your solution aligns with this approach. If that's the case, I'm in favor of it. thanks, Bill On Wed, May 19, 2010 at 6:02 PM, Kirk True k...@mustardgrain.com wrote: Hi Jerome, I'm trying to use the 'stick your JAR in HDFS' chukwa/demux directory' approach. I'm not able to get it working (see the Chukwa can't find Demux class thread in the mailing list). Thanks, Kirk On 5/11/10 5:29 PM, Jerome Boulon wrote: Hi, I would like to get feedback from people using their own external jars with Demux. The current implementation is using the distributed cache to upload jars to Hadoop. - Is it working well? - Do you have any problem with this feature? I’m asking this because we solved this requirement in a different way in Honu and I wonder if that’s something we need to improve in Chukwa. Thanks in advance, /Jerome.
including multiple delimited fields (of unknown count) into one
Hi, Is there a way to read a collection (of unknown size) of tab-delimited values into a single data type (tuple?) during the LOAD phase? Here's specifically what I'm looking to do. I have a given input file format of tab-delimited fields like so: [timestamp] [count] [field1] [field2] [field2] .. [fieldN] I'm writing a pig job to take many small files and roll up the counts for a given time increment of a lesser granularity. For example, many files with timestamps rounded to 5 minute intervals will be rolled into a single file with 1 hour granularity. I'm able to do this by grouping on the timestamp (rounded down to the hour) and each of the fields shown if I know the number of fields and I list them all explicitly. I'd like to write this script though that would work on different input formats, some which might have N fields, where others have M. For a given job run, the number of fields in the input files passed would be fixed. So I'd like to be able to do something like this in pseudo code: LOAD USING PigStorage('\t') AS (timestamp, count, rest_of_line) ... GROUP BY round_hour(timestamp), rest_of_line [flatten group and sum counts] ... STORE round_hour(timestamp), totalCount, rest_of_line Where I know nothing about how many tokens are in next_of_line. Any ideas besides subclassing PigStorage or writing a new FileInputLoadFunc? thanks, Bill
HttpTriggerAction - configuring N objects
Hi, As a follow up to CHUKWA-477, I'm writing a class called HttpTriggerAction that can hit one or more URLs upon successful completion of a demux job. I'd like to contribute it back unless anyone objects. Anyway, I'm looking for feedback though on how to configure this object. The issue is that since the class can hit N urls, it needs N sets of key-value configurations. The hadoop configurations model is just name-value pairs though, so I'm kicking around ideas around the best way to handle this. Specifically, I need to configure values for url, an optional HTTP method (default is GET), an optional collection of HTTP headers and an optional post body. I was thinking of just making a convention where key values could be incremented like below, but wanted to see if there were better suggestions out there. chukwa.trigger.action.[eventName].http.1.url=http://site.com/firstTrigger chukwa.trigger.action.[eventName].http.1.headers=User-Agent:chukwa chukwa.trigger.action.[eventName].http.2.url=http://site.com/secondTrigger chukwa.trigger.action.[eventName].http.2.method=POST chukwa.trigger.action.[eventName].http.2.headers=User-Agent:chukwa,Accepts:text/plain chukwa.trigger.action.[eventName].http.2.body=Some post body to submit chukwa.trigger.action.[eventName].http.N.url= chukwa.trigger.action.[eventName].http.N.method= chukwa.trigger.action.[eventName].http.N.headers= chukwa.trigger.action.[eventName].http.N.body= Since the action could potentially be used by other types of events, the event name should be included. This implies that we should add an eventName field to the TriggerAction.execute method in CHUKWA-477. Thoughts? thanks, Bill
Re: HttpTriggerAction - configuring N objects
Thanks guys. Eric, I think I'm on a similar path to what you're suggesting except I'm using a two-step config to specidy a.) the action classes, and b.) their configs. Jerome, I considered that, but gets tricky with multi-value sets (like headers). At some point we get into delimiter overload. To get to specifics, here's an example of what I currently have: - A comma separated list of TriggerAction classes to be invoked upon successful completion of a successful demux run. This is similar to the current data loader pattern and can also be used wherever else we later want to fire TriggerActions. property namechukwa.post.demux.success.action/name valueorg.apache.hadoop.chukwa.extraction.demux.HttpTriggerAction,some.other.TriggerAction/value /property - Then when the HttpTriggerAction runs, it knows how to look for it's configs in the way that we've been discussing. For the above action, the trigger event name will be passed to TriggerAction as an enum. The enum would have a name like 'postDemuxSuccess', as well as a pointer to the base config string for the event, like 'chukwa.trigger.post.demux.success'. - HttpTriggerAction would in this case then know to look for it's configs under chukwa.trigger.post.demux.success.*http *and look for values like this: property namechukwa.trigger.post.demux.success.http.1.url/name valuehttp://site.com/firstTrigger/value /property so the syntax of the config key then is: chukwa.[eventName].[actionNS].N.[actionKeys] Although other actions are free to implement all parts below eventName however makes the most sense for their needs. thanks, Bill On Wed, Apr 21, 2010 at 4:50 PM, Jerome Boulon jbou...@netflix.com wrote: Hi, You can have a root configuration key that will give the list of keys to look for: Or you can look for the existence of http.N key in the configuration object. Ex property namemyConfig.eventName.list/name valuehttp.1, http.2, ..., http.x/value /property Can you clarify what the eventName is? /Jerome. On 4/21/10 3:31 PM, Bill Graham billgra...@gmail.com wrote: Hi, As a follow up to CHUKWA-477, I'm writing a class called HttpTriggerAction that can hit one or more URLs upon successful completion of a demux job. I'd like to contribute it back unless anyone objects. Anyway, I'm looking for feedback though on how to configure this object. The issue is that since the class can hit N urls, it needs N sets of key-value configurations. The hadoop configurations model is just name-value pairs though, so I'm kicking around ideas around the best way to handle this. Specifically, I need to configure values for url, an optional HTTP method (default is GET), an optional collection of HTTP headers and an optional post body. I was thinking of just making a convention where key values could be incremented like below, but wanted to see if there were better suggestions out there. chukwa.trigger.action.[eventName].http.1.url=http://site.com/firstTrigger chukwa.trigger.action.[eventName].http.1.headers=User-Agent:chukwa chukwa.trigger.action.[eventName].http.2.url=http://site.com/secondTrigger chukwa.trigger.action.[eventName].http.2.method=POST chukwa.trigger.action.[eventName].http.2.headers=User-Agent:chukwa,Accepts:text/plain chukwa.trigger.action.[eventName].http.2.body=Some post body to submit chukwa.trigger.action.[eventName].http.N.url= chukwa.trigger.action.[eventName].http.N.method= chukwa.trigger.action.[eventName].http.N.headers= chukwa.trigger.action.[eventName].http.N.body= Since the action could potentially be used by other types of events, the event name should be included. This implies that we should add an eventName field to the TriggerAction.execute method in CHUKWA-477. Thoughts? thanks, Bill
Re: How to write log4j output to HDFS?
Hi, Check out Chukwa: http://hadoop.apache.org/chukwa/docs/r0.3.0/design.html#Introduction It allows you to run agents which tail log4j output and send the data to collectors, which write the data to HDFS. thanks, Bill On Wed, Apr 21, 2010 at 3:43 AM, Dhanya Aishwarya Palanisamy dhanya.aishwa...@gmail.com wrote: Hi, Has anyone tried to write log4j log file directly to HDFS? If yes, please reply how to achieve this. I am trying to create a appender. Is this the way? My necessity is to write logs to a file at particular intervals and query that data at a later stage. Thanks, Dhanya
Re: Demux trigger
Thanks, Eric. Looking into the PostProcessorManager code a little more, it seems the chukwa.post.demux.data.loader loaders get called before the post processor moves the finished files into place. I need a trigger that fires after they're in place in the repos/ dir. This is the code I'm referring to from PostProcessorManager. if ( processDemuxPigOutput(directoryToBeProcessed) == true) { if (movetoMainRepository(directoryToBeProcessed,chukwaRootReposDir) == true) { deleteDirectory(directoryToBeProcessed); ... continue; } } The data loaders get called as part of processDemuxPigOutput. Is this sequence correct or am I missing something? If this is in fact the case, I'd like to add a hook to take some post action once the files are in the /repos dir. From a users perspective, that's what 'post demux' implies. I'm open for suggestions re the best way to do that and what to call the configs. One thought is to follow a similar pattern as how DataLoaders are configured, but use a new interface that's more generic than loading data. Not 'Action', but something that denotes that. thanks, Bill On Sat, Apr 17, 2010 at 1:35 PM, Eric Yang ey...@yahoo-inc.com wrote: Yes. Regards, Eric On 4/16/10 9:56 PM, Bill Graham billgra...@gmail.com wrote: Thanks Eric, I'm glad I emailed before writing code. I can see how data loaders get triggered, but I don't see one that makes an HTTP request like I'm proposing. Are you suggesting I implement a new DataLoader that doesn't actually load data, but makes an HTTP request instead? On Fri, Apr 16, 2010 at 7:30 PM, Eric Yang ey...@yahoo-inc.com wrote: Hi Bill, This already exist in Chukwa. Take a look in DataLoaderFactory.java and SocketDataLoader.java, they are triggered after demux jobs. Hence, you can use PostProcessManager as triggers, and configure it through chukwa-demux-conf.xml, chukwa.post.demux.data.loader. Hope this helps. Regards, Eric On 4/16/10 4:39 PM, Bill Graham billgra...@gmail.com http://billgra...@gmail.com wrote: Hi, I'd like to add a feature to the DemuxManager where you can configure an HTTP request to be fired after a Demux run. It would be similar to what's currently there for Nagios alerts, only this would be HTTP (the Nagios alert is a raw TCP socket call). You'd configure the host, port, (POST|GET) and uri for this first pass. Some metadata about the job would also go along for the ride. Maybe things like status code and job name. The use case is to trigger a dependent job to run elsewhere upon completion. The same functionality could potentially be ported to some of the other chukwa processor jobs if the need arose. Thoughts? thanks, Bill
Re: Demux trigger
Sure, I can make that changes. I named it processDataLoaders since it handles a collection. I created a JIRA with a first pass of the implementation, along with a few points of discussion: https://issues.apache.org/jira/browse/CHUKWA-477 Let me know what you think. thanks, Bill On Mon, Apr 19, 2010 at 10:23 AM, Eric Yang ey...@yahoo-inc.com wrote: Yes this is indeed the case. +1 on adding hook inside the if block, to generate post move triggers. Could you also rename processDemuxPigOutput to processDataLoader? The name is more concise. Thanks Regards, Eric On 4/19/10 9:54 AM, Bill Graham billgra...@gmail.com wrote: Thanks, Eric. Looking into the PostProcessorManager code a little more, it seems the chukwa.post.demux.data.loader loaders get called before the post processor moves the finished files into place. I need a trigger that fires after they're in place in the repos/ dir. This is the code I'm referring to from PostProcessorManager. if ( processDemuxPigOutput(directoryToBeProcessed) == true) { if (movetoMainRepository(directoryToBeProcessed,chukwaRootReposDir) == true) { deleteDirectory(directoryToBeProcessed); ... continue; } } The data loaders get called as part of processDemuxPigOutput. Is this sequence correct or am I missing something? If this is in fact the case, I'd like to add a hook to take some post action once the files are in the /repos dir. From a users perspective, that's what 'post demux' implies. I'm open for suggestions re the best way to do that and what to call the configs. One thought is to follow a similar pattern as how DataLoaders are configured, but use a new interface that's more generic than loading data. Not 'Action', but something that denotes that. thanks, Bill On Sat, Apr 17, 2010 at 1:35 PM, Eric Yang ey...@yahoo-inc.com wrote: Yes. Regards, Eric On 4/16/10 9:56 PM, Bill Graham billgra...@gmail.com http://billgra...@gmail.com wrote: Thanks Eric, I'm glad I emailed before writing code. I can see how data loaders get triggered, but I don't see one that makes an HTTP request like I'm proposing. Are you suggesting I implement a new DataLoader that doesn't actually load data, but makes an HTTP request instead? On Fri, Apr 16, 2010 at 7:30 PM, Eric Yang ey...@yahoo-inc.com http://ey...@yahoo-inc.com wrote: Hi Bill, This already exist in Chukwa. Take a look in DataLoaderFactory.java and SocketDataLoader.java, they are triggered after demux jobs. Hence, you can use PostProcessManager as triggers, and configure it through chukwa-demux-conf.xml, chukwa.post.demux.data.loader. Hope this helps. Regards, Eric On 4/16/10 4:39 PM, Bill Graham billgra...@gmail.com http://billgra...@gmail.com http://billgra...@gmail.com wrote: Hi, I'd like to add a feature to the DemuxManager where you can configure an HTTP request to be fired after a Demux run. It would be similar to what's currently there for Nagios alerts, only this would be HTTP (the Nagios alert is a raw TCP socket call). You'd configure the host, port, (POST|GET) and uri for this first pass. Some metadata about the job would also go along for the ride. Maybe things like status code and job name. The use case is to trigger a dependent job to run elsewhere upon completion. The same functionality could potentially be ported to some of the other chukwa processor jobs if the need arose. Thoughts? thanks, Bill
Re: Demux trigger
Thanks Eric, I'm glad I emailed before writing code. I can see how data loaders get triggered, but I don't see one that makes an HTTP request like I'm proposing. Are you suggesting I implement a new DataLoader that doesn't actually load data, but makes an HTTP request instead? On Fri, Apr 16, 2010 at 7:30 PM, Eric Yang ey...@yahoo-inc.com wrote: Hi Bill, This already exist in Chukwa. Take a look in DataLoaderFactory.java and SocketDataLoader.java, they are triggered after demux jobs. Hence, you can use PostProcessManager as triggers, and configure it through chukwa-demux-conf.xml, chukwa.post.demux.data.loader. Hope this helps. Regards, Eric On 4/16/10 4:39 PM, Bill Graham billgra...@gmail.com wrote: Hi, I'd like to add a feature to the DemuxManager where you can configure an HTTP request to be fired after a Demux run. It would be similar to what's currently there for Nagios alerts, only this would be HTTP (the Nagios alert is a raw TCP socket call). You'd configure the host, port, (POST|GET) and uri for this first pass. Some metadata about the job would also go along for the ride. Maybe things like status code and job name. The use case is to trigger a dependent job to run elsewhere upon completion. The same functionality could potentially be ported to some of the other chukwa processor jobs if the need arose. Thoughts? thanks, Bill
Re: Making TsProcessor's date format configurable
Sure, thanks Jerome. I assigned you the JobConf work: https://issues.apache.org/jira/browse/CHUKWA-471 And I've got the date format for TsProcessor JIRA; https://issues.apache.org/jira/browse/CHUKWA-472 As well as making the default processor configurable: https://issues.apache.org/jira/browse/CHUKWA-473 For this last one how about a config like this: property namechukwa.demux.default.processor/name valueorg.apache.hadoop.chukwa.extraction.demux.processor.mapper.DefaultProcessor/value /property On Tue, Apr 6, 2010 at 10:57 AM, Jerome Boulon jbou...@netflix.com wrote: Hi, When you'll create a Jira for that, can you create a separate one for JobConf? I'll submit a patch for it. Thanks, /Jerome On 4/6/10 10:50 AM, Jerome Boulon jbou...@netflix.com wrote: Hi, Could you also make sure that you force sdf to be GMT? sdf.setTimeZone(TimeZone.getTimeZone(GMT)); - Instead of an If/then/else you could use the default value in conf.get(key,defaultVal) to set the default format. - You can load jobConf directlty from the mapper/reducer but you will have to add a new method to the AbstractProcessor/Reducer then any parser/reducer class will have access to it. We don't need a distributed cache todo that. Thanks, /Jerome. On 4/6/10 10:18 AM, Eric Yang ey...@yahoo-inc.com wrote: Hi Bill, We can introduce some configuration like this in chukwa-demux.conf.xml: property nameTsProcessor.time.format.some_data_type/name value-MM-dd HH:mm:ss,SSS/value /property Move the SimpleDateFormat outside of constructor. StringBuilder format = new StringBuilder(); format.append(³TsProcessor.time.format²); format.append(chunk.getDataType()); if(conf.get(format.toString)!=null) { sdf = new SImpleDateFormat(conf.get(format.toString)); } else { sdf = new SImpleDateFormat(-MM-dd HH:mm:ss,SSS); } It will require changes the MapperFactory class to include the running JobConf has a HashMap or load the JobConf from distributed cache. Regards, Eric On 4/6/10 9:55 AM, Bill Graham billgra...@gmail.com wrote: Hi, I'd like to be able to configure the date format for TSProcessor. Looking at the code, others have had the same thought: public TsProcessor() { // TODO move that to config sdf = new SimpleDateFormat(-MM-dd HH:mm:ss,SSS); } I can write a patch to support this change, but how do we want to make the date configurable? Currently there is a single config (AFAIK) that binds the processor class to a data type in chukwa-demux-conf.xml that looks like this: property namesome_data_type/name valueorg.apache.hadoop.chukwa.extraction.demux.processor.mapper.TsProcessor /value descriptionParser for some_data_type/description /property Any suggestions for how we'd incorporate date format into that config? Or perhaps it would be a separate conf. Are there any examples in the code of processors that take configurations currently? As a side note, I'd also like to add a configuration for what the default processor should be, since I'd prefer to change ours from DefaultProcessor to TsProcessor. Maybe 'chukwa.demux.default.processor'? Thoughts? thanks, Bill
Re: Web/Data Analytics and Data Collection using Hadoop
Hi Utku, We're using Chukwa to collect and aggregate data as you describe and so far it's working well. Typically chukwa collectors are deployed to all data nodes, so there is no master write-bottleneck with this approach actually. There have been discussions lately on the Chukwa list regarding how to write data into HBase using Chukwa collectors or data processors that you might want to check out. thanks, Bill On Mon, Mar 22, 2010 at 4:50 AM, Utku Can Topçu u...@topcu.gen.tr wrote: Hey All, Currently in a project I'm involved, we're about to make design choices regarding the use of Hadoop as a scalable and distributed data analytics framework. Basically the application would be the base of a Web Analytics tool, so I do have the vision that Hadoop would be the finest choice for analyzing the collected data. But for the collection of data is somewhat a different issue to consider, there needs to be serious design decision taken for the data collection architecture. Actually, I'd like to have a distributed and scalable data collection in production. The current situation is like we have multiple of servers in 3-4 different locations, each collect some sort of data. The basic approach on analyzing this distributed data would be: logging them into structured text files so that we'll be able to transfer them to the hadoop cluster and analyze them using some MapReduce jobs. The basic process I define follows like this - Transfer log files to Hadoop master, (collectors to master) - Put the files on the master to the HDFS, (master to the cluster) As it's clear there's an overhead in the transfer of the log files. And the big log files will have to be analyzed even if you'll somehow need a small portion of the data. One better other option is, logging directly to a distributed database like Cassandra and HBase, so the MapReduce jobs would be fetching the data from the databases and doing the analysis. And the data will also be randomly accessible and open to queries in real-time. I'm not that much familiar in this area of distributed databases, however I can see that, -If we're using cassandra for storing logging information, we won't have a connection overhead for writing the data to the Cassandra cluster, since all nodes in the cluster are able to accept incoming write requests. However in HBase I'm afraid we'll need to write to the master only, so in such situation, there seems to be a connection overhead on the master and we can only scale up-to the levels that the through-put of master. Logging to HBase doesn't seem scalable from this point of view. -On the other hand, using a different Cassandra cluster which is not directly from the ecosystem of Hadoop, I'm afraid we'll lose the concept of data locality while using the data for analysis in MapReduce jobs if Cassandra was the choice for keeping the log data. However in the case of HBase we'll be able to use the data locality since it's directly related to the HDFS. -Is there a stable way for integrating Cassandra with Hadoop? So finally Chukwa seems to be a good choice for such kind of a data collection. Where each server that can be defined as sources will be running Agents on them, so they can transfer the data to the Collectors that reside close to the HDFS. After series of pipe-lined processes the data would be clearly available for analysis using MapReduce jobs. I see some connection overhead due to the through-put of master in this scenario and the files that need to be analyzed will also be again available in big files, so a sample range of the data analysis would require the reading of the full files. I feel like these are the brief options I figured out till now. Actually all decision will come with some kind of a drawback and provide some decision specific more functionality compared to the others. Is there anyone on the list who solved the need in such functionality previously? I'm open to all kind of comments and suggestions, Best Regards, Utku
Re: Web/Data Analytics and Data Collection using Hadoop
Sure, any framework that writes data into HDFS will need to communicate with the namenode. So yes, there can potentially be large numbers of connections to the namenode. I (possibly mistakenly) thought you were speaking specifically of a bottleneck caused by writes through a single master node. The actually data does not go through the name node though, so there is no bottleneck in the data flow. On Mon, Mar 22, 2010 at 10:50 AM, Utku Can Topçu u...@topcu.gen.tr wrote: Hi Bill, Thank you for your comments, The main thing about the Chukwa installation on top of Hadoop is I guess, you somehow need to connect to the namenode from the collectors. Isn't it the case while trying to reach the HDFS, or the Chukwa collectors are writing on the local drives instead of HDFS? Best, Utku On Mon, Mar 22, 2010 at 6:34 PM, Bill Graham billgra...@gmail.com wrote: Hi Utku, We're using Chukwa to collect and aggregate data as you describe and so far it's working well. Typically chukwa collectors are deployed to all data nodes, so there is no master write-bottleneck with this approach actually. There have been discussions lately on the Chukwa list regarding how to write data into HBase using Chukwa collectors or data processors that you might want to check out. thanks, Bill On Mon, Mar 22, 2010 at 4:50 AM, Utku Can Topçu u...@topcu.gen.tr wrote: Hey All, Currently in a project I'm involved, we're about to make design choices regarding the use of Hadoop as a scalable and distributed data analytics framework. Basically the application would be the base of a Web Analytics tool, so I do have the vision that Hadoop would be the finest choice for analyzing the collected data. But for the collection of data is somewhat a different issue to consider, there needs to be serious design decision taken for the data collection architecture. Actually, I'd like to have a distributed and scalable data collection in production. The current situation is like we have multiple of servers in 3-4 different locations, each collect some sort of data. The basic approach on analyzing this distributed data would be: logging them into structured text files so that we'll be able to transfer them to the hadoop cluster and analyze them using some MapReduce jobs. The basic process I define follows like this - Transfer log files to Hadoop master, (collectors to master) - Put the files on the master to the HDFS, (master to the cluster) As it's clear there's an overhead in the transfer of the log files. And the big log files will have to be analyzed even if you'll somehow need a small portion of the data. One better other option is, logging directly to a distributed database like Cassandra and HBase, so the MapReduce jobs would be fetching the data from the databases and doing the analysis. And the data will also be randomly accessible and open to queries in real-time. I'm not that much familiar in this area of distributed databases, however I can see that, -If we're using cassandra for storing logging information, we won't have a connection overhead for writing the data to the Cassandra cluster, since all nodes in the cluster are able to accept incoming write requests. However in HBase I'm afraid we'll need to write to the master only, so in such situation, there seems to be a connection overhead on the master and we can only scale up-to the levels that the through-put of master. Logging to HBase doesn't seem scalable from this point of view. -On the other hand, using a different Cassandra cluster which is not directly from the ecosystem of Hadoop, I'm afraid we'll lose the concept of data locality while using the data for analysis in MapReduce jobs if Cassandra was the choice for keeping the log data. However in the case of HBase we'll be able to use the data locality since it's directly related to the HDFS. -Is there a stable way for integrating Cassandra with Hadoop? So finally Chukwa seems to be a good choice for such kind of a data collection. Where each server that can be defined as sources will be running Agents on them, so they can transfer the data to the Collectors that reside close to the HDFS. After series of pipe-lined processes the data would be clearly available for analysis using MapReduce jobs. I see some connection overhead due to the through-put of master in this scenario and the files that need to be analyzed will also be again available in big files, so a sample range of the data analysis would require the reading of the full files. I feel like these are the brief options I figured out till now. Actually all decision will come with some kind of a drawback and provide some decision specific more functionality compared to the others. Is there anyone on the list who solved the need in such functionality previously? I'm open to all kind of comments
Re: PigServer memory leak
I believe I've found the cause of my Pig memory leak so I wanted to report back. I profiled my app after letting it run for a couple of days and found that the static toDelete Stack in the FileLocalizer object was growing over time without getting flushed. I had thousands of HFile objects in that stack. This produced a memory leak both in my app and in HDFS. The fix seems straightforward enough in my app. I suspect calling FileLocalizer.deleteTempFiles() after each usage of PigServer for a given execution of a given pig script will do the trick. This seems to be a major gotcha though that will likely burn others. I suggest we add FileLocalizer.deleteTempFiles() to the shutdown() method of PigServer. Thoughts? Currently shutdown isn't doing much: public void shutdown() { // clean-up activities // TODO: reclaim scope to free up resources. Currently // this is not implemented and throws an exception // hence, for now, we won't call it. // // pigContext.getExecutionEngine().reclaimScope(this.scope); } thanks, Bill On Wed, Mar 10, 2010 at 12:15 PM, Bill Graham billgra...@gmail.com wrote: Yes, these errors appear in the Pig client and the jobs are definitely being executed on the cluster. I can see the data in HDFS and the jobs in the JobTracker UI of the cluster. On Wed, Mar 10, 2010 at 10:54 AM, Ashutosh Chauhan ashutosh.chau...@gmail.com wrote: [Low Memory Detector] [INFO] SpillableMemoryManager.java:143 low memory handler called Are you seeing this warning on client side, in pig logs? If so, then are you sure your job is actually running on real hadoop cluster. Because these logs should appear in task-tracker logs not in client logs. This may imply that you job is getting executed locally in local mode and not actually submitted to cluster. Look for the very first lines in the client logs, where Pig tries to connect to the cluster. See, if its successful in doing so. On Wed, Mar 10, 2010 at 10:15, Ashutosh Chauhan ashutosh.chau...@gmail.com wrote: Posting for Bill. -- Forwarded message -- From: Bill Graham billgra...@gmail.com Date: Wed, Mar 10, 2010 at 10:11 Subject: Re: PigServer memory leak To: ashutosh.chau...@gmail.com Thanks for the reply, Ashutosh. [hadoop.apache.org keeps flagging my reply as spam, so I'm replying directly to you. Feel free to push this conversation back onto the list, if you can. :)] I'm running the same two scripts, one after the other, every 5 minutes. The scripts have dynamic tokens substituted to change the input and output directories. Besides that, they have the same logic. I will try to execute the script from grunt next time it happens, but I don't see how a lack of pig MR optimizations could cause a memory issue on the client? If I bounce my daemon, the next jobs to run executes without a problem upon start, so I would also expect a script run through grunt at that time to run without a problem as well. I reverted back to re-initializing PigServer for every run. I have other places in my scheduled workflow where I interact with HDFS which I've now modified to re-use an instance of Hadoop's Configuration object for the life of the VM. I was re-initializing that many times per run. Looking at the Configuration code it seems to re-parse the XML configs into a DOM every time it's called, so this certainly looks like a place for a potential leak. If nothing else it should give me an optimization. Configuration seems to be stateless and read-only after initiation so this seems safe. Anyway, here are my two scripts. The first generates summaries, the second makes a report from the summaries and they run in separate PigServer instances via registerQuery(..). Let me know if you see anything that seems off: define chukwaLoader org.apache.hadoop.chukwa. ChukwaStorage(); define tokenize com.foo.hadoop.mapreduce.pig.udf.TOKENIZE(); define regexMatch com.foo.hadoop.mapreduce.pig.udf.REGEX_MATCH(); define timePeriod org.apache.hadoop.chukwa.TimePartition('@TIME_PERIOD@'); raw = LOAD '@HADOOP_INPUT_LOCATION@' USING chukwaLoader AS (ts: long, fields); bodies = FOREACH raw GENERATE tokenize((chararray)fields#'body') as tokens, timePeriod(ts) as time; -- pull values out of the URL tokens1 = FOREACH bodies GENERATE (int)regexMatch($0.token4, '(?:[?])ptId=([^]*)', 1) as pageTypeId, (int)regexMatch($0.token4, '(?:[?])sId=([^]*)', 1) as siteId, (int)regexMatch($0.token4, '(?:[?])aId=([^]*)', 1) as assetId, time, regexMatch($0.token4, '(?:[?])tag=([^]*)', 1) as tagValue; -- filter out entries without an assetId tokens2 = FILTER tokens1 BY (assetId is not null) AND (pageTypeId is not null) AND (siteId is not null); -- group by tagValue, time, assetId and flatten to get counts grouped = GROUP tokens2 BY (tagValue
Re: Is there a way to suppress the attempt logs?
Not sure if what you're asking is possible or not, but you could experiment with these params to see if you could achieve a similar effect. property namemapred.userlog.limit.kb/name value0/value descriptionThe maximum size of user-logs of each task in KB. 0 disables the cap. /description /property property namemapred.userlog.retain.hours/name value24/value descriptionThe maximum time, in hours, for which the user-logs are to be retained. /description /property On Mon, Mar 15, 2010 at 5:54 PM, abhishek sharma absha...@usc.edu wrote: Hi all, Hadoop creates a directory (and some files) for each map and reduce task attempts in logs/userlogs on each tasktracker. Is there a way to configure Hadoop not to create these attempt logs? Thanks, Abhishek
Re: Something wrong the pig-user mail list ?
Yesterday I couldn't send emails to this list. Google was reporting that apache was blocking them as spam. We'll see if this goes through... On Wed, Mar 10, 2010 at 6:09 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote: This one appears to have worked.. On Wed, Mar 10, 2010 at 6:07 PM, Jeff Zhang zjf...@gmail.com wrote: I always receive the failed deviled message from google. -- Best Regards Jeff Zhang
Re: PigServer memory leak
Yes, these errors appear in the Pig client and the jobs are definitely being executed on the cluster. I can see the data in HDFS and the jobs in the JobTracker UI of the cluster. On Wed, Mar 10, 2010 at 10:54 AM, Ashutosh Chauhan ashutosh.chau...@gmail.com wrote: [Low Memory Detector] [INFO] SpillableMemoryManager.java:143 low memory handler called Are you seeing this warning on client side, in pig logs? If so, then are you sure your job is actually running on real hadoop cluster. Because these logs should appear in task-tracker logs not in client logs. This may imply that you job is getting executed locally in local mode and not actually submitted to cluster. Look for the very first lines in the client logs, where Pig tries to connect to the cluster. See, if its successful in doing so. On Wed, Mar 10, 2010 at 10:15, Ashutosh Chauhan ashutosh.chau...@gmail.com wrote: Posting for Bill. -- Forwarded message -- From: Bill Graham billgra...@gmail.com Date: Wed, Mar 10, 2010 at 10:11 Subject: Re: PigServer memory leak To: ashutosh.chau...@gmail.com Thanks for the reply, Ashutosh. [hadoop.apache.org keeps flagging my reply as spam, so I'm replying directly to you. Feel free to push this conversation back onto the list, if you can. :)] I'm running the same two scripts, one after the other, every 5 minutes. The scripts have dynamic tokens substituted to change the input and output directories. Besides that, they have the same logic. I will try to execute the script from grunt next time it happens, but I don't see how a lack of pig MR optimizations could cause a memory issue on the client? If I bounce my daemon, the next jobs to run executes without a problem upon start, so I would also expect a script run through grunt at that time to run without a problem as well. I reverted back to re-initializing PigServer for every run. I have other places in my scheduled workflow where I interact with HDFS which I've now modified to re-use an instance of Hadoop's Configuration object for the life of the VM. I was re-initializing that many times per run. Looking at the Configuration code it seems to re-parse the XML configs into a DOM every time it's called, so this certainly looks like a place for a potential leak. If nothing else it should give me an optimization. Configuration seems to be stateless and read-only after initiation so this seems safe. Anyway, here are my two scripts. The first generates summaries, the second makes a report from the summaries and they run in separate PigServer instances via registerQuery(..). Let me know if you see anything that seems off: define chukwaLoader org.apache.hadoop.chukwa. ChukwaStorage(); define tokenize com.foo.hadoop.mapreduce.pig.udf.TOKENIZE(); define regexMatch com.foo.hadoop.mapreduce.pig.udf.REGEX_MATCH(); define timePeriod org.apache.hadoop.chukwa.TimePartition('@TIME_PERIOD@ '); raw = LOAD '@HADOOP_INPUT_LOCATION@' USING chukwaLoader AS (ts: long, fields); bodies = FOREACH raw GENERATE tokenize((chararray)fields#'body') as tokens, timePeriod(ts) as time; -- pull values out of the URL tokens1 = FOREACH bodies GENERATE (int)regexMatch($0.token4, '(?:[?])ptId=([^]*)', 1) as pageTypeId, (int)regexMatch($0.token4, '(?:[?])sId=([^]*)', 1) as siteId, (int)regexMatch($0.token4, '(?:[?])aId=([^]*)', 1) as assetId, time, regexMatch($0.token4, '(?:[?])tag=([^]*)', 1) as tagValue; -- filter out entries without an assetId tokens2 = FILTER tokens1 BY (assetId is not null) AND (pageTypeId is not null) AND (siteId is not null); -- group by tagValue, time, assetId and flatten to get counts grouped = GROUP tokens2 BY (tagValue, time, assetId, pageTypeId, siteId); flattened = FOREACH grouped GENERATE FLATTEN(group) as (tagValue, time, assetId, pageTypeId, siteId), COUNT(tokens2) as count; shifted = FOREACH flattened GENERATE time, count, assetId, pageTypeId, siteId, tagValue; -- order and store ordered = ORDER shifted BY tagValue ASC, count DESC, assetId DESC, pageTypeId ASC, siteId ASC, time DESC; STORE ordered INTO '@HADOOP_OUTPUT_LOCATION@'; raw = LOAD '@HADOOP_INPUT_LOCATION@' USING PigStorage('\t') AS (ts: long, count: int, assetId: int, pageTypeId: chararray, siteId: int, tagValue: chararray); -- now store most popular overall - filtered by pageTypeId most_popular_filtered = FILTER raw BY (siteId == 162) AND (pageTypeId matches '(2100)|(1606)|(1801)|(2300)|(2718)'); most_popular = GROUP most_popular_filtered BY (ts, assetId, pageTypeId); most_popular_flattened = FOREACH most_popular GENERATE FLATTEN(group) as (ts, assetId, pageTypeId), SUM(most_popular_filtered.count) as count; most_popular_shifted = FOREACH most_popular_flattened GENERATE ts, count, assetId, (int)pageTypeId
PigServer memory leak
hi, I've got a long running daemon application that periodically kicks of Pig jobs via quartz (Pig version 0.4.0). It uses a wrapper class that initilizes an instance of PigServer before parsing and executing a pig script. As implemented, the app would leak memory and after a while jobs would fail to run with messages like this appearing in the logs: [Low Memory Detector] [INFO] SpillableMemoryManager.java:143 low memory handler called To fix the issue, I created an instance of PigServer at application initialization and I re-use that instance for all jobs for the life of the daemon. Problem solved. So my question is, is this a bug in PigServer that it leaks memory when multiple instances are created, or is that just improper use of the class? thanks, Bill
Re: PigServer memory leak
Actually, upon closer investigation, re-using PigServer isn't working as well as I thought. I'm digging into the issue. To step back a bit though, I want to pose a different question: What is the intended usage of PigServer and PigContext w.r.t. it's scope? Should a new instance of each be used for every job or is one or the other intended for re-use throughout the lifecycle of the VM instance? Digging into the code of PigServer it seems like it's intended to be used for a single script's execution only, but it's not entirely clear if that's the case. On Tue, Mar 9, 2010 at 9:29 AM, Bill Graham billgra...@gmail.com wrote: hi, I've got a long running daemon application that periodically kicks of Pig jobs via quartz (Pig version 0.4.0). It uses a wrapper class that initilizes an instance of PigServer before parsing and executing a pig script. As implemented, the app would leak memory and after a while jobs would fail to run with messages like this appearing in the logs: [Low Memory Detector] [INFO] SpillableMemoryManager.java:143 low memory handler called To fix the issue, I created an instance of PigServer at application initialization and I re-use that instance for all jobs for the life of the daemon. Problem solved. So my question is, is this a bug in PigServer that it leaks memory when multiple instances are created, or is that just improper use of the class? thanks, Bill
Re: filter/join by sql like %pattern condition
You could specify a condition using the the RegexMatch or RegexExtract UDF in piggybank: http://svn.apache.org/repos/asf/hadoop/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/string/RegexMatch.java http://svn.apache.org/repos/asf/hadoop/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/string/RegexExtract.java On Thu, Feb 25, 2010 at 10:17 AM, Jan Zimmek jan.zim...@toptarif.de wrote: hi, i recently found pig, really like it and want to use it for one of our actual projects. getting the basics running was easy, but now i am struggling one a problem. i am trying to get customers whose email is not blacklisted. blacklist entires can be specified as: n...@domain.de or wildcarded @domain.de in sql i would solve this by: select * from customer c left join blacklist b on c.email like concat(%,b.email) where b.email is null this is the structure of my input files: raw_customer = LOAD 'customer.csv' USING PigStorage('\t') AS (id: long, email: chararray); raw_blacklist = LOAD 'blacklist.csv' USING PigStorage('\t') AS (email: chararray); how would i solve this using pig ? - especially handling the like % condition. i already looked into udf, but need some advice how to implement this. any help would be really appreciated. regards, jan
Re: Some additions to the hive jdbc driver.
This would certainly be useful. When creating the JIRA you can make it a child task to this one: https://issues.apache.org/jira/browse/HIVE-576 On Wed, Feb 3, 2010 at 1:18 AM, 김영우 warwit...@gmail.com wrote: Hi Bennie, Sounds great! That should be very useful for users. It would be nice to have more jdbc functionality on hive jdbc. Thanks, Youngwoo 2010/2/3 Bennie Schut bsc...@ebuddy.com I've been using the hive jdbc driver more and more and was missing some functionality which I added HiveDatabaseMetaData.getTables Using show tables to get the info from hive. HiveDatabaseMetaData.getColumns Using describe tablename to get the columns. This makes using something like SQuirreL a lot nicer since you have the list of tables and just click on the content tab to see what's in the table. I also implemented HiveResultSet.getObject(String columnName) so you call most get* methods based on the column name. Is it worth making a patch for this?
Re: ChukwaArchiveManager and DemuxManager
I had a lot of questions regarding the data flow as well. I spent a while reverse engineering it and wrote something up on our internal wiki. I believe this is what's happening. If others with more knowledge could verify what I have below, I'll gladly move it to a wiki on the Chukwa site. Regarding your specific question, I believe the DemuxManager process is the first step in aggregating the data sink files. It moves the chunks to the dataSinkArchives directory once it's done with them. The ArchiveManager later archives those chunks. 1. Collectors write chunks to logs/*.chukwa files until a 64MB chunk size is reached or a given time interval is reached. - to: logs/*.chukwa 2. Collectors close chunks and rename them to *.done - from: logs/*.chukwa - to: logs/*.done 3. DemuxManager wakes up every 20 seconds, runs M/R to merges *.donefiles and moves them. - from: logs/*.done - to: demuxProcessing/mrInput - to: demuxProcessing/mrOutput - to: dataSinkArchives/[MMdd]/*/*.done 4. PostProcessManager wakes up every few minutes and aggregates, orders and de-dups record files. - from: postProcess/demuxOutputDir_*/[clusterName]/[dataType]/[dataType]_[MMdd]_[HH].R.evt - to: repos/[clusterName]/[dataType]/[MMdd]/[HH]/[mm]/[dataType]_[MMdd]_[HH]_[N].[N].evt 5. HourlyChukwaRecordRolling runs M/R jobs at 16 past the hour to group 5 minute logs to hourly. - from: repos/[clusterName]/[dataType]/[MMdd]/[HH]/[mm]/[dataType]_[MMdd]_[mm].[N].evt - to: temp/hourlyRolling/[clusterName]/[dataType]/[MMdd] - to: repos/[clusterName]/[dataType]/[MMdd]/[HH]/[dataType]_HourlyDone_[MMdd]_[HH].[N].evt - leaves: repos/[clusterName]/[dataType]/[MMdd]/[HH]/rotateDone/ 6. DailyChukwaRecordRolling runs M/R jobs at 1:30AM to group hourly logs to daily. - from: repos/[clusterName]/[dataType]/[MMdd]/[HH]/[dataType]_[MMdd]_[HH].[N].evt - to: temp/dailyRolling/[clusterName]/[dataType]/[MMdd] - to: repos/[clusterName]/[dataType]/[MMdd]/[dataType]_DailyDone_[MMdd].[N].evt - leaves: repos/[clusterName]/[dataType]/[MMdd]/rotateDone/ 7. ChukwaArchiveManager every half hour or so aggregates and removes dataSinkArchives data using M/R. - from: dataSinkArchives/[MMdd]/*/*.done - to: archivesProcessing/mrInput - to: archivesProcessing/mrOutput - to: finalArchives/[MMdd]/*/chukwaArchive-part-* thanks, Bill On Tue, Feb 2, 2010 at 10:21 AM, Corbin Hoenes cor...@tynt.com wrote: I am trying to understand the flow of data inside hdfs as it's processed by the data processor script. I see the archive.sh and demux.sh are run which runs ArchiveManager and DemuxManager. It appears to that just reading the code that they both are looking at the data sink (default /chukwa/logs). Can someone shed some light on how ArchiveManager and DemuxManager interact? E.g. I was under the impression that the data flowed through the archiving process first then got fed into the demuxing after it had created .arc files.
Re: ChukwaArchiveManager and DemuxManager
I'm doing the same thing with Pig and log files. If the date format/location of your log entries doesn't match the chukwa date format found in the TsProcessor, you'll need to write your own. The TsProcessor is a good example to follow. You'll need to configure your processor to be used for your datatype in chukwa-demux-conf.xml. Even if you use the TsProcessor, you'll need to configure that in chukwa-demux-conf.xml, since the default processor is DefaultProcessor (despite a bug in the wiki documentation that says TsProcessor). If you write your own processor, be aware of this JIRA: https://issues.apache.org/jira/browse/CHUKWA-440 The processor should be the only Chukwa customization you need to do. If you follow the TsProcessor example, all you're doing is determining the timestamp of the record. All the downstream processes should work fine without customization. If your processor is working properly you should see *.evt files written beneith the repos/ dir. If it's not working, the data will go into an error directory, probably because the date parsing failed (the MR logs will indicate the cause). These are the files you'll write your Pig scripts against using the ChukwaStorage class to read the Chukwa sequence files. Here's an example of the start of a script which normalizes the timestamp of the record down to 5 minutes: define chukwaLoader org.apache.hadoop.chukwa.ChukwaStorage(); define timePeriod org.apache.hadoop.chukwa.TimePartition('30'); raw = LOAD '/chukwa/repos/path/to/evt/files' USING chukwaLoader AS (ts: long, fields); bodies = FOREACH raw GENERATE (chararray)fields#'body' as body, timePeriod(ts) as time; Also, if you want to generate a sequence file from an apache log for testing, without setting up the chukwa cluster you can use the utility discussed here FYI: https://issues.apache.org/jira/browse/CHUKWA-449 HTH, Bill On Tue, Feb 2, 2010 at 11:18 AM, Corbin Hoenes cor...@tynt.com wrote: This is exactly what I've been trying to create so that I can understand how we can use the data once in chukwa. Our goal is to use pig to process our apache logs. It looks like I need to customize the demux with a custom processor to create a chukwa record per line in the log file since right now we get a chukwa record per chunk which isn't useful to our pig scripts. I noticed in another conversation you've written a custom processor. What kinds of data are you processing? Did you find you had to split up the chunked data into individual ChukwaRecords? And how does this affect the rest of processing (archiving,postprocessing etc...) I am trying to understand how much customization I'm going to have to do. On Feb 2, 2010, at 11:56 AM, Bill Graham wrote: I had a lot of questions regarding the data flow as well. I spent a while reverse engineering it and wrote something up on our internal wiki. I believe this is what's happening. If others with more knowledge could verify what I have below, I'll gladly move it to a wiki on the Chukwa site. Regarding your specific question, I believe the DemuxManager process is the first step in aggregating the data sink files. It moves the chunks to the dataSinkArchives directory once it's done with them. The ArchiveManager later archives those chunks. 1. Collectors write chunks to logs/*.chukwa files until a 64MB chunk size is reached or a given time interval is reached. - to: logs/*.chukwa 2. Collectors close chunks and rename them to *.done - from: logs/*.chukwa - to: logs/*.done 3. DemuxManager wakes up every 20 seconds, runs M/R to merges *.donefiles and moves them. - from: logs/*.done - to: demuxProcessing/mrInput - to: demuxProcessing/mrOutput - to: dataSinkArchives/[MMdd]/*/*.done 4. PostProcessManager wakes up every few minutes and aggregates, orders and de-dups record files. - from: postProcess/demuxOutputDir_*/[clusterName]/[dataType]/[dataType]_[MMdd]_[HH].R.evt - to: repos/[clusterName]/[dataType]/[MMdd]/[HH]/[mm]/[dataType]_[MMdd]_[HH]_[N].[N].evt 5. HourlyChukwaRecordRolling runs M/R jobs at 16 past the hour to group 5 minute logs to hourly. - from: repos/[clusterName]/[dataType]/[MMdd]/[HH]/[mm]/[dataType]_[MMdd]_[mm].[N].evt - to: temp/hourlyRolling/[clusterName]/[dataType]/[MMdd] - to: repos/[clusterName]/[dataType]/[MMdd]/[HH]/[dataType]_HourlyDone_[MMdd]_[HH].[N].evt - leaves: repos/[clusterName]/[dataType]/[MMdd]/[HH]/rotateDone/ 6. DailyChukwaRecordRolling runs M/R jobs at 1:30AM to group hourly logs to daily. - from: repos/[clusterName]/[dataType]/[MMdd]/[HH]/[dataType]_[MMdd]_[HH].[N].evt - to: temp/dailyRolling/[clusterName]/[dataType]/[MMdd] - to: repos/[clusterName]/[dataType]/[MMdd]/[dataType]_DailyDone_[MMdd].[N].evt - leaves
NN fails to start with LeaseManager errors
Hi, This morning the namenode of my hadoop cluster shut itself down after the logs/ directory had filled itself with job configs, log files and all the other fun things hadoop leaves there. It had been running for a few months. I deleted all off the job configs and attempt log directories and tried to restart the namenode, but it failed due to many LeaseManager errors. Does anyone know what needs to be done to fix this and get the namenode back up? Here's what the logs report. I'm using Cloudera's 0.18.3 distro. STARTUP_MSG: Starting NameNode STARTUP_MSG: host = my-host-name.com/10.15.137.204 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.18.3-2 STARTUP_MSG: build = -r ; compiled by 'httpd' on Fri Jun 12 15:27:43 PDT 2009 / 2010-02-02 13:38:31,199 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=9000 2010-02-02 13:38:31,208 INFO org.apache.hadoop.dfs.NameNode: Namenode up at: my-host-name.com/10.15.137.204:9000 2010-02-02 13:38:31,212 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null 2010-02-02 13:38:31,218 INFO org.apache.hadoop.dfs.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext 2010-02-02 13:38:31,318 INFO org.apache.hadoop.fs.FSNamesystem: fsOwner=app,app 2010-02-02 13:38:31,319 INFO org.apache.hadoop.fs.FSNamesystem: supergroup=supergroup 2010-02-02 13:38:31,319 INFO org.apache.hadoop.fs.FSNamesystem: isPermissionEnabled=true 2010-02-02 13:38:31,329 INFO org.apache.hadoop.dfs.FSNamesystemMetrics: Initializing FSNamesystemMeterics using context object:org.apache.hadoop.metrics.spi.NullContext 2010-02-02 13:38:31,331 INFO org.apache.hadoop.fs.FSNamesystem: Registered FSNamesystemStatusMBean 2010-02-02 13:38:31,375 INFO org.apache.hadoop.dfs.Storage: Number of files = 248675 2010-02-02 13:38:36,932 INFO org.apache.hadoop.dfs.Storage: Number of files under construction = 2 2010-02-02 13:38:37,008 INFO org.apache.hadoop.dfs.Storage: Image file of size 42924164 loaded in 5 seconds. 2010-02-02 13:38:37,020 ERROR org.apache.hadoop.dfs.LeaseManager: /path/on/hdfs/_logs/history/my-host-name.com_1261508934685_job_200912221108_15967_conf.xml not found in lease.paths (=[/path/on/hdfs/_logs/history/my-host-name.com_1261508934685_job_200912221108_15967_app_MyJobName_20100202_10_59]) [[ a bunch more errors like the one above ]] 2010-02-02 13:38:37,076 ERROR org.apache.hadoop.fs.FSNamesystem: FSNamesystem initialization failed. java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:585) at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:846) at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:675) at org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:289) at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80) at org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:294) at org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:273) at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:148) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:193) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:179) at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:830) at org.apache.hadoop.dfs.NameNode.main(NameNode.java:839) 2010-02-02 13:38:37,077 INFO org.apache.hadoop.ipc.Server: Stopping server on 9000 2010-02-02 13:38:37,081 ERROR org.apache.hadoop.dfs.NameNode: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:585) at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:846) at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:675) at org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:289) at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80) at org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:294) at org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:273) at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:148) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:193) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:179) at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:830) at org.apache.hadoop.dfs.NameNode.main(NameNode.java:839) 2010-02-02 13:38:37,082 INFO org.apache.hadoop.dfs.NameNode: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NameNode at my-host-name.com/10.15.137.204 / thanks, Bill
Re: piggybank build problem
I also was unable to build piggybank with similar errors. I did the following: $ svn co http://svn.apache.org/repos/asf/hadoop/pig/trunk/contrib/piggybankpiggybank-trunk $ cd piggybank-trunk $ ant compile The problem is that the classpath references the following files: property name=pigjar value=../../../pig.jar / property name=pigjar-withouthadoop value=../../../pig-withouthadoop.jar / property name=hadoopjar value=../../../lib/hadoop20.jar / property name=pigtest value=../../../build/test/classes / You need to check out the entire pig project then cd to contrib/piggybank/java to build piggybank. It won't work if you check out just piggybank itself. thanks, Bill On Thu, Jan 28, 2010 at 10:22 AM, Dmitriy Ryaboy dvrya...@gmail.com wrote: You should be able to compile piggybank itself (just ant jar). To compile and run the tests, you also need to compile Pig's test classes -- so for that you need to first run ant jar compile-test in the top-level pig directory. -D On Wed, Jan 27, 2010 at 11:08 PM, felix gao gre1...@gmail.com wrote: OK I checked out the version 5 's piggybank and still can't compile it. /usr/local/pig svn co http://svn.apache.org/repos/asf/hadoop/pig/tags/release-0.5.0/contrib/piggybankpiggybank /usr/local/pig/piggybank/java ant jar compile-test Buildfile: build.xml init: [mkdir] Created dir: /Users/felixgao/mapreduce/pig-0.5.0/piggybank/java/build [mkdir] Created dir: /Users/felixgao/mapreduce/pig-0.5.0/piggybank/java/build/classes [mkdir] Created dir: /Users/felixgao/mapreduce/pig-0.5.0/piggybank/java/build/test [mkdir] Created dir: /Users/felixgao/mapreduce/pig-0.5.0/piggybank/java/build/test/classes [mkdir] Created dir: /Users/felixgao/mapreduce/pig-0.5.0/piggybank/java/build/docs/api compile: [echo] *** Compiling Pig UDFs *** [javac] Compiling 97 source files to /Users/felixgao/mapreduce/pig-0.5.0/piggybank/java/build/classes jar: [echo] *** Creating pigudf.jar *** [jar] Building jar: /Users/felixgao/mapreduce/pig-0.5.0/piggybank/java/piggybank.jar init: compile: [echo] *** Compiling Pig UDFs *** [javac] Compiling 97 source files to /Users/felixgao/mapreduce/pig-0.5.0/piggybank/java/build/classes compile-test: [echo] *** Compiling UDF tests *** [javac] Compiling 20 source files to /Users/felixgao/mapreduce/pig-0.5.0/piggybank/java/build/test/classes [javac] /Users/felixgao/mapreduce/pig-0.5.0/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/string/TestLookupInFiles.java:31: package org.apache.pig.test does not exist [javac] import org.apache.pig.test.MiniCluster; [javac] ^ [javac] /Users/felixgao/mapreduce/pig-0.5.0/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/string/TestLookupInFiles.java:32: package org.apache.pig.test does not exist [javac] import org.apache.pig.test.Util; [javac] ^ [javac] /Users/felixgao/mapreduce/pig-0.5.0/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/string/TestLookupInFiles.java:38: cannot find symbol [javac] symbol : class MiniCluster [javac] location: class org.apache.pig.piggybank.test.evaluation.string.TestLookupInFiles [javac] MiniCluster cluster = MiniCluster.buildCluster(); [javac] ^ [javac] /Users/felixgao/mapreduce/pig-0.5.0/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/TestSequenceFileLoader.java:38: package org.apache.pig.test does not exist [javac] import org.apache.pig.test.Util; [javac] ^ [javac] /Users/felixgao/mapreduce/pig-0.5.0/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/string/TestLookupInFiles.java:38: cannot find symbol [javac] symbol : variable MiniCluster [javac] location: class org.apache.pig.piggybank.test.evaluation.string.TestLookupInFiles [javac] MiniCluster cluster = MiniCluster.buildCluster(); [javac] ^ [javac] /Users/felixgao/mapreduce/pig-0.5.0/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/string/TestLookupInFiles.java:73: cannot find symbol [javac] symbol : variable Util [javac] location: class org.apache.pig.piggybank.test.evaluation.string.TestLookupInFiles [javac] pigServer.registerQuery(A = LOAD ' + Util.generateURI(tmpFile.toString()) + ' AS (key:chararray);); [javac]^ [javac] /Users/felixgao/mapreduce/pig-0.5.0/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/TestSequenceFileLoader.java:84: cannot find symbol [javac] symbol : variable Util [javac] location: class org.apache.pig.piggybank.test.storage.TestSequenceFileLoader [javac]
Re: How to cleanup old Job jars
Thanks Rekha. These issues seem to be related to cleaning up Pig/Hadoop file upon shutdown of the VM. I just checked and when I shut down the VM, all files are cleaned up as expected. My issue is that I have Pig jobs that run in an app server which are triggered by quartz. It might be days or weeks between app server bounces. If anyone knows a way to configure or kick off some sort of cleanup process without shutting down the VM, please let me know. Otherwise, I need to deploy a hacky crontab script like this: find /tmp/Job[0-9]*.jar -type f -mmin +50 -exec rm {} \; On Tue, Jan 26, 2010 at 8:40 PM, Rekha Joshi rekha...@yahoo-inc.com wrote: You might like to check up PIG-116 and HADOOP-5175.Also think there is a JobCleanup task which takes care of cleaning.., AFAIK.., unless its failed job. Cheers, /R On 1/27/10 12:01 AM, Bill Graham billgra...@gmail.com wrote: Hi, Every time I run a Pig script I get a number of Job jars left in the /tmp directory of my client, 1 per MR job it seems. The file names look like /tmp/Job875278192.jar. I have scripts that run every five minutes and fire 10 MR jobs each, so the amount of space used by these jars grows rapidly. Is there a way to tell Pig to clean up after itself and remove these jars, or do I need to just write my own clean-up script? thanks, Bill
How to cleanup old Job jars
Hi, Every time I run a Pig script I get a number of Job jars left in the /tmp directory of my client, 1 per MR job it seems. The file names look like /tmp/Job875278192.jar. I have scripts that run every five minutes and fire 10 MR jobs each, so the amount of space used by these jars grows rapidly. Is there a way to tell Pig to clean up after itself and remove these jars, or do I need to just write my own clean-up script? thanks, Bill
Re: Deleted input files after load
Hive doesn't delete the files upon load, it moves them to a location under the Hive warehouse directory. Try looking under /user/hive/warehouse/t_word_count. On Fri, Jan 22, 2010 at 10:44 AM, Shiva shiv...@gmail.com wrote: Hi, For the first time I used Hive to load couple of word count data input files into tables with and without OVERWRITE. Both the times the input file in HDFS got deleted. Is that a expected behavior? And couldn't find any definitive answer on the Hive wiki. hive LOAD DATA INPATH '/user/vmplanet/output/part-0' OVERWRITE INTO TABLE t_word_count; Env.: Using Hadoop 0.20.1 and latest Hive on Ubuntu 9.10 running in VMware. Thanks, Shiva
Re: how to generate a Chukwa SequenceFile
Here's a JIRA with a patch. Let me know if you think I should refactor any parts of it: https://issues.apache.org/jira/browse/CHUKWA-449 On Tue, Jan 19, 2010 at 6:03 PM, Ariel Rabkin asrab...@gmail.com wrote: Yes, if by processing you mean demux. Which should be renamed, I think, at some point. --Ari On Tue, Jan 19, 2010 at 4:53 PM, Bill Graham billgra...@gmail.com wrote: Thanks Ari, that helps. The TempFileUtil.writeASinkFile method seems similar to what I want actually. From looking at the code though it seems that a sink file contains ChukwaArchiveKey - ChunkImpl key value pairs, but a processed file instead contains ChukwaRecordKey - ChukwaRecord pairs. If I followed that code as an example, but just created the latter k/v pairs instead of the former I'd be good to go, correct? On Tue, Jan 19, 2010 at 3:59 PM, Ariel Rabkin asrab...@gmail.com wrote: There isn't a polished utility for this, and there should be. I think it'll be entirely straightforward, depending on your specific requirements. If you look in org.apache.hadoop.chukwa.util.TempFileUtil.RandSeqFileWriter there's an example of code that writes out a sequence file for test purposes. --Ari On Tue, Jan 19, 2010 at 3:46 PM, Bill Graham billgra...@gmail.com wrote: Hi, Is there an easy way (maybe using a utility class or the chukwa API) to manually create a sequence file of chukwa records from a log file without the need for HDFS? My use case is this: I've got pig unit tests that read input sequence file input using ChukwaStorage from local disk. I generated these files by putting data into the cluster an waiting for the data processor to run. We're looking to change the log format though, and I'd like to be able to write and run the unit tests without putting the new data into the cluster. If there were a command line way that I could do this that would be very helpful. Or if anyone could point me to the relevant classes, I could write such a utility and contribute it back. thanks, Bill -- Ari Rabkin asrab...@gmail.com UC Berkeley Computer Science Department -- Ari Rabkin asrab...@gmail.com UC Berkeley Computer Science Department
LOAD from multiple directories
Hi, I have summary data created in directories every 10 minutes and I have a job that needs to LOAD from all directories in a one hour period. I was hoping to use Hadoop file path globing, but it doesn't seem to allow the glob patterns with slashes '/' in them. If my directory structure looks like what I show below, does anyone have any suggestions for how I could write a LOAD command that would load all directories from 10:30-11:20, for example? /20100121/10/00 /20100121/10/10 /20100121/10/20 /20100121/10/30 -- /20100121/10/40 -- /20100121/10/50 -- /20100121/11/00 -- /20100121/11/10 -- /20100121/11/20 -- /20100121/11/30 /20100121/11/40 thanks, Bill
Re: LOAD from multiple directories
Thanks for the union suggestion, Thejas. Dmitry, how were you envisioning that globs can be used for this use case? Globs with slashes like this don't work: {10/30,10/40,10/50,11/00,11/10,11/20} On Thu, Jan 21, 2010 at 11:57 AM, Dmitriy Ryaboy dvrya...@gmail.com wrote: you should be able to use globs: http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29 {ab,c{de,fh}} Matches a string from the string set {ab, cde, cfh} -D On Thu, Jan 21, 2010 at 11:29 AM, Thejas Nair te...@yahoo-inc.com wrote: I was going to suggest - /20100121/{10,11}/{30,40,50,00,10,20} but that would not work because it will also match - /20100121/10/00 . I don't think hadoop file path globing can be used for this use case. You can use multiple loads followed by a union . -Thejas On 1/21/10 11:02 AM, Bill Graham billgra...@gmail.com wrote: Hi, I have summary data created in directories every 10 minutes and I have a job that needs to LOAD from all directories in a one hour period. I was hoping to use Hadoop file path globing, but it doesn't seem to allow the glob patterns with slashes '/' in them. If my directory structure looks like what I show below, does anyone have any suggestions for how I could write a LOAD command that would load all directories from 10:30-11:20, for example? /20100121/10/00 /20100121/10/10 /20100121/10/20 /20100121/10/30 -- /20100121/10/40 -- /20100121/10/50 -- /20100121/11/00 -- /20100121/11/10 -- /20100121/11/20 -- /20100121/11/30 /20100121/11/40 thanks, Bill
Re: LOAD from multiple directories
Note to those that are interested. As of 0.19.0, globs with slashes do work: http://issues.apache.org/jira/browse/HADOOP-3498 Of course we're on 0.18.3. Sigh... On Thu, Jan 21, 2010 at 12:09 PM, Bill Graham billgra...@gmail.com wrote: Thanks for the union suggestion, Thejas. Dmitry, how were you envisioning that globs can be used for this use case? Globs with slashes like this don't work: {10/30,10/40,10/50,11/00,11/10,11/20} On Thu, Jan 21, 2010 at 11:57 AM, Dmitriy Ryaboy dvrya...@gmail.comwrote: you should be able to use globs: http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29 {ab,c{de,fh}} Matches a string from the string set {ab, cde, cfh} -D On Thu, Jan 21, 2010 at 11:29 AM, Thejas Nair te...@yahoo-inc.com wrote: I was going to suggest - /20100121/{10,11}/{30,40,50,00,10,20} but that would not work because it will also match - /20100121/10/00 . I don't think hadoop file path globing can be used for this use case. You can use multiple loads followed by a union . -Thejas On 1/21/10 11:02 AM, Bill Graham billgra...@gmail.com wrote: Hi, I have summary data created in directories every 10 minutes and I have a job that needs to LOAD from all directories in a one hour period. I was hoping to use Hadoop file path globing, but it doesn't seem to allow the glob patterns with slashes '/' in them. If my directory structure looks like what I show below, does anyone have any suggestions for how I could write a LOAD command that would load all directories from 10:30-11:20, for example? /20100121/10/00 /20100121/10/10 /20100121/10/20 /20100121/10/30 -- /20100121/10/40 -- /20100121/10/50 -- /20100121/11/00 -- /20100121/11/10 -- /20100121/11/20 -- /20100121/11/30 /20100121/11/40 thanks, Bill
Re: Google has obtained the patent over mapreduce
Typically companies will patent their IP as a defensive measure to protect themselves from being sued, as has been pointed out already. Another typical reason is to exercise the patent against companies that present a challenge to their core business. I would bet that unless you're making a noticeable dent in google's search/ad business, then you really don't need to worry about them enforcing the patent against you. On Wed, Jan 20, 2010 at 1:42 PM, Colin Freas colinfr...@gmail.com wrote: Developers do themselves, their code, and their users a disservice if they lack some understanding of intellectual property law. It can be complicated, but it isn't rocket science. In the United States, Google is protected by the first to inventhttp://en.wikipedia.org/wiki/First_to_file_and_first_to_invent principle: they can safely publish anything they want about their invention prior to applying for a patent if they can prove they came up with the invention first. As others have pointed out, it isn't something to panic over. This is Google, not Rambus. It would be nice to see Google proactively and explicitly say We're not going to enforce this patent. But this patent and a lot of other software and business process patents could be in danger of being summarily overturned, depending on how the US Supreme Court rules in the Bilski case. It's possible they wanted to acquire this patent before that ruling, since it would give them standing to challenge a lot of potentially unfavorable outcomes. On Wed, Jan 20, 2010 at 4:07 PM, brien colwell xcolw...@gmail.com wrote: Personally, it seems like they gave away too much information before they had the patent. I'm not a patent lawyer, but I'd expect they submitted the patent application or a provisional before they submitted their academic paper or other public disclosure. On Wed, Jan 20, 2010 at 12:09 PM, Edward Capriolo edlinuxg...@gmail.com wrote: Interesting situation. I try to compare mapreduce to the camera. Let argue Google is Kodak, Apache is Polaroid, and MapReduce is a Camera. Imagine Kodak invented the camera privately, never sold it to anyone, but produced some document describing what a camera did. Polaroid followed the document and produced a camera and sold it publicly. Kodak later patents a camera, even though no one outside of Kodak can confirm Kodak ever made a camera before Polaroid. Not saying that is what happened here, but google releasing the GFS pdf was a large factor in causing hadoop to happen. Personally, it seems like they gave away too much information before they had the patent. The patent system faces many problems including this 'back to the future' issue. Where it takes so long to get a patent no one can wait, by the time a patent is issued there are already multiple viable implementations of a patent. I am no patent layer or anything, but I notice the phrase master process all over the claims. Maybe if a piece of software (hadoop) had a distributed process that would be sufficient to say hadoop technology does not infringe on this patent. I think it would be interesting to look deeply at each claim and determine if hadoop could be designed to not infringe on these patents, to deal with what if scenarios. On Wed, Jan 20, 2010 at 11:29 AM, Ravi ravindra.babu.rav...@gmail.com wrote: Hi, I too read about that news. I don't think that it will be any problem. However Google didn't invent the model. Thanks. On Wed, Jan 20, 2010 at 9:47 PM, Udaya Lakshmi udaya...@gmail.com wrote: Hi, As an user of hadoop, Is there anything to worry about Google obtaining the patent over mapreduce? Thanks.
Ho to deploying a custom processor to demux
Hi, I've written my own Processor to handle my log format per this wiki and I've run into a couple of gotchast: http://wiki.apache.org/hadoop/DemuxModification 1. The default processor is not the TsProcessor as documented, but the DefaultProcessor (see line 83 of Demux.java). This causes headaches because when using DefaultProcessor data always goes under minute 0 in hdfs, regardless of when in the hour it was created. 2. When implementing a custom parser as shown in the wiki, how do you register the class so it gets included in the job that's submitted to the hadoop cluster? The only way I've been able to do this is to put my class in the package org.apache.hadoop.chukwa.extraction.demux.processor.mapper and then manually add that class to the chukwa-core-0.3.0.jar that is on my data processor, which is a pretty rough hack. Otherwise, I get class not found exceptions in my mapper. thanks, Bill
Re: Ho to deploying a custom processor to demux
Thanks for your quick reply Eric. The TsProcessor does use buildGenericRecord and has been working fine for me (at least I thought it was). I've mapped it to my dataType as you described without problems. My only point with issue #1 was just that the documentation is off and that the DefaultProcessor yields what I think is unexpected behavior. There is an plan to load parser class from class path by using Java annotation. It is still in the initial phase of planning. Design participation are welcome. Yes, annotations would be useful. Or what about just having an extensions directory (maybe lib/ext/) or something similar where custom jars could be placed that are to be submitted by demux M/R? Do you know where the code resides that handles adding the chukwa-core jar? I poked around bit but couldn't find it. Finally, is there a JIRA for this issue that you know of? If not I'll create one. This is going to become a pain point for us soon, so if we have a design I might be able to contribute a patch. thanks, Bill On Tue, Dec 22, 2009 at 2:14 PM, Eric Yang ey...@yahoo-inc.com wrote: On 12/22/09 1:36 PM, Bill Graham billgra...@gmail.com wrote: I've written my own Processor to handle my log format per this wiki and I've run into a couple of gotchast: http://wiki.apache.org/hadoop/DemuxModification 1. The default processor is not the TsProcessor as documented, but the DefaultProcessor (see line 83 of Demux.java). This causes headaches because when using DefaultProcessor data always goes under minute 0 in hdfs, regardless of when in the hour it was created. There is a generic method to build the record, like: buildGenericRecord(record, recordEntry, timestamp, recordType); This method will build up key like: Time partition/Primary Key/timestamp When all records are roll up into large sequence file by end of the hour and end of the day, the sequence file is sorted by time partition and primary key. This arrangement of data structure was put in place to assist data scanning. When data is retrieved, use record.getTimestamp() to find the real timestamp for the record. TsProcessor is incompleted for now because the key in ChukwaRecord is used in hourly and daily roll up. Without using buildGenericRecord, hourly and daily roll up will not work correctly. 2. When implementing a custom parser as shown in the wiki, how do you register the class so it gets included in the job that's submitted to the hadoop cluster? The only way I've been able to do this is to put my class in the package org.apache.hadoop.chukwa.extraction.demux.processor.mapper and then manually add that class to the chukwa-core-0.3.0.jar that is on my data processor, which is a pretty rough hack. Otherwise, I get class not found exceptions in my mapper. The demux process is controlled by $CHUKWA_HOME/conf/chukwa-demux-conf.xml, and map the recordType to your parser class. There is an plan to load parser class from class path by using Java annotation. It is still in the initial phase of planning. Design participation are welcome. Hope this helps. :) Regards, Eric
Log data written to [DataType]InError
Hi, For some reason data that I put into Chukwa appears in the following director after the demux/post processor processes runs: /chukwa/repos[clusterName]/[dataType]InError/ instead of /chukwa/repos[clusterName]/[dataType]/ There's no explanation in the logs as to why this is happening, nor are there any exceptions. Any idea why this happens or how I can troubleshoot? thanks, Bill
Re: dynamically calling STORE
Thanks Dmitriy, this is exactly what I need. There was one bug I ran into though FYI, which is when making a request like this, as documented in the JavaDocs: STORE A INTO '/my/home/output' USING MultiStorage('/my/home/output','0', 'none', '\t'); Pig would create a file '/my/home/output' and then an exception would be thrown when MultiStorage tried to make a directory under '/my/home/output'. The workaround that worked for me was to instead specify a dummy location as the first path like so: STORE A INTO '/my/home/output/temp' USING MultiStorage('/my/home/output','0', 'none', '\t'); On Tue, Dec 15, 2009 at 1:06 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote: Bill, A custom storefunc should do the trick. See https://issues.apache.org/jira/browse/PIG-958 (aka piggybank.storage.MultiStorage) for a jumping-off point. -D On Tue, Dec 15, 2009 at 1:59 PM, Bill Graham billgra...@gmail.com wrote: Hi, I'm pretty sure the answer to my question is no, but I have to ask. Is it possible within Pig to store different groups of data into different output files where the grouping is dynamic (i.e. not known ahead of time)? Here's what I'm trying to do... I've got a script that reads log files of URLs and generates counts for a given time period. The urls might have a 'tag' querystring param though, and in that case I want to get the most popular urls for each tag output to it's own file. My data looks like this and is ordered by tag asc, count desc: [tag] [timeinterval] [url] [count] I need to do something like so: for each tag group found store all records in file foo_[tag].txt I ultimately need these files on local disk and I'm looking for a better way to do so than generating a file of N unique tags in HDFS, reading it from Java, submitting N jobs with the tag name substituted into a script file, followed by N copyToLocal calls. At least two possible solutions come to mind, but am curious if there's another that I'm overlooking: 1. In java submit pig dynamic commands to an instance of PigServer. I'd still need a unique tag file for this case. 2. Maybe with a custom store function?? thanks, Bill
dynamically calling STORE
Hi, I'm pretty sure the answer to my question is no, but I have to ask. Is it possible within Pig to store different groups of data into different output files where the grouping is dynamic (i.e. not known ahead of time)? Here's what I'm trying to do... I've got a script that reads log files of URLs and generates counts for a given time period. The urls might have a 'tag' querystring param though, and in that case I want to get the most popular urls for each tag output to it's own file. My data looks like this and is ordered by tag asc, count desc: [tag] [timeinterval] [url] [count] I need to do something like so: for each tag group found store all records in file foo_[tag].txt I ultimately need these files on local disk and I'm looking for a better way to do so than generating a file of N unique tags in HDFS, reading it from Java, submitting N jobs with the tag name substituted into a script file, followed by N copyToLocal calls. At least two possible solutions come to mind, but am curious if there's another that I'm overlooking: 1. In java submit pig dynamic commands to an instance of PigServer. I'd still need a unique tag file for this case. 2. Maybe with a custom store function?? thanks, Bill
HDFS move instead of LOAD DATA INPATH?
Hi, When the LOAD DATA INPATH is issued, does Hive modify the metastore data, or do anything else special besides just moving the files in HDFS? I've got a daily MR job that runs before I need to load data into a daily Hive partition and using the FileSystem class to move the files from Java would be pretty easy. thanks, Bill
broken links to Hive documentation
The links to documentation for releases 3.0 and 4.0 on the left nav of the Hive homepage are broken FYI: http://hadoop.apache.org/hive/ They send to you these pages that show white apache directory listing pages: http://hadoop.apache.org/hive/docs/r0.3.0/ http://hadoop.apache.org/hive/docs/r0.4.0/
Re: Tracking files deletions in HDFS
I'm don't know about the auditing tools, but I have seen files get randomly deleted in dev setups when using hadoop with the default hadoop.tmp.dir setting, which is /tmp/hadoop-${user.name}. On Thu, Nov 19, 2009 at 9:03 AM, Stas Oskin stas.os...@gmail.com wrote: Hi. I have a strange case of missing files, which most probably randomly delete by my application. Does HDFS provides any auditing tools for tracking who deleted what and when? Thanks in advance.
Accessing a bag of token tuples from TOKENIZE
Hi, I'm struggling to get the tokens out of a bag of tuples created by the TOKENIZE UDF and could use some help. I want to tokenize and then be able to reference the tokens by their position. Is this even possible? Since the token count is non-deterministic, I'm question whether I can use positional parameters to dig them out. Anyway, here's what I'm doing, starting with a chararray where each: grunt describe B; B: {body: chararray} grunt dump B; (2009-11-18 09:32:43,000 color=blue) (2009-11-18 09:32:43,000 color=red) (2009-11-18 09:32:44,000 color=red) (2009-11-18 09:32:45,000 color=green) grunt C = FOREACH B GENERATE TOKENIZE((chararray)body) as B1:bag{T1:tuple(T:chararray)}; grunt describe C; C: {B1: {T1: (T: chararray)}} grunt D = FOREACH C GENERATE B1.$0 as date; grunt describe D; D: {date: {T: chararray}} grunt dump D; ... ({(2009-11-18),(09:32:43),(000),(color=blue)}) ({(2009-11-18),(09:32:43),(000),(color=red)}) ({(2009-11-18),(09:32:44),(000),(color=red)}) ({(2009-11-18),(09:32:45),(000),(color=green)}) What I'd expect to see is just the date values. Any ideas? thanks, Bill
Re: Accessing a bag of token tuples from TOKENIZE
Thanks Zaki, I think you're right about bag values lacking order and not being able to be accessed by position. I'll take a look at the regex UDF. What I'm ultimately trying to get is a handle to each token in the body though, I'm just using date as an example. I'd like to be able to pull these values out with one UDF execution per line (as opposed to per field). My input is basically access log entries and I need to get the different space-delimited values in it. Seems like the thing to do would be to write my own UDF that returns a tuple from the space-delimited tokens for each line passed. I'm sure this problem has been solved a million times before though, so if anyone has a better suggestion I'd love to hear it. I recall talk about an access log UDF at one point (maybe it was in hive), but I can't find any references to it at the moment. On Wed, Nov 18, 2009 at 12:38 PM, zaki rahaman zaki.raha...@gmail.comwrote: Hm, I may be wrong about this, but from what I recall, there are no 'fields' in the bag of tokens (and no ordering) created by TOKENIZE. As such, I don't think there's a way to accomplish what you're trying to do the way it's written. As an alternative approach, you might try using FLATTEN to unnest the TOKENIZE output and give you tuples for each token and then filter the tokens to those that match your date pattern. Alternatively, you could accomplish this in one step with a regex extract UDF (there's one in piggybank if I recall correctly and something similar in amazon's pig function jar). If the data you described below is your input data, then you could remove the projection step altogether by using a RegEx LoadFunc to get the date field. Hope this helps, and others feel free to correct me if I'm wrong, as I'm sure there's probably a better/more elegant way. -- Zaki On Wed, Nov 18, 2009 at 3:03 PM, Bill Graham billgra...@gmail.com wrote: Hi, I'm struggling to get the tokens out of a bag of tuples created by the TOKENIZE UDF and could use some help. I want to tokenize and then be able to reference the tokens by their position. Is this even possible? Since the token count is non-deterministic, I'm question whether I can use positional parameters to dig them out. Anyway, here's what I'm doing, starting with a chararray where each: grunt describe B; B: {body: chararray} grunt dump B; (2009-11-18 09:32:43,000 color=blue) (2009-11-18 09:32:43,000 color=red) (2009-11-18 09:32:44,000 color=red) (2009-11-18 09:32:45,000 color=green) grunt C = FOREACH B GENERATE TOKENIZE((chararray)body) as B1:bag{T1:tuple(T:chararray)}; grunt describe C; C: {B1: {T1: (T: chararray)}} grunt D = FOREACH C GENERATE B1.$0 as date; grunt describe D; D: {date: {T: chararray}} grunt dump D; ... ({(2009-11-18),(09:32:43),(000),(color=blue)}) ({(2009-11-18),(09:32:43),(000),(color=red)}) ({(2009-11-18),(09:32:44),(000),(color=red)}) ({(2009-11-18),(09:32:45),(000),(color=green)}) What I'd expect to see is just the date values. Any ideas? thanks, Bill -- Zaki Rahaman
Re: Accessing a bag of token tuples from TOKENIZE
This is exactly what I need, thanks! I had checked piggybank previously, but didn't catch these the first time. On Wed, Nov 18, 2009 at 1:15 PM, zaki rahaman zaki.raha...@gmail.comwrote: Hey Bill, If you look in piggybank (http://wiki.apache.org/pig/PiggyBank look in) in the contrib dir of your pig installation, you'll find several functions that might help. I haven't used any myself, but in org.apache.pig.piggybank.storage you'll find RegExLoader and MyRegExLoader. If you pass a reg exp with capturing groups I believe you can simply use these functions directly. There are also apache log specific load funcs, I think theres Common and Combined Log Loaders... simply set up your scripts to use those functions to load your input data and you'll have what you need I believe. On Wed, Nov 18, 2009 at 4:03 PM, Mridul Muralidharan mrid...@yahoo-inc.comwrote: You are right, there is no ordering of tuples within a bag by default (except in some cases - like output of ORDER BY). For the specific purpose of pulling the date field - you could just use some regexp udf instead of tokenize to pick the value you are interested in. There should be udf's in piggy bank which do this ... Or is this a more general question regarding accessing tuples within a bag in some ordered fashion ? Regards, Mridul Bill Graham wrote: Hi, I'm struggling to get the tokens out of a bag of tuples created by the TOKENIZE UDF and could use some help. I want to tokenize and then be able to reference the tokens by their position. Is this even possible? Since the token count is non-deterministic, I'm question whether I can use positional parameters to dig them out. Anyway, here's what I'm doing, starting with a chararray where each: grunt describe B; B: {body: chararray} grunt dump B; (2009-11-18 09:32:43,000 color=blue) (2009-11-18 09:32:43,000 color=red) (2009-11-18 09:32:44,000 color=red) (2009-11-18 09:32:45,000 color=green) grunt C = FOREACH B GENERATE TOKENIZE((chararray)body) as B1:bag{T1:tuple(T:chararray)}; grunt describe C; C: {B1: {T1: (T: chararray)}} grunt D = FOREACH C GENERATE B1.$0 as date; grunt describe D; D: {date: {T: chararray}} grunt dump D; ... ({(2009-11-18),(09:32:43),(000),(color=blue)}) ({(2009-11-18),(09:32:43),(000),(color=red)}) ({(2009-11-18),(09:32:44),(000),(color=red)}) ({(2009-11-18),(09:32:45),(000),(color=green)}) What I'd expect to see is just the date values. Any ideas? thanks, Bill -- Zaki Rahaman
Re: Problem regarding hive-jdbc driver
Not all methods in the JDBC spec are implemented, as your noticing. Statement.setMaxRows(int) is implemented though, so maybe that would work for your needs. Or you could just specify a limit in your sql. Bill On Mon, Oct 26, 2009 at 6:21 AM, Mohan Agarwal mohan.agarwa...@gmail.comwrote: Hi everyone, I am writing a java program to create a Query Editor to a execute query through hive. I am using hive-jdbc driver for database connection and query execution, but I am facing a problem regarding java.sql.Statement class. When I am using setFetchSize() of Statement class, it is giving java.sql.SQLException: Method not supported exception. I have to implement paging in user interface ,I can't show all the data to the user in a single page. Also java.sql.Statement class is not supporting scrollable Result Set. Can someone help me to solve this problem, so that I can implement paging in user interface. Thanking You Mohan Agarwal
Re: Hive query web service
There is no J2EE web server or SOAP web service in this equation. The Hive JDBC client connects to the Hive Server, which can be started with a script like so run from your $HIVE_HOME/build/dist directory: export HADOOP_HOME=/path/to/hadoop HIVE_PORT=1 ./bin/hive --service hiveserver No war files, or WEB-INF/ directories at all in this case. On Wed, Oct 21, 2009 at 5:24 AM, Arijit Mukherjee ariji...@gmail.comwrote: If I understood the concept of standalone and embedded properly, then a Web Service which connects to the Hive/Thrift server via JDBC (jdbc:hive://host:port...) is actually a standalone client. The difference from a Java standalone client is that - in this case, the whole thing is packaged as a Web Service and deployed on a web server such as JBoss/GlassFish - and the connection is initiated only after a SOAP request is received from the web service client. If that is the case, then the Web Service should not require the conf files, or jpox libraries - is it not? Or did I misunderstand the concept? Arijit 2009/10/21 Arijit Mukherjee ariji...@gmail.com: Update: I did a clean/build/deploy - the config files are within the Web Service WEB-INF/classes folder, and the libraries (including the jpox ones) are inside WEB-INF/lib - which are standard for any web application. But the config related exception is still there:-(( Arijit 2009/10/21 Arijit Mukherjee ariji...@gmail.com: Thanx Bill. I copied the jpox jars from the 0.3.0 distribution and added them to the web service archive, and they are in the classpath, but the config related exception is still there. Let me do a clean build/deploy and I'll get back again. Arijit 2009/10/20 Bill Graham billgra...@gmail.com: The Hive JDBC client can run in two different modes: standalone and embedded. Standalone mode is where the client connects to a separate standalone HiveServer by specifying the host:port of the server in the jdbc URL like this: jdbc:hive://localhost:1/default. In this case the hive configs are not needed by the client, since the client is making thrift requests to the server which has the Hive configs. the Hive Server knows how to resolve the metastore. Embedded mode is where the JDBC client connects to itself so to speak using a JDBC url like this: jdbc:hive://. It's as if the client is running an embedded server that only it communicates with. In this case the client needs the Hive configs since it needs to resolve the metastore, amongst other things. The metastore dependency in this case is what will cause you to see jpox errors appear if those jars aren't found. HTH, Bill On Tue, Oct 20, 2009 at 4:14 AM, Arijit Mukherjee ariji...@gmail.com wrote: BTW - the service is working though, in spite of those exceptions. I'm able to run queries and get results. Arijit 2009/10/20 Arijit Mukherjee ariji...@gmail.com: I created a hive-site.xml using the outline given in the Hive Web Interface tutorial - now that file is in the classpath of the Web Service - and the service can find the file. But, now there's another exception - 2009-10-20 14:27:30,914 DEBUG [httpSSLWorkerThread-14854-0] HiveQueryService - connecting to Hive using URL: jdbc:hive://localhost:1/default 2009-10-20 14:27:30,969 DEBUG [httpSSLWorkerThread-14854-0] Configuration - java.io.IOException: config() at org.apache.hadoop.conf.Configuration.init(Configuration.java:176) at org.apache.hadoop.conf.Configuration.init(Configuration.java:164) at org.apache.hadoop.hive.conf.HiveConf.init(HiveConf.java:287) at org.apache.hadoop.hive.jdbc.HiveConnection.init(HiveConnection.java:63) at org.apache.hadoop.hive.jdbc.HiveDriver.connect(HiveDriver.java:109) at java.sql.DriverManager.getConnection(DriverManager.java:582) at java.sql.DriverManager.getConnection(DriverManager.java:185) at com.ctva.poc.hive.service.HiveQueryService.getConnection(HiveQueryService.java:134) at com.ctva.poc.hive.service.HiveQueryService.connectDB(HiveQueryService.java:43) Apparently, something goes wrong during the config routine. Do I need something more within the service? Regards Arijit 2009/10/20 Arijit Mukherjee ariji...@gmail.com: Hi I'm trying to create a Web Service which will access Hive (0.4.0 release) using JDBC. I used to sample JDBC code from the wiki ( http://wiki.apache.org/hadoop/Hive/HiveClient#head-fd2d8ae9e17fdc3d9b7048d088b2c23a53a6857d ), but when I'm trying to connect the the DB using the DriverManager, there's an exception which seems to relate to hive-site.xml (HiveConf - hive-site.xml not found.). But I could not find any hive-site.xml in $HIVE_HOME/conf - there's only hive-default.xml. The wiki page also speaks about couple of jpox JAR files, which aren't
Re: Hive query web service
The Hive JDBC client can run in two different modes: standalone and embedded. Standalone mode is where the client connects to a separate standalone HiveServer by specifying the host:port of the server in the jdbc URL like this: jdbc:hive://localhost:1/default. In this case the hive configs are not needed by the client, since the client is making thrift requests to the server which has the Hive configs. the Hive Server knows how to resolve the metastore. Embedded mode is where the JDBC client connects to itself so to speak using a JDBC url like this: jdbc:hive://. It's as if the client is running an embedded server that only it communicates with. In this case the client needs the Hive configs since it needs to resolve the metastore, amongst other things. The metastore dependency in this case is what will cause you to see jpox errors appear if those jars aren't found. HTH, Bill On Tue, Oct 20, 2009 at 4:14 AM, Arijit Mukherjee ariji...@gmail.comwrote: BTW - the service is working though, in spite of those exceptions. I'm able to run queries and get results. Arijit 2009/10/20 Arijit Mukherjee ariji...@gmail.com: I created a hive-site.xml using the outline given in the Hive Web Interface tutorial - now that file is in the classpath of the Web Service - and the service can find the file. But, now there's another exception - 2009-10-20 14:27:30,914 DEBUG [httpSSLWorkerThread-14854-0] HiveQueryService - connecting to Hive using URL: jdbc:hive://localhost:1/default 2009-10-20 14:27:30,969 DEBUG [httpSSLWorkerThread-14854-0] Configuration - java.io.IOException: config() at org.apache.hadoop.conf.Configuration.init(Configuration.java:176) at org.apache.hadoop.conf.Configuration.init(Configuration.java:164) at org.apache.hadoop.hive.conf.HiveConf.init(HiveConf.java:287) at org.apache.hadoop.hive.jdbc.HiveConnection.init(HiveConnection.java:63) at org.apache.hadoop.hive.jdbc.HiveDriver.connect(HiveDriver.java:109) at java.sql.DriverManager.getConnection(DriverManager.java:582) at java.sql.DriverManager.getConnection(DriverManager.java:185) at com.ctva.poc.hive.service.HiveQueryService.getConnection(HiveQueryService.java:134) at com.ctva.poc.hive.service.HiveQueryService.connectDB(HiveQueryService.java:43) Apparently, something goes wrong during the config routine. Do I need something more within the service? Regards Arijit 2009/10/20 Arijit Mukherjee ariji...@gmail.com: Hi I'm trying to create a Web Service which will access Hive (0.4.0 release) using JDBC. I used to sample JDBC code from the wiki ( http://wiki.apache.org/hadoop/Hive/HiveClient#head-fd2d8ae9e17fdc3d9b7048d088b2c23a53a6857d ), but when I'm trying to connect the the DB using the DriverManager, there's an exception which seems to relate to hive-site.xml (HiveConf - hive-site.xml not found.). But I could not find any hive-site.xml in $HIVE_HOME/conf - there's only hive-default.xml. The wiki page also speaks about couple of jpox JAR files, which aren't in the lib folder either. Am I missing something here? Regards Arijit -- And when the night is cloudy, There is still a light that shines on me, Shine on until tomorrow, let it be. -- And when the night is cloudy, There is still a light that shines on me, Shine on until tomorrow, let it be. -- And when the night is cloudy, There is still a light that shines on me, Shine on until tomorrow, let it be.
input20.q unit test failure
Hi, I'm trying to run the unit tests before submitting a patch and I'm getting a test failure. I've tried running the same test on a fresh checkout and it also fails. Below is an excerpt of the output. Is anyone else able to run this test? [grah...@bgrahammbproosx:~/ws/hive-svn/hive-trunk-clean]$ svn up ... Updated to revision 821099. [grah...@bgrahammbproosx:~/ws/hive-svn/hive-trunk-clean]$ svn stat [grah...@bgrahammbproosx:~/ws/hive-svn/hive-trunk-clean]$ svn diff [grah...@bgrahammbproosx:~/ws/hive-svn/hive-trunk-clean]$ svn info Path: . URL: http://svn.apache.org/repos/asf/hadoop/hive/trunk Repository Root: http://svn.apache.org/repos/asf Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68 Revision: 821099 Node Kind: directory Schedule: normal Last Changed Author: zshao Last Changed Rev: 820823 Last Changed Date: 2009-10-01 15:19:08 -0700 (Thu, 01 Oct 2009) [grah...@bgrahammbproosx:~/ws/hive-svn/hive-trunk-clean]$ ant clean test -Dtestcase=TestCliDriver -Dqfile=input20.q ... [junit] NULL 432 [junit] NULL 435 [junit] NULL 436 [junit] junit.framework.AssertionFailedError: Client execution results failed with error code = 1 [junit] at junit.framework.Assert.fail(Assert.java:47) [junit] at org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_input20(TestCliDriver.java:96) [junit] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [junit] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) [junit] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) [junit] at java.lang.reflect.Method.invoke(Method.java:597) [junit] at junit.framework.TestCase.runTest(TestCase.java:154) [junit] at junit.framework.TestCase.runBare(TestCase.java:127) [junit] at junit.framework.TestResult$1.protect(TestResult.java:106) [junit] at junit.framework.TestResult.runProtected(TestResult.java:124) [junit] at junit.framework.TestResult.run(TestResult.java:109) [junit] at junit.framework.TestCase.run(TestCase.java:118) [junit] at junit.framework.TestSuite.runTest(TestSuite.java:208) [junit] at junit.framework.TestSuite.run(TestSuite.java:203) [junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:421) [junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:912) [junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:766) [junit] NULL 437 [junit] NULL 438 [junit] NULL 439 [junit] NULL 44 ... [junit] 5 348_348 [junit] 5 401_401 [junit] 5 469_469 [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 22.597 sec [junit] Test org.apache.hadoop.hive.cli.TestCliDriver FAILED BUILD FAILED /Users/grahamb/ws/hive-svn/hive-trunk-clean/build.xml:142: The following error occurred while executing this line: /Users/grahamb/ws/hive-svn/hive-trunk-clean/build.xml:89: The following error occurred while executing this line: /Users/grahamb/ws/hive-svn/hive-trunk-clean/build-common.xml:316: Tests failed! Total time: 1 minute 28 seconds thanks, Bill
Re: input20.q unit test failure
Thanks Namit for giving it a shot. I just updated again and it's still failing. I tried a fresh checkout and it also failed. Any pointers on how to troubleshoot this? I'm reluctant to submit a patch with this failure, even though it's happening without any of my modifications. On Fri, Oct 2, 2009 at 10:36 AM, Namit Jain nj...@facebook.com wrote: I tried the test on the latest version and it ran fine for me. [nj...@dev029 hive4]$ svn info Path: . URL: http://svn.apache.org/repos/asf/hadoop/hive/trunk Repository Root: http://svn.apache.org/repos/asf Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68 Revision: 821103 Node Kind: directory Schedule: normal Last Changed Author: zshao Last Changed Rev: 820823 Last Changed Date: 2009-10-01 15:19:08 -0700 (Thu, 01 Oct 2009) Can you do update your repository ? *From:* Bill Graham [mailto:billgra...@gmail.com] *Sent:* Friday, October 02, 2009 10:19 AM *To:* hive-user@hadoop.apache.org *Subject:* input20.q unit test failure Hi, I'm trying to run the unit tests before submitting a patch and I'm getting a test failure. I've tried running the same test on a fresh checkout and it also fails. Below is an excerpt of the output. Is anyone else able to run this test? [grah...@bgrahammbproosx:~/ws/hive-svn/hive-trunk-clean]$ svn up ... Updated to revision 821099. [grah...@bgrahammbproosx:~/ws/hive-svn/hive-trunk-clean]$ svn stat [grah...@bgrahammbproosx:~/ws/hive-svn/hive-trunk-clean]$ svn diff [grah...@bgrahammbproosx:~/ws/hive-svn/hive-trunk-clean]$ svn info Path: . URL: http://svn.apache.org/repos/asf/hadoop/hive/trunk Repository Root: http://svn.apache.org/repos/asf Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68 Revision: 821099 Node Kind: directory Schedule: normal Last Changed Author: zshao Last Changed Rev: 820823 Last Changed Date: 2009-10-01 15:19:08 -0700 (Thu, 01 Oct 2009) [grah...@bgrahammbproosx:~/ws/hive-svn/hive-trunk-clean]$ ant clean test -Dtestcase=TestCliDriver -Dqfile=input20.q ... [junit] NULL 432 [junit] NULL 435 [junit] NULL 436 [junit] junit.framework.AssertionFailedError: Client execution results failed with error code = 1 [junit] at junit.framework.Assert.fail(Assert.java:47) [junit] at org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_input20(TestCliDriver.java:96) [junit] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [junit] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) [junit] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) [junit] at java.lang.reflect.Method.invoke(Method.java:597) [junit] at junit.framework.TestCase.runTest(TestCase.java:154) [junit] at junit.framework.TestCase.runBare(TestCase.java:127) [junit] at junit.framework.TestResult$1.protect(TestResult.java:106) [junit] at junit.framework.TestResult.runProtected(TestResult.java:124) [junit] at junit.framework.TestResult.run(TestResult.java:109) [junit] at junit.framework.TestCase.run(TestCase.java:118) [junit] at junit.framework.TestSuite.runTest(TestSuite.java:208) [junit] at junit.framework.TestSuite.run(TestSuite.java:203) [junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:421) [junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:912) [junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:766) [junit] NULL 437 [junit] NULL 438 [junit] NULL 439 [junit] NULL 44 ... [junit] 5 348_348 [junit] 5 401_401 [junit] 5 469_469 [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 22.597 sec [junit] Test org.apache.hadoop.hive.cli.TestCliDriver FAILED BUILD FAILED /Users/grahamb/ws/hive-svn/hive-trunk-clean/build.xml:142: The following error occurred while executing this line: /Users/grahamb/ws/hive-svn/hive-trunk-clean/build.xml:89: The following error occurred while executing this line: /Users/grahamb/ws/hive-svn/hive-trunk-clean/build-common.xml:316: Tests failed! Total time: 1 minute 28 seconds thanks, Bill
Re: Should all processors return a DriverResponse?
Looking again at the solution to HIVE-795 I see that adding the runCommand method to Driver worked for that class, but deviated from the approach used by the CommandProcessor interface. Hence the issue you're running into. Driver got this new method: public DriverResponse runCommand(String command) But the CommandProcessor interface, which Driver implements has this method: public int run(String command) Other implementations CommandProcessor should also return a composite response object instead of an int. I think the ideal solution would be if the CommandProcessor interface instead had a method either of these: public CommandProcessorResponse runCommand(String command); or public CommandProcessorResponse run(String command); And CommandProcessorResponse had attributes for responseCode and errorMessage. DriverResponse could extend CommandProcessorResponse (as could other response types as needed) and have the SQLState attribute. If we move towards a solution like this, then the question becomes how hard to we try to maintain backward compatibility with the CommandProcessor interface? Do we just change the interface (which would be easier and result in cleaner code) or do we migrate to a new interface and deprecate the old? I lean towards the former, but would like to hear what others have to say. Although it's a public method, I'd expect that there probably aren't many implementations outside of the Hive code base that are written against CommandProcessor, and the fact that we're at a 0.0.x version should imply that internal interfaces might change from release to release. Other thoughts? thanks, Bill On Tue, Sep 29, 2009 at 7:42 PM, Edward Capriolo edlinuxg...@gmail.comwrote: All, I am looking to integrate HWI with https://issues.apache.org/jira/browse/HIVE-795 Should all Processors return a DriverResponse? the web interface allows a list of commands as the CLI would take. I was storing this in ListInt I was looking to change this to ListDriverResult I also have to extend the class... class DriverResponseWrapper extends DriverResponse { public DriverResponseWrapper (int x){ super(x); } } Should DriverReponse and CommandResponse exist maybe as a subclass of Response. Edward
Re: Should all processors return a DriverResponse?
Edward your approach looks good, but I have a few comments. - Since we agree that we don't need to worry about backward compatibility, then why have both methods as part of the CommandProcessor interface? Seems we should get rid of the method that returns an int. If we decide to keep it, then we're saying that backward compatibility does matter to us. In that case we should keep it and deprecate it. - You included SQLState in the response base class, which makes sense. If that's the case though, do we need to create DriverResponse and CommandResponse subclasses? If the subclasses need to have add'l methods, then yes we do. But if they don't then I don't see the need. We could always subclass later if the need arises. Maybe we extend for DriverResponse just for backward compatibility, but otherwise we don't need to? thanks, Bill On Wed, Sep 30, 2009 at 1:45 PM, Edward Capriolo edlinuxg...@gmail.comwrote: On Wed, Sep 30, 2009 at 4:32 PM, Bill Graham billgra...@gmail.com wrote: Looking again at the solution to HIVE-795 I see that adding the runCommand method to Driver worked for that class, but deviated from the approach used by the CommandProcessor interface. Hence the issue you're running into. Driver got this new method: public DriverResponse runCommand(String command) But the CommandProcessor interface, which Driver implements has this method: public int run(String command) Other implementations CommandProcessor should also return a composite response object instead of an int. I think the ideal solution would be if the CommandProcessor interface instead had a method either of these: public CommandProcessorResponse runCommand(String command); or public CommandProcessorResponse run(String command); And CommandProcessorResponse had attributes for responseCode and errorMessage. DriverResponse could extend CommandProcessorResponse (as could other response types as needed) and have the SQLState attribute. If we move towards a solution like this, then the question becomes how hard to we try to maintain backward compatibility with the CommandProcessor interface? Do we just change the interface (which would be easier and result in cleaner code) or do we migrate to a new interface and deprecate the old? I lean towards the former, but would like to hear what others have to say. Although it's a public method, I'd expect that there probably aren't many implementations outside of the Hive code base that are written against CommandProcessor, and the fact that we're at a 0.0.x version should imply that internal interfaces might change from release to release. Other thoughts? thanks, Bill On Tue, Sep 29, 2009 at 7:42 PM, Edward Capriolo edlinuxg...@gmail.com wrote: All, I am looking to integrate HWI with https://issues.apache.org/jira/browse/HIVE-795 Should all Processors return a DriverResponse? the web interface allows a list of commands as the CLI would take. I was storing this in ListInt I was looking to change this to ListDriverResult I also have to extend the class... class DriverResponseWrapper extends DriverResponse { public DriverResponseWrapper (int x){ super(x); } } Should DriverReponse and CommandResponse exist maybe as a subclass of Response. Edward Thanks Bill, I wanted to open a Jira but it seems like an issue the past two days. I agree that there are few/no implementations outside of the CLI of CommandProcessor. I do not think we need to support backwards compatibility for the change, but doing it like DriverResponse is logical. Have the old method be a chained call to the new method I have a slight variation of your suggestion but essentially the same idea. Driver.java int run (String command) DriverResponse runCommand(command) We should do the same with CommandProcessor CommandProcessor.java int run (String command) CommandResponse runCommand(command) The refactoring would involved creating an abstract base class Response. All the private members would move to the base class and become protected. public abstract class Response { protected int responseCode; protected String errorMessage; protected String SQLState; public DriverResponse(int responseCode) { this(responseCode, null, null); } public DriverResponse(int responseCode, String errorMessage, String SQLState) { this.responseCode = responseCode; this.errorMessage = errorMessage; this.SQLState = SQLState; } public int getResponseCode() { return responseCode; } public String getErrorMessage() { return errorMessage; } public String getSQLState() { return SQLState; } } public class DriverResponse extends Response{ public DriverResponse(int responseCode) { this(responseCode, null, null); } public DriverResponse(int responseCode, String errorMessage, String
Re: Problems using hive JDBC driver on Windows (with Squirrel SQL Client)
Additional help could certainly be used on the JDBC driver. I agree that getting the JDBC support full-featured is a good place to focus your energy. Guys correct me if I'm wrong, but the approach has generally been to pick a SQL client desktop app and implement the JDBC methods need to support it. Raghu did that with Pentaho and I did the same for Squirrel. I don't normally use Squirrel, but I implemented to it since it's open source and it gives great visibility in the logs/admin window into which JDBC methods are being called that are not yet implemented and it's fairly forgiving. (I'm a fan of DBVisualizer, but it doesn't give very much useful info when something in the driver isn't implemented, plus it tends to fail-fast when an optional metadata methods fails.) Regarding what's currently being worked on in the driver, I'm working on HIVE-795 to get better error messaing and SQLStates passed from the Hive server to the JDBC driver. Others can comment on what else is active, but I'd recommend searching for open JIRAs where Component/s is 'Client'. This issue has also become somewhat of an umbrella bug for JDBC, and contains a number of TODOs: https://issues.apache.org/jira/browse/HIVE-576 HTH. thanks, BIll On Wed, Sep 16, 2009 at 8:36 PM, Vijay tec...@gmail.com wrote: This may be something i'd be interested in working on. My idea in general would be to make the jdbc driver as thin and full featured as reasonable. I'm convinced this is the best way to integrate hive with many of the excellent tools available out thee. Although I could get the jdbc driver to work with a couple of these tools I'm having trouble to get it working with many others. Especially on windows it is slightly more painful. Is there some work in progress along these lines? Any thoughts or pointers? Thanks in advance, Vijay On Sep 15, 2009 10:59 PM, Prasad Chakka pcha...@facebook.com wrote: SerDe may be needed to deserialize the result from Hive server. But I also thought the results are in delimited form (LazySimpleSerDe or MetadataTypedColumnSetSerDe or something like that) so it is possible to decouple hadoop jar but will take some work. -- *From: *Bill Graham billgra...@gmail.com *Reply-To: *hive-user@hadoop.apache.org, billgra...@gmail.com *Date: *Tue, 15 Sep 2009 22:54:50 -0700 *To: *Vijay tec...@gmail.com *Cc: *hive-user@hadoop.apache.org *Subject: *Re: Problems using hive JDBC driver on Windows (with Squirrel SQL Client) Good question. The JDBC package relies on the hive serde2 package, which has at least the followin... Thanks a lot Bill! That was my problem. I thought my code base was pretty recent. Apparently not...
Re: Problems using hive JDBC driver on Windows (with Squirrel SQL Client)
Hi Vijay, Check your classpath to make sure you've got the correct hive-jdbc.jar version built using either the trunk or the current 4.0 branch. HiveStatement.java:390 used to throw 'java.sql.SQLException: Method not supported' before HIVE-679 was committed. In the current code base after the commit, the setMaxRows method is on lines 422-425. thanks, BIll On Tue, Sep 15, 2009 at 2:13 PM, Vijay tec...@gmail.com wrote: I'm having trouble getting the hive jdbc driver to work on Windows. I'm following the Squirrel SQL Client setup from the Hive/HiveJDBCInterface wiki page. Everything works great when Squirrel SQL Client is running on Linux but on Windows it doesn't. Squirrel can connect to the hive server fine so the setup is alright. However, when I issue a query, it doesn't seem to execute at all. I see this exception in the Squirrel client: 2009-09-15 14:10:35,268 [Thread-5] ERROR net.sourceforge.squirrel_sql.client.session.SQLExecuterTask - Can't Set MaxRows java.sql.SQLException: Method not supported at org.apache.hadoop.hive.jdbc.HiveStatement.setMaxRows(HiveStatement.java:390) at net.sourceforge.squirrel_sql.client.session.SQLExecuterTask.setMaxRows(SQLExecuterTask.java:318) at net.sourceforge.squirrel_sql.client.session.SQLExecuterTask.run(SQLExecuterTask.java:157) at net.sourceforge.squirrel_sql.fw.util.TaskExecuter.run(TaskExecuter.java:82) at java.lang.Thread.run(Thread.java:619) I don't seem to get this exception on Linux. I can't get the Squirrel client to not set max rows but I'm not entirely sure that's the real problem. Any help is appreciated. Vijay
Re: which thrift reversion do you use ?
+1 I've been struggling with thrift versions as well, see: https://issues.apache.org/jira/browse/HIVE-795?focusedCommentId=12754020page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12754020 Any insight into which version of thrift the Hive trunk is using would be helpful. On Fri, Sep 11, 2009 at 1:21 AM, Min Zhou coderp...@gmail.com wrote: Hi all, we've tried newest one from trunk and r760184, both of them can't produce the same code with hive trunk. which thrift reversion do you use ? Thanks, Min -- My research interests are distributed systems, parallel computing and bytecode based virtual machine. My profile: http://www.linkedin.com/in/coderplay My blog: http://coderplay.javaeye.com
Re: Adding jar files when running hive in hwi mode or hiveserver mode
+1 for the HWI - HiveServer approach. Building out rich APIs in the HiveServer (thrift currently, and possible REST at some point), would allow the HiveServer to focus on the functional API. The HWI (and others) could then focus on rich UI functionality. The two would have a clean decoupling, which would reduce complexity of the codebases and help abid by the KISS principle. On Wed, Aug 26, 2009 at 2:42 PM, Edward Capriolo edlinuxg...@gmail.comwrote: On Wed, Aug 26, 2009 at 3:25 PM, Raghu Murthyrmur...@facebook.com wrote: Even if we decided to have multiple HiveServers, wouldn't it be possible for HWI to randomly pick a HiveServer to connect to per query/client? On 8/26/09 12:16 PM, Ashish Thusoo athu...@facebook.com wrote: +1 for ajaxing this baby. On the broader question of whether we should combine HWI and HiveServer - I think there are definite deployment and code reuse advantages in doing so, however keeping them separate also has the advantage that we can cluster HiveServers independently from HWI. Since the HiveServer sits in the data path, the independent scaling may have advantages. I am not sure how strong of an argument that is to not put them together. Simplicity obviously indicates that we should have them together. Thoughts? Ashish -Original Message- From: Edward Capriolo [mailto:edlinuxg...@gmail.com] Sent: Wednesday, August 26, 2009 9:45 AM To: hive-user@hadoop.apache.org Subject: Re: Adding jar files when running hive in hwi mode or hiveserver mode On Tue, Aug 25, 2009 at 8:13 PM, Vijaytec...@gmail.com wrote: Yep, I got it and now it works perfectly! I like hwi btw! It definitely makes things easier for a wider audience to try out hive. Your new session result bucket idea is very nice as well. I will keep trying more things and see if anything else comes up but so far it looks great! Thanks Edward! On Tue, Aug 25, 2009 at 7:25 AM, Edward Capriolo edlinuxg...@gmail.com wrote: On Tue, Aug 25, 2009 at 10:18 AM, Edward Caprioloedlinuxg...@gmail.com wrote: On Mon, Aug 24, 2009 at 10:13 PM, Vijaytec...@gmail.com wrote: Probably spoke too soon :) I added this comment to the JIRA ticket above. Hi, I tried the latest patch on trunk and there seems to be a problem. I was interested in using the add jar command to add jar files to the path. However, by the time the command flows through the SessionState to the AddResourceProcessor (in ./ql/src/java/org/apache/hadoop/hive/ql/processors/AddResourceProc essor.java), the command word add is not being stripped so the resource processor is trying to find a ResourceType of ADD. I'm not sure if this was an existing bug or was a result of the current set of changes. [ Show ] Vijay added a comment - 24/Aug/09 07:12 PM Hi, I tried the latest patch on trunk and there seems to be a problem. I was interested in using the add jar command to add jar files to the path. However, by the time the command flows through the SessionState to the AddResourceProcessor (in ./ql/src/java/org/apache/hadoop/hive/ql/processors/AddResourceProc essor.java), the command word add is not being stripped so the resource processor is trying to find a ResourceType of ADD. I'm not sure if this was an existing bug or was a result of the current set of changes. On Mon, Aug 24, 2009 at 5:30 PM, Vijay tec...@gmail.com wrote: That's awesome and looks like exactly what I needed. Local file system requirement is perfectly ok for now. I will check it out right away! Hopefully it will be checked in soon. Thanks Edward! On Mon, Aug 24, 2009 at 5:14 PM, Edward Capriolo edlinuxg...@gmail.com wrote: On Mon, Aug 24, 2009 at 8:09 PM, Prasad Chakkapcha...@facebook.com wrote: Vijay, there is no solution for it yet. There may be a jira open but AFAIK, no one is working on it. You are welcome to contribute this feature. Prasad From: Vijay tec...@gmail.com Reply-To: hive-user@hadoop.apache.org Date: Mon, 24 Aug 2009 16:59:28 -0700 To: hive-user@hadoop.apache.org Subject: Re: Adding jar files when running hive in hwi mode or hiveserver mode Hi, is there any solution for this? How does everybody include custom jar files running hive in a non-cli mode? Thanks in advance, Vijay On Sat, Aug 22, 2009 at 6:19 PM, Vijay tec...@gmail.com wrote: When I run hive in cli mode, I add the hive_contrib.jar file using this command: hive add jar lib/hive_contrib.jar Is there a way to do this automatically when running hive in hwi or hiveserver modes? Or do I have to add the jar file explicitly to any of the startup scripts? Vijay, Currently HWI does not support this. The changes in https://issues.apache.org/jira/browse/HIVE-716 will make this possible (although I did not test but it should work as the cli does). The
Re: Adding jar files when running hive in hwi mode or hiveserver mode
The JDBC driver now is now able to integrate with some SQL desktop tools for basic querying FYI. That still requires the user to know SQL, but at least it doesn't require working on the command line. The SQuirrel SQL client has been tested with the current JDBC release: http://wiki.apache.org/hadoop/Hive/HiveJDBCInterface#head-98f2bc43312161b56e267773267546c080f4fb27 There's also ODBC driver work being done that has been tested with MicroStrategy, but it only supports Linux currently: http://wiki.apache.org/hadoop/Hive/HiveODBC On Wed, Aug 26, 2009 at 3:40 PM, Vijay tec...@gmail.com wrote: Having played around with hive cli/hiveserver/hwi for a few weeks I think I understand the various pieces better now. Can people provide some real world scenarios where they use the different modes? As far as UI and more importantly making hive accessible to users that are not super familiar with SQL goes, it seems to me like hive JDBC might be the best option since that way hive can be relatively seamlessly integrated with many sophisticated reporting tools. I haven't explored that option much yet. Hive cli was good enough for me to play around with the framework and I can keep using it for real work. However, having a simple GUI like hwi is better for many reasons but I don't think it can ever be a replacement for all the available reporting tools. So, I guess I'm kind of conflicted at this point :) My ultimate goal is to put the power of hadoop and hive into the hands of non-technical (business) users. At first I thought I could probably build a simple UI (which is a bunch of php files really) using the php thrift API but that API did not seem suited for short lived web applications without some sort of sophisticated session management. Any thoughts/ideas are greatly appreciated. On Wed, Aug 26, 2009 at 2:50 PM, Bill Graham billgra...@gmail.com wrote: +1 for the HWI - HiveServer approach. Building out rich APIs in the HiveServer (thrift currently, and possible REST at some point), would allow the HiveServer to focus on the functional API. The HWI (and others) could then focus on rich UI functionality. The two would have a clean decoupling, which would reduce complexity of the codebases and help abid by the KISS principle. On Wed, Aug 26, 2009 at 2:42 PM, Edward Capriolo edlinuxg...@gmail.comwrote: On Wed, Aug 26, 2009 at 3:25 PM, Raghu Murthyrmur...@facebook.com wrote: Even if we decided to have multiple HiveServers, wouldn't it be possible for HWI to randomly pick a HiveServer to connect to per query/client? On 8/26/09 12:16 PM, Ashish Thusoo athu...@facebook.com wrote: +1 for ajaxing this baby. On the broader question of whether we should combine HWI and HiveServer - I think there are definite deployment and code reuse advantages in doing so, however keeping them separate also has the advantage that we can cluster HiveServers independently from HWI. Since the HiveServer sits in the data path, the independent scaling may have advantages. I am not sure how strong of an argument that is to not put them together. Simplicity obviously indicates that we should have them together. Thoughts? Ashish -Original Message- From: Edward Capriolo [mailto:edlinuxg...@gmail.com] Sent: Wednesday, August 26, 2009 9:45 AM To: hive-user@hadoop.apache.org Subject: Re: Adding jar files when running hive in hwi mode or hiveserver mode On Tue, Aug 25, 2009 at 8:13 PM, Vijaytec...@gmail.com wrote: Yep, I got it and now it works perfectly! I like hwi btw! It definitely makes things easier for a wider audience to try out hive. Your new session result bucket idea is very nice as well. I will keep trying more things and see if anything else comes up but so far it looks great! Thanks Edward! On Tue, Aug 25, 2009 at 7:25 AM, Edward Capriolo edlinuxg...@gmail.com wrote: On Tue, Aug 25, 2009 at 10:18 AM, Edward Caprioloedlinuxg...@gmail.com wrote: On Mon, Aug 24, 2009 at 10:13 PM, Vijaytec...@gmail.com wrote: Probably spoke too soon :) I added this comment to the JIRA ticket above. Hi, I tried the latest patch on trunk and there seems to be a problem. I was interested in using the add jar command to add jar files to the path. However, by the time the command flows through the SessionState to the AddResourceProcessor (in ./ql/src/java/org/apache/hadoop/hive/ql/processors/AddResourceProc essor.java), the command word add is not being stripped so the resource processor is trying to find a ResourceType of ADD. I'm not sure if this was an existing bug or was a result of the current set of changes. [ Show ] Vijay added a comment - 24/Aug/09 07:12 PM Hi, I tried the latest patch on trunk and there seems to be a problem. I was interested in using the add jar command to add jar files to the path. However, by the time the command flows through the SessionState
Re: Adding jar files when running hive in hwi mode or hiveserver mode
How does HWI go about caching query results for others to view? Are the results durable given a bounce of HWI or are they held in memory? We have a process where we build daily summaries from Hive queries that get emailed. Instead I'd like a way to persist/cache the query results on a server and build a custom AJAXy web UI to expose them. Just wondering if HWI could help with this... On Wed, Aug 26, 2009 at 5:46 PM, Edward Capriolo edlinuxg...@gmail.comwrote: On Wed, Aug 26, 2009 at 7:31 PM, Bill Grahambillgra...@gmail.com wrote: The JDBC driver now is now able to integrate with some SQL desktop tools for basic querying FYI. That still requires the user to know SQL, but at least it doesn't require working on the command line. The SQuirrel SQL client has been tested with the current JDBC release: http://wiki.apache.org/hadoop/Hive/HiveJDBCInterface#head-98f2bc43312161b56e267773267546c080f4fb27 There's also ODBC driver work being done that has been tested with MicroStrategy, but it only supports Linux currently: http://wiki.apache.org/hadoop/Hive/HiveODBC On Wed, Aug 26, 2009 at 3:40 PM, Vijay tec...@gmail.com wrote: Having played around with hive cli/hiveserver/hwi for a few weeks I think I understand the various pieces better now. Can people provide some real world scenarios where they use the different modes? As far as UI and more importantly making hive accessible to users that are not super familiar with SQL goes, it seems to me like hive JDBC might be the best option since that way hive can be relatively seamlessly integrated with many sophisticated reporting tools. I haven't explored that option much yet. Hive cli was good enough for me to play around with the framework and I can keep using it for real work. However, having a simple GUI like hwi is better for many reasons but I don't think it can ever be a replacement for all the available reporting tools. So, I guess I'm kind of conflicted at this point :) My ultimate goal is to put the power of hadoop and hive into the hands of non-technical (business) users. At first I thought I could probably build a simple UI (which is a bunch of php files really) using the php thrift API but that API did not seem suited for short lived web applications without some sort of sophisticated session management. Any thoughts/ideas are greatly appreciated. On Wed, Aug 26, 2009 at 2:50 PM, Bill Graham billgra...@gmail.com wrote: +1 for the HWI - HiveServer approach. Building out rich APIs in the HiveServer (thrift currently, and possible REST at some point), would allow the HiveServer to focus on the functional API. The HWI (and others) could then focus on rich UI functionality. The two would have a clean decoupling, which would reduce complexity of the codebases and help abid by the KISS principle. On Wed, Aug 26, 2009 at 2:42 PM, Edward Capriolo edlinuxg...@gmail.com wrote: On Wed, Aug 26, 2009 at 3:25 PM, Raghu Murthyrmur...@facebook.com wrote: Even if we decided to have multiple HiveServers, wouldn't it be possible for HWI to randomly pick a HiveServer to connect to per query/client? On 8/26/09 12:16 PM, Ashish Thusoo athu...@facebook.com wrote: +1 for ajaxing this baby. On the broader question of whether we should combine HWI and HiveServer - I think there are definite deployment and code reuse advantages in doing so, however keeping them separate also has the advantage that we can cluster HiveServers independently from HWI. Since the HiveServer sits in the data path, the independent scaling may have advantages. I am not sure how strong of an argument that is to not put them together. Simplicity obviously indicates that we should have them together. Thoughts? Ashish -Original Message- From: Edward Capriolo [mailto:edlinuxg...@gmail.com] Sent: Wednesday, August 26, 2009 9:45 AM To: hive-user@hadoop.apache.org Subject: Re: Adding jar files when running hive in hwi mode or hiveserver mode On Tue, Aug 25, 2009 at 8:13 PM, Vijaytec...@gmail.com wrote: Yep, I got it and now it works perfectly! I like hwi btw! It definitely makes things easier for a wider audience to try out hive. Your new session result bucket idea is very nice as well. I will keep trying more things and see if anything else comes up but so far it looks great! Thanks Edward! On Tue, Aug 25, 2009 at 7:25 AM, Edward Capriolo edlinuxg...@gmail.com wrote: On Tue, Aug 25, 2009 at 10:18 AM, Edward Caprioloedlinuxg...@gmail.com wrote: On Mon, Aug 24, 2009 at 10:13 PM, Vijaytec...@gmail.com wrote: Probably spoke too soon :) I added this comment to the JIRA ticket above. Hi, I tried the latest patch on trunk and there seems to be a problem. I was interested in using the add jar command to add jar
Errors creating MySQL metastore
Hi, I'm trying to set up a MySQL metastore for Hive and I'm getting the exceptions shown below. If anyone could shed some insight as to why this is happening, it would be greatly appreciated. My hive-sites.xml is also attached below. This is how it looks when I start the Hive client with an empty db. The schema gets created, but the errors attached below appear. After the db is created I change datanucleus.autoCreateSchema to false before restarting the client. The same errors appear whenever I restart the client and run show tables, which takes about 30 seconds to complete. Any ideas how to fix this? I've experimented with many combinations of datanucleus.autoCreateColumns and datanucleus.identifier.case, but nothing makes a difference. property namehive.metastore.local/name valuetrue/value /property property namejavax.jdo.option.ConnectionURL/name valuejdbc:mysql://x:11000/hive/value /property property namejavax.jdo.option.ConnectionDriverName/name valuecom.mysql.jdbc.Driver/value /property property namejavax.jdo.option.ConnectionUserName/name value/value /property property namejavax.jdo.option.ConnectionPassword/name value/value /property property namedatanucleus.autoCreateSchema/name valuetrue/value /property 2009-08-05 10:05:46,543 ERROR Datastore.Schema (Log4JLogger.java:error(115)) - Failed to validate SchemaTable for Schema . Either it doesnt exist, or doesnt validate : Required columns missing from table NUCLEUS_TABLES : `TABLE_NAME`, VERSION, CLASS_NAME, INTERFACE_NAME, OWNER, `TYPE`. Perhaps your MetaData is incorrect, or you havent enabled datanucleus.autoCreateColumns. Required columns missing from table NUCLEUS_TABLES : `TABLE_NAME`, VERSION, CLASS_NAME, INTERFACE_NAME, OWNER, `TYPE`. Perhaps your MetaData is incorrect, or you havent enabled datanucleus.autoCreateColumns. org.datanucleus.store.rdbms.exceptions.MissingColumnException: Required columns missing from table NUCLEUS_TABLES : `TABLE_NAME`, VERSION, CLASS_NAME, INTERFACE_NAME, OWNER, `TYPE`. Perhaps your MetaData is incorrect, or you havent enabled datanucleus.autoCreateColumns. at org.datanucleus.store.rdbms.table.TableImpl.validateColumns(TableImpl.java:280) at org.datanucleus.store.rdbms.table.TableImpl.validate(TableImpl.java:173) at org.datanucleus.store.rdbms.SchemaAutoStarter.init(SchemaAutoStarter.java:101) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:576) at org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:300) at org.datanucleus.store.AbstractStoreManager.initialiseAutoStart(AbstractStoreManager.java:486) at org.datanucleus.store.rdbms.RDBMSManager.initialiseSchema(RDBMSManager.java:821) at org.datanucleus.store.rdbms.RDBMSManager.init(RDBMSManager.java:394) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:576) at org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:300) at org.datanucleus.store.FederationManager.initialiseStoreManager(FederationManager.java:106) at org.datanucleus.store.FederationManager.init(FederationManager.java:68) at org.datanucleus.ObjectManagerFactoryImpl.initialiseStoreManager(ObjectManagerFactoryImpl.java:152) at org.datanucleus.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:529) at org.datanucleus.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:175) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at javax.jdo.JDOHelper$16.run(JDOHelper.java:1956) at java.security.AccessController.doPrivileged(Native Method) at javax.jdo.JDOHelper.invoke(JDOHelper.java:1951) at
Re: Errors creating MySQL metastore
Yes, I'm using the latest hive-default.xml. What I'm showing is just the contents of my hive-sites.xml. The NUCLEUS_TABLES and all of it's columns listed in the exception exists in the DB, which is what's puzzling me. On Wed, Aug 5, 2009 at 10:51 AM, Prasad Chakka pcha...@facebook.com wrote: Are you using the latest hive-default.xml? It should contain more datanucleus properties than below. It is looking for a table called ‘NUCLEUS_TABLES’ which contains list of tables that got created when the original schema was created. Prasad -- *From: *Bill Graham billgra...@gmail.com *Reply-To: *hive-user@hadoop.apache.org, billgra...@gmail.com *Date: *Wed, 5 Aug 2009 10:19:24 -0700 *To: *hive-user@hadoop.apache.org *Subject: *Errors creating MySQL metastore Hi, I'm trying to set up a MySQL metastore for Hive and I'm getting the exceptions shown below. If anyone could shed some insight as to why this is happening, it would be greatly appreciated. My hive-sites.xml is also attached below. This is how it looks when I start the Hive client with an empty db. The schema gets created, but the errors attached below appear. After the db is created I change datanucleus.autoCreateSchema to false before restarting the client. The same errors appear whenever I restart the client and run show tables, which takes about 30 seconds to complete. Any ideas how to fix this? I've experimented with many combinations of datanucleus.autoCreateColumns and datanucleus.identifier.case, but nothing makes a difference. property namehive.metastore.local/name valuetrue/value /property property namejavax.jdo.option.ConnectionURL/name valuejdbc:mysql://x:11000/hive/value /property property namejavax.jdo.option.ConnectionDriverName/name valuecom.mysql.jdbc.Driver/value /property property namejavax.jdo.option.ConnectionUserName/name value/value /property property namejavax.jdo.option.ConnectionPassword/name value/value /property property namedatanucleus.autoCreateSchema/name valuetrue/value /property 2009-08-05 10:05:46,543 ERROR Datastore.Schema (Log4JLogger.java:error(115)) - Failed to validate SchemaTable for Schema . Either it doesnt exist, or doesnt validate : Required columns missing from table NUCLEUS_TABLES : `TABLE_NAME`, VERSION, CLASS_NAME, INTERFACE_NAME, OWNER, `TYPE`. Perhaps your MetaData is incorrect, or you havent enabled datanucleus.autoCreateColumns. Required columns missing from table NUCLEUS_TABLES : `TABLE_NAME`, VERSION, CLASS_NAME, INTERFACE_NAME, OWNER, `TYPE`. Perhaps your MetaData is incorrect, or you havent enabled datanucleus.autoCreateColumns. org.datanucleus.store.rdbms.exceptions.MissingColumnException: Required columns missing from table NUCLEUS_TABLES : `TABLE_NAME`, VERSION, CLASS_NAME, INTERFACE_NAME, OWNER, `TYPE`. Perhaps your MetaData is incorrect, or you havent enabled datanucleus.autoCreateColumns. at org.datanucleus.store.rdbms.table.TableImpl.validateColumns(TableImpl.java:280) at org.datanucleus.store.rdbms.table.TableImpl.validate(TableImpl.java:173) at org.datanucleus.store.rdbms.SchemaAutoStarter.init(SchemaAutoStarter.java:101) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:576) at org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:300) at org.datanucleus.store.AbstractStoreManager.initialiseAutoStart(AbstractStoreManager.java:486) at org.datanucleus.store.rdbms.RDBMSManager.initialiseSchema(RDBMSManager.java:821) at org.datanucleus.store.rdbms.RDBMSManager.init(RDBMSManager.java:394) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:576) at org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:300) at org.datanucleus.store.FederationManager.initialiseStoreManager(FederationManager.java:106) at org.datanucleus.store.FederationManager.init(FederationManager.java:68
Re: JDBC: Infinite while(rs.next()) loop
I would love to see nightly/periodic builds published somewhere, especially if it's going to be some time before Hive 0.4 is released. It seems like people new to Hive get the check out and build from the trunk or apply this patch answer often on this list. Having nightly builds would make life easier on these folks as well. On Tue, Aug 4, 2009 at 10:33 AM, Edward Capriolo edlinuxg...@gmail.comwrote: On Tue, Aug 4, 2009 at 1:20 PM, Saurabh Nandasaurabhna...@gmail.com wrote: Is there any possibility of having a nightly build off the trunk, before hive 0.4 is officially released? On 8/4/09, Edward Capriolo edlinuxg...@gmail.com wrote: On Tue, Aug 4, 2009 at 10:43 AM, Bill Grahambillgra...@gmail.com wrote: +1 I agree. I do not know the answer to that one. Can anyone comment on the future Hive release schedule? On Tue, Aug 4, 2009 at 7:39 AM, Saurabh Nanda saurabhna...@gmail.com wrote: I was dreading this response! Are there any plans to push out a new Hive build with the latest features bug fixes? Building from trunk is not everyone's cup of tea, you know :-) Any nightly builds that I can pick up? Saurabh. On Tue, Aug 4, 2009 at 8:05 PM, Bill Graham billgra...@gmail.com wrote: This bug has been fixed on the trunk. Check out the hive trunk and build the JDBC driver and you should be fine. On Tue, Aug 4, 2009 at 12:47 AM, Saurabh Nanda saurabhna...@gmail.com wrote: Here's what I'm trying: ResultSet rs=st.executeQuery(show tables); while(rs.next()) { System.out.println(rs.getString(1)); } The while loop never terminates, after going through the list of tables, it keeps printing the last table name over over again. Am I doing something wrong over here, or have I hit a bug? I'm on hive-0.3.0-hadoop-0.18.0-bin Saurabh. -- http://nandz.blogspot.com http://foodieforlife.blogspot.com -- http://nandz.blogspot.com http://foodieforlife.blogspot.com Hive 4.0 will be a release candidate soon. The largest major blocker that I know of is dealing with Hadoop 0.20. See: https://issues.apache.org/jira/browse/HIVE-487 Soon after that their should be a release candidate, then voting. Edward -- http://nandz.blogspot.com http://foodieforlife.blogspot.com The major thing on that is we have to build releases for every hadoop major/minor and possibly one off the trunk. I was thinking of doing something similar on my site since accomplishing this is possible with hudson. Does anyone think adding this to hadoop hive is something we should do.
Re: partitions not being created
I just completely removed my all of my Hive tables and folders in HDFS, as well as metadata_db. I then re-built Hive from the latest from the trunk. After replacing my Hive server with the contents of build/dist, and doing the same for my client, I created new tables from scratch and again tried to migrate from ApiUsageTemp -- ApiUsage. I got the same get_partition failed: unknown result error. I decided to skip the table migration and just load data directly into a partitioned table. That also gives the same error. Below is what I tried. Any ideas? hive CREATE TABLE ApiUsage (user STRING, restResource STRING, statusCode INT, requestDate STRING, requestHour INT, numRequests STRING, responseTime STRING, numSlowRequests STRING) PARTITIONED BY (dt STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '; OK Time taken: 0.27 seconds hive describe extended ApiUsage; OK userstring restresourcestring statuscode int requestdate string requesthour int numrequests string responsetimestring numslowrequests string dt string Detailed Table Information Table(tableName:apiusage, dbName:default, owner:grahamb, createTime:1249073147, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:user, type:string, comment:null), FieldSchema(name:restresource, type:string, comment:null), FieldSchema(name:statuscode, type:int, comment:null), FieldSchema(name:requestdate, type:string, comment:null), FieldSchema(name:requesthour, type:int, comment:null), FieldSchema(name:numrequests, type:string, comment:null), FieldSchema(name:responsetime, type:string, comment:null), FieldSchema(name:numslowrequests, type:string, comment:null)], location:hdfs://xxx:9000/user/hive/warehouse/apiusage, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{field.delim= , serialization.format= }), bucketCols:[], sortCols:[], parameters:{}), partitionKeys:[FieldSchema(name:dt, type:string, comment:null)], parameters:{}) Time taken: 0.276 seconds hive LOAD DATA INPATH sample_data/apilogs/summary-small/2009/05/18 INTO TABLE ApiUsage PARTITION (dt = 20090518 ); Loading data to table apiusage partition {dt=20090518} Failed with exception org.apache.thrift.TApplicationException: get_partition failed: unknown result FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask On Thu, Jul 30, 2009 at 1:24 PM, Prasad Chakka pcha...@facebook.com wrote: This is not backward compatibility issue. Check HIVE-592 for details. Before this patch, a rename doesn’t change the name of the hdfs directory and if you create a new table with the old name of the renamed table then both tables will be pointing to the same directory thus causing problems. HIVE-592 fixes this to rename directories correctly. So if you have created all tables after HIVE-592 patch went in, you should be fine. -- *From: *Bill Graham billgra...@gmail.com *Reply-To: *billgra...@gmail.com *Date: *Thu, 30 Jul 2009 13:09:03 -0700 *To: *Prasad Chakka pcha...@facebook.com *Cc: *hive-user@hadoop.apache.org *Subject: *Re: partitions not being created I sent my last try reply before seeing your last email. Thanks, that seems possible. I did initially create ApiUsageTemp using the most recent Hive release. Then while working on a JIRA I updated my Hive client and server to the more recent builds from the trunk. If that could cause such a problem, this is troubling though, since it implies that we can't upgrade Hive without possibly corrupting our metadata store. I'll try again from scratch though and see if it works, thanks. On Thu, Jul 30, 2009 at 1:04 PM, Bill Graham billgra...@gmail.com wrote: Prasad, My setup is Hive client - Hive Server (with local metastore) - Hadoop. I was also suspecting metastore issues, so I've tried multiple times with newly created destination tables and I see the same thing happening. All of the log info I've been able to find I've included already in this thread. Let me know if there's anywhere else I could look for clues. I've included from the client: - /tmp/$USER/hive.log And from the hive server: - Stdout/err logs - /tmp/$USER/hive_job_log*.txt Is there anything else I should be looking at? All of the M/R logs don't show any exceptions anything suspect. Thanks for your time and insights on this issue, I appreciate it. thanks, Bill On Thu, Jul 30, 2009 at 11:57 AM, Prasad Chakka pcha...@facebook.com wrote: Bill, The real error is happening on the Hive Metastore Server or Hive Server (depending on the setup you are using). Error logs on it must have different stack trace. From the information below I am guessing that the way the destination table hdfs
Re: partitions not being created
I'm trying to set a string to a string and I'm seeing this error. I also had an attempt where it was a string to an int, and I also saw the same error. The /tmp/$USER/hive_job_log*.txt file doesn't contain any exceptions, but I've included it's output below. Only the Hive server logs show the exceptions listed above. (Note that the table I'm loading from in this log output is ApiUsageSmall, which is identical to ApiUsageTemp. For some reason the data from ApiUsageTemp is now gone.) QueryStart QUERY_STRING=INSERT OVERWRITE TABLE ApiUsage PARTITION (dt = 20090518) SELECT `(requestDate)?+.+` FROM ApiUsageSmall WHERE requestDate = '2009/05/18' QUERY_ID=app_20090730104242 TIME=1248975752235 TaskStart TASK_NAME=org.apache.hadoop.hive.ql.exec.ExecDriver TASK_ID=Stage-1 QUERY_ID=app_20090730104242 TIME=1248975752235 TaskProgress TASK_HADOOP_PROGRESS=2009-07-30 10:42:34,783 map = 0%, reduce =0% TASK_NAME=org.apache.hadoop.hive.ql.exec.ExecDriver TASK_COUNTERS=Job Counters .Launched map tasks:1,Job Counters .Data-local map tasks:1 TASK_ID=Stage-1 QUERY_ID=app_20090730104242 TASK_HADOOP_ID=job_200906301559_0409 TIME=1248975754785 TaskProgress ROWS_INSERTED=apiusage~296 TASK_HADOOP_PROGRESS=2009-07-30 10:42:43,031 map = 40%, reduce =0% TASK_NAME=org.apache.hadoop.hive.ql.exec.ExecDriver TASK_COUNTERS=File Systems.HDFS bytes read:23019,File Systems.HDFS bytes written:19178,Job Counters .Rack-local map tasks:2,Job Counters .Launched map tasks:5,Job Counters .Data-local map tasks:3,org.apache.hadoop.hive.ql.exec.FilterOperator$Counter.PASSED:592,org.apache.hadoop.hive.ql.exec.FilterOperator$Counter.FILTERED:6,org.apache.hadoop.hive.ql.exec.FileSinkOperator$TableIdEnum.TABLE_ID_1_ROWCOUNT:296,org.apache.hadoop.hive.ql.exec.MapOperator$Counter.DESERIALIZE_ERRORS:0,Map-Reduce Framework.Map input records:302,Map-Reduce Framework.Map input bytes:23019,Map-Reduce Framework.Map output records:0 TASK_ID=Stage-1 QUERY_ID=app_20090730104242 TASK_HADOOP_ID=job_200906301559_0409 TIME=1248975763033 TaskProgress ROWS_INSERTED=apiusage~1471 TASK_HADOOP_PROGRESS=2009-07-30 10:42:44,068 map = 100%, reduce =100% TASK_NAME=org.apache.hadoop.hive.ql.exec.ExecDriver TASK_COUNTERS=File Systems.HDFS bytes read:114068,File Systems.HDFS bytes written:95275,Job Counters .Rack-local map tasks:2,Job Counters .Launched map tasks:5,Job Counters .Data-local map tasks:3,org.apache.hadoop.hive.ql.exec.FilterOperator$Counter.PASSED:2942,org.apache.hadoop.hive.ql.exec.FilterOperator$Counter.FILTERED:27,org.apache.hadoop.hive.ql.exec.FileSinkOperator$TableIdEnum.TABLE_ID_1_ROWCOUNT:1471,org.apache.hadoop.hive.ql.exec.MapOperator$Counter.DESERIALIZE_ERRORS:0,Map-Reduce Framework.Map input records:1498,Map-Reduce Framework.Map input bytes:114068,Map-Reduce Framework.Map output records:0 TASK_ID=Stage-1 QUERY_ID=app_20090730104242 TASK_HADOOP_ID=job_200906301559_0409 TIME=1248975764071 TaskEnd ROWS_INSERTED=apiusage~1471 TASK_RET_CODE=0 TASK_HADOOP_PROGRESS=2009-07-30 10:42:44,068 map = 100%, reduce =100% TASK_NAME=org.apache.hadoop.hive.ql.exec.ExecDriver TASK_COUNTERS=File Systems.HDFS bytes read:114068,File Systems.HDFS bytes written:95275,Job Counters .Rack-local map tasks:2,Job Counters .Launched map tasks:5,Job Counters .Data-local map tasks:3,org.apache.hadoop.hive.ql.exec.FilterOperator$Counter.PASSED:2942,org.apache.hadoop.hive.ql.exec.FilterOperator$Counter.FILTERED:27,org.apache.hadoop.hive.ql.exec.FileSinkOperator$TableIdEnum.TABLE_ID_1_ROWCOUNT:1471,org.apache.hadoop.hive.ql.exec.MapOperator$Counter.DESERIALIZE_ERRORS:0,Map-Reduce Framework.Map input records:1498,Map-Reduce Framework.Map input bytes:114068,Map-Reduce Framework.Map output records:0 TASK_ID=Stage-1 QUERY_ID=app_20090730104242 TASK_HADOOP_ID=job_200906301559_0409 TIME=1248975764199 TaskStart TASK_NAME=org.apache.hadoop.hive.ql.exec.ConditionalTask TASK_ID=Stage-4 QUERY_ID=app_20090730104242 TIME=1248975764199 TaskEnd TASK_RET_CODE=0 TASK_NAME=org.apache.hadoop.hive.ql.exec.ConditionalTask TASK_ID=Stage-4 QUERY_ID=app_20090730104242 TIME=1248975782277 TaskStart TASK_NAME=org.apache.hadoop.hive.ql.exec.MoveTask TASK_ID=Stage-0 QUERY_ID=app_20090730104242 TIME=1248975782277 TaskEnd TASK_RET_CODE=1 TASK_NAME=org.apache.hadoop.hive.ql.exec.MoveTask TASK_ID=Stage-0 QUERY_ID=app_20090730104242 TIME=1248975782473 QueryEnd ROWS_INSERTED=apiusage~1471 QUERY_STRING=INSERT OVERWRITE TABLE ApiUsage PARTITION (dt = 20090518) SELECT `(requestDate)?+.+` FROM ApiUsageSmall WHERE requestDate = '2009/05/18' QUERY_ID=app_20090730104242 QUERY_NUM_TASKS=2 TIME=1248975782474 On Thu, Jul 30, 2009 at 10:09 AM, Prasad Chakka pcha...@facebook.comwrote: Are you sure you are getting the same error even with the schema below (i.e. trying to set a string to an int column?). Can you give the full stack trace that you might see in /tmp/$USER/hive.log? -- *From: *Bill Graham billgra...@gmail.com *Reply-To: *hive-user@hadoop.apache.org, billgra...@gmail.com *Date: *Thu
HiveServer and client user accounts
Hi, I've found that if I access Hive via the HiveServer (using either the Hive shell or the JDBC client), tables are created as the user who is running the Hive server, not the user who is executing the commands. I understand why this happens, but it doesn't seem like the expected behavior to me. If I were to run a Hive Server on the NN as user 'hive' I'd expect other users to be able to connect to it and add/remove tables as themselves, which isn't the case currently. If from the command line I use hadoop to add files to HDFS (as user 'bill') and then use hive to create a table (as user 'hive'), I can't put my data into it, due to ownership conflicts. It brings two questions to mind: 1. Is this a bug? If so I'll create a JIRA. 2. How are other people dealing with this issue in production?* * Granted the HiveServer currently doesn't support more than one client at a time, which is a completely separate issue which I'm curious about w.r.t. production use. Is the answer that the Hive Server just isn't production ready? I that is the case, then how are people using Hive in a multi-user environement? Does each client just connect directly to a central metastoredb? thanks, Bill
Re: partitions not being created
That file contains a similar error as the Hive Server logs: 2009-07-30 11:44:21,095 WARN mapred.JobClient (JobClient.java:configureCommandLineOptions(510)) - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-07-30 11:44:48,070 WARN mapred.JobClient (JobClient.java:configureCommandLineOptions(510)) - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-07-30 11:45:27,796 ERROR metadata.Hive (Hive.java:getPartition(588)) - org.apache.thrift.TApplicationException: get_partition failed: unknown result at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_partition(ThriftHiveMetastore.java:784) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partition(ThriftHiveMetastore.java:752) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getPartition(HiveMetaStoreClient.java:415) at org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:579) at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:466) at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:135) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:335) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:241) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:122) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:165) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:258) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:155) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) 2009-07-30 11:45:27,797 ERROR exec.MoveTask (SessionState.java:printError(279)) - Failed with exception org.apache.thrift.TApplicationException: get_partition failed: unknown result org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.TApplicationException: get_partition failed: unknown result at org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:589) at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:466) at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:135) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:335) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:241) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:122) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:165) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:258) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:155) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) Caused by: org.apache.thrift.TApplicationException: get_partition failed: unknown result at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_partition(ThriftHiveMetastore.java:784) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partition(ThriftHiveMetastore.java:752) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getPartition(HiveMetaStoreClient.java:415) at org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:579) ... 16 more 2009-07-30 11:45:27,798 ERROR ql.Driver (SessionState.java:printError(279)) - FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask On Thu, Jul 30, 2009 at 11:33 AM, Prasad Chakka pcha...@facebook.comwrote: The hive logs go into /tmp/$USER/hive.log not hive_job_log*.txt. -- *From: *Bill Graham billgra...@gmail.com *Reply-To: *billgra...@gmail.com *Date: *Thu, 30 Jul 2009 10:52:06 -0700 *To: *Prasad Chakka pcha...@facebook.com *Cc: *hive-user@hadoop.apache.org, Zheng Shao zsh...@gmail.com *Subject: *Re: partitions not being created I'm trying to set a string to a string and I'm seeing
Re: partitions not being created
I sent my last try reply before seeing your last email. Thanks, that seems possible. I did initially create ApiUsageTemp using the most recent Hive release. Then while working on a JIRA I updated my Hive client and server to the more recent builds from the trunk. If that could cause such a problem, this is troubling though, since it implies that we can't upgrade Hive without possibly corrupting our metadata store. I'll try again from scratch though and see if it works, thanks. On Thu, Jul 30, 2009 at 1:04 PM, Bill Graham billgra...@gmail.com wrote: Prasad, My setup is Hive client - Hive Server (with local metastore) - Hadoop. I was also suspecting metastore issues, so I've tried multiple times with newly created destination tables and I see the same thing happening. All of the log info I've been able to find I've included already in this thread. Let me know if there's anywhere else I could look for clues. I've included from the client: - /tmp/$USER/hive.log And from the hive server: - Stdout/err logs - /tmp/$USER/hive_job_log*.txt Is there anything else I should be looking at? All of the M/R logs don't show any exceptions anything suspect. Thanks for your time and insights on this issue, I appreciate it. thanks, Bill On Thu, Jul 30, 2009 at 11:57 AM, Prasad Chakka pcha...@facebook.comwrote: Bill, The real error is happening on the Hive Metastore Server or Hive Server (depending on the setup you are using). Error logs on it must have different stack trace. From the information below I am guessing that the way the destination table hdfs directories that got created has some problems. Can you drop that table (and make sure that there is no corresponding HDFS directory for both integer and string type partitions that you created) and retry the query. If you don’t want to drop the destination table then send me the logs on Hive Server. Prasad -- *From: *Bill Graham billgra...@gmail.com *Reply-To: *billgra...@gmail.com *Date: *Thu, 30 Jul 2009 11:47:41 -0700 *To: *Prasad Chakka pcha...@facebook.com *Cc: *hive-user@hadoop.apache.org *Subject: *Re: partitions not being created That file contains a similar error as the Hive Server logs: 2009-07-30 11:44:21,095 WARN mapred.JobClient (JobClient.java:configureCommandLineOptions(510)) - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-07-30 11:44:48,070 WARN mapred.JobClient (JobClient.java:configureCommandLineOptions(510)) - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-07-30 11:45:27,796 ERROR metadata.Hive (Hive.java:getPartition(588)) - org.apache.thrift.TApplicationException: get_partition failed: unknown result at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_partition(ThriftHiveMetastore.java:784) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partition(ThriftHiveMetastore.java:752) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getPartition(HiveMetaStoreClient.java:415) at org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:579) at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:466) at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:135) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:335) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:241) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:122) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:165) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:258) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:155) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) 2009-07-30 11:45:27,797 ERROR exec.MoveTask (SessionState.java:printError(279)) - Failed with exception org.apache.thrift.TApplicationException: get_partition failed: unknown result org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.TApplicationException: get_partition failed: unknown result at org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:589) at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:466) at org.apache.hadoop.hive.ql.exec.MoveTask.execute
partitions not being created
Hi, I'm trying to create a partitioned table and the partition is not appearing for some reason. Am I doing something wrong, or is this a bug? Below are the commands I'm executing with their output. Note that the 'show partitions' command is not returning anything. If I were to try to load data into this table I'd get a 'get_partition failed' error. I'm using bleeding-edge Hive, built from the trunk. hive create table partTable (a string, b int) partitioned by (dt int); OK Time taken: 0.308 seconds hive show partitions partTable; OK Time taken: 0.329 seconds hive describe partTable; OK a string b int dt int Time taken: 0.181 seconds thanks, Bill
Re: partitions not being created
at org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:589) at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:466) at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:135) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:335) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:241) at org.apache.hadoop.hive.service.HiveServer$HiveServerHandler.execute(HiveServer.java:105) at org.apache.hadoop.hive.service.ThriftHive$Processor$execute.process(ThriftHive.java:264) at org.apache.hadoop.hive.service.ThriftHive$Processor.process(ThriftHive.java:252) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:252) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Caused by: org.apache.thrift.TApplicationException: get_partition failed: unknown result at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_partition(ThriftHiveMetastore.java:784) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partition(ThriftHiveMetastore.java:752) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getPartition(HiveMetaStoreClient.java:415) at org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:579) ... 11 more FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask 09/07/28 18:06:15 ERROR ql.Driver: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask On Tue, Jul 28, 2009 at 5:57 PM, Namit Jain nj...@facebook.com wrote: There are no partitions in the table – Can you post the output you get while loading the data ? *From:* Bill Graham [mailto:billgra...@gmail.com] *Sent:* Tuesday, July 28, 2009 5:54 PM *To:* hive-user@hadoop.apache.org *Subject:* partitions not being created Hi, I'm trying to create a partitioned table and the partition is not appearing for some reason. Am I doing something wrong, or is this a bug? Below are the commands I'm executing with their output. Note that the 'show partitions' command is not returning anything. If I were to try to load data into this table I'd get a 'get_partition failed' error. I'm using bleeding-edge Hive, built from the trunk. hive create table partTable (a string, b int) partitioned by (dt int); OK Time taken: 0.308 seconds hive show partitions partTable; OK Time taken: 0.329 seconds hive describe partTable; OK a string b int dt int Time taken: 0.181 seconds thanks, Bill
Re: partitions not being created
Thanks for the tip, but it fails in the same way when I use a string. On Tue, Jul 28, 2009 at 6:21 PM, David Lerman dler...@videoegg.com wrote: hive create table partTable (a string, b int) partitioned by (dt int); INSERT OVERWRITE TABLE ApiUsage PARTITION (dt = 20090518) SELECT `(requestDate)?+.+` FROM ApiUsageTemp WHERE requestDate = '2009/05/18' The table has an int partition column (dt), but you're trying to set a string value (dt = 20090518). Try : create table partTable (a string, b int) partitioned by (dt string); and then do your insert.
Re: Apply a patch to hadoop
http://wiki.apache.org/hadoop/HowToContribute Search for Applying a patch and you'll find this: patch -p0 cool_patch.patch On Mon, Jul 27, 2009 at 2:33 PM, Gopal Gandhi gopal.gandhi2...@yahoo.comwrote: I am going to apply a patch to hadoop (version 18.3). I searched on line but could not find a step by step how-to manual. Would any hadoop guru tell me how to apply a patch, say HADOOP-.patch to hadoop? Thanks lot.
Re: Block not found
I ran into the same issue when using the default settings for dfs.data.dir, which is under the /tmp directory. Files in this directory will be cleaned our periodically as needed by the OS, which will break HDFS. On Thu, Jul 2, 2009 at 7:01 AM, Gross, Danny danny.gr...@spansion.comwrote: Hello Johnson, I have observed similar error messages when my system ran out of disk space on an HDFS node, or in hadoop.tmp.dir. Hope it helps. Best regards, Danny -Original Message- From: Johnson Chen [mailto:dong...@gmail.com] Sent: Thursday, July 02, 2009 3:47 AM To: core-u...@hadoop.apache.org Subject: Block not found Hi , My hadoop program stop at 66% Reduce phrase 09/07/02 16:41:37 INFO mapred.JobClient: map 0% reduce 0% 09/07/02 16:41:48 INFO mapred.JobClient: map 50% reduce 0% 09/07/02 16:41:51 INFO mapred.JobClient: map 100% reduce 0% 09/07/02 16:41:58 INFO mapred.JobClient: map 100% reduce 66% And I found it pop up a lot of error message in namenode log . here is the error messages : 2009-07-02 16:43:44,461 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: commitBlockSynchronization(lastblock=blk_-7800669485846603924_144754, newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false, deleteBlock=true) 2009-07-02 16:43:44,461 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 50040, call commitBlockSynchronization(blk_-7800669485846603924_144754, 0, 0, false, true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@2df2888) from 192.168.151.231:39976: error: java.io.IOException: Block (=blk_-7800669485846603924_144754) not found java.io.IOException: Block (=blk_-7800669485846603924_144754) not found at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchroni zation(FSNamesystem.java:1906) at org.apache.hadoop.hdfs.server.namenode.NameNode.commitBlockSynchronizati on(NameNode.java:410) at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor Impl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894) 2009-07-02 16:43:44,967 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: commitBlockSynchronization(lastblock=blk_-8293900342000823254_748370, newgenerationstamp=0, newlength=0, newtargets=[], closeFile=false, deleteBlock=true) 2009-07-02 16:43:44,967 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50040, call commitBlockSynchronization(blk_-8293900342000823254_748370, 0, 0, false, true, [Lorg.apache.hadoop.hdfs.protocol.DatanodeID;@8ddfa31) from 192.168.151.232:45283: error: java.io.IOException: Block (=blk_-8293900342000823254_748370) not found java.io.IOException: Block (=blk_-8293900342000823254_748370) not found at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchroni zation(FSNamesystem.java:1906) at org.apache.hadoop.hdfs.server.namenode.NameNode.commitBlockSynchronizati on(NameNode.java:410) at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor Impl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894) Can anyone help me ? Thanks. -- Best wishes, Johnson Chen
Re: Pig ClassCastException trying to cast to org.apache.pig.data.DataBag
That's the strange thing though, is that I have the entire logic of my UDF wrapped in a try/catch, and nothing is thrown there. This exception seems to be thrown elsewhere. 2009/6/26 zjffdu zjf...@gmail.com I think this may be caused by your UDF. you can write try catch in your UDF, and log more context information. -Original Message- From: Bill Graham [mailto:billgra...@gmail.com] Sent: 2009年6月25日 9:28 To: pig-user@hadoop.apache.org Subject: Pig ClassCastException trying to cast to org.apache.pig.data.DataBag Hello Pig fans, I've implemented a collaborative filtering job in Pig using CROSS and FOREACH with a UDF. It works great until my dataset grows to a certain size, at which point I start to get Pig ClassCastExceptions in the logs. I know that CROSS can be expensive and difficult to scale, but it's strange to me that when things fall over, it's due to a Pig ClassCastException. Any insights as to why this is happening or how I should go about troubleshooting? Here's the script: userAssets1 = LOAD 'sample_data/userAssets' AS (user:bytearray, userAssetRatings: bag {T: tuple(user:bytearray, asset:chararray, rating:double)}); userAssets2 = LOAD 'sample_data/userAssets' AS (user:bytearray, userAssetRatings: bag {T: tuple(user:bytearray, asset:chararray, rating:double)}); X = CROSS userAssets1, userAssets2 PARALLEL 20; userToUserFilter = FILTER X BY userAssets1::user != userAssets2::user; REGISTER pearson.jar; dist = FOREACH userToUserFilter GENERATE userAssets1::user, userAssets2::user, cnwk.grahamb.pig.PEARSON(userAssets1::userAssetRatings, userAssets2::userAssetRatings); similarUsers = FILTER dist BY ($2 != 0.0); STORE similarUsers INTO 'sample_data/userSimilarityPearson'; Once the number of userAssets values grows to about 28K, the mapper succeeds, but the reduces fails after around 60% complete. There are 558K input records for the reducer in this case. The exceptions look like this: 2009-06-24 11:34:47,854 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher - Error message from task (reduce) task_200906171500_0141_r_12java.lang.ClassCastException: org.apache.pig.data.DataByteArray cannot be cast to org.apache.pig.data.DataBag at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperat ors.POProject.processInputBag(POProject.java:368) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperat ors.POProject.getNext(POProject.java:171) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperat ors.POUserFunc.processInput(POUserFunc.java:129) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperat ors.POUserFunc.getNext(POUserFunc.java:181) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperat ors.POUserFunc.getNext(POUserFunc.java:235) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat ors.POForEach.processPlan(POForEach.java:262) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat ors.POForEach.getNext(POForEach.java:197) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator .processInput(PhysicalOperator.java:226) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat ors.POFilter.getNext(POFilter.java:95) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Re duce.runPipeline(PigMapReduce.java:280) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Re duce.processOnePackageOutput(PigMapReduce.java:247) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Re duce.reduce(PigMapReduce.java:216) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Re duce.reduce(PigMapReduce.java:136) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2210) thanks, Bill