Re: cross product of 2 data sets
http://ofps.oreilly.com/titles/9781449302641/advanced_pig_latin.html search on "cross matches" Alan. On Sep 1, 2011, at 11:44 AM, Marc Sturlese wrote: > Hey there, > I would like to do the cross product of two data sets, any of them feeds in > memory. I've seen pig has the cross operation. Can someone please explain me > how it implements it? > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/cross-product-of-2-data-sets-tp3302160p3302160.html > Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Re: 回复: Can pig-0.8.1 can work with junit 4.3.1 or 4.8.1 or 4.8.2?
When I download the Pig 0.8.1 tarball I don't find any junit class files, just a license file (which probably doesn't need to be there). If you build it it will pull those via Ivy, but I they are not in the tarball. AFAIK it will work with any Junit 4.x, but 4.5 is what we use in our testing. In any case Junit is only used in testing, so if you are just using Pig this doesn't matter at all. Also, questions like this are better asked on u...@pig.apache.org. Alan. On Aug 22, 2011, at 4:54 AM, shiju...@gmail.com wrote: > > > 发送自 HTC > > - Reply message - > 发件人: "lulynn_2008" > 收件人: "u...@pig.apache.org" , > "common-user@hadoop.apache.org" > 主题: Can pig-0.8.1 can work with junit 4.3.1 or 4.8.1 or 4.8.2? > 日期: 周日, 8 月 7 日, 2011 年 23:52 > > > Hello, > I found pig-0.8.1 included junit-4.5 class files. > Could you please give me some suggestion my questions : > Can pig-0.8.1 can work with junit 4.3.1 or 4.8.1 or 4.8.2? > why included classes in Junit-4.5? > Thank you.
Re: Research projects with Hadoop
Luan, Pig keeps a list at http://wiki.apache.org/pig/PigJournal of all the Pig projects we know of. Many of these are more project based, but some could be turned into actual research. If you do choose one of these, please let us know (over on pig-...@hadoop.apache.org) so we can mark it that you're working on it. Good luck in your studies. Alan. On Sep 6, 2010, at 8:02 PM, Luan Cestari wrote: Hi buddies, I'm a CS student and I would like to ask if you guys have some ideas of research project that can be done with Hadoop or other projects like HBase, Hive, KosmosFS, Pig. In my country, during the master degree, you need to do a project. Well I have some time to think about that as I'll start the master degree in the beginning of next year but I'm very fascinated and enthusiastic with all those projects and all the possibilities of innovation that distributed system brings. As I'm new in these kind of projects, I don't know what exactly can be done in a period like 3 years. Any ideas? Thanks for the help. Best Regards, Luan Cestari P.S.: I know this is an unusual question, but I think it is pertinent to the forum and can be good to the forum. -- View this message in context: http://lucene.472066.n3.nabble.com/Research-projects-with-Hadoop-tp1430287p1430287.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Re: Why hadoop-u...@lucene.a.o ?
Ancient history. Hadoop started as a subproject of Lucene. Alan. On Jun 17, 2010, at 10:22 PM, Otis Gospodnetic wrote: Hello, I've noticed people send emails to the following address: hadoop-u...@lucene.apache.org Why? Is this supposed to be related to common-user@hadoop.apache.org list? But why would any Hadoop mailing list be @lucene.a.o? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Hadoop ecosystem search :: http://search-hadoop.com/
Re: Bible Code and some input format ideas
I'm guessing that you want to set the width of the text to avoid the issue where if you split by block, then all splits but the first will have an unknown offset. Most texts have natural divisions in them which I'm guessing you'll want to respect anyway. In the Bible this would be the different books, in more recent books it would be different chapters. Could you instead set up your InputFormat to split on these divisions in the text? Then you don't have to go through this single threaded step. And in most cases the divisions in the text will be small enough to be handled by a single mapper (though not necessarily well balanced). Alan. On Jan 11, 2010, at 11:52 AM, Edward Capriolo wrote: Hey all, I saw a special on discovery about bible code. http://en.wikipedia.org/wiki/Bible_code I am designing something in hadoop to do bible code on any text (not just the bible). I have a rough idea on how to make all the parts efficient in map reduce. I have a little challenge I originally thought I could solve with with a custom InputFormat but it seems I may have to do this in a stand alone program. Lets assume your input looks like this: Is there any bible-code in this text? I don't know. The end result might look like this ( assuming I take every 5th letter.) irbcn tdn__ The first part of the process is given an input text we have to strip out a user configured list of things '\t' '-' '.' '?' . That I have no problem with. The second part of the process, I would like to get the data to be the proper width, in this case 5 characters. This is a challenge because assuming a line is 5 characters e.g. 'done?' Once it is cleaned it will be 4 characters 'done'. This -1 offsets changes the rest of the data, the next line might have another offset, so on and so on. Originally I was thinking I could create NCharacterInputFormat, but it seems like this stage of the process can not easily be done in map/reduce. I guess I need to write a single threaded program to read through the data and make the correct offsets (5 characters per line). Unless someone else has an idea.
Re: map side Vs. Reduce side join
Usually doing a join on the map side depends on exploiting some characteristic of the data (such as one input is small enough it can fit in memory and be replicated to every map, or both inputs are already sorted on the same key, or both inputs are already partitioned into same number of partitions using an identical hash function). If one of those is not true, then you have to preprocess. This preprocess is equivalent to doing your join on the reduce side using Hadoop to group your keys. As a side note, Pig and Hive both already implement join in Map and Reduce phases, so you could take a look at how those are implemented and when they recommend choosing each. Alan. On Jul 14, 2009, at 1:49 PM, bonito wrote: Hello. I would like to ask if there is any 'scenario' in which the reduce side join is preferable than the map-side join. One may claim that map side join requires preprocessing of the input sources. Is there any other reason or reasons? I am really interested in this. Thank you. -- View this message in context: http://www.nabble.com/map-side-Vs.-Reduce-side-join-tp24487391p24487391.html Sent from the Hadoop core-user mailing list archive at Nabble.com.