Re: cross product of 2 data sets

2011-09-01 Thread Alan Gates
http://ofps.oreilly.com/titles/9781449302641/advanced_pig_latin.html
search on "cross matches"

Alan.

On Sep 1, 2011, at 11:44 AM, Marc Sturlese wrote:

> Hey there,
> I would like to do the cross product of two data sets, any of them feeds in
> memory. I've seen pig has the cross operation. Can someone please explain me
> how it implements it?
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/cross-product-of-2-data-sets-tp3302160p3302160.html
> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.



Re: 回复: Can pig-0.8.1 can work with junit 4.3.1 or 4.8.1 or 4.8.2?

2011-08-22 Thread Alan Gates
When I download the Pig 0.8.1 tarball I don't find any junit class files, just 
a license file (which probably doesn't need to be there).  If you build it it 
will pull those via Ivy, but I they are not in the tarball.

AFAIK it will work with any Junit 4.x, but 4.5 is what we use in our testing.  
In any case Junit is only used in testing, so if you are just using Pig this 
doesn't matter at all.

Also, questions like this are better asked on u...@pig.apache.org.

Alan.

On Aug 22, 2011, at 4:54 AM, shiju...@gmail.com wrote:

> 
> 
> 发送自 HTC
> 
> - Reply message -
> 发件人: "lulynn_2008" 
> 收件人: "u...@pig.apache.org" , 
> "common-user@hadoop.apache.org" 
> 主题: Can pig-0.8.1 can work with junit 4.3.1 or 4.8.1 or 4.8.2?
> 日期: 周日, 8 月 7 日, 2011 年 23:52
> 
> 
> Hello,
> I found pig-0.8.1 included junit-4.5 class files.
> Could you please give me some suggestion my questions :
> Can pig-0.8.1 can work with junit 4.3.1 or 4.8.1 or 4.8.2?
> why included classes in Junit-4.5?
> Thank you.



Re: Research projects with Hadoop

2010-09-07 Thread Alan Gates

Luan,

Pig keeps a list at http://wiki.apache.org/pig/PigJournal of all the  
Pig projects we know of.  Many of these are more project based, but  
some could be turned into actual research.  If you do choose one of  
these, please let us know (over on pig-...@hadoop.apache.org) so we  
can mark it that you're working on it.  Good luck in your studies.


Alan.

On Sep 6, 2010, at 8:02 PM, Luan Cestari wrote:



Hi buddies,

I'm a CS student and I would like to ask if you guys have some ideas  
of
research project that can be done with Hadoop or other projects like  
HBase,
Hive, KosmosFS, Pig. In my country, during the master degree, you  
need to do

a project. Well I have some time to think about that as I'll start the
master degree in the beginning of next year but I'm very fascinated  
and
enthusiastic with all those projects and all the possibilities of  
innovation
that distributed system brings. As I'm new in these kind of  
projects, I
don't know what exactly can be done in a period like 3 years. Any  
ideas?


Thanks for the help.

Best Regards,
Luan Cestari

P.S.: I know this is an unusual question, but I think it is  
pertinent to the

forum and can be good to the forum.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Research-projects-with-Hadoop-tp1430287p1430287.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.




Re: Why hadoop-u...@lucene.a.o ?

2010-06-18 Thread Alan Gates

Ancient history.  Hadoop started as a subproject of Lucene.

Alan.

On Jun 17, 2010, at 10:22 PM, Otis Gospodnetic wrote:


Hello,

I've noticed people send emails to the following address:

   hadoop-u...@lucene.apache.org

Why?
Is this supposed to be related to common-user@hadoop.apache.org list?
But why would any Hadoop mailing list be @lucene.a.o?

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/





Re: Bible Code and some input format ideas

2010-01-12 Thread Alan Gates
I'm guessing that you want to set the width of the text to avoid the  
issue where if you split by block, then all splits but the first will  
have an unknown offset.


Most texts have natural divisions in them which I'm guessing you'll  
want to respect anyway.  In the Bible this would be the different  
books, in more recent books it would be different chapters.  Could you  
instead set up your InputFormat to split on these divisions in the  
text?  Then you don't have to go through this single threaded step.   
And in most cases the divisions in the text will be small enough to be  
handled by a single mapper (though not necessarily well balanced).


Alan.

On Jan 11, 2010, at 11:52 AM, Edward Capriolo wrote:


Hey all,
I saw a special on discovery about bible code.
http://en.wikipedia.org/wiki/Bible_code

I am designing something in hadoop to do bible code on any text (not
just the bible). I have a rough idea on how to make all the parts
efficient in map reduce. I have a little challenge I originally
thought I could solve with with a custom InputFormat but it seems I
may have to do this in a stand alone program.

Lets assume your input looks like this:

Is there any
bible-code in this
text? I don't know.

The end result might look like this ( assuming I take every 5th  
letter.)


irbcn
tdn__

The first part of the process is given an input text we have to strip
out a user configured list of things '\t' '-' '.' '?' .  That I have
no problem with.

The second part of the process, I would like to get the data to be the
proper width, in this case 5 characters. This is a challenge because
assuming a line is 5 characters e.g. 'done?' Once it is cleaned it
will be 4 characters  'done'. This -1 offsets changes the rest of the
data, the next line might have another offset, so on and so on.

Originally I was thinking I could create NCharacterInputFormat, but it
seems like this stage of the process can not easily be done in
map/reduce. I guess I need to write a single threaded program to read
through the data and make the correct offsets (5 characters per line).
Unless someone else has an idea.




Re: map side Vs. Reduce side join

2009-07-14 Thread Alan Gates
Usually doing a join on the map side depends on exploiting some  
characteristic of the data (such as one input is small enough it can  
fit in memory and be replicated to every map, or both inputs are  
already sorted on the same key, or both inputs are already partitioned  
into same number of partitions using an identical hash function).  If  
one of those is not true, then you have to preprocess.  This  
preprocess is equivalent to doing your join on the reduce side using  
Hadoop to group your keys.


As a side note, Pig and Hive both already implement join in Map and  
Reduce phases, so you could take a look at how those are implemented  
and when they recommend choosing each.


Alan.

On Jul 14, 2009, at 1:49 PM, bonito wrote:



Hello.
I would like to ask if there is any 'scenario' in which the reduce  
side join

is preferable than the map-side join.
One may claim that map side join requires preprocessing of the input
sources.
Is there any other reason or reasons?
I am really interested in this.
Thank you.
--
View this message in context: 
http://www.nabble.com/map-side-Vs.-Reduce-side-join-tp24487391p24487391.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.