Re: Distributed Lucene Questions
Thanks a lot. Dev On Wed, Jun 3, 2009 at 11:45 AM, Paco NATHAN wrote: > Our Katta cluster at ShareThis runs on EC2. > Managed using RightScale templates > (thanks @mikebabineau) > > > On Tue, Jun 2, 2009 at 19:46, Devajyoti Sarkar wrote: > > Can Katta be used on an EC2 cluster? > > > > The reason why I ask this is that it appears to use ZooKeeper which > ideally > > needs to have a dedicated drive. That may not be possible in a shared > > environment. Is this a non-issue with respect to Katta? > > > > I would appreciate any input in this regard. > > > > Dev >
Re: Distributed Lucene Questions
Can Katta be used on an EC2 cluster? The reason why I ask this is that it appears to use ZooKeeper which ideally needs to have a dedicated drive. That may not be possible in a shared environment. Is this a non-issue with respect to Katta? I would appreciate any input in this regard. Dev On Wed, Jun 3, 2009 at 2:03 AM, Ted Dunning wrote: > Just a quick plug for Katta. We use it extensively (and have been sending > back some patches). > > See www.deepdyve.com for a test drive. > > At my previous job, we had utter fits working with SOLR using sharded > retrieval. Katta is designed to address the sharding problem very well and > we have been very happy. Our extensions have been to adapt Katta so that > it > is a general sharding and replication engine that supports general queries. > For some things we use a modified Lucene, for other things, we use our own > code. Katta handles that really well. > > On Tue, Jun 2, 2009 at 9:23 AM, Tarandeep Singh > wrote: > > > thanks all for your replies. I am checking Katta... > > > > -Tarandeep > > > > On Tue, Jun 2, 2009 at 8:05 AM, Stefan Groschupf wrote: > > > > > Hi, > > > you might want to checkout: > > > http://katta.sourceforge.net/ > > > > > > Stefan > > > > > > ~~~ > > > Hadoop training and consulting > > > http://www.scaleunlimited.com > > > http://www.101tec.com > > > > > > > > > > > > > > > On Jun 1, 2009, at 9:54 AM, Tarandeep Singh wrote: > > > > > > Hi All, > > >> > > >> I am trying to build a distributed system to build and serve lucene > > >> indexes. > > >> I came across the Distributed Lucene project- > > >> http://wiki.apache.org/hadoop/DistributedLucene > > >> https://issues.apache.org/jira/browse/HADOOP-3394 > > >> > > >> and have a couple of questions. It will be really helpful if someone > can > > >> provide some insights. > > >> > > >> 1) Is this code production ready? > > >> 2) Does someone has performance data for this project? > > >> 3) It allows searches and updates/deletes to be performed at the same > > >> time. > > >> How well the system will perform if there are frequent updates to the > > >> system. Will it handle the search and update load easily or will it be > > >> better to rebuild or update the indexes on different machines and then > > >> deploy the indexes back to the machines that are serving the indexes? > > >> > > >> Basically I am trying to choose between the 2 approaches- > > >> > > >> 1) Use Hadoop to build and/or update Lucene indexes and then deploy > them > > >> on > > >> separate cluster that will take care or load balancing, fault > tolerance > > >> etc. > > >> There is a package in Hadoop contrib that does this, so I can use that > > >> code. > > >> > > >> 2) Use and/or modify the Distributed Lucene code. > > >> > > >> I am expecting daily updates to our index so I am not sure if > > Distribtued > > >> Lucene code (which allows searches and updates on the same indexes) > will > > >> be > > >> able to handle search and update load efficiently. > > >> > > >> Any suggestions ? > > >> > > >> Thanks, > > >> Tarandeep > > >> > > > > > > > > > > > > -- > Ted Dunning, CTO > DeepDyve > > 111 West Evelyn Ave. Ste. 202 > Sunnyvale, CA 94086 > http://www.deepdyve.com > 858-414-0013 (m) > 408-773-0220 (fax) >
Re: Sharing an object across mappers
Hi Owen, Thanks a lot for the pointers. In order to use the MultiThreadedMapRunner, if I change the setMapRunnerClass() method in the jobConf, then does the rest of my code remain the same (apart from making it thread-safe)? Thanks in advance, Dev On Sat, Oct 4, 2008 at 12:29 AM, Owen O'Malley <[EMAIL PROTECTED]> wrote: > > On Oct 3, 2008, at 7:49 AM, Devajyoti Sarkar wrote: > > Briefly going through the DistributedCache information, it seems to be a >> way >> to distribute files to mappers/reducers. >> > > Sure, but it handles the distribution problem for you. > > One still needs to read the >> contents into each map/reduce task VM. >> > > If the data is straight binary data, you could just mmap it from the > various tasks. It would be pretty efficient. > > The other direction is to use the MultiThreadedMapRunner and run multiple > maps as threads in the same VM. But unless your maps are CPU heavy or > contacting external servers, it probably won't help as much as you'd like. > > -- Owen >
Re: Sharing an object across mappers
Hi Arun, Briefly going through the DistributedCache information, it seems to be a way to distribute files to mappers/reducers. One still needs to read the contents into each map/reduce task VM. Therefore, the data gets replicated across the VMs in a single node. It seems it does not address my basic problem which is to have a large shared object across multiple map/reduce tasks at a given node without having to replicate it across the VMs. Is there a setting in Hadoop where one can tell Hadoop to create the individual map/reduce tasks in the same JVM? Thanks, Dev On Fri, Oct 3, 2008 at 10:32 PM, Arun C Murthy <[EMAIL PROTECTED]> wrote: > > On Oct 3, 2008, at 1:10 AM, Devajyoti Sarkar wrote: > > Hi Alan, >> >> Thanks for your message. >> >> The object can be read-only once it is initialized - I do not need to >> modify >> > > Please take a look at DistributedCache: > > http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#DistributedCache > > An example: > > http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Example%3A+WordCount+v2.0 > > Arun > > > >> it. Essentially it is an object that allows me to analyze/modify data that >> I >> am mapping/reducing. It comes to about 3-4GB of RAM. The problem I have is >> that if I run multiple mappers, this object gets replicated in the >> different >> VMs and I run out of memory on my node. I pretty much need to have the >> full >> object in memory to do my processing. It is possible (though quite >> difficult) to have it partially on disk and query it (like a lucene store >> implementation) but there is a significant performance hit. As an e.g., >> let >> us say I use the xlarge CPU instance at Amazon (8CPUs, 8GB RAM). In this >> scenario, I can really only have 1 mapper per node whereas there are 8 >> CPUs. >> But if the overhead of sharing the object (e.g. RMI) or persisting the >> object (e.g. lucene) is greater than 8 times the memory speed, then it is >> cheaper to run 1 mapper/node. I tried sharing with Terracotta and I was >> getting a roughly 600 times decrease in performance versus in-memory >> access. >> >> So ideally, if I could have all the mappers in the same VM, then I can >> create a singleton and still have multiple mappers access it at memory >> speeds. >> >> Please do let me know if I am looking at this correctly and if the above >> is >> possible. >> >> Thanks a lot for all your help. >> >> Cheers, >> Dev >> >> >> >> >> On Fri, Oct 3, 2008 at 12:49 PM, Alan Ho <[EMAIL PROTECTED]> wrote: >> >> It really depends on what type of data you are sharing, how you are >>> looking >>> up the data, whether the data is Read-write, and whether you care about >>> consistency. If you don't care about consistency, I suggest that you >>> shove >>> the data into a BDB store (for key-value lookup) or a lucene store, and >>> copy >>> the data to all the nodes. That way all data access will be in-process, >>> no >>> gc problems, and you will get very fast results. BDB and lucene both have >>> easy replication strategies. >>> >>> If the data is RW, and you need consistency, you should probably forget >>> about MapReduce and just run everything on big-iron. >>> >>> Regards, >>> Alan Ho >>> >>> >>> >>> >>> - Original Message >>> From: Devajyoti Sarkar <[EMAIL PROTECTED]> >>> To: core-user@hadoop.apache.org >>> Sent: Thursday, October 2, 2008 8:41:04 PM >>> Subject: Sharing an object across mappers >>> >>> I think each mapper/reducer runs in its own JVM which makes it impossible >>> to >>> share objects. I need to share a large object so that I can access it at >>> memory speeds across all the mappers. Is it possible to have all the >>> mappers >>> run in the same VM? Or is there a way to do this across VMs at high >>> speed? >>> I >>> guess JMI and others such methods will be just too slow. >>> >>> Thanks, >>> Dev >>> >>> >>> >>> __ >>> Instant Messaging, free SMS, sharing photos and more... Try the new >>> Yahoo! >>> Canada Messenger at http://ca.beta.messenger.yahoo.com/ >>> >>> >
Re: Sharing an object across mappers
Hi Alan, Thanks for your message. The object can be read-only once it is initialized - I do not need to modify it. Essentially it is an object that allows me to analyze/modify data that I am mapping/reducing. It comes to about 3-4GB of RAM. The problem I have is that if I run multiple mappers, this object gets replicated in the different VMs and I run out of memory on my node. I pretty much need to have the full object in memory to do my processing. It is possible (though quite difficult) to have it partially on disk and query it (like a lucene store implementation) but there is a significant performance hit. As an e.g., let us say I use the xlarge CPU instance at Amazon (8CPUs, 8GB RAM). In this scenario, I can really only have 1 mapper per node whereas there are 8 CPUs. But if the overhead of sharing the object (e.g. RMI) or persisting the object (e.g. lucene) is greater than 8 times the memory speed, then it is cheaper to run 1 mapper/node. I tried sharing with Terracotta and I was getting a roughly 600 times decrease in performance versus in-memory access. So ideally, if I could have all the mappers in the same VM, then I can create a singleton and still have multiple mappers access it at memory speeds. Please do let me know if I am looking at this correctly and if the above is possible. Thanks a lot for all your help. Cheers, Dev On Fri, Oct 3, 2008 at 12:49 PM, Alan Ho <[EMAIL PROTECTED]> wrote: > It really depends on what type of data you are sharing, how you are looking > up the data, whether the data is Read-write, and whether you care about > consistency. If you don't care about consistency, I suggest that you shove > the data into a BDB store (for key-value lookup) or a lucene store, and copy > the data to all the nodes. That way all data access will be in-process, no > gc problems, and you will get very fast results. BDB and lucene both have > easy replication strategies. > > If the data is RW, and you need consistency, you should probably forget > about MapReduce and just run everything on big-iron. > > Regards, > Alan Ho > > > > > - Original Message > From: Devajyoti Sarkar <[EMAIL PROTECTED]> > To: core-user@hadoop.apache.org > Sent: Thursday, October 2, 2008 8:41:04 PM > Subject: Sharing an object across mappers > > I think each mapper/reducer runs in its own JVM which makes it impossible > to > share objects. I need to share a large object so that I can access it at > memory speeds across all the mappers. Is it possible to have all the > mappers > run in the same VM? Or is there a way to do this across VMs at high speed? > I > guess JMI and others such methods will be just too slow. > > Thanks, > Dev > > > > __ > Instant Messaging, free SMS, sharing photos and more... Try the new Yahoo! > Canada Messenger at http://ca.beta.messenger.yahoo.com/ >
Sharing an object across mappers
I think each mapper/reducer runs in its own JVM which makes it impossible to share objects. I need to share a large object so that I can access it at memory speeds across all the mappers. Is it possible to have all the mappers run in the same VM? Or is there a way to do this across VMs at high speed? I guess JMI and others such methods will be just too slow. Thanks, Dev
Re: Page Ranking, Hadoop And MPI.
Hi Ted, I am really interested in learning more about the symmetry in power law/fractal graphs in general and how matrix math (SVD, etc.) allows us to detect/exploit it. Would you have any recommendations for books or papers that I could read which explores this question. I remember doing a standard Linear Algebra course way back in college but it never discussed its links to graphs/symmetry... Thanks and best regards, Dev On Wed, Apr 16, 2008 at 2:33 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > Power law algorithms are ideal for this kind of parallelized problem. > > The basic idea is that hub and authority style algorithms are intimately > related to eigenvector or singular value decompositions (depending on > whether the links are symmetrical). This also means that there is a close > relationship to asymptotic beahavior of random walks on the graph. > > If you represent the linkage in the web by a matrix that has columns > representing source page and rows representing the target page and with a > 1 > where-ever the source page has a link pointing to the target page, then if > you start with a vector with a single non-zero element equal to 1 as a > representation of a single page, then multiplying by the linkage matrix > will > give you a vector with 1 in the positions corresponding to the pages the > original page linked to. If you multiply again, you get all the pages > that > you can get to in two steps from the original page. > > Mathematically, if we call the original vector x and the linkage matrix A, > the pages that x links to are just Ax. The pages that are two steps from > x > are A(Ax) = A^2 x. > > The eigenvector decomposition of A is just a way of writing A as a product > of three matrices: > >A = U S U' > > U' is the transpose of U, and U has the special property that U'U = I (it > is > called ortho-normal because of this). > > S is a diagonal matrix. > > There is lots of deep mathematical machinery and beautiful symmetry > available here, but for now we can just take this as given. > > The pages n steps from x are > > x_n = A^n x = (U S U')^n x = (U S U')^n-2 (U S U') (U S U') x > = (U S U')^(n-2) (U S (U'U) S U') x = (U S U')^(n-2) (U S^2 U') x > = U S^n U' x > > This is really cool because S^n can be computed by just taking each > diagonal > element and raising it to a power. > > Eigenvector decompositions have other, really deep connections. For > instance, if you take the elements of S (call the i-th one s_i) then > >sum_i s_i^n > > Is the number of paths that are n steps long. > > Connected (or nearly connected) clusters of pages can also be derived from > the eigenvector decomposition. This is the basis of so-called spectral > clustering. For some very impressive examples of spectral clustering see > > http://citeseer.ist.psu.edu/ng01spectral.html > > So eigenvectors are cool. But how can we compute them? > > First, note that if A^n = U S^n U' and if some of the s_i are bigger than > others, the big ones will quickly dominate the others. That is pretty > quickly, A^n \approx u_1 s_1^n u_1'. This means that we can compute an > approximation of u_1 by just doing A^n x where x is some random vector. > Moreover, we can compute u_2 by starting with a different random vector > and > iterating the same way, but with an additional step where we forbid the > result from going towards u_1. With just a few additional wrinkles, this > gives us what is called the Lanczos algorithm. Golub and van Loan's > excellent book Matrix Computations gives a lot of information on these > algorithms. > > The cool thing here is that our random vector can represent a single page > and we can approximate the final result by following links. Following > links > is just a (human-readable) way of saying sparse matrix multiplication. If > we do this multiplication against lots of different random starting > points, > we can quickly build parallel algorithms to compute things like page rank. > > I hope this helps. I know it is a big byte to take at one time. > > > > > > > > On 4/15/08 9:31 AM, "Chaman Singh Verma" <[EMAIL PROTECTED]> wrote: > > > Hello, > > > > After googling for many days, I couldn't get one answer from many of the > > published reports on Ranking algorithm done by Google. Since Google uses > > GFS for fault tolerance purposes, what communication libraries they > might be > > using to solve such a large matrix ? I presume that standard Message > Passing > > (MPI) may not be suitable for this purpose(Fault tolerance issue) > > > > Suppose we implement ranking algorithm on top of Hadoop, what could be > > the best way/best distributed algorithm/library etc ? > > > > With regards. > > > > Chaman Singh Verma > > Poona, India > > > > > > > > > > between -00-00 and -99-99 > >