Re: Distributed Lucene Questions

2009-06-02 Thread Devajyoti Sarkar
Thanks a lot.

Dev

On Wed, Jun 3, 2009 at 11:45 AM, Paco NATHAN  wrote:

> Our Katta cluster at ShareThis runs on EC2.
> Managed using RightScale templates
> (thanks @mikebabineau)
>
>
> On Tue, Jun 2, 2009 at 19:46, Devajyoti Sarkar  wrote:
> > Can Katta be used on an EC2 cluster?
> >
> > The reason why I ask this is that it appears to use ZooKeeper which
> ideally
> > needs to have a dedicated drive. That may not be possible in a shared
> > environment. Is this a non-issue with respect to Katta?
> >
> > I would appreciate any input in this regard.
> >
> > Dev
>


Re: Distributed Lucene Questions

2009-06-02 Thread Devajyoti Sarkar
Can Katta be used on an EC2 cluster?

The reason why I ask this is that it appears to use ZooKeeper which ideally
needs to have a dedicated drive. That may not be possible in a shared
environment. Is this a non-issue with respect to Katta?

I would appreciate any input in this regard.

Dev


On Wed, Jun 3, 2009 at 2:03 AM, Ted Dunning  wrote:

> Just a quick plug for Katta.  We use it extensively (and have been sending
> back some patches).
>
> See www.deepdyve.com for a test drive.
>
> At my previous job, we had utter fits working with SOLR using sharded
> retrieval.  Katta is designed to address the sharding problem very well and
> we have been very happy.  Our extensions have been to adapt Katta so that
> it
> is a general sharding and replication engine that supports general queries.
> For some things we use a modified Lucene, for other things, we use our own
> code.  Katta handles that really well.
>
> On Tue, Jun 2, 2009 at 9:23 AM, Tarandeep Singh 
> wrote:
>
> > thanks all for your replies. I am checking Katta...
> >
> > -Tarandeep
> >
> > On Tue, Jun 2, 2009 at 8:05 AM, Stefan Groschupf  wrote:
> >
> > > Hi,
> > > you might want to checkout:
> > > http://katta.sourceforge.net/
> > >
> > > Stefan
> > >
> > > ~~~
> > > Hadoop training and consulting
> > > http://www.scaleunlimited.com
> > > http://www.101tec.com
> > >
> > >
> > >
> > >
> > > On Jun 1, 2009, at 9:54 AM, Tarandeep Singh wrote:
> > >
> > >  Hi All,
> > >>
> > >> I am trying to build a distributed system to build and serve lucene
> > >> indexes.
> > >> I came across the Distributed Lucene project-
> > >> http://wiki.apache.org/hadoop/DistributedLucene
> > >> https://issues.apache.org/jira/browse/HADOOP-3394
> > >>
> > >> and have a couple of questions. It will be really helpful if someone
> can
> > >> provide some insights.
> > >>
> > >> 1) Is this code production ready?
> > >> 2) Does someone has performance data for this project?
> > >> 3) It allows searches and updates/deletes to be performed at the same
> > >> time.
> > >> How well the system will perform if there are frequent updates to the
> > >> system. Will it handle the search and update load easily or will it be
> > >> better to rebuild or update the indexes on different machines and then
> > >> deploy the indexes back to the machines that are serving the indexes?
> > >>
> > >> Basically I am trying to choose between the 2 approaches-
> > >>
> > >> 1) Use Hadoop to build and/or update Lucene indexes and then deploy
> them
> > >> on
> > >> separate cluster that will take care or load balancing, fault
> tolerance
> > >> etc.
> > >> There is a package in Hadoop contrib that does this, so I can use that
> > >> code.
> > >>
> > >> 2) Use and/or modify the Distributed Lucene code.
> > >>
> > >> I am expecting daily updates to our index so I am not sure if
> > Distribtued
> > >> Lucene code (which allows searches and updates on the same indexes)
> will
> > >> be
> > >> able to handle search and update load efficiently.
> > >>
> > >> Any suggestions ?
> > >>
> > >> Thanks,
> > >> Tarandeep
> > >>
> > >
> > >
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>
> 111 West Evelyn Ave. Ste. 202
> Sunnyvale, CA 94086
> http://www.deepdyve.com
> 858-414-0013 (m)
> 408-773-0220 (fax)
>


Re: Sharing an object across mappers

2008-10-03 Thread Devajyoti Sarkar
Hi Owen,

Thanks a lot for the pointers.

In order to use the MultiThreadedMapRunner, if I change the
setMapRunnerClass() method in the jobConf, then does the rest of my code
remain the same (apart from making it thread-safe)?

Thanks in advance,
Dev


On Sat, Oct 4, 2008 at 12:29 AM, Owen O'Malley <[EMAIL PROTECTED]> wrote:

>
> On Oct 3, 2008, at 7:49 AM, Devajyoti Sarkar wrote:
>
>  Briefly going through the DistributedCache information, it seems to be a
>> way
>> to distribute files to mappers/reducers.
>>
>
> Sure, but it handles the distribution problem for you.
>
>  One still needs to read the
>> contents into each map/reduce task VM.
>>
>
> If the data is straight binary data, you could just mmap it from the
> various tasks. It would be pretty efficient.
>
> The other direction is to use the MultiThreadedMapRunner and run multiple
> maps as threads in the same VM. But unless your maps are CPU heavy or
> contacting external servers, it probably won't help as much as you'd like.
>
> -- Owen
>


Re: Sharing an object across mappers

2008-10-03 Thread Devajyoti Sarkar
Hi Arun,

Briefly going through the DistributedCache information, it seems to be a way
to distribute files to mappers/reducers. One still needs to read the
contents into each map/reduce task VM. Therefore, the data gets replicated
across the VMs in a single node. It seems it does not address my basic
problem which is to have a large shared object across multiple map/reduce
tasks at a given node without having to replicate it across the VMs.

Is there a setting in Hadoop where one can tell Hadoop to create the
individual map/reduce tasks in the same JVM?

Thanks,
Dev


On Fri, Oct 3, 2008 at 10:32 PM, Arun C Murthy <[EMAIL PROTECTED]> wrote:

>
> On Oct 3, 2008, at 1:10 AM, Devajyoti Sarkar wrote:
>
>  Hi Alan,
>>
>> Thanks for your message.
>>
>> The object can be read-only once it is initialized - I do not need to
>> modify
>>
>
> Please take a look at DistributedCache:
>
> http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#DistributedCache
>
> An example:
>
> http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Example%3A+WordCount+v2.0
>
> Arun
>
>
>
>> it. Essentially it is an object that allows me to analyze/modify data that
>> I
>> am mapping/reducing. It comes to about 3-4GB of RAM. The problem I have is
>> that if I run multiple mappers, this object gets replicated in the
>> different
>> VMs and I run out of memory on my node. I pretty much need to have the
>> full
>> object in memory to do my processing. It is possible (though quite
>> difficult) to have it partially on disk and query it (like a lucene store
>> implementation) but there is a significant performance hit. As an e.g.,
>> let
>> us say I use the xlarge CPU instance at Amazon (8CPUs, 8GB RAM). In this
>> scenario, I can really only have 1 mapper per node whereas there are 8
>> CPUs.
>> But if the overhead of sharing the object (e.g. RMI) or persisting the
>> object (e.g. lucene) is greater than 8 times the memory speed, then it is
>> cheaper to run 1 mapper/node. I tried sharing with Terracotta and I was
>> getting a roughly 600 times decrease in performance versus in-memory
>> access.
>>
>> So ideally, if I could have all the mappers in the same VM, then I can
>> create a singleton and still have multiple mappers access it at memory
>> speeds.
>>
>> Please do let me know if I am looking at this correctly and if the above
>> is
>> possible.
>>
>> Thanks a lot for all your help.
>>
>> Cheers,
>> Dev
>>
>>
>>
>>
>> On Fri, Oct 3, 2008 at 12:49 PM, Alan Ho <[EMAIL PROTECTED]> wrote:
>>
>>  It really depends on what type of data you are sharing, how you are
>>> looking
>>> up the data, whether the data is Read-write, and whether you care about
>>> consistency. If you don't care about consistency, I suggest that you
>>> shove
>>> the data into a BDB store (for key-value lookup) or a lucene store, and
>>> copy
>>> the data to all the nodes. That way all data access will be in-process,
>>> no
>>> gc problems, and you will get very fast results. BDB and lucene both have
>>> easy replication strategies.
>>>
>>> If the data is RW, and you need consistency, you should probably forget
>>> about MapReduce and just run everything on big-iron.
>>>
>>> Regards,
>>> Alan Ho
>>>
>>>
>>>
>>>
>>> - Original Message 
>>> From: Devajyoti Sarkar <[EMAIL PROTECTED]>
>>> To: core-user@hadoop.apache.org
>>> Sent: Thursday, October 2, 2008 8:41:04 PM
>>> Subject: Sharing an object across mappers
>>>
>>> I think each mapper/reducer runs in its own JVM which makes it impossible
>>> to
>>> share objects. I need to share a large object so that I can access it at
>>> memory speeds across all the mappers. Is it possible to have all the
>>> mappers
>>> run in the same VM? Or is there a way to do this across VMs at high
>>> speed?
>>> I
>>> guess JMI and others such methods will be just too slow.
>>>
>>> Thanks,
>>> Dev
>>>
>>>
>>>
>>> __
>>> Instant Messaging, free SMS, sharing photos and more... Try the new
>>> Yahoo!
>>> Canada Messenger at http://ca.beta.messenger.yahoo.com/
>>>
>>>
>


Re: Sharing an object across mappers

2008-10-03 Thread Devajyoti Sarkar
Hi Alan,

Thanks for your message.

The object can be read-only once it is initialized - I do not need to modify
it. Essentially it is an object that allows me to analyze/modify data that I
am mapping/reducing. It comes to about 3-4GB of RAM. The problem I have is
that if I run multiple mappers, this object gets replicated in the different
VMs and I run out of memory on my node. I pretty much need to have the full
object in memory to do my processing. It is possible (though quite
difficult) to have it partially on disk and query it (like a lucene store
implementation) but there is a significant performance hit. As an e.g., let
us say I use the xlarge CPU instance at Amazon (8CPUs, 8GB RAM). In this
scenario, I can really only have 1 mapper per node whereas there are 8 CPUs.
But if the overhead of sharing the object (e.g. RMI) or persisting the
object (e.g. lucene) is greater than 8 times the memory speed, then it is
cheaper to run 1 mapper/node. I tried sharing with Terracotta and I was
getting a roughly 600 times decrease in performance versus in-memory access.

So ideally, if I could have all the mappers in the same VM, then I can
create a singleton and still have multiple mappers access it at memory
speeds.

Please do let me know if I am looking at this correctly and if the above is
possible.

Thanks a lot for all your help.

Cheers,
Dev




On Fri, Oct 3, 2008 at 12:49 PM, Alan Ho <[EMAIL PROTECTED]> wrote:

> It really depends on what type of data you are sharing, how you are looking
> up the data, whether the data is Read-write, and whether you care about
> consistency. If you don't care about consistency, I suggest that you shove
> the data into a BDB store (for key-value lookup) or a lucene store, and copy
> the data to all the nodes. That way all data access will be in-process, no
> gc problems, and you will get very fast results. BDB and lucene both have
> easy replication strategies.
>
> If the data is RW, and you need consistency, you should probably forget
> about MapReduce and just run everything on big-iron.
>
> Regards,
> Alan Ho
>
>
>
>
> - Original Message 
> From: Devajyoti Sarkar <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Thursday, October 2, 2008 8:41:04 PM
> Subject: Sharing an object across mappers
>
> I think each mapper/reducer runs in its own JVM which makes it impossible
> to
> share objects. I need to share a large object so that I can access it at
> memory speeds across all the mappers. Is it possible to have all the
> mappers
> run in the same VM? Or is there a way to do this across VMs at high speed?
> I
> guess JMI and others such methods will be just too slow.
>
> Thanks,
> Dev
>
>
>
>   __
> Instant Messaging, free SMS, sharing photos and more... Try the new Yahoo!
> Canada Messenger at http://ca.beta.messenger.yahoo.com/
>


Sharing an object across mappers

2008-10-02 Thread Devajyoti Sarkar
I think each mapper/reducer runs in its own JVM which makes it impossible to
share objects. I need to share a large object so that I can access it at
memory speeds across all the mappers. Is it possible to have all the mappers
run in the same VM? Or is there a way to do this across VMs at high speed? I
guess JMI and others such methods will be just too slow.

Thanks,
Dev


Re: Page Ranking, Hadoop And MPI.

2008-04-15 Thread Devajyoti Sarkar
Hi Ted,

I am really interested in learning more about the symmetry in power
law/fractal graphs in general and how matrix math (SVD, etc.) allows us to
detect/exploit it. Would you have any recommendations for books or papers
that I could read which explores this question. I remember doing a standard
Linear Algebra course way back in college but it never discussed its links
to graphs/symmetry...

Thanks and best regards,
Dev



On Wed, Apr 16, 2008 at 2:33 AM, Ted Dunning <[EMAIL PROTECTED]> wrote:

>
> Power law algorithms are ideal for this kind of parallelized problem.
>
> The basic idea is that hub and authority style algorithms are intimately
> related to eigenvector or singular value decompositions (depending on
> whether the links are symmetrical).  This also means that there is a close
> relationship to asymptotic beahavior of random walks on the graph.
>
> If you represent the linkage in the web by a matrix that has columns
> representing source page and rows representing the target page and with a
> 1
> where-ever the source page has a link pointing to the target page, then if
> you start with a vector with a single non-zero element equal to 1 as a
> representation of a single page, then multiplying by the linkage matrix
> will
> give you a vector with 1 in the positions corresponding to the pages the
> original page linked to.  If you multiply again, you get all the pages
> that
> you can get to in two steps from the original page.
>
> Mathematically, if we call the original vector x and the linkage matrix A,
> the pages that x links to are just Ax.   The pages that are two steps from
> x
> are A(Ax) = A^2 x.
>
> The eigenvector decomposition of A is just a way of writing A as a product
> of three matrices:
>
>A = U S U'
>
> U' is the transpose of U, and U has the special property that U'U = I (it
> is
> called ortho-normal because of this).
>
> S is a diagonal matrix.
>
> There is lots of deep mathematical machinery and beautiful symmetry
> available here, but for now we can just take this as given.
>
> The pages n steps from x are
>
>   x_n = A^n x = (U S U')^n x = (U S U')^n-2 (U S U') (U S U') x
>   = (U S U')^(n-2) (U S (U'U) S U') x = (U S U')^(n-2) (U S^2 U') x
>   = U S^n U' x
>
> This is really cool because S^n can be computed by just taking each
> diagonal
> element and raising it to a power.
>
> Eigenvector decompositions have other, really deep connections.  For
> instance,  if you take the elements of S (call the i-th one s_i) then
>
>sum_i s_i^n
>
> Is the number of paths that are n steps long.
>
> Connected (or nearly connected) clusters of pages can also be derived from
> the eigenvector decomposition.  This is the basis of so-called spectral
> clustering.  For some very impressive examples of spectral clustering see
>
> http://citeseer.ist.psu.edu/ng01spectral.html
>
> So eigenvectors are cool.  But how can we compute them?
>
> First, note that if A^n = U S^n U' and if some of the s_i are bigger than
> others, the big ones will quickly dominate the others.  That is pretty
> quickly, A^n \approx u_1 s_1^n u_1'.  This means that we can compute an
> approximation of u_1 by just doing A^n x where x is some random vector.
> Moreover, we can compute u_2 by starting with a different random vector
> and
> iterating the same way, but with an additional step where we forbid the
> result from going towards u_1.  With just a few additional wrinkles, this
> gives us what is called the Lanczos algorithm.  Golub and van Loan's
> excellent book Matrix Computations gives a lot of information on these
> algorithms.
>
> The cool thing here is that our random vector can represent a single page
> and we can approximate the final result by following links.  Following
> links
> is just a (human-readable) way of saying sparse matrix multiplication.  If
> we do this multiplication against lots of different random starting
> points,
> we can quickly build parallel algorithms to compute things like page rank.
>
> I hope this helps.  I know it is a big byte to take at one time.
>
>
>
>
>
>
>
> On 4/15/08 9:31 AM, "Chaman Singh Verma" <[EMAIL PROTECTED]> wrote:
>
> > Hello,
> >
> > After googling for many days, I couldn't get one answer from many of the
> > published reports on Ranking algorithm done by Google. Since Google uses
> > GFS for fault tolerance purposes, what communication libraries they
> might be
> > using to solve such a large matrix ? I presume that standard Message
> Passing
> > (MPI) may not be suitable for this purpose(Fault tolerance issue)
> >
> > Suppose we implement ranking algorithm on top of Hadoop, what could be
> > the best way/best distributed algorithm/library etc ?
> >
> > With regards.
> >
> > Chaman Singh Verma
> > Poona, India
> >
> >
> >
> >
> >  between -00-00 and -99-99
>
>