That is exactly what I want.

I want the distributed Hadoop TaskNode to be running on the same server
that is holding the local distributed solr index. This way there is no need
to move any data around... I think other people call this feature 'data
locality' of map/reduce.

I believe HBase and Hadoop integration work exactly like this. The only
difference here is we are substituting HDFS with the distributed Solr
indexes.

Since solr4 can manage the sharded/distributed index files, it's doing the
exact work that HDFS is doing. In theory, this should be achievable.

On Thu, Jul 26, 2012 at 7:51 PM, Lance Norskog <goks...@gmail.com> wrote:

> No. This is just a Hadoop file input class. Distributed Hadoop has to
> get files from a distributed file service. It sounds like you want
> some kind of distributed file service that maps a TaskNode (??) on a
> given server to the files available on that server. There might be
> something that does this. HDFS works very hard at doing this; are you
> sure it is not good enough? I am endlessly amazed at the speed of
> these distributed apps.
>
> Have you done a proof of concept?
>
> On Thu, Jul 26, 2012 at 7:40 PM, Trung Pham <tr...@phamcom.com> wrote:
> > Can it read distributed lucene indexes in SolrCloud?
> > On Jul 26, 2012 7:11 PM, "Lance Norskog" <goks...@gmail.com> wrote:
> >
> >> Mahout includes a file reader for Lucene indexes. It will read from
> >> HDFS or local disks.
> >>
> >> On Thu, Jul 26, 2012 at 6:57 PM, Darren Govoni <dar...@ontrenet.com>
> >> wrote:
> >> > You raise an interesting possibility. A map/reduce solr handler over
> >> > solrcloud.......
> >> >
> >> > On Thu, 2012-07-26 at 18:52 -0700, Trung Pham wrote:
> >> >
> >> >> I think the performance should be close to Hadoop running on HDFS, if
> >> >> somehow Hadoop job can directly read the Solr Index file while
> executing
> >> >> the job on the local solr node.
> >> >>
> >> >> Kindna like how HBase and Cassadra integrate with Hadoop.
> >> >>
> >> >> Plus, we can run the map reduce job on a standby Solr4 cluster.
> >> >>
> >> >> This way, the documents in Solr will be our primary source of truth.
> >> And we
> >> >> have the ability to run near real time search queries and analytics
> on
> >> it.
> >> >> No need to export data around.
> >> >>
> >> >> Solr4 is becoming a very interesting solution to many web scale
> >> problems.
> >> >> Just missing the map/reduce component. :)
> >> >>
> >> >> On Thu, Jul 26, 2012 at 3:01 PM, Darren Govoni <dar...@ontrenet.com>
> >> wrote:
> >> >>
> >> >> > Of course you can do it, but the question is whether this will
> produce
> >> >> > the performance results you expect.
> >> >> > I've seen talk about this in other forums, so you might find some
> >> prior
> >> >> > work here.
> >> >> >
> >> >> > Solr and HDFS serve somewhat different purposes. The key issue
> would
> >> be
> >> >> > if your map and reduce code
> >> >> > overloads the Solr endpoint. Even using SolrCloud, I believe all
> >> >> > requests will have to go through a single
> >> >> > URL (to be routed), so if you have thousands of map/reduce jobs all
> >> >> > running simultaneously, the question is whether
> >> >> > your Solr is architected to handle that amount of throughput.
> >> >> >
> >> >> >
> >> >> > On Thu, 2012-07-26 at 14:55 -0700, Trung Pham wrote:
> >> >> >
> >> >> > > Is it possible to run map reduce jobs directly on Solr4?
> >> >> > >
> >> >> > > I'm asking this because I want to use Solr4 as the primary
> storage
> >> >> > engine.
> >> >> > > And I want to be able to run near real time analytics against it
> as
> >> well.
> >> >> > > Rather than export solr4 data out to a hadoop cluster.
> >> >> >
> >> >> >
> >> >> >
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Lance Norskog
> >> goks...@gmail.com
> >>
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>

Reply via email to