Re: How to do parallel indexing on files (not on HDFS)

Adhyan Arizki Thu, 24 May 2018 10:54:37 -0700

You will still need to devise a way to partition the data source even if
you are scheduling multiple jobs otherwise, you might end up digesting the
same data again and again.


On Fri, May 25, 2018 at 12:46 AM, Raymond Xie <xie3208...@gmail.com> wrote:

> Thank you all for the suggestions. I'm now tending to not using a
> traditional parallel indexing my data are json files with meta data
> extracted from raw data received and archived into our data server cluster.
> Those data come in various flows and reside in their respective folders,
> splitting them might introduce unnecessary extra work and could end up with
> trouble. So instead of that, maybe it would be easier to simply schedule
> multiple indexing jobs separately.?
>
> Thanks.
>
> Raymond
>
>
> Rahul Singh <rahul.xavier.si...@gmail.com> 于 2018年5月24日周四 上午11:23写道：
>
> > Resending to list to help more people..
> >
> > This is an architectural pattern to solve the same issue that arises over
> > and over again.. The queue can be anything — a table in a database, even
> a
> > collection solr.
> >
> > And yes I have implemented it —  I did it in C# before using a SQL Server
> > table based queue -- (http://github.com/appleseed/search-stack) — and
> > then made the indexer be able to write to lucene, elastic or solr
> depending
> > config. Im not actively maintaining this right now ,but will consider
> > porting it to Kafka + Spark + Kafka Connect based system when I find
> time.
> >
> > In Kafka however, you have a lot of potential with Kafka Connect . Here
> is
> > an example using Cassandra..
> > But the premise is the same Kafka Connect has libraries of connectors for
> > different source / sinks … may not work for files but for pure raw data,
> > Kafka Connect is good.
> >
> > Here’s a project that may guide you best.
> >
> >
> > http://saumitra.me/blog/tweet-search-and-analysis-with-
> kafka-solr-cassandra/
> >
> > I dont know where this guys code went.. but the content is there with
> code
> > samples.
> >
> >
> >
> >
> > --
> >
> > On May 23, 2018, 8:37 PM -0500, Raymond Xie <xie3208...@gmail.com>,
> wrote:
> >
> > Thank you Rahul despite that's very high level.
> >
> > With no offense, do you have a successful implementation or it is just
> > your unproven idea? I never used Rabbit nor Kafka before but would be
> very
> > interested in knowing more detail on the Kafka idea as Kafka is available
> > in my environment.
> >
> > Thank you again and look forward to hearing more from you or anyone in
> > this Solr community.
> >
> >
> > *------------------------------------------------*
> > *Sincerely yours,*
> >
> >
> > *Raymond*
> >
> > On Wed, May 23, 2018 at 8:15 AM, Rahul Singh <
> rahul.xavier.si...@gmail.com
> > > wrote:
> >
> >> Enumerate the file locations (map) , put them in a queue like rabbit or
> >> Kafka (Persist the map), have a bunch of threads , workers, containers,
> >> whatever pop off the queue , process the item (reduce).
> >>
> >>
> >> --
> >> Rahul Singh
> >> rahul.si...@anant.us
> >>
> >> Anant Corporation
> >>
> >> On May 20, 2018, 7:24 AM -0400, Raymond Xie <xie3208...@gmail.com>,
> >> wrote:
> >>
> >> I know how to do indexing on file system like single file or folder, but
> >> how do I do that in a parallel way? The data I need to index is of huge
> >> volume and can't be put on HDFS.
> >>
> >> Thank you
> >>
> >> *------------------------------------------------*
> >> *Sincerely yours,*
> >>
> >>
> >> *Raymond*
> >>
> >>
> >
>



-- 

Best regards,
Adhyan Arizki

Re: How to do parallel indexing on files (not on HDFS)

Reply via email to