Re: How to do parallel indexing on files (not on HDFS)

Rahul Singh Thu, 24 May 2018 11:04:52 -0700

Right,
That’s why you need a place to persist the task list / graph. If you use a 
table, you can set “processed” / “unprocessed” value … or a queue, then its 
delivered only once .. otherwise you have to check indexed date from solr, and 
waste a solr call.


--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On May 24, 2018, 12:54 PM -0500, Adhyan Arizki <a.ari...@gmail.com>, wrote:
> You will still need to devise a way to partition the data source even if you 
> are scheduling multiple jobs otherwise, you might end up digesting the same 
> data again and again.
>
> > On Fri, May 25, 2018 at 12:46 AM, Raymond Xie <xie3208...@gmail.com> wrote:
> > > Thank you all for the suggestions. I'm now tending to not using a
> > > traditional parallel indexing my data are json files with meta data
> > > extracted from raw data received and archived into our data server 
> > > cluster.
> > > Those data come in various flows and reside in their respective folders,
> > > splitting them might introduce unnecessary extra work and could end up 
> > > with
> > > trouble. So instead of that, maybe it would be easier to simply schedule
> > > multiple indexing jobs separately.?
> > >
> > > Thanks.
> > >
> > > Raymond
> > >
> > >
> > > Rahul Singh <rahul.xavier.si...@gmail.com> 于 2018年5月24日周四 上午11:23写道：
> > >
> > > > Resending to list to help more people..
> > > >
> > > > This is an architectural pattern to solve the same issue that arises 
> > > > over
> > > > and over again.. The queue can be anything — a table in a database, 
> > > > even a
> > > > collection solr.
> > > >
> > > > And yes I have implemented it —  I did it in C# before using a SQL 
> > > > Server
> > > > table based queue -- (http://github.com/appleseed/search-stack) — and
> > > > then made the indexer be able to write to lucene, elastic or solr 
> > > > depending
> > > > config. Im not actively maintaining this right now ,but will consider
> > > > porting it to Kafka + Spark + Kafka Connect based system when I find 
> > > > time.
> > > >
> > > > In Kafka however, you have a lot of potential with Kafka Connect . Here 
> > > > is
> > > > an example using Cassandra..
> > > > But the premise is the same Kafka Connect has libraries of connectors 
> > > > for
> > > > different source / sinks … may not work for files but for pure raw data,
> > > > Kafka Connect is good.
> > > >
> > > > Here’s a project that may guide you best.
> > > >
> > > >
> > > > http://saumitra.me/blog/tweet-search-and-analysis-with-kafka-solr-cassandra/
> > > >
> > > > I dont know where this guys code went.. but the content is there with 
> > > > code
> > > > samples.
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > On May 23, 2018, 8:37 PM -0500, Raymond Xie <xie3208...@gmail.com>, 
> > > > wrote:
> > > >
> > > > Thank you Rahul despite that's very high level.
> > > >
> > > > With no offense, do you have a successful implementation or it is just
> > > > your unproven idea? I never used Rabbit nor Kafka before but would be 
> > > > very
> > > > interested in knowing more detail on the Kafka idea as Kafka is 
> > > > available
> > > > in my environment.
> > > >
> > > > Thank you again and look forward to hearing more from you or anyone in
> > > > this Solr community.
> > > >
> > > >
> > > > *------------------------------------------------*
> > > > *Sincerely yours,*
> > > >
> > > >
> > > > *Raymond*
> > > >
> > > > On Wed, May 23, 2018 at 8:15 AM, Rahul Singh 
> > > > <rahul.xavier.si...@gmail.com
> > > > > wrote:
> > > >
> > > >> Enumerate the file locations (map) , put them in a queue like rabbit or
> > > >> Kafka (Persist the map), have a bunch of threads , workers, containers,
> > > >> whatever pop off the queue , process the item (reduce).
> > > >>
> > > >>
> > > >> --
> > > >> Rahul Singh
> > > >> rahul.si...@anant.us
> > > >>
> > > >> Anant Corporation
> > > >>
> > > >> On May 20, 2018, 7:24 AM -0400, Raymond Xie <xie3208...@gmail.com>,
> > > >> wrote:
> > > >>
> > > >> I know how to do indexing on file system like single file or folder, 
> > > >> but
> > > >> how do I do that in a parallel way? The data I need to index is of huge
> > > >> volume and can't be put on HDFS.
> > > >>
> > > >> Thank you
> > > >>
> > > >> *------------------------------------------------*
> > > >> *Sincerely yours,*
> > > >>
> > > >>
> > > >> *Raymond*
> > > >>
> > > >>
> > > >
>
>
>
> --
>
> Best regards,
> Adhyan Arizki

Re: How to do parallel indexing on files (not on HDFS)

Reply via email to