Right, That’s why you need a place to persist the task list / graph. If you use a table, you can set “processed” / “unprocessed” value … or a queue, then its delivered only once .. otherwise you have to check indexed date from solr, and waste a solr call.
-- Rahul Singh rahul.si...@anant.us Anant Corporation On May 24, 2018, 12:54 PM -0500, Adhyan Arizki <a.ari...@gmail.com>, wrote: > You will still need to devise a way to partition the data source even if you > are scheduling multiple jobs otherwise, you might end up digesting the same > data again and again. > > > On Fri, May 25, 2018 at 12:46 AM, Raymond Xie <xie3208...@gmail.com> wrote: > > > Thank you all for the suggestions. I'm now tending to not using a > > > traditional parallel indexing my data are json files with meta data > > > extracted from raw data received and archived into our data server > > > cluster. > > > Those data come in various flows and reside in their respective folders, > > > splitting them might introduce unnecessary extra work and could end up > > > with > > > trouble. So instead of that, maybe it would be easier to simply schedule > > > multiple indexing jobs separately.? > > > > > > Thanks. > > > > > > Raymond > > > > > > > > > Rahul Singh <rahul.xavier.si...@gmail.com> 于 2018年5月24日周四 上午11:23写道: > > > > > > > Resending to list to help more people.. > > > > > > > > This is an architectural pattern to solve the same issue that arises > > > > over > > > > and over again.. The queue can be anything — a table in a database, > > > > even a > > > > collection solr. > > > > > > > > And yes I have implemented it — I did it in C# before using a SQL > > > > Server > > > > table based queue -- (http://github.com/appleseed/search-stack) — and > > > > then made the indexer be able to write to lucene, elastic or solr > > > > depending > > > > config. Im not actively maintaining this right now ,but will consider > > > > porting it to Kafka + Spark + Kafka Connect based system when I find > > > > time. > > > > > > > > In Kafka however, you have a lot of potential with Kafka Connect . Here > > > > is > > > > an example using Cassandra.. > > > > But the premise is the same Kafka Connect has libraries of connectors > > > > for > > > > different source / sinks … may not work for files but for pure raw data, > > > > Kafka Connect is good. > > > > > > > > Here’s a project that may guide you best. > > > > > > > > > > > > http://saumitra.me/blog/tweet-search-and-analysis-with-kafka-solr-cassandra/ > > > > > > > > I dont know where this guys code went.. but the content is there with > > > > code > > > > samples. > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > On May 23, 2018, 8:37 PM -0500, Raymond Xie <xie3208...@gmail.com>, > > > > wrote: > > > > > > > > Thank you Rahul despite that's very high level. > > > > > > > > With no offense, do you have a successful implementation or it is just > > > > your unproven idea? I never used Rabbit nor Kafka before but would be > > > > very > > > > interested in knowing more detail on the Kafka idea as Kafka is > > > > available > > > > in my environment. > > > > > > > > Thank you again and look forward to hearing more from you or anyone in > > > > this Solr community. > > > > > > > > > > > > *------------------------------------------------* > > > > *Sincerely yours,* > > > > > > > > > > > > *Raymond* > > > > > > > > On Wed, May 23, 2018 at 8:15 AM, Rahul Singh > > > > <rahul.xavier.si...@gmail.com > > > > > wrote: > > > > > > > >> Enumerate the file locations (map) , put them in a queue like rabbit or > > > >> Kafka (Persist the map), have a bunch of threads , workers, containers, > > > >> whatever pop off the queue , process the item (reduce). > > > >> > > > >> > > > >> -- > > > >> Rahul Singh > > > >> rahul.si...@anant.us > > > >> > > > >> Anant Corporation > > > >> > > > >> On May 20, 2018, 7:24 AM -0400, Raymond Xie <xie3208...@gmail.com>, > > > >> wrote: > > > >> > > > >> I know how to do indexing on file system like single file or folder, > > > >> but > > > >> how do I do that in a parallel way? The data I need to index is of huge > > > >> volume and can't be put on HDFS. > > > >> > > > >> Thank you > > > >> > > > >> *------------------------------------------------* > > > >> *Sincerely yours,* > > > >> > > > >> > > > >> *Raymond* > > > >> > > > >> > > > > > > > > -- > > Best regards, > Adhyan Arizki