If you have data that only consists of id (full filename) and filename (indexed, tokenized) 40M of those will fit comfortably into a single shard provided enough RAM to operate.
I know SolrJ is tossed out there a lot as a/the way to index - but if you’ve got a directory tree of files and want to index _just_ the file names then a shell script that generated a CSV could be easy and clean. It’s trivial to `bin/post -c <your collection> data.csv` — Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com <http://www.lucidworks.com/> > On Aug 4, 2015, at 5:51 AM, Mugeesh Husain <muge...@gmail.com> wrote: > > Thank @Alexandre and Erickson ,Hatcher. > > I will generate ID of MD5 with help of filename using java. > I can do it with help of SolrJ nicely because i am java developer apart from > this > The question raised that data is too large i think it will break into > multiple shards(core) > Using multi core indexing how i can analysed duplicate ID while reindexing > the whole.(Using Solrj) and > How i will analysed one core contains such amount of data and other etc. > > I have decide i will do it with SolrJ because i don't have good > understanding with DIH for such type operation which i needed on my > requirement. i'd google but unable to find such type of DIH Example which i > can implement on my problem. > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220673.html > Sent from the Solr - User mailing list archive at Nabble.com.