If you have data that only consists of id (full filename) and filename 
(indexed, tokenized) 40M of those will fit comfortably into a single shard 
provided enough RAM to operate.

I know SolrJ is tossed out there a lot as a/the way to index - but if you’ve 
got a directory tree of files and want to index _just_ the file names then a 
shell script that generated a CSV could be easy and clean.  It’s trivial to 
`bin/post -c <your collection> data.csv`

—
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com <http://www.lucidworks.com/>




> On Aug 4, 2015, at 5:51 AM, Mugeesh Husain <muge...@gmail.com> wrote:
> 
> Thank @Alexandre and  Erickson ,Hatcher.
> 
> I will generate ID of MD5  with help of filename using java.
> I can do it with help of SolrJ nicely because i am java developer apart from
> this 
> The question raised that data is too large i think it will break into
> multiple shards(core)
> Using multi core indexing how i can analysed duplicate ID while reindexing
> the whole.(Using Solrj) and
> How i will analysed one core contains such amount of data and other etc.
> 
> I have decide i will do it with SolrJ because i don't have good
> understanding with DIH for such type operation which i needed on my
> requirement. i'd google but unable to find such type of DIH Example which i
> can implement on my problem.
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220673.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to