Re: Recursively scan documents for indexing in a folder in SolrJ

Zheng Lin Edwin Yeo Fri, 16 Oct 2015 16:55:56 -0700

Thanks for your advice. I also found this method which so far has been able
to traverse all the documents in the folder and index them in Solr.


public static void showFiles(File[] files) {
    for (File file : files) {
        if (file.isDirectory()) {
            System.out.println("Directory: " + file.getName());
            showFiles(file.listFiles()); // Calls same method again.
        } else {
            System.out.println("File: " + file.getName());
        }
    }}

The problem for this is that it is indexing all the files regardless of the
formats, instead of just those formats in post.jar. So I guess still have
to "steal" some codes from there to detect the file format?

As for files that contains non-English characters (Eg; Chinese characters),
it is currently not able to read the Chinese characters, and it is all read
as a series of "???". Any idea how to solve this problem?

Thank you.

Regards,
Edwin


On 16 October 2015 at 21:16, Duck Geraint (ext) GBJH <
geraint.d...@syngenta.com> wrote:

> Also, check this link for SolrJ example code (including the recursion):
> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
>
> Geraint
>
>
> Geraint Duck
> Data Scientist
> Toxicology and Health Sciences
> Syngenta UK
> Email: geraint.d...@syngenta.com
>
> -----Original Message-----
> From: Jan Høydahl [mailto:jan....@cominvent.com]
> Sent: 16 October 2015 12:14
> To: solr-user@lucene.apache.org
> Subject: Re: Recursively scan documents for indexing in a folder in SolrJ
>
> SolrJ does not have any file crawler built in.
> But you are free to steal code from SimplePostTool.java related to
> directory traversal, and then index each document found using SolrJ.
>
> Note that SimplePostTool.java tries to be smart with what endpoint to post
> files to, xml, csv and json content will be posted to /update while office
> docs go to /update/extract
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 16. okt. 2015 kl. 05.22 skrev Zheng Lin Edwin Yeo <edwinye...@gmail.com
> >:
> >
> > Hi,
> >
> > I understand that in SimplePostTool (post.jar), there is this command
> > to automatically detect content types in a folder, and recursively
> > scan it for documents for indexing into a collection:
> > bin/post -c gettingstarted afolder/
> >
> > This has been useful for me to do mass indexing of all the files that
> > are in the folder. Now that I'm moving to production and plans to use
> > SolrJ to do the indexing as it can do more things like robustness
> > checks and retires for indexes that fails.
> >
> > However, I can't seems to find a way to do the same in SolrJ. Is it
> > possible for this to be done in SolrJ? I'm using Solr 5.3.0
> >
> > Thank you.
> >
> > Regards,
> > Edwin
>
>
> ________________________________
>
>
> Syngenta Limited, Registered in England No 2710846;Registered Office :
> Syngenta Limited, European Regional Centre, Priestley Road, Surrey Research
> Park, Guildford, Surrey, GU2 7YH, United Kingdom
> ________________________________
>  This message may contain confidential information. If you are not the
> designated recipient, please notify the sender immediately, and delete the
> original and any copies. Any use of the message by you is prohibited.
>

Re: Recursively scan documents for indexing in a folder in SolrJ

Reply via email to