Re: Recursively scan documents for indexing in a folder in SolrJ
Yes, I've managed to "steal" some codes from post.jar to only send rich-text documents format to /update/extract. I've also change the setting of the Eclipse at Windows -> Preference -> General -> Workspace. Under Text file encoding, select Other, and choose UTF-8. The Eclipse is now able to read the Chinese characters successfully. Thank you for your help. Regards, Edwin On 19 October 2015 at 16:33, Duck Geraint (ext) GBJH < geraint.d...@syngenta.com> wrote: > "The problem for this is that it is indexing all the files regardless of > the formats, instead of just those formats in post.jar. So I guess still > have to "steal" some codes from there to detect the file format?" > > If you've not worked it out yourself yet, try something like: > > http://docs.oracle.com/javase/7/docs/api/java/io/File.html#listFiles(java.io.FilenameFilter) > > http://stackoverflow.com/questions/5751335/using-file-listfiles-with-filenameextensionfilter > > Geraint > > Geraint Duck > Data Scientist > Toxicology and Health Sciences > Syngenta UK > Email: geraint.d...@syngenta.com > > -Original Message- > From: Zheng Lin Edwin Yeo [mailto:edwinye...@gmail.com] > Sent: 17 October 2015 00:55 > To: solr-user@lucene.apache.org > Subject: Re: Recursively scan documents for indexing in a folder in SolrJ > > Thanks for your advice. I also found this method which so far has been > able to traverse all the documents in the folder and index them in Solr. > > public static void showFiles(File[] files) { > for (File file : files) { > if (file.isDirectory()) { > System.out.println("Directory: " + file.getName()); > showFiles(file.listFiles()); // Calls same method again. > } else { > System.out.println("File: " + file.getName()); > } > }} > > The problem for this is that it is indexing all the files regardless of > the formats, instead of just those formats in post.jar. So I guess still > have to "steal" some codes from there to detect the file format? > > As for files that contains non-English characters (Eg; Chinese > characters), it is currently not able to read the Chinese characters, and > it is all read as a series of "???". Any idea how to solve this problem? > > Thank you. > > Regards, > Edwin > > > On 16 October 2015 at 21:16, Duck Geraint (ext) GBJH < > geraint.d...@syngenta.com> wrote: > > > Also, check this link for SolrJ example code (including the recursion): > > https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/ > > > > Geraint > > > > > > Geraint Duck > > Data Scientist > > Toxicology and Health Sciences > > Syngenta UK > > Email: geraint.d...@syngenta.com > > > > -Original Message- > > From: Jan Høydahl [mailto:jan@cominvent.com] > > Sent: 16 October 2015 12:14 > > To: solr-user@lucene.apache.org > > Subject: Re: Recursively scan documents for indexing in a folder in > > SolrJ > > > > SolrJ does not have any file crawler built in. > > But you are free to steal code from SimplePostTool.java related to > > directory traversal, and then index each document found using SolrJ. > > > > Note that SimplePostTool.java tries to be smart with what endpoint to > > post files to, xml, csv and json content will be posted to /update > > while office docs go to /update/extract > > > > -- > > Jan Høydahl, search solution architect Cominvent AS - > > www.cominvent.com > > > > > 16. okt. 2015 kl. 05.22 skrev Zheng Lin Edwin Yeo > > > > >: > > > > > > Hi, > > > > > > I understand that in SimplePostTool (post.jar), there is this > > > command to automatically detect content types in a folder, and > > > recursively scan it for documents for indexing into a collection: > > > bin/post -c gettingstarted afolder/ > > > > > > This has been useful for me to do mass indexing of all the files > > > that are in the folder. Now that I'm moving to production and plans > > > to use SolrJ to do the indexing as it can do more things like > > > robustness checks and retires for indexes that fails. > > > > > > However, I can't seems to find a way to do the same in SolrJ. Is it > > > possible for this to be done in SolrJ? I'm using Solr 5.3.0 > > > > > > Thank you. > > > > > > Regards, > > > Edwin > > > > > > > > > > > > Syng
RE: Recursively scan documents for indexing in a folder in SolrJ
"The problem for this is that it is indexing all the files regardless of the formats, instead of just those formats in post.jar. So I guess still have to "steal" some codes from there to detect the file format?" If you've not worked it out yourself yet, try something like: http://docs.oracle.com/javase/7/docs/api/java/io/File.html#listFiles(java.io.FilenameFilter) http://stackoverflow.com/questions/5751335/using-file-listfiles-with-filenameextensionfilter Geraint Geraint Duck Data Scientist Toxicology and Health Sciences Syngenta UK Email: geraint.d...@syngenta.com -Original Message- From: Zheng Lin Edwin Yeo [mailto:edwinye...@gmail.com] Sent: 17 October 2015 00:55 To: solr-user@lucene.apache.org Subject: Re: Recursively scan documents for indexing in a folder in SolrJ Thanks for your advice. I also found this method which so far has been able to traverse all the documents in the folder and index them in Solr. public static void showFiles(File[] files) { for (File file : files) { if (file.isDirectory()) { System.out.println("Directory: " + file.getName()); showFiles(file.listFiles()); // Calls same method again. } else { System.out.println("File: " + file.getName()); } }} The problem for this is that it is indexing all the files regardless of the formats, instead of just those formats in post.jar. So I guess still have to "steal" some codes from there to detect the file format? As for files that contains non-English characters (Eg; Chinese characters), it is currently not able to read the Chinese characters, and it is all read as a series of "???". Any idea how to solve this problem? Thank you. Regards, Edwin On 16 October 2015 at 21:16, Duck Geraint (ext) GBJH < geraint.d...@syngenta.com> wrote: > Also, check this link for SolrJ example code (including the recursion): > https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/ > > Geraint > > > Geraint Duck > Data Scientist > Toxicology and Health Sciences > Syngenta UK > Email: geraint.d...@syngenta.com > > -Original Message- > From: Jan Høydahl [mailto:jan....@cominvent.com] > Sent: 16 October 2015 12:14 > To: solr-user@lucene.apache.org > Subject: Re: Recursively scan documents for indexing in a folder in > SolrJ > > SolrJ does not have any file crawler built in. > But you are free to steal code from SimplePostTool.java related to > directory traversal, and then index each document found using SolrJ. > > Note that SimplePostTool.java tries to be smart with what endpoint to > post files to, xml, csv and json content will be posted to /update > while office docs go to /update/extract > > -- > Jan Høydahl, search solution architect Cominvent AS - > www.cominvent.com > > > 16. okt. 2015 kl. 05.22 skrev Zheng Lin Edwin Yeo > > >: > > > > Hi, > > > > I understand that in SimplePostTool (post.jar), there is this > > command to automatically detect content types in a folder, and > > recursively scan it for documents for indexing into a collection: > > bin/post -c gettingstarted afolder/ > > > > This has been useful for me to do mass indexing of all the files > > that are in the folder. Now that I'm moving to production and plans > > to use SolrJ to do the indexing as it can do more things like > > robustness checks and retires for indexes that fails. > > > > However, I can't seems to find a way to do the same in SolrJ. Is it > > possible for this to be done in SolrJ? I'm using Solr 5.3.0 > > > > Thank you. > > > > Regards, > > Edwin > > > > > > Syngenta Limited, Registered in England No 2710846;Registered Office : > Syngenta Limited, European Regional Centre, Priestley Road, Surrey > Research Park, Guildford, Surrey, GU2 7YH, United Kingdom > This message may contain > confidential information. If you are not the designated recipient, > please notify the sender immediately, and delete the original and any > copies. Any use of the message by you is prohibited. > Syngenta Limited, Registered in England No 2710846;Registered Office : Syngenta Limited, European Regional Centre, Priestley Road, Surrey Research Park, Guildford, Surrey, GU2 7YH, United Kingdom This message may contain confidential information. If you are not the designated recipient, please notify the sender immediately, and delete the original and any copies. Any use of the message by you is prohibited.
Re: Recursively scan documents for indexing in a folder in SolrJ
Thanks for your advice. I also found this method which so far has been able to traverse all the documents in the folder and index them in Solr. public static void showFiles(File[] files) { for (File file : files) { if (file.isDirectory()) { System.out.println("Directory: " + file.getName()); showFiles(file.listFiles()); // Calls same method again. } else { System.out.println("File: " + file.getName()); } }} The problem for this is that it is indexing all the files regardless of the formats, instead of just those formats in post.jar. So I guess still have to "steal" some codes from there to detect the file format? As for files that contains non-English characters (Eg; Chinese characters), it is currently not able to read the Chinese characters, and it is all read as a series of "???". Any idea how to solve this problem? Thank you. Regards, Edwin On 16 October 2015 at 21:16, Duck Geraint (ext) GBJH < geraint.d...@syngenta.com> wrote: > Also, check this link for SolrJ example code (including the recursion): > https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/ > > Geraint > > > Geraint Duck > Data Scientist > Toxicology and Health Sciences > Syngenta UK > Email: geraint.d...@syngenta.com > > -Original Message- > From: Jan Høydahl [mailto:jan@cominvent.com] > Sent: 16 October 2015 12:14 > To: solr-user@lucene.apache.org > Subject: Re: Recursively scan documents for indexing in a folder in SolrJ > > SolrJ does not have any file crawler built in. > But you are free to steal code from SimplePostTool.java related to > directory traversal, and then index each document found using SolrJ. > > Note that SimplePostTool.java tries to be smart with what endpoint to post > files to, xml, csv and json content will be posted to /update while office > docs go to /update/extract > > -- > Jan Høydahl, search solution architect > Cominvent AS - www.cominvent.com > > > 16. okt. 2015 kl. 05.22 skrev Zheng Lin Edwin Yeo >: > > > > Hi, > > > > I understand that in SimplePostTool (post.jar), there is this command > > to automatically detect content types in a folder, and recursively > > scan it for documents for indexing into a collection: > > bin/post -c gettingstarted afolder/ > > > > This has been useful for me to do mass indexing of all the files that > > are in the folder. Now that I'm moving to production and plans to use > > SolrJ to do the indexing as it can do more things like robustness > > checks and retires for indexes that fails. > > > > However, I can't seems to find a way to do the same in SolrJ. Is it > > possible for this to be done in SolrJ? I'm using Solr 5.3.0 > > > > Thank you. > > > > Regards, > > Edwin > > > > > > Syngenta Limited, Registered in England No 2710846;Registered Office : > Syngenta Limited, European Regional Centre, Priestley Road, Surrey Research > Park, Guildford, Surrey, GU2 7YH, United Kingdom > > This message may contain confidential information. If you are not the > designated recipient, please notify the sender immediately, and delete the > original and any copies. Any use of the message by you is prohibited. >
RE: Recursively scan documents for indexing in a folder in SolrJ
Also, check this link for SolrJ example code (including the recursion): https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/ Geraint Geraint Duck Data Scientist Toxicology and Health Sciences Syngenta UK Email: geraint.d...@syngenta.com -Original Message- From: Jan Høydahl [mailto:jan@cominvent.com] Sent: 16 October 2015 12:14 To: solr-user@lucene.apache.org Subject: Re: Recursively scan documents for indexing in a folder in SolrJ SolrJ does not have any file crawler built in. But you are free to steal code from SimplePostTool.java related to directory traversal, and then index each document found using SolrJ. Note that SimplePostTool.java tries to be smart with what endpoint to post files to, xml, csv and json content will be posted to /update while office docs go to /update/extract -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com > 16. okt. 2015 kl. 05.22 skrev Zheng Lin Edwin Yeo : > > Hi, > > I understand that in SimplePostTool (post.jar), there is this command > to automatically detect content types in a folder, and recursively > scan it for documents for indexing into a collection: > bin/post -c gettingstarted afolder/ > > This has been useful for me to do mass indexing of all the files that > are in the folder. Now that I'm moving to production and plans to use > SolrJ to do the indexing as it can do more things like robustness > checks and retires for indexes that fails. > > However, I can't seems to find a way to do the same in SolrJ. Is it > possible for this to be done in SolrJ? I'm using Solr 5.3.0 > > Thank you. > > Regards, > Edwin Syngenta Limited, Registered in England No 2710846;Registered Office : Syngenta Limited, European Regional Centre, Priestley Road, Surrey Research Park, Guildford, Surrey, GU2 7YH, United Kingdom This message may contain confidential information. If you are not the designated recipient, please notify the sender immediately, and delete the original and any copies. Any use of the message by you is prohibited.
Re: Recursively scan documents for indexing in a folder in SolrJ
SolrJ does not have any file crawler built in. But you are free to steal code from SimplePostTool.java related to directory traversal, and then index each document found using SolrJ. Note that SimplePostTool.java tries to be smart with what endpoint to post files to, xml, csv and json content will be posted to /update while office docs go to /update/extract -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com > 16. okt. 2015 kl. 05.22 skrev Zheng Lin Edwin Yeo : > > Hi, > > I understand that in SimplePostTool (post.jar), there is this command to > automatically detect content types in a folder, and recursively scan it for > documents for indexing into a collection: > bin/post -c gettingstarted afolder/ > > This has been useful for me to do mass indexing of all the files that are > in the folder. Now that I'm moving to production and plans to use SolrJ to > do the indexing as it can do more things like robustness checks and retires > for indexes that fails. > > However, I can't seems to find a way to do the same in SolrJ. Is it > possible for this to be done in SolrJ? I'm using Solr 5.3.0 > > Thank you. > > Regards, > Edwin
Recursively scan documents for indexing in a folder in SolrJ
Hi, I understand that in SimplePostTool (post.jar), there is this command to automatically detect content types in a folder, and recursively scan it for documents for indexing into a collection: bin/post -c gettingstarted afolder/ This has been useful for me to do mass indexing of all the files that are in the folder. Now that I'm moving to production and plans to use SolrJ to do the indexing as it can do more things like robustness checks and retires for indexes that fails. However, I can't seems to find a way to do the same in SolrJ. Is it possible for this to be done in SolrJ? I'm using Solr 5.3.0 Thank you. Regards, Edwin