Re: Indexing directories and files in a File System. (Fetched: 2, Processed: 0)

2013-03-06 Thread Syao Work
So you are suggesting me to iterate file system and index fs tree entities
including: directory names, file names, file size etc. and then post it to
solr?
I need to index the FS tree, not the file contents.

On Tue, Mar 5, 2013 at 5:54 PM, Erik Hatcher erik.hatc...@gmail.com wrote:

 Would Solr's post.jar work for you?   It has a directory recurse option.
  The usage/help output is pasted below.

 Here's what should work for you: java -Dauto -Drecursive -jar post.jar
 /some/folder

 Erik



 exampledocs  java -jar post.jar --help
 SimplePostTool version 1.5
 Usage: java [SystemProperties] -jar post.jar [-h|-] [file|folder|url|arg
 [file|folder|url|arg...]]

 Supported System Properties and their defaults:
   -Ddata=files|web|args|stdin (default=files)
   -Dtype=content-type (default=application/xml)
   -Durl=solr-update-url (default=http://localhost:8983/solr/update)
   -Dauto=yes|no (default=no)
   -Drecursive=yes|no|depth (default=0)
   -Ddelay=seconds (default=0 for files, 10 for web)
   -Dfiletypes=type[,type,...]
 (default=xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log)
   -Dparams=key=value[key=value...] (values must be URL-encoded)
   -Dcommit=yes|no (default=yes)
   -Doptimize=yes|no (default=no)
   -Dout=yes|no (default=no)

 This is a simple command line tool for POSTing raw data to a Solr
 port.  Data can be read from files specified as commandline args,
 URLs specified as args, as raw commandline arg strings or via STDIN.
 Examples:
   java -jar post.jar *.xml
   java -Ddata=args  -jar post.jar 'deleteid42/id/delete'
   java -Ddata=stdin -jar post.jar  hd.xml
   java -Ddata=web -jar post.jar http://example.com/
   java -Dtype=text/csv -jar post.jar *.csv
   java -Dtype=application/json -jar post.jar *.json
   java -Durl=http://localhost:8983/solr/update/extract -Dparams=literal.id=a
 -Dtype=application/pdf -jar post.jar a.pdf
   java -Dauto -jar post.jar *
   java -Dauto -Drecursive -jar post.jar afolder
   java -Dauto -Dfiletypes=ppt,html -jar post.jar afolder
 The options controlled by System Properties include the Solr
 URL to POST to, the Content-Type of the data, whether a commit
 or optimize should be executed, and whether the response should
 be written to STDOUT. If auto=yes the tool will try to set type
 and url automatically from file name. When posting rich documents
 the file name will be propagated as resource.name and also used
 as literal.id. You may override these or any other request parameter
 through the -Dparams property. To do a commit only, use - as argument.
 The web mode is a simple crawler following links within domain, default
 delay=10s.


 On Mar 5, 2013, at 04:38 , Syao Work wrote:

  Hello,
 
  I am trying to index some FS folder tree.
  Spent 2 days finding what could be the problem - got nothing :) There
 are not so much examples on indexing File System.
  In the logs I cant find any exceptions why it does not process the info
  Data import configuration and debug response are attached
 
 
  Using:
  1. solr web admin tool,
  2. Java version 1.7.0_09-icedtea
 OpenJDK Runtime Environment (fedora-2.3.7.0.fc17-x86_64)
 OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)
 
  Thank you for your time,
  Ro
 
  P.S. Excuse my bad English, I am not a native English speaker.
  data-config.xmlimport-debug-response.json




Re: Indexing directories and files in a File System. (Fetched: 2, Processed: 0)

2013-03-06 Thread Otis Gospodnetic
Hi Syao,

You should just write a simple (Java) app that traverses the dir tree, gets
info about each file, uses it to construct Solr doc objects
(SolrInputDocuments if you are working in Java with SolrJ) and sends them
to Solr for indexing.  Should be about 30 minutes of work or less.

Otis
--
Solr  ElasticSearch Support
http://sematext.com/





On Wed, Mar 6, 2013 at 3:37 AM, Syao Work syao.w...@gmail.com wrote:

 So you are suggesting me to iterate file system and index fs tree entities
 including: directory names, file names, file size etc. and then post it to
 solr?
 I need to index the FS tree, not the file contents.

 On Tue, Mar 5, 2013 at 5:54 PM, Erik Hatcher erik.hatc...@gmail.com
 wrote:

  Would Solr's post.jar work for you?   It has a directory recurse option.
   The usage/help output is pasted below.
 
  Here's what should work for you: java -Dauto -Drecursive -jar post.jar
  /some/folder
 
  Erik
 
 
 
  exampledocs  java -jar post.jar --help
  SimplePostTool version 1.5
  Usage: java [SystemProperties] -jar post.jar [-h|-]
 [file|folder|url|arg
  [file|folder|url|arg...]]
 
  Supported System Properties and their defaults:
-Ddata=files|web|args|stdin (default=files)
-Dtype=content-type (default=application/xml)
-Durl=solr-update-url (default=http://localhost:8983/solr/update)
-Dauto=yes|no (default=no)
-Drecursive=yes|no|depth (default=0)
-Ddelay=seconds (default=0 for files, 10 for web)
-Dfiletypes=type[,type,...]
 
 (default=xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log)
-Dparams=key=value[key=value...] (values must be
 URL-encoded)
-Dcommit=yes|no (default=yes)
-Doptimize=yes|no (default=no)
-Dout=yes|no (default=no)
 
  This is a simple command line tool for POSTing raw data to a Solr
  port.  Data can be read from files specified as commandline args,
  URLs specified as args, as raw commandline arg strings or via STDIN.
  Examples:
java -jar post.jar *.xml
java -Ddata=args  -jar post.jar 'deleteid42/id/delete'
java -Ddata=stdin -jar post.jar  hd.xml
java -Ddata=web -jar post.jar http://example.com/
java -Dtype=text/csv -jar post.jar *.csv
java -Dtype=application/json -jar post.jar *.json
java -Durl=http://localhost:8983/solr/update/extract -Dparams=
 literal.id=a
  -Dtype=application/pdf -jar post.jar a.pdf
java -Dauto -jar post.jar *
java -Dauto -Drecursive -jar post.jar afolder
java -Dauto -Dfiletypes=ppt,html -jar post.jar afolder
  The options controlled by System Properties include the Solr
  URL to POST to, the Content-Type of the data, whether a commit
  or optimize should be executed, and whether the response should
  be written to STDOUT. If auto=yes the tool will try to set type
  and url automatically from file name. When posting rich documents
  the file name will be propagated as resource.name and also used
  as literal.id. You may override these or any other request parameter
  through the -Dparams property. To do a commit only, use - as argument.
  The web mode is a simple crawler following links within domain, default
  delay=10s.
 
 
  On Mar 5, 2013, at 04:38 , Syao Work wrote:
 
   Hello,
  
   I am trying to index some FS folder tree.
   Spent 2 days finding what could be the problem - got nothing :) There
  are not so much examples on indexing File System.
   In the logs I cant find any exceptions why it does not process the info
   Data import configuration and debug response are attached
  
  
   Using:
   1. solr web admin tool,
   2. Java version 1.7.0_09-icedtea
  OpenJDK Runtime Environment (fedora-2.3.7.0.fc17-x86_64)
  OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)
  
   Thank you for your time,
   Ro
  
   P.S. Excuse my bad English, I am not a native English speaker.
   data-config.xmlimport-debug-response.json
 
 



Indexing directories and files in a File System. (Fetched: 2, Processed: 0)

2013-03-05 Thread Syao Work
Hello,

I am trying to index some FS folder tree.
Spent 2 days finding what could be the problem - got nothing :) There are
not so much examples on indexing File System.
In the logs I cant find any exceptions why it does not process the info
Data import configuration and debug response are attached


Using:
1. solr web admin tool,
2. Java version 1.7.0_09-icedtea
   OpenJDK Runtime Environment (fedora-2.3.7.0.fc17-x86_64)
   OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)

Thank you for your time,
Ro

P.S. Excuse my bad English, I am not a native English speaker.
dataConfig
 dataSource type=FileDataSource /
 document
	 
	entity name=file_entity 
		processor=FileListEntityProcessor 
		fileName=.*
		baseDir=/srv/nfs/test 
		recursive=true 
		rootEntity=false 
		onError=skip

		field column=fileAbsolutePath name=path /
		field column=fileSize name=size /
		field column=fileLastModified name=updated_at / 
		field column=file name=name/
		field column=baseDir name=title /
		
	/entity	
 /document
/dataConfig


import-debug-response.json
Description: application/json


Re: Indexing directories and files in a File System. (Fetched: 2, Processed: 0)

2013-03-05 Thread Gora Mohanty
On 5 March 2013 15:08, Syao Work syao.w...@gmail.com wrote:
 Hello,

 I am trying to index some FS folder tree.
 Spent 2 days finding what could be the problem - got nothing :) There are
 not so much examples on indexing File System.
 In the logs I cant find any exceptions why it does not process the info
 Data import configuration and debug response are attached
[...]

Please look more closely at the sample data configuration file at
http://wiki.apache.org/solr/DataImportHandler#FileListEntityProcessor
You need to use something like XPathEntityProcessor to define
entities for indexing. Other entity processors, such as
PlainTextEntityProcessor,
can instead be used if you are not using XML files. Also, make sure
that the field definitions in your schema.xml match the field names
here.

Regards,
Gora


Re: Indexing directories and files in a File System. (Fetched: 2, Processed: 0)

2013-03-05 Thread Syao Work
And if I need to index file name, path, size and/or mime?

On Tue, Mar 5, 2013 at 2:45 PM, Gora Mohanty g...@mimirtech.com wrote:

 On 5 March 2013 15:08, Syao Work syao.w...@gmail.com wrote:
  Hello,
 
  I am trying to index some FS folder tree.
  Spent 2 days finding what could be the problem - got nothing :) There are
  not so much examples on indexing File System.
  In the logs I cant find any exceptions why it does not process the info
  Data import configuration and debug response are attached
 [...]

 Please look more closely at the sample data configuration file at
 http://wiki.apache.org/solr/DataImportHandler#FileListEntityProcessor
 You need to use something like XPathEntityProcessor to define
 entities for indexing. Other entity processors, such as
 PlainTextEntityProcessor,
 can instead be used if you are not using XML files. Also, make sure
 that the field definitions in your schema.xml match the field names
 here.

 Regards,
 Gora



Re: Indexing directories and files in a File System. (Fetched: 2, Processed: 0)

2013-03-05 Thread Gora Mohanty
On 5 March 2013 18:22, Syao Work syao.w...@gmail.com wrote:
 And if I need to index file name, path, size and/or mime?
[...]

You would need to create separate entities for each field that
you need to index. The referenced Wiki page on DIH has
other examples of configurations with multiple entities.

Regards,
Gora


Re: Indexing directories and files in a File System. (Fetched: 2, Processed: 0)

2013-03-05 Thread Syao Work
Can you send an example?

On Tue, Mar 5, 2013 at 5:11 PM, Gora Mohanty g...@mimirtech.com wrote:

 On 5 March 2013 18:22, Syao Work syao.w...@gmail.com wrote:
  And if I need to index file name, path, size and/or mime?
 [...]

 You would need to create separate entities for each field that
 you need to index. The referenced Wiki page on DIH has
 other examples of configurations with multiple entities.

 Regards,
 Gora



Re: Indexing directories and files in a File System. (Fetched: 2, Processed: 0)

2013-03-05 Thread Erik Hatcher
Would Solr's post.jar work for you?   It has a directory recurse option.  The 
usage/help output is pasted below.

Here's what should work for you: java -Dauto -Drecursive -jar post.jar 
/some/folder

Erik



exampledocs  java -jar post.jar --help
SimplePostTool version 1.5
Usage: java [SystemProperties] -jar post.jar [-h|-] [file|folder|url|arg 
[file|folder|url|arg...]]

Supported System Properties and their defaults:
  -Ddata=files|web|args|stdin (default=files)
  -Dtype=content-type (default=application/xml)
  -Durl=solr-update-url (default=http://localhost:8983/solr/update)
  -Dauto=yes|no (default=no)
  -Drecursive=yes|no|depth (default=0)
  -Ddelay=seconds (default=0 for files, 10 for web)
  -Dfiletypes=type[,type,...] 
(default=xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log)
  -Dparams=key=value[key=value...] (values must be URL-encoded)
  -Dcommit=yes|no (default=yes)
  -Doptimize=yes|no (default=no)
  -Dout=yes|no (default=no)

This is a simple command line tool for POSTing raw data to a Solr
port.  Data can be read from files specified as commandline args,
URLs specified as args, as raw commandline arg strings or via STDIN.
Examples:
  java -jar post.jar *.xml
  java -Ddata=args  -jar post.jar 'deleteid42/id/delete'
  java -Ddata=stdin -jar post.jar  hd.xml
  java -Ddata=web -jar post.jar http://example.com/
  java -Dtype=text/csv -jar post.jar *.csv
  java -Dtype=application/json -jar post.jar *.json
  java -Durl=http://localhost:8983/solr/update/extract -Dparams=literal.id=a 
-Dtype=application/pdf -jar post.jar a.pdf
  java -Dauto -jar post.jar *
  java -Dauto -Drecursive -jar post.jar afolder
  java -Dauto -Dfiletypes=ppt,html -jar post.jar afolder
The options controlled by System Properties include the Solr
URL to POST to, the Content-Type of the data, whether a commit
or optimize should be executed, and whether the response should
be written to STDOUT. If auto=yes the tool will try to set type
and url automatically from file name. When posting rich documents
the file name will be propagated as resource.name and also used
as literal.id. You may override these or any other request parameter
through the -Dparams property. To do a commit only, use - as argument.
The web mode is a simple crawler following links within domain, default 
delay=10s.


On Mar 5, 2013, at 04:38 , Syao Work wrote:

 Hello,
 
 I am trying to index some FS folder tree. 
 Spent 2 days finding what could be the problem - got nothing :) There are not 
 so much examples on indexing File System.
 In the logs I cant find any exceptions why it does not process the info
 Data import configuration and debug response are attached 
 
 
 Using: 
 1. solr web admin tool, 
 2. Java version 1.7.0_09-icedtea
OpenJDK Runtime Environment (fedora-2.3.7.0.fc17-x86_64) 
OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)
 
 Thank you for your time,
 Ro
 
 P.S. Excuse my bad English, I am not a native English speaker.
 data-config.xmlimport-debug-response.json