I’d urge you to consider moving the process from using ExtractingRequestHandler (i.e. just sending the data to Solr) to doing the Tika parser externally. ExtractingRequestHandler is a great way to get started, but I’ve often found that I need much finer control over the process.
Here’s the full treatment: https://lucidworks.com/post/indexing-with-solrj/ Best, Erick > On Dec 18, 2019, at 11:15 AM, Jörn Franke <jornfra...@gmail.com> wrote: > > This depends on your ingestion process. Usually the unique ids that are not > filenames may come not from a file or your ingestion process does not tel the > file name. In this case the Collection seems to be configured to generate a > unique identifier. > > Maybe you can describe more in detail on how you process the files. > > A wild speculation could be that they come from inside a zip file. In this > case Metadata from Tika could be used as an Id were you concatenation zip > file + file inside zip file . > However we don’t know what you have defined how your ingestion process looks > like so this is pure speculation from my side. > >> Am 18.12.2019 um 16:40 schrieb Nan Yu <n...@z-geoinfo.com>: >> >> Sorry that I just found out that the mailing list takes plain text and my >> previous post looks really messy. So I reformatted it. >> >> >> Hi, >> I did a simple indexing of a directory that contains a lot of pdf, text, >> doc, zip etc. There are no structures for the content of the files and I >> would like to index them and later on search "key words" within the files. >> >> >> After creating the core, I indexed the files in the directory using the >> following command: >> >> >> bin/post -p 8983 -m 10g -c myCore /DATA_FOLDER > solr_indexing.log >> >> >> The log file shows something like below (the first and last few lines in >> the log file): >> >> >> java -classpath /solr/solr-8.3.0/dist/solr-core-8.3.0.jar -Dauto=yes >> -Dport=8983 -Dm=15g -Dc=myCore -Ddata=files -Drecursive=yes >> org.apache.solr.util.SimplePostTool /DATA_FOLDER >> SimplePostTool version 5.0.0 >> Posting files to [base] url http://localhost:8983/solr/myCore/update... >> Entering auto mode. File endings considered are >> xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log >> ... >> ... >> ... >> POSTing file Report.pdf (application/pdf) to [base]/extract >> 47256 files indexed. >> COMMITting Solr index changes to http://localhost:8983/solr/myCore/update... >> Time spent: 1:03:59.587 >> >> >> >> >> But when using browser to try to look at the result, the "overview" >> (http://localhost:8983/solr/#/myCore/core-overview) shows: >> Num Docs: 47648 >> >> >> Most of the files indexed has an metadata id has the value of the full path >> of the file indexed, such as /DATA_FOLDER/20180321/Report.pdf >> >> >> But there are about 400 of them, the id looks like: >> 232d7bd6-c586-4726-8d2b-bc9b1febcff4. >> >> >> So my questions are: >> (1)why the two numbers are different (in log file vs. in the overview). >> (2)for those ids that are not a full path of a file, how do I know where >> they comes from (the original file)? >> >> >> >> >> Thanks for your help! >> Nan >> >> >> >> >> PS: a few examples of query result for those strange ids: >> >> >> { >> "bolt-small-online":["Test strip-north"], >> "3696714.008":[3702848.584], >> "380614.564":[376900.143], >> "100.038":[111.074], >> "gpo-bolt":["teststrip"], >> "id":"232d7bd6-c586-4726-8d2b-bc9b1febcff4", >> "_version_":1652839231413813252 >> } >> >> >> >> >> { >> "Date":["8/24/2001"], >> "EXT31":[0], >> "EXT32":[0.12], >> "Aggregate":[0.12], >> "Pounds_Vap":[37], >> "Gallons_Vap":[5.8], >> "Gallons_Liq":[0], >> "Gallons_Tot":[5.8], >> "Avg_Rate":[1.8], >> "Gallons_Rec":[577], >> "Water":[577], >> "id":"840c05af-caf0-4407-8753-dcc6957abcc5", >> "Well_s_":["EXT31;EXT32"], >> "Time__hrs_":[3.25], >> "_version_":1652898731969740800}] >> } >> >> >> { >> "2":[4], >> "SFS1":["PLM1"], >> "1.00":[1.0], >> "69":[79], >> "id":"e675a6f5-0a3e-41b1-b1fe-b3098d0be725", >> "_version_":1652825435791163395 >> }