Re: Can't index all docs in a local folder with DIH in Solr 5.0.0
Alex, I've created JIRA ticket: https://issues.apache.org/jira/browse/SOLR-7174 In response to your suggestions below: 1. No exceptions are reported, even with onError removed. 2. ProcessMonitor shows only the very first epub file is being read (repeatedly) 3. I can repeat this on Ubuntu (14.04) by following the same steps. 4. Ticket raised (https://issues.apache.org/jira/browse/SOLR-7174) Additionally (and I've added this on the ticket), if I change the dataConfig to use FileDataSource and PlainTextEntityProcessor, and just list *.txt files, it works! dataConfig dataSource type=FileDataSource name=bin / document entity name=files dataSource=null rootEntity=false processor=FileListEntityProcessor baseDir=c:/Users/gt/Documents/HackerMonthly/epub fileName=.*txt field column=fileAbsolutePath name=id / field column=fileSize name=size / field column=fileLastModified name=lastModified / entity name=documentImport processor=PlainTextEntityProcessor url=${files.fileAbsolutePath} format=text dataSource=bin field column=plainText name=content/ /entity /entity /document /dataConfig So it's something related to BinFileDataSource and TikaEntityProcessor. Thanks, Gary. On 26/02/2015 14:24, Gary Taylor wrote: Alex, That's great. Thanks for the pointers. I'll try and get more info on this and file a JIRA issue. Kind regards, Gary. On 26/02/2015 14:16, Alexandre Rafalovitch wrote: On 26 February 2015 at 08:32, Gary Taylor g...@inovem.com wrote: Alex, Same results on recursive=true / recursive=false. I also tried importing plain text files instead of epub (still using TikeEntityProcessor though) and get exactly the same result - ie. all files fetched, but only one document indexed in Solr. To me, this would indicate that something is a problem with the inner DIH entity then. As a next set of steps, I would probably 1) remove both onError statements and see if there is an exception that is being swallowed. 2) run the import under ProcessMonitor and see if the other files are actually being read https://technet.microsoft.com/en-us/library/bb896645.aspx 3) Assume a Windows bug and test this on Mac/Linux 4) File a JIRA with a replication case. If there is a full replication setup, I'll test it machines I have access to with full debugger step-through For example, I wonder if FileBinDataSource is somehow not cleaning up after the first file properly on Windows and fails to open the second one. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ -- Gary Taylor | www.inovem.com | www.kahootz.com INOVEM Ltd is registered in England and Wales No 4228932 Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE kahootz.com is a trading name of INOVEM Ltd.
Re: Can't index all docs in a local folder with DIH in Solr 5.0.0
Alex, Same results on recursive=true / recursive=false. I also tried importing plain text files instead of epub (still using TikeEntityProcessor though) and get exactly the same result - ie. all files fetched, but only one document indexed in Solr. With verbose output, I get a row for each file in the directory, but only the first one has a non-empty documentImport entity. All subsequent documentImport entities just have an empty document#2 entry. eg: verbose-output: [ entity:files, [ null, --- row #1-, fileSize, 2609004, fileLastModified, 2015-02-25T11:37:25.217Z, fileAbsolutePath, c:\\Users\\gt\\Documents\\epub\\issue018.epub, fileDir, c:\\Users\\gt\\Documents\\epub, file, issue018.epub, null, -, entity:documentImport, [ document#1, [ query, c:\\Users\\gt\\Documents\\epub\\issue018.epub, time-taken, 0:0:0.0, null, --- row #1-, text, ... parsed epub text - snip ... title, Issue 18 title, Author, Author text, null, - ], document#2, [] ], null, --- row #2-, fileSize, 4428804, fileLastModified, 2015-02-25T11:37:36.399Z, fileAbsolutePath, c:\\Users\\gt\\Documents\\epub\\issue019.epub, fileDir, c:\\Users\\gt\\Documents\\epub, file, issue019.epub, null, -, entity:documentImport, [ document#2, [] ], null, --- row #3-, fileSize, 2580266, fileLastModified, 2015-02-25T11:37:41.188Z, fileAbsolutePath, c:\\Users\\gt\\Documents\\epub\\issue020.epub, fileDir, c:\\Users\\gt\\Documents\\epub, file, issue020.epub, null, -, entity:documentImport, [ document#2, [] ],
Re: Can't index all docs in a local folder with DIH in Solr 5.0.0
Alex, That's great. Thanks for the pointers. I'll try and get more info on this and file a JIRA issue. Kind regards, Gary. On 26/02/2015 14:16, Alexandre Rafalovitch wrote: On 26 February 2015 at 08:32, Gary Taylor g...@inovem.com wrote: Alex, Same results on recursive=true / recursive=false. I also tried importing plain text files instead of epub (still using TikeEntityProcessor though) and get exactly the same result - ie. all files fetched, but only one document indexed in Solr. To me, this would indicate that something is a problem with the inner DIH entity then. As a next set of steps, I would probably 1) remove both onError statements and see if there is an exception that is being swallowed. 2) run the import under ProcessMonitor and see if the other files are actually being read https://technet.microsoft.com/en-us/library/bb896645.aspx 3) Assume a Windows bug and test this on Mac/Linux 4) File a JIRA with a replication case. If there is a full replication setup, I'll test it machines I have access to with full debugger step-through For example, I wonder if FileBinDataSource is somehow not cleaning up after the first file properly on Windows and fails to open the second one. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ -- Gary Taylor | www.inovem.com | www.kahootz.com INOVEM Ltd is registered in England and Wales No 4228932 Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE kahootz.com is a trading name of INOVEM Ltd.
Re: Can't index all docs in a local folder with DIH in Solr 5.0.0
On 26 February 2015 at 08:32, Gary Taylor g...@inovem.com wrote: Alex, Same results on recursive=true / recursive=false. I also tried importing plain text files instead of epub (still using TikeEntityProcessor though) and get exactly the same result - ie. all files fetched, but only one document indexed in Solr. To me, this would indicate that something is a problem with the inner DIH entity then. As a next set of steps, I would probably 1) remove both onError statements and see if there is an exception that is being swallowed. 2) run the import under ProcessMonitor and see if the other files are actually being read https://technet.microsoft.com/en-us/library/bb896645.aspx 3) Assume a Windows bug and test this on Mac/Linux 4) File a JIRA with a replication case. If there is a full replication setup, I'll test it machines I have access to with full debugger step-through For example, I wonder if FileBinDataSource is somehow not cleaning up after the first file properly on Windows and fails to open the second one. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/
Can't index all docs in a local folder with DIH in Solr 5.0.0
I can't get the FileListEntityProcessor and TikeEntityProcessor to correctly add a Solr document for each epub file in my local directory. I've just downloaded Solr 5.0.0, on a Windows 7 PC. I ran solr start and then solr create -c hn2 to create a new core. I want to index a load of epub files that I've got in a directory. So I created a data-import.xml (in solr\hn2\conf): dataConfig dataSource type=BinFileDataSource name=bin / document entity name=files dataSource=null rootEntity=false processor=FileListEntityProcessor baseDir=c:/Users/gt/Documents/epub fileName=.*epub onError=skip recursive=true field column=fileAbsolutePath name=id / field column=fileSize name=size / field column=fileLastModified name=lastModified / entity name=documentImport processor=TikaEntityProcessor url=${files.fileAbsolutePath} format=text dataSource=bin onError=skip field column=file name=fileName/ field column=Author name=author meta=true/ field column=title name=title meta=true/ field column=text name=content/ /entity /entity /document /dataConfig In my solrconfig.xml, I added a requestHandler entry to reference my data-import.xml: requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configdata-import.xml/str /lst /requestHandler I renamed managed-schema to schema.xml, and ensured the following doc fields were setup: field name=id type=string indexed=true stored=true required=true multiValued=false / field name=fileName type=string indexed=true stored=true / field name=author type=string indexed=true stored=true / field name=title type=string indexed=true stored=true / field name=size type=long indexed=true stored=true / field name=lastModified type=date indexed=true stored=true / field name=content type=text_en indexed=false stored=true multiValued=false/ field name=text type=text_en indexed=true stored=false multiValued=true/ copyField source=content dest=text/ I copied all the jars from dist and contrib\* into server\solr\lib. Stopping and restarting solr then creates a new managed-schema file and renames schema.xml to schema.xml.back All good so far. Now I go to the web admin for dataimport (http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and execute a full import. But, the results show Requests: 0, Fetched: 58, Skipped: 0, Processed:1 - ie. it only adds one document (the very first one) even though it's iterated over 58! No errors are reported in the logs. I can search on the contents of that first epub document, so it's extracting OK in Tika, but there's a problem somewhere in my config that's causing only 1 document to be indexed in Solr. Thanks for any assistance / pointers. Regards, Gary -- Gary Taylor | www.inovem.com | www.kahootz.com INOVEM Ltd is registered in England and Wales No 4228932 Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE kahootz.com is a trading name of INOVEM Ltd.
Re: Can't index all docs in a local folder with DIH in Solr 5.0.0
Alex, Thanks for the suggestions. It always just indexes 1 doc, regardless of the first epub file it sees. Debug / verbose don't show anything obvious to me. I can include the output here if you think it would help. I tried using the SimplePostTool first ( *java -Dtype=application/epub+zip -Durl=http://localhost:8983/solr/hn1/update/extract -jar post.jar \Users\gt\Documents\epub\*.epub) to index the docs and check the Tika parsing and that works OK so I don't think it's the e*pubs. I was trying to use DIH so that I could more easily specify the schema fields and store content in the index in preparation for trying out the search highlighting. Couldn't work out how to do that with post.jar Thanks, Gary On 25/02/2015 17:09, Alexandre Rafalovitch wrote: Try removing that first epub from the directory and rerunning. If you now index 0 documents, then there is something unexpected about them and DIH skips. If it indexes 1 document again but a different one, then it is definitely something about the repeat logic. Also, try running with debug and verbose modes and see if something specific shows up. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 25 February 2015 at 11:14, Gary Taylor g...@inovem.com wrote: I can't get the FileListEntityProcessor and TikeEntityProcessor to correctly add a Solr document for each epub file in my local directory. I've just downloaded Solr 5.0.0, on a Windows 7 PC. I ran solr start and then solr create -c hn2 to create a new core. I want to index a load of epub files that I've got in a directory. So I created a data-import.xml (in solr\hn2\conf): dataConfig dataSource type=BinFileDataSource name=bin / document entity name=files dataSource=null rootEntity=false processor=FileListEntityProcessor baseDir=c:/Users/gt/Documents/epub fileName=.*epub onError=skip recursive=true field column=fileAbsolutePath name=id / field column=fileSize name=size / field column=fileLastModified name=lastModified / entity name=documentImport processor=TikaEntityProcessor url=${files.fileAbsolutePath} format=text dataSource=bin onError=skip field column=file name=fileName/ field column=Author name=author meta=true/ field column=title name=title meta=true/ field column=text name=content/ /entity /entity /document /dataConfig In my solrconfig.xml, I added a requestHandler entry to reference my data-import.xml: requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configdata-import.xml/str /lst /requestHandler I renamed managed-schema to schema.xml, and ensured the following doc fields were setup: field name=id type=string indexed=true stored=true required=true multiValued=false / field name=fileName type=string indexed=true stored=true / field name=author type=string indexed=true stored=true / field name=title type=string indexed=true stored=true / field name=size type=long indexed=true stored=true / field name=lastModified type=date indexed=true stored=true / field name=content type=text_en indexed=false stored=true multiValued=false/ field name=text type=text_en indexed=true stored=false multiValued=true/ copyField source=content dest=text/ I copied all the jars from dist and contrib\* into server\solr\lib. Stopping and restarting solr then creates a new managed-schema file and renames schema.xml to schema.xml.back All good so far. Now I go to the web admin for dataimport (http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and execute a full import. But, the results show Requests: 0, Fetched: 58, Skipped: 0, Processed:1 - ie. it only adds one document (the very first one) even though it's iterated over 58! No errors are reported in the logs. I can search on the contents of that first epub document, so it's extracting OK in Tika, but there's a problem somewhere in my config that's causing only 1 document to be indexed in Solr. Thanks for any assistance / pointers. Regards, Gary -- Gary Taylor | www.inovem.com | www.kahootz.com INOVEM Ltd is registered in England and Wales No 4228932 Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE kahootz.com is a trading name of INOVEM Ltd. -- Gary Taylor | www.inovem.com | www.kahootz.com INOVEM Ltd is registered in England and Wales No 4228932 Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE kahootz.com is a trading name of INOVEM Ltd.
Re: Can't index all docs in a local folder with DIH in Solr 5.0.0
What about recursive=true? Do you have subdirectories that could make a difference. Your SimplePostTool would not look at subdirectories (great comparison, BTW). However, you do have lots of mapping options as well with /update/extract handler, look at the example and documentations. There is lots of mapping there. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 25 February 2015 at 12:24, Gary Taylor g...@inovem.com wrote: Alex, Thanks for the suggestions. It always just indexes 1 doc, regardless of the first epub file it sees. Debug / verbose don't show anything obvious to me. I can include the output here if you think it would help. I tried using the SimplePostTool first ( *java -Dtype=application/epub+zip -Durl=http://localhost:8983/solr/hn1/update/extract -jar post.jar \Users\gt\Documents\epub\*.epub) to index the docs and check the Tika parsing and that works OK so I don't think it's the e*pubs. I was trying to use DIH so that I could more easily specify the schema fields and store content in the index in preparation for trying out the search highlighting. Couldn't work out how to do that with post.jar Thanks, Gary On 25/02/2015 17:09, Alexandre Rafalovitch wrote: Try removing that first epub from the directory and rerunning. If you now index 0 documents, then there is something unexpected about them and DIH skips. If it indexes 1 document again but a different one, then it is definitely something about the repeat logic. Also, try running with debug and verbose modes and see if something specific shows up. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 25 February 2015 at 11:14, Gary Taylor g...@inovem.com wrote: I can't get the FileListEntityProcessor and TikeEntityProcessor to correctly add a Solr document for each epub file in my local directory. I've just downloaded Solr 5.0.0, on a Windows 7 PC. I ran solr start and then solr create -c hn2 to create a new core. I want to index a load of epub files that I've got in a directory. So I created a data-import.xml (in solr\hn2\conf): dataConfig dataSource type=BinFileDataSource name=bin / document entity name=files dataSource=null rootEntity=false processor=FileListEntityProcessor baseDir=c:/Users/gt/Documents/epub fileName=.*epub onError=skip recursive=true field column=fileAbsolutePath name=id / field column=fileSize name=size / field column=fileLastModified name=lastModified / entity name=documentImport processor=TikaEntityProcessor url=${files.fileAbsolutePath} format=text dataSource=bin onError=skip field column=file name=fileName/ field column=Author name=author meta=true/ field column=title name=title meta=true/ field column=text name=content/ /entity /entity /document /dataConfig In my solrconfig.xml, I added a requestHandler entry to reference my data-import.xml: requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configdata-import.xml/str /lst /requestHandler I renamed managed-schema to schema.xml, and ensured the following doc fields were setup: field name=id type=string indexed=true stored=true required=true multiValued=false / field name=fileName type=string indexed=true stored=true / field name=author type=string indexed=true stored=true / field name=title type=string indexed=true stored=true / field name=size type=long indexed=true stored=true / field name=lastModified type=date indexed=true stored=true / field name=content type=text_en indexed=false stored=true multiValued=false/ field name=text type=text_en indexed=true stored=false multiValued=true/ copyField source=content dest=text/ I copied all the jars from dist and contrib\* into server\solr\lib. Stopping and restarting solr then creates a new managed-schema file and renames schema.xml to schema.xml.back All good so far. Now I go to the web admin for dataimport (http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and execute a full import. But, the results show Requests: 0, Fetched: 58, Skipped: 0, Processed:1 - ie. it only adds one document (the very first one) even though it's iterated over 58! No errors are reported in the logs. I can search on the contents of that first epub document, so it's extracting OK in Tika, but there's a problem somewhere in my config that's causing only 1 document to be indexed in Solr. Thanks for any assistance / pointers. Regards, Gary -- Gary Taylor |
Re: Can't index all docs in a local folder with DIH in Solr 5.0.0
Try removing that first epub from the directory and rerunning. If you now index 0 documents, then there is something unexpected about them and DIH skips. If it indexes 1 document again but a different one, then it is definitely something about the repeat logic. Also, try running with debug and verbose modes and see if something specific shows up. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 25 February 2015 at 11:14, Gary Taylor g...@inovem.com wrote: I can't get the FileListEntityProcessor and TikeEntityProcessor to correctly add a Solr document for each epub file in my local directory. I've just downloaded Solr 5.0.0, on a Windows 7 PC. I ran solr start and then solr create -c hn2 to create a new core. I want to index a load of epub files that I've got in a directory. So I created a data-import.xml (in solr\hn2\conf): dataConfig dataSource type=BinFileDataSource name=bin / document entity name=files dataSource=null rootEntity=false processor=FileListEntityProcessor baseDir=c:/Users/gt/Documents/epub fileName=.*epub onError=skip recursive=true field column=fileAbsolutePath name=id / field column=fileSize name=size / field column=fileLastModified name=lastModified / entity name=documentImport processor=TikaEntityProcessor url=${files.fileAbsolutePath} format=text dataSource=bin onError=skip field column=file name=fileName/ field column=Author name=author meta=true/ field column=title name=title meta=true/ field column=text name=content/ /entity /entity /document /dataConfig In my solrconfig.xml, I added a requestHandler entry to reference my data-import.xml: requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configdata-import.xml/str /lst /requestHandler I renamed managed-schema to schema.xml, and ensured the following doc fields were setup: field name=id type=string indexed=true stored=true required=true multiValued=false / field name=fileName type=string indexed=true stored=true / field name=author type=string indexed=true stored=true / field name=title type=string indexed=true stored=true / field name=size type=long indexed=true stored=true / field name=lastModified type=date indexed=true stored=true / field name=content type=text_en indexed=false stored=true multiValued=false/ field name=text type=text_en indexed=true stored=false multiValued=true/ copyField source=content dest=text/ I copied all the jars from dist and contrib\* into server\solr\lib. Stopping and restarting solr then creates a new managed-schema file and renames schema.xml to schema.xml.back All good so far. Now I go to the web admin for dataimport (http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and execute a full import. But, the results show Requests: 0, Fetched: 58, Skipped: 0, Processed:1 - ie. it only adds one document (the very first one) even though it's iterated over 58! No errors are reported in the logs. I can search on the contents of that first epub document, so it's extracting OK in Tika, but there's a problem somewhere in my config that's causing only 1 document to be indexed in Solr. Thanks for any assistance / pointers. Regards, Gary -- Gary Taylor | www.inovem.com | www.kahootz.com INOVEM Ltd is registered in England and Wales No 4228932 Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE kahootz.com is a trading name of INOVEM Ltd.