Re: Can't index all docs in a local folder with DIH in Solr 5.0.0
Alex, I've created JIRA ticket: https://issues.apache.org/jira/browse/SOLR-7174 In response to your suggestions below: 1. No exceptions are reported, even with onError removed. 2. ProcessMonitor shows only the very first epub file is being read (repeatedly) 3. I can repeat this on Ubuntu (14.04) by following the same steps. 4. Ticket raised (https://issues.apache.org/jira/browse/SOLR-7174) Additionally (and I've added this on the ticket), if I change the dataConfig to use FileDataSource and PlainTextEntityProcessor, and just list *.txt files, it works! baseDir="c:/Users/gt/Documents/HackerMonthly/epub" fileName=".*txt"> processor="PlainTextEntityProcessor" url="${files.fileAbsolutePath}" format="text" dataSource="bin"> So it's something related to BinFileDataSource and TikaEntityProcessor. Thanks, Gary. On 26/02/2015 14:24, Gary Taylor wrote: Alex, That's great. Thanks for the pointers. I'll try and get more info on this and file a JIRA issue. Kind regards, Gary. On 26/02/2015 14:16, Alexandre Rafalovitch wrote: On 26 February 2015 at 08:32, Gary Taylor wrote: Alex, Same results on recursive=true / recursive=false. I also tried importing plain text files instead of epub (still using TikeEntityProcessor though) and get exactly the same result - ie. all files fetched, but only one document indexed in Solr. To me, this would indicate that something is a problem with the inner DIH entity then. As a next set of steps, I would probably 1) remove both onError statements and see if there is an exception that is being swallowed. 2) run the import under ProcessMonitor and see if the other files are actually being read https://technet.microsoft.com/en-us/library/bb896645.aspx 3) Assume a Windows bug and test this on Mac/Linux 4) File a JIRA with a replication case. If there is a full replication setup, I'll test it machines I have access to with full debugger step-through For example, I wonder if FileBinDataSource is somehow not cleaning up after the first file properly on Windows and fails to open the second one. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ -- Gary Taylor | www.inovem.com | www.kahootz.com INOVEM Ltd is registered in England and Wales No 4228932 Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE kahootz.com is a trading name of INOVEM Ltd.
Re: Can't index all docs in a local folder with DIH in Solr 5.0.0
Alex, That's great. Thanks for the pointers. I'll try and get more info on this and file a JIRA issue. Kind regards, Gary. On 26/02/2015 14:16, Alexandre Rafalovitch wrote: On 26 February 2015 at 08:32, Gary Taylor wrote: Alex, Same results on recursive=true / recursive=false. I also tried importing plain text files instead of epub (still using TikeEntityProcessor though) and get exactly the same result - ie. all files fetched, but only one document indexed in Solr. To me, this would indicate that something is a problem with the inner DIH entity then. As a next set of steps, I would probably 1) remove both onError statements and see if there is an exception that is being swallowed. 2) run the import under ProcessMonitor and see if the other files are actually being read https://technet.microsoft.com/en-us/library/bb896645.aspx 3) Assume a Windows bug and test this on Mac/Linux 4) File a JIRA with a replication case. If there is a full replication setup, I'll test it machines I have access to with full debugger step-through For example, I wonder if FileBinDataSource is somehow not cleaning up after the first file properly on Windows and fails to open the second one. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ -- Gary Taylor | www.inovem.com | www.kahootz.com INOVEM Ltd is registered in England and Wales No 4228932 Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE kahootz.com is a trading name of INOVEM Ltd.
Re: Can't index all docs in a local folder with DIH in Solr 5.0.0
On 26 February 2015 at 08:32, Gary Taylor wrote: > Alex, > > Same results on recursive=true / recursive=false. > > I also tried importing plain text files instead of epub (still using > TikeEntityProcessor though) and get exactly the same result - ie. all files > fetched, but only one document indexed in Solr. To me, this would indicate that something is a problem with the inner DIH entity then. As a next set of steps, I would probably 1) remove both onError statements and see if there is an exception that is being swallowed. 2) run the import under ProcessMonitor and see if the other files are actually being read https://technet.microsoft.com/en-us/library/bb896645.aspx 3) Assume a Windows bug and test this on Mac/Linux 4) File a JIRA with a replication case. If there is a full replication setup, I'll test it machines I have access to with full debugger step-through For example, I wonder if FileBinDataSource is somehow not cleaning up after the first file properly on Windows and fails to open the second one. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/
Re: Can't index all docs in a local folder with DIH in Solr 5.0.0
Alex, Same results on recursive=true / recursive=false. I also tried importing plain text files instead of epub (still using TikeEntityProcessor though) and get exactly the same result - ie. all files fetched, but only one document indexed in Solr. With verbose output, I get a row for each file in the directory, but only the first one has a non-empty documentImport entity. All subsequent documentImport entities just have an empty document#2 entry. eg: "verbose-output": [ "entity:files", [ null, "--- row #1-", "fileSize", 2609004, "fileLastModified", "2015-02-25T11:37:25.217Z", "fileAbsolutePath", "c:\\Users\\gt\\Documents\\epub\\issue018.epub", "fileDir", "c:\\Users\\gt\\Documents\\epub", "file", "issue018.epub", null, "-", "entity:documentImport", [ "document#1", [ "query", "c:\\Users\\gt\\Documents\\epub\\issue018.epub", "time-taken", "0:0:0.0", null, "--- row #1-", "text", "< ... parsed epub text - snip ... >" "title", "Issue 18 title", "Author", "Author text", null, "-" ], "document#2", [] ], null, "--- row #2-", "fileSize", 4428804, "fileLastModified", "2015-02-25T11:37:36.399Z", "fileAbsolutePath", "c:\\Users\\gt\\Documents\\epub\\issue019.epub", "fileDir", "c:\\Users\\gt\\Documents\\epub", "file", "issue019.epub", null, "-", "entity:documentImport", [ "document#2", [] ], null, "--- row #3-", "fileSize", 2580266, "fileLastModified", "2015-02-25T11:37:41.188Z", "fileAbsolutePath", "c:\\Users\\gt\\Documents\\epub\\issue020.epub", "fileDir", "c:\\Users\\gt\\Documents\\epub", "file", "issue020.epub", null, "-", "entity:documentImport", [ "document#2", [] ],
Re: Can't index all docs in a local folder with DIH in Solr 5.0.0
Alex, Thanks for the suggestions. It always just indexes 1 doc, regardless of the first epub file it sees. Debug / verbose don't show anything obvious to me. I can include the output here if you think it would help. I tried using the SimplePostTool first ( *java -Dtype=application/epub+zip -Durl=http://localhost:8983/solr/hn1/update/extract -jar post.jar \Users\gt\Documents\epub\*.epub) to index the docs and check the Tika parsing and that works OK so I don't think it's the e*pubs. I was trying to use DIH so that I could more easily specify the schema fields and store content in the index in preparation for trying out the search highlighting. Couldn't work out how to do that with post.jar Thanks, Gary On 25/02/2015 17:09, Alexandre Rafalovitch wrote: Try removing that first epub from the directory and rerunning. If you now index 0 documents, then there is something unexpected about them and DIH skips. If it indexes 1 document again but a different one, then it is definitely something about the repeat logic. Also, try running with debug and verbose modes and see if something specific shows up. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 25 February 2015 at 11:14, Gary Taylor wrote: I can't get the FileListEntityProcessor and TikeEntityProcessor to correctly add a Solr document for each epub file in my local directory. I've just downloaded Solr 5.0.0, on a Windows 7 PC. I ran "solr start" and then "solr create -c hn2" to create a new core. I want to index a load of epub files that I've got in a directory. So I created a data-import.xml (in solr\hn2\conf): In my solrconfig.xml, I added a requestHandler entry to reference my data-import.xml: data-import.xml I renamed managed-schema to schema.xml, and ensured the following doc fields were setup: I copied all the jars from dist and contrib\* into server\solr\lib. Stopping and restarting solr then creates a new managed-schema file and renames schema.xml to schema.xml.back All good so far. Now I go to the web admin for dataimport (http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and execute a full import. But, the results show "Requests: 0, Fetched: 58, Skipped: 0, Processed:1" - ie. it only adds one document (the very first one) even though it's iterated over 58! No errors are reported in the logs. I can search on the contents of that first epub document, so it's extracting OK in Tika, but there's a problem somewhere in my config that's causing only 1 document to be indexed in Solr. Thanks for any assistance / pointers. Regards, Gary -- Gary Taylor | www.inovem.com | www.kahootz.com INOVEM Ltd is registered in England and Wales No 4228932 Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE kahootz.com is a trading name of INOVEM Ltd. -- Gary Taylor | www.inovem.com | www.kahootz.com INOVEM Ltd is registered in England and Wales No 4228932 Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE kahootz.com is a trading name of INOVEM Ltd.
Re: Can't index all docs in a local folder with DIH in Solr 5.0.0
What about "recursive=true"? Do you have subdirectories that could make a difference. Your SimplePostTool would not look at subdirectories (great comparison, BTW). However, you do have lots of mapping options as well with /update/extract handler, look at the example and documentations. There is lots of mapping there. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 25 February 2015 at 12:24, Gary Taylor wrote: > Alex, > > Thanks for the suggestions. It always just indexes 1 doc, regardless of the > first epub file it sees. Debug / verbose don't show anything obvious to me. > I can include the output here if you think it would help. > > I tried using the SimplePostTool first ( *java -Dtype=application/epub+zip > -Durl=http://localhost:8983/solr/hn1/update/extract -jar post.jar > \Users\gt\Documents\epub\*.epub) to index the docs and check the Tika > parsing and that works OK so I don't think it's the e*pubs. > > I was trying to use DIH so that I could more easily specify the schema > fields and store content in the index in preparation for trying out the > search highlighting. Couldn't work out how to do that with post.jar > > Thanks, > Gary > > > On 25/02/2015 17:09, Alexandre Rafalovitch wrote: >> >> Try removing that first epub from the directory and rerunning. If you >> now index 0 documents, then there is something unexpected about them >> and DIH skips. If it indexes 1 document again but a different one, >> then it is definitely something about the repeat logic. >> >> Also, try running with debug and verbose modes and see if something >> specific shows up. >> >> Regards, >> Alex. >> >> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: >> http://www.solr-start.com/ >> >> >> On 25 February 2015 at 11:14, Gary Taylor wrote: >>> >>> I can't get the FileListEntityProcessor and TikeEntityProcessor to >>> correctly >>> add a Solr document for each epub file in my local directory. >>> >>> I've just downloaded Solr 5.0.0, on a Windows 7 PC. I ran "solr start" >>> and >>> then "solr create -c hn2" to create a new core. >>> >>> I want to index a load of epub files that I've got in a directory. So I >>> created a data-import.xml (in solr\hn2\conf): >>> >>> >>> >>> >>> >> processor="FileListEntityProcessor" >>> baseDir="c:/Users/gt/Documents/epub" fileName=".*epub" >>> onError="skip" >>> recursive="true"> >>> >>> >>> >>> >>> >> processor="TikaEntityProcessor" >>> url="${files.fileAbsolutePath}" format="text" >>> dataSource="bin" onError="skip"> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> In my solrconfig.xml, I added a requestHandler entry to reference my >>> data-import.xml: >>> >>>>> class="org.apache.solr.handler.dataimport.DataImportHandler"> >>> >>>data-import.xml >>> >>> >>> >>> I renamed managed-schema to schema.xml, and ensured the following doc >>> fields >>> were setup: >>> >>>>> required="true" multiValued="false" /> >>>>> /> >>> >>> >>> >>> >>>>> stored="true" /> >>> >>>>> multiValued="false"/> >>>>> multiValued="true"/> >>> >>> >>> >>> I copied all the jars from dist and contrib\* into server\solr\lib. >>> >>> Stopping and restarting solr then creates a new managed-schema file and >>> renames schema.xml to schema.xml.back >>> >>> All good so far. >>> >>> Now I go to the web admin for dataimport >>> (http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and >>> execute a full import. >>> >>> But, the results show "Requests: 0, Fetched: 58, Skipped: 0, Processed:1" >>> - >>> ie. it only adds one document (the very first one) even though it's >>> iterated >>> over 58! >>> >>> No errors are reported in the logs. >>> >>> I can search on the contents of that first epub document, so it's >>> extracting >>> OK in Tika, but there's a problem somewhere in my config that's causing >>> only >>> 1 document to be indexed in Solr. >>> >>> Thanks for any assistance / pointers. >>> >>> Regards, >>> Gary >>> >>> -- >>> Gary Taylor | www.inovem.com | www.kahootz.com >>> >>> INOVEM Ltd is registered in England and Wales No 4228932 >>> Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE >>> kahootz.com is a trading name of INOVEM Ltd. >>> > > -- > Gary Taylor | www.inovem.com | www.kahootz.com > > INOVEM Ltd is registered in England and Wales No 4228932 > Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE > kahootz.com is a trading name of INOVEM Ltd. >
Re: Can't index all docs in a local folder with DIH in Solr 5.0.0
Try removing that first epub from the directory and rerunning. If you now index 0 documents, then there is something unexpected about them and DIH skips. If it indexes 1 document again but a different one, then it is definitely something about the repeat logic. Also, try running with debug and verbose modes and see if something specific shows up. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 25 February 2015 at 11:14, Gary Taylor wrote: > I can't get the FileListEntityProcessor and TikeEntityProcessor to correctly > add a Solr document for each epub file in my local directory. > > I've just downloaded Solr 5.0.0, on a Windows 7 PC. I ran "solr start" and > then "solr create -c hn2" to create a new core. > > I want to index a load of epub files that I've got in a directory. So I > created a data-import.xml (in solr\hn2\conf): > > > > > processor="FileListEntityProcessor" > baseDir="c:/Users/gt/Documents/epub" fileName=".*epub" > onError="skip" > recursive="true"> > > > > > url="${files.fileAbsolutePath}" format="text" > dataSource="bin" onError="skip"> > > > > > > > > > > In my solrconfig.xml, I added a requestHandler entry to reference my > data-import.xml: > >class="org.apache.solr.handler.dataimport.DataImportHandler"> > > data-import.xml > > > > I renamed managed-schema to schema.xml, and ensured the following doc fields > were setup: > >required="true" multiValued="false" /> > > > > > > > >multiValued="false"/> >multiValued="true"/> > > > > I copied all the jars from dist and contrib\* into server\solr\lib. > > Stopping and restarting solr then creates a new managed-schema file and > renames schema.xml to schema.xml.back > > All good so far. > > Now I go to the web admin for dataimport > (http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and > execute a full import. > > But, the results show "Requests: 0, Fetched: 58, Skipped: 0, Processed:1" - > ie. it only adds one document (the very first one) even though it's iterated > over 58! > > No errors are reported in the logs. > > I can search on the contents of that first epub document, so it's extracting > OK in Tika, but there's a problem somewhere in my config that's causing only > 1 document to be indexed in Solr. > > Thanks for any assistance / pointers. > > Regards, > Gary > > -- > Gary Taylor | www.inovem.com | www.kahootz.com > > INOVEM Ltd is registered in England and Wales No 4228932 > Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE > kahootz.com is a trading name of INOVEM Ltd. >
Can't index all docs in a local folder with DIH in Solr 5.0.0
I can't get the FileListEntityProcessor and TikeEntityProcessor to correctly add a Solr document for each epub file in my local directory. I've just downloaded Solr 5.0.0, on a Windows 7 PC. I ran "solr start" and then "solr create -c hn2" to create a new core. I want to index a load of epub files that I've got in a directory. So I created a data-import.xml (in solr\hn2\conf): url="${files.fileAbsolutePath}" format="text" dataSource="bin" onError="skip"> In my solrconfig.xml, I added a requestHandler entry to reference my data-import.xml: class="org.apache.solr.handler.dataimport.DataImportHandler"> data-import.xml I renamed managed-schema to schema.xml, and ensured the following doc fields were setup: required="true" multiValued="false" /> stored="true" /> stored="true" multiValued="false"/> multiValued="true"/> I copied all the jars from dist and contrib\* into server\solr\lib. Stopping and restarting solr then creates a new managed-schema file and renames schema.xml to schema.xml.back All good so far. Now I go to the web admin for dataimport (http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and execute a full import. But, the results show "Requests: 0, Fetched: 58, Skipped: 0, Processed:1" - ie. it only adds one document (the very first one) even though it's iterated over 58! No errors are reported in the logs. I can search on the contents of that first epub document, so it's extracting OK in Tika, but there's a problem somewhere in my config that's causing only 1 document to be indexed in Solr. Thanks for any assistance / pointers. Regards, Gary -- Gary Taylor | www.inovem.com | www.kahootz.com INOVEM Ltd is registered in England and Wales No 4228932 Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE kahootz.com is a trading name of INOVEM Ltd.