Alex,

I've created JIRA ticket: https://issues.apache.org/jira/browse/SOLR-7174

In response to your suggestions below:

1. No exceptions are reported, even with onError removed.
2. ProcessMonitor shows only the very first epub file is being read (repeatedly)
3. I can repeat this on Ubuntu (14.04) by following the same steps.
4. Ticket raised (https://issues.apache.org/jira/browse/SOLR-7174)

Additionally (and I've added this on the ticket), if I change the dataConfig to use FileDataSource and PlainTextEntityProcessor, and just list *.txt files, it works!

<dataConfig>
    <dataSource type="FileDataSource" name="bin" />
    <document>
        <entity name="files" dataSource="null" rootEntity="false"
            processor="FileListEntityProcessor"
baseDir="c:/Users/gt/Documents/HackerMonthly/epub" fileName=".*txt">
            <field column="fileAbsolutePath" name="id" />
            <field column="fileSize" name="size" />
            <field column="fileLastModified" name="lastModified" />

<entity name="documentImport" processor="PlainTextEntityProcessor" url="${files.fileAbsolutePath}" format="text" dataSource="bin">
                <field column="plainText" name="content"/>
            </entity>
        </entity>
    </document>
</dataConfig>

So it's something related to BinFileDataSource and TikaEntityProcessor.

Thanks,
Gary.

On 26/02/2015 14:24, Gary Taylor wrote:
Alex,

That's great. Thanks for the pointers. I'll try and get more info on this and file a JIRA issue.

Kind regards,
Gary.

On 26/02/2015 14:16, Alexandre Rafalovitch wrote:
On 26 February 2015 at 08:32, Gary Taylor <g...@inovem.com> wrote:
Alex,

Same results on recursive=true / recursive=false.

I also tried importing plain text files instead of epub (still using
TikeEntityProcessor though) and get exactly the same result - ie. all files
fetched, but only one document indexed in Solr.
To me, this would indicate that something is a problem with the inner
DIH entity then. As a next set of steps, I would probably
1) remove both onError statements and see if there is an exception
that is being swallowed.
2) run the import under ProcessMonitor and see if the other files are
actually being read
https://technet.microsoft.com/en-us/library/bb896645.aspx
3) Assume a Windows bug and test this on Mac/Linux
4) File a JIRA with a replication case. If there is a full replication
setup, I'll test it machines I have access to with full debugger
step-through

For example, I wonder if FileBinDataSource is somehow not cleaning up
after the first file properly on Windows and fails to open the second
one.

Regards,
    Alex.

----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/



--
Gary Taylor | www.inovem.com | www.kahootz.com

INOVEM Ltd is registered in England and Wales No 4228932
Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE
kahootz.com is a trading name of INOVEM Ltd.

Reply via email to