Re: Can't index all docs in a local folder with DIH in Solr 5.0.0

2015-02-27 Thread Gary Taylor

Alex,

I've created JIRA ticket: https://issues.apache.org/jira/browse/SOLR-7174

In response to your suggestions below:

1. No exceptions are reported, even with onError removed.
2. ProcessMonitor shows only the very first epub file is being read 
(repeatedly)

3. I can repeat this on Ubuntu (14.04) by following the same steps.
4. Ticket raised (https://issues.apache.org/jira/browse/SOLR-7174)

Additionally (and I've added this on the ticket), if I change the 
dataConfig to use FileDataSource and PlainTextEntityProcessor, and just 
list *.txt files, it works!





baseDir="c:/Users/gt/Documents/HackerMonthly/epub" 
fileName=".*txt">





processor="PlainTextEntityProcessor"
url="${files.fileAbsolutePath}" format="text" 
dataSource="bin">







So it's something related to BinFileDataSource and TikaEntityProcessor.

Thanks,
Gary.

On 26/02/2015 14:24, Gary Taylor wrote:

Alex,

That's great.  Thanks for the pointers.  I'll try and get more info on 
this and file a JIRA issue.


Kind regards,
Gary.

On 26/02/2015 14:16, Alexandre Rafalovitch wrote:

On 26 February 2015 at 08:32, Gary Taylor  wrote:

Alex,

Same results on recursive=true / recursive=false.

I also tried importing plain text files instead of epub (still using
TikeEntityProcessor though) and get exactly the same result - ie. 
all files

fetched, but only one document indexed in Solr.

To me, this would indicate that something is a problem with the inner
DIH entity then. As a next set of steps, I would probably
1) remove both onError statements and see if there is an exception
that is being swallowed.
2) run the import under ProcessMonitor and see if the other files are
actually being read
https://technet.microsoft.com/en-us/library/bb896645.aspx
3) Assume a Windows bug and test this on Mac/Linux
4) File a JIRA with a replication case. If there is a full replication
setup, I'll test it machines I have access to with full debugger
step-through

For example, I wonder if FileBinDataSource is somehow not cleaning up
after the first file properly on Windows and fails to open the second
one.

Regards,
Alex.


Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/





--
Gary Taylor | www.inovem.com | www.kahootz.com

INOVEM Ltd is registered in England and Wales No 4228932
Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE
kahootz.com is a trading name of INOVEM Ltd.



Re: Can't index all docs in a local folder with DIH in Solr 5.0.0

2015-02-26 Thread Gary Taylor

Alex,

That's great.  Thanks for the pointers.  I'll try and get more info on 
this and file a JIRA issue.


Kind regards,
Gary.

On 26/02/2015 14:16, Alexandre Rafalovitch wrote:

On 26 February 2015 at 08:32, Gary Taylor  wrote:

Alex,

Same results on recursive=true / recursive=false.

I also tried importing plain text files instead of epub (still using
TikeEntityProcessor though) and get exactly the same result - ie. all files
fetched, but only one document indexed in Solr.

To me, this would indicate that something is a problem with the inner
DIH entity then. As a next set of steps, I would probably
1) remove both onError statements and see if there is an exception
that is being swallowed.
2) run the import under ProcessMonitor and see if the other files are
actually being read
https://technet.microsoft.com/en-us/library/bb896645.aspx
3) Assume a Windows bug and test this on Mac/Linux
4) File a JIRA with a replication case. If there is a full replication
setup, I'll test it machines I have access to with full debugger
step-through

For example, I wonder if FileBinDataSource is somehow not cleaning up
after the first file properly on Windows and fails to open the second
one.

Regards,
Alex.


Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/



--
Gary Taylor | www.inovem.com | www.kahootz.com

INOVEM Ltd is registered in England and Wales No 4228932
Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE
kahootz.com is a trading name of INOVEM Ltd.



Re: Can't index all docs in a local folder with DIH in Solr 5.0.0

2015-02-26 Thread Alexandre Rafalovitch
On 26 February 2015 at 08:32, Gary Taylor  wrote:
> Alex,
>
> Same results on recursive=true / recursive=false.
>
> I also tried importing plain text files instead of epub (still using
> TikeEntityProcessor though) and get exactly the same result - ie. all files
> fetched, but only one document indexed in Solr.

To me, this would indicate that something is a problem with the inner
DIH entity then. As a next set of steps, I would probably
1) remove both onError statements and see if there is an exception
that is being swallowed.
2) run the import under ProcessMonitor and see if the other files are
actually being read
https://technet.microsoft.com/en-us/library/bb896645.aspx
3) Assume a Windows bug and test this on Mac/Linux
4) File a JIRA with a replication case. If there is a full replication
setup, I'll test it machines I have access to with full debugger
step-through

For example, I wonder if FileBinDataSource is somehow not cleaning up
after the first file properly on Windows and fails to open the second
one.

Regards,
   Alex.


Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


Re: Can't index all docs in a local folder with DIH in Solr 5.0.0

2015-02-26 Thread Gary Taylor

Alex,

Same results on recursive=true / recursive=false.

I also tried importing plain text files instead of epub (still using 
TikeEntityProcessor though) and get exactly the same result - ie. all 
files fetched, but only one document indexed in Solr.


With verbose output, I get a row for each file in the directory, but 
only the first one has a non-empty documentImport entity.   All 
subsequent documentImport entities just have an empty document#2 entry.  eg:


 
  "verbose-output": [
"entity:files",
[
  null,
  "--- row #1-",
  "fileSize",
  2609004,
  "fileLastModified",
  "2015-02-25T11:37:25.217Z",
  "fileAbsolutePath",
  "c:\\Users\\gt\\Documents\\epub\\issue018.epub",
  "fileDir",
  "c:\\Users\\gt\\Documents\\epub",
  "file",
  "issue018.epub",
  null,
  "-",
  "entity:documentImport",
  [
"document#1",
[
  "query",
  "c:\\Users\\gt\\Documents\\epub\\issue018.epub",
  "time-taken",
  "0:0:0.0",
  null,
  "--- row #1-",
  "text",
  "< ... parsed epub text - snip ... >"
  "title",
  "Issue 18 title",
  "Author",
  "Author text",
  null,
  "-"
],
"document#2",
[]
  ],
  null,
  "--- row #2-",
  "fileSize",
  4428804,
  "fileLastModified",
  "2015-02-25T11:37:36.399Z",
  "fileAbsolutePath",
  "c:\\Users\\gt\\Documents\\epub\\issue019.epub",
  "fileDir",
  "c:\\Users\\gt\\Documents\\epub",
  "file",
  "issue019.epub",
  null,
  "-",
  "entity:documentImport",
  [
"document#2",
[]
  ],
  null,
  "--- row #3-",
  "fileSize",
  2580266,
  "fileLastModified",
  "2015-02-25T11:37:41.188Z",
  "fileAbsolutePath",
  "c:\\Users\\gt\\Documents\\epub\\issue020.epub",
  "fileDir",
  "c:\\Users\\gt\\Documents\\epub",
  "file",
  "issue020.epub",
  null,
  "-",
  "entity:documentImport",
  [
"document#2",
[]
  ],






Re: Can't index all docs in a local folder with DIH in Solr 5.0.0

2015-02-25 Thread Gary Taylor

Alex,

Thanks for the suggestions.  It always just indexes 1 doc, regardless of 
the first epub file it sees.  Debug / verbose don't show anything 
obvious to me.  I can include the output here if you think it would help.


I tried using the SimplePostTool first ( *java 
-Dtype=application/epub+zip 
-Durl=http://localhost:8983/solr/hn1/update/extract -jar post.jar 
\Users\gt\Documents\epub\*.epub) to index the docs and check the Tika 
parsing and that works OK so I don't think it's the e*pubs.


I was trying to use DIH so that I could more easily specify the schema 
fields and store content in the index in preparation for trying out the 
search highlighting. Couldn't work out how to do that with post.jar 


Thanks,
Gary

On 25/02/2015 17:09, Alexandre Rafalovitch wrote:

Try removing that first epub from the directory and rerunning. If you
now index 0 documents, then there is something unexpected about them
and DIH skips. If it indexes 1 document again but a different one,
then it is definitely something about the repeat logic.

Also, try running with debug and verbose modes and see if something
specific shows up.

Regards,
Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 25 February 2015 at 11:14, Gary Taylor  wrote:

I can't get the FileListEntityProcessor and TikeEntityProcessor to correctly
add a Solr document for each epub file in my local directory.

I've just downloaded Solr 5.0.0, on a Windows 7 PC.   I ran "solr start" and
then "solr create -c hn2" to create a new core.

I want to index a load of epub files that I've got in a directory. So I
created a data-import.xml (in solr\hn2\conf):


 
 
 
 
 
 

 
 
 
 
 
 
 
 


In my solrconfig.xml, I added a requestHandler entry to reference my
data-import.xml:

   
   
   data-import.xml
   
   

I renamed managed-schema to schema.xml, and ensured the following doc fields
were setup:

   
   
   
   

   
   

   
   

 

I copied all the jars from dist and contrib\* into server\solr\lib.

Stopping and restarting solr then creates a new managed-schema file and
renames schema.xml to schema.xml.back

All good so far.

Now I go to the web admin for dataimport
(http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and
execute a full import.

But, the results show "Requests: 0, Fetched: 58, Skipped: 0, Processed:1" -
ie. it only adds one document (the very first one) even though it's iterated
over 58!

No errors are reported in the logs.

I can search on the contents of that first epub document, so it's extracting
OK in Tika, but there's a problem somewhere in my config that's causing only
1 document to be indexed in Solr.

Thanks for any assistance / pointers.

Regards,
Gary

--
Gary Taylor | www.inovem.com | www.kahootz.com

INOVEM Ltd is registered in England and Wales No 4228932
Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE
kahootz.com is a trading name of INOVEM Ltd.



--
Gary Taylor | www.inovem.com | www.kahootz.com

INOVEM Ltd is registered in England and Wales No 4228932
Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE
kahootz.com is a trading name of INOVEM Ltd.



Re: Can't index all docs in a local folder with DIH in Solr 5.0.0

2015-02-25 Thread Alexandre Rafalovitch
What about "recursive=true"? Do you have subdirectories that could
make a difference. Your SimplePostTool would not look at
subdirectories (great comparison, BTW).

However, you do have lots of mapping options as well with
/update/extract handler, look at the example and documentations. There
is lots of mapping there.

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 25 February 2015 at 12:24, Gary Taylor  wrote:
> Alex,
>
> Thanks for the suggestions.  It always just indexes 1 doc, regardless of the
> first epub file it sees.  Debug / verbose don't show anything obvious to me.
> I can include the output here if you think it would help.
>
> I tried using the SimplePostTool first ( *java -Dtype=application/epub+zip
> -Durl=http://localhost:8983/solr/hn1/update/extract -jar post.jar
> \Users\gt\Documents\epub\*.epub) to index the docs and check the Tika
> parsing and that works OK so I don't think it's the e*pubs.
>
> I was trying to use DIH so that I could more easily specify the schema
> fields and store content in the index in preparation for trying out the
> search highlighting. Couldn't work out how to do that with post.jar 
>
> Thanks,
> Gary
>
>
> On 25/02/2015 17:09, Alexandre Rafalovitch wrote:
>>
>> Try removing that first epub from the directory and rerunning. If you
>> now index 0 documents, then there is something unexpected about them
>> and DIH skips. If it indexes 1 document again but a different one,
>> then it is definitely something about the repeat logic.
>>
>> Also, try running with debug and verbose modes and see if something
>> specific shows up.
>>
>> Regards,
>> Alex.
>> 
>> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
>> http://www.solr-start.com/
>>
>>
>> On 25 February 2015 at 11:14, Gary Taylor  wrote:
>>>
>>> I can't get the FileListEntityProcessor and TikeEntityProcessor to
>>> correctly
>>> add a Solr document for each epub file in my local directory.
>>>
>>> I've just downloaded Solr 5.0.0, on a Windows 7 PC.   I ran "solr start"
>>> and
>>> then "solr create -c hn2" to create a new core.
>>>
>>> I want to index a load of epub files that I've got in a directory. So I
>>> created a data-import.xml (in solr\hn2\conf):
>>>
>>> 
>>>  
>>>  
>>>  >>  processor="FileListEntityProcessor"
>>>  baseDir="c:/Users/gt/Documents/epub" fileName=".*epub"
>>>  onError="skip"
>>>  recursive="true">
>>>  
>>>  
>>>  
>>>
>>>  >> processor="TikaEntityProcessor"
>>>  url="${files.fileAbsolutePath}" format="text"
>>> dataSource="bin" onError="skip">
>>>  
>>>  
>>>  
>>>  
>>>  
>>>  
>>>  
>>> 
>>>
>>> In my solrconfig.xml, I added a requestHandler entry to reference my
>>> data-import.xml:
>>>
>>>>> class="org.apache.solr.handler.dataimport.DataImportHandler">
>>>
>>>data-import.xml
>>>
>>>
>>>
>>> I renamed managed-schema to schema.xml, and ensured the following doc
>>> fields
>>> were setup:
>>>
>>>>> required="true" multiValued="false" />
>>>>> />
>>>
>>>
>>>
>>>
>>>>> stored="true" />
>>>
>>>>> multiValued="false"/>
>>>>> multiValued="true"/>
>>>
>>>  
>>>
>>> I copied all the jars from dist and contrib\* into server\solr\lib.
>>>
>>> Stopping and restarting solr then creates a new managed-schema file and
>>> renames schema.xml to schema.xml.back
>>>
>>> All good so far.
>>>
>>> Now I go to the web admin for dataimport
>>> (http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and
>>> execute a full import.
>>>
>>> But, the results show "Requests: 0, Fetched: 58, Skipped: 0, Processed:1"
>>> -
>>> ie. it only adds one document (the very first one) even though it's
>>> iterated
>>> over 58!
>>>
>>> No errors are reported in the logs.
>>>
>>> I can search on the contents of that first epub document, so it's
>>> extracting
>>> OK in Tika, but there's a problem somewhere in my config that's causing
>>> only
>>> 1 document to be indexed in Solr.
>>>
>>> Thanks for any assistance / pointers.
>>>
>>> Regards,
>>> Gary
>>>
>>> --
>>> Gary Taylor | www.inovem.com | www.kahootz.com
>>>
>>> INOVEM Ltd is registered in England and Wales No 4228932
>>> Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE
>>> kahootz.com is a trading name of INOVEM Ltd.
>>>
>
> --
> Gary Taylor | www.inovem.com | www.kahootz.com
>
> INOVEM Ltd is registered in England and Wales No 4228932
> Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE
> kahootz.com is a trading name of INOVEM Ltd.
>


Re: Can't index all docs in a local folder with DIH in Solr 5.0.0

2015-02-25 Thread Alexandre Rafalovitch
Try removing that first epub from the directory and rerunning. If you
now index 0 documents, then there is something unexpected about them
and DIH skips. If it indexes 1 document again but a different one,
then it is definitely something about the repeat logic.

Also, try running with debug and verbose modes and see if something
specific shows up.

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 25 February 2015 at 11:14, Gary Taylor  wrote:
> I can't get the FileListEntityProcessor and TikeEntityProcessor to correctly
> add a Solr document for each epub file in my local directory.
>
> I've just downloaded Solr 5.0.0, on a Windows 7 PC.   I ran "solr start" and
> then "solr create -c hn2" to create a new core.
>
> I want to index a load of epub files that I've got in a directory. So I
> created a data-import.xml (in solr\hn2\conf):
>
> 
> 
> 
>  processor="FileListEntityProcessor"
> baseDir="c:/Users/gt/Documents/epub" fileName=".*epub"
> onError="skip"
> recursive="true">
> 
> 
> 
>
>  url="${files.fileAbsolutePath}" format="text"
> dataSource="bin" onError="skip">
> 
> 
> 
> 
> 
> 
> 
> 
>
> In my solrconfig.xml, I added a requestHandler entry to reference my
> data-import.xml:
>
>class="org.apache.solr.handler.dataimport.DataImportHandler">
>   
>   data-import.xml
>   
>   
>
> I renamed managed-schema to schema.xml, and ensured the following doc fields
> were setup:
>
>required="true" multiValued="false" />
>   
>   
>   
>
>   
>   
>
>multiValued="false"/>
>multiValued="true"/>
>
> 
>
> I copied all the jars from dist and contrib\* into server\solr\lib.
>
> Stopping and restarting solr then creates a new managed-schema file and
> renames schema.xml to schema.xml.back
>
> All good so far.
>
> Now I go to the web admin for dataimport
> (http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and
> execute a full import.
>
> But, the results show "Requests: 0, Fetched: 58, Skipped: 0, Processed:1" -
> ie. it only adds one document (the very first one) even though it's iterated
> over 58!
>
> No errors are reported in the logs.
>
> I can search on the contents of that first epub document, so it's extracting
> OK in Tika, but there's a problem somewhere in my config that's causing only
> 1 document to be indexed in Solr.
>
> Thanks for any assistance / pointers.
>
> Regards,
> Gary
>
> --
> Gary Taylor | www.inovem.com | www.kahootz.com
>
> INOVEM Ltd is registered in England and Wales No 4228932
> Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE
> kahootz.com is a trading name of INOVEM Ltd.
>


Can't index all docs in a local folder with DIH in Solr 5.0.0

2015-02-25 Thread Gary Taylor
I can't get the FileListEntityProcessor and TikeEntityProcessor to 
correctly add a Solr document for each epub file in my local directory.


I've just downloaded Solr 5.0.0, on a Windows 7 PC.   I ran "solr start" 
and then "solr create -c hn2" to create a new core.


I want to index a load of epub files that I've got in a directory. So I 
created a data-import.xml (in solr\hn2\conf):










url="${files.fileAbsolutePath}" format="text" 
dataSource="bin" onError="skip">










In my solrconfig.xml, I added a requestHandler entry to reference my 
data-import.xml:


  class="org.apache.solr.handler.dataimport.DataImportHandler">

  
  data-import.xml
  
  

I renamed managed-schema to schema.xml, and ensured the following doc 
fields were setup:


  required="true" multiValued="false" />

  
  
  

  
  stored="true" />


  stored="true" multiValued="false"/>
  multiValued="true"/>




I copied all the jars from dist and contrib\* into server\solr\lib.

Stopping and restarting solr then creates a new managed-schema file and 
renames schema.xml to schema.xml.back


All good so far.

Now I go to the web admin for dataimport 
(http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and 
execute a full import.


But, the results show "Requests: 0, Fetched: 58, Skipped: 0, 
Processed:1" - ie. it only adds one document (the very first one) even 
though it's iterated over 58!


No errors are reported in the logs.

I can search on the contents of that first epub document, so it's 
extracting OK in Tika, but there's a problem somewhere in my config 
that's causing only 1 document to be indexed in Solr.


Thanks for any assistance / pointers.

Regards,
Gary

--
Gary Taylor | www.inovem.com | www.kahootz.com

INOVEM Ltd is registered in England and Wales No 4228932
Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE
kahootz.com is a trading name of INOVEM Ltd.