Re: Can't index all docs in a local folder with DIH in Solr 5.0.0

2015-02-27 Thread Gary Taylor

Alex,

I've created JIRA ticket: https://issues.apache.org/jira/browse/SOLR-7174

In response to your suggestions below:

1. No exceptions are reported, even with onError removed.
2. ProcessMonitor shows only the very first epub file is being read 
(repeatedly)

3. I can repeat this on Ubuntu (14.04) by following the same steps.
4. Ticket raised (https://issues.apache.org/jira/browse/SOLR-7174)

Additionally (and I've added this on the ticket), if I change the 
dataConfig to use FileDataSource and PlainTextEntityProcessor, and just 
list *.txt files, it works!


dataConfig
dataSource type=FileDataSource name=bin /
document
entity name=files dataSource=null rootEntity=false
processor=FileListEntityProcessor
baseDir=c:/Users/gt/Documents/HackerMonthly/epub 
fileName=.*txt

field column=fileAbsolutePath name=id /
field column=fileSize name=size /
field column=fileLastModified name=lastModified /

entity name=documentImport 
processor=PlainTextEntityProcessor
url=${files.fileAbsolutePath} format=text 
dataSource=bin

field column=plainText name=content/
/entity
/entity
/document
/dataConfig

So it's something related to BinFileDataSource and TikaEntityProcessor.

Thanks,
Gary.

On 26/02/2015 14:24, Gary Taylor wrote:

Alex,

That's great.  Thanks for the pointers.  I'll try and get more info on 
this and file a JIRA issue.


Kind regards,
Gary.

On 26/02/2015 14:16, Alexandre Rafalovitch wrote:

On 26 February 2015 at 08:32, Gary Taylor g...@inovem.com wrote:

Alex,

Same results on recursive=true / recursive=false.

I also tried importing plain text files instead of epub (still using
TikeEntityProcessor though) and get exactly the same result - ie. 
all files

fetched, but only one document indexed in Solr.

To me, this would indicate that something is a problem with the inner
DIH entity then. As a next set of steps, I would probably
1) remove both onError statements and see if there is an exception
that is being swallowed.
2) run the import under ProcessMonitor and see if the other files are
actually being read
https://technet.microsoft.com/en-us/library/bb896645.aspx
3) Assume a Windows bug and test this on Mac/Linux
4) File a JIRA with a replication case. If there is a full replication
setup, I'll test it machines I have access to with full debugger
step-through

For example, I wonder if FileBinDataSource is somehow not cleaning up
after the first file properly on Windows and fails to open the second
one.

Regards,
Alex.


Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/





--
Gary Taylor | www.inovem.com | www.kahootz.com

INOVEM Ltd is registered in England and Wales No 4228932
Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE
kahootz.com is a trading name of INOVEM Ltd.



Re: Can't index all docs in a local folder with DIH in Solr 5.0.0

2015-02-26 Thread Gary Taylor

Alex,

Same results on recursive=true / recursive=false.

I also tried importing plain text files instead of epub (still using 
TikeEntityProcessor though) and get exactly the same result - ie. all 
files fetched, but only one document indexed in Solr.


With verbose output, I get a row for each file in the directory, but 
only the first one has a non-empty documentImport entity.   All 
subsequent documentImport entities just have an empty document#2 entry.  eg:


 
  verbose-output: [
entity:files,
[
  null,
  --- row #1-,
  fileSize,
  2609004,
  fileLastModified,
  2015-02-25T11:37:25.217Z,
  fileAbsolutePath,
  c:\\Users\\gt\\Documents\\epub\\issue018.epub,
  fileDir,
  c:\\Users\\gt\\Documents\\epub,
  file,
  issue018.epub,
  null,
  -,
  entity:documentImport,
  [
document#1,
[
  query,
  c:\\Users\\gt\\Documents\\epub\\issue018.epub,
  time-taken,
  0:0:0.0,
  null,
  --- row #1-,
  text,
   ... parsed epub text - snip ... 
  title,
  Issue 18 title,
  Author,
  Author text,
  null,
  -
],
document#2,
[]
  ],
  null,
  --- row #2-,
  fileSize,
  4428804,
  fileLastModified,
  2015-02-25T11:37:36.399Z,
  fileAbsolutePath,
  c:\\Users\\gt\\Documents\\epub\\issue019.epub,
  fileDir,
  c:\\Users\\gt\\Documents\\epub,
  file,
  issue019.epub,
  null,
  -,
  entity:documentImport,
  [
document#2,
[]
  ],
  null,
  --- row #3-,
  fileSize,
  2580266,
  fileLastModified,
  2015-02-25T11:37:41.188Z,
  fileAbsolutePath,
  c:\\Users\\gt\\Documents\\epub\\issue020.epub,
  fileDir,
  c:\\Users\\gt\\Documents\\epub,
  file,
  issue020.epub,
  null,
  -,
  entity:documentImport,
  [
document#2,
[]
  ],






Re: Can't index all docs in a local folder with DIH in Solr 5.0.0

2015-02-26 Thread Gary Taylor

Alex,

That's great.  Thanks for the pointers.  I'll try and get more info on 
this and file a JIRA issue.


Kind regards,
Gary.

On 26/02/2015 14:16, Alexandre Rafalovitch wrote:

On 26 February 2015 at 08:32, Gary Taylor g...@inovem.com wrote:

Alex,

Same results on recursive=true / recursive=false.

I also tried importing plain text files instead of epub (still using
TikeEntityProcessor though) and get exactly the same result - ie. all files
fetched, but only one document indexed in Solr.

To me, this would indicate that something is a problem with the inner
DIH entity then. As a next set of steps, I would probably
1) remove both onError statements and see if there is an exception
that is being swallowed.
2) run the import under ProcessMonitor and see if the other files are
actually being read
https://technet.microsoft.com/en-us/library/bb896645.aspx
3) Assume a Windows bug and test this on Mac/Linux
4) File a JIRA with a replication case. If there is a full replication
setup, I'll test it machines I have access to with full debugger
step-through

For example, I wonder if FileBinDataSource is somehow not cleaning up
after the first file properly on Windows and fails to open the second
one.

Regards,
Alex.


Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/



--
Gary Taylor | www.inovem.com | www.kahootz.com

INOVEM Ltd is registered in England and Wales No 4228932
Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE
kahootz.com is a trading name of INOVEM Ltd.



Re: Can't index all docs in a local folder with DIH in Solr 5.0.0

2015-02-26 Thread Alexandre Rafalovitch
On 26 February 2015 at 08:32, Gary Taylor g...@inovem.com wrote:
 Alex,

 Same results on recursive=true / recursive=false.

 I also tried importing plain text files instead of epub (still using
 TikeEntityProcessor though) and get exactly the same result - ie. all files
 fetched, but only one document indexed in Solr.

To me, this would indicate that something is a problem with the inner
DIH entity then. As a next set of steps, I would probably
1) remove both onError statements and see if there is an exception
that is being swallowed.
2) run the import under ProcessMonitor and see if the other files are
actually being read
https://technet.microsoft.com/en-us/library/bb896645.aspx
3) Assume a Windows bug and test this on Mac/Linux
4) File a JIRA with a replication case. If there is a full replication
setup, I'll test it machines I have access to with full debugger
step-through

For example, I wonder if FileBinDataSource is somehow not cleaning up
after the first file properly on Windows and fails to open the second
one.

Regards,
   Alex.


Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


Can't index all docs in a local folder with DIH in Solr 5.0.0

2015-02-25 Thread Gary Taylor
I can't get the FileListEntityProcessor and TikeEntityProcessor to 
correctly add a Solr document for each epub file in my local directory.


I've just downloaded Solr 5.0.0, on a Windows 7 PC.   I ran solr start 
and then solr create -c hn2 to create a new core.


I want to index a load of epub files that I've got in a directory. So I 
created a data-import.xml (in solr\hn2\conf):


dataConfig
dataSource type=BinFileDataSource name=bin /
document
entity name=files dataSource=null rootEntity=false
processor=FileListEntityProcessor
baseDir=c:/Users/gt/Documents/epub fileName=.*epub
onError=skip
recursive=true
field column=fileAbsolutePath name=id /
field column=fileSize name=size /
field column=fileLastModified name=lastModified /

entity name=documentImport processor=TikaEntityProcessor
url=${files.fileAbsolutePath} format=text 
dataSource=bin onError=skip

field column=file name=fileName/
field column=Author name=author meta=true/
field column=title name=title meta=true/
field column=text name=content/
/entity
/entity
/document
/dataConfig

In my solrconfig.xml, I added a requestHandler entry to reference my 
data-import.xml:


  requestHandler name=/dataimport 
class=org.apache.solr.handler.dataimport.DataImportHandler

  lst name=defaults
  str name=configdata-import.xml/str
  /lst
  /requestHandler

I renamed managed-schema to schema.xml, and ensured the following doc 
fields were setup:


  field name=id type=string indexed=true stored=true 
required=true multiValued=false /

  field name=fileName type=string indexed=true stored=true /
  field name=author type=string indexed=true stored=true /
  field name=title type=string indexed=true stored=true /

  field name=size type=long indexed=true stored=true /
  field name=lastModified type=date indexed=true 
stored=true /


  field name=content type=text_en indexed=false 
stored=true multiValued=false/
  field name=text type=text_en indexed=true stored=false 
multiValued=true/


copyField source=content dest=text/

I copied all the jars from dist and contrib\* into server\solr\lib.

Stopping and restarting solr then creates a new managed-schema file and 
renames schema.xml to schema.xml.back


All good so far.

Now I go to the web admin for dataimport 
(http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and 
execute a full import.


But, the results show Requests: 0, Fetched: 58, Skipped: 0, 
Processed:1 - ie. it only adds one document (the very first one) even 
though it's iterated over 58!


No errors are reported in the logs.

I can search on the contents of that first epub document, so it's 
extracting OK in Tika, but there's a problem somewhere in my config 
that's causing only 1 document to be indexed in Solr.


Thanks for any assistance / pointers.

Regards,
Gary

--
Gary Taylor | www.inovem.com | www.kahootz.com

INOVEM Ltd is registered in England and Wales No 4228932
Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE
kahootz.com is a trading name of INOVEM Ltd.



Re: Can't index all docs in a local folder with DIH in Solr 5.0.0

2015-02-25 Thread Gary Taylor

Alex,

Thanks for the suggestions.  It always just indexes 1 doc, regardless of 
the first epub file it sees.  Debug / verbose don't show anything 
obvious to me.  I can include the output here if you think it would help.


I tried using the SimplePostTool first ( *java 
-Dtype=application/epub+zip 
-Durl=http://localhost:8983/solr/hn1/update/extract -jar post.jar 
\Users\gt\Documents\epub\*.epub) to index the docs and check the Tika 
parsing and that works OK so I don't think it's the e*pubs.


I was trying to use DIH so that I could more easily specify the schema 
fields and store content in the index in preparation for trying out the 
search highlighting. Couldn't work out how to do that with post.jar 


Thanks,
Gary

On 25/02/2015 17:09, Alexandre Rafalovitch wrote:

Try removing that first epub from the directory and rerunning. If you
now index 0 documents, then there is something unexpected about them
and DIH skips. If it indexes 1 document again but a different one,
then it is definitely something about the repeat logic.

Also, try running with debug and verbose modes and see if something
specific shows up.

Regards,
Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 25 February 2015 at 11:14, Gary Taylor g...@inovem.com wrote:

I can't get the FileListEntityProcessor and TikeEntityProcessor to correctly
add a Solr document for each epub file in my local directory.

I've just downloaded Solr 5.0.0, on a Windows 7 PC.   I ran solr start and
then solr create -c hn2 to create a new core.

I want to index a load of epub files that I've got in a directory. So I
created a data-import.xml (in solr\hn2\conf):

dataConfig
 dataSource type=BinFileDataSource name=bin /
 document
 entity name=files dataSource=null rootEntity=false
 processor=FileListEntityProcessor
 baseDir=c:/Users/gt/Documents/epub fileName=.*epub
 onError=skip
 recursive=true
 field column=fileAbsolutePath name=id /
 field column=fileSize name=size /
 field column=fileLastModified name=lastModified /

 entity name=documentImport processor=TikaEntityProcessor
 url=${files.fileAbsolutePath} format=text
dataSource=bin onError=skip
 field column=file name=fileName/
 field column=Author name=author meta=true/
 field column=title name=title meta=true/
 field column=text name=content/
 /entity
 /entity
 /document
/dataConfig

In my solrconfig.xml, I added a requestHandler entry to reference my
data-import.xml:

   requestHandler name=/dataimport
class=org.apache.solr.handler.dataimport.DataImportHandler
   lst name=defaults
   str name=configdata-import.xml/str
   /lst
   /requestHandler

I renamed managed-schema to schema.xml, and ensured the following doc fields
were setup:

   field name=id type=string indexed=true stored=true
required=true multiValued=false /
   field name=fileName type=string indexed=true stored=true /
   field name=author type=string indexed=true stored=true /
   field name=title type=string indexed=true stored=true /

   field name=size type=long indexed=true stored=true /
   field name=lastModified type=date indexed=true stored=true /

   field name=content type=text_en indexed=false stored=true
multiValued=false/
   field name=text type=text_en indexed=true stored=false
multiValued=true/

 copyField source=content dest=text/

I copied all the jars from dist and contrib\* into server\solr\lib.

Stopping and restarting solr then creates a new managed-schema file and
renames schema.xml to schema.xml.back

All good so far.

Now I go to the web admin for dataimport
(http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and
execute a full import.

But, the results show Requests: 0, Fetched: 58, Skipped: 0, Processed:1 -
ie. it only adds one document (the very first one) even though it's iterated
over 58!

No errors are reported in the logs.

I can search on the contents of that first epub document, so it's extracting
OK in Tika, but there's a problem somewhere in my config that's causing only
1 document to be indexed in Solr.

Thanks for any assistance / pointers.

Regards,
Gary

--
Gary Taylor | www.inovem.com | www.kahootz.com

INOVEM Ltd is registered in England and Wales No 4228932
Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE
kahootz.com is a trading name of INOVEM Ltd.



--
Gary Taylor | www.inovem.com | www.kahootz.com

INOVEM Ltd is registered in England and Wales No 4228932
Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE
kahootz.com is a trading name of INOVEM Ltd.



Re: Can't index all docs in a local folder with DIH in Solr 5.0.0

2015-02-25 Thread Alexandre Rafalovitch
What about recursive=true? Do you have subdirectories that could
make a difference. Your SimplePostTool would not look at
subdirectories (great comparison, BTW).

However, you do have lots of mapping options as well with
/update/extract handler, look at the example and documentations. There
is lots of mapping there.

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 25 February 2015 at 12:24, Gary Taylor g...@inovem.com wrote:
 Alex,

 Thanks for the suggestions.  It always just indexes 1 doc, regardless of the
 first epub file it sees.  Debug / verbose don't show anything obvious to me.
 I can include the output here if you think it would help.

 I tried using the SimplePostTool first ( *java -Dtype=application/epub+zip
 -Durl=http://localhost:8983/solr/hn1/update/extract -jar post.jar
 \Users\gt\Documents\epub\*.epub) to index the docs and check the Tika
 parsing and that works OK so I don't think it's the e*pubs.

 I was trying to use DIH so that I could more easily specify the schema
 fields and store content in the index in preparation for trying out the
 search highlighting. Couldn't work out how to do that with post.jar 

 Thanks,
 Gary


 On 25/02/2015 17:09, Alexandre Rafalovitch wrote:

 Try removing that first epub from the directory and rerunning. If you
 now index 0 documents, then there is something unexpected about them
 and DIH skips. If it indexes 1 document again but a different one,
 then it is definitely something about the repeat logic.

 Also, try running with debug and verbose modes and see if something
 specific shows up.

 Regards,
 Alex.
 
 Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
 http://www.solr-start.com/


 On 25 February 2015 at 11:14, Gary Taylor g...@inovem.com wrote:

 I can't get the FileListEntityProcessor and TikeEntityProcessor to
 correctly
 add a Solr document for each epub file in my local directory.

 I've just downloaded Solr 5.0.0, on a Windows 7 PC.   I ran solr start
 and
 then solr create -c hn2 to create a new core.

 I want to index a load of epub files that I've got in a directory. So I
 created a data-import.xml (in solr\hn2\conf):

 dataConfig
  dataSource type=BinFileDataSource name=bin /
  document
  entity name=files dataSource=null rootEntity=false
  processor=FileListEntityProcessor
  baseDir=c:/Users/gt/Documents/epub fileName=.*epub
  onError=skip
  recursive=true
  field column=fileAbsolutePath name=id /
  field column=fileSize name=size /
  field column=fileLastModified name=lastModified /

  entity name=documentImport
 processor=TikaEntityProcessor
  url=${files.fileAbsolutePath} format=text
 dataSource=bin onError=skip
  field column=file name=fileName/
  field column=Author name=author meta=true/
  field column=title name=title meta=true/
  field column=text name=content/
  /entity
  /entity
  /document
 /dataConfig

 In my solrconfig.xml, I added a requestHandler entry to reference my
 data-import.xml:

requestHandler name=/dataimport
 class=org.apache.solr.handler.dataimport.DataImportHandler
lst name=defaults
str name=configdata-import.xml/str
/lst
/requestHandler

 I renamed managed-schema to schema.xml, and ensured the following doc
 fields
 were setup:

field name=id type=string indexed=true stored=true
 required=true multiValued=false /
field name=fileName type=string indexed=true stored=true
 /
field name=author type=string indexed=true stored=true /
field name=title type=string indexed=true stored=true /

field name=size type=long indexed=true stored=true /
field name=lastModified type=date indexed=true
 stored=true /

field name=content type=text_en indexed=false stored=true
 multiValued=false/
field name=text type=text_en indexed=true stored=false
 multiValued=true/

  copyField source=content dest=text/

 I copied all the jars from dist and contrib\* into server\solr\lib.

 Stopping and restarting solr then creates a new managed-schema file and
 renames schema.xml to schema.xml.back

 All good so far.

 Now I go to the web admin for dataimport
 (http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and
 execute a full import.

 But, the results show Requests: 0, Fetched: 58, Skipped: 0, Processed:1
 -
 ie. it only adds one document (the very first one) even though it's
 iterated
 over 58!

 No errors are reported in the logs.

 I can search on the contents of that first epub document, so it's
 extracting
 OK in Tika, but there's a problem somewhere in my config that's causing
 only
 1 document to be indexed in Solr.

 Thanks for any assistance / pointers.

 Regards,
 Gary

 --
 Gary Taylor | 

Re: Can't index all docs in a local folder with DIH in Solr 5.0.0

2015-02-25 Thread Alexandre Rafalovitch
Try removing that first epub from the directory and rerunning. If you
now index 0 documents, then there is something unexpected about them
and DIH skips. If it indexes 1 document again but a different one,
then it is definitely something about the repeat logic.

Also, try running with debug and verbose modes and see if something
specific shows up.

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 25 February 2015 at 11:14, Gary Taylor g...@inovem.com wrote:
 I can't get the FileListEntityProcessor and TikeEntityProcessor to correctly
 add a Solr document for each epub file in my local directory.

 I've just downloaded Solr 5.0.0, on a Windows 7 PC.   I ran solr start and
 then solr create -c hn2 to create a new core.

 I want to index a load of epub files that I've got in a directory. So I
 created a data-import.xml (in solr\hn2\conf):

 dataConfig
 dataSource type=BinFileDataSource name=bin /
 document
 entity name=files dataSource=null rootEntity=false
 processor=FileListEntityProcessor
 baseDir=c:/Users/gt/Documents/epub fileName=.*epub
 onError=skip
 recursive=true
 field column=fileAbsolutePath name=id /
 field column=fileSize name=size /
 field column=fileLastModified name=lastModified /

 entity name=documentImport processor=TikaEntityProcessor
 url=${files.fileAbsolutePath} format=text
 dataSource=bin onError=skip
 field column=file name=fileName/
 field column=Author name=author meta=true/
 field column=title name=title meta=true/
 field column=text name=content/
 /entity
 /entity
 /document
 /dataConfig

 In my solrconfig.xml, I added a requestHandler entry to reference my
 data-import.xml:

   requestHandler name=/dataimport
 class=org.apache.solr.handler.dataimport.DataImportHandler
   lst name=defaults
   str name=configdata-import.xml/str
   /lst
   /requestHandler

 I renamed managed-schema to schema.xml, and ensured the following doc fields
 were setup:

   field name=id type=string indexed=true stored=true
 required=true multiValued=false /
   field name=fileName type=string indexed=true stored=true /
   field name=author type=string indexed=true stored=true /
   field name=title type=string indexed=true stored=true /

   field name=size type=long indexed=true stored=true /
   field name=lastModified type=date indexed=true stored=true /

   field name=content type=text_en indexed=false stored=true
 multiValued=false/
   field name=text type=text_en indexed=true stored=false
 multiValued=true/

 copyField source=content dest=text/

 I copied all the jars from dist and contrib\* into server\solr\lib.

 Stopping and restarting solr then creates a new managed-schema file and
 renames schema.xml to schema.xml.back

 All good so far.

 Now I go to the web admin for dataimport
 (http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and
 execute a full import.

 But, the results show Requests: 0, Fetched: 58, Skipped: 0, Processed:1 -
 ie. it only adds one document (the very first one) even though it's iterated
 over 58!

 No errors are reported in the logs.

 I can search on the contents of that first epub document, so it's extracting
 OK in Tika, but there's a problem somewhere in my config that's causing only
 1 document to be indexed in Solr.

 Thanks for any assistance / pointers.

 Regards,
 Gary

 --
 Gary Taylor | www.inovem.com | www.kahootz.com

 INOVEM Ltd is registered in England and Wales No 4228932
 Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE
 kahootz.com is a trading name of INOVEM Ltd.