Re: Solr hanging when extracting a some broken .doc files

2013-12-19 Thread Augusto Camarotti
 of tika classes
to FINEST in solr console, maybe can be helpful

Best,
Andrea
On 17 Dec 2013 16:30, Augusto Camarotti augu...@prpb.mpf.gov.br wrote:

  Hi guys,

I'm having a problem with solr when trying to index some broken .doc
 files.
I have set up a test case using Solr to index all the files the users
 save on the shared directorys of the company that i work for and Solr is
 hanging when trying to index this file in particular(the one i'm attaching
 on this e-mail). There are some others broken .doc files that Solr index by
 the name without a problem, even logging some Tika erros during the
 process, but when it reaches this file in particular, it hangs and i have
 to cancel the upload.
I cannot guarantee the directorys will never hold a broken .doc file,
 or a broken file with some other extension, so i guess solr could just
 return a failing message, or something like that.
These are the logging messages solr is recording:


   03:38:23 ERROR SolrCore org.apache.solr.common.SolrException:
 org.apache.tika.exception.TikaException: Unexpected RuntimeException from
 org.apache.tika.parser.microsoft.OfficeParser@386f9474 03:38:25 ERROR
 SolrDispatchFilter null:org.apache.solr.common.SolrException:
 org.apache.tika.exception.TikaException: Unexpected RuntimeException from
 org.apache.tika.parser.microsoft.OfficeParser@386f9474

 So, how do I prevent solr from hanging when trying to index broken files?

 Regards,

 Augusto Camarotti



Solr hanging when extracting a some broken .doc files

2013-12-17 Thread Augusto Camarotti
Hi guys,
 
   I'm having a problem with solr when trying to index some broken .doc files.
   I have set up a test case using Solr to index all the files the users save 
on the shared directorys of the company that i work for and Solr is hanging 
when trying to index this file in particular(the one i'm attaching on this 
e-mail). There are some others broken .doc files that Solr index by the name 
without a problem, even logging some Tika erros during the process, but when it 
reaches this file in particular, it hangs and i have to cancel the upload.
   I cannot guarantee the directorys will never hold a broken .doc file, or a 
broken file with some other extension, so i guess solr could just return a 
failing message, or something like that.
   These are the logging messages solr is recording:
 
 

03:38:23ERRORSolrCoreorg.apache.solr.common.SolrException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.microsoft.OfficeParser@386f9474
03:38:25ERRORSolrDispatchFilternull:org.apache.solr.common.SolrException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.microsoft.OfficeParser@386f9474
   
So, how do I prevent solr from hanging when trying to index broken files?
 
Regards,
 
Augusto Camarotti 


Re: Tika not extracting content from ODT / ODS (open document / libreoffice) in Solr 4.2.1

2013-12-04 Thread Augusto Camarotti
Hello everybody,
 
First of all, sorry about my bad english.
 
Giving updates on this bug, i maybe have found a solution for it.
I would like to have opinions on this solution.
I have found out that tika, when reading .odt files, would return more
than one document.
The first one for content.xml, which have the actual content of the
file, and the second one for styles.xml.
To test this, try to modify an .odt file removing styles.xml and solr
should parse its contents normally.
Solr, when receiving the second document (styles.xml), erases anything
it has read before. In general, styles.xml doesnt have any text on it,
so it receives just some spaces. 
I just modified a function inside 'SolrContentHandler.java' that erases
the content of the first document. I made this function to just add an
space, do not erase any previous content, so will always add up any
document tika is returning for solr.
I guess this behavior is going to work for previous cases, but i need
your opinion about this.
 
Here is the only modification i made on 'SolrContentHandler.java' 
 
  @Override
  public void startDocument() throws SAXException {
document.clear();
//catchAllBuilder.setLength(0);
//Augusto Camarotti - 28-11-2013
//As tika may parse more than one documents in one file, i have to
append every documento tika parses me,
//so, i will only append a whitespace and wait for new content
everytime. Otherwise, Solr would just get the last document of the file
catchAllBuilder.append(' ');
for (StringBuilder builder : fieldBuilders.values()) {
  builder.setLength(0);
}
bldrStack.clear();
bldrStack.add(catchAllBuilder);
  }
 
 
Regards, 
 
Augusto Camarotti

 Alexandre Rafalovitch arafa...@gmail.com 10/05/2013 21:13 
I would try DIH with the flags as in jira issue I linked to. If
possible.
Just in case.

Regards,
Alex
On 10 May 2013 19:53, Sebastián Ramírez
sebastian.rami...@senseta.com
wrote:

 OK Jack, I'll switch to MS Office ...hahaha

 Many thanks for your interest and help... and the bug report in
JIRA.

 Best,

 Sebastián Ramírez


 On Fri, May 10, 2013 at 5:48 PM, Jack Krupansky
j...@basetechnology.com
 wrote:

  I filed  SOLR-4809 - OpenOffice document body is not indexed by
  SolrCell, including some test files.
 
  https://issues.apache.org/**jira/browse/SOLR-4809
 https://issues.apache.org/jira/browse/SOLR-4809
 
  Yeah, at this stage, switching to Microsoft Office seems like the
best
 bet!
 
 
  -- Jack Krupansky
 
  -Original Message- From: Sebastián Ramírez
  Sent: Friday, May 10, 2013 6:33 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Tika not extracting content from ODT / ODS (open
document /
  libreoffice) in Solr 4.2.1
 
 
  Many thanks Jack for your attention and effort on solving the
problem.
 
  Best,
 
  Sebastián Ramírez
 
 
  On Fri, May 10, 2013 at 5:23 PM, Jack Krupansky
j...@basetechnology.com
 *
  *wrote:
 
   I downloaded the latest Apache OpenOffice 3.4.1 and it does in
fact fail
  to index the proper content, both for .ODP and .ODT files.
 
  If I do extractOnly=trueextractFormat=text, I see the
extracted
 text
 
  clearly in addition to the metadata.
 
  I tested on 4.3, and then tested on Solr 3.6.1 and it also
exhibited the
  problem. I just see spaces in both cases.
 
  But whether the problem is due to Solr or Tika, is not apparent.
 
  In any case, a Jira is warranted.
 
 
  -- Jack Krupansky
 
  -Original Message- From: Sebastián Ramírez
  Sent: Friday, May 10, 2013 11:24 AM
  To: solr-user@lucene.apache.org
  Subject: Tika not extracting content from ODT / ODS (open document
/
  libreoffice) in Solr 4.2.1
 
  Hello everyone,
 
  I'm having a problem indexing content from opendocument format
files.
  The
  files created with OpenOffice and LibreOffice (odt, ods...).
 
  Tika is being able to read the files but Solr is not indexing the
 content.
 
  It's not a problem of commiting or something like that, after I
post a
  file
  it is indexed and all the metadata is indexed/stored but the
content
 isn't
  there.
 
 
- I modified the solrconfig.xml file to catch everything:
 
 
  requestHandler name=/update/extract...
 
 !-- here is the interesting part --
 
 !-- str name=uprefixignored_/str --
 str name=defaultFieldall_txt/str
 
 
 
 
- Then I submitted the file to Solr:
 
 
  curl '
  http://localhost:8983/solr/update/extract?commit=true**
 http://localhost:8983/solr/**update/extract?commit=true**
  literal.id=newodshttp://**localhost:8983/solr/update/**
  extract?commit=trueliteral.**id=newods

http://localhost:8983/solr/update/extract?commit=trueliteral.id=newods
  '
  -H
  'Content-type:
application/vnd.oasis.opendocument.spreadsheet'
 
  --data-binary @test_ods.ods
 
 
 
- Now when I do a search in Solr I get this result, there is
something
 
in the content, but that's not the actual content of the
original
  file:
 
  result name=response numFound=1 start=0
   doc
 str name

Re: SolrCell maximum file size

2012-02-06 Thread Augusto Camarotti
Thanks for the tips Erick, i'm really talking about 2.5GB files full of data to 
be indexed. Like .csv files or .xls, .ods and so on. I guess I will try to do a 
great increase on the memory the JVM will be able to use. 
 
Regards,
 
Augusto

 Erick Erickson erickerick...@gmail.com 1/27/2012 1:22 pm 
Hmmm, I'd go considerably higher than 2.5G. Problem is you the Tika
processing will need memory, I have no idea how much. Then you'll
have a bunch of stuff for Solr to index it etc.

But I also suspect that this will be about useless to index (assuming
you're talking lots of data, not say just the meta-data associated
with a video or something). How do you provide a meaningful snippet
of such a huge amount of data?

If it *is* say a video or whatever where almost all of the data won't
make it into the index anyway, you're probably better off using
tika directly on the client and only sending the bits to Solr that
you need in the form of a SolrInputDocument (I'm thinking that
you'll be doing this in SolrJ) rather than transmit 2.5G over the
network and throwing almost all of it away

If the entire 2.5G is data to be indexed, you'll probably want to
consider breaking it up into smaller chunks in order to make it
useful.

Best
Erick

On Fri, Jan 27, 2012 at 3:43 AM, Augusto Camarotti
augu...@prpb.mpf.gov.br wrote:
 I'm talking about 2 GB files. It means that I'll have to allocate something 
 bigger than that for the JVM? Something like 2,5 GB?

 Thanks,

 Augusto Camarotti

 Erick Erickson erickerick...@gmail.com 1/25/2012 1:48 pm 
 Mostly it depends on your container settings, quite often that's
 where the limits are. I don't think Solr imposes any restrictions.

 What size are we talking about anyway? There are implicit
 issues with how much memory parsing the file requires, but you
 can allocate lots of memory to the JVM to handle that.

 Best
 Erick

 On Tue, Jan 24, 2012 at 10:24 AM, Augusto Camarotti
 augu...@prpb.mpf.gov.br wrote:
 Hi everybody

 Does anyone knows if there is a maximum file size that can be uploaded to 
 the extractingrequesthandler via http request?

 Thanks in advance,

 Augusto Camarotti


Re: SolrCell maximum file size

2012-01-27 Thread Augusto Camarotti
I'm talking about 2 GB files. It means that I'll have to allocate something 
bigger than that for the JVM? Something like 2,5 GB?
 
Thanks,
 
Augusto Camarotti

 Erick Erickson erickerick...@gmail.com 1/25/2012 1:48 pm 
Mostly it depends on your container settings, quite often that's
where the limits are. I don't think Solr imposes any restrictions.

What size are we talking about anyway? There are implicit
issues with how much memory parsing the file requires, but you
can allocate lots of memory to the JVM to handle that.

Best
Erick

On Tue, Jan 24, 2012 at 10:24 AM, Augusto Camarotti
augu...@prpb.mpf.gov.br wrote:
 Hi everybody

 Does anyone knows if there is a maximum file size that can be uploaded to the 
 extractingrequesthandler via http request?

 Thanks in advance,

 Augusto Camarotti


SolrCell maximum file size

2012-01-24 Thread Augusto Camarotti
Hi everybody
 
Does anyone knows if there is a maximum file size that can be uploaded to the 
extractingrequesthandler via http request?
 
Thanks in advance,
 
Augusto Camarotti