q=doc_content?Try q=id:""
Solr Cell and DIH are comparable (in that they are about getting content into
Solr) but "unrelated" to TVRH. TVRH is about inspecting indexed content,
regardless of how it got in.
Erik
> On May 1, 2019, at 3:14 PM, Geoffrey Will
I am using Solr in a web app to extract text from .pdf, and docx files. I was
wondering if I can access the TermFreq and TermPosition vectors via the HTTP
interface exposed by Solr Cell. I’m posting/getting documents fine, I’ve
enabled the TV, TFV etc in the managed schema:
http://localhost
Several things:
1> Please don’t use add-unknown…. It’s fine for prototyping, but guesses field
definitions.
2> the solrocnfig appears to be malformed, I’m surprised it fires up at all.
This never terminates for instance:
in the right direction
more quickly than I can.
Here is her original inquiry:
I am pulling data from a local drive for indexing. I am using solr cell and
tika in schemaless mode. I am attempting to rewrite certain field information
prior to indexing using html-strip and regex
The tika.config param is documented here:
https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-cell-using-apache-tika.html#configuring-the-solr-extractingrequesthandler
I notice that the code
(https://github.com/apache/lucene-solr/blob/964cc88cee7d62edf03a923e3217809d630af5d5/solr
Robertson, Eric J :
> Hello all,
>
> Currently trying to define a tika config to use when posting a pdf to Solr
> Cell as we may want to override the default tika configuration depending on
> type of document being ingested.
>
> In the docs it lists tika.config as an input param
Hello all,
Currently trying to define a tika config to use when posting a pdf to Solr Cell
as we may want to override the default tika configuration depending on type of
document being ingested.
In the docs it lists tika.config as an input parameter to the Solr Cell
endpoint. Though in my
process can improve the overall stability of the SolR service.
--
Rahul Singh
rahul.si...@anant.us
Anant Corporation
On Apr 25, 2018, 12:49 PM -0400, Shawn Heisey <apa...@elyograg.org>, wrote:
> On 4/25/2018 4:02 AM, Lee Carroll wrote:
> > *We don't recommend using solr-cell for prod
On 4/25/2018 4:02 AM, Lee Carroll wrote:
*We don't recommend using solr-cell for production indexing.*
Ok. Are the reasons for:
Performance. I think we have rather modest index requirement (1000 a day...
on a busy day)
Security. The index workflow is, upload files to public facing server
greed. The app will have a few implementations for storing the binary
file. Easiest for a user to configure for proto-typing would be store in
index impl. A live impl would probably be fs
*We don't recommend using solr-cell for production indexing.*
Ok. Are the reasons for:
Performance. I thi
On 4/24/2018 10:26 AM, Lee Carroll wrote:
> Does the solr cell contrib give access to the files raw content along with
> the extracted metadata?\
That's not usually the kind of information you want to have in a Solr
index. Most of the time, there will be an entry in the Solr index that
Does the solr cell contrib give access to the files raw content along with
the extracted metadata?
cheers Lee C
(Solr Cell) to extract content from HTML document
instead of Solr's MostlyPassthroughHtmlMapper ?
As a bonus here's a Dropwizard Tika wrapper that gives you a Tika web service
https://github.com/mattflax/dropwizard-tika-server written by a colleague of
mine at Flax. Hope this is useful.
Cheers
gt; > We should add a chatbot to the list that includes Charlie's advice and
> the link to Erick's blog post whenever Tika is used.
> >
> >
> > -Original Message-
> > From: Charlie Hull [mailto:char...@flax.co.uk]
> > Sent: Monday, April 9, 2018 12:44 PM
>
st whenever Tika is used.
>
>
> -Original Message-
> From: Charlie Hull [mailto:char...@flax.co.uk]
> Sent: Monday, April 9, 2018 12:44 PM
> To: solr-user@lucene.apache.org
> Subject: Re: How to use Tika (Solr Cell) to extract content from HTML
> document instead
Oh this is great! Saves me a whole bunch of manual work.
Thanks!
-Original Message-
From: Charlie Hull [mailto:char...@flax.co.uk]
Sent: Monday, April 09, 2018 2:15 PM
To: solr-user@lucene.apache.org
Subject: [EXT] Re: How to use Tika (Solr Cell) to extract content from HTML
document
whenever Tika is used.
>
>
>
>
>
> -Original Message-
>
> From: Charlie Hull [mailto:char...@flax.co.uk]
>
> Sent: Monday, April 9, 2018 12:44 PM
>
> To: solr-user@lucene.apache.org
>
> Subject: Re: How to use Tika (Solr Cell) to extract content
Thank you Charlie, Tim.
I will integrate Tika in my Java app and use SolrJ to send data to Solr.
-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Monday, April 09, 2018 11:24 AM
To: solr-user@lucene.apache.org
Subject: [EXT] RE: How to use Tika (Solr Cell
PM
To: solr-user@lucene.apache.org
Subject: Re: How to use Tika (Solr Cell) to extract content from HTML document
instead of Solr's MostlyPassthroughHtmlMapper ?
I'd recommend you run Tika externally to Solr, which will allow you to catch
this kind of problem and prevent it bringing down your
I'd recommend you run Tika externally to Solr, which will allow you to
catch this kind of problem and prevent it bringing down your Solr
installation.
Cheers
Charlie
On 9 April 2018 at 16:59, Hanjan, Harinder
wrote:
> Hello!
>
> Solr (i.e. Tika) throws a "zip bomb"
Hello!
Solr (i.e. Tika) throws a "zip bomb" exception with certain documents we have
in our Sharepoint system. I have used the tika-app.jar directly to extract the
document in question and it does _not_ throw an exception and extract the
contents just fine. So it would seem Solr is doing
Hi all, I have been having an issue with Solr, using the
>> ExtractingRequestHandler. Basically, when indexing a PDF (for
>> example) I get all the metadata mixed into the "content" field along
>> with the content. See:
>> <https://stackoverflow.com/questions/479342
ield along
> with the content. See:
> <https://stackoverflow.com/questions/47934257/importing-files-with-solr-cell-tika-is-mixing-metadata-fields-with-content>
> for the gory details.
>
> I'm guessing this is the same basic issue as
> <https://issues.apache.org/jira/bro
Hi all, I have been having an issue with Solr, using the
ExtractingRequestHandler. Basically, when indexing a PDF (for
example) I get all the metadata mixed into the "content" field along
with the content. See:
<https://stackoverflow.com/questions/47934257/importing-files-with-s
<gilhulys...@gmail.com> wrote:
> Hello,
>
> I have been successfully able to index archive files (zip, tar, and the
> like) using solr cell, but the archive is returned as a single document
> when I do queries. Is there a way to configure it so that files are
> extracted recursively
Hello,
I have been successfully able to index archive files (zip, tar, and the
like) using solr cell, but the archive is returned as a single document
when I do queries. Is there a way to configure it so that files are
extracted recursively, and indexed separately?
I know that if I set
ords using only Solr
> CELL+TIKA+DIH to some Solr with schema? I mean when importing, I can map
> schema on mysql to schema in Solr?
>
> scott.chu,scott@udngroup.com
> 2016/5/20 (週五)
>
I have a mysql table with over 300M blog articles. The records are in html
format. Is it possible to import these records using only Solr CELL+TIKA+DIH to
some Solr with schema? I mean when importing, I can map schema on mysql to
schema in Solr?
scott.chu,scott@udngroup.com
2016/5/20 (週五)
this message in context:
http://lucene.472066.n3.nabble.com/Solr-Cell-Tika-date-formats-tp4138478.html
Sent from the Solr - User mailing list archive at Nabble.com.
-MM-dd hh:mm:ss
-MM-dd HH:mm:ss
EEE MMM d hh:mm:ss z
EEE, dd MMM HH:mm:ss zzz
, dd-MMM-yy HH:mm:ss zzz
EEE MMM d HH:mm:ss
See:
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika
-- Jack Krupansky
-Original Message
HH:mm:ss
EEE MMM d hh:mm:ss z
EEE, dd MMM HH:mm:ss zzz
, dd-MMM-yy HH:mm:ss zzz
EEE MMM d HH:mm:ss
See:
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika
-- Jack Krupansky
-Original Message-
From: ienjreny
Sent
I'm trying to index a set of PDF documents with Solr 4.5.0. So far I can
get Solr to ingest the entire document as one long string, stored in the
index as content. However, I want to index structure within the documents.
I know that the ExtractingRequestHandler uses Apache Tika to convert the
You can have a look here:
http://solr.pl/en/2011/04/04/indexing-files-like-doc-pdf-solr-and-tika-integration/
2013/10/10 Peter Bleackley bleackl...@zooey.co.uk
I'm trying to index a set of PDF documents with Solr 4.5.0. So far I can
get Solr to ingest the entire document as one long string,
Thanks Erick, This is how I was doing it but when I saw the Solr Cell
stuff I figured I'd give it a go. What I ended up doing is the following
ModifiableSolrParams params = indexer.index(artifact);
params.add(fmap.content, my_custom_field);
params.add(extractFormat, text
program with indexing from a DB mixed in, but
it shouldn't be hard at all to pull the DB parts out.
http://searchhub.org/dev/2012/02/14/indexing-with-solrj/
FWIW,
Erick
On Thu, Sep 5, 2013 at 5:28 PM, Jamie Johnson jej2...@gmail.com wrote:
Is it possible to configure solr cell to only extract
Is it possible to configure solr cell to only extract and store the body of
a document when indexing? I'm currently doing the following which I
thought would work
ModifiableSolrParams params = new ModifiableSolrParams();
params.set(defaultField, content);
params.set(xpath, /xhtml:html
Niklas,
In Linux, the API for watching for filesystem changes is called
inotify. You'd need to write something to listen to those events and
react accordingly.
Here's a brief discussion about it:
http://stackoverflow.com/questions/4062806/inotify-how-to-use-it-linux
Michael Della Bitta
Take a look at ManifoldCF, whch has a file system crawler which can track
changed files.
-- Jack Krupansky
-Original Message-
From: Niklas Langvig
Sent: Friday, March 15, 2013 11:10 AM
To: solr-user@lucene.apache.org
Subject: solr cell
We have all our documents (doc, docx, pdf
Another options similar to this would be the new file system
WatchService available in java 7:
http://docs.oracle.com/javase/tutorial/essential/io/notification.html
Arcadius.
On 15 March 2013 15:22, Michael Della Bitta
michael.della.bi...@appinions.com wrote:
Niklas,
In Linux, the API for
Hi Chris thank you for replying. My content field in the schema is
stored=true and indexed=false because I am copying the content field
in text field which is by default indexed=true.
I was having a query that I am able to search in the html documents I had
fed to the solr, but as the results
: Hi everyone, i am new to solr technology and not getting a way to get back
: the original HTML document with Hits highlighted into it. what
: configuration and where i can do to instruct SolrCell/ Tika so that it does
: not strips down the tags of HTML document in the content field.
I _think_
- From: Divyanand Tiwari
Sent: Monday, February 18, 2013 10:52 PM
To: solr-user@lucene.apache.org
Subject: Re: How can i instruct the Solr/ Solr Cell to output the original
HTML document which was fed to it.?
Thank you for replying sir !!!
I have two queries related with this -
1) So
for
highlighting.
See:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
-- Jack Krupansky
-Original Message-
From: Divyanand Tiwari
Sent: Monday, February 18, 2013 7:28 AM
To: solr-user@lucene.apache.org
Subject: How can i instruct the Solr/ Solr
Thank you for replying sir !!!
I have two queries related with this -
1) So in this case which request handler I have to use because
'ExtractingRequestHandler' by default strips the html content and the
default handler 'UpdateRequestHandler' does not accepts the HTML contrents.
2) How can I
-Original Message-
From: Divyanand Tiwari
Sent: Monday, February 18, 2013 10:52 PM
To: solr-user@lucene.apache.org
Subject: Re: How can i instruct the Solr/ Solr Cell to output the original
HTML document which was fed to it.?
Thank you for replying sir !!!
I have two queries related
Thank you Erick for your respone,
I've already tried what you've suggested and got some out of memory
exceptions. Because of this i like the solution with solr Cell where i can
send the file directly to solr via stream and don't collect them in my
memory.
And another question that came to my
, 2012 at 5:23 AM, johannes.schwendin...@blum.com wrote:
Thank you Erick for your respone,
I've already tried what you've suggested and got some out of memory
exceptions. Because of this i like the solution with solr Cell where i can
send the file directly to solr via stream and don't collect
The difference with solr cell is, that i'am sending every single document
to solr cell and don't collect them until i have a couple of them in my
memory.
Using mainly the code form here:
http://wiki.apache.org/solr/ExtractingRequestHandler#SolrJ
Erick Erickson erickerick...@gmail.com schrieb
with Solr Cell to index files to Solr. During
this some questions came up.
1. Is it possible (and wise) to connect to Solr Cell with multiple Threads
at the same time to index several documents at the same time?
This question came up because my prrogramm takes about 6hours to index
round 35000
as a
separate process) to minimize thread issues, GC issues, hung parsers, etc.
-- Jack Krupansky
-Original Message-
From: Alexandre Rafalovitch
Sent: Tuesday, September 25, 2012 10:24 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr Cell Questions
Are you by any chance committing
...
FWIW,
Erick
On Tue, Sep 25, 2012 at 10:04 AM, johannes.schwendin...@blum.com wrote:
The difference with solr cell is, that i'am sending every single document
to solr cell and don't collect them until i have a couple of them in my
memory.
Using mainly the code form here:
http
Hi,
Im currently experimenting with Solr Cell to index files to Solr. During
this some questions came up.
1. Is it possible (and wise) to connect to Solr Cell with multiple Threads
at the same time to index several documents at the same time?
This question came up because my prrogramm takes
Best
Erick
On Mon, Sep 24, 2012 at 10:04 AM, johannes.schwendin...@blum.com wrote:
Hi,
Im currently experimenting with Solr Cell to index files to Solr. During
this some questions came up.
1. Is it possible (and wise) to connect to Solr Cell with multiple Threads
at the same time to index
@lucene.apache.org
Subject: Re: Indexing PDF-Files using Solr Cell
Thank you for your response.
I'm writing my Bachelor-Thesis about Solr and my company doesn't want me to
use a beta-version.
I dont want to be annoying, but how do i direct the content to a stored
filed and so on... in the URL i use
Hello *,
I've got a problem indexing and searching PDF-Files.
It seems like Solr doenst index the name of the file.
In returning i only get
result name=response numFound=1 start=0docstr
name=authorA28240/strarr
name=content_typestrapplication/pdf/str/arrstr
name=iddoc5/strdate
using Solr Cell
Hello *,
I've got a problem indexing and searching PDF-Files.
It seems like Solr doenst index the name of the file.
In returning i only get
result name=response numFound=1 start=0docstr
name=authorA28240/strarr
name=content_typestrapplication/pdf/str/arrstr
name=iddoc5/strdate
-Original Message- From: Alexander Troost
Sent: Sunday, September 16, 2012 10:16 PM
To: solr-user@lucene.apache.org
Subject: Indexing PDF-Files using Solr Cell
Hello *,
I've got a problem indexing and searching PDF-Files.
It seems like Solr doenst index the name of the file
-
From: Alexander Troost
Sent: Sunday, September 16, 2012 11:59 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexing PDF-Files using Solr Cell
Hi, first of all: Thank you for that quick response!
But i am not sure if i am doing this right.
For my point of view the command now has to look like
-Original Message- From: Alexander Troost
Sent: Sunday, September 16, 2012 11:59 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexing PDF-Files using Solr Cell
Hi, first of all: Thank you for that quick response!
But i am not sure if i am doing this right.
For my point
It's pretty easy to accidentally run into the AWT stuff if you're
doing anything that involves image processing, which I would expect a
generic RTF parser might do.
Michael Della Bitta
Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
The backstory here is that Tika uses a library that for some crazy
reason is inside the Java AWG graphics toolkit. (I think the RTF
parser?)
On Wed, Aug 15, 2012 at 5:57 AM, Ahmet Arslan iori...@yahoo.com wrote:
You can try passing
-Djava.awt.headless=true as one of the arguments
when you
When I send a scanned pdf to extraction request
handler, below icon appears in my Dock.
http://tinypic.com/r/2mpmo7o/6
http://tinypic.com/r/28ukxhj/6
I found that text-extractable pdf files triggers above weird icon too.
curl
Ahmet,
the dock icon appears when AWT starts, e.g. when a font is loaded.
You can prevent it using the headless mode but this is likely to trigger an
exception.
Same if your user is not UI-logged-in.
hope it helps.
Paul
Le 15 août 2012 à 01:30, Ahmet Arslan a écrit :
Hi All,
I have set
the dock icon appears when AWT starts, e.g. when a font is
loaded.
You can prevent it using the headless mode but this is
likely to trigger an exception.
Same if your user is not UI-logged-in.
Hi Paul, thanks for the explanation. So is it nothing to worry about?
Le 15 août 2012 à 13:03, Ahmet Arslan a écrit :
Hi Paul, thanks for the explanation. So is it nothing to worry about?
it is nothing to worry about except to remember that you can't run this step in
a daemon-like process.
(on Linux, I had to set-up a VNC-server for similar tasks)
paul
You can try passing -Djava.awt.headless=true as one of the arguments
when you start Jetty to see if you can get this to go away with no ill
effects.
Michael Della Bitta
Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
You can try passing
-Djava.awt.headless=true as one of the arguments
when you start Jetty to see if you can get this to go away
with no ill
effects.
I started jetty using : 'java -Djava.awt.headless=true -jar start.jar' and
successfully indexed two pdf files. That icon didn't appeared :)
Hi All,
I have set of rich documents. Some of them are scanned pdf files. When I send a
scanned pdf to extraction request handler, below icon appears in my Dock.
http://tinypic.com/r/2mpmo7o/6
http://tinypic.com/r/28ukxhj/6
Does anyone know what this is?
curl
a normal PDF for comparison.
-- Jack Krupansky
-Original Message-
From: Ahmet Arslan
Sent: Tuesday, August 14, 2012 7:30 PM
To: solr-user@lucene.apache.org
Subject: scanned pdf with solr cell
Hi All,
I have set of rich documents. Some of them are scanned pdf files. When I
send
Hi ,
I'm using solr cell(solrj) to index plain text files, but am encountering
IllegalCharsetNameException: Could you please point out if anything should
be added in schema.xml file. I could index the other mime types
efficiently. I gave the field type as text.
fieldType name=text class
Hi John,
See discussion about the issue of indexing contents of ZIP files:
https://issues.apache.org/jira/browse/SOLR-2416
Depending on your use case, you may be able to write a Tika parser which
handles your specific case, such as uncompressing a GZIP file and using
AutoDetect on its
Is it possible to extract content for file types that Tika doesn’t support
without changing and rebuilding Tika? Do I need to specify a tika.config
file in the solrconfig.xml file, and if so, what is the format of that file?
One example that I’m trying to solve is for a document management
: Caused by: org.apache.solr.common.SolrException: Error loading class
'solr.extraction.ExtractingRequestHandler'
:
: With the jetty and the provided example, I have no problem. It all happens
when I use tomcat and solr.
:
: My setup is as follows:
:
: I downloaded the apache-solr-3.3.0 and
-3.3.0.war and coppied everything from the
contrib/extraction/lib into dist.
I would greatly appreciate it if you can possibly point me to the right
direction. I have read everything on the wiki page and the documentation but no
luck!
Can you please guide me through step-by-step usgae of Solr Cell
I'm batching documents into solr using solr cell with the 'stream.url'
parameter. Everything is working fine until I get to about 5k documents
in and then it starts issuing 'read timeout 500' errors on every document.
The sysadmin says there's plenty of CPU, memory, and no paging so
I am new to both Solr and Cell, so sorry if I am misusing some of the
terminologies. So the problem I am trying to solve is to index a PDF document
using Solr Cell where I want to exclude part of it via XPATH. I am using Solr
release 3.1. When researching the user list, I came across one entry
Hello everyone,
I have just gotten extracting information from files with Solr Cell. Some of
the files we are indexing are large, and have much content. I would like to
limit the amount of data I index to a specified limit of characters (example
300 chars) which I will use as a document
information from files with Solr Cell. Some of
the files we are indexing are large, and have much content. I would like to
limit the amount of data I index to a specified limit of characters (example
300 chars) which I will use as a document preview. Is this possible to set as
a parameter
Hello everyone,
We have our infrastructure on Amazon cloud servers, and we use the S3 file
system. We need to index files using Solr Cell. From what I have read, we need
to stream files to Solr in order for it to extract the metadata into the index.
If we stream data through a public url
AS - www.cominvent.com
On 30. mai 2011, at 22.46, Greg Georges wrote:
Hello everyone,
We have our infrastructure on Amazon cloud servers, and we use the S3 file
system. We need to index files using Solr Cell. From what I have read, we
need to stream files to Solr in order for it to extract the metadata
Hi,
I have a question about Solr Cell please.
I index some files. For example, if I want to extract the filename, then use
a hash function on it like MD5 and then store it on Solr ; the correct way
is to use Tika « manually » to extract the metadata I want, do the
transformations
Cores with Solr Cell for indexing documents
Sounds like the Tika jar is not on the class path. Add it to a directory
where Solr's looking for libs.
On Thursday 24 March 2011 16:24:17 Brandon Waterloo wrote:
Hello everyone,
I've been trying for several hours now to set up Solr
Jelsma [markus.jel...@openindex.io]
Sent: Friday, March 25, 2011 1:23 PM
To: solr-user@lucene.apache.org
Cc: Upayavira
Subject: Re: Multiple Cores with Solr Cell for indexing documents
You can only set properties for a lib dir that must be used in solrconfig.xml.
You can use sharedLib in solr.xml
@lucene.apache.org
Cc: Upayavira
Subject: Re: Multiple Cores with Solr Cell for indexing documents
You can only set properties for a lib dir that must be used in solrconfig.xml.
You can use sharedLib in solr.xml though.
There's options in solr.xml that point to lib dirs. Make sure you get
them
Hello everyone,
I've been trying for several hours now to set up Solr with multiple cores with
Solr Cell working on each core. The only items being indexed are PDF, DOC, and
TXT files (with the possibility of expanding this list, but for now, just
assume the only things in the index should
Sounds like the Tika jar is not on the class path. Add it to a directory where
Solr's looking for libs.
On Thursday 24 March 2011 16:24:17 Brandon Waterloo wrote:
Hello everyone,
I've been trying for several hours now to set up Solr with multiple cores
with Solr Cell working on each core
From: Markus Jelsma [markus.jel...@openindex.io]
Sent: Thursday, March 24, 2011 11:29 AM
To: solr-user@lucene.apache.org
Cc: Brandon Waterloo
Subject: Re: Multiple Cores with Solr Cell for indexing documents
Sounds like the Tika jar is not on the class path. Add it to a directory where
Solr's looking
From: Markus Jelsma [markus.jel...@openindex.io]
Sent: Thursday, March 24, 2011 11:29 AM
To: solr-user@lucene.apache.org
Cc: Brandon Waterloo
Subject: Re: Multiple Cores with Solr Cell for indexing documents
Sounds like the Tika jar is not on the class path. Add it to a directory where
Solr's looking
, March 24, 2011 11:29 AM
To: solr-user@lucene.apache.org
Cc: Brandon Waterloo
Subject: Re: Multiple Cores with Solr Cell for indexing documents
Sounds like the Tika jar is not on the class path. Add it to a directory
where Solr's looking for libs.
On Thursday 24 March 2011 16:24:17 Brandon
Hello everyone,
I've been trying for several hours now to set up Solr with multiple cores with
Solr Cell working on each core. The only items being indexed are PDF, DOC, and
TXT files (with the possibility of expanding this list, but for now, just
assume the only things in the index should
Hi,
I'm using Solr 1.4.1.
The scenario involves user uploading multiple files. These have content
extracted using SolrCell, then indexed by Solr along with other information
about the user.
ContentStreamUpdateRequest seemed like the right choice for this - use
addFile() to send file data, and
In case the exact problem was not clear to somebody:
The problem with FileUpload interpreting file data as regular form fields is
that, Solr thinks there are no content streams in the request and throws a
missing_content_stream exception.
On Thu, Mar 10, 2011 at 10:59 AM, Karthik Shiraly
Working with the latest Solr Trunk code and seems the Tika handlers
for Solr Cell (ExtractingDocumentLoader.java) and Data Import handler
(TikaEntityProcessor.java) fails to index the zip file contents again.
It just indexes the file names again.
This issue was addressed some time back, late last
/2009-09/msg00037.html
Looking at my libraries it seems I am using pdfbox 0.7.3. I am using maven
for building and pdfbox 0.7.3 appears to have come from the tika-parsers 0.4
pom file which in turn appears to have come solr-cell 1.4.0 pom file. In my
project's maven pom file I have the following
Hi,
I'm trying to index a txt-File (~150MB) using Solr Cell/Tika.
The curl command aborts due to a java.lang.OutOfMemoryError.
*
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3209
On Thu, 12 Aug 2010 14:32:19 +0200
Lannig Carina lan...@ssi-schaefer-noell.com wrote:
Hi,
I'm trying to index a txt-File (~150MB) using Solr Cell/Tika.
The curl command aborts due to a java.lang.OutOfMemoryError.
[...]
AFAIK Tika keeps the whole file in RAM and posts it as one single
: Subject: Indexing large files using Solr Cell causes OutOfMemory error
: References: aanlktinfbtudv4lpjh40vjzderto1-dn7gztnjxfv...@mail.gmail.com
: In-Reply-To: aanlktinfbtudv4lpjh40vjzderto1-dn7gztnjxfv...@mail.gmail.com
http://people.apache.org/~hossman/#threadhijack
Thread Hijacking
: Alessandro Benedetti [mailto:benedetti.ale...@gmail.com]
Sent: Tuesday, July 27, 2010 6:09 AM
To: solr-user@lucene.apache.org
Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with Solr
CELL/Tika/PDFBox
Hi Jon,
During the last days we front the same problem.
Using Solr 1.4.1
-Original Message-
From: Tommaso Teofili [mailto:tommaso.teof...@gmail.com]
Sent: Wednesday, July 28, 2010 3:31 AM
To: solr-user@lucene.apache.org
Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with Solr
CELL/Tika/PDFBox
I attached a patch for Solr 1.4.1 release
@lucene.apache.org
Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with Solr
CELL/Tika/PDFBox
I attached a patch for Solr 1.4.1 release on
https://issues.apache.org/jira/browse/SOLR-1902 that made things work for
me.
This strange behaviour for me was due to the fact that I copied
1 - 100 of 184 matches
Mail list logo