Re: Term Freq Vector with SOLR cell?

2019-05-01 Thread Erik Hatcher
q=doc_content?Try q=id:"" Solr Cell and DIH are comparable (in that they are about getting content into Solr) but "unrelated" to TVRH. TVRH is about inspecting indexed content, regardless of how it got in. Erik > On May 1, 2019, at 3:14 PM, Geoffrey Will

Term Freq Vector with SOLR cell?

2019-05-01 Thread Geoffrey Willis
I am using Solr in a web app to extract text from .pdf, and docx files. I was wondering if I can access the TermFreq and TermPosition vectors via the HTTP interface exposed by Solr Cell. I’m posting/getting documents fine, I’ve enabled the TV, TFV etc in the managed schema: http://localhost

Re: Solr Cell, Tika and UpdateProcessorChains

2019-02-21 Thread Erick Erickson
Several things: 1> Please don’t use add-unknown…. It’s fine for prototyping, but guesses field definitions. 2> the solrocnfig appears to be malformed, I’m surprised it fires up at all. This never terminates for instance:

Solr Cell, Tika and UpdateProcessorChains

2019-02-21 Thread Demian Katz
in the right direction more quickly than I can. Here is her original inquiry: I am pulling data from a local drive for indexing. I am using solr cell and tika in schemaless mode. I am attempting to rewrite certain field information prior to indexing using html-strip and regex

Re: Solr Cell Input Parameter tika.config

2018-11-07 Thread Jan Høydahl
The tika.config param is documented here: https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-cell-using-apache-tika.html#configuring-the-solr-extractingrequesthandler I notice that the code (https://github.com/apache/lucene-solr/blob/964cc88cee7d62edf03a923e3217809d630af5d5/solr

Re: Solr Cell Input Parameter tika.config

2018-10-25 Thread Yasufumi Mizoguchi
Robertson, Eric J : > Hello all, > > Currently trying to define a tika config to use when posting a pdf to Solr > Cell as we may want to override the default tika configuration depending on > type of document being ingested. > > In the docs it lists tika.config as an input param

Solr Cell Input Parameter tika.config

2018-10-25 Thread Robertson, Eric J
Hello all, Currently trying to define a tika config to use when posting a pdf to Solr Cell as we may want to override the default tika configuration depending on type of document being ingested. In the docs it lists tika.config as an input parameter to the Solr Cell endpoint. Though in my

Re: solr cell: write entire file content binary to index along with metadata

2018-04-25 Thread Rahul Singh
process can improve the overall stability of the SolR service. -- Rahul Singh rahul.si...@anant.us Anant Corporation On Apr 25, 2018, 12:49 PM -0400, Shawn Heisey <apa...@elyograg.org>, wrote: > On 4/25/2018 4:02 AM, Lee Carroll wrote: > > *We don't recommend using solr-cell for prod

Re: solr cell: write entire file content binary to index along with metadata

2018-04-25 Thread Shawn Heisey
On 4/25/2018 4:02 AM, Lee Carroll wrote: *We don't recommend using solr-cell for production indexing.* Ok. Are the reasons for: Performance. I think we have rather modest index requirement (1000 a day... on a busy day) Security. The index workflow is, upload files to public facing server

Re: solr cell: write entire file content binary to index along with metadata

2018-04-25 Thread Lee Carroll
greed. The app will have a few implementations for storing the binary file. Easiest for a user to configure for proto-typing would be store in index impl. A live impl would probably be fs *We don't recommend using solr-cell for production indexing.* Ok. Are the reasons for: Performance. I thi

Re: solr cell: write entire file content binary to index along with metadata

2018-04-24 Thread Shawn Heisey
On 4/24/2018 10:26 AM, Lee Carroll wrote: > Does the solr cell contrib give access to the files raw content along with > the extracted metadata?\ That's not usually the kind of information you want to have in a Solr index.  Most of the time, there will be an entry in the Solr index that

solr cell: write entire file content binary to index along with metadata

2018-04-24 Thread Lee Carroll
Does the solr cell contrib give access to the files raw content along with the extracted metadata? cheers Lee C

RE: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-12 Thread Allison, Timothy B.
(Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ? As a bonus here's a Dropwizard Tika wrapper that gives you a Tika web service https://github.com/mattflax/dropwizard-tika-server written by a colleague of mine at Flax. Hope this is useful. Cheers

Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-10 Thread David Hastings
gt; > We should add a chatbot to the list that includes Charlie's advice and > the link to Erick's blog post whenever Tika is used.  > > > > > > -Original Message- > > From: Charlie Hull [mailto:char...@flax.co.uk] > > Sent: Monday, April 9, 2018 12:44 PM >

Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-10 Thread Alexandre Rafalovitch
st whenever Tika is used.  > > > -Original Message- > From: Charlie Hull [mailto:char...@flax.co.uk] > Sent: Monday, April 9, 2018 12:44 PM > To: solr-user@lucene.apache.org > Subject: Re: How to use Tika (Solr Cell) to extract content from HTML > document instead

RE: [EXT] Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-09 Thread Hanjan, Harinder
Oh this is great! Saves me a whole bunch of manual work. Thanks! -Original Message- From: Charlie Hull [mailto:char...@flax.co.uk] Sent: Monday, April 09, 2018 2:15 PM To: solr-user@lucene.apache.org Subject: [EXT] Re: How to use Tika (Solr Cell) to extract content from HTML document

Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-09 Thread Charlie Hull
whenever Tika is used.  > > > > > > -Original Message- > > From: Charlie Hull [mailto:char...@flax.co.uk] > > Sent: Monday, April 9, 2018 12:44 PM > > To: solr-user@lucene.apache.org > > Subject: Re: How to use Tika (Solr Cell) to extract content

RE: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-09 Thread Hanjan, Harinder
Thank you Charlie, Tim. I will integrate Tika in my Java app and use SolrJ to send data to Solr. -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, April 09, 2018 11:24 AM To: solr-user@lucene.apache.org Subject: [EXT] RE: How to use Tika (Solr Cell

RE: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-09 Thread Allison, Timothy B.
PM To: solr-user@lucene.apache.org Subject: Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ? I'd recommend you run Tika externally to Solr, which will allow you to catch this kind of problem and prevent it bringing down your

Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-09 Thread Charlie Hull
I'd recommend you run Tika externally to Solr, which will allow you to catch this kind of problem and prevent it bringing down your Solr installation. Cheers Charlie On 9 April 2018 at 16:59, Hanjan, Harinder wrote: > Hello! > > Solr (i.e. Tika) throws a "zip bomb"

How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-09 Thread Hanjan, Harinder
Hello! Solr (i.e. Tika) throws a "zip bomb" exception with certain documents we have in our Sharepoint system. I have used the tika-app.jar directly to extract the document in question and it does _not_ throw an exception and extract the contents just fine. So it would seem Solr is doing

Re: Issue with Solr Cell mixing metadata and content together

2017-12-21 Thread Phillip Rhodes
Hi all, I have been having an issue with Solr, using the >> ExtractingRequestHandler. Basically, when indexing a PDF (for >> example) I get all the metadata mixed into the "content" field along >> with the content. See: >> <https://stackoverflow.com/questions/479342

Re: Issue with Solr Cell mixing metadata and content together

2017-12-21 Thread Erick Erickson
ield along > with the content. See: > <https://stackoverflow.com/questions/47934257/importing-files-with-solr-cell-tika-is-mixing-metadata-fields-with-content> > for the gory details. > > I'm guessing this is the same basic issue as > <https://issues.apache.org/jira/bro

Issue with Solr Cell mixing metadata and content together

2017-12-21 Thread Phillip Rhodes
Hi all, I have been having an issue with Solr, using the ExtractingRequestHandler. Basically, when indexing a PDF (for example) I get all the metadata mixed into the "content" field along with the content. See: <https://stackoverflow.com/questions/47934257/importing-files-with-s

Re: Recursive archive indexing using solr cell

2017-12-13 Thread Erick Erickson
<gilhulys...@gmail.com> wrote: > Hello, > > I have been successfully able to index archive files (zip, tar, and the > like) using solr cell, but the archive is returned as a single document > when I do queries. Is there a way to configure it so that files are > extracted recursively

Recursive archive indexing using solr cell

2017-12-13 Thread Sean Gilhuly
Hello, I have been successfully able to index archive files (zip, tar, and the like) using solr cell, but the archive is returned as a single document when I do queries. Is there a way to configure it so that files are extracted recursively, and indexed separately? I know that if I set

Re: Import html data in mysql and map schemas using only Solr CELL+TIKA+DIH [scottchu]

2016-05-20 Thread Siddhartha Singh Sandhu
ords using only Solr > CELL+TIKA+DIH to some Solr with schema? I mean when importing, I can map > schema on mysql to schema in Solr? > > scott.chu,scott@udngroup.com > 2016/5/20 (週五) >

Import html data in mysql and map schemas using only Solr CELL+TIKA+DIH [scottchu]

2016-05-20 Thread scott.chu
I have a mysql table with over 300M blog articles. The records are in html format. Is it possible to import these records using only Solr CELL+TIKA+DIH to some Solr with schema? I mean when importing, I can map schema on mysql to schema in Solr? scott.chu,scott@udngroup.com 2016/5/20 (週五)

Solr Cell Tika - date.formats

2014-05-28 Thread ienjreny
this message in context: http://lucene.472066.n3.nabble.com/Solr-Cell-Tika-date-formats-tp4138478.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Cell Tika - date.formats

2014-05-28 Thread Jack Krupansky
-MM-dd hh:mm:ss -MM-dd HH:mm:ss EEE MMM d hh:mm:ss z EEE, dd MMM HH:mm:ss zzz , dd-MMM-yy HH:mm:ss zzz EEE MMM d HH:mm:ss See: https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika -- Jack Krupansky -Original Message

Re: Solr Cell Tika - date.formats

2014-05-28 Thread ienjreny
HH:mm:ss EEE MMM d hh:mm:ss z EEE, dd MMM HH:mm:ss zzz , dd-MMM-yy HH:mm:ss zzz EEE MMM d HH:mm:ss See: https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika -- Jack Krupansky -Original Message- From: ienjreny Sent

Using Solr Cell to index the internal structure of a PDF

2013-10-10 Thread Peter Bleackley
I'm trying to index a set of PDF documents with Solr 4.5.0. So far I can get Solr to ingest the entire document as one long string, stored in the index as content. However, I want to index structure within the documents. I know that the ExtractingRequestHandler uses Apache Tika to convert the

Re: Using Solr Cell to index the internal structure of a PDF

2013-10-10 Thread Furkan KAMACI
You can have a look here: http://solr.pl/en/2011/04/04/indexing-files-like-doc-pdf-solr-and-tika-integration/ 2013/10/10 Peter Bleackley bleackl...@zooey.co.uk I'm trying to index a set of PDF documents with Solr 4.5.0. So far I can get Solr to ingest the entire document as one long string,

Re: Solr Cell Question

2013-09-09 Thread Jamie Johnson
Thanks Erick, This is how I was doing it but when I saw the Solr Cell stuff I figured I'd give it a go. What I ended up doing is the following ModifiableSolrParams params = indexer.index(artifact); params.add(fmap.content, my_custom_field); params.add(extractFormat, text

Re: Solr Cell Question

2013-09-06 Thread Erick Erickson
program with indexing from a DB mixed in, but it shouldn't be hard at all to pull the DB parts out. http://searchhub.org/dev/2012/02/14/indexing-with-solrj/ FWIW, Erick On Thu, Sep 5, 2013 at 5:28 PM, Jamie Johnson jej2...@gmail.com wrote: Is it possible to configure solr cell to only extract

Solr Cell Question

2013-09-05 Thread Jamie Johnson
Is it possible to configure solr cell to only extract and store the body of a document when indexing? I'm currently doing the following which I thought would work ModifiableSolrParams params = new ModifiableSolrParams(); params.set(defaultField, content); params.set(xpath, /xhtml:html

Re: solr cell

2013-03-15 Thread Michael Della Bitta
Niklas, In Linux, the API for watching for filesystem changes is called inotify. You'd need to write something to listen to those events and react accordingly. Here's a brief discussion about it: http://stackoverflow.com/questions/4062806/inotify-how-to-use-it-linux Michael Della Bitta

Re: solr cell

2013-03-15 Thread Jack Krupansky
Take a look at ManifoldCF, whch has a file system crawler which can track changed files. -- Jack Krupansky -Original Message- From: Niklas Langvig Sent: Friday, March 15, 2013 11:10 AM To: solr-user@lucene.apache.org Subject: solr cell We have all our documents (doc, docx, pdf

Re: solr cell

2013-03-15 Thread Arcadius Ahouansou
Another options similar to this would be the new file system WatchService available in java 7: http://docs.oracle.com/javase/tutorial/essential/io/notification.html Arcadius. On 15 March 2013 15:22, Michael Della Bitta michael.della.bi...@appinions.com wrote: Niklas, In Linux, the API for

Re: How can i instruct the Solr/ Solr Cell to output the original HTML document which was fed to it.?

2013-03-05 Thread Divyanand Tiwari
Hi Chris thank you for replying. My content field in the schema is stored=true and indexed=false because I am copying the content field in text field which is by default indexed=true. I was having a query that I am able to search in the html documents I had fed to the solr, but as the results

Re: How can i instruct the Solr/ Solr Cell to output the original HTML document which was fed to it.?

2013-02-21 Thread Chris Hostetter
: Hi everyone, i am new to solr technology and not getting a way to get back : the original HTML document with Hits highlighted into it. what : configuration and where i can do to instruct SolrCell/ Tika so that it does : not strips down the tags of HTML document in the content field. I _think_

Re: How can i instruct the Solr/ Solr Cell to output the original HTML document which was fed to it.?

2013-02-19 Thread Divyanand Tiwari
- From: Divyanand Tiwari Sent: Monday, February 18, 2013 10:52 PM To: solr-user@lucene.apache.org Subject: Re: How can i instruct the Solr/ Solr Cell to output the original HTML document which was fed to it.? Thank you for replying sir !!! I have two queries related with this - 1) So

Re: How can i instruct the Solr/ Solr Cell to output the original HTML document which was fed to it.?

2013-02-18 Thread Jack Krupansky
for highlighting. See: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory -- Jack Krupansky -Original Message- From: Divyanand Tiwari Sent: Monday, February 18, 2013 7:28 AM To: solr-user@lucene.apache.org Subject: How can i instruct the Solr/ Solr

Re: How can i instruct the Solr/ Solr Cell to output the original HTML document which was fed to it.?

2013-02-18 Thread Divyanand Tiwari
Thank you for replying sir !!! I have two queries related with this - 1) So in this case which request handler I have to use because 'ExtractingRequestHandler' by default strips the html content and the default handler 'UpdateRequestHandler' does not accepts the HTML contrents. 2) How can I

Re: How can i instruct the Solr/ Solr Cell to output the original HTML document which was fed to it.?

2013-02-18 Thread Jack Krupansky
-Original Message- From: Divyanand Tiwari Sent: Monday, February 18, 2013 10:52 PM To: solr-user@lucene.apache.org Subject: Re: How can i instruct the Solr/ Solr Cell to output the original HTML document which was fed to it.? Thank you for replying sir !!! I have two queries related

Antwort: Re: Solr Cell Questions

2012-09-25 Thread Johannes . Schwendinger
Thank you Erick for your respone, I've already tried what you've suggested and got some out of memory exceptions. Because of this i like the solution with solr Cell where i can send the file directly to solr via stream and don't collect them in my memory. And another question that came to my

Re: Re: Solr Cell Questions

2012-09-25 Thread Erick Erickson
, 2012 at 5:23 AM, johannes.schwendin...@blum.com wrote: Thank you Erick for your respone, I've already tried what you've suggested and got some out of memory exceptions. Because of this i like the solution with solr Cell where i can send the file directly to solr via stream and don't collect

Antwort: Re: Re: Solr Cell Questions

2012-09-25 Thread Johannes . Schwendinger
The difference with solr cell is, that i'am sending every single document to solr cell and don't collect them until i have a couple of them in my memory. Using mainly the code form here: http://wiki.apache.org/solr/ExtractingRequestHandler#SolrJ Erick Erickson erickerick...@gmail.com schrieb

Re: Solr Cell Questions

2012-09-25 Thread Alexandre Rafalovitch
with Solr Cell to index files to Solr. During this some questions came up. 1. Is it possible (and wise) to connect to Solr Cell with multiple Threads at the same time to index several documents at the same time? This question came up because my prrogramm takes about 6hours to index round 35000

Re: Solr Cell Questions

2012-09-25 Thread Jack Krupansky
as a separate process) to minimize thread issues, GC issues, hung parsers, etc. -- Jack Krupansky -Original Message- From: Alexandre Rafalovitch Sent: Tuesday, September 25, 2012 10:24 AM To: solr-user@lucene.apache.org Subject: Re: Solr Cell Questions Are you by any chance committing

Re: Re: Re: Solr Cell Questions

2012-09-25 Thread Erick Erickson
... FWIW, Erick On Tue, Sep 25, 2012 at 10:04 AM, johannes.schwendin...@blum.com wrote: The difference with solr cell is, that i'am sending every single document to solr cell and don't collect them until i have a couple of them in my memory. Using mainly the code form here: http

Solr Cell Questions

2012-09-24 Thread Johannes . Schwendinger
Hi, Im currently experimenting with Solr Cell to index files to Solr. During this some questions came up. 1. Is it possible (and wise) to connect to Solr Cell with multiple Threads at the same time to index several documents at the same time? This question came up because my prrogramm takes

Re: Solr Cell Questions

2012-09-24 Thread Erick Erickson
Best Erick On Mon, Sep 24, 2012 at 10:04 AM, johannes.schwendin...@blum.com wrote: Hi, Im currently experimenting with Solr Cell to index files to Solr. During this some questions came up. 1. Is it possible (and wise) to connect to Solr Cell with multiple Threads at the same time to index

Re: Indexing PDF-Files using Solr Cell

2012-09-17 Thread Jack Krupansky
@lucene.apache.org Subject: Re: Indexing PDF-Files using Solr Cell Thank you for your response. I'm writing my Bachelor-Thesis about Solr and my company doesn't want me to use a beta-version. I dont want to be annoying, but how do i direct the content to a stored filed and so on... in the URL i use

Indexing PDF-Files using Solr Cell

2012-09-16 Thread Alexander Troost
Hello *, I've got a problem indexing and searching PDF-Files. It seems like Solr doenst index the name of the file. In returning i only get result name=response numFound=1 start=0docstr name=authorA28240/strarr name=content_typestrapplication/pdf/str/arrstr name=iddoc5/strdate

Re: Indexing PDF-Files using Solr Cell

2012-09-16 Thread Jack Krupansky
using Solr Cell Hello *, I've got a problem indexing and searching PDF-Files. It seems like Solr doenst index the name of the file. In returning i only get result name=response numFound=1 start=0docstr name=authorA28240/strarr name=content_typestrapplication/pdf/str/arrstr name=iddoc5/strdate

Re: Indexing PDF-Files using Solr Cell

2012-09-16 Thread Alexander Troost
-Original Message- From: Alexander Troost Sent: Sunday, September 16, 2012 10:16 PM To: solr-user@lucene.apache.org Subject: Indexing PDF-Files using Solr Cell Hello *, I've got a problem indexing and searching PDF-Files. It seems like Solr doenst index the name of the file

Re: Indexing PDF-Files using Solr Cell

2012-09-16 Thread Jack Krupansky
- From: Alexander Troost Sent: Sunday, September 16, 2012 11:59 PM To: solr-user@lucene.apache.org Subject: Re: Indexing PDF-Files using Solr Cell Hi, first of all: Thank you for that quick response! But i am not sure if i am doing this right. For my point of view the command now has to look like

Re: Indexing PDF-Files using Solr Cell

2012-09-16 Thread Alexander Troost
-Original Message- From: Alexander Troost Sent: Sunday, September 16, 2012 11:59 PM To: solr-user@lucene.apache.org Subject: Re: Indexing PDF-Files using Solr Cell Hi, first of all: Thank you for that quick response! But i am not sure if i am doing this right. For my point

Re: scanned pdf with solr cell

2012-08-20 Thread Michael Della Bitta
It's pretty easy to accidentally run into the AWT stuff if you're doing anything that involves image processing, which I would expect a generic RTF parser might do. Michael Della Bitta Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017

Re: scanned pdf with solr cell

2012-08-19 Thread Lance Norskog
The backstory here is that Tika uses a library that for some crazy reason is inside the Java AWG graphics toolkit. (I think the RTF parser?) On Wed, Aug 15, 2012 at 5:57 AM, Ahmet Arslan iori...@yahoo.com wrote: You can try passing -Djava.awt.headless=true as one of the arguments when you

Re: scanned pdf with solr cell

2012-08-15 Thread Ahmet Arslan
When I send a scanned pdf to extraction request handler, below icon appears in my Dock. http://tinypic.com/r/2mpmo7o/6 http://tinypic.com/r/28ukxhj/6 I found that text-extractable pdf files triggers above weird icon too. curl

Re: scanned pdf with solr cell

2012-08-15 Thread Paul Libbrecht
Ahmet, the dock icon appears when AWT starts, e.g. when a font is loaded. You can prevent it using the headless mode but this is likely to trigger an exception. Same if your user is not UI-logged-in. hope it helps. Paul Le 15 août 2012 à 01:30, Ahmet Arslan a écrit : Hi All, I have set

Re: scanned pdf with solr cell

2012-08-15 Thread Ahmet Arslan
the dock icon appears when AWT starts, e.g. when a font is loaded. You can prevent it using the headless mode but this is likely to trigger an exception. Same if your user is not UI-logged-in. Hi Paul, thanks for the explanation. So is it nothing to worry about?

Re: scanned pdf with solr cell

2012-08-15 Thread Paul Libbrecht
Le 15 août 2012 à 13:03, Ahmet Arslan a écrit : Hi Paul, thanks for the explanation. So is it nothing to worry about? it is nothing to worry about except to remember that you can't run this step in a daemon-like process. (on Linux, I had to set-up a VNC-server for similar tasks) paul

Re: scanned pdf with solr cell

2012-08-15 Thread Michael Della Bitta
You can try passing -Djava.awt.headless=true as one of the arguments when you start Jetty to see if you can get this to go away with no ill effects. Michael Della Bitta Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017

Re: scanned pdf with solr cell

2012-08-15 Thread Ahmet Arslan
You can try passing -Djava.awt.headless=true as one of the arguments when you start Jetty to see if you can get this to go away with no ill effects. I started jetty using : 'java -Djava.awt.headless=true -jar start.jar' and successfully indexed two pdf files. That icon didn't appeared :)

scanned pdf with solr cell

2012-08-14 Thread Ahmet Arslan
Hi All, I have set of rich documents. Some of them are scanned pdf files. When I send a scanned pdf to extraction request handler, below icon appears in my Dock. http://tinypic.com/r/2mpmo7o/6 http://tinypic.com/r/28ukxhj/6 Does anyone know what this is? curl

Re: scanned pdf with solr cell

2012-08-14 Thread Jack Krupansky
a normal PDF for comparison. -- Jack Krupansky -Original Message- From: Ahmet Arslan Sent: Tuesday, August 14, 2012 7:30 PM To: solr-user@lucene.apache.org Subject: scanned pdf with solr cell Hi All, I have set of rich documents. Some of them are scanned pdf files. When I send

[Error] Indexing with solr cell

2012-07-03 Thread savitha sundaramurthy
Hi , I'm using solr cell(solrj) to index plain text files, but am encountering IllegalCharsetNameException: Could you please point out if anything should be added in schema.xml file. I could index the other mime types efficiently. I gave the field type as text. fieldType name=text class

Re: Custom content extractor for Solr Cell

2011-12-27 Thread Jan Høydahl
Hi John, See discussion about the issue of indexing contents of ZIP files: https://issues.apache.org/jira/browse/SOLR-2416 Depending on your use case, you may be able to write a Tika parser which handles your specific case, such as uncompressing a GZIP file and using AutoDetect on its

Custom content extractor for Solr Cell

2011-12-05 Thread John Bartak
Is it possible to extract content for file types that Tika doesn’t support without changing and rebuilding Tika? Do I need to specify a tika.config file in the solrconfig.xml file, and if so, what is the format of that file? One example that I’m trying to solve is for a document management

Re: Can you please guide me through step-by-step installation of Solr Cell ?

2011-11-03 Thread Chris Hostetter
: Caused by: org.apache.solr.common.SolrException: Error loading class 'solr.extraction.ExtractingRequestHandler' : : With the jetty and the provided example, I have no problem. It all happens when I use tomcat and solr. : : My setup is as follows: : : I downloaded the apache-solr-3.3.0 and

Can you please guide me through step-by-step installation of Solr Cell ?

2011-10-17 Thread Sina Fakhraee
-3.3.0.war and coppied everything from the contrib/extraction/lib into dist. I would greatly appreciate it if you can possibly point me to the right direction. I have read everything on the wiki page and the documentation but no luck! Can you please guide me through step-by-step usgae of Solr Cell

Please help - Solr Cell using 'stream.url'

2011-10-07 Thread Tod
I'm batching documents into solr using solr cell with the 'stream.url' parameter. Everything is working fine until I get to about 5k documents in and then it starts issuing 'read timeout 500' errors on every document. The sysadmin says there's plenty of CPU, memory, and no paging so

Question on XPATH use in Solr Cell.

2011-06-15 Thread Koorosh Vakhshoori
I am new to both Solr and Cell, so sorry if I am misusing some of the terminologies. So the problem I am trying to solve is to index a PDF document using Solr Cell where I want to exclude part of it via XPATH. I am using Solr release 3.1. When researching the user list, I came across one entry

Limit data stored from fmap.content with Solr cell

2011-06-01 Thread Greg Georges
Hello everyone, I have just gotten extracting information from files with Solr Cell. Some of the files we are indexing are large, and have much content. I would like to limit the amount of data I index to a specified limit of characters (example 300 chars) which I will use as a document

Re: Limit data stored from fmap.content with Solr cell

2011-06-01 Thread Erick Erickson
information from files with Solr Cell. Some of the files we are indexing are large, and have much content. I would like to limit the amount of data I index to a specified limit of characters (example 300 chars) which I will use as a document preview. Is this possible to set as a parameter

Indexing files Solr cell and Amazon S3

2011-05-30 Thread Greg Georges
Hello everyone, We have our infrastructure on Amazon cloud servers, and we use the S3 file system. We need to index files using Solr Cell. From what I have read, we need to stream files to Solr in order for it to extract the metadata into the index. If we stream data through a public url

Re: Indexing files Solr cell and Amazon S3

2011-05-30 Thread Jan Høydahl
AS - www.cominvent.com On 30. mai 2011, at 22.46, Greg Georges wrote: Hello everyone, We have our infrastructure on Amazon cloud servers, and we use the S3 file system. We need to index files using Solr Cell. From what I have read, we need to stream files to Solr in order for it to extract the metadata

Solr Cell and operations on metadata extracted

2011-05-16 Thread Olivier Tavard
Hi, I have a question about Solr Cell please. I index some files. For example, if I want to extract the filename, then use a hash function on it like MD5 and then store it on Solr ; the correct way is to use Tika « manually » to extract the metadata I want, do the transformations

Re: Multiple Cores with Solr Cell for indexing documents

2011-03-25 Thread Upayavira
Cores with Solr Cell for indexing documents Sounds like the Tika jar is not on the class path. Add it to a directory where Solr's looking for libs. On Thursday 24 March 2011 16:24:17 Brandon Waterloo wrote: Hello everyone, I've been trying for several hours now to set up Solr

RE: Multiple Cores with Solr Cell for indexing documents

2011-03-25 Thread Brandon Waterloo
Jelsma [markus.jel...@openindex.io] Sent: Friday, March 25, 2011 1:23 PM To: solr-user@lucene.apache.org Cc: Upayavira Subject: Re: Multiple Cores with Solr Cell for indexing documents You can only set properties for a lib dir that must be used in solrconfig.xml. You can use sharedLib in solr.xml

Re: Multiple Cores with Solr Cell for indexing documents

2011-03-25 Thread Erick Erickson
@lucene.apache.org Cc: Upayavira Subject: Re: Multiple Cores with Solr Cell for indexing documents You can only set properties for a lib dir that must be used in solrconfig.xml. You can use sharedLib in solr.xml though. There's options in solr.xml that point to lib dirs. Make sure you get them

Multiple Cores with Solr Cell for indexing documents

2011-03-24 Thread Brandon Waterloo
Hello everyone, I've been trying for several hours now to set up Solr with multiple cores with Solr Cell working on each core. The only items being indexed are PDF, DOC, and TXT files (with the possibility of expanding this list, but for now, just assume the only things in the index should

Re: Multiple Cores with Solr Cell for indexing documents

2011-03-24 Thread Markus Jelsma
Sounds like the Tika jar is not on the class path. Add it to a directory where Solr's looking for libs. On Thursday 24 March 2011 16:24:17 Brandon Waterloo wrote: Hello everyone, I've been trying for several hours now to set up Solr with multiple cores with Solr Cell working on each core

RE: Multiple Cores with Solr Cell for indexing documents

2011-03-24 Thread Brandon Waterloo
From: Markus Jelsma [markus.jel...@openindex.io] Sent: Thursday, March 24, 2011 11:29 AM To: solr-user@lucene.apache.org Cc: Brandon Waterloo Subject: Re: Multiple Cores with Solr Cell for indexing documents Sounds like the Tika jar is not on the class path. Add it to a directory where Solr's looking

Multiple Cores with Solr Cell for indexing documents

2011-03-24 Thread Brandon Waterloo
From: Markus Jelsma [markus.jel...@openindex.io] Sent: Thursday, March 24, 2011 11:29 AM To: solr-user@lucene.apache.org Cc: Brandon Waterloo Subject: Re: Multiple Cores with Solr Cell for indexing documents Sounds like the Tika jar is not on the class path. Add it to a directory where Solr's looking

Re: Multiple Cores with Solr Cell for indexing documents

2011-03-24 Thread Markus Jelsma
, March 24, 2011 11:29 AM To: solr-user@lucene.apache.org Cc: Brandon Waterloo Subject: Re: Multiple Cores with Solr Cell for indexing documents Sounds like the Tika jar is not on the class path. Add it to a directory where Solr's looking for libs. On Thursday 24 March 2011 16:24:17 Brandon

Multiple Cores with Solr Cell for indexing documents

2011-03-22 Thread Brandon Waterloo
Hello everyone, I've been trying for several hours now to set up Solr with multiple cores with Solr Cell working on each core. The only items being indexed are PDF, DOC, and TXT files (with the possibility of expanding this list, but for now, just assume the only things in the index should

Solr Cell: Content extraction problem with ContentStreamUpdateRequest and multiple files

2011-03-09 Thread Karthik Shiraly
Hi, I'm using Solr 1.4.1. The scenario involves user uploading multiple files. These have content extracted using SolrCell, then indexed by Solr along with other information about the user. ContentStreamUpdateRequest seemed like the right choice for this - use addFile() to send file data, and

Re: Solr Cell: Content extraction problem with ContentStreamUpdateRequest and multiple files

2011-03-09 Thread Karthik Shiraly
In case the exact problem was not clear to somebody: The problem with FileUpload interpreting file data as regular form fields is that, Solr thinks there are no content streams in the request and throws a missing_content_stream exception. On Thu, Mar 10, 2011 at 10:59 AM, Karthik Shiraly

Solr Cell DataImport Tika handler broken - fails to index Zip file contents

2011-03-07 Thread Jayendra Patil
Working with the latest Solr Trunk code and seems the Tika handlers for Solr Cell (ExtractingDocumentLoader.java) and Data Import handler (TikaEntityProcessor.java) fails to index the zip file contents again. It just indexes the file names again. This issue was addressed some time back, late last

Exception being thrown indexing a specific pdf document using Solr Cell

2010-10-15 Thread Shaun Campbell
/2009-09/msg00037.html Looking at my libraries it seems I am using pdfbox 0.7.3. I am using maven for building and pdfbox 0.7.3 appears to have come from the tika-parsers 0.4 pom file which in turn appears to have come solr-cell 1.4.0 pom file. In my project's maven pom file I have the following

Indexing large files using Solr Cell causes OutOfMemory error

2010-08-12 Thread Lannig Carina
Hi, I'm trying to index a txt-File (~150MB) using Solr Cell/Tika. The curl command aborts due to a java.lang.OutOfMemoryError. * java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Arrays.java:3209

Re: Indexing large files using Solr Cell causes OutOfMemory error

2010-08-12 Thread Gora Mohanty
On Thu, 12 Aug 2010 14:32:19 +0200 Lannig Carina lan...@ssi-schaefer-noell.com wrote: Hi, I'm trying to index a txt-File (~150MB) using Solr Cell/Tika. The curl command aborts due to a java.lang.OutOfMemoryError. [...] AFAIK Tika keeps the whole file in RAM and posts it as one single

Re: Indexing large files using Solr Cell causes OutOfMemory error

2010-08-12 Thread Chris Hostetter
: Subject: Indexing large files using Solr Cell causes OutOfMemory error : References: aanlktinfbtudv4lpjh40vjzderto1-dn7gztnjxfv...@mail.gmail.com : In-Reply-To: aanlktinfbtudv4lpjh40vjzderto1-dn7gztnjxfv...@mail.gmail.com http://people.apache.org/~hossman/#threadhijack Thread Hijacking

Re: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox

2010-07-28 Thread Tommaso Teofili
: Alessandro Benedetti [mailto:benedetti.ale...@gmail.com] Sent: Tuesday, July 27, 2010 6:09 AM To: solr-user@lucene.apache.org Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox Hi Jon, During the last days we front the same problem. Using Solr 1.4.1

RE: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox

2010-07-28 Thread David Thibault
-Original Message- From: Tommaso Teofili [mailto:tommaso.teof...@gmail.com] Sent: Wednesday, July 28, 2010 3:31 AM To: solr-user@lucene.apache.org Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox I attached a patch for Solr 1.4.1 release

Re: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox

2010-07-28 Thread Alessandro Benedetti
@lucene.apache.org Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox I attached a patch for Solr 1.4.1 release on https://issues.apache.org/jira/browse/SOLR-1902 that made things work for me. This strange behaviour for me was due to the fact that I copied

  1   2   >