RE: Indexing data from multiple datasources

2011-06-09 Thread Greg Georges
No from what I understand, the way Solr does an update is to delete the 
document, then recreate all the fields, there is no partial updating of the 
file.. maybe because of performance issues or locking?

-Original Message-
From: David Ross [mailto:davidtr...@hotmail.com] 
Sent: 9 juin 2011 15:23
To: solr-user@lucene.apache.org
Subject: RE: Indexing data from multiple datasources


This thread got me thinking a bit...
Does SOLR support the concept of "partial updates" to documents?  By this I 
mean updating a subset of fields in a document that already exists in the 
index, and without having to resubmit the entire document.
An example would be storing/indexing user tags associated with documents. These 
tags will not be available when the document is initially presented to SOLR, 
and may or may not come along at a later time. When that time comes, can we 
just submit the tag data (and document identifier I'd imagine), or do we have 
to import the entire document?
new to SOLR...

> Date: Thu, 9 Jun 2011 14:00:43 -0400
> Subject: Re: Indexing data from multiple datasources
> From: erickerick...@gmail.com
> To: solr-user@lucene.apache.org
> 
> How are you using it? Streaming the files to Solr via HTTP? You can use Tika
> on the client to extract the various bits from the structured documents, and
> use SolrJ to assemble various bits of that data Tika exposes into a
> Solr document
> that you then send to Solr. At the point you're transferring data from the
> Tika parse to the Solr document, you could add any data from your database 
> that
> you wanted.
> 
> The result is that you'd be indexing the complete Solr document only once.
> 
> You're right that updating a document in Solr overwrites the previous
> version and any
> data in the previous version is lost
> 
> Best
> Erick
> 
> On Thu, Jun 9, 2011 at 1:20 PM, Greg Georges  wrote:
> > Hello Erick,
> >
> > Thanks for the response. No, I am using the extract handler to extract the 
> > data from my text files. In your second approach, you say I could use a DIH 
> > to update the index which would have been created by the extract handler in 
> > the first phase. I thought that lets say I get info from the DB and update 
> > the index with the document ID, will I overwrite the data and lose the 
> > initial data from the extract handler phase? Thanks
> >
> > Greg
> >
> > -Original Message-
> > From: Erick Erickson [mailto:erickerick...@gmail.com]
> > Sent: 9 juin 2011 12:15
> > To: solr-user@lucene.apache.org
> > Subject: Re: Indexing data from multiple datasources
> >
> > Hmmm, when you say you use Tika, are you using some custom Java code? 
> > Because
> > if you are, the best thing to do is query your database at that point
> > and add whatever information
> > you need to the document.
> >
> > If you're using DIH to do the crawl, consider implementing a
> > Transformer to do the database
> > querying and modify the document as necessary This is pretty
> > simple to do, we can
> > chat a bit more depending on whether either approach makes sense.
> >
> > Best
> > Erick
> >
> >
> >
> > On Thu, Jun 9, 2011 at 10:43 AM, Greg Georges  
> > wrote:
> >> Hello all,
> >>
> >> I have checked the forums to see if it is possible to create and index 
> >> from multiple datasources. I have found references to SOLR-1358, but I 
> >> don't think this fits my scenario. In all, we have an application where we 
> >> upload files. On the file upload, I use the Tika extract handler to save 
> >> metadata from the file (_attr, literal values, etc..). We also have a 
> >> database which has information on the uploaded files, like the category, 
> >> type, etc.. I would like to update the index to include this information 
> >> from the db in the index for each document. If I run a dataimporthandler 
> >> after the extract phase I am afraid that by updating the doc in the index 
> >> by its id will just cause that I overwrite the old information with the 
> >> info from the DB (what I understand is that Solr updates its index by ID 
> >> by deleting first then recreating the info).
> >>
> >> Anyone have any pointers, is there a clean way to do this, or must I find 
> >> a way to pass the db metadata to the extract handler and save it as 
> >> literal fields?
> >>
> >> Thanks in advance
> >>
> >> Greg
> >>
> >
  


RE: Indexing data from multiple datasources

2011-06-09 Thread Greg Georges
Hello Erick,

Thanks for the response. No, I am using the extract handler to extract the data 
from my text files. In your second approach, you say I could use a DIH to 
update the index which would have been created by the extract handler in the 
first phase. I thought that lets say I get info from the DB and update the 
index with the document ID, will I overwrite the data and lose the initial data 
from the extract handler phase? Thanks

Greg

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: 9 juin 2011 12:15
To: solr-user@lucene.apache.org
Subject: Re: Indexing data from multiple datasources

Hmmm, when you say you use Tika, are you using some custom Java code? Because
if you are, the best thing to do is query your database at that point
and add whatever information
you need to the document.

If you're using DIH to do the crawl, consider implementing a
Transformer to do the database
querying and modify the document as necessary This is pretty
simple to do, we can
chat a bit more depending on whether either approach makes sense.

Best
Erick



On Thu, Jun 9, 2011 at 10:43 AM, Greg Georges  wrote:
> Hello all,
>
> I have checked the forums to see if it is possible to create and index from 
> multiple datasources. I have found references to SOLR-1358, but I don't think 
> this fits my scenario. In all, we have an application where we upload files. 
> On the file upload, I use the Tika extract handler to save metadata from the 
> file (_attr, literal values, etc..). We also have a database which has 
> information on the uploaded files, like the category, type, etc.. I would 
> like to update the index to include this information from the db in the index 
> for each document. If I run a dataimporthandler after the extract phase I am 
> afraid that by updating the doc in the index by its id will just cause that I 
> overwrite the old information with the info from the DB (what I understand is 
> that Solr updates its index by ID by deleting first then recreating the info).
>
> Anyone have any pointers, is there a clean way to do this, or must I find a 
> way to pass the db metadata to the extract handler and save it as literal 
> fields?
>
> Thanks in advance
>
> Greg
>


Indexing data from multiple datasources

2011-06-09 Thread Greg Georges
Hello all,

I have checked the forums to see if it is possible to create and index from 
multiple datasources. I have found references to SOLR-1358, but I don't think 
this fits my scenario. In all, we have an application where we upload files. On 
the file upload, I use the Tika extract handler to save metadata from the file 
(_attr, literal values, etc..). We also have a database which has information 
on the uploaded files, like the category, type, etc.. I would like to update 
the index to include this information from the db in the index for each 
document. If I run a dataimporthandler after the extract phase I am afraid that 
by updating the doc in the index by its id will just cause that I overwrite the 
old information with the info from the DB (what I understand is that Solr 
updates its index by ID by deleting first then recreating the info).

Anyone have any pointers, is there a clean way to do this, or must I find a way 
to pass the db metadata to the extract handler and save it as literal fields?

Thanks in advance

Greg


Limit data stored from fmap.content with Solr cell

2011-06-01 Thread Greg Georges
Hello everyone,

I have just gotten extracting information from files with Solr Cell. Some of 
the files we are indexing are large, and have much content. I would like to 
limit the amount of data I index to a specified limit of characters (example 
300 chars) which I will use as a document preview. Is this possible to set as a 
parameter with the fmap.content param, of must I index it all and then do a 
copyfield but just with a specified number of characters? Thanks in advance

Greg


Indexing files Solr cell and Amazon S3

2011-05-30 Thread Greg Georges
Hello everyone,

We have our infrastructure on Amazon cloud servers, and we use the S3 file 
system. We need to index files using Solr Cell. From what I have read, we need 
to stream files to Solr in order for it to extract the metadata into the index. 
If we stream data through a public url there will be costs associated to the 
transfer on the Amazon cloud. We have planned to have a directory with the 
files, is it possible to tell solr to add documents from a specific folder 
location? Or must we stream them into Solr? In SolrJ I see that the only option 
is streaming. Thank you very much.

Greg


RE: DataImportHandler on 2 tables

2011-05-02 Thread Greg Georges
No it is not a problem, just wanted to confirm my question before looking into 
solr cores more closely. Thanks for your advice and confirmation

Greg

-Original Message-
From: lboutros [mailto:boutr...@gmail.com] 
Sent: 2 mai 2011 16:43
To: solr-user@lucene.apache.org
Subject: Re: DataImportHandler on 2 tables

ok, so It seems you should create a new index and core as you said.

see here for the management :

http://wiki.apache.org/solr/CoreAdmin

But it seems that is a problem for you. Is it ?

Ludovic.


2011/5/2 Greg Georges [via Lucene] <
ml-node+2891277-472183207-383...@n3.nabble.com>

> No, the data has no relationship between each other, they are both
> independant with no joins. I want to search separately
>
> Greg
>
>


-
Jouve
France.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/DataImportHandler-on-2-tables-tp2891256p2891316.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: DataImportHandler on 2 tables

2011-05-02 Thread Greg Georges
No, the data has no relationship between each other, they are both independant 
with no joins. I want to search separately

Greg

-Original Message-
From: lboutros [mailto:boutr...@gmail.com] 
Sent: 2 mai 2011 16:29
To: solr-user@lucene.apache.org
Subject: Re: DataImportHandler on 2 tables

Do you want to search on the datas from the tables together or seperately ?
Is there a join between the two tables ?

Ludovic.

2011/5/2 Greg Georges [via Lucene] <
ml-node+2891256-222073995-383...@n3.nabble.com>

> Hello all,
>
> I have a system where I have a dataimporthandler defined for one table in
> my database. I need to also index data from another table, so therefore I
> will need another index to search on. Does this mean I must configure
> another solr instance (another schema.xml file, dataimporthandler sql file,
> etc)? Do I need another solr core for this? Thanks
>
> Greg
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/DataImportHandler-on-2-tables-tp2891256p2891256.html
>  To start a new topic under Solr - User, email
> ml-node+472068-1765922688-383...@n3.nabble.com
> To unsubscribe from Solr - User, click 
> here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=472068&code=Ym91dHJvc2xAZ21haWwuY29tfDQ3MjA2OHw0Mzk2MDUxNjE=>.
>
>


-
Jouve
France.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/DataImportHandler-on-2-tables-tp2891256p2891272.html
Sent from the Solr - User mailing list archive at Nabble.com.


DataImportHandler on 2 tables

2011-05-02 Thread Greg Georges
Hello all,

I have a system where I have a dataimporthandler defined for one table in my 
database. I need to also index data from another table, so therefore I will 
need another index to search on. Does this mean I must configure another solr 
instance (another schema.xml file, dataimporthandler sql file, etc)? Do I need 
another solr core for this? Thanks

Greg


RE: Question concerning the updating of my solr index

2011-05-02 Thread Greg Georges
Yeah you are right, I have changed that to add a document and not a list of 
documents. Still works pretty fast, I will continue to test settings to see if 
I can tweak it further. Thanks

Greg

-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: 2 mai 2011 14:56
To: solr-user@lucene.apache.org
Subject: Re: Question concerning the updating of my solr index

Greg,

I believe the point of SUSS is that you can just add docs to it one by one, so 
that SUSS can asynchronously send them to the backend Solr instead of you 
batching the docs.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Greg Georges 
> To: "solr-user@lucene.apache.org" 
> Sent: Mon, May 2, 2011 2:45:40 PM
> Subject: RE: Question concerning the updating of my solr index
> 
> Oops, here is the code
> 
> SolrServer server = new  
>StreamingUpdateSolrServer("http://localhost:8080/apache-solr-1.4.1/";, 1000, 
>4); 
>
> Collection  docs = new 
>ArrayList();
>  
> for (Iterator  iterator = documents.iterator(); iterator.hasNext();) {
>  Document document = (Document)  iterator.next();
> 
> SolrInputDocument  solrDoc = 
>SolrUtils.createDocsSolrDocument(document);
>
>  docs.add(solrDoc);
>  }
> 
>server.add(docs);
> server.commit();
> server.optimize();
> 
> Greg
> 
> -Original Message-
> From: Greg  Georges [mailto:greg.geor...@biztree.com] 
> Sent: 2  mai 2011 14:44
> To: solr-user@lucene.apache.org
> Subject:  RE: Question concerning the updating of my solr index
> 
> Ok I had seen this  in the wiki, performance has gone from 19 seconds to 13. 
> I 
>have configured it  like this, I wonder what would the best settings be with 
>20,000 docs to update?  Higher or lower queue value? Higher or lower thread 
>value?  Thanks
> 
> Greg
> 
> -Original Message-
> From: Otis Gospodnetic  [mailto:otis_gospodne...@yahoo.com] 
> Sent: 2 mai 2011 13:59
> To: solr-user@lucene.apache.org
> Subject:  Re: Question concerning the updating of my solr index
> 
> Greg,
> 
> You  could use StreamingUpdateSolrServer instead of that UpdateRequest class 
> - 

> http://search-lucene.com/?q=StreamingUpdateSolrServer+&fc_project=Solr
> Your  index won't be locked in the sense that you could have multiple apps or 
> threads adding docs to the same index simultaneously and that searches can  
> be 

> executed against the index concurrently.
> 
> Otis
> 
> Sematext  :: http://sematext.com/ ::  Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
> 
> 
> 
> - Original Message  
> > From: Greg Georges 
> >  To: "solr-user@lucene.apache.org"  
> >  Sent: Mon, May 2, 2011 1:33:30 PM
> > Subject: Question concerning the  updating of my solr index
> > 
> > Hello all,
> > 
> > I have  integrated Solr into my project with success. I use a  
>dataimporthandler 
>
> >to first import the data mapping the fields to my schema.xml.  I  use Solrj 
> >to 
>
> >query the data and also use faceting. Works great.
> > 
> > The  question I have now is a general one on updating the index  and how it 
> >works.  Right now, I have a thread which runs a couple  of times a day to 
>update 
>
> >the  index. My index is composed of about  2 documents, and when this 
>thread 
>
> >is  run it takes the data of  the 2 documents in the db, I create a 
> >solrdocument  for each  and I then use this line of code to index the index.
> > 
> >  SolrServer  server = new 
> >CommonsHttpSolrServer("http://localhost:8080/apache-solr-1.4.1/";);
> >  Collection  docs = new  ArrayList();
> > 
> > for (Iterator  iterator  = documents.iterator(); iterator.hasNext();) {
> > Document  document = (Document)  iterator.next();
> > SolrInputDocument solrDoc  =  SolrUtils.createDocsSolrDocument(document);
> >  docs.add(solrDoc);
> > }
> > 
> > UpdateRequest req = new  UpdateRequest();
> >  req.setAction(UpdateRequest.ACTION.COMMIT, false,  false);
> >  req.add(docs);
> > UpdateResponse rsp =  req.process(server);
> > 
> > server.optimize();
> > 
> > This process takes 19   seconds, which is 10 seconds faster than my older 
> >solution using  compass  (another opensource search project we used). Is 
> >this 
>the 
>
> >best was to update the  index? If I understand correctly, an update  is 
>actually 
>
> >a delete in the index  then an add. During the 19  seconds, will my index be 
> >locked only on the document  being  updated or the whole index could be 
>locked? I 
>
> >am not in production  yet  with this solution, so I want to make sure my 
>update 
>
> >process  makes sense.  Thanks
> > 
> > Greg
> > 
> 


RE: Question concerning the updating of my solr index

2011-05-02 Thread Greg Georges
Oops, here is the code

SolrServer server = new 
StreamingUpdateSolrServer("http://localhost:8080/apache-solr-1.4.1/";, 1000, 4); 
Collection docs = new 
ArrayList();

for (Iterator iterator = documents.iterator(); 
iterator.hasNext();) {
Document document = (Document) iterator.next();

SolrInputDocument solrDoc = 
SolrUtils.createDocsSolrDocument(document); 
docs.add(solrDoc);
}

   server.add(docs);
   server.commit();
   server.optimize();

Greg

-Original Message-----
From: Greg Georges [mailto:greg.geor...@biztree.com] 
Sent: 2 mai 2011 14:44
To: solr-user@lucene.apache.org
Subject: RE: Question concerning the updating of my solr index

Ok I had seen this in the wiki, performance has gone from 19 seconds to 13. I 
have configured it like this, I wonder what would the best settings be with 
20,000 docs to update? Higher or lower queue value? Higher or lower thread 
value? Thanks

Greg

-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: 2 mai 2011 13:59
To: solr-user@lucene.apache.org
Subject: Re: Question concerning the updating of my solr index

Greg,

You could use StreamingUpdateSolrServer instead of that UpdateRequest class - 
http://search-lucene.com/?q=StreamingUpdateSolrServer+&fc_project=Solr
Your index won't be locked in the sense that you could have multiple apps or 
threads adding docs to the same index simultaneously and that searches can be 
executed against the index concurrently.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Greg Georges 
> To: "solr-user@lucene.apache.org" 
> Sent: Mon, May 2, 2011 1:33:30 PM
> Subject: Question concerning the updating of my solr index
> 
> Hello all,
> 
> I have integrated Solr into my project with success. I use a  
> dataimporthandler 
>to first import the data mapping the fields to my schema.xml.  I use Solrj to 
>query the data and also use faceting. Works great.
> 
> The  question I have now is a general one on updating the index and how it 
>works.  Right now, I have a thread which runs a couple of times a day to 
>update 
>the  index. My index is composed of about 2 documents, and when this 
>thread 
>is  run it takes the data of the 2 documents in the db, I create a 
>solrdocument  for each and I then use this line of code to index the index.
> 
> SolrServer  server = new 
>CommonsHttpSolrServer("http://localhost:8080/apache-solr-1.4.1/";);
> Collection  docs = new ArrayList();
> 
> for (Iterator iterator  = documents.iterator(); iterator.hasNext();) {
> Document document = (Document)  iterator.next();
> SolrInputDocument solrDoc =  SolrUtils.createDocsSolrDocument(document);
> docs.add(solrDoc);
> }
> 
> UpdateRequest req = new  UpdateRequest();
> req.setAction(UpdateRequest.ACTION.COMMIT, false,  false);
> req.add(docs);
> UpdateResponse rsp =  req.process(server);
> 
> server.optimize();
> 
> This process takes 19  seconds, which is 10 seconds faster than my older 
>solution using compass  (another opensource search project we used). Is this 
>the 
>best was to update the  index? If I understand correctly, an update is 
>actually 
>a delete in the index  then an add. During the 19 seconds, will my index be 
>locked only on the document  being updated or the whole index could be locked? 
>I 
>am not in production yet  with this solution, so I want to make sure my update 
>process makes sense.  Thanks
> 
> Greg
> 


RE: Question concerning the updating of my solr index

2011-05-02 Thread Greg Georges
Ok I had seen this in the wiki, performance has gone from 19 seconds to 13. I 
have configured it like this, I wonder what would the best settings be with 
20,000 docs to update? Higher or lower queue value? Higher or lower thread 
value? Thanks

Greg

-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: 2 mai 2011 13:59
To: solr-user@lucene.apache.org
Subject: Re: Question concerning the updating of my solr index

Greg,

You could use StreamingUpdateSolrServer instead of that UpdateRequest class - 
http://search-lucene.com/?q=StreamingUpdateSolrServer+&fc_project=Solr
Your index won't be locked in the sense that you could have multiple apps or 
threads adding docs to the same index simultaneously and that searches can be 
executed against the index concurrently.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Greg Georges 
> To: "solr-user@lucene.apache.org" 
> Sent: Mon, May 2, 2011 1:33:30 PM
> Subject: Question concerning the updating of my solr index
> 
> Hello all,
> 
> I have integrated Solr into my project with success. I use a  
> dataimporthandler 
>to first import the data mapping the fields to my schema.xml.  I use Solrj to 
>query the data and also use faceting. Works great.
> 
> The  question I have now is a general one on updating the index and how it 
>works.  Right now, I have a thread which runs a couple of times a day to 
>update 
>the  index. My index is composed of about 2 documents, and when this 
>thread 
>is  run it takes the data of the 2 documents in the db, I create a 
>solrdocument  for each and I then use this line of code to index the index.
> 
> SolrServer  server = new 
>CommonsHttpSolrServer("http://localhost:8080/apache-solr-1.4.1/";);
> Collection  docs = new ArrayList();
> 
> for (Iterator iterator  = documents.iterator(); iterator.hasNext();) {
> Document document = (Document)  iterator.next();
> SolrInputDocument solrDoc =  SolrUtils.createDocsSolrDocument(document);
> docs.add(solrDoc);
> }
> 
> UpdateRequest req = new  UpdateRequest();
> req.setAction(UpdateRequest.ACTION.COMMIT, false,  false);
> req.add(docs);
> UpdateResponse rsp =  req.process(server);
> 
> server.optimize();
> 
> This process takes 19  seconds, which is 10 seconds faster than my older 
>solution using compass  (another opensource search project we used). Is this 
>the 
>best was to update the  index? If I understand correctly, an update is 
>actually 
>a delete in the index  then an add. During the 19 seconds, will my index be 
>locked only on the document  being updated or the whole index could be locked? 
>I 
>am not in production yet  with this solution, so I want to make sure my update 
>process makes sense.  Thanks
> 
> Greg
> 


Question concerning the updating of my solr index

2011-05-02 Thread Greg Georges
Hello all,

I have integrated Solr into my project with success. I use a dataimporthandler 
to first import the data mapping the fields to my schema.xml. I use Solrj to 
query the data and also use faceting. Works great.

The question I have now is a general one on updating the index and how it 
works. Right now, I have a thread which runs a couple of times a day to update 
the index. My index is composed of about 2 documents, and when this thread 
is run it takes the data of the 2 documents in the db, I create a 
solrdocument for each and I then use this line of code to index the index.

SolrServer server = new 
CommonsHttpSolrServer("http://localhost:8080/apache-solr-1.4.1/";);
Collection docs = new ArrayList();

for (Iterator iterator = documents.iterator(); iterator.hasNext();) {
Document document = (Document) iterator.next();
SolrInputDocument solrDoc = SolrUtils.createDocsSolrDocument(document);
   docs.add(solrDoc);
}

UpdateRequest req = new UpdateRequest();
req.setAction(UpdateRequest.ACTION.COMMIT, false, false);
req.add(docs);
UpdateResponse rsp = req.process(server);

server.optimize();

This process takes 19 seconds, which is 10 seconds faster than my older 
solution using compass (another opensource search project we used). Is this the 
best was to update the index? If I understand correctly, an update is actually 
a delete in the index then an add. During the 19 seconds, will my index be 
locked only on the document being updated or the whole index could be locked? I 
am not in production yet with this solution, so I want to make sure my update 
process makes sense. Thanks

Greg


Embedded Solr

2011-03-21 Thread Greg Georges
Hello all,

I am using Solr in a Java architecture right now, and the results are great. 
The app development team has asked me if it is possible to embed Solr, but the 
request is to embed it into a C++ app and mac app using objective C. I do not 
have much knowledge on embedded Solr. Does it need a JVM? Is what they are 
asking me possible and are there any resources for it? Thanks

Greg


Indexing a text string for faceting

2011-03-09 Thread Greg Georges
Hello all,

I have a small problem with my faceting fields. In all I create a new faceting 
field which is indexed and not stored, and use copyField. The problem is I 
facet on category names which have examples like this

Policies & Documentation 
(37)
Forms & Checklists 
(22)

Right now my fields were using the string type, which is not got because I 
think by default it is using a tokenizer etc.. I think I must define a new type 
field so that my category names will be properly indexed as a facet field. Here 
is what I have now









Can someone give me a type configuration which will support my category names 
which have whitespaces and ampersands?

Thanks in advance

Greg


Indexing languages, dataimporthandler

2011-02-22 Thread Greg Georges
Hello all,

I have just gone through the mailing list and have set up my different field 
type analysers for my 6 different languages in my shema.xml. Here is my 
question. I am using the dataimporthandler to import data from my database into 
my index. In my table, the documentname column's data can be in any of the 6 
languages. Lets say I want to index this data and apply the different language 
analysers for certain cases, what would be the best way in my case. The real 
problem is that I do not know the language of the string in the documentname 
column once I create my index, therefore I cannot apply the correct field type. 
Should I create a custom transformer?

Thanks

Greg


Question regarding indexing multiple languages, stopwords, etc.

2011-02-21 Thread Greg Georges
Hello all,

I have gotten my DataImporthandler to index my data from my MySQL database. I 
was looking at the schema tool and noticing that stopwords in different 
languages are being indexed as terms. The 6 languages we have are English, 
French, Spanish, Chinese, German and Italian.

Right now I am using the basic schema configuration for English. How do I 
define them for others languages? I have looked at the wiki page 
(http://wiki.apache.org/solr/LanguageAnalysis) but I would like to have an 
example configuration for all the languages I need. Also I need a list of 
stopwords for these languages.  So far I have this


  







  

Thanks in advance

Greg


RE: Question regarding inner entity in dataimporthandler

2011-02-15 Thread Greg Georges
OK, I think I found some information, supposedly TemplateTransformer will 
return an empty string if the value of a variable is null. Some people say to 
use the regex transformer instead, can anyone clarify this? Thanks

-Original Message-
From: Greg Georges [mailto:greg.geor...@biztree.com] 
Sent: 15 février 2011 13:38
To: solr-user@lucene.apache.org
Subject: Question regarding inner entity in dataimporthandler

Hello all,

I have searched the forums for the question I am about to ask, never found any 
concrete results. This is my case. I am defining the data config file with the 
document and entity tags. I define with success a basic entity mapped to my 
mysql database, and I then add some inner entities. The problem I have is with 
the one-to-one relationship I have between my "document" entity and its 
"documentcategory" entity. In my document table, the documentcategory foreign 
key is optional. Here is my mapping


   

   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   

   

   
   
   
   
   

   



My first document entity in the database does not have a documentcateogry. When 
I run the dataimported I get this error message

Unable to execute query: select CategoryID as id, CategoryName as categoryName, 
MetaTitle as categoryMetaTitle, MetaDescription as categoryMetaDescription, 
MetaKeywords as categoryMetakeywords from documentcategory where CategoryID =  
Processing Document # 1

Caused by: com.mysql.jdbc.exceptions.MySQLSyntaxErrorException: You have an 
error in your SQL syntax; check the manual that corresponds to your MySQL 
server version for the right syntax to use near '' at line 1

It seems that since the document.categoryId is null it uses an empty string. We 
would say that the importer does not work like a left join thus returning 
results even if one child is null. Anyone know a possible solution? Maybe 
instead of using inner entities, can I define a left join directly in my 
document query? Thanks

BTW: I already tested the config with another child element and everything 
works fine. Only the case with documentcategory which is sometimes null which 
causes problems

Greg



Question regarding inner entity in dataimporthandler

2011-02-15 Thread Greg Georges
Hello all,

I have searched the forums for the question I am about to ask, never found any 
concrete results. This is my case. I am defining the data config file with the 
document and entity tags. I define with success a basic entity mapped to my 
mysql database, and I then add some inner entities. The problem I have is with 
the one-to-one relationship I have between my "document" entity and its 
"documentcategory" entity. In my document table, the documentcategory foreign 
key is optional. Here is my mapping


   

   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   

   

   
   
   
   
   

   



My first document entity in the database does not have a documentcateogry. When 
I run the dataimported I get this error message

Unable to execute query: select CategoryID as id, CategoryName as categoryName, 
MetaTitle as categoryMetaTitle, MetaDescription as categoryMetaDescription, 
MetaKeywords as categoryMetakeywords from documentcategory where CategoryID =  
Processing Document # 1

Caused by: com.mysql.jdbc.exceptions.MySQLSyntaxErrorException: You have an 
error in your SQL syntax; check the manual that corresponds to your MySQL 
server version for the right syntax to use near '' at line 1

It seems that since the document.categoryId is null it uses an empty string. We 
would say that the importer does not work like a left join thus returning 
results even if one child is null. Anyone know a possible solution? Maybe 
instead of using inner entities, can I define a left join directly in my 
document query? Thanks

BTW: I already tested the config with another child element and everything 
works fine. Only the case with documentcategory which is sometimes null which 
causes problems

Greg



Difference between Solr and Lucidworks distribution

2011-02-11 Thread Greg Georges
Hello all,

I just started watching the webinars from Lucidworks, and they mention their 
distribution which has an installer, etc.. Is there any other differences? Is 
it a good idea to use this free distribution?

Greg


Solr design decisions

2011-02-11 Thread Greg Georges
Hello all,

I have just finished to book "Solr 1.4 Enterprise Search Server". I now 
understand most of the basics of Solr and also how we can scale the solution. 
Our goal is to have a centralized search service for a multitude of apps.

Our first application which we want to index, is a system in which we must 
index documents through Solr Cell. These documents are associated to certain 
clients (companies). Each client can have a multitude of users, and each user 
can be part of a group of users. We have permissions on each physical document 
in the system, and we want this to also be present in our enterprise search for 
the system.

I read that we can associate roles and ids to solr documents in order to show 
only a subset of search results for a particular user. The question I am asking 
is this. A best practice in Solr is to batch commit changes. The problem in my 
case is that if we change a documents permissions (role), and if we batch 
commit there can be a period where the document in the search results can be 
associated to the old role. What should I do in this case? Should I just commit 
the change right away? What if this action is done many times by many clients, 
will the performance still scale even if I do not batch commit my changes? 
Thanks

Greg


RE: Architecture decisions with Solr

2011-02-09 Thread Greg Georges
From what I understand about multicore, each of the indexes are independant 
from each other right? Or would one index have access to the info of the other? 
My requirement is like you mention, a client has access only to his or her 
search data based in their documents. Other clients have no access to the index 
of other clients.

Greg

-Original Message-
From: Darren Govoni [mailto:dar...@ontrenet.com] 
Sent: 9 février 2011 14:28
To: solr-user@lucene.apache.org
Subject: Re: Architecture decisions with Solr

What about standing up a VM (search appliance that you would make) for
each client? 
If there's no data sharing across clients, then using the same solr
server/index doesn't seem necessary.

Solr will easily meet your needs though, its the best there is.

On Wed, 2011-02-09 at 14:23 -0500, Greg Georges wrote:

> Hello all,
> 
> I am looking into an enterprise search solution for our architecture and I am 
> very pleased to see all the features Solr provides. In our case, we will have 
> a need for a highly scalable application for multiple clients. This 
> application will be built to serve many users who each will have a client 
> account. Each client will have a multitude of documents to index (0-1000s of 
> documents). After discussion we were talking about going multicore and to 
> have one index file per client account. The reason for this is that security 
> is achieved by having a separate index for each client etc.. Is this the best 
> approach? How feasible is it (dynamically create indexes on client account 
> creation. Is it better to go the faceted search capabilities route? Thanks 
> for your help
> 
> Greg




Architecture decisions with Solr

2011-02-09 Thread Greg Georges
Hello all,

I am looking into an enterprise search solution for our architecture and I am 
very pleased to see all the features Solr provides. In our case, we will have a 
need for a highly scalable application for multiple clients. This application 
will be built to serve many users who each will have a client account. Each 
client will have a multitude of documents to index (0-1000s of documents). 
After discussion we were talking about going multicore and to have one index 
file per client account. The reason for this is that security is achieved by 
having a separate index for each client etc.. Is this the best approach? How 
feasible is it (dynamically create indexes on client account creation. Is it 
better to go the faceted search capabilities route? Thanks for your help

Greg