Re: Question concerning date fields

2012-04-20 Thread Gora Mohanty
On 21 April 2012 09:12, Bill Bell  wrote:
> We are loading a long (number of seconds since 1970?) value into Solr using 
> java and Solrj. What is the best way to convert this into the right Solr date 
> fields?
[...]

There are various options, depending on the source of
your data, and how you are indexing the field into Solr:
* If you are fetching the no. of seconds from a database,
  most DBs will have date-conversion functions that you
  can use in the SELECT statement.
* If you are using SolrJ, java.sql.Date has a constructor
  from milli-seconds since the Unix epoch.
* If you are using DIH, you can use a transformer to
  convert the number of seconds to a date.

Regards,
Gora


Question concerning date fields

2012-04-20 Thread Bill Bell
We are loading a long (number of seconds since 1970?) value into Solr using 
java and Solrj. What is the best way to convert this into the right Solr date 
fields?

Sent from my Mobile device
720-256-8076


Re: Storing the md5 hash of pdf files as a field in the index

2012-04-20 Thread Otis Gospodnetic
Hi Joe,

You could write a custom URP - Update Request Processor.  This URP would take 
the value from one SolrDocument field (say the one that has the full path to 
your PDF and is thus unique), compute MD5 using Java API for doing that, and 
would stick that MD5 value in some field that you've defined as string to hold 
that value.

Otis

Performance Monitoring SaaS for Solr - 
http://sematext.com/spm/solr-performance-monitoring/index.html



>
> From: "kuchenbr...@mail.org" 
>To: solr-user@lucene.apache.org 
>Sent: Friday, April 20, 2012 10:07 AM
>Subject: Storing the md5 hash of pdf files as a field in the index
> 
>Hi,
>
>I want to build an index of quite a number of pdf and msword files using the 
>Data Import Request Handler and the Tika Entity Processor. It works very well. 
>Now I would like to use the md5 digest of the binary (pdf/word) file as the 
>unique key in t
>he index. But I do not know how to implement this. In the data-config.xml 
>configuring the FileListEntityProcessor I have access to the absolute file 
>name of a pdf to be indexed. I'm sitting on a Linux box and so there is an 
>easy way to calculate t
>he md5 hash using the operating system command md5sum. But how can I trigger 
>this calculation and store the result as a field in my index?
>
>Any tips or ideas are really appreciated.
>
>Thanks.
>Joe
>
>
>

Re: Crawling an SCM to update a Solr index

2012-04-20 Thread Otis Gospodnetic
Kristian,

For what it's worth, for http://search-lucene.com and http://search-hadoop.com 
we simply check out the source code from the SCM and index from the file 
system.  It works reasonably well.  The only issues that I can recall us having 
is with the source code organization under SCM - modules get moved around and 
sometimes this requires us to update stuff on our end to match those changes.

Otis

Performance Monitoring for Solr - 
http://sematext.com/spm/solr-performance-monitoring/index.html



>
> From: "Van Tassell, Kristian" 
>To: "solr-user@lucene.apache.org"  
>Sent: Friday, April 20, 2012 3:26 PM
>Subject: Crawling an SCM to update a Solr index
> 
>Hello everyone,
>
>I'm in the process of pulling together requirements for a SCM (source code 
>manager) crawling mechanism for our Solr index. I probably don't need to argue 
>the need for a crawler, but to be specific, we have an index which receives 
>its updates from a custom built application. I would, however, like to 
>periodically crawl the SCM to ensure the index is up to date. In addition, if 
>updates are made which require a complete reindex (such as schema.xml 
>modifications), I could utilize this crawler to update everything or specific 
>areas.
>
>I'm wondering if there are any initiatives, tools (like Nutch) or whitepapers 
>out there, which crawl an SCM. More specifically, I'm looking for a Perforce 
>solution. I'm guessing that there is nothing specific and I'm prepared to 
>design to our specific requirements, but wanted to check with the Solr 
>community prior to getting too far in.
>
>I'm most likely going to build the solution to interact with the SCM directly 
>(via their API) versus sync'ing the SCM repository to the filesystem and crawl 
>that way, since there could be filesystem problem syncing the data and because 
>there may be relevant metadata information that can be retrieved from the SCM.
>
>Thanks in advance for any information you may have,
>Kristian
>
>
>

Re: How to index pdf's content with SolrJ?

2012-04-20 Thread Erick Erickson
This might help:
http://www.lucidimagination.com/blog/2012/02/14/indexing-with-solrj/

The bit here is you have to have Tika parse your file
and then extract the content to send to Solr...

Best
Erick

On Fri, Apr 20, 2012 at 7:36 PM, vasuj  wrote:
>
> 0
> down vote
> favorite
> share [g+]
> share [fb]
> share [tw]
> I'm trying to index a few pdf documents using SolrJ as described at
> http://wiki.apache.org/solr/ContentStreamUpdateRequestExample, below there's
> the code:
>
> import static
> org.apache.solr.handler.extraction.ExtractingParams.LITERALS_PREFIX;
> import static
> org.apache.solr.handler.extraction.ExtractingParams.MAP_PREFIX;
> import static
> org.apache.solr.handler.extraction.ExtractingParams.UNKNOWN_FIELD_PREFIX;
>
> import org.apache.solr.client.solrj.SolrServer;
> import org.apache.solr.client.solrj.SolrServerException;
> import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer;
> import org.apache.solr.client.solrj.request.AbstractUpdateRequest;
> import org.apache.solr.client.solrj.request.ContentStreamUpdateRequest;
> import org.apache.solr.common.util.NamedList;
> ...
> public static void indexFilesSolrCell(String fileName) throws IOException,
> SolrServerException {
>
>  String urlString = "http://localhost:8080/solr";;
>  SolrServer server = new CommonsHttpSolrServer(urlString);
>
>  ContentStreamUpdateRequest up = new
> ContentStreamUpdateRequest("/update/extract");
>  up.addFile(new File(fileName));
>  String id = fileName.substring(fileName.lastIndexOf('/')+1);
>  System.out.println(id);
>
>  up.setParam(LITERALS_PREFIX + "id", id);
>  up.setParam(LITERALS_PREFIX + "location", fileName); // this field doesn't
> exists in schema.xml, it'll be created as attr_location
>  up.setParam(UNKNOWN_FIELD_PREFIX, "attr_");
>  up.setParam(MAP_PREFIX + "content", "attr_content");
>  up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
>
>  NamedList request = server.request(up);
>  for(Entry entry : request){
>    System.out.println(entry.getKey());
>    System.out.println(entry.getValue());
>  }
> }
> Unfortunately when querying for *:* I get the list of indexed documents but
> the content field is empty. How can I change the code above to extract also
> the document's content?
>
> Below there's the xml frament that describes this document:
>
> 
>  
>                
>  
>  
>    /home/alex/Documents/lsp.pdf
>  
>  
>    stream_size
>    31203
>    Content-Type
>    application/pdf
>  
>  
>    31203
>  
>  
>    application/pdf
>  
>  lsp.pdf
> 
> I don't think that this problem is related to an incorrect installation of
> Apache Tika, because previously I had a few ServerException but now I've
> installed the required jars in the correct path. Moreover I've tried to
> index a txt file using the same class but the attr_content field is always
> empty.
>
> Also tried In the schema.xml file, "stored= true" in the content field,
>
>  required="false" multiValued="true"/>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/How-to-index-pdf-s-content-with-SolrJ-tp3927284p3927284.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Date granularity

2012-04-20 Thread Erick Erickson
Well, that's just the way Solr works. You can tune the range
performance by playing with the prescisionStep, Trie
fields are built to make range queries perform well.

Best
Erick

On Fri, Apr 20, 2012 at 10:20 AM, vybe3142  wrote:
> ... Inelegant as opposed to the possibility of using /DAY to specify day
> granularity on a single term query
>
> In any case, if that's how SOLR works, that's fine
>
> Any rough idea of the performance of range queries vs truncated day queries?
> Otherwise, I might just write up a quick program to compare them
>
> Thanks
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Date-granularity-tp3920890p3926165.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Convert a SolrDocumentList to DocList

2012-04-20 Thread Erick Erickson
OK, this description really sounds like an XY problem. Why do you
want to do this? What is the higher-level problem you're trying to solve?

Best
Erick

On Fri, Apr 20, 2012 at 9:18 AM, Ramprakash Ramamoorthy
 wrote:
> Dear all,
>
>        Is there any way I can convert a SolrDocumentList to a DocList and
> set it in the QueryResult object?
>
>        Or, the workaround adding a SolrDocumentList object to the
> QueryResult object?
>
> --
> With Thanks and Regards,
> Ramprakash Ramamoorthy,
> Project Trainee,
> Zoho Corporation.
> +91 9626975420


How to index pdf's content with SolrJ?

2012-04-20 Thread vasuj

0
down vote
favorite
share [g+]
share [fb]
share [tw]
I'm trying to index a few pdf documents using SolrJ as described at
http://wiki.apache.org/solr/ContentStreamUpdateRequestExample, below there's
the code:

import static
org.apache.solr.handler.extraction.ExtractingParams.LITERALS_PREFIX;
import static
org.apache.solr.handler.extraction.ExtractingParams.MAP_PREFIX;
import static
org.apache.solr.handler.extraction.ExtractingParams.UNKNOWN_FIELD_PREFIX;

import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer;
import org.apache.solr.client.solrj.request.AbstractUpdateRequest;
import org.apache.solr.client.solrj.request.ContentStreamUpdateRequest;
import org.apache.solr.common.util.NamedList;
...
public static void indexFilesSolrCell(String fileName) throws IOException,
SolrServerException {

  String urlString = "http://localhost:8080/solr";; 
  SolrServer server = new CommonsHttpSolrServer(urlString);

  ContentStreamUpdateRequest up = new
ContentStreamUpdateRequest("/update/extract");
  up.addFile(new File(fileName));
  String id = fileName.substring(fileName.lastIndexOf('/')+1);
  System.out.println(id);

  up.setParam(LITERALS_PREFIX + "id", id);
  up.setParam(LITERALS_PREFIX + "location", fileName); // this field doesn't
exists in schema.xml, it'll be created as attr_location
  up.setParam(UNKNOWN_FIELD_PREFIX, "attr_");
  up.setParam(MAP_PREFIX + "content", "attr_content");
  up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);

  NamedList request = server.request(up);
  for(Entry entry : request){
System.out.println(entry.getKey());
System.out.println(entry.getValue());
  }
}
Unfortunately when querying for *:* I get the list of indexed documents but
the content field is empty. How can I change the code above to extract also
the document's content?

Below there's the xml frament that describes this document:


  

  
  
/home/alex/Documents/lsp.pdf
  
  
stream_size
31203
Content-Type
application/pdf
  
  
31203
  
  
application/pdf
  
  lsp.pdf

I don't think that this problem is related to an incorrect installation of
Apache Tika, because previously I had a few ServerException but now I've
installed the required jars in the correct path. Moreover I've tried to
index a txt file using the same class but the attr_content field is always
empty.

Also tried In the schema.xml file, "stored= true" in the content field, 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-index-pdf-s-content-with-SolrJ-tp3927284p3927284.html
Sent from the Solr - User mailing list archive at Nabble.com.


null pointer error with solr deduplication

2012-04-20 Thread Peter Markey
Hello,

I have been trying out deduplication in solr by following:
http://wiki.apache.org/solr/Deduplication. I have defined a signature field
to hold the values of the signature created based on few other fields in a
document and the idea seems to work like a charm in a single solr instance.
But, when I have multiple cores and try to do a distributed search (
Http://localhost:8080/solr/core0/select?q=*&shards=localhost:8080/solr/dedupe,localhost:8080/solr/dedupe2&facet=true&facet.field=doc_id)
I get the error pasted below. While normal search (with just q) works fine,
the facet/stats queries seem to be the culprit. The doc_id contains
duplicate ids since I'm testing the same set of documents indexed in both
the cores(dedupe, dedupe2). Any insights would be highly appreciated.

Thanks



20-Apr-2012 11:39:35 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.NullPointerException
at
org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:887)
at
org.apache.solr.handler.component.QueryComponent.handleRegularResponses(QueryComponent.java:633)
at
org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:612)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:307)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1540)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:435)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:256)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
at
org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:472)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
at
org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987)
at
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579)
at
org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:307)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)


Re: Opposite to MoreLikeThis?

2012-04-20 Thread Darren Govoni
You could run the MLT for the document in question, then gather all
those doc id's in the MLT results and negate those in a subsequent
query. Not sure how robust that would work with very large result sets,
but something to try.

Another approach would be to gather the "interesting terms" from the
document in question and then negate those terms in subsequent queries.
Perhaps with many negated terms, Solr will rank the results based on
most negated terms above less negated terms, simulating a ranked "less
like" effect.

On Fri, 2012-04-20 at 15:38 -0700, Charlie Maroto wrote:
> Hi all,
> 
> Is there a way to implement the opposite to MoreLikeThis (LessLikeThis, I
> guess :).  The requirement we have is to remove all documents with content
> like that of a given document id or a text provided by the end-user.  In
> the current index implementation (not using Solr), the user can narrow
> results by indicating what document(s) are not relevant to him and then
> request to remove from the search results any document whose content is
> like that of the selected document(s)
> 
> Our index has close to 100 million documents and they cover multiple topics
> that are not related to one another.  So, a search for some broad terms may
> retrieve documents about engineering, agriculture, communications, etc.  As
> the user is trying to discover the relevant documents, he may select an
> agriculture-related document to exclude it and those documents like it from
> the results set; same w/ engineering-like content, etc. until most of the
> documents are about communications.
> 
> Of course, some exclusions may actually remove relevant content but those
> filters can be removed to go back to the previous set of results.
> 
> Any ideas from similar implementations or suggestions are welcomed!
> Thanks,
> Carlos




Re: Language Identification

2012-04-20 Thread Jan Høydahl
Hi,

Solr just reuses Tika's language identifier. But you are of course free to do 
your language detection on the Nutch side if you choose and not invoke the one 
in Solr.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 20. apr. 2012, at 21:49, Bai Shen wrote:

> I'm working on using Shuyo's work to improve the language identification of
> our search.  Apparently, it's been moved from Nutch to Solr.  Is there a
> reason for this?
> 
> http://code.google.com/p/language-detection/issues/detail?id=34
> 
> I would prefer to have the processing done in Nutch as that has the benefit
> of more hardware and not interfering with Solr latency.
> 
> Thanks.



Re: SolrCloud indexing question

2012-04-20 Thread Jamie Johnson
I believe the SolrJ code round robins which server the request is sent
to and as such probably wouldn't send to the same server in your case,
but if you had an HttpSolrServer for instance and were pointing to
only one particular intsance my guess would be that would be 5
separate requests from the server you hit.  Especially since in all
likelihood those documents wouldn't be destined for the same shard as
the others (unless of course you only had 1 shard and you sent these
to the replica)

On Fri, Apr 20, 2012 at 3:02 PM, Darren Govoni  wrote:
> Gotcha.
>
> Now does that mean if I have 5 threads all writing to a local shard,
> will that shard piggyhop those index requests onto a SINGLE connection
> to the leader? Or will they spawn 5 connections from the shard to the
> leader? I really hope the formerthe latter won't scale well.
>
> On Fri, 2012-04-20 at 10:28 -0400, Jamie Johnson wrote:
>> my understanding is that you can send your updates/deletes to any
>> shard and they will be forwarded to the leader automatically.  That
>> being said your leader will always be the place where the index
>> happens and then distributed to the other replicas.
>>
>> On Fri, Apr 20, 2012 at 7:54 AM, Darren Govoni  wrote:
>> > Hi,
>> >  I just wanted to make sure I understand how distributed indexing works
>> > in solrcloud.
>> >
>> > Can I index locally at each shard to avoid throttling a central port? Or
>> > all the indexing has to go through a single shard leader?
>> >
>> > thanks
>> >
>> >
>>
>
>


Re: How to escape “<” character in regex in Solr schema.xml?

2012-04-20 Thread smooth almonds
Thanks Jeevanandam. I couldn't get any regex pattern to work except a basic
one to look for sentence-ending punctuation followed by whitespace:

[.!?](?=\s)

However, this isn't good enough for my needs so I'm switching tactics at the
moment and working on plugging in OpenNLP's SentenceDetector into either
Solr or Nutch. 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-escape-character-in-regex-in-Solr-schema-xml-tp3921961p3927003.html
Sent from the Solr - User mailing list archive at Nabble.com.


Language Identification

2012-04-20 Thread Bai Shen
I'm working on using Shuyo's work to improve the language identification of
our search.  Apparently, it's been moved from Nutch to Solr.  Is there a
reason for this?

http://code.google.com/p/language-detection/issues/detail?id=34

I would prefer to have the processing done in Nutch as that has the benefit
of more hardware and not interfering with Solr latency.

Thanks.


Crawling an SCM to update a Solr index

2012-04-20 Thread Van Tassell, Kristian
Hello everyone,

I'm in the process of pulling together requirements for a SCM (source code 
manager) crawling mechanism for our Solr index. I probably don't need to argue 
the need for a crawler, but to be specific, we have an index which receives its 
updates from a custom built application. I would, however, like to periodically 
crawl the SCM to ensure the index is up to date. In addition, if updates are 
made which require a complete reindex (such as schema.xml modifications), I 
could utilize this crawler to update everything or specific areas.

I'm wondering if there are any initiatives, tools (like Nutch) or whitepapers 
out there, which crawl an SCM. More specifically, I'm looking for a Perforce 
solution. I'm guessing that there is nothing specific and I'm prepared to 
design to our specific requirements, but wanted to check with the Solr 
community prior to getting too far in.

I'm most likely going to build the solution to interact with the SCM directly 
(via their API) versus sync'ing the SCM repository to the filesystem and crawl 
that way, since there could be filesystem problem syncing the data and because 
there may be relevant metadata information that can be retrieved from the SCM.

Thanks in advance for any information you may have,
Kristian


Re: SolrCloud indexing question

2012-04-20 Thread Darren Govoni
Gotcha.

Now does that mean if I have 5 threads all writing to a local shard,
will that shard piggyhop those index requests onto a SINGLE connection
to the leader? Or will they spawn 5 connections from the shard to the
leader? I really hope the formerthe latter won't scale well.

On Fri, 2012-04-20 at 10:28 -0400, Jamie Johnson wrote:
> my understanding is that you can send your updates/deletes to any
> shard and they will be forwarded to the leader automatically.  That
> being said your leader will always be the place where the index
> happens and then distributed to the other replicas.
> 
> On Fri, Apr 20, 2012 at 7:54 AM, Darren Govoni  wrote:
> > Hi,
> >  I just wanted to make sure I understand how distributed indexing works
> > in solrcloud.
> >
> > Can I index locally at each shard to avoid throttling a central port? Or
> > all the indexing has to go through a single shard leader?
> >
> > thanks
> >
> >
> 




Re: String ordering appears different with sort vs range query

2012-04-20 Thread Cat Bieber
Thanks for looking at this. I'll see if we can sneak an upgrade to 3.6 
into the project to get this working.

-Cat

On 04/20/2012 12:03 PM, Erick Erickson wrote:

BTW, nice problem statement...

Anyway, I see this too in 3.5. I do NOT see
this in 3.6 or trunk, so it looks like a bug that got fixed
in the 3.6 time-frame. Don't have the time right now
to go back over the JIRA's to see...

Best
Erick

On Thu, Apr 19, 2012 at 3:39 PM, Cat Bieber  wrote:
   

I'm trying to use a Solr query to find the next title in alphabetical order
after a given string. The issue I'm facing is that the sort param seems to
sort non-alphanumeric characters in a different order from the ordering used
by a range filter in the q or fq param. I can't filter the non-alphanumeric
characters out because they're integral to the data and it would not be a
useful ordering if it were based only on the alphanumeric portion of the
strings.

I'm running Solr version 3.5.

In my current approach, I have a field that is a unique string for each
document:












I'm passing the value for the current document in a range to query
everything after the current string, sorted ascending:

/select?fl=uniqueSortString&sort=uniqueSortString+asc&q=uniqueSortString:["$1+ZX+Spectrum+HOBETA+format+file"+TO+*]&wt=xml&rows=5&version=2.2

In theory, I expect the first result to be the current item and the second
result to be the next one. However, I'm finding that the sort and the range
filter seem to use different ordering:



$1 ZX Spectrum - Emulator


$1 ZX Spectrum HOBETA format file


$1 ZX Spectrum Hobetta Picture Format


$? TR-DOS ZX Spectrum file in HOBETA
format


$A AutoCAD Autosave File ( Autodesk Inc.)



Based on the results ordering, sort believes - precedes H, but the range
filter should have excluded that first result if it ordered in the same way.
Digging through the code, I think it looks like sorting uses
String.compareTo() for ordering on a text/string field. However I haven't
been able to track down where the range filter code is. If someone can point
me in the right direction to find that code I'd love to look through it. Or,
if anyone has suggestions regarding a different approach or changes I can
make to this query/field, that would be very helpful.

Thanks for your time.
-Cat Bieber
 


How can I get the top term in solr?

2012-04-20 Thread neosky
Actually I would like to know two meaning of the top term in document level
and index file level.
1.The top term in document level means that I would like to know the top
term frequency in all document(only calculate once in one document)
The solr schema.jsp seems to provide to  top 10 term, but it only works in
small index set. When the index gets large, it is hardly to get the result.
Suppose I want to use the Solrj to get the top 20 term, What should I do?
I have reviewed the schema.jsp, but I have no idea how they do this.

2.Another is that I also would like to know how many times of the a specific
term appear in the index. I would like to know the total number=
sum(document*appear times in this document)

Any idea will be appreciated.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-can-I-get-the-top-term-in-solr-tp3926536p3926536.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Special characters in synonyms.txt on Solr 3.5

2012-04-20 Thread Robert Muir
On Fri, Apr 20, 2012 at 12:10 PM, carl.nordenf...@bwinparty.com
 wrote:
> Directly injecting the letter "ö" into synonyms like so:
> island, ön
> island, "ön"
>
> renders the following exception on startup (both lines renders the same 
> error):
>
> java.lang.RuntimeException: java.nio.charset.MalformedInputException: Input 
> length = 3
>                             at 
> org.apache.solr.analysis.FSTSynonymFilterFactory.inform(FSTSynonymFilterFactory.java:92)
>                             at 
> org.apache.solr.analysis.SynonymFilterFactory.inform(SynonymFilterFactory.java:50)

Synonyms file needs to be in UTF-8 encoding.



-- 
lucidimagination.com


Special characters in synonyms.txt on Solr 3.5

2012-04-20 Thread carl.nordenf...@bwinparty.com
Hi,

I'm having issues with special characters in synonyms.txt on Solr 3.5.

I'm running a multi-lingual index and need certain terms to give results across 
all languages no matter what language the user uses.
I figured that this should be easily resovled by just adding the different 
words to synonyms.txt.
This works great as long as I don't use special characters such as åäö.

I've tried a couple of things so far but now I'm completely stuck.

This is completetly ignored by solr:
island, "\u00F6"
and alternatively:
island, "\u00F6n" (this should translate to "ön" which means "the island")

A search for island gives me results with the word " island " but not containg 
the word "ö" (island in swedish) and vice versa.

Directly injecting the letter "ö" into synonyms like so:
island, ön
island, "ön"

renders the following exception on startup (both lines renders the same error):

java.lang.RuntimeException: java.nio.charset.MalformedInputException: Input 
length = 3
 at 
org.apache.solr.analysis.FSTSynonymFilterFactory.inform(FSTSynonymFilterFactory.java:92)
 at 
org.apache.solr.analysis.SynonymFilterFactory.inform(SynonymFilterFactory.java:50)
 at 
org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:546)
 at 
org.apache.solr.schema.IndexSchema.(IndexSchema.java:126)
 at 
org.apache.solr.core.CoreContainer.create(CoreContainer.java:461)
 at 
org.apache.solr.core.CoreContainer.load(CoreContainer.java:316)
 at 
org.apache.solr.core.CoreContainer.load(CoreContainer.java:207)
 at 
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:130)
 at 
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:94)
 at 
org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97)
 at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
 at 
org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:713)
 at 
org.mortbay.jetty.servlet.Context.startContext(Context.java:140)
 at 
org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1282)
 at 
org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518)
 at 
org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499)
 at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
 at 
org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152)
 at 
org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156)
 at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
 at 
org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152)
 at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
 at 
org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)
 at 
org.mortbay.jetty.Server.doStart(Server.java:224)
 at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
 at 
org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:985)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
 at java.lang.reflect.Method.invoke(Unknown Source)
 at org.mortbay.start.Main.invokeMain(Main.java:194)
 at org.mortbay.start.Main.start(Main.java:534)
 at org.mortbay.start.Main.start(Main.java:441)
 at org.mortbay.start.Main.main(Main.java:119)
Caused by: java.nio.charset.MalformedInputException: Input length = 3
 at 
java.nio.charset.CoderResult.throwException(Unknown Source)
 at sun.nio.cs.StreamDecoder.implRead(Unknown 
Source)
 at sun.nio.cs.StreamDecoder.read(Unknown Source)
 at java.io.InputStreamReader.read(Unknown Source)
 at java.io.BufferedReader.fill(Unknown Source)
 at java.io.BufferedReader.readLine(Unknown Source)
 at

Re: String ordering appears different with sort vs range query

2012-04-20 Thread Erick Erickson
BTW, nice problem statement...

Anyway, I see this too in 3.5. I do NOT see
this in 3.6 or trunk, so it looks like a bug that got fixed
in the 3.6 time-frame. Don't have the time right now
to go back over the JIRA's to see...

Best
Erick

On Thu, Apr 19, 2012 at 3:39 PM, Cat Bieber  wrote:
> I'm trying to use a Solr query to find the next title in alphabetical order
> after a given string. The issue I'm facing is that the sort param seems to
> sort non-alphanumeric characters in a different order from the ordering used
> by a range filter in the q or fq param. I can't filter the non-alphanumeric
> characters out because they're integral to the data and it would not be a
> useful ordering if it were based only on the alphanumeric portion of the
> strings.
>
> I'm running Solr version 3.5.
>
> In my current approach, I have a field that is a unique string for each
> document:
>
>  sortMissingLast="true" omitNorms="true">
> 
> 
> 
> 
> 
> 
> 
>
>  stored="true"/>
>
> I'm passing the value for the current document in a range to query
> everything after the current string, sorted ascending:
>
> /select?fl=uniqueSortString&sort=uniqueSortString+asc&q=uniqueSortString:["$1+ZX+Spectrum+HOBETA+format+file"+TO+*]&wt=xml&rows=5&version=2.2
>
> In theory, I expect the first result to be the current item and the second
> result to be the next one. However, I'm finding that the sort and the range
> filter seem to use different ordering:
>
> 
> 
> $1 ZX Spectrum - Emulator
> 
> 
> $1 ZX Spectrum HOBETA format file
> 
> 
> $1 ZX Spectrum Hobetta Picture Format
> 
> 
> $? TR-DOS ZX Spectrum file in HOBETA
> format
> 
> 
> $A AutoCAD Autosave File ( Autodesk Inc.)
> 
> 
>
> Based on the results ordering, sort believes - precedes H, but the range
> filter should have excluded that first result if it ordered in the same way.
> Digging through the code, I think it looks like sorting uses
> String.compareTo() for ordering on a text/string field. However I haven't
> been able to track down where the range filter code is. If someone can point
> me in the right direction to find that code I'd love to look through it. Or,
> if anyone has suggestions regarding a different approach or changes I can
> make to this query/field, that would be very helpful.
>
> Thanks for your time.
> -Cat Bieber


Re: Further questions about behavior in ReversedWildcardFilterFactory

2012-04-20 Thread neosky
I have to discard this method at this time. Thank you all the same.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Further-questions-about-behavior-in-ReversedWildcardFilterFactory-tp3905416p3926423.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Dismax request handler and Dismax query parser

2012-04-20 Thread Erick Erickson
Right, this is often a source of confusion and there's a discussion about
this on the dev list (but the URL escapes me)..

Anyway, qt and defType have pretty much completely different meanings.
Saying "defType=dismax" means you're providing all the dismax
parameters on the URL.

Saying "qt=handlername" is looking for a handler (defined in your
solrconfig.xml)
_named_ "dismax" (e.g. )


The fact that we used to ship the example with a requesthandler _named_
edismax makes this especially confusing.

Does that help?

Best
Erick



On Thu, Apr 19, 2012 at 7:03 AM, mechravi25  wrote:
> Hi,
>
> If I give the search string as, "type list", I want my search to match both
> "type" & "list".
> The following search query which we are using
>
> /select/?qf=name%5e2.3+text+r_name%5e0.3+id%5e0.3+uid%5e0.3&fl=*&qf=name%5e2.3+text+r_name%5e0.3+id%5e0.3+uid%5e0.3&fl=*&qt=dismax&f.typeFacet.facet.mincount=1&facet.field=typeFacet&f.rFacet.facet.mincount=1&facet.field=rFacet&facet=true&hl.fl=*&hl=t
> rue&rows=10&start=0&q=type+list&debugQuery=on
>
> and this does not return any results.
>
> But if we remove qt=dismax in the above case and replace it with
> defType=dismax then, we are getting results for the same
> search string. The request Handlers used for the standard and dismax is as
> follows.
>
>   default="true">
>
>     
>       explicit
>     
>  
>
>  
>    
>     dismax
>     explicit
>     
>        id,score
>     
>
>     *:*
>     0
>     name
>     regex
>    
>  
>
> Im hitting the above request query for a common core usings the shards
> concept (in this case im using 2 cores to be combined in the common core).
> When I use the debugQuery=On, I get the following response in the back end
> (while hitting the different cores from the common core).
>
> INFO: [corex] webapp=/solr path=/select
> params={facet=true&qf=name^2.3+text+r_name^0.3+id^0.3+uid^0.3&q.alt=*:*&hl.fl=*&wt=javabin&hl=false&defType=dismax&rows=10&version=1&f.rFacet.facet.limit=160&fl=uid,score&start=0&f.typeFacet.facet.limit=160&q=type+list&f.text.hl.fragmenter=regex&f.name.hl.fragsize=0&facet.field=typeFacet&facet.field=rFacet&f.name.hl.alternateField=name&isShard=true&fsv=true}
> hits=0 status=0 QTime=6
>
> INFO: [corey] webapp=/solr path=/select
> params={facet=true&qf=name^2.3+text+r_name^0.3+id^0.3+uid^0.3&q.alt=*:*&hl.fl=*&wt=javabin&hl=false&defType=dismax&rows=10&version=1&f.rFacet.facet.limit=160&fl=uid,score&start=0&f.typeFacet.facet.limit=160&q=type+list&f.text.hl.fragmenter=regex&f.name.hl.fragsize=0&facet.field=typeFacet&facet.field=rFacet&f.name.hl.alternateField=name&isShard=true&fsv=true}
> hits=0 status=0 QTime=6
>
> So, here I can see that defType=dismax is being used in the query string
> while querying the individual cores even if we use qt=dismax on the common
> core. If this is the case why is it not returing any values.
> Am I missing anything?Can you guide me on this?
>
> I ve used the defType=dismax in the default section of dismax handler
> definition, but still Im not getting the required results. In Our scenario,
> We would like to use Dismax request handler along with Dismax Query Parser.
> Can you tell me how this can be done?
>
> Regards,
> Sivaganesh
> 
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Dismax-request-handler-and-Dismax-query-parser-tp3922708p3922708.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Abbreviations with KeywordTokenizerFactory

2012-04-20 Thread Erick Erickson
Yeah, this is a pretty ugly problem. You have two
problems, neither of which is all that amenable to
simple solutions.

1> context at index time. St, in your example, is
either Saint or Street. Solr has nothing built
in to it to distinguish this. so you need to do some
processing "somewhere else" to get the proper
substitutions.
2> Query time. Same issue, but you have virtually no
 context to figure this out...

But, it is NOT the case that "Synonyms only work
with Whitespace tokenizer". Synonyms will work
with any tokenizer, the problem is that the tokens
produced have to match when they get to the
SynonymFilter. Even KeywordTokenizer will
"work with synonyms", with the caveat that
you'd have to have single-word input

The admin/analysis page will help you
see how all this fits together. For instance,
if you have the stemmer _before_ the
synonym filter, and your original input contains, say,
"story", by the time it gets to the synonym filter, the
word being matched will be something like "stori".

But even getting synonyms working with other
tokenizers won't help you with the context problem

Best
Erick

On Thu, Apr 19, 2012 at 4:25 AM, Daniel Persson  wrote:
> Hi solr users.
>
> I'm trying to create an index of geographic data to search with solr.
>
> And I get a problem with searches with abbreviations.
>
> At the moment I use an index filter with
>
>      
>        
>        
>      
>
> This is because my searches at the moment are need to be full Keywords to
> enable correct hits and ranking.
>
> I have other tokenizers for other types of searches.
>
> The problem I got now is with a streets with names like
>
> East Saint James Street.
>
> This could be abbreviated as
>
> E St James St
>
> Anyone got a suggestion what to try?
>
> My guess was to use synonyms but that seems to work only with
> WhitespaceTokenizer. I've thought about PatternReplaceCharFilter but that
> will be a lot of rules to cover all abbreviations.
>
> Best regards
>
> Daniel


RE: Maximum Open Cursors using JdbcDataSource and cacheImpl

2012-04-20 Thread Keith Naas
I have removed most of the file to protect the innocent.   As you can see we 
have a high level item that has subentity called skus, and then those skus 
contain subentities for size/width/etc.  The database is configured for only 10 
open cursors, and voila, when the 11th item is being processed we get an 
exception from Oracle about violating the max open cursors.












  





  





It seems like EntityProcessorBase should set the rowIterator and query to null 
after fetching the cached data.

protected Map getNext() {
if(cacheSupport==null) {
...
} else  {
  Map next = cacheSupport.getCacheData(context, query, 
rowIterator);
  query = null;
  rowIterator = null;
  return next;
}
  }

Cheers,
Keith Naas
614-238-4139



Storing the md5 hash of pdf files as a field in the index

2012-04-20 Thread kuchenbrett
Hi,

 I want to build an index of quite a number of pdf and msword files using the 
Data Import Request Handler and the Tika Entity Processor. It works very well. 
Now I would like to use the md5 digest of the binary (pdf/word) file as the 
unique key in t
 he index. But I do not know how to implement this. In the data-config.xml 
configuring the FileListEntityProcessor I have access to the absolute file name 
of a pdf to be indexed. I'm sitting on a Linux box and so there is an easy way 
to calculate t
 he md5 hash using the operating system command md5sum. But how can I trigger 
this calculation and store the result as a field in my index?

 Any tips or ideas are really appreciated.

 Thanks.
 Joe


Re: How can I use a function or fieldvalue as the default for query(subquery, default)?

2012-04-20 Thread jimtronic
I was able to use solr 3.1 functions to accomplish this logic:

/solr/select?q=_val_:sum(query("{!dismax qf=text v='solr
rocks'}"),product(map(query("{!dismax qf=text v='solr
rocks'}",-1),0,100,0,1), product(this_field,that_field)))





--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-can-I-use-a-function-or-fieldvalue-as-the-default-for-query-subquery-default-tp3924172p3926183.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud indexing question

2012-04-20 Thread Jamie Johnson
my understanding is that you can send your updates/deletes to any
shard and they will be forwarded to the leader automatically.  That
being said your leader will always be the place where the index
happens and then distributed to the other replicas.

On Fri, Apr 20, 2012 at 7:54 AM, Darren Govoni  wrote:
> Hi,
>  I just wanted to make sure I understand how distributed indexing works
> in solrcloud.
>
> Can I index locally at each shard to avoid throttling a central port? Or
> all the indexing has to go through a single shard leader?
>
> thanks
>
>


Re: Date granularity

2012-04-20 Thread vybe3142
... Inelegant as opposed to the possibility of using /DAY to specify day
granularity on a single term query

In any case, if that's how SOLR works, that's fine

Any rough idea of the performance of range queries vs truncated day queries?
Otherwise, I might just write up a quick program to compare them

Thanks


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Date-granularity-tp3920890p3926165.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Large Index and OutOfMemoryError: Map failed

2012-04-20 Thread Gopal Patwa
We cannot avoid auto soft commit, since we need Lucene NRT feature. And I
use StreamingUpdateSolrServer for adding/updating index.

On Thu, Apr 19, 2012 at 7:42 AM, Boon Low  wrote:

> Hi,
>
> Also came across this error recently, while indexing with > 10 DIH
> processes in parallel + default index setting. The JVM grinds to a halt and
> throws this error. Checking the index of a core reveals thousands of files!
> Tuning the default autocommit from 15000ms to 90ms solved the problem
> for us. (no 'autosoftcommit').
>
> Boon
>
> -
> Boon Low
> Search UX and Engine Developer
> brightsolid Online Publishing
>
> On 14 Apr 2012, at 17:40, Gopal Patwa wrote:
>
> > I checked it was "MMapDirectory.UNMAP_SUPPORTED=true" and below are my
> > system data. Is their any existing test case to reproduce this issue? I
> am
> > trying understand how I can reproduce this issue with unit/integration
> test
> >
> > I will try recent solr trunk build too,  if it is some bug in solr or
> > lucene keeping old searcher open then how to reproduce it?
> >
> > SYSTEM DATA
> > ===
> > PROCESSOR: Intel(R) Xeon(R) CPU E5504 @ 2.00GHz
> > SYSTEM ID: x86_64
> > CURRENT CPU SPEED: 1600.000 MHz
> > CPUS: 8 processor(s)
> > MEMORY: 49449296 kB
> > DISTRIBUTION: CentOS release 5.3 (Final)
> > KERNEL NAME: 2.6.18-128.el5
> > UPTIME: up 71 days
> > LOAD AVERAGE: 1.42, 1.45, 1.53
> > JBOSS Version: Implementation-Version: 4.2.2.GA (build:
> > SVNTag=JBoss_4_2_2_GA date=20
> > JAVA Version: java version "1.6.0_24"
> >
> >
> > On Thu, Apr 12, 2012 at 3:07 AM, Michael McCandless <
> > luc...@mikemccandless.com> wrote:
> >
> >> Your largest index has 66 segments (690 files) ... biggish but not
> >> insane.  With 64K maps you should be able to have ~47 searchers open
> >> on each core.
> >>
> >> Enabling compound file format (not the opposite!) will mean fewer maps
> >> ... ie should improve this situation.
> >>
> >> I don't understand why Solr defaults to compound file off... that
> >> seems dangerous.
> >>
> >> Really we need a Solr dev here... to answer "how long is a stale
> >> searcher kept open".  Is it somehow possible 46 old searchers are
> >> being left open...?
> >>
> >> I don't see any other reason why you'd run out of maps.  Hmm, unless
> >> MMapDirectory didn't think it could safely invoke unmap in your JVM.
> >> Which exact JVM are you using?  If you can print the
> >> MMapDirectory.UNMAP_SUPPORTED constant, we'd know for sure.
> >>
> >> Yes, switching away from MMapDir will sidestep the "too many maps"
> >> issue, however, 1) MMapDir has better perf than NIOFSDir, and 2) if
> >> there really is a leak here (Solr not closing the old searchers or a
> >> Lucene bug or something...) then you'll eventually run out of file
> >> descriptors (ie, same  problem, different manifestation).
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >> 2012/4/11 Gopal Patwa :
> >>>
> >>> I have not change the mergefactor, it was 10. Compound index file is
> >> disable
> >>> in my config but I read from below post, that some one had similar
> issue
> >> and
> >>> it was resolved by switching from compound index file format to
> >> non-compound
> >>> index file.
> >>>
> >>> and some folks resolved by "changing lucene code to disable
> >> MMapDirectory."
> >>> Is this best practice to do, if so is this can be done in
> configuration?
> >>>
> >>>
> >>
> http://lucene.472066.n3.nabble.com/MMapDirectory-failed-to-map-a-23G-compound-index-segment-td3317208.html
> >>>
> >>> I have index document of core1 = 5 million, core2=8million and
> >>> core3=3million and all index are hosted in single Solr instance
> >>>
> >>> I am going to use Solr for our site StubHub.com, see attached "ls -l"
> >> list
> >>> of index files for all core
> >>>
> >>> SolrConfig.xml:
> >>>
> >>>
> >>>  
> >>>  false
> >>>  10
> >>>  2147483647
> >>>  1
> >>>  4096
> >>>  10
> >>>  1000
> >>>  1
> >>>  single
> >>>
> >>>   class="org.apache.lucene.index.TieredMergePolicy">
> >>>0.0
> >>>10.0
> >>>  
> >>>
> >>>  
> >>>false
> >>>0
> >>>  
> >>>
> >>>  
> >>>
> >>>
> >>>  
> >>>  1000
> >>>   
> >>> 90
> >>> false
> >>>   
> >>>   
> >>>
> >> ${inventory.solr.softcommit.duration:1000}
> >>>   
> >>>
> >>>  
> >>>
> >>>
> >>> Forwarded conversation
> >>> Subject: Large Index and OutOfMemoryError: Map failed
> >>> 
> >>>
> >>> From: Gopal Patwa 
> >>> Date: Fri, Mar 30, 2012 at 10:26 PM
> >>> To: solr-user@lucene.apache.org
> >>>
> >>>
> >>> I need help!!
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> I am using Solr 4.0 nightly build with NRT and I often get this error
> >> during
> >>> auto commit "java.lang.OutOfMemoryError: Map failed". I have search
> this
> >>> fo

Convert a SolrDocumentList to DocList

2012-04-20 Thread Ramprakash Ramamoorthy
Dear all,

Is there any way I can convert a SolrDocumentList to a DocList and
set it in the QueryResult object?

Or, the workaround adding a SolrDocumentList object to the
QueryResult object?

-- 
With Thanks and Regards,
Ramprakash Ramamoorthy,
Project Trainee,
Zoho Corporation.
+91 9626975420


Re: Solr file size limit?

2012-04-20 Thread Bram Rongen
Hmm, reading your reply again I see that Solr only uses the first 10k
tokens from each field so field length should not be a problem per se.. It
could be my document contain very large tokens and unorganized tokens,
could this startle Solr?

On Fri, Apr 20, 2012 at 2:03 PM, Bram Rongen  wrote:

> Yeah, I'm indexing some PDF documents.. I've extracted the text through
> tika (pre-indexing).. and the largest field in my DB is 20MB. That's quite
> extensive ;) My Solution for the moment is to cut this text to the first
> 500KB, that should be enough for a decent index and search capabilities..
> Should I increase the buffer size for these sizes as well or will 32MB
> suffice?
>
> FYI, output of ulimit -a is
> core file size  (blocks, -c) 0
> data seg size   (kbytes, -d) unlimited
> scheduling priority (-e) 20
> *file size   (blocks, -f) unlimited*
> pending signals (-i) 16382
> max locked memory   (kbytes, -l) 64
> max memory size (kbytes, -m) unlimited
> open files  (-n) 1024
> pipe size(512 bytes, -p) 8
> POSIX message queues (bytes, -q) 819200
> real-time priority  (-r) 0
> stack size  (kbytes, -s) 8192
> cpu time   (seconds, -t) unlimited
> max user processes  (-u) unlimited
> virtual memory  (kbytes, -v) unlimited
> file locks  (-x) unlimited
>
>
> Kind regards!
> Bram
>
> On Fri, Apr 20, 2012 at 12:15 PM, Lance Norskog  wrote:
>
>> Good point! Do you store the large file in your documents, or just index
>> them?
>>
>> Do you have a "largest file" limit in your environment? Try this:
>> ulimit -a
>>
>> What is the "file size"?
>>
>> On Thu, Apr 19, 2012 at 8:04 AM, Shawn Heisey  wrote:
>> > On 4/19/2012 7:49 AM, Bram Rongen wrote:
>> >>
>> >> Yesterday I've started indexing again but this time on Solr 3.6.. Again
>> >> Solr is failing around the same time, but not exactly (now the largest
>> fdt
>> >> file is 4.8G).. It's right after the moment I receive memory-errors at
>> the
>> >> Drupal side which make me suspicious that it maybe has something to do
>> >> with
>> >> a huge document.. Is that possible? I was indexing 1500 documents at
>> once
>> >> every minute. Drupal builds them all up in memory before submitting
>> them
>> >> to
>> >> Solr. At some point it runs out of memory and I have to switch to 10/20
>> >> documents per minute for a while.. then I can switch back to 1000
>> >> documents
>> >> per minute.
>> >>
>> >> The disk is a software RAID1 over 2 disks. But I've also run into the
>> same
>> >> problem at another server.. This was a VM-server with only 1GB ram and
>> >> 40GB
>> >> of disk. With this server the merge-repeat happened at an earlier
>> stage.
>> >>
>> >> I've also let Solr continue with merging for about two days before
>>  (in an
>> >> earlier attempt), without submitting new documents. The merging kept
>> >> repeating.
>> >>
>> >> Somebody suggested it could be because I'm using Jetty, could that be
>> >> right?
>> >
>> >
>> > I am using Jetty for my Solr installation and it handles very large
>> indexes
>> > without a problem.  I have created a single index with all my data
>> (nearly
>> > 70 million documents, total index size over 100GB).  Aside from how
>> long it
>> > takes to build and the fact that I don't have enough RAM to cache it for
>> > good performance, Solr handled it just fine.  For production I use a
>> > distributed index on multiple servers.
>> >
>> > I don't know why you are seeing a merge that continually restarts,
>> that's
>> > truly odd.  I've never used drupal, don't know a lot about it.  From my
>> > small amount of research just now, I assume that it uses Tika, also
>> another
>> > tool that I have no experience with.  I am guessing that you store the
>> > entire text of your documents into solr, and that they are indexed up
>> to a
>> > maximum of 1 tokens (the default value of maxFieldLength in
>> > solrconfig.xml), based purely on speculation about the "body" field in
>> your
>> > schema.
>> >
>> > A document that's 100MB in size, if the whole thing gets stored, will
>> > completely overwhelm a 32MB buffer, and might even be enough to
>> overwhelm a
>> > 256MB buffer as well, because it will basically have to build the entire
>> > index segment in RAM, with term vectors, indexed data, and stored data
>> for
>> > all fields.
>> >
>> > With such large documents, you may have to increase the maxFieldLength,
>> or
>> > you won't be able to search on the entire document text.  Depending on
>> the
>> > content of those documents, it may or may not be a problem that only the
>> > first 10,000 tokens will get indexed.  Large documents tend to be
>> repetitive
>> > and there might not be any search value after the introduction and
>> initial
>> > words.  Your documents may be different, so you'll have to make that
>> > decision.
>> >
>> > To test whether my current thoughts

Re: Solr file size limit?

2012-04-20 Thread Bram Rongen
Yeah, I'm indexing some PDF documents.. I've extracted the text through
tika (pre-indexing).. and the largest field in my DB is 20MB. That's quite
extensive ;) My Solution for the moment is to cut this text to the first
500KB, that should be enough for a decent index and search capabilities..
Should I increase the buffer size for these sizes as well or will 32MB
suffice?

FYI, output of ulimit -a is
core file size  (blocks, -c) 0
data seg size   (kbytes, -d) unlimited
scheduling priority (-e) 20
*file size   (blocks, -f) unlimited*
pending signals (-i) 16382
max locked memory   (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files  (-n) 1024
pipe size(512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority  (-r) 0
stack size  (kbytes, -s) 8192
cpu time   (seconds, -t) unlimited
max user processes  (-u) unlimited
virtual memory  (kbytes, -v) unlimited
file locks  (-x) unlimited


Kind regards!
Bram

On Fri, Apr 20, 2012 at 12:15 PM, Lance Norskog  wrote:

> Good point! Do you store the large file in your documents, or just index
> them?
>
> Do you have a "largest file" limit in your environment? Try this:
> ulimit -a
>
> What is the "file size"?
>
> On Thu, Apr 19, 2012 at 8:04 AM, Shawn Heisey  wrote:
> > On 4/19/2012 7:49 AM, Bram Rongen wrote:
> >>
> >> Yesterday I've started indexing again but this time on Solr 3.6.. Again
> >> Solr is failing around the same time, but not exactly (now the largest
> fdt
> >> file is 4.8G).. It's right after the moment I receive memory-errors at
> the
> >> Drupal side which make me suspicious that it maybe has something to do
> >> with
> >> a huge document.. Is that possible? I was indexing 1500 documents at
> once
> >> every minute. Drupal builds them all up in memory before submitting them
> >> to
> >> Solr. At some point it runs out of memory and I have to switch to 10/20
> >> documents per minute for a while.. then I can switch back to 1000
> >> documents
> >> per minute.
> >>
> >> The disk is a software RAID1 over 2 disks. But I've also run into the
> same
> >> problem at another server.. This was a VM-server with only 1GB ram and
> >> 40GB
> >> of disk. With this server the merge-repeat happened at an earlier stage.
> >>
> >> I've also let Solr continue with merging for about two days before  (in
> an
> >> earlier attempt), without submitting new documents. The merging kept
> >> repeating.
> >>
> >> Somebody suggested it could be because I'm using Jetty, could that be
> >> right?
> >
> >
> > I am using Jetty for my Solr installation and it handles very large
> indexes
> > without a problem.  I have created a single index with all my data
> (nearly
> > 70 million documents, total index size over 100GB).  Aside from how long
> it
> > takes to build and the fact that I don't have enough RAM to cache it for
> > good performance, Solr handled it just fine.  For production I use a
> > distributed index on multiple servers.
> >
> > I don't know why you are seeing a merge that continually restarts, that's
> > truly odd.  I've never used drupal, don't know a lot about it.  From my
> > small amount of research just now, I assume that it uses Tika, also
> another
> > tool that I have no experience with.  I am guessing that you store the
> > entire text of your documents into solr, and that they are indexed up to
> a
> > maximum of 1 tokens (the default value of maxFieldLength in
> > solrconfig.xml), based purely on speculation about the "body" field in
> your
> > schema.
> >
> > A document that's 100MB in size, if the whole thing gets stored, will
> > completely overwhelm a 32MB buffer, and might even be enough to
> overwhelm a
> > 256MB buffer as well, because it will basically have to build the entire
> > index segment in RAM, with term vectors, indexed data, and stored data
> for
> > all fields.
> >
> > With such large documents, you may have to increase the maxFieldLength,
> or
> > you won't be able to search on the entire document text.  Depending on
> the
> > content of those documents, it may or may not be a problem that only the
> > first 10,000 tokens will get indexed.  Large documents tend to be
> repetitive
> > and there might not be any search value after the introduction and
> initial
> > words.  Your documents may be different, so you'll have to make that
> > decision.
> >
> > To test whether my current thoughts are right, I recommend that you try
> with
> > the following settings during the initial full import:  ramBufferSizeMB:
> > 1024 (or maybe higher), autoCommit maxTime: 0, autoCommit maxDocs: 0.
>  This
> > will mean that unless the indexing process issues manual commits (either
> in
> > the middle of indexing or at the end), you will have to do a manual one.
> >  Once you have the initial index built and it is only doing updates, you
> > will

SolrCloud indexing question

2012-04-20 Thread Darren Govoni
Hi,
  I just wanted to make sure I understand how distributed indexing works
in solrcloud.

Can I index locally at each shard to avoid throttling a central port? Or
all the indexing has to go through a single shard leader?

thanks




Re: Solr Cloud vs sharding vs grouping

2012-04-20 Thread Martijn v Groningen
Hi Jean-Sebastien,

For some grouping features (like total group count and grouped
faceting), the distributed grouping requires you to partition your
documents into the right shard. Basically groups can't cross shards.
Otherwise the group counts or grouped facet counts may not be correct.
If you use the basic grouping functionality then this limitation
doesn't apply.

I think right now that SolrCloud partitions documents based on the
unique id (id % number_shards). You need to modify this somehow or
maybe do the distributed indexing yourself.

Martijn

On 20 April 2012 12:07, Lance Norskog  wrote:
> The implementation of grouping in the trunk is completely different
> from 236. Grouping works across distributed search:
> https://issues.apache.org/jira/browse/SOLR-2066
>
> committed last September.
>
> On Thu, Apr 19, 2012 at 6:04 PM, Jean-Sebastien Vachon
>  wrote:
>> Hi All,
>>
>> I am currently trying out SolrCloud on a small cluster and I'm enjoying Solr 
>> more than ever. Thanks to all the contributors.
>>
>> That being said, one very important feature for us is the 
>> grouping/collapsing of results on a specific field value on a distributed 
>> index. We are currently using Solr 1.4 with Patch 236 and it does the job as 
>> long as all documents with a common field value are on the same shard. 
>> Otherwise grouping on a distributed index will not work as expected.
>>
>> I looked everywhere if this limitation was still present in the trunk but 
>> found no mention of it.
>> Is this still a requirement for grouping results on a distributed index?
>>
>> Thanks
>>
>
>
>
> --
> Lance Norskog
> goks...@gmail.com



-- 
Met vriendelijke groet,

Martijn van Groningen


Re: Importing formats - Which works best with Solr?

2012-04-20 Thread Erick Erickson
CSV files can also be imported, which may be more
compact.

Best
Erick

On Fri, Apr 20, 2012 at 6:01 AM, Dmitry Kan  wrote:
> James,
>
> You could create xml files of format:
>
> 
> 1 name="Name"> name="Surname">
> 
> 
>
> and then post them to SOLR using, for example, the post.sh utility from
> SOLR's binary distribution.
>
> HTH,
> Dmitry
>
> On Fri, Apr 20, 2012 at 12:35 PM, Spadez  wrote:
>
>> Hi,
>>
>> I am designing a custom scrapping solution. I need to store my data, do
>> some
>> post processing on it and then import it into SOLR.
>>
>> If I want to import data into SOLR in the quickest, easiest way possible,
>> what format should I be saving my scrapped data in? I get the impression
>> that .XML would be the best choice but I don’t really have much grounding
>> for that.
>>
>> Any input would be appreciated.
>>
>> James
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Importing-formats-Which-works-best-with-Solr-tp3925557p3925557.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
>
>
> --
> Regards,
>
> Dmitry Kan


Re: Date granularity

2012-04-20 Thread Erick Erickson
The only way to get more "elegant" would be to
index the dates with the granularity you want, i.e.
truncate to DAY at index time then truncate
to DAY at query time as well.

Why do you consider ranges inelegant? How else
would you imagine it would be done?

Best
Erick

On Thu, Apr 19, 2012 at 4:07 PM, vybe3142  wrote:
> Also, what's the performance impact of range queries vs. querying for a
> particular DAY (as described in my last post)  when the index contains ,
> say, 10 million docs ?
>
> If the range queries result in a significant performance hit, one option for
> us would be to define additional DAY fields when indexing TIME data
> eg. when indexing METADATA.DATE_ABC= 2009-07-31T15:25:45Z , also create and
> index something like METADATA.DATE_DAY_ABC= 2009-07-31T00:005:00Z
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Date-granularity-tp3920890p3924290.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr with UIMA

2012-04-20 Thread dsy99
Hi Rahul,
Thank you for the reply. I tried by modifying the
updateRequestProcessorChain as follows:



 But still I am not able to see the UIMA fields in the result. I executed
the following curl command to index a file named "test.docx"

curl
"http://localhost:8983/solr/update/extract?fmap.content=content&literal.id=doc47&commit=true";
-F "file=@test.docx"

When I searched the same document with
"http://localhost:8983/solr/select?q=id:doc47"; command, got the following
result.


  
 divakar
 

  
application/vnd.openxmlformats-officedocument.wordprocessingml.document

 
 doc47
 2012-04-18T14:19:00Z
  


Could you please help where I am wrong?

With Thaks & Regds:
Divakar

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-with-UIMA-tp3863324p3925670.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How sorlcloud distribute data among shards of the same cluster?

2012-04-20 Thread Boon Low
Thanks. My colleague also pointed a previous thread and the solution out: add a 
new update.chain for data import/update handlers to bypass the distributed 
update processor. 

A simpler use case example for SolrCloud newbies could be on distributed 
search, to experience the features of the cloud-based version, how it compares 
with the existing distributed search. This involves firing up a single node 
SolrCloud, quickly re-creating multiple logical cores with existing tools and 
testing them within a search collection. The default distributed update 
processor may be confusing in this case. At least how it can be opted out 
should be explained on the wiki.

Impressed with the distributed search and real-time update! 

On 19 Apr 2012, at 21:03, Mark Miller wrote:

> You can remove the distrib update processor and just distrib the data 
> yourself.
> 
> Eventually the hash implementation will also be pluggable I think.
> 
> On Apr 19, 2012, at 10:30 AM, Boon Low wrote:
> 
>> Hi,
>> 
>> Is there any mechanism in SolrCloud for controlling how the data is 
>> distributed among the shards? For example, I'd like to create logical 
>> (standalone) shards ('A', 'B', 'C') to make up a collection ('A-C"), and be 
>> able query both a particular shard (e.g. 'A') or the collection entirely. At 
>> the moment, my test suggests 'A' data is distributed to evenly to all shards 
>> in SolrCloud.
>> 
>> Regards,
>> 
>> Boon
>> 
>> -
>> Boon Low
>> Search UX and Engine Developer
>> brightsolid Online Publishing
>> 
>> On 18 Apr 2012, at 12:41, Erick Erickson wrote:
>> 
>>> Try looking at DistributedUpdateProcessor, there's
>>> a "hash(cmd)" method in there.
>>> 
>>> Best
>>> Erick
>>> 
>>> On Tue, Apr 17, 2012 at 4:45 PM, emma1023  wrote:
 Thanks for your reply. In sorl 3.x, we need to manually hash the doc Id to
 the server.How does solrcloud do this instead? I am working on a project
 using solrcloud.But we need to monitor how the solrcloud distribute the
 data. I cannot find which part of the code it is from source code.Is it
 from the cloud part? Thanks.
 
 
 On Tue, Apr 17, 2012 at 3:16 PM, Mark Miller-3 [via Lucene] <
 ml-node+s472066n3918192...@n3.nabble.com> wrote:
 
> 
> On Apr 17, 2012, at 9:56 AM, emma1023 wrote:
> 
> It hashes the id. The doc distribution is fairly even - but sizes may be
> fairly different.
> 
>> How solrcloud manage distribute data among shards of the same cluster
> when
>> you query? Is it distribute the data equally? What is the basis? Which
> part
>> of the code that I can find about it?Thank you so much!
>> 
>> 
>> --
>> View this message in context:
> http://lucene.472066.n3.nabble.com/How-sorlcloud-distribute-data-among-shards-of-the-same-cluster-tp3917323p3917323.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> - Mark Miller
> lucidimagination.com
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> --
> If you reply to this email, your message will be added to the discussion
> below:
> 
> http://lucene.472066.n3.nabble.com/How-sorlcloud-distribute-data-among-shards-of-the-same-cluster-tp3917323p3918192.html
> To unsubscribe from How sorlcloud distribute data among shards of the
> same cluster?, click 
> here
> .
> NAML
> 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/How-sorlcloud-distribute-data-among-shards-of-the-same-cluster-tp3917323p3918348.html
 Sent from the Solr - User mailing list archive at Nabble.com.
>>> 
>>> __
>>> This email has been scanned by the brightsolid Email Security System. 
>>> Powered by MessageLabs
>>> __
>> 
>> 
>> __
>> "brightsolid" is used in this email to collectively mean brightsolid online 
>> innovation limited and its subsidiary companies brightsolid online 
>> publishing limited and brightsolid online technology limited.
>> findmypast.co.uk is a brand of brightsolid online publishing limited.
>> brightsolid online innovation limited, Gateway House, Luna Place, Dundee 
>> Technology Park, Dundee DD2 1TP.  Regist

Re: Solr file size limit?

2012-04-20 Thread Lance Norskog
Good point! Do you store the large file in your documents, or just index them?

Do you have a "largest file" limit in your environment? Try this:
ulimit -a

What is the "file size"?

On Thu, Apr 19, 2012 at 8:04 AM, Shawn Heisey  wrote:
> On 4/19/2012 7:49 AM, Bram Rongen wrote:
>>
>> Yesterday I've started indexing again but this time on Solr 3.6.. Again
>> Solr is failing around the same time, but not exactly (now the largest fdt
>> file is 4.8G).. It's right after the moment I receive memory-errors at the
>> Drupal side which make me suspicious that it maybe has something to do
>> with
>> a huge document.. Is that possible? I was indexing 1500 documents at once
>> every minute. Drupal builds them all up in memory before submitting them
>> to
>> Solr. At some point it runs out of memory and I have to switch to 10/20
>> documents per minute for a while.. then I can switch back to 1000
>> documents
>> per minute.
>>
>> The disk is a software RAID1 over 2 disks. But I've also run into the same
>> problem at another server.. This was a VM-server with only 1GB ram and
>> 40GB
>> of disk. With this server the merge-repeat happened at an earlier stage.
>>
>> I've also let Solr continue with merging for about two days before  (in an
>> earlier attempt), without submitting new documents. The merging kept
>> repeating.
>>
>> Somebody suggested it could be because I'm using Jetty, could that be
>> right?
>
>
> I am using Jetty for my Solr installation and it handles very large indexes
> without a problem.  I have created a single index with all my data (nearly
> 70 million documents, total index size over 100GB).  Aside from how long it
> takes to build and the fact that I don't have enough RAM to cache it for
> good performance, Solr handled it just fine.  For production I use a
> distributed index on multiple servers.
>
> I don't know why you are seeing a merge that continually restarts, that's
> truly odd.  I've never used drupal, don't know a lot about it.  From my
> small amount of research just now, I assume that it uses Tika, also another
> tool that I have no experience with.  I am guessing that you store the
> entire text of your documents into solr, and that they are indexed up to a
> maximum of 1 tokens (the default value of maxFieldLength in
> solrconfig.xml), based purely on speculation about the "body" field in your
> schema.
>
> A document that's 100MB in size, if the whole thing gets stored, will
> completely overwhelm a 32MB buffer, and might even be enough to overwhelm a
> 256MB buffer as well, because it will basically have to build the entire
> index segment in RAM, with term vectors, indexed data, and stored data for
> all fields.
>
> With such large documents, you may have to increase the maxFieldLength, or
> you won't be able to search on the entire document text.  Depending on the
> content of those documents, it may or may not be a problem that only the
> first 10,000 tokens will get indexed.  Large documents tend to be repetitive
> and there might not be any search value after the introduction and initial
> words.  Your documents may be different, so you'll have to make that
> decision.
>
> To test whether my current thoughts are right, I recommend that you try with
> the following settings during the initial full import:  ramBufferSizeMB:
> 1024 (or maybe higher), autoCommit maxTime: 0, autoCommit maxDocs: 0.  This
> will mean that unless the indexing process issues manual commits (either in
> the middle of indexing or at the end), you will have to do a manual one.
>  Once you have the initial index built and it is only doing updates, you
> will probably be able to go back to using autoCommit.
>
> It's possible that I have no understanding of the real problem here, and my
> recommendation above may result in no improvement.  General recommendations,
> no matter what the current problem might be:
>
> 1) Get a lot more RAM.  Ideally you want to have enough free memory to cache
> your entire index.  That may not be possible, but you want to get as close
> to that goal as you can.
> 2) If you can, see what you can do to increase your IOPS.  Using mirrored
> high RPM SAS is an easy solution, and might be slightly cheaper than SATA
> RAID10, which is my solution.  SSD is easy and very fast, but expensive and
> not redundant -- I am currently not aware of any SSD RAID solutions that
> have OS TRIM support.  RAID10 with high RPM SAS would be best, but very
> expensive.  On the extreme high end, you could go with a high performance
> SAN.
>
> Thanks,
> Shawn
>



-- 
Lance Norskog
goks...@gmail.com


Re: Wrong categorization with DIH

2012-04-20 Thread Lance Norskog
Working with the DIH is a little easier if you make database view and
load from that. You can set all of the field names and see exactly
what the DIH gets.

On Thu, Apr 19, 2012 at 10:11 AM, Ramo Karahasan
 wrote:
> Hi,
>
> yes i use every oft hem.
>
> Thanks for your hint... I'll have a look at this and try to configure it
> correctly.
>
> Thank you,
> Ramo
>
> -Ursprüngliche Nachricht-
> Von: Jeevanandam Madanagopal [mailto:je...@myjeeva.com]
> Gesendet: Donnerstag, 19. April 2012 18:42
> An: solr-user@lucene.apache.org
> Betreff: Re: Wrong categorization with DIH
>
> Ramo -
>
> Are you using all the selected columns from the query?
>
> select p.title as title, p.id, p.category_id, p.pic_thumb, c.name as
> category, c.id as category_id from product p, category c ...
>
> I see following attributes 'p.id', 'p.category_id' & 'p.pic_thumb'  doesn't
> have alias defined.
>
> Pointers:
> 
> - Select only required field in the sql query
> - Ensure sql alias name and attribute name in the schema.xml should match
>      or
> - If you like to do explicit mapping for every column in DIH config as
> follow  name="SOLR-SCHEMA-ATTRIBUTE-NAME-HERE" />
>
> Detailed Info refer this: http://wiki.apache.org/solr/DataImportHandler
>
> -Jeevanandam
>
>
> On Apr 19, 2012, at 9:37 PM, Ramo Karahasan wrote:
>
>> Hi,
>>
>> my config is just the following:
>>
>> 
>>  >              driver="com.mysql.jdbc.Driver"
>>              url="jdbc:mysql://xx/asdx"
>>              user=""
>>              password=""/>
>>  
>>   >            query="select p.title as title, p.id, p.category_id,
>> p.pic_thumb, c.name as category, c.id as category_id from product p,
>> category c WHERE p.category_id = c.id AND  '${dataimporter.request.clean}'
>> != 'false' OR updated_at > '${dataimporter.last_index_time}' ">
>>    
>>  
>> 
>>
>> I'm doing it as described on:
>>
>> http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport
>>
>> Any ideas?
>>
>> Best regars,
>> Ramo
>>
>> -Ursprüngliche Nachricht-
>> Von: Jeevanandam Madanagopal [mailto:je...@myjeeva.com]
>> Gesendet: Donnerstag, 19. April 2012 17:44
>> An: solr-user@lucene.apache.org
>> Betreff: Re: Wrong categorization with DIH
>>
>> Ramo -
>>
>> Please share DIH configuration with us.
>>
>> -Jeevanandam
>>
>> On Apr 19, 2012, at 7:46 PM, Ramo Karahasan wrote:
>>
>>> Does anyone has an idea what's going wrong here?
>>>
>>> Thanks,
>>> Ramo
>>>
>>> -Ursprüngliche Nachricht-
>>> Von: Gora Mohanty [mailto:g...@mimirtech.com]
>>> Gesendet: Dienstag, 17. April 2012 11:34
>>> An: solr-user@lucene.apache.org
>>> Betreff: Re: Wrong categorization with DIH
>>>
>>> On 17 April 2012 14:47, Ramo Karahasan
>>> 
>>> wrote:
 Hi,



 i currently face the followin issue:

 Testing the following sql statement which is also used in SOLR (DIH)
 leads to a wrong categorization in solr:

 select p.title as title, p.id, p.category_id, p.pic_thumb, c.name as
 category, c.id as category_id from product p, category c WHERE
 p.category_id = c.id AND p.id = 3091328



 This returns in my sql client:

 Apple MacBook Pro MD313D/A 33,8 cm (13,3 Zoll) Notebook (Intel Core
 i5-2435M, 2,4GHz, 4GB RAM, 500GB HDD, Intel HD 3000, Mac OS),
 3091328, 1003,
 http://m-d.ww.cdn.com/images/I/41teWbp-uAL._SL75_.jpg,
 Computer,
 1003



 As you see, the categoryid 1003 points to "Computer"



 Via the solr searchadmin i get the following result when searchgin
 for
 id:3091328

 Sport

 1003
>>> [...]
>>>
>>> Please share with us the rest of the DIH configuration file, i.e.,
>>> the part where these data are saved to the Solr index.
>>>
>>> Regards,
>>> Gora
>>>
>>
>>
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: Solr Cloud vs sharding vs grouping

2012-04-20 Thread Lance Norskog
The implementation of grouping in the trunk is completely different
from 236. Grouping works across distributed search:
https://issues.apache.org/jira/browse/SOLR-2066

committed last September.

On Thu, Apr 19, 2012 at 6:04 PM, Jean-Sebastien Vachon
 wrote:
> Hi All,
>
> I am currently trying out SolrCloud on a small cluster and I'm enjoying Solr 
> more than ever. Thanks to all the contributors.
>
> That being said, one very important feature for us is the grouping/collapsing 
> of results on a specific field value on a distributed index. We are currently 
> using Solr 1.4 with Patch 236 and it does the job as long as all documents 
> with a common field value are on the same shard. Otherwise grouping on a 
> distributed index will not work as expected.
>
> I looked everywhere if this limitation was still present in the trunk but 
> found no mention of it.
> Is this still a requirement for grouping results on a distributed index?
>
> Thanks
>



-- 
Lance Norskog
goks...@gmail.com


Re: Importing formats - Which works best with Solr?

2012-04-20 Thread Dmitry Kan
James,

You could create xml files of format:


1



and then post them to SOLR using, for example, the post.sh utility from
SOLR's binary distribution.

HTH,
Dmitry

On Fri, Apr 20, 2012 at 12:35 PM, Spadez  wrote:

> Hi,
>
> I am designing a custom scrapping solution. I need to store my data, do
> some
> post processing on it and then import it into SOLR.
>
> If I want to import data into SOLR in the quickest, easiest way possible,
> what format should I be saving my scrapped data in? I get the impression
> that .XML would be the best choice but I don’t really have much grounding
> for that.
>
> Any input would be appreciated.
>
> James
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Importing-formats-Which-works-best-with-Solr-tp3925557p3925557.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Regards,

Dmitry Kan


Re: PolySearcher in Solr

2012-04-20 Thread Lance Norskog
The PolySearcher in Lucy seems to do exactly what is "Distributed
Search" in Solr.

On Fri, Apr 20, 2012 at 2:58 AM, Lance Norskog  wrote:
> In Solr&Lucene, a "shard" is one part of an "index". There cannot be
> "multiple indices in one shard".
>
> All of the shards in an index share the same schema, and no document
> is in two or more shards. "distributed search" as implemented by solr
> searches several shards in one index.
>
> On Thu, Apr 19, 2012 at 11:17 PM, Ramprakash Ramamoorthy
>  wrote:
>> On Thu, Apr 19, 2012 at 9:21 PM, Jeevanandam Madanagopal
>> wrote:
>>
>>> Please have a look
>>>
>>> http://wiki.apache.org/solr/DistributedSearch
>>>
>>> -Jeevanandam
>>>
>>> On Apr 19, 2012, at 9:14 PM, Ramprakash Ramamoorthy wrote:
>>>
>>> > Dear all,
>>> >
>>> >
>>> > I came across this while browsing through lucy
>>> >
>>> > http://lucy.apache.org/docs/perl/Lucy/Search/PolySearcher.html
>>> >
>>> > Does solr have an equivalent of this? My usecase is exactly the same
>>> > (reading through multiple indices in a single shard and perform a
>>> > distribution across shards).
>>> >
>>> > If not can someone give me a hint? I tried swapping readers for a single
>>> > searcher, but didn't help.
>>> >
>>> > --
>>> > With Thanks and Regards,
>>> > Ramprakash Ramamoorthy,
>>> > Project Trainee,
>>> > Zoho Corporation.
>>> > +91 9626975420
>>>
>>>
>> Dear Jeevanandam,
>>
>>         Thanks for the response, but come on, I am aware of it. Try
>> reading my mail again. I will have to read through multiple indices in a
>> single shard, and have a distributed search across all shards.
>>
>> --
>> With Thanks and Regards,
>> Ramprakash Ramamoorthy,
>> Project Trainee,
>> Zoho Corporation.
>> +91 9626975420
>
>
>
> --
> Lance Norskog
> goks...@gmail.com



-- 
Lance Norskog
goks...@gmail.com


Re: PolySearcher in Solr

2012-04-20 Thread Lance Norskog
In Solr&Lucene, a "shard" is one part of an "index". There cannot be
"multiple indices in one shard".

All of the shards in an index share the same schema, and no document
is in two or more shards. "distributed search" as implemented by solr
searches several shards in one index.

On Thu, Apr 19, 2012 at 11:17 PM, Ramprakash Ramamoorthy
 wrote:
> On Thu, Apr 19, 2012 at 9:21 PM, Jeevanandam Madanagopal
> wrote:
>
>> Please have a look
>>
>> http://wiki.apache.org/solr/DistributedSearch
>>
>> -Jeevanandam
>>
>> On Apr 19, 2012, at 9:14 PM, Ramprakash Ramamoorthy wrote:
>>
>> > Dear all,
>> >
>> >
>> > I came across this while browsing through lucy
>> >
>> > http://lucy.apache.org/docs/perl/Lucy/Search/PolySearcher.html
>> >
>> > Does solr have an equivalent of this? My usecase is exactly the same
>> > (reading through multiple indices in a single shard and perform a
>> > distribution across shards).
>> >
>> > If not can someone give me a hint? I tried swapping readers for a single
>> > searcher, but didn't help.
>> >
>> > --
>> > With Thanks and Regards,
>> > Ramprakash Ramamoorthy,
>> > Project Trainee,
>> > Zoho Corporation.
>> > +91 9626975420
>>
>>
> Dear Jeevanandam,
>
>         Thanks for the response, but come on, I am aware of it. Try
> reading my mail again. I will have to read through multiple indices in a
> single shard, and have a distributed search across all shards.
>
> --
> With Thanks and Regards,
> Ramprakash Ramamoorthy,
> Project Trainee,
> Zoho Corporation.
> +91 9626975420



-- 
Lance Norskog
goks...@gmail.com


Importing formats - Which works best with Solr?

2012-04-20 Thread Spadez
Hi,

I am designing a custom scrapping solution. I need to store my data, do some
post processing on it and then import it into SOLR. 

If I want to import data into SOLR in the quickest, easiest way possible,
what format should I be saving my scrapped data in? I get the impression
that .XML would be the best choice but I don’t really have much grounding
for that.

Any input would be appreciated.

James


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Importing-formats-Which-works-best-with-Solr-tp3925557p3925557.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: # open files with SolrCloud

2012-04-20 Thread Sami Siren
On Thu, Apr 19, 2012 at 3:12 PM, Sami Siren  wrote:
> I have a simple solrcloud setup from trunk with default configs; 1
> shard with one replica. As few other people have reported there seems
> to be some kind of leak somewhere that causes the number of open files
> to grow over time when doing indexing.
>
> One thing that correlates with the open file count that the jvm
> reports is the count of deleted files that solr still keeps open (not
> sure if the problem is this or something else). The deleted but not
> closed files are all ending with "nrm.cfs", for example
>
> /solr/data/index/_jwk_nrm.cfs (deleted)
>
> Any ideas about what could be the cause for this? I don't even know
> where to start looking...

I think the problem is not just the (deleted) files that counts
towards the max open files the user is allowed to have. There's also
the mmapped files that are kept open and count towards the limit
configured in /proc/sys/vm/max_map_count and eventually lead to
exceptions like java.io.IOException: Map failed

there's plenty of those too visible to lsof (all kinds of lucene index
files), for example:

java32624  sam  DELREG8,36425004
/home/sam/example/solr/data/index/_4.fdx
java32624  sam  DELREG8,36424981
/home/sam/example/solr/data/index/_4.fdt
java32624  sam  DELREG8,36425019
/home/sam/example/solr/data/index/_4_nrm.cfs
java32624  sam  DELREG8,36425016
/home/sam/example/solr/data/index/_4_0.tim
java32624  sam  DELREG8,36425015
/home/sam/example/solr/data/index/_4_0.prx
java32624  sam  DELREG8,36425014
/home/sam/example/solr/data/index/_4_0.frq

Why are all these files, regular and mmapped kept open even if merging
has occurred in the background and the files are deleted?

-
 Sami Siren