date:20091006

Solr Quries

2009-10-06 Thread Pravin Karne

Hi,
I am new to solr. I have following queries :


1.   Is solr work in distributed environment ? if yes, how to configure it?



2.   Is solr have Hadoop support? if yes, how to setup it with Hadoop/HDFS? 
(Note: I am familiar with Hadoop)



3.   I have employee information(id, name ,address, cell no, personal info) 
of 1 TB ,To post(index)this data on solr server, shall I have to create xml 
file with this data and then post it to solr server? Or is there any other 
optimal way?  In future my data will grow upto 10 TB , then how can I index 
this data ?(because creating xml is more headache )





Thanks in advance

-Pravin




DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.

Re: solr 1.4 formats last_index_time for SQL differently than 1.3 ?!?

2009-10-06 Thread Noble Paul നോബിള്‍ नोब्ळ्

really?
I don't remember that being changed.

what difference do u notice?

On Wed, Oct 7, 2009 at 2:30 AM, michael8  wrote:
>
> Just looking for confirmation from others, but it appears that the formatting
> of last_index_time from dataimport.properties (using DataImportHandler) is
> different in 1.4 vs. that in 1.3.  I was troubleshooting why delta imports
> are no longer working for me after moving over to solr 1.4 (10/2 nighly) and
> noticed that format is different.
>
> Michael
> --
> View this message in context: 
> http://www.nabble.com/solr-1.4-formats-last_index_time-for-SQL-differently-than-1.3--%21--tp25776496p25776496.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: Weird Facet and KeywordTokenizerFactory Issue

2009-10-06 Thread Ravi Kiran

Hello Mr.Hostetter,
Thank you for patiently reading through my post, I apologize for being
cryptic in my previous messages..

>>when you cut/pasted the facet output, you excluded the field names.  based
>>on the schema & solrconfig.xml snippets you posted later, i'm assuming
>>they are usstate, and keyword, but you have to be explicit so that people
can help correlate the
>>results you are getting with the schema you posted

I had to be brief as my facets are in the order of 100K over 800K documents
and also if I give the complete schema.xml I was afraid nobody would read my
long message :-) ..Hence I showed only relevant pieces of the result showing
different fields having same problem

>>i'm assuming they are usstate, and keyword, but you have to be explicit so
that people can help correlate the
>>results you are getting with the schema you posted -- for example, you
haven't posted anything that would verify that the usstate >>field actually
uses your keywordText field

Yes, you are right here is the compete relavant snippet regarding
keywordText and associated fields. keyword, keywordlower and
keywordformatted are all aggregations of all other fields like - person,
personformatted, organization, location. location itself is aggregation of
usstate, country. The aggregation is done seperately in custom code even
before indexing into solr


  






  
  





  













>>A huge gap is in what your synonym files contain ... something weird in
>>there could easily explain superfluous terms getting added to your data.
Here are my synonym entries
---

#Persons
barack obama, barak obama, barack h. obama, barack hussein obama, barak
hussein obama
hillary clinton, hillary r. clinton, hillary rodham clinton
timothy geithner, tim geithner, timothy f. geithner, geithner, timothy franz
geithner
vladimir putin, putin

#Organizations
U.N, U.N., u.n, un, UN, United Nations => U.N
DHS,D.H.S,D.H.S.,Department of Homeland Security,Homeland Security => D.H.S
USCIS, United States Citizenship and Immigration Services, U.S.C.I.S. =>
United States Citizenship and Immigration Services, U.S.C.I.S
SEC, Securities and Exchange Commission, S.E.C, S.E.C, SEC. => Securities
and Exchange Commission, S.E.C
FCC, Federal Communications Commission, F.C.C, F.C.C. => Federal
Communications Commission, F.C.C
GSA, General Services Administration, G.S.A, G.S.A. => General Services
Administration, G.S.A
SBA, Small Business Administration, S.B.A, S.B.A. => Small Business
Administration, S.B.A.
FEMA, Federal Emergency Management Agency, FEMA. => FEMA
AT&T, ATT, ATT., AT&T., AT&T Wireless => AT&T
BBC, British Broadcasting Corporation, B.B.C, B.B.C. => B.B.C,BBC
Bank of America, BOA, B.O.A, Bank of America Corp, Bank of America Corp. =>
B.O.A
General Motors, G.M., G.M, GM, General Motors Corp., General Motors Corp =>
General Motors, G.M
NFL, National Football League, N.F.L, N.F.L. => N.F.L
Exxon Mobil, Exxon Mobil Corp => Exxon Mobil
Google, Google Inc, Google Inc. => Google
AIG, A.I.G, A.I.G., American International Group => American International
Group, A.I.G
Goldman Sachs, Goldman Sachs Inc., Goldman Sachs Group Inc, Goldman Sachs
Group Inc. => Goldman Sachs
GE, General Electric Co., General Electric Co, G.E, G.E., General Electric
=> G.E, General Electric
General Dynamics, General Dynamics Corp,General Dynamics Corp., General
Dynamics Information Technology, General Dynamics Advanced Information
Systems => General Dynamics
HP, Hewlett Packard Co,Hewlett Packard Co., Hewlett Packard,
Hewlett-Packard, Hewlett-Packard Corp,H.P, H.P. => Hewlett Packard, H.P
IBM, International Business Machines, I.B.M, International Business Machines
Corp => I.B.M
Johns Hopkins University, Johns Hopkins, JHU, J.H.U, J.H.U. => Johns Hopkins
University, JHU, J.H.U
J.C. Penney, J.C. Penney Co. => J.C. Penney
JPMorgan Chase, JPMorgan Chase & Co., JPMorgan Chase & Co, JPMorgan =>
JPMorgan Chase & Co.
Lockheed Martin, Lockheed Martin Corp, Lockheed Martin Corp., Lockheed,
Lockheed VH => Lockheed Martin
Merrill Lynch, Merrill Lynch & Co., Merrill, Merrill. => Merrill Lynch
Microsoft, Microsoft Corp., Microsoft Corp, Microsoft. => Microsoft
Northrop Grumman, Northrop Grumman Corp., Northrop Grumman Corp, Northrop,
Northrop Corp. => Northrop Grumman
Smyth Co., Smyth Co
Sony, Sony Corp., Sony Corp => Sony Corp.
TJX Companies, TJX, TJX Cos. => TJX Companies
Target Corp., Target Corp, Target Corp stores => Target Corp.
Walmart, WalMart Inc, WalMart Stores, WalMart Stores Inc, WalMart Stores
Inc. => WalMart Inc.
Yahoo, Yahoo Inc co, Yahoo Inc. => Yahoo Inc.
AP, AP., A.P, A.P., Associated Press => Associated Press

#Countries
USA,USA.,U.S.A.,u.s.a,u.s.a.,U.S,U.S.,US,US.,u.s, u.s.,United States,United
States of America,United States Of America,united states,united states of
america,

Re: DataImportHandler problem: Feeding the XPathEntityProcessor with the FieldReaderDataSource

2009-10-06 Thread Noble Paul നോബിള്‍ नोब्ळ्

hi Lance. db.blob is the correct field name so that is fine.
you can probbaly open an issue and provide the testcase as a patch.
That can help us track this better

On Wed, Oct 7, 2009 at 12:45 AM, Lance Norskog  wrote:
> A side note that might help: if I change the dataField from 'db.blob'
> to 'blob', this DIH stack emits no documents.
>
> On 10/5/09, Lance Norskog  wrote:
>> I've added a unit test for the problem down below. It feeds document
>> field data into the XPathEntityProcessor via the
>> FieldReaderDataSource, and the XPath EP does not emit unpacked fields.
>>
>> Running this under the debugger, I can see the supplied StringReader,
>> with the XML string, being piped into the XPath EP. But somehow the
>> XPath EP does not pick it apart the right way.
>>
>> Here is the DIH configuration file separately.
>>
>> 
>>   
>>   
>>   
>>       
>>         
>>         
>>         
>>         > processor='XPathEntityProcessor'
>>               forEach='/names' dataField='db.blob'>
>>           
>>         
>>       
>>   
>> 
>>
>> Any ideas?
>>
>> ---
>>
>> package org.apache.solr.handler.dataimport;
>>
>> import static
>> org.apache.solr.handler.dataimport.AbstractDataImportHandlerTest.createMap;
>> import junit.framework.TestCase;
>>
>> import java.util.ArrayList;
>> import java.util.HashMap;
>> import java.util.List;
>> import java.util.Map;
>>
>> import org.apache.solr.common.SolrInputDocument;
>> import org.apache.solr.common.SolrInputField;
>> import org.apache.solr.handler.dataimport.TestDocBuilder.SolrWriterImpl;
>> import org.junit.Test;
>>
>> /*
>>  * Demonstrate problem feeding XPathEntity from a FieldReaderDatasource
>>  */
>>
>> public class TestFieldReaderXPath extends TestCase {
>>       static final String KISSINGER = "Henry";
>>
>>       static final String[][][] DBDOCS = {
>>               {{"dbid", "1"}, {"blob", KISSINGER}},
>>       };
>>
>>       /*
>>        * Receive a row from SQL and fetch a row from Solr - no value matching
>>        * stolen from TestDocBuilder
>>        * */
>>
>>       @Test
>>       public void testSolrEmbedded() throws Exception {
>>               try {
>>                       DataImporter di = new DataImporter();
>>                       di.loadDataConfig(dih_config_FR_into_XP);
>>                       DataImporter.RequestParams rp = new 
>> DataImporter.RequestParams();
>>                       rp.command = "full-import";
>>                       rp.requestParams = new HashMap();
>>
>>                       DataConfig cfg = di.getConfig();
>>                       DataConfig.Entity entity = 
>> cfg.document.entities.get(0);
>>                       List> l = new 
>> ArrayList>();
>>                       addDBDocuments(l);
>>                       MockDataSource.setIterator("select * from x", 
>> l.iterator());
>>                       entity.dataSrc = new MockDataSource();
>>                       entity.isDocRoot = true;
>>                       SolrWriterImpl swi = new SolrWriterImpl();
>>                       di.runCmd(rp, swi);
>>
>>                       assertEquals(1, swi.docs.size());
>>                       SolrInputDocument doc = swi.docs.get(0);
>>                       SolrInputField field;
>>                       field = doc.getField("dbid");
>>                       assertEquals(field.getValue().toString(), "1");
>>                       field = doc.getField("blob");
>>                       assertEquals(field.getValue().toString(), KISSINGER);
>>                       field = doc.getField("name");
>>                       assertNotNull(field);
>>                       assertEquals(field.getValue().toString(), "Henry");
>>               } finally {
>>                       MockDataSource.clearCache();
>>               }
>>       }
>>
>>
>>       private void addDBDocuments(List> l) {
>>               for(String[][] dbdoc: DBDOCS) {
>>                       l.add(createMap(dbdoc[0][0], dbdoc[0][1], dbdoc[1][0], 
>> dbdoc[1][1]));
>>               }
>>       }
>>
>>        String dih_config_FR_into_XP = "\r\n" +
>>        "  \r\n" +
>>        "  \r\n" +
>>        "  \r\n" +
>>        "      > dataSource='db'>\r\n" +
>>        "        \r\n" +
>>        "        \r\n" +
>>        "        \r\n" +
>>        "        > processor='XPathEntityProcessor'\r\n" +
>>        "                      forEach='/names' dataField='db.blob'>\r\n" +
>>        "          \r\n" +
>>        "        \r\n" +
>>        "      \r\n" +
>>        "  \r\n" +
>>        "\r\n"
>>        ;
>>
>>
>> }
>>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>



-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: Problems with DIH XPath flatten

2009-10-06 Thread Noble Paul നോബിള്‍ नोब्ळ्

send a small sample xml snippet you are trying to index and it may help

On Tue, Oct 6, 2009 at 9:29 PM, Adam Foltzer  wrote:
> Hi all,
>
> I'm trying to set up DataImportHandler to index some XML documents available
> over web services. The XML includes both content and metadata, so for the
> indexable content, I'm trying to just index everything under the content
> tag:
>
>         url="resturl" processor="XPathEntityProcessor"
>        forEach="/document" transformer="HTMLStripTransformer"
> flatten="true">
>  flatten="true" stripHTML="true" />
> 
> 
>
> The result of this is that the title field gets populated and indexed (there
> are no child nodes of /document/kbml/kbq), but content does not get indexed
> at all. Since /document/kbml/body has many children, I expected that
> flatten="true" would store all of the body text in the field. Instead, it
> stores nothing at all. I've tried this with many combinations of
> transformers and flatten options, and the result is the same each time.
>
> Here are the relevant field declarations from the schema (the type="text" is
> just the one from the example's schema.xml). I have tried combinations here
> as well of stored= and multiValued=, with the same result each time.
>
>  multiValued="true" />
>  multiValued="true" />
>
> If it would help troubleshooting, I could send along some sample XML. I
> don't want to spam the list with an attachment unless it's necessary, though
> :)
>
> Thanks in advance for your help,
>
> Adam Foltzer
>



-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: search by some functionality

2009-10-06 Thread David Smiley @MITRE.org


Maybe I'm missing something, but function queries aren't involved in
determining whether a document matches or not, only its score.  How is a a
custom function / value-source going to filter?

~ David Smiley


hossman wrote:
> 
> 
> : I read about this chapter before. It did not mention how to create my
> : own customized function.
> : Can you point me to some instructions?
> 
> The first step is to figure out how you can code your custom functionality 
> as an extension of the ValueSource class...
> 
> http://lucene.apache.org/solr/api/org/apache/solr/search/function/ValueSource.html
> 
> ...which has to be able to generate DocValues implementation for an 
> IndexReader.  DocValues is where you would some numeric score for each 
> document based on whatever criteria you wanted (including arguments passed 
> in when your ValueSource is constructed, like field names and constants)
> 
> Then you need a simple ValueSourceParser class to be able to specify when 
> to use your ValueSource, and that's what you register in solrconfig.xml...
> 
> http://wiki.apache.org/solr/SolrPlugins#ValueSourceParser
> 
> 
> 
> -Hoss
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/search-by-some-functionality-tp25721533p25779702.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Importing CSV file slow/crashes

2009-10-06 Thread Nasseam Elkarra

Hello Yonik,

Thank you for looking into this. Your question of if I'm using stock
solr put me in the right direction. I am in fact using a patched
version of solr to get hierarchal facet support (http://issues.apache.org/jira/browse/SOLR-64
). I took out the 4 hiefacet fields from the schema and the import was
back to normal times of less than a minute. This same configuration
worked fine with the 5/1 patched build.

Here is the field definition:
omitNorms="true" positionIncrementGap="0" indexed="true"
stored="false" delimiter="/" />

multiValued="true"/>
stored="true" multiValued="true"/>

stored="false" multiValued="true"/>
stored="false" multiValued="true"/>

CSV file snippet:
category,category_seo
"T-Shirt Mens/Crew Neck/","t-shirt-mens/crew-neck/"

Thanks again!
Nasseam

On Oct 6, 2009, at 3:22 PM, Yonik Seeley wrote:

On Tue, Oct 6, 2009 at 1:06 PM, Nasseam Elkarra
wrote:
I had a dev build of 1.4 from 5/1/2009 and importing a 20K row took
less
than a minute. Updating to the latest as of yesterday, the import
is really

slow and I had to cancel it after a half hour. This prevented me from
upgrading a few months ago as well.

I haven't had any success at replicating this problem.

I just tried a 100K row CSV file, consisting of an id and a few text
fields. The total size of the file is 79MB.

On trunk (today): 22 seconds to index, another 5-7 secons to commit
5/21 version: 28 seconds to index, another 8 seconds to commit

Then I modified the 5/1 schema to closer match the trunk schema
(removing defaults, copyfields that could slow things down).
Modified 5/1 version: 25 seconds to index, another 8 seconds to commit

92 matches

Mail list logo