RE: Facet sort numeric values

2012-08-15 Thread Aleksander Akerø
Oh brilliant, didn't think of it being possible to configure that way.

Had made my own untokenized type, so I guess it would be better for me to
control datatype this way.

Bonus question (hehe): What if these field values also contain alphanumeric
values? E.g. Alpha, Bravo, Omega, ... 
How would this affect the sorting? I guess the TrieIntField is not
applicable then.

Aleksander Akerø
@ Gurusoft AS
Mobil: 944 89 054 
QR-Code (Kontaktinfo)

-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: 14. august 2012 17:45
To: solr-user@lucene.apache.org
Subject: Re: Facet sort numeric values


: I'm having a problem with sorting facets. I am using the facet.sort=index
: parameter and it works fine for most of the values.
...
: Eksample, when sorting 15, 6, 23, 7, 10, 90 it sorts like this: 10, 15,
: 23, 6, 7, 90, but what I wanted was 6, 7, 10, 15, 23, 90.

what field type are you using?

If you use one of the Trie___Field types then the facet values should sort
exactly as you describe.

fieldType name=int class=solr.TrieIntField precisionStep=0
positionIncrementGap=0/ fieldType name=float
class=solr.TrieFloatField precisionStep=0 positionIncrementGap=0/
fieldType name=long class=solr.TrieLongField precisionStep=0
positionIncrementGap=0/ fieldType name=double
class=solr.TrieDoubleField precisionStep=0 positionIncrementGap=0/



-Hoss



Fwd: Solr 3.5 result grouping is failing

2012-08-15 Thread chethan
Hi,

I'm trying to group (field collapse) my search results on a field called
site. The schema says that it has to be indexed: *field name=site
type=string stored=false indexed=true/.*
But when I try to query the results with *group.field=sitegroup.limit=100,
*I see only 1 group of results being returned. And the group value is null.
This seems to work on another solr instance which only has a few documents
indexed. Seems to fail on bigger indexes. Help is appreciated.

Thanks
Chethan


Sent this message again as it seemed to bounce the first time.


Re: offsets issues with multiword synonyms since LUCENE_33

2012-08-15 Thread Konrad Lötzsch

I don't know wether this was discussed previously,
but if you tell the synonmyfilter to not break your synonyms (which 
might be the default). In this case, the parts of the synonyms get new 
word positions. So you could use a Keywordtokenizer to avoid that behaviour:


filter class=solr.SynonymFilterFactory
synonyms=Synonyms.txt
ignoreCase=true
expand=false
tokenizerFactory=solr.KeywordTokenizerFactory
/

with regards,
konrad.

Am 14.08.2012 18:51, schrieb Marc Sturlese:

Well an example would be:
synonyms.txt:
huge,big size

The I have the docs:
1- The huge fox attacks first
2- The big size fox attacks first

Then if I query for huge, the highlights for each document are:

1- The stronghuge/strong strongfox/strong attacks first
2- The strongbig size/strong fox attacks first

The analyzer looks like this:
fieldType name=sy_text class=solr.TextField positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.ASCIIFoldingFilterFactory/
 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=false expand=true /
   /analyzer
   analyzer type=query
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.ASCIIFoldingFilterFactory/
 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=false expand=true /
   /analyzer
 /fieldType

This was working with a previous version of Solr (couldn't make it work with
3.6, 4-alpha nor 4-beta).



--
View this message in context: 
http://lucene.472066.n3.nabble.com/offsets-issues-with-multiword-synonyms-since-LUCENE-33-tp4001195p4001213.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: Query regarding dataimporthandler

2012-08-15 Thread Shalin Shekhar Mangar
There is no way to do it within DataImportHandler but you can configure
autoCommit in solrconfig.xml to automatically commit pending updates by
time or number of documents.

On Tue, Aug 14, 2012 at 4:11 PM, ravicv ravichandra...@gmail.com wrote:

 Hi,

 Is there any way for intermediate commits while indexing data using
 dataimport handler?
 I am using 1.4 solr version.

 My problem is :

 Some times while indexing huge data about 4 GB , after indeixng it is while
 commit process is going on if any user searches the data sometimes solr is
 throwing heap space error.

 Since my data before commit operation is nearly 8 GB , but after both
 commit
 and optimize is node it reduces to 4 GB. I am usign full import option.

 Any ideas?

 Thanks,
 ravichandra



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Query-regarding-dataimporthandler-tp4001098.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Regards,
Shalin Shekhar Mangar.


Re: scanned pdf with solr cell

2012-08-15 Thread Ahmet Arslan
 When I send a scanned pdf to extraction request
 handler, below icon appears in my Dock.
 
 http://tinypic.com/r/2mpmo7o/6
 http://tinypic.com/r/28ukxhj/6

I found that text-extractable pdf files triggers above weird icon too.

curl 
http://localhost:8983/solr/update/extract?literal.id=solr-wordcommit=true; -F 
myfile=@solr-word.pdf

I wrote a standalone java program using tika. When text-extracting from all 
kinds of pdf files, that weird icon pops-up :)

I will ask tika-ML about this.

 AutoDetectParser _autoParser = new AutoDetectParser();
 File file = new File(solr-word.pdf);
 BodyContentHandler textHandler = new BodyContentHandler();
 Metadata metadata = new Metadata();
 ParseContext context = new ParseContext();
 InputStream input = new FileInputStream(file);

 _autoParser.parse(input, textHandler, metadata, context);

 System.out.println(text :  + textHandler.toString());
 input.close();
 while (true) { }


Re: scanned pdf with solr cell

2012-08-15 Thread Paul Libbrecht
Ahmet,

the dock icon appears when AWT starts, e.g. when a font is loaded.
You can prevent it using the headless mode but this is likely to trigger an 
exception.
Same if your user is not UI-logged-in.

hope it helps.

Paul

Le 15 août 2012 à 01:30, Ahmet Arslan a écrit :

 Hi All,
 
 I have set of rich documents. Some of them are scanned pdf files. When I send 
 a scanned pdf to extraction request handler, below icon appears in my Dock.
 
 http://tinypic.com/r/2mpmo7o/6
 http://tinypic.com/r/28ukxhj/6
 
 Does anyone know what this is?
 
 curl 
 http://localhost:8983/solr/documents/update/extract?literal.ID=ticaret_sicil_gazetesiliteral.URL=ticaret_sicil_gazetesicommit=true;
  -F myfile=@ticaret_sicil_gazetesi.pdf
 
 No exception is seen on solr logs. Doc is indexed, content field is: 
 
 xmpTPg:NPages 4   Creation-Date 2011-08-24T13:03:16Z   stream_source_info 
 myfile   created Wed Aug 24 16:03:16 EEST 2011   stream_content_type 
 application/octet-stream   stream_size 2302337   producer Image Recognition 
 Integrated Systems, Autoformat5,0,0,229   stream_name 
 ticaret_sicil_gazetesi.pdf   Content-Type application/pdf   creator I.R.I.S.  
  page page page page 
 
 Environment: solr-trunk, Mac OS X Version 10.7.4, Java HotSpot(TM) 64-Bit 
 Server VM (build 20.8-b03-424, mixed mode), jetty.
 
 Same thing happens with Solr 4.0-beta and Tomcat too.
 
 Thanks,



Re: scanned pdf with solr cell

2012-08-15 Thread Ahmet Arslan
 the dock icon appears when AWT starts, e.g. when a font is
 loaded.
 You can prevent it using the headless mode but this is
 likely to trigger an exception.
 Same if your user is not UI-logged-in.

Hi Paul, thanks for the explanation. So is it nothing to worry about?



Re: SOLR3.6:Field Collapsing/Grouping throws OOM

2012-08-15 Thread Tirthankar Chatterjee
Hi Erick,
You are so right on the memory calculations. I am happy that I know now that I 
was doing something wrong. Yes I am getting confused with SQL.

I will back up and let you know the use case. I am tracking file versions. And 
I want to give an option to browse your system for the latest files. So in 
order to remove dups (same filename) I used grouping.

Also when you say Sharding is it okay if I do multi cores and does it mean that 
each core needs a separate tomcat. I meant to say can I use the same machine? 
150 mill docs have 120 mill unique paths too.

One more thing. If I need sharding and need a new box then it wont be great. 
Because this system still have horsepower left which I can use.

Thanks a ton for explaining the issue.

Erick Erickson erickerick...@gmail.com wrote:


You'r putting  a lot of data on a single box, then
asking to group on what I presume is a string
field. That's just going to eat up a _bunch_ of
memory.

let's say your average file name is 16 bytes long. Each
unique value will take up 58 + 32 bytes (58 bytes
of overhead, I'm presuming Solr 3.X and 16*2 bytes
for the chars). So, we're up to 90 bytes/string * number
of distinct file names) Say you have, for argument's
sake, 100M distinct file names. You're up to 9G
memory requirement for sorting alone. Solr's
sorting reads all the unique values into memory whether
or not they satisfy the query...

And Grouping can also be expensive. I don't think
you really want to group in this case, I'd simply use
a filter query something like:
fq=filefolder:E\:\\pd_dst\\646c6907-a948-4b83-ac1d-d44742bb0307

Then you're also grouping on conv_sort which doesn't
make much sense, do you really want individual results returned
for _each_ file name?

What it looks like to me is you're confusing SQL with
solr search and getting into bad situations...

Also, 150M documents in a single shard is...really a lot.
You're probably at a point where you need to shard. Not
to mention that your 400G index is trying to be jammed
into 12G of memory.

This actually feels like an XY problem, can you back
up and let us know what the use-case you're
trying to solve is? Perhaps there are less memory-
consumptive solutions possible.

Best
Erick

On Tue, Aug 14, 2012 at 6:38 AM, Tirthankar Chatterjee
tchatter...@commvault.com wrote:
 Editing the query...remove smb:. I don't know where it came from while 
 I did copy/paste

 Tirthankar Chatterjee tchatter...@commvault.com wrote:


 Hi,
 I have a beefy box with 24Gb RAM (12GB for Tomcat7 which houses SOLR3.6)  2 
 Processors Intel Xeon 64 bit Server, 30TB HDD. JDK 1.7.0_03 x64 bit


 Data Index Dir Size: 400GB
 Metadata of files is stored in it. I have around 15 schema fields.
 Total number of items:150million approx.

 I have a scenario which I will try to explain to the best of my knowledge 
 here:

 Let us consider the fields I am interested in

 Url: Entire path of a file in windows file system including the filename. 
 ex:C:\Documents\A.txt
 mtm: Modified Time of the file
 Jid:JOb ID
 conv_sort is string field type where the filename is stored.

 I run a job where the following gets inserted

 Total Items:2
 Url:C:\personal\A1.txt
 mtm:08/14/2012 12:00:00
 Jid:1
 Conv_sort:A1.txt
 ---
 Url:C:\personal\B1.txt
 mtm:08/14/2012 12:01:00
 Jid:1
 Conv_sort:B1.txt
 In the second run only one item changes:

 Url:C:\personal\A1.txt
 mtm:08/15/2012 1:00:00
 Jid:2
 Conv_sort=A1.txt

 When queried I would like to return the latest A1.txt and B1.txt back to the 
 end user. I am trying to use grouping with no luck. It keeps throwing OOM… 
 can someone please help… as it is critical for my project

 The query I am trying is under a folder there are 1000 files and I putting a 
 filtered query param too asking it to group by filenames or url and none of 
 them work…what am I doing wrong here


 http://172.19.108.78:8080/solr/select/?q=*:*version=2.2start=0rows=10indent=ongroup.query=filefolder:E\:\\pd_dst\\646c6907-a948-4b83-ac1d-d44742bb0307smb://pd_dst//646c6907-a948-4b83-ac1d-d44742bb0307group=truegroup.limit=1group.field=conv_sortgroup.ngroup=true


 The stack trace:


 SEVERE: java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOfRange(Unknown Source)
 at java.lang.String.init(Unknown Source)
 at org.apache.lucene.index.TermBuffer.toTerm(TermBuffer.java:122)
 at 
 org.apache.lucene.index.SegmentTermEnum.term(SegmentTermEnum.java:184
 )
 at 
 org.apache.lucene.search.FieldCacheImpl$StringIndexCache.createValue(
 FieldCacheImpl.java:882)
 at 
 org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java
 :233)
 at 
 org.apache.lucene.search.FieldCacheImpl.getStringIndex(FieldCacheImpl
 .java:856)
 at 
 org.apache.lucene.search.grouping.TermFirstPassGroupingCollector.setN
 extReader(TermFirstPassGroupingCollector.java:74)
 at 
 

Re: scanned pdf with solr cell

2012-08-15 Thread Paul Libbrecht


Le 15 août 2012 à 13:03, Ahmet Arslan a écrit :

 Hi Paul, thanks for the explanation. So is it nothing to worry about?

it is nothing to worry about except to remember that you can't run this step in 
a daemon-like process.
(on Linux, I had to set-up a VNC-server for similar tasks)

paul

Re: Switch from Sphinx to Solr - some basics please

2012-08-15 Thread Ahmet Arslan
 Because I have set a post in Stackoverflow, I wan't, that
 there is dublicate
 questions. Can you please read this post:
 
 http://stackoverflow.com/questions/11956608/sphinx-user-is-switching-to-solr

Your questions require Sphinx knowledge. I suggest you to read these book(s) 
http://lucene.apache.org/solr/books.html
http://www.manning.com/hatcher3/

I have in Sphinx: min_word_len ... How to use this in Solr?

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters/#solr.LengthFilterFactory


Re: Switch from Sphinx to Solr - some basics please

2012-08-15 Thread nnikolay
HI iorixxx, thanks for the reply.

Well you don't need sphinx knowledge to answer my questions.

I have write you what I want:

1. I need to have 2 seprate indexes. In Stackoverlfow I became the answer I
need to start 2 cores for example. How many cores can I run for solr? I have
for example over 100 different indexes, that they should seeing as separate
data. This indexes should be reindexed in different times and the data of
them should not mixed with each other.

You need to understand follow situation:

I have for example jobs form country A, jobs from country B and so on until
100 countries. I need to have for each country an separate index, because if
someone search for jobs in country A I need to query only the index for
country A. How to solve this problem?

How to do this? Is there are good tutorial? In the wiki of solr, it is very
bad explained.

2. When I become new data for example: Should I rotate the whole index
again, or can I include the new rows and delete the old rows. What is your
suggestion?

Thanks
Nik



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Switch-from-Sphinx-to-Solr-some-basics-please-tp4001234p4001379.html
Sent from the Solr - User mailing list archive at Nabble.com.


How to design index for related versioned database records

2012-08-15 Thread Stefan Burkard
Hi solr-users

I have a case where I need to build an index from a database.

***Data structure***
The data is spread across multiple tables and in each table the
records are versioned - this means that one real record can exist
multiple times in a table, each with different validFrom/validUntil
dates. Therefore it is possible to query the valid version of a record
for a given point in time.

The relations of the data are something like this:
Employee - LinkTable (=Employment) - Employer - LinkTable
(=offered services) - Service

That means I have data across 5 relations, each of them with versioned records.

***Search needs***
Now I need to be able to search for employees and employers based on
the services they offer for a given point in time.

Therefore I have built an index of all employees and employers with
their services as subentity. So I have one index entry for every
version of every employee/employer and each version collects the
offered services for the given timeframe of the employee/employer
version.

Problem: The offered services of an employee/employer can change
during its validity period. That means I do not only need to take the
version timespan of the employee/employer into account but also the
version timespans of services and the link-tables.

***Question***
I think I could continue with my strategy to have an index entry of an
employee/employer with its services for any given point in time. But
there are much more entries than now since every involved
validfrom/validuntil period (if they overlap) produces more entries.
But I am not sure if this is a good strategy, or if it would be better
to try to index the whole datastructure in an other way.

Are there any recommendations how to handle such a case?

Thanks for any help
Stephan


Re: Switch from Sphinx to Solr - some basics please

2012-08-15 Thread Ahmet Arslan

 1. I need to have 2 seprate indexes. In Stackoverlfow I
 became the answer I
 need to start 2 cores for example. How many cores can I run
 for solr? 

Please see : http://search-lucene.com/m/6rYti2ehFZ82


 I have for example jobs form country A, jobs from country B
 and so on until
 100 countries. I need to have for each country an separate
 index, because if
 someone search for jobs in country A I need to query only
 the index for
 country A. How to solve this problem?
 How to do this? Is there are good tutorial? In the wiki of
 solr, it is very
 bad explained.

http://wiki.apache.org/solr/MultipleIndexes talks about different solutions. 
One big index with fq is an option too.

 2. When I become new data for example: Should I rotate the
 whole index
 again, or can I include the new rows and delete the old
 rows. What is your
 suggestion?

I don't understand this. What do you mean by rotate the whole index?


Re: RAMDirectoryFactory bug

2012-08-15 Thread Michael Della Bitta
Hi, Lance,

Thanks for your reply!

It seems as if RAMDirectoryFactory is being passed the correct path to
the index, as it's being logged correctly. It just doesn't recognize
it as an index.

Michael Della Bitta


Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
www.appinions.com
Where Influence Isn’t a Game


On Tue, Aug 14, 2012 at 9:57 PM, Lance Norskog goks...@gmail.com wrote:
 I can't remember the property name, but there is a Solr Java property
 that tells where to hunt for the data/ directory. You might be able to
 work around this bug using that property.

 On Tue, Aug 14, 2012 at 1:34 PM, Michael Della Bitta
 michael.della.bi...@appinions.com wrote:
 Hi everyone,

 It looks like I found a bug with RAMDirectoryFactory (I know, I know...)

 It doesn't seem to be able to load files off the disk. Everytime it
 starts up, it logs:

 WARNING: [] Solr index directory 'solr/./data/index' doesn't exist.
 Creating new index...

 Even if that filesystem path exists and there's a valid index there
 (verified by switching back to StandardDirectoryFactory).

 I experienced this first on our infrastructure on AWS, but I confirmed
 this by downloading the Solr 3.6.1 distribution fresh, indexing the
 exampledocs, stopping Jetty and reconfiguring for RAMDirectoryFactory,
 and restarting Jetty. The statement above gets logged, but otherwise
 the core comes up OK, but empty.

 Should I file a bug?

 Michael Della Bitta

 
 Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
 www.appinions.com
 Where Influence Isn’t a Game



 --
 Lance Norskog
 goks...@gmail.com


Re: How to design index for related versioned database records

2012-08-15 Thread Jack Krupansky
The date checking can be implemented using range query as a filter query, 
such as


fq=startDate:[* TO NOW] AND endDate:[NOW TO *]

(You can also use an frange query.)

Then you will have to flatten the database tables. Your Solr schema would 
have a single merged record type. You will have to decide whether the 
different record types (tables) will have common fields versus static 
qualification by adding a prefix or suffix, e.g., name vs. employee_name 
and employer_name. The latter has the advantage that you do not have to 
separately specify a table type field since the fields would be empty for 
records of other types.


-- Jack Krupansky

-Original Message- 
From: Stefan Burkard

Sent: Wednesday, August 15, 2012 8:12 AM
To: solr-user@lucene.apache.org
Subject: How to design index for related versioned database records

Hi solr-users

I have a case where I need to build an index from a database.

***Data structure***
The data is spread across multiple tables and in each table the
records are versioned - this means that one real record can exist
multiple times in a table, each with different validFrom/validUntil
dates. Therefore it is possible to query the valid version of a record
for a given point in time.

The relations of the data are something like this:
Employee - LinkTable (=Employment) - Employer - LinkTable
(=offered services) - Service

That means I have data across 5 relations, each of them with versioned 
records.


***Search needs***
Now I need to be able to search for employees and employers based on
the services they offer for a given point in time.

Therefore I have built an index of all employees and employers with
their services as subentity. So I have one index entry for every
version of every employee/employer and each version collects the
offered services for the given timeframe of the employee/employer
version.

Problem: The offered services of an employee/employer can change
during its validity period. That means I do not only need to take the
version timespan of the employee/employer into account but also the
version timespans of services and the link-tables.

***Question***
I think I could continue with my strategy to have an index entry of an
employee/employer with its services for any given point in time. But
there are much more entries than now since every involved
validfrom/validuntil period (if they overlap) produces more entries.
But I am not sure if this is a good strategy, or if it would be better
to try to index the whole datastructure in an other way.

Are there any recommendations how to handle such a case?

Thanks for any help
Stephan 



Re: scanned pdf with solr cell

2012-08-15 Thread Michael Della Bitta
You can try passing -Djava.awt.headless=true as one of the arguments
when you start Jetty to see if you can get this to go away with no ill
effects.

Michael Della Bitta


Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
www.appinions.com
Where Influence Isn’t a Game


On Wed, Aug 15, 2012 at 7:07 AM, Paul Libbrecht p...@hoplahup.net wrote:


 Le 15 août 2012 à 13:03, Ahmet Arslan a écrit :

 Hi Paul, thanks for the explanation. So is it nothing to worry about?

 it is nothing to worry about except to remember that you can't run this step 
 in a daemon-like process.
 (on Linux, I had to set-up a VNC-server for similar tasks)

 paul


Re: scanned pdf with solr cell

2012-08-15 Thread Ahmet Arslan
 You can try passing
 -Djava.awt.headless=true as one of the arguments
 when you start Jetty to see if you can get this to go away
 with no ill
 effects.

I started jetty using : 'java -Djava.awt.headless=true -jar start.jar' and 
successfully indexed two pdf files. That icon didn't appeared :) Thanks! 


Re: RAMDirectoryFactory bug

2012-08-15 Thread Mark Miller

On Aug 14, 2012, at 4:34 PM, Michael Della Bitta 
michael.della.bi...@appinions.com wrote:

 Hi everyone,
 
 It looks like I found a bug with RAMDirectoryFactory (I know, I know...)
 

Fair warning - RAMDir use in Solr is like a third class citizen. You probably 
should be using the mmap dir anyway.
See http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

 It doesn't seem to be able to load files off the disk. Everytime it
 starts up, it logs:
 
 WARNING: [] Solr index directory 'solr/./data/index' doesn't exist.
 Creating new index...
 
 Even if that filesystem path exists and there's a valid index there
 (verified by switching back to StandardDirectoryFactory).

I think it *should* work how you want, so does sound like a bug perhaps.

 
 I experienced this first on our infrastructure on AWS, but I confirmed
 this by downloading the Solr 3.6.1 distribution fresh, indexing the
 exampledocs, stopping Jetty and reconfiguring for RAMDirectoryFactory,
 and restarting Jetty. The statement above gets logged, but otherwise
 the core comes up OK, but empty.
 
 Should I file a bug?

Sure.

 
 Michael Della Bitta
 
 
 Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
 www.appinions.com
 Where Influence Isn’t a Game

- Mark Miller
lucidimagination.com













Re: RAMDirectoryFactory bug

2012-08-15 Thread Michael Della Bitta
Yes, moving to mmap was on our roadmap. I'm in the middle of moving
our infrastructure from 1.4 to 3.6.1, and didn't want to make too many
changes at the same time. However, this bug might push us over the
edge to mmap and away from ram.

I'll file a bug regardless.

Thanks!

Michael Della Bitta


Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
www.appinions.com
Where Influence Isn’t a Game


On Wed, Aug 15, 2012 at 9:05 AM, Mark Miller markrmil...@gmail.com wrote:

 On Aug 14, 2012, at 4:34 PM, Michael Della Bitta 
 michael.della.bi...@appinions.com wrote:

 Hi everyone,

 It looks like I found a bug with RAMDirectoryFactory (I know, I know...)


 Fair warning - RAMDir use in Solr is like a third class citizen. You probably 
 should be using the mmap dir anyway.
 See http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

 It doesn't seem to be able to load files off the disk. Everytime it
 starts up, it logs:

 WARNING: [] Solr index directory 'solr/./data/index' doesn't exist.
 Creating new index...

 Even if that filesystem path exists and there's a valid index there
 (verified by switching back to StandardDirectoryFactory).

 I think it *should* work how you want, so does sound like a bug perhaps.


 I experienced this first on our infrastructure on AWS, but I confirmed
 this by downloading the Solr 3.6.1 distribution fresh, indexing the
 exampledocs, stopping Jetty and reconfiguring for RAMDirectoryFactory,
 and restarting Jetty. The statement above gets logged, but otherwise
 the core comes up OK, but empty.

 Should I file a bug?

 Sure.


 Michael Della Bitta

 
 Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
 www.appinions.com
 Where Influence Isn’t a Game

 - Mark Miller
 lucidimagination.com













RE: Solr 4.0 - Join performance

2012-08-15 Thread David Smiley (@MITRE.org)
You would index rectangles of 0 height but that have a left edge 'x' of the
start time and a right edge 'x' of your end time.  You can index a variable
number of these per Solr document and then query by either a point or
another rectangle to find documents which intersect your query shape.  It
can't do a completely within based query, just intersection for now.  I
really look forward to seeing this wrapped up in some sort of RangeFieldType
so that users don't have to think in spatial terms.  



-
 Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-0-Join-performance-tp3998827p4001404.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Index not loading

2012-08-15 Thread Jonatan Fournier
On Tue, Aug 14, 2012 at 5:37 PM, Jonatan Fournier
jonatan.fourn...@gmail.com wrote:
 On Tue, Aug 14, 2012 at 10:25 AM, Erick Erickson
 erickerick...@gmail.com wrote:
 This is quite odd, it really sounds like you're not
 actually committing. So, some questions.

 1 What happens if you search before you shut
 down your tomcat? Do you see docs then? If so,
 somehow you're doing soft commits and never
 doing a hard commit.

Yeah I just realized the behavior is the same as softCommit, is it the
default for commitWithin?

Cheers,

/jonathan


 2 What happens if, as the last statement in your SolrJ
 program you do a commit()?

 When using commitWithin, if I introduce server.commit() within the
 data load process the data gets commited ( I didn't reproduce with my
 89G of data...), if I shutdown my EmbeddedServer and restart it and
 send a commit, like on Tomcat, all data gets wiped out too. So I guess
 that there's state loss somewhere.

 Cheers,

 /jonathan


 3 While you're indexing, what do you see in your index
 directory? You should see multiple segments being
 created, and possibly merged so the number of
 files should go up and down. If you only have a single
 set of files, you're somehow not doing a commit.

 4 Is there something really silly going on like your
 restart scripts delete the index directory? Or you're
 using a VM that restores a blank image?

 5 When you do restart, are there any files at all
 in your index directory?

 I really suspect you've got some configuration problem
 here

 Best
 Erick



 On Mon, Aug 13, 2012 at 9:11 AM, Jonatan Fournier
 jonatan.fourn...@gmail.com wrote:
 Hi,

 I'm using Solr 4.0.0-ALPHA and the EmbeddedSolrServer.

 Within my SolrJ application, the documents are added to the server
 using the commitWithin parameter (in my case 60s). After 1 day my 125
 millions document are all added to the server and I can see 89G of
 index data files. I stop my SolrJ application and reload my Solr
 instance in Tomcat.

 From the Solr admin panel related to my Core (collection1) I see this info:


 Last Modified:
 Num Docs:0
 Max Doc:0
 Version:1
 Segment Count:0
 Optimized: (green check)
 Current:  (green check)
 Master:
 Version: 0
 Gen: 1
 Size: 88.14 GB


 From the general Core Admin panel I see:

 lastModified:
 version:1
 numDocs:0
 maxDoc:0
 optimized: (red circle)
 current: (green check)
 hasDeletions: (red circle)

 If I query my index for *:* I get 0 result. If I trigger optimize it
 wipes ALL my data inside the index and reset to empty. I've played
 around my EmbeddedServer initially using autoCommit/softCommit and it
 was working fine. Now that I've switched to commitWithin the document
 add query, it always do that! I'm never able to reload my index within
 Tomcat/Solr.

 Any idea?

 Cheers,

 /jonathan


Re: Switch from Sphinx to Solr - some basics please

2012-08-15 Thread Walter Underwood
These do require some Sphinx knowledge. I could answer them on StackOverflow 
because I converted Chegg from Sphinx to Solr this year.

As I said there, read about Solr cores. They are independent search 
configurations and indexes within one Solr server: 
http://wiki.apache.org/solr/CoreAdmin 

For your jobs example, I would use filter queries to limit the search to a 
single country. Filter them to country:us or country:de or country:fr and you 
will only get result from that country.

Solr does not use the term rotate for indexes. You can delete with a query, 
so you could delete all the jobs for one country, reindex those, then commit.

Separate cores are best when you have different kinds of data. At Chegg, we 
search books and college courses. Those are in different cores and have very 
different schemas.

wunder

On Aug 15, 2012, at 5:11 AM, nnikolay wrote:

 HI iorixxx, thanks for the reply.
 
 Well you don't need sphinx knowledge to answer my questions.
 
 I have write you what I want:
 
 1. I need to have 2 seprate indexes. In Stackoverlfow I became the answer I
 need to start 2 cores for example. How many cores can I run for solr? I have
 for example over 100 different indexes, that they should seeing as separate
 data. This indexes should be reindexed in different times and the data of
 them should not mixed with each other.
 
 You need to understand follow situation:
 
 I have for example jobs form country A, jobs from country B and so on until
 100 countries. I need to have for each country an separate index, because if
 someone search for jobs in country A I need to query only the index for
 country A. How to solve this problem?
 
 How to do this? Is there are good tutorial? In the wiki of solr, it is very
 bad explained.
 
 2. When I become new data for example: Should I rotate the whole index
 again, or can I include the new rows and delete the old rows. What is your
 suggestion?
 
 Thanks
 Nik
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Switch-from-Sphinx-to-Solr-some-basics-please-tp4001234p4001379.html
 Sent from the Solr - User mailing list archive at Nabble.com.

--
Walter Underwood
wun...@wunderwood.org





Re: Duplicated facet counts in solr 4 beta: user error

2012-08-15 Thread Erick Erickson
No problem, and thanks for posting the resolution

If you have the time and energy, anyone can edit the Wiki if you
create a logon, so any clarification you'd like to provide to keep
others from having this problem would be most welcome!

Best
Erick

On Tue, Aug 14, 2012 at 6:13 PM, Buttler, David buttl...@llnl.gov wrote:
 Here are my steps:

 1)  Download apache-solr-4.0.0-BETA

 2)  Untar into a directory

 3)  cp -r example example2

 4)  cp -r example exampleB

 5)  cp -r example example2B

 6)  cd example;  java -Dbootstrap_confdir=./solr/collection1/conf 
 -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar

 7)  cd example2; java -Djetty.port=7574 -DzkHost=localhost:9983 -jar 
 start.jar

 8)  cd exampleB; java -Djetty.port=8900 -DzkHost=localhost:9983 -jar 
 start.jar

 9)  cd example2B; java -Djetty.port=7500 -DzkHost=localhost:9983 -jar 
 start.jar

 10)   cd example/exampledocs; java 
 -Durl=http://localhost:8983/solr/collection1/update -jar post.jar *.xml

 http://localhost:8983/solr/collection1/select?q=*:*wt=xmlfq=cat:%22electronics%22
 14 results returned

 This is correct.  Let's try a slightly more circuitous route by running 
 through the solr tutorial first


 1)  Download apache-solr-4.0.0-BETA

 2)  Untar into a directory

 3)  cd example; java  -jar start.jar

 4)  cd example/exampledocs; java 
 -Durl=http://localhost:8983/solr/collection1/update -jar post.jar *.xml

 5)  kill jetty server

 6)  cp -r example example2

 7)  cp -r example exampleB

 8)  cp -r example example2B

 9)  cd example;  java -Dbootstrap_confdir=./solr/collection1/conf 
 -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar

 10)   cd example2; java -Djetty.port=7574 -DzkHost=localhost:9983 -jar 
 start.jar

 11)   cd exampleB; java -Djetty.port=8900 -DzkHost=localhost:9983 -jar 
 start.jar

 12)   cd example2B; java -Djetty.port=7500 -DzkHost=localhost:9983 -jar 
 start.jar

 13)   cd example/exampledocs; java 
 -Durl=http://localhost:8983/solr/collection1/update -jar post.jar *.xml

 With the same query as above, 22 results are returned.

 Looking at this, it is somewhat obvious that what is happening is that the 
 index was copied over from the tutorial and was not cleaned up before running 
 the cloud examples.

 Adding the debug=query parameter to the query URL produces the following:
 lst name=debug
 str name=rawquerystring*:*/str
 str name=querystring*:*/str
 str name=parsedqueryMatchAllDocsQuery(*:*)/str
 str name=parsedquery_toString*:*/str
 str name=QParserLuceneQParser/str
 arr name=filter_queries
 strcat:electronics/str
 /arr
 arr name=parsed_filter_queries
 strcat:electronics/str
 /arr
 /lst

 So, Erick's diagnoses is correct: pilot error.  However, the straightforward 
 path through the tutorial and on to solr cloud makes it easy to make this 
 mistake. Maybe a small warning in the solr cloud page would help?

 Now, running a delete operations fixes things:
 cd example/exampledocs;
 java -Dcommit=false -Ddata=args -jar post.jar 
 deletequery*:*/query/delete
 causes the number of results to be zero.  So, let's reload the data:
 java -Durl=http://localhost:8983/solr/collection1/update -jar post.jar *.xml
 now the number of results for our query
 http://localhost:8983/solr/collection1/select?q=*:*wt=xmlfq=cat:electronicshttp://localhost:8983/solr/collection1/select?q=*:*wt=xmlfq=cat:%22electronics
 is back to the correct 14 results.

 Dave

 PS apologizes for hijacking the thread earlier.


Re: Facet sort numeric values

2012-08-15 Thread Erick Erickson
the problem you're running into is that lexical ordering of
numeric data != numeric ordering. If you have a mixed
alpha and numeric data, you man not care if the alpha
stuff is first, i.e.

asdb456
asdf490

sorts fine. Problems happen with
9jsdf
100ukel

the 100ukel comes first.

So if you have a mixed alpha and numeric situation,
you have to either live with it or normalize the numeric
data so it lexical ordering == numeric ordering, the most
common way is to left-pad numeric data to a fixed-width,
i.e. rather than index asb9fg, index asb009fg. Of
course you have to know what the upper limit of any digit
is for this to work...

Best
Erick

On Wed, Aug 15, 2012 at 12:33 AM, Aleksander Akerø
solraleksan...@gmail.com wrote:
 Oh brilliant, didn't think of it being possible to configure that way.

 Had made my own untokenized type, so I guess it would be better for me to
 control datatype this way.

 Bonus question (hehe): What if these field values also contain alphanumeric
 values? E.g. Alpha, Bravo, Omega, ... 
 How would this affect the sorting? I guess the TrieIntField is not
 applicable then.

 Aleksander Akerø
 @ Gurusoft AS
 Mobil: 944 89 054
 QR-Code (Kontaktinfo)

 -Original Message-
 From: Chris Hostetter [mailto:hossman_luc...@fucit.org]
 Sent: 14. august 2012 17:45
 To: solr-user@lucene.apache.org
 Subject: Re: Facet sort numeric values


 : I'm having a problem with sorting facets. I am using the facet.sort=index
 : parameter and it works fine for most of the values.
 ...
 : Eksample, when sorting 15, 6, 23, 7, 10, 90 it sorts like this: 10, 15,
 : 23, 6, 7, 90, but what I wanted was 6, 7, 10, 15, 23, 90.

 what field type are you using?

 If you use one of the Trie___Field types then the facet values should sort
 exactly as you describe.

 fieldType name=int class=solr.TrieIntField precisionStep=0
 positionIncrementGap=0/ fieldType name=float
 class=solr.TrieFloatField precisionStep=0 positionIncrementGap=0/
 fieldType name=long class=solr.TrieLongField precisionStep=0
 positionIncrementGap=0/ fieldType name=double
 class=solr.TrieDoubleField precisionStep=0 positionIncrementGap=0/



 -Hoss



Re: Solr 3.5 result grouping is failing

2012-08-15 Thread Erick Erickson
Please attach the results of adding debugQuery=on
to your query in both the success and failure case, there's
very little information to go on here. You might review:

http://wiki.apache.org/solr/UsingMailingLists

Best
Erick

On Wed, Aug 15, 2012 at 12:57 AM, chethan chethan.p...@gmail.com wrote:
 Hi,

 I'm trying to group (field collapse) my search results on a field called
 site. The schema says that it has to be indexed: *field name=site
 type=string stored=false indexed=true/.*
 But when I try to query the results with *group.field=sitegroup.limit=100,
 *I see only 1 group of results being returned. And the group value is null.
 This seems to work on another solr instance which only has a few documents
 indexed. Seems to fail on bigger indexes. Help is appreciated.

 Thanks
 Chethan


 Sent this message again as it seemed to bounce the first time.


Re: SOLR3.6:Field Collapsing/Grouping throws OOM

2012-08-15 Thread Erick Erickson
No, sharding into multiple cores on the same machine still
is limited by the physical memory available. It's still lots
of stuf on a limited box.

But try backing up and re-thinking the problem a bit.
Some possibilities off the top of my head:

1 have a new field current. when you update a doc,
 reindex the old doc with current=0 and put current=
 1 in the new doc (boolean field). Getting one and
 only one is really simple.
2 Use external file fields (EFF) for the same purpose, that
 won't require you to re-index the doc. The trick
 here is you use the value in the EFF as a multiplier
 for the score (that's what function queries do). So older
 versions of the doc have scores of 0 and just don't
 show up.
3 Implement a custom collector that replaces older hits
 with newer hits. Actually I don't particularly like this
 because it would potentially replace a higher-scoring
document with a lower-scoring one in the results list...

Bottom line here is I don't think grouping is a good approach
for this problem

Best
Erick

On Wed, Aug 15, 2012 at 5:04 AM, Tirthankar Chatterjee
tchatter...@commvault.com wrote:
 Hi Erick,
 You are so right on the memory calculations. I am happy that I know now that 
 I was doing something wrong. Yes I am getting confused with SQL.

 I will back up and let you know the use case. I am tracking file versions. 
 And I want to give an option to browse your system for the latest files. So 
 in order to remove dups (same filename) I used grouping.

 Also when you say Sharding is it okay if I do multi cores and does it mean 
 that each core needs a separate tomcat. I meant to say can I use the same 
 machine? 150 mill docs have 120 mill unique paths too.

 One more thing. If I need sharding and need a new box then it wont be great. 
 Because this system still have horsepower left which I can use.

 Thanks a ton for explaining the issue.

 Erick Erickson erickerick...@gmail.com wrote:


 You'r putting  a lot of data on a single box, then
 asking to group on what I presume is a string
 field. That's just going to eat up a _bunch_ of
 memory.

 let's say your average file name is 16 bytes long. Each
 unique value will take up 58 + 32 bytes (58 bytes
 of overhead, I'm presuming Solr 3.X and 16*2 bytes
 for the chars). So, we're up to 90 bytes/string * number
 of distinct file names) Say you have, for argument's
 sake, 100M distinct file names. You're up to 9G
 memory requirement for sorting alone. Solr's
 sorting reads all the unique values into memory whether
 or not they satisfy the query...

 And Grouping can also be expensive. I don't think
 you really want to group in this case, I'd simply use
 a filter query something like:
 fq=filefolder:E\:\\pd_dst\\646c6907-a948-4b83-ac1d-d44742bb0307

 Then you're also grouping on conv_sort which doesn't
 make much sense, do you really want individual results returned
 for _each_ file name?

 What it looks like to me is you're confusing SQL with
 solr search and getting into bad situations...

 Also, 150M documents in a single shard is...really a lot.
 You're probably at a point where you need to shard. Not
 to mention that your 400G index is trying to be jammed
 into 12G of memory.

 This actually feels like an XY problem, can you back
 up and let us know what the use-case you're
 trying to solve is? Perhaps there are less memory-
 consumptive solutions possible.

 Best
 Erick

 On Tue, Aug 14, 2012 at 6:38 AM, Tirthankar Chatterjee
 tchatter...@commvault.com wrote:
 Editing the query...remove smb:. I don't know where it came from while 
 I did copy/paste

 Tirthankar Chatterjee tchatter...@commvault.com wrote:


 Hi,
 I have a beefy box with 24Gb RAM (12GB for Tomcat7 which houses SOLR3.6)  2 
 Processors Intel Xeon 64 bit Server, 30TB HDD. JDK 1.7.0_03 x64 bit


 Data Index Dir Size: 400GB
 Metadata of files is stored in it. I have around 15 schema fields.
 Total number of items:150million approx.

 I have a scenario which I will try to explain to the best of my knowledge 
 here:

 Let us consider the fields I am interested in

 Url: Entire path of a file in windows file system including the filename. 
 ex:C:\Documents\A.txt
 mtm: Modified Time of the file
 Jid:JOb ID
 conv_sort is string field type where the filename is stored.

 I run a job where the following gets inserted

 Total Items:2
 Url:C:\personal\A1.txt
 mtm:08/14/2012 12:00:00
 Jid:1
 Conv_sort:A1.txt
 ---
 Url:C:\personal\B1.txt
 mtm:08/14/2012 12:01:00
 Jid:1
 Conv_sort:B1.txt
 In the second run only one item changes:

 Url:C:\personal\A1.txt
 mtm:08/15/2012 1:00:00
 Jid:2
 Conv_sort=A1.txt

 When queried I would like to return the latest A1.txt and B1.txt back to the 
 end user. I am trying to use grouping with no luck. It keeps throwing OOM… 
 can someone please help… as it is critical for my project

 The query I am trying is under a folder there are 1000 files and I putting a 
 filtered 

Re: question(s) re lucene spatial toolkit aka LSP aka spatial4j

2012-08-15 Thread David Smiley (@MITRE.org)
Hey solr-user, are you by chance indexing LineStrings?  That is something I
never tried with this spatial index.  Depending on which iteration of LSP
you are using, I figure you'd either end up indexing a vast number of points
along the line which would be slow to index and make the index quite big, or
you might end up with a geohash granularity that will look more like a very
blocky (i.e. pixelated) approximation of the line that is much courser and
will thus trigger searches near the line to match the line.  I don't have
this use-case in my work so I haven't put that much thought into handling
lines -- I just do points  polygons  circles  rects.
~ David



-
 Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: 
http://lucene.472066.n3.nabble.com/question-s-re-lucene-spatial-toolkit-aka-LSP-aka-spatial4j-tp3997757p4001486.html
Sent from the Solr - User mailing list archive at Nabble.com.


Does DataImportHandler do any sanitizing?

2012-08-15 Thread Jon Drukman
I am pulling some fields from a mysql database using DataImportHandler and
some of them have invalid XML in them.  Does DataImportHandler do any kind
of filtering/sanitizing to ensure that it will go in OK or is it all on me?

Example bad data:  orphaned ampersands (Peanut Butter  Jelly), curly
quotes (we’re)

-jsd-


Re: Does DataImportHandler do any sanitizing?

2012-08-15 Thread Michael Della Bitta
Hi, Jon,

As far as I know, DataImportHandler doesn't transfer data to the rest
of Solr via XML so it shouldn't be a problem...

Michael Della Bitta


Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
www.appinions.com
Where Influence Isn’t a Game


On Wed, Aug 15, 2012 at 5:03 PM, Jon Drukman jdruk...@gmail.com wrote:
 I am pulling some fields from a mysql database using DataImportHandler and
 some of them have invalid XML in them.  Does DataImportHandler do any kind
 of filtering/sanitizing to ensure that it will go in OK or is it all on me?

 Example bad data:  orphaned ampersands (Peanut Butter  Jelly), curly
 quotes (we’re)

 -jsd-


custom complex field - PolyField

2012-08-15 Thread Leonardo Souza
Hi,

I have to index a tuple like ('blah', 'more blah info') in a multivalued
field type.
I have read about the PolyField type and it seems the best solution so far
but i can't find documentation pointing how to use or implement a custom
field.
Any help is appreciated.


--
Leonardo S Souza


solr.xml entries got deleted when powered off

2012-08-15 Thread vempap
Hello,

  I created an index = all the schema.xml  solrconfig.xml files are
created with content (I checked that they have contents in the xml files).
But, if I poweroff the system  restart again - the contents of the files
are gone. It's like 0 bytes files.

Even, the solr.xml file which got updated when I created a new index (with a
core) has 0 bytes  all the previous entries are lost too.

I'm using Solr 4.0

Does anyone has any idea about the scenarios where it might happen.

Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-xml-entries-got-deleted-when-powered-off-tp4001496.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr.xml entries got deleted when powered off

2012-08-15 Thread Leonardo Souza
Just guessing,.
disk full?

--
Abraços,
Leonardo S Souza




2012/8/15 vempap phani.vemp...@emc.com

 Hello,

   I created an index = all the schema.xml  solrconfig.xml files are
 created with content (I checked that they have contents in the xml files).
 But, if I poweroff the system  restart again - the contents of the files
 are gone. It's like 0 bytes files.

 Even, the solr.xml file which got updated when I created a new index (with
 a
 core) has 0 bytes  all the previous entries are lost too.

 I'm using Solr 4.0

 Does anyone has any idea about the scenarios where it might happen.

 Thanks.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/solr-xml-entries-got-deleted-when-powered-off-tp4001496.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: solr.xml entries got deleted when powered off

2012-08-15 Thread vempap
nopes .. there is good amount of space left on disk



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-xml-entries-got-deleted-when-powered-off-tp4001496p4001502.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr.xml entries got deleted when powered off

2012-08-15 Thread vempap
It's happening when I'm not doing a clean shutdown. Are there any more
scenarios it might happen ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-xml-entries-got-deleted-when-powered-off-tp4001496p4001503.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: solr.xml entries got deleted when powered off

2012-08-15 Thread Buttler, David
You are not putting these files in /tmp are you?  That is sometimes wiped by 
different OS's on shutdown


-Original Message-
From: vempap [mailto:phani.vemp...@emc.com] 
Sent: Wednesday, August 15, 2012 3:31 PM
To: solr-user@lucene.apache.org
Subject: Re: solr.xml entries got deleted when powered off

It's happening when I'm not doing a clean shutdown. Are there any more
scenarios it might happen ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-xml-entries-got-deleted-when-powered-off-tp4001496p4001503.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: solr.xml entries got deleted when powered off

2012-08-15 Thread vempap
No, I'm not keeping them in /tmp



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-xml-entries-got-deleted-when-powered-off-tp4001496p4001506.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SOLR3.6:Field Collapsing/Grouping throws OOM

2012-08-15 Thread Chris Hostetter

: 2 Use external file fields (EFF) for the same purpose, that
:  won't require you to re-index the doc. The trick
:  here is you use the value in the EFF as a multiplier
:  for the score (that's what function queries do). So older
:  versions of the doc have scores of 0 and just don't
:  show up.

or use it in an fq={!frange ...} to eliminate the older versions 
completley.

:  I will back up and let you know the use case. I am tracking file 
: versions. And I want to give an option to browse your system for the 
: latest files. So in order to remove dups (same filename) I used 
: grouping.

based on only knowing that sentence, my starting suggestion would be to 
have two indexes: one where the filename is hte unique key, thus only the 
most current versions of files are listed, and one where there is no 
unique key (or you use whatever key you use today) that lets you do the 
full historical archive search, and query whichever index makes sense for 
each user action.



-Hoss


Re: Atomic Multicore Operations - E.G. Move Docs

2012-08-15 Thread Nicholas Ball

Haven't managed to find a good way to do this yet. Does anyone have any
ideas on how I could implement this feature?
Really need to move docs across from one core to another atomically.

Many thanks,
Nicholas

On Mon, 02 Jul 2012 04:37:12 -0600, Nicholas Ball
nicholas.b...@nodelay.com wrote:
 That could work, but then how do you ensure commit is called on the two
 cores at the exact same time?
 
 Cheers,
 Nicholas
 
 On Sat, 30 Jun 2012 16:19:31 -0700, Lance Norskog goks...@gmail.com
 wrote:
 Index all documents to both cores, but do not call commit until both
 report that indexing worked. If one of the cores throws an exception,
 call roll back on both cores.
 
 On Sat, Jun 30, 2012 at 6:50 AM, Nicholas Ball
 nicholas.b...@nodelay.com wrote:

 Hey all,

 Trying to figure out the best way to perform atomic operation across
 multiple cores on the same solr instance i.e. a multi-core
environment.

 An example would be to move a set of docs from one core onto another
 core
 and ensure that a softcommit is done as the exact same time. If one
 were
 to
 fail so would the other.
 Obviously this would probably require some customization but wanted to
 know what the best way to tackle this would be and where should I be
 looking in the source.

 Many thanks for the help in advance,
 Nicholas a.k.a. incunix


Re: Atomic Multicore Operations - E.G. Move Docs

2012-08-15 Thread Li Li
在 2012-7-2 傍晚6:37,Nicholas Ball nicholas.b...@nodelay.com写道:


 That could work, but then how do you ensure commit is called on the two
 cores at the exact same time?
that may needs something like two phrase commit in relational dB. lucene
has prepareCommit, but to implement 2pc, many things need to do.
 Also, any way to commit a specific update rather then all the back-logged
 ones?

 Cheers,
 Nicholas

 On Sat, 30 Jun 2012 16:19:31 -0700, Lance Norskog goks...@gmail.com
 wrote:
  Index all documents to both cores, but do not call commit until both
  report that indexing worked. If one of the cores throws an exception,
  call roll back on both cores.
 
  On Sat, Jun 30, 2012 at 6:50 AM, Nicholas Ball
  nicholas.b...@nodelay.com wrote:
 
  Hey all,
 
  Trying to figure out the best way to perform atomic operation across
  multiple cores on the same solr instance i.e. a multi-core environment.
 
  An example would be to move a set of docs from one core onto another
 core
  and ensure that a softcommit is done as the exact same time. If one
 were
  to
  fail so would the other.
  Obviously this would probably require some customization but wanted to
  know what the best way to tackle this would be and where should I be
  looking in the source.
 
  Many thanks for the help in advance,
  Nicholas a.k.a. incunix


Re: Atomic Multicore Operations - E.G. Move Docs

2012-08-15 Thread Li Li
do you really need this?
distributed transaction is a difficult problem. in 2pc, every node could
fail, including coordinator. something like leader election needed to make
sure it works. you maybe try zookeeper.
but if the transaction is not very very important like transfer money in
bank, you can do like this.
coordinator:
在 2012-8-16 上午7:42,Nicholas Ball nicholas.b...@nodelay.com写道:


 Haven't managed to find a good way to do this yet. Does anyone have any
 ideas on how I could implement this feature?
 Really need to move docs across from one core to another atomically.

 Many thanks,
 Nicholas

 On Mon, 02 Jul 2012 04:37:12 -0600, Nicholas Ball
 nicholas.b...@nodelay.com wrote:
  That could work, but then how do you ensure commit is called on the two
  cores at the exact same time?
 
  Cheers,
  Nicholas
 
  On Sat, 30 Jun 2012 16:19:31 -0700, Lance Norskog goks...@gmail.com
  wrote:
  Index all documents to both cores, but do not call commit until both
  report that indexing worked. If one of the cores throws an exception,
  call roll back on both cores.
 
  On Sat, Jun 30, 2012 at 6:50 AM, Nicholas Ball
  nicholas.b...@nodelay.com wrote:
 
  Hey all,
 
  Trying to figure out the best way to perform atomic operation across
  multiple cores on the same solr instance i.e. a multi-core
 environment.
 
  An example would be to move a set of docs from one core onto another
  core
  and ensure that a softcommit is done as the exact same time. If one
  were
  to
  fail so would the other.
  Obviously this would probably require some customization but wanted to
  know what the best way to tackle this would be and where should I be
  looking in the source.
 
  Many thanks for the help in advance,
  Nicholas a.k.a. incunix



Re: Atomic Multicore Operations - E.G. Move Docs

2012-08-15 Thread Li Li
http://zookeeper.apache.org/doc/r3.3.6/recipes.html#sc_recipes_twoPhasedCommit

On Thu, Aug 16, 2012 at 7:41 AM, Nicholas Ball
nicholas.b...@nodelay.com wrote:

 Haven't managed to find a good way to do this yet. Does anyone have any
 ideas on how I could implement this feature?
 Really need to move docs across from one core to another atomically.

 Many thanks,
 Nicholas

 On Mon, 02 Jul 2012 04:37:12 -0600, Nicholas Ball
 nicholas.b...@nodelay.com wrote:
 That could work, but then how do you ensure commit is called on the two
 cores at the exact same time?

 Cheers,
 Nicholas

 On Sat, 30 Jun 2012 16:19:31 -0700, Lance Norskog goks...@gmail.com
 wrote:
 Index all documents to both cores, but do not call commit until both
 report that indexing worked. If one of the cores throws an exception,
 call roll back on both cores.

 On Sat, Jun 30, 2012 at 6:50 AM, Nicholas Ball
 nicholas.b...@nodelay.com wrote:

 Hey all,

 Trying to figure out the best way to perform atomic operation across
 multiple cores on the same solr instance i.e. a multi-core
 environment.

 An example would be to move a set of docs from one core onto another
 core
 and ensure that a softcommit is done as the exact same time. If one
 were
 to
 fail so would the other.
 Obviously this would probably require some customization but wanted to
 know what the best way to tackle this would be and where should I be
 looking in the source.

 Many thanks for the help in advance,
 Nicholas a.k.a. incunix


Re: SOLR3.6:Field Collapsing/Grouping throws OOM

2012-08-15 Thread Tirthankar Chatterjee
Awesome thanks a lot, I am already on it with option 1. We need to track delete 
to flip the previous one as the current.

Erick Erickson erickerick...@gmail.com wrote:


No, sharding into multiple cores on the same machine still
is limited by the physical memory available. It's still lots
of stuf on a limited box.

But try backing up and re-thinking the problem a bit.
Some possibilities off the top of my head:

1 have a new field current. when you update a doc,
 reindex the old doc with current=0 and put current=
 1 in the new doc (boolean field). Getting one and
 only one is really simple.
2 Use external file fields (EFF) for the same purpose, that
 won't require you to re-index the doc. The trick
 here is you use the value in the EFF as a multiplier
 for the score (that's what function queries do). So older
 versions of the doc have scores of 0 and just don't
 show up.
3 Implement a custom collector that replaces older hits
 with newer hits. Actually I don't particularly like this
 because it would potentially replace a higher-scoring
document with a lower-scoring one in the results list...

Bottom line here is I don't think grouping is a good approach
for this problem

Best
Erick

On Wed, Aug 15, 2012 at 5:04 AM, Tirthankar Chatterjee
tchatter...@commvault.com wrote:
 Hi Erick,
 You are so right on the memory calculations. I am happy that I know now that 
 I was doing something wrong. Yes I am getting confused with SQL.

 I will back up and let you know the use case. I am tracking file versions. 
 And I want to give an option to browse your system for the latest files. So 
 in order to remove dups (same filename) I used grouping.

 Also when you say Sharding is it okay if I do multi cores and does it mean 
 that each core needs a separate tomcat. I meant to say can I use the same 
 machine? 150 mill docs have 120 mill unique paths too.

 One more thing. If I need sharding and need a new box then it wont be great. 
 Because this system still have horsepower left which I can use.

 Thanks a ton for explaining the issue.

 Erick Erickson erickerick...@gmail.com wrote:


 You'r putting  a lot of data on a single box, then
 asking to group on what I presume is a string
 field. That's just going to eat up a _bunch_ of
 memory.

 let's say your average file name is 16 bytes long. Each
 unique value will take up 58 + 32 bytes (58 bytes
 of overhead, I'm presuming Solr 3.X and 16*2 bytes
 for the chars). So, we're up to 90 bytes/string * number
 of distinct file names) Say you have, for argument's
 sake, 100M distinct file names. You're up to 9G
 memory requirement for sorting alone. Solr's
 sorting reads all the unique values into memory whether
 or not they satisfy the query...

 And Grouping can also be expensive. I don't think
 you really want to group in this case, I'd simply use
 a filter query something like:
 fq=filefolder:E\:\\pd_dst\\646c6907-a948-4b83-ac1d-d44742bb0307

 Then you're also grouping on conv_sort which doesn't
 make much sense, do you really want individual results returned
 for _each_ file name?

 What it looks like to me is you're confusing SQL with
 solr search and getting into bad situations...

 Also, 150M documents in a single shard is...really a lot.
 You're probably at a point where you need to shard. Not
 to mention that your 400G index is trying to be jammed
 into 12G of memory.

 This actually feels like an XY problem, can you back
 up and let us know what the use-case you're
 trying to solve is? Perhaps there are less memory-
 consumptive solutions possible.

 Best
 Erick

 On Tue, Aug 14, 2012 at 6:38 AM, Tirthankar Chatterjee
 tchatter...@commvault.com wrote:
 Editing the query...remove smb:. I don't know where it came from while 
 I did copy/paste

 Tirthankar Chatterjee tchatter...@commvault.com wrote:


 Hi,
 I have a beefy box with 24Gb RAM (12GB for Tomcat7 which houses SOLR3.6)  2 
 Processors Intel Xeon 64 bit Server, 30TB HDD. JDK 1.7.0_03 x64 bit


 Data Index Dir Size: 400GB
 Metadata of files is stored in it. I have around 15 schema fields.
 Total number of items:150million approx.

 I have a scenario which I will try to explain to the best of my knowledge 
 here:

 Let us consider the fields I am interested in

 Url: Entire path of a file in windows file system including the filename. 
 ex:C:\Documents\A.txt
 mtm: Modified Time of the file
 Jid:JOb ID
 conv_sort is string field type where the filename is stored.

 I run a job where the following gets inserted

 Total Items:2
 Url:C:\personal\A1.txt
 mtm:08/14/2012 12:00:00
 Jid:1
 Conv_sort:A1.txt
 ---
 Url:C:\personal\B1.txt
 mtm:08/14/2012 12:01:00
 Jid:1
 Conv_sort:B1.txt
 In the second run only one item changes:

 Url:C:\personal\A1.txt
 mtm:08/15/2012 1:00:00
 Jid:2
 Conv_sort=A1.txt

 When queried I would like to return the latest A1.txt and B1.txt back to the 
 end user. I am trying to use grouping with no luck. 

Re: Does DataImportHandler do any sanitizing?

2012-08-15 Thread Lance Norskog
If you want to sanitize them during indexing, the regular expression
tools can do this. You would create a regular expression that matches
bogus elements. There is a regular expression transformer in the DIH,
and a regular expression CharFilter inside the Lucene text analysis
stack.

On Wed, Aug 15, 2012 at 2:10 PM, Michael Della Bitta
michael.della.bi...@appinions.com wrote:
 Hi, Jon,

 As far as I know, DataImportHandler doesn't transfer data to the rest
 of Solr via XML so it shouldn't be a problem...

 Michael Della Bitta

 
 Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
 www.appinions.com
 Where Influence Isn’t a Game


 On Wed, Aug 15, 2012 at 5:03 PM, Jon Drukman jdruk...@gmail.com wrote:
 I am pulling some fields from a mysql database using DataImportHandler and
 some of them have invalid XML in them.  Does DataImportHandler do any kind
 of filtering/sanitizing to ensure that it will go in OK or is it all on me?

 Example bad data:  orphaned ampersands (Peanut Butter  Jelly), curly
 quotes (we’re)

 -jsd-



-- 
Lance Norskog
goks...@gmail.com