Boost documents based on criteria

2015-01-23 Thread Jorge Luis Betancourt González
Hi all,

Recently I got an interesting use case that I'm not sure how to implement, the 
idea is that the client wants a fixed number of documents, let's call it N, to 
appear in the top of the results. Let me explain a little we're working with 
web documents so the idea is too promote the documents that match the query of 
the user from a given domain (wikipedia, for example) to the top of the list. 
So if I apply a a boost using the boost parameter:

http://localhost:8983/solr/select?q=search&fl=url&boost=map(query($type1query),0,0,1,50)&type1query=host:wikipedia

I get *all* the documents from the desired host at the top, but there is no way 
of limiting the number of documents from the host that are boosted to the top 
of the result list (which could lead to several pages of content from the same 
host, which is not desired, the idea is to only show N) . I was thinking in 
something like field collapsing/grouping but only for the documents that match 
my $type1query parameter (host:wikipedia) but I don't see any way of doing 
grouping/collapsing on only one group and leave the other results untouched. 

I although thought on using 2 groups using group.query=host:wikipedia and 
group.query=-host:wikipedia, but in this case there is no way of controlling 
how much documents each independently group will have.

Any thoughts or recommendations on this? 

Thank you,

Regards,


---
XII Aniversario de la creación de la Universidad de las Ciencias Informáticas. 
12 años de historia junto a Fidel. 12 de diciembre de 2014.



Re: Solr regex query help

2015-01-23 Thread Erick Erickson
Right. As I mentioned on the original JIRA, the regex match is happening on
_terms_.
You are conflating the original input (the entire field) with the
individual terms that the
regex is applied to.

I suggest that you look at the admin/analysis page. There you'll see the
terms that are
indexed and you'll see that the regex simply cannot work since it assumes
that the
regex is applied to the entire input rather than the results of the
analysis chain.

I further suggest that you explore tokenization and how
individual terms are searched. The admin/analysis page is invaluable in this
endeavor.

The root cause of your confusion is that, given you're using
ClassicTokenizer,
you have a bunch of individual terms that are being searched, _not_ the
whole
input. So the regex is bound to fail since you're thinking in terms of the
entire
input rather than the result of your analysis chain, i.e. tokenization +
filters
as defined in schema.xml.

FWIW,
Erick

On Fri, Jan 23, 2015 at 8:58 PM, Arumugam, Suresh 
wrote:

> Hi All,
>
>
>
> We have indexed the documents to Solr & not able to query using the Regex.
>
>
>
> Our data looks like as below in a Text Field, which is indexed using the
> ClassicTokenizer.
>
>
>
> *1b ::PIPE:: 04/14/2014 ::PIPE:: 01:32:48 ::PIPE:: BMC
> Power/Reset action  ::PIPE:: Delayed shutdown timer disabled ::PIPE::
> Asserted*
>
>
>
> We tried lookup this string with the Regex.
>
> *PIPE*[0-9]{2}\/[0-9}{2}\/[0-9]{4}*Delayed shutdown*Asserted*
>
>
>
> Since the analyzer tokenized the data, the regex match is
> happening on the terms & it’s not working as we expect.
>
>
>
> Can you please help us in finding an equivalent way to query this in Solr
> ?
>
>
>
> The following are the details about our environment.
>
>
>
> 1.   Solr 4.10.3 as well as Solr 4.8
>
> 2.   JDK 1.7_51
>
> 3.   SolrConfig.xml & Schema.xml attached.
>
>
>
> The regex query as below is working
>
> msg:/[0-9]{2}/
>
>
>
> But when we want to match more than one terms the regex doesn't seems to
> be working.
>
> Please help us in resolving this issue.
>
>
>
> Thanks in advance.
>
>
>
> Regards,
>
> Suresh.A
>


Solr regex query help

2015-01-23 Thread Arumugam, Suresh
Hi All,

We have indexed the documents to Solr & not able to query using the Regex.

Our data looks like as below in a Text Field, which is indexed using the 
ClassicTokenizer.

1b ::PIPE:: 04/14/2014 ::PIPE:: 01:32:48 ::PIPE:: BMC 
Power/Reset action  ::PIPE:: Delayed shutdown timer disabled ::PIPE:: Asserted

We tried lookup this string with the Regex.
PIPE*[0-9]{2}\/[0-9}{2}\/[0-9]{4}*Delayed shutdown*Asserted

Since the analyzer tokenized the data, the regex match is 
happening on the terms & it's not working as we expect.

Can you please help us in finding an equivalent way to query this in Solr ?

The following are the details about our environment.


1.   Solr 4.10.3 as well as Solr 4.8

2.   JDK 1.7_51

3.   SolrConfig.xml & Schema.xml attached.

The regex query as below is working
msg:/[0-9]{2}/

But when we want to match more than one terms the regex doesn't seems to be 
working.
Please help us in resolving this issue.

Thanks in advance.

Regards,
Suresh.A


   
   

   
   

   
   
   
 
   
   
   
   

   
   

   
   

   
   
   

 
 id

 

 

  

   
   
   

   

   

   
   

	





















  




  
  




  




  




  
  




  



  
  		
  
  
	  
  


  

  
  
  
  


   
  

  
  
  
  


   
  

  
  
  
  


   
  

  
  
  
  


   
  

  
  
  
  








  
  



  



	
		
		
		
	












  

  
  4.8

  

  
  
  

  
  

  
  

  
  
  

  
  

  
  ${solr.data.dir:}
  

  
  




${solr.hdfs.home:}

${solr.hdfs.confdir:}

${solr.hdfs.blockcache.enabled:true}

${solr.hdfs.blockcache.global:true}

  

  
  

  
  

  
  
























${solr.lock.type:native}












  
  
  
  
  
  



 true
  


  
  
  
  
  
  

  
  



  ${solr.ulog.dir:}



 
   ${solr.autoCommit.maxTime:15000}
   false
 



 
   ${solr.autoSoftCommit.maxTime:-1}
 






  

  
  
  
  

  
  

1024
























true

   
   

   
   20

   
   200

   


  

  


  

  static firstSearcher warming in solrconfig.xml

  



false


2

  


  
  









  

  
  

  dataconfig.xml

  
  
  

 
   explicit
   1000
   text
 









  
  
 
   explicit
   json
   true
   text
 
  


  
  
 
   true
   json
   true
 
  


  
  
 
   explicit

   
   velocity
   browse
   layout
   Solritas

   
   edismax
   
  text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
  title^10.0 description^5.0 keywords^5.0 author^2.0 resourcename^1.0
   
   text
   100%
   *:*
   10
   *,score

   
 text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
 title^10.0 description^5.0 keywords^5.0 author^2.0 resourcename^1.0
   
   text,features,name,sku,id,manu,cat,title,description,keywords,author,resourcename
   3

   
   on
   cat
   manu_exact
   content_type
   author_s
   ipod
   GB
   1
   cat,inStock
   after
   price
   0
   600
   50
   popularity
   0
   10
   3
   manufacturedate_dt
   NOW/YEAR-10YEARS
   NOW
   +1YEAR
   before
   after

   
   on
   content features title name
   html
   
   
   0
   title
   0
   name
   3
   200
   content
   750

   
   on
   false
   5
   2
   5
   true
   true
   5
   3
 

 
 
   spellcheck
 
  


  
  


   
 dedupe
   
   
  

  
  

 application/json
   
  
  

 application/csv
   
  

  
  

  true
  ignored_

  
  true
  links
  ignored_

  


  
  


  
  

  
  
  
  
  
  

  
  

  solrpingquery


 

Re: Replicas fall into recovery mode right after update

2015-01-23 Thread Nishanth S
Can you tell what version of solr you are using and what causes your
replicas to go into recovery.

On Fri, Jan 23, 2015 at 8:40 PM, gouthsmsimhadri 
wrote:

> I'm working with a cluster of solr-cloud servers at a configration of 10
> shards and 4 replicas on each shard in stress environment.
> Planned production configuration is 10 shards and 15 replicas on each
> shard.
>
> Current commit settings are as follows
>
> 
> 50
> 18
> 
>
> 
> 200
> 18
> false
> 
>
>
> The application requires to index approximately 90 Million docs which is
> indexed in two ways
> a)  Full indexing. It takes 4 hours to index 90 Million docs and the
> rate of
> docs coming to the searcher is around 6000 per second
> b)  Incremental indexing. It takes an hour to index delta changes.
> Roughly
> there are 3 million changes and rate of docs coming to the searchers is
> 2500
> per second
>
> I use two collections for example collection1 and collection2
> Each collection has system settings at 12 GB of available RAM and quad core
> Intel(R) Xeon(R) CPU X5570  @ 2.93GHz
>
> Full indexing is always performed on a collection which is not serving live
> traffic and Once job is completed we swap collection so the collection with
> latest data serves traffic and other is inactive.
>
> The other mode of incremental indexing  is performed  always on the
> collection which is serving live traffic.
>
> The problem is in about 10 minutes of indexing is triggered, the replicas
> goes in to recovery mode. This happens on all the shards. In about 20
> minutes or more rest of replicas start to fall into recovery mode. In about
> half an hour all replicas except the leader is in recovery mode.
>
> I cannot throttle the indexing load as that will increase our overall
> indexing time. So to overcome this issue, I remove all the replicas before
> the indexing is started and then add them after the indexing completes.
>
> The behavior(replicas falling into recovery mode) in incremental mode of
> indexing is troublesome as i cannot remove replicas during incremental
> indexing since it serves live traffic, i tried to throttle the speed at
> which documents are indexed but with no success as the cluster still goes
> on
> recovery.
>
> If i let the cluster as is the indexing  eventually completes and also
> recovers after a while, but since this is serving live traffic i just
> cannot
> let these replicas go into recovery mode since it degrades the search
> performance also (from the tests performed).
>
> I tried different commit settings like the below
> a)  No auto soft commit, no auto hard commit and a commit triggered at
> the
> end of indexing
> b)  No auto soft commit, yes auto hard commit and a commit in the end
> of
> indexing
> c)  Yes auto soft commit , no auto hard commit
> d)  Yes auto soft commit , yes auto hard commit
> e)  Different frequency setting for commits for above
>
> Unfortunately all the above yields the same behavior . The replicas still
> goes in recovery
>
> I have increased the zookeeper timeout from 30 seconds to 5 minutes and the
> problem persists.
>
> Is there any setting that would fix this issue ?
>
>
>
>
> -
>  -goutham
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/SolrCloud-Replicas-fall-into-recovery-mode-right-after-update-tp4181706.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Replicas fall into recovery mode right after update

2015-01-23 Thread gouthsmsimhadri
I'm working with a cluster of solr-cloud servers at a configration of 10
shards and 4 replicas on each shard in stress environment.  
Planned production configuration is 10 shards and 15 replicas on each shard.  

Current commit settings are as follows


50
18



200
18
false



The application requires to index approximately 90 Million docs which is
indexed in two ways
a)  Full indexing. It takes 4 hours to index 90 Million docs and the rate of
docs coming to the searcher is around 6000 per second
b)  Incremental indexing. It takes an hour to index delta changes. Roughly
there are 3 million changes and rate of docs coming to the searchers is 2500
per second 

I use two collections for example collection1 and collection2
Each collection has system settings at 12 GB of available RAM and quad core 
Intel(R) Xeon(R) CPU X5570  @ 2.93GHz

Full indexing is always performed on a collection which is not serving live
traffic and Once job is completed we swap collection so the collection with
latest data serves traffic and other is inactive. 

The other mode of incremental indexing  is performed  always on the
collection which is serving live traffic.

The problem is in about 10 minutes of indexing is triggered, the replicas
goes in to recovery mode. This happens on all the shards. In about 20
minutes or more rest of replicas start to fall into recovery mode. In about
half an hour all replicas except the leader is in recovery mode.

I cannot throttle the indexing load as that will increase our overall
indexing time. So to overcome this issue, I remove all the replicas before
the indexing is started and then add them after the indexing completes.

The behavior(replicas falling into recovery mode) in incremental mode of
indexing is troublesome as i cannot remove replicas during incremental
indexing since it serves live traffic, i tried to throttle the speed at
which documents are indexed but with no success as the cluster still goes on
recovery.

If i let the cluster as is the indexing  eventually completes and also
recovers after a while, but since this is serving live traffic i just cannot
let these replicas go into recovery mode since it degrades the search
performance also (from the tests performed). 

I tried different commit settings like the below
a)  No auto soft commit, no auto hard commit and a commit triggered at the
end of indexing
b)  No auto soft commit, yes auto hard commit and a commit in the end of
indexing
c)  Yes auto soft commit , no auto hard commit 
d)  Yes auto soft commit , yes auto hard commit 
e)  Different frequency setting for commits for above

Unfortunately all the above yields the same behavior . The replicas still
goes in recovery

I have increased the zookeeper timeout from 30 seconds to 5 minutes and the
problem persists. 

Is there any setting that would fix this issue ?




-
 -goutham
--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-Replicas-fall-into-recovery-mode-right-after-update-tp4181706.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Need help importing data

2015-01-23 Thread Carl Roberts

NVM

I figured this out.  The problem was this:  pk="link" in 
rss-dat.config.xml but unique id not link in schema.xml - it is id.


From rss-data-config.xml:

https://nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-2002.xml.zip";
processor="XPathEntityProcessor"
forEach="/nvd/entry">

commonField="true" />
commonField="true" />




From schema.xml:

* id

*What really bothers me is that there were no errors output by Solr to 
indicate this type of misconfiguration error and all the messages that 
Solr gave indicated the import was successful.  This lack of appropriate 
error reporting is a pain, especially for someone learning Solr.


Switching pk="link" to pk="id" solved the problem and I was then able to 
import the data.



On 1/23/15, 9:39 PM, Carl Roberts wrote:

Hi,

I have set log4j logging to level DEBUG and I have also modified the 
code to see what is being imported and I can see the nextRow() 
records, and the import is successful, however I have no data. Can 
someone please help me figure this out?


Here is the logging output:

ow:  r1={{id=CVE-2002-2353, cve=CVE-2002-2353, cwe=CWE-264, 
$forEach=/nvd/entry}}
2015-01-23 21:28:04,606- 
INFO-[Thread-15]-[XPathEntityProcessor.java:251]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
r3={{id=CVE-2002-2353, cve=CVE-2002-2353, cwe=CWE-264, 
$forEach=/nvd/entry}}
2015-01-23 21:28:04,606- 
INFO-[Thread-15]-[XPathEntityProcessor.java:221]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
URL={url}
2015-01-23 21:28:04,606- 
INFO-[Thread-15]-[XPathEntityProcessor.java:227]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
r1={{id=CVE-2002-2354, cve=CVE-2002-2354, cwe=CWE-20, 
$forEach=/nvd/entry}}
2015-01-23 21:28:04,606- 
INFO-[Thread-15]-[XPathEntityProcessor.java:251]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
r3={{id=CVE-2002-2354, cve=CVE-2002-2354, cwe=CWE-20, 
$forEach=/nvd/entry}}
2015-01-23 21:28:04,606- 
INFO-[Thread-15]-[XPathEntityProcessor.java:221]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
URL={url}
2015-01-23 21:28:04,606- 
INFO-[Thread-15]-[XPathEntityProcessor.java:227]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
r1={{id=CVE-2002-2355, cve=CVE-2002-2355, cwe=CWE-255, 
$forEach=/nvd/entry}}
2015-01-23 21:28:04,606- 
INFO-[Thread-15]-[XPathEntityProcessor.java:251]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
r3={{id=CVE-2002-2355, cve=CVE-2002-2355, cwe=CWE-255, 
$forEach=/nvd/entry}}
2015-01-23 21:28:04,607- 
INFO-[Thread-15]-[XPathEntityProcessor.java:221]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
URL={url}
2015-01-23 21:28:04,607- 
INFO-[Thread-15]-[XPathEntityProcessor.java:227]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
r1={{id=CVE-2002-2356, cve=CVE-2002-2356, cwe=CWE-264, 
$forEach=/nvd/entry}}
2015-01-23 21:28:04,607- 
INFO-[Thread-15]-[XPathEntityProcessor.java:251]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
r3={{id=CVE-2002-2356, cve=CVE-2002-2356, cwe=CWE-264, 
$forEach=/nvd/entry}}
2015-01-23 21:28:04,607- 
INFO-[Thread-15]-[XPathEntityProcessor.java:221]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
URL={url}
2015-01-23 21:28:04,607- 
INFO-[Thread-15]-[XPathEntityProcessor.java:227]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
r1={{id=CVE-2002-2357, cve=CVE-2002-2357, cwe=CWE-119, 
$forEach=/nvd/entry}}
2015-01-23 21:28:04,607- 
INFO-[Thread-15]-[XPathEntityProcessor.java:251]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
r3={{id=CVE-2002-2357, cve=CVE-2002-2357, cwe=CWE-119, 
$forEach=/nvd/entry}}
2015-01-23 21:28:04,607- 
INFO-[Thread-15]-[XPathEntityProcessor.java:221]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
URL={url}
2015-01-23 21:28:04,607- 
INFO-[Thread-15]-[XPathEntityProcessor.java:227]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
r1={{id=CVE-2002-2358, cve=CVE-2002-2358, cwe=CWE-79, 
$forEach=/nvd/entry}}
2015-01-23 21:28:04,607- 
INFO-[Thread-15]-[XPathEntityProcessor.java:251]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
r3={{id=CVE-2002-2358, cve=CVE-2002-2358, cwe=CWE-79, 
$forEach=/nvd/entry}}
2015-01-23 21:28:04,607- 
INFO-[Thread-15]-[XPathEntityProcessor.java:221]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
URL={url}
2015-01-23 21:28:04,607- 
INFO-[Thread-15]-[XPathEntityProcessor.java:227]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
r1={{id=CVE-2002-2359, cve=CVE-2002-2359, cwe=CWE-79, 
$forEach=/nvd/entry}}
2015-01-23 21:28:04,607- 
INFO-[Thread-15]-[XPathEntityProcessor.java:251]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
r3={{id=CVE-20

Re: Need Help with custom ZIPURLDataSource class

2015-01-23 Thread Carl Roberts

NVM - I have this working.

The problem was this:  pk="link" in rss-dat.config.xml but unique id not 
link in schema.xml - it is id.


From rss-data-config.xml:


url="https://nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-2002.xml.zip";

processor="XPathEntityProcessor"
forEach="/nvd/entry">

commonField="true" />
commonField="true" />




From schema.xml:

* id

*What really bothers me is that there were no errors output by Solr to 
indicate this type of misconfiguration error and all the messages that 
Solr gave indicated the import was successful.  This lack of appropriate 
error reporting is a pain, especially for someone learning Solr.


Switching pk="link" to pk="id" solved the problem and I was then able to 
import the data.

On 1/23/15, 6:34 PM, Carl Roberts wrote:


Hi,

I created a custom ZIPURLDataSource class to unzip the content from an
http URL for an XML ZIP file and it seems to be working (at least I have
no errors), but no data is imported.

Here is my configuration in rss-data-config.xml:




https://nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-2002.xml.zip";
processor="XPathEntityProcessor"
forEach="/nvd/entry"
transformer="DateFormatTransformer">




xpath="/nvd/entry/vulnerable-software-list/product" 
commonField="false" />









Attached is the ZIPURLDataSource.java file.

It actually unzips and saves the raw XML to disk, which I have 
verified to be a valid XML file.  The file has one or more entries 
(here is an example):


http://scap.nist.gov/schema/scap-core/0.1";
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
xmlns:patch="http://scap.nist.gov/schema/patch/0.1";
xmlns:vuln="http://scap.nist.gov/schema/vulnerability/0.4";
xmlns:cvss="http://scap.nist.gov/schema/cvss-v2/0.2";
xmlns:cpe-lang="http://cpe.mitre.org/language/2.0";
xmlns="http://scap.nist.gov/schema/feed/vulnerability/2.0";
pub_date="2015-01-10T05:37:05"
xsi:schemaLocation="http://scap.nist.gov/schema/patch/0.1
http://nvd.nist.gov/schema/patch_0.1.xsd
http://scap.nist.gov/schema/scap-core/0.1
http://nvd.nist.gov/schema/scap-core_0.1.xsd
http://scap.nist.gov/schema/feed/vulnerability/2.0
http://nvd.nist.gov/schema/nvd-cve-feed_2.0.xsd"; nvd_xml_version="2.0">

http://nvd.nist.gov/";>



























cpe:/o:freebsd:freebsd:2.2.8
cpe:/o:freebsd:freebsd:1.1.5.1
cpe:/o:freebsd:freebsd:2.2.3
cpe:/o:freebsd:freebsd:2.2.2
cpe:/o:freebsd:freebsd:2.2.5
cpe:/o:freebsd:freebsd:2.2.4
cpe:/o:freebsd:freebsd:2.0.5
cpe:/o:freebsd:freebsd:2.2.6
cpe:/o:freebsd:freebsd:2.1.6.1
cpe:/o:freebsd:freebsd:2.0.1
cpe:/o:freebsd:freebsd:2.2
cpe:/o:freebsd:freebsd:2.0
cpe:/o:openbsd:openbsd:2.3
cpe:/o:freebsd:freebsd:3.0
cpe:/o:freebsd:freebsd:1.1
cpe:/o:freebsd:freebsd:2.1.6
cpe:/o:openbsd:openbsd:2.4
cpe:/o:bsdi:bsd_os:3.1
cpe:/o:freebsd:freebsd:1.0
cpe:/o:freebsd:freebsd:2.1.7
cpe:/o:freebsd:freebsd:1.2
cpe:/o:freebsd:freebsd:2.1.5
cpe:/o:freebsd:freebsd:2.1.7.1

CVE-1999-0001
1999-12-30T00:00:00.000-05:00 

2010-12-16T00:00:00.000-05:00 




5.0
NETWORK
LOW
NONE
NONE
NONE
PARTIAL
http://nvd.nist.gov
2004-01-01T00:00:00.000-05:00 






OSVDB
http://www.osvdb.org/5707";
xml:lang="en">5707


CONFIRM
http://www.openbsd.org/errata23.html#tcpfix";
xml:lang="en">http://www.openbsd.org/errata23.html#tcpfix 



ip_input.c in BSD-derived TCP/IP implementations allows
remote attackers to cause a denial of service (crash or hang) via
crafted packets.



Here is the curl command:

curl http://127.0.0.1:8983/solr/nvd-rss/dataimport?command=full-import

And here is the output from the console for Jetty:

main{StandardDirectoryReader(segments_1:1:nrt)}
2407 [coreLoadExecutor-5-thread-1] INFO
org.apache.solr.core.CoreContainer – registering core: nvd-rss
2409 [main] INFO org.apache.solr.servlet.SolrDispatchFilter –
user.dir=/Users/carlroberts/dev/solr-4.10.3/example
2409 [main] INFO org.apache.solr.servlet.SolrDispatchFilter –
SolrDispatchFilter.init() done
2431 [main] INFO org.eclipse.jetty.server.AbstractConnector – Started
SocketConnector@0.0.0.0:8983
2450 [searcherExecutor-6-thread-1] INFO org.apache.solr.core.SolrCore –
[nvd-rss] webapp=null path=null
params={event=firstSearcher&q=static+firstSearcher+warming+in+solrconfig.xml&distrib=false} 


hits=0 status=0 QTime=43
2451 [searcherExecutor-6-thread-1] INFO org.apache.solr.core.SolrCore –
QuerySenderListener done.
2451 [searcherExecutor-6-thread-1] INFO
org.apache.solr.handler.component.SpellCheckComponent – Loading spell
index for spellchecker: default
2451 [searcherExecutor-6-thread-1] INFO
org.apache.solr.handler.component.SpellCheckComponent – Loading spell
index for spellchecker: wordbreak
2452 [searcherExecutor-6-thread-1] INFO
org.apache.solr.handler.component.SuggestComponent – Loading suggester
index for: mySuggester
2452 [searcherExecutor-6-thread-1] INFO
org.apache.solr.spelling.suggest.SolrSuggester – reload()
2452 [searcherExecutor-6-thread-1] INFO
org.apache.solr.spelling.suggest.

Need help importing data

2015-01-23 Thread Carl Roberts

Hi,

I have set log4j logging to level DEBUG and I have also modified the 
code to see what is being imported and I can see the nextRow() records, 
and the import is successful, however I have no data.  Can someone 
please help me figure this out?


Here is the logging output:

ow:  r1={{id=CVE-2002-2353, cve=CVE-2002-2353, cwe=CWE-264, 
$forEach=/nvd/entry}}
2015-01-23 21:28:04,606- 
INFO-[Thread-15]-[XPathEntityProcessor.java:251]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
r3={{id=CVE-2002-2353, cve=CVE-2002-2353, cwe=CWE-264, $forEach=/nvd/entry}}
2015-01-23 21:28:04,606- 
INFO-[Thread-15]-[XPathEntityProcessor.java:221]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
URL={url}
2015-01-23 21:28:04,606- 
INFO-[Thread-15]-[XPathEntityProcessor.java:227]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
r1={{id=CVE-2002-2354, cve=CVE-2002-2354, cwe=CWE-20, $forEach=/nvd/entry}}
2015-01-23 21:28:04,606- 
INFO-[Thread-15]-[XPathEntityProcessor.java:251]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
r3={{id=CVE-2002-2354, cve=CVE-2002-2354, cwe=CWE-20, $forEach=/nvd/entry}}
2015-01-23 21:28:04,606- 
INFO-[Thread-15]-[XPathEntityProcessor.java:221]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
URL={url}
2015-01-23 21:28:04,606- 
INFO-[Thread-15]-[XPathEntityProcessor.java:227]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
r1={{id=CVE-2002-2355, cve=CVE-2002-2355, cwe=CWE-255, $forEach=/nvd/entry}}
2015-01-23 21:28:04,606- 
INFO-[Thread-15]-[XPathEntityProcessor.java:251]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
r3={{id=CVE-2002-2355, cve=CVE-2002-2355, cwe=CWE-255, $forEach=/nvd/entry}}
2015-01-23 21:28:04,607- 
INFO-[Thread-15]-[XPathEntityProcessor.java:221]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
URL={url}
2015-01-23 21:28:04,607- 
INFO-[Thread-15]-[XPathEntityProcessor.java:227]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
r1={{id=CVE-2002-2356, cve=CVE-2002-2356, cwe=CWE-264, $forEach=/nvd/entry}}
2015-01-23 21:28:04,607- 
INFO-[Thread-15]-[XPathEntityProcessor.java:251]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
r3={{id=CVE-2002-2356, cve=CVE-2002-2356, cwe=CWE-264, $forEach=/nvd/entry}}
2015-01-23 21:28:04,607- 
INFO-[Thread-15]-[XPathEntityProcessor.java:221]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
URL={url}
2015-01-23 21:28:04,607- 
INFO-[Thread-15]-[XPathEntityProcessor.java:227]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
r1={{id=CVE-2002-2357, cve=CVE-2002-2357, cwe=CWE-119, $forEach=/nvd/entry}}
2015-01-23 21:28:04,607- 
INFO-[Thread-15]-[XPathEntityProcessor.java:251]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
r3={{id=CVE-2002-2357, cve=CVE-2002-2357, cwe=CWE-119, $forEach=/nvd/entry}}
2015-01-23 21:28:04,607- 
INFO-[Thread-15]-[XPathEntityProcessor.java:221]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
URL={url}
2015-01-23 21:28:04,607- 
INFO-[Thread-15]-[XPathEntityProcessor.java:227]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
r1={{id=CVE-2002-2358, cve=CVE-2002-2358, cwe=CWE-79, $forEach=/nvd/entry}}
2015-01-23 21:28:04,607- 
INFO-[Thread-15]-[XPathEntityProcessor.java:251]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
r3={{id=CVE-2002-2358, cve=CVE-2002-2358, cwe=CWE-79, $forEach=/nvd/entry}}
2015-01-23 21:28:04,607- 
INFO-[Thread-15]-[XPathEntityProcessor.java:221]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
URL={url}
2015-01-23 21:28:04,607- 
INFO-[Thread-15]-[XPathEntityProcessor.java:227]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
r1={{id=CVE-2002-2359, cve=CVE-2002-2359, cwe=CWE-79, $forEach=/nvd/entry}}
2015-01-23 21:28:04,607- 
INFO-[Thread-15]-[XPathEntityProcessor.java:251]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
r3={{id=CVE-2002-2359, cve=CVE-2002-2359, cwe=CWE-79, $forEach=/nvd/entry}}
2015-01-23 21:28:04,607- 
INFO-[Thread-15]-[XPathEntityProcessor.java:221]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
URL={url}
2015-01-23 21:28:04,607- 
INFO-[Thread-15]-[XPathEntityProcessor.java:227]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
r1={{id=CVE-2002-2360, cve=CVE-2002-2360, cwe=CWE-264, $forEach=/nvd/entry}}
2015-01-23 21:28:04,607- 
INFO-[Thread-15]-[XPathEntityProcessor.java:251]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
r3={{id=CVE-2002-2360, cve=CVE-2002-2360, cwe=CWE-264, $forEach=/nvd/entry}}
2015-01-23 21:28:04,607- 
INFO-[Thread-15]-[XPathEntityProcessor.java:221]-org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow: 
URL={url}
2015-01-23 21:28:04,607- 
INFO-[Thread-15]-[XPath

Re: SolrCloud result correctness compared with single core

2015-01-23 Thread Erick Erickson
you might, but probably not enough to notice. At 50G, the tf/idf
stats will _probably_ be close enough you won't be able to tell.

That said, recently distributed tf/idf has been implemented but
you need to ask for it, see SOLR-1632. This is Solr 5.0 though.

I've rarely seen it matter except in fairly specialized situations.
Consider a single core. Deleted documents still count towards
some of the tf/idf stats. So your scoring could theoretically
change after, say, an optimize.

So called "bottom line" is that yes, the scoring may change, but
IMO not any more radically than was possible with single cores,
and I wouldn't worry about unless I had evidence that it was
biting me.

Best
Erick

On Fri, Jan 23, 2015 at 2:52 PM, Yandong Yao  wrote:

> Hi Guys,
>
> As the main scoring mechanism is based tf/idf, so will same query running
> against SolrCloud return different result against running it against single
> core with same data sets as idf will only count df inside one core?
>
> eg: Assume I have 100GB data:
> A) Index those data using single core
> B) Index those data using SolrCloud with two cores (each has 50GB data
> index)
>
> Then If I query those with same query like 'apple', then will I get
> different result for A and B?
>
>
> Regards,
> Yandong
>


Re: Avoiding wildcard queries using edismax query parser

2015-01-23 Thread Jorge Luis Betancourt González
Tank your Michael for sharing your patch! It was really helpful, but for our 
particular requirement a SearchComponent that rewrites our query is enough (as 
suggested by Alexandre, although thanks a lot), basically we just escape a 
bunch of * that we know are "problematic". 

This approach allow us to quietly avoid the wildcard query and instead use it 
as a normal term query, instead of throwing a SyntaxError which is more 
convenient in our case.

Regards,
 

- Original Message -
From: "Michael F. Ryan (LNG-DAY)" 
To: solr-user@lucene.apache.org
Sent: Friday, January 23, 2015 8:26:48 AM
Subject: RE: Avoiding wildcard queries using edismax query parser

Here's a Jira for this: https://issues.apache.org/jira/browse/SOLR-3031

I've attached a patch there that might be useful for you.

-Michael

-Original Message-
From: Jorge Luis Betancourt González [mailto:jlbetanco...@uci.cu] 
Sent: Thursday, January 22, 2015 4:34 PM
To: solr-user@lucene.apache.org
Subject: Avoiding wildcard queries using edismax query parser

Hello all,

Currently we are using edismax query parser in an internal application, we've 
detected that some wildcard queries including "*" are causing some performance 
issues and for this particular case we're not interested in allowing any user 
to request all the indexed documents. 

This could be easily escaped in the application level, but right now we have 
several applications (using several programming languages) consuming from Solr, 
and adding this into each application is kind of exhausting, so I'm wondering 
if there is some configuration that allow us to treat this special characters 
as normal alphanumeric characters. 

I've tried one solution that worked before, involving the WordDelimiterFilter 
an the types attribute:



and in characters.txt I've mapped the special characters into ALPHA:

+ => ALPHA 
* => ALPHA 

Any thoughts on this?


---
XII Aniversario de la creación de la Universidad de las Ciencias Informáticas. 
12 años de historia junto a Fidel. 12 de diciembre de 2014.



---
XII Aniversario de la creación de la Universidad de las Ciencias Informáticas. 
12 años de historia junto a Fidel. 12 de diciembre de 2014.



Re: Connection Reset Errors with Solr 4.4

2015-01-23 Thread Mike Drob
I'm not sure what a reasonable workaround would be. Perhaps somebody else
can brainstorm and make a suggestion, sorry.

On Tue, Jan 20, 2015 at 12:56 PM, Nishanth S 
wrote:

> Thank you Mike.Sure enough,we are running into the same issue you
> mentoined.Is there a quick fix for this other than the patch.I do not see
> the tlogs getting replayed at all.It is doing a full index recovery from
> the leader and our index size is around 200G.Would lowering the autocommit
> settings help(where the replica would go for a tlog replay as the tlogs I
> see are not huge).
>
> Thanks,
> Nishanth
>
> On Tue, Jan 20, 2015 at 10:46 AM, Mike Drob  wrote:
>
> > Are we sure this isn't SOLR-6931?
> >
> > On Tue, Jan 20, 2015 at 11:39 AM, Nishanth S 
> > wrote:
> >
> > > Hello All,
> > >
> > > We are running solr cloud 4.4 with 30 shards and 3 replicas with real
> > time
> > > indexing on rhel 6.5.The indexing rate is 3K Tps now.We are running
> into
> > an
> > > issue with replicas going into recovery mode  due to connection reset
> > > errors.Soft commit time is 2 min and auto commit is set as 5 minutes.I
> > have
> > > seen that replicas do a full index recovery which takes a long
> > > time(days).Below is the error trace that  I see.I would really
> appreciate
> > > any help in this case.
> > >
> > > g.apache.solr.client.solrj.SolrServerException: IOException occured
> when
> > > talking to server at: http://xxx:8083/solr/log_pn_shard20_replica2
> > > at
> > >
> > >
> >
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:435)
> > > at
> > >
> > >
> >
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)
> > > at
> > >
> > >
> >
> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:401)
> > > at
> > >
> > >
> >
> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:375)
> > > at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> > > at
> > > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> > > at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> > > at
> > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> > > at
> > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> > > at java.lang.Thread.run(Thread.java:745)
> > > Caused by: java.net.SocketException: Connection reset
> > > at java.net.SocketInputStream.read(SocketInputStream.java:196)
> > > at java.net.SocketInputStream.read(SocketInputStream.java:122)
> > > at
> > >
> > >
> >
> org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:166)
> > > at
> > >
> > >
> >
> org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:90)
> > > at
> > >
> > >
> >
> org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:281)
> > > at
> > >
> > >
> >
> org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:92)
> > > at
> > >
> > >
> >
> org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:62)
> > > at
> > >
> > >
> >
> org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:254)
> > > at
> > >
> > >
> >
> org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:289)
> > > at
> > >
> > >
> >
> org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:252)
> > > at
> > >
> > >
> >
> org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:191)
> > > at
> > >
> > >
> >
> org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:300)
> > > at
> > >
> > >
> >
> org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:127)
> > > at
> > >
> > >
> >
> org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:717)
> > > at
> > >
> > >
> >
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:522)
> > > at
> > >
> > >
> >
> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
> > > at
> > >
> > >
> >
> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
> > > at
> > >
> > >
> >
> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784)
> > > at
> > >
> > >
> >
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:365)
> > > ... 9 more
> > >
> > >
> > > Thanks,
> > > Nishanth
> > >
> >
>


Re: Solr I/O increases over time

2015-01-23 Thread Shawn Heisey
On 1/23/2015 3:52 PM, Daniel Cukier wrote:
> I am running around eight solr servers (version 3.5) instances behind a
> Load Balancer. All servers are identical and the LB is weighted by number
> connections. The servers have around 4M documents and receive a constant
> flow of queries. When the solr server starts, it works fine. But after some
> time running, it starts to take longer respond to queries, and the server
> I/O goes crazy to 100%. Look at the New Relic graphic:
>
> [image: enter image description here]
>
> If the servers behaves well in the beginning, I it starts to fail after
> some time? Then if I restart the server, it gets back to low I/O for same
> time and this repeats over and over.

The mailing list eats almost all attachments.  We can't see your image. 
You can use http://apaste.info for images (up to 1MB) and text, or pick
another hosting provider, and include the URL in your reply.

Most performance problems like this are memory related.  The high I/O
you mentioned definitely sounds like it could be a situation where you
don't have enough RAM available for OS disk cache.  When the OS cannot
cache the index effectively, queries will result in a large amount of
real disk I/O.  If there's enough memory for caching, queries will be
entirely or mostly handled from RAM, which is *MUCH* faster than the disk.

http://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn



Need Help with custom ZIPURLDataSource class

2015-01-23 Thread Carl Roberts


Hi,

I created a custom ZIPURLDataSource class to unzip the content from an
http URL for an XML ZIP file and it seems to be working (at least I have
no errors), but no data is imported.

Here is my configuration in rss-data-config.xml:




https://nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-2002.xml.zip";
processor="XPathEntityProcessor"
forEach="/nvd/entry"
transformer="DateFormatTransformer">













Attached is the ZIPURLDataSource.java file.

It actually unzips and saves the raw XML to disk, which I have verified to be a 
valid XML file.  The file has one or more entries (here is an example):

http://scap.nist.gov/schema/scap-core/0.1";
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
xmlns:patch="http://scap.nist.gov/schema/patch/0.1";
xmlns:vuln="http://scap.nist.gov/schema/vulnerability/0.4";
xmlns:cvss="http://scap.nist.gov/schema/cvss-v2/0.2";
xmlns:cpe-lang="http://cpe.mitre.org/language/2.0";
xmlns="http://scap.nist.gov/schema/feed/vulnerability/2.0";
pub_date="2015-01-10T05:37:05"
xsi:schemaLocation="http://scap.nist.gov/schema/patch/0.1
http://nvd.nist.gov/schema/patch_0.1.xsd
http://scap.nist.gov/schema/scap-core/0.1
http://nvd.nist.gov/schema/scap-core_0.1.xsd
http://scap.nist.gov/schema/feed/vulnerability/2.0
http://nvd.nist.gov/schema/nvd-cve-feed_2.0.xsd"; nvd_xml_version="2.0">

http://nvd.nist.gov/";>



























cpe:/o:freebsd:freebsd:2.2.8
cpe:/o:freebsd:freebsd:1.1.5.1
cpe:/o:freebsd:freebsd:2.2.3
cpe:/o:freebsd:freebsd:2.2.2
cpe:/o:freebsd:freebsd:2.2.5
cpe:/o:freebsd:freebsd:2.2.4
cpe:/o:freebsd:freebsd:2.0.5
cpe:/o:freebsd:freebsd:2.2.6
cpe:/o:freebsd:freebsd:2.1.6.1
cpe:/o:freebsd:freebsd:2.0.1
cpe:/o:freebsd:freebsd:2.2
cpe:/o:freebsd:freebsd:2.0
cpe:/o:openbsd:openbsd:2.3
cpe:/o:freebsd:freebsd:3.0
cpe:/o:freebsd:freebsd:1.1
cpe:/o:freebsd:freebsd:2.1.6
cpe:/o:openbsd:openbsd:2.4
cpe:/o:bsdi:bsd_os:3.1
cpe:/o:freebsd:freebsd:1.0
cpe:/o:freebsd:freebsd:2.1.7
cpe:/o:freebsd:freebsd:1.2
cpe:/o:freebsd:freebsd:2.1.5
cpe:/o:freebsd:freebsd:2.1.7.1

CVE-1999-0001
1999-12-30T00:00:00.000-05:00
2010-12-16T00:00:00.000-05:00


5.0
NETWORK
LOW
NONE
NONE
NONE
PARTIAL
http://nvd.nist.gov
2004-01-01T00:00:00.000-05:00




OSVDB
http://www.osvdb.org/5707";
xml:lang="en">5707


CONFIRM
http://www.openbsd.org/errata23.html#tcpfix";
xml:lang="en">http://www.openbsd.org/errata23.html#tcpfix

ip_input.c in BSD-derived TCP/IP implementations allows
remote attackers to cause a denial of service (crash or hang) via
crafted packets.



Here is the curl command:

curl http://127.0.0.1:8983/solr/nvd-rss/dataimport?command=full-import

And here is the output from the console for Jetty:

main{StandardDirectoryReader(segments_1:1:nrt)}
2407 [coreLoadExecutor-5-thread-1] INFO
org.apache.solr.core.CoreContainer – registering core: nvd-rss
2409 [main] INFO org.apache.solr.servlet.SolrDispatchFilter –
user.dir=/Users/carlroberts/dev/solr-4.10.3/example
2409 [main] INFO org.apache.solr.servlet.SolrDispatchFilter –
SolrDispatchFilter.init() done
2431 [main] INFO org.eclipse.jetty.server.AbstractConnector – Started
SocketConnector@0.0.0.0:8983
2450 [searcherExecutor-6-thread-1] INFO org.apache.solr.core.SolrCore –
[nvd-rss] webapp=null path=null
params={event=firstSearcher&q=static+firstSearcher+warming+in+solrconfig.xml&distrib=false}
hits=0 status=0 QTime=43
2451 [searcherExecutor-6-thread-1] INFO org.apache.solr.core.SolrCore –
QuerySenderListener done.
2451 [searcherExecutor-6-thread-1] INFO
org.apache.solr.handler.component.SpellCheckComponent – Loading spell
index for spellchecker: default
2451 [searcherExecutor-6-thread-1] INFO
org.apache.solr.handler.component.SpellCheckComponent – Loading spell
index for spellchecker: wordbreak
2452 [searcherExecutor-6-thread-1] INFO
org.apache.solr.handler.component.SuggestComponent – Loading suggester
index for: mySuggester
2452 [searcherExecutor-6-thread-1] INFO
org.apache.solr.spelling.suggest.SolrSuggester – reload()
2452 [searcherExecutor-6-thread-1] INFO
org.apache.solr.spelling.suggest.SolrSuggester – build()
2459 [searcherExecutor-6-thread-1] INFO org.apache.solr.core.SolrCore –
[nvd-rss] Registered new searcher Searcher@df9e84e[nvd-rss]
main{StandardDirectoryReader(segments_1:1:nrt)}
8371 [qtp1640586218-17] INFO
org.apache.solr.handler.dataimport.DataImporter – Loading DIH
Configuration: rss-data-config.xml
8379 [qtp1640586218-17] INFO
org.apache.solr.handler.dataimport.DataImporter – Data Configuration
loaded successfully
8383 [Thread-15] INFO org.apache.solr.handler.dataimport.DataImporter –
Starting Full Import
8384 [qtp1640586218-17] INFO org.apache.solr.core.SolrCore – [nvd-rss]
webapp=/solr path=/dataimport params={command=full-import} status=0 QTime=15
8396 [Thread-15] INFO
org.apache.solr.handler.dataimport.SimplePropertiesWriter – Read
dataimport.properties
23431 [commitScheduler-8-thread-1] INFO
org.apache.solr.update.UpdateHandler – start
commit{,optimize=false,openSearcher=false,waitSearcher=true,expungeDeletes=false,softCommi

Solr I/O increases over time

2015-01-23 Thread Daniel Cukier
I am running around eight solr servers (version 3.5) instances behind a
Load Balancer. All servers are identical and the LB is weighted by number
connections. The servers have around 4M documents and receive a constant
flow of queries. When the solr server starts, it works fine. But after some
time running, it starts to take longer respond to queries, and the server
I/O goes crazy to 100%. Look at the New Relic graphic:

[image: enter image description here]

If the servers behaves well in the beginning, I it starts to fail after
some time? Then if I restart the server, it gets back to low I/O for same
time and this repeats over and over.
Daniel Cukier


Fwd: Need Help with custom ZIPURLDataSource class

2015-01-23 Thread Carl Roberts


Hi,

I created a custom ZIPURLDataSource class to unzip the content from an
http URL for an XML ZIP file and it seems to be working (at least I have
no errors), but no data is imported.

Here is my configuration in rss-data-config.xml:




https://nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-2002.xml.zip";
processor="XPathEntityProcessor"
forEach="/nvd/entry"
transformer="DateFormatTransformer">













Attached is the ZIPURLDataSource.java file.

It actually unzips and saves the raw XML to disk, which I have verified to be a 
valid XML file.  The file has one or more entries (here is an example):

http://scap.nist.gov/schema/scap-core/0.1";
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
xmlns:patch="http://scap.nist.gov/schema/patch/0.1";
xmlns:vuln="http://scap.nist.gov/schema/vulnerability/0.4";
xmlns:cvss="http://scap.nist.gov/schema/cvss-v2/0.2";
xmlns:cpe-lang="http://cpe.mitre.org/language/2.0";
xmlns="http://scap.nist.gov/schema/feed/vulnerability/2.0";
pub_date="2015-01-10T05:37:05"
xsi:schemaLocation="http://scap.nist.gov/schema/patch/0.1
http://nvd.nist.gov/schema/patch_0.1.xsd
http://scap.nist.gov/schema/scap-core/0.1
http://nvd.nist.gov/schema/scap-core_0.1.xsd
http://scap.nist.gov/schema/feed/vulnerability/2.0
http://nvd.nist.gov/schema/nvd-cve-feed_2.0.xsd"; nvd_xml_version="2.0">

http://nvd.nist.gov/";>



























cpe:/o:freebsd:freebsd:2.2.8
cpe:/o:freebsd:freebsd:1.1.5.1
cpe:/o:freebsd:freebsd:2.2.3
cpe:/o:freebsd:freebsd:2.2.2
cpe:/o:freebsd:freebsd:2.2.5
cpe:/o:freebsd:freebsd:2.2.4
cpe:/o:freebsd:freebsd:2.0.5
cpe:/o:freebsd:freebsd:2.2.6
cpe:/o:freebsd:freebsd:2.1.6.1
cpe:/o:freebsd:freebsd:2.0.1
cpe:/o:freebsd:freebsd:2.2
cpe:/o:freebsd:freebsd:2.0
cpe:/o:openbsd:openbsd:2.3
cpe:/o:freebsd:freebsd:3.0
cpe:/o:freebsd:freebsd:1.1
cpe:/o:freebsd:freebsd:2.1.6
cpe:/o:openbsd:openbsd:2.4
cpe:/o:bsdi:bsd_os:3.1
cpe:/o:freebsd:freebsd:1.0
cpe:/o:freebsd:freebsd:2.1.7
cpe:/o:freebsd:freebsd:1.2
cpe:/o:freebsd:freebsd:2.1.5
cpe:/o:freebsd:freebsd:2.1.7.1

CVE-1999-0001
1999-12-30T00:00:00.000-05:00
2010-12-16T00:00:00.000-05:00


5.0
NETWORK
LOW
NONE
NONE
NONE
PARTIAL
http://nvd.nist.gov
2004-01-01T00:00:00.000-05:00




OSVDB
http://www.osvdb.org/5707";
xml:lang="en">5707


CONFIRM
http://www.openbsd.org/errata23.html#tcpfix";
xml:lang="en">http://www.openbsd.org/errata23.html#tcpfix

ip_input.c in BSD-derived TCP/IP implementations allows
remote attackers to cause a denial of service (crash or hang) via
crafted packets.



Here is the curl command:

curl http://127.0.0.1:8983/solr/nvd-rss/dataimport?command=full-import

And here is the output from the console for Jetty:

main{StandardDirectoryReader(segments_1:1:nrt)}
2407 [coreLoadExecutor-5-thread-1] INFO
org.apache.solr.core.CoreContainer – registering core: nvd-rss
2409 [main] INFO org.apache.solr.servlet.SolrDispatchFilter –
user.dir=/Users/carlroberts/dev/solr-4.10.3/example
2409 [main] INFO org.apache.solr.servlet.SolrDispatchFilter –
SolrDispatchFilter.init() done
2431 [main] INFO org.eclipse.jetty.server.AbstractConnector – Started
SocketConnector@0.0.0.0:8983
2450 [searcherExecutor-6-thread-1] INFO org.apache.solr.core.SolrCore –
[nvd-rss] webapp=null path=null
params={event=firstSearcher&q=static+firstSearcher+warming+in+solrconfig.xml&distrib=false}
hits=0 status=0 QTime=43
2451 [searcherExecutor-6-thread-1] INFO org.apache.solr.core.SolrCore –
QuerySenderListener done.
2451 [searcherExecutor-6-thread-1] INFO
org.apache.solr.handler.component.SpellCheckComponent – Loading spell
index for spellchecker: default
2451 [searcherExecutor-6-thread-1] INFO
org.apache.solr.handler.component.SpellCheckComponent – Loading spell
index for spellchecker: wordbreak
2452 [searcherExecutor-6-thread-1] INFO
org.apache.solr.handler.component.SuggestComponent – Loading suggester
index for: mySuggester
2452 [searcherExecutor-6-thread-1] INFO
org.apache.solr.spelling.suggest.SolrSuggester – reload()
2452 [searcherExecutor-6-thread-1] INFO
org.apache.solr.spelling.suggest.SolrSuggester – build()
2459 [searcherExecutor-6-thread-1] INFO org.apache.solr.core.SolrCore –
[nvd-rss] Registered new searcher Searcher@df9e84e[nvd-rss]
main{StandardDirectoryReader(segments_1:1:nrt)}
8371 [qtp1640586218-17] INFO
org.apache.solr.handler.dataimport.DataImporter – Loading DIH
Configuration: rss-data-config.xml
8379 [qtp1640586218-17] INFO
org.apache.solr.handler.dataimport.DataImporter – Data Configuration
loaded successfully
8383 [Thread-15] INFO org.apache.solr.handler.dataimport.DataImporter –
Starting Full Import
8384 [qtp1640586218-17] INFO org.apache.solr.core.SolrCore – [nvd-rss]
webapp=/solr path=/dataimport params={command=full-import} status=0 QTime=15
8396 [Thread-15] INFO
org.apache.solr.handler.dataimport.SimplePropertiesWriter – Read
dataimport.properties
23431 [commitScheduler-8-thread-1] INFO
org.apache.solr.update.UpdateHandler – start
commit{,optimize=false,openSearcher=false,waitSearcher=true,expungeDeletes=false,softCommi

SolrCloud result correctness compared with single core

2015-01-23 Thread Yandong Yao
Hi Guys,

As the main scoring mechanism is based tf/idf, so will same query running
against SolrCloud return different result against running it against single
core with same data sets as idf will only count df inside one core?

eg: Assume I have 100GB data:
A) Index those data using single core
B) Index those data using SolrCloud with two cores (each has 50GB data
index)

Then If I query those with same query like 'apple', then will I get
different result for A and B?


Regards,
Yandong


Re: How to inject custom response data after results have been sorted

2015-01-23 Thread tedsolr
Thank you so much for your responses Hoss and Shalin. I gather the
DocTransfomer allows manipulations to the doc list returned in the results.
That is very cool. So the transformer has access to the Solr Request. I
haven't seen the hook yet, but I believe you - I'll have to keep looking. It
would certainly be cleaner to return my stats as "fields" within each doc.
My plan was to attach the stats as a map to the response, and post process
in my app.

I was able to quickly mock up a custom SearchComponent and verify that it
receives the doc list in sorted order, and that I could retrieve objects
form the request context. So this search component would allow me to simply
"paste" the filtered map of stats to the response.

Is there a performance benefit one way or the other? Is it just easier in
the DocTransformer since there is a method transform(doc, id) that must get
called for every return doc?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-inject-custom-response-data-after-results-have-been-sorted-tp4181545p4181602.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: multiple data source indexing through data import handler

2015-01-23 Thread Qiu Mo

Alex,

Thanks,  I tried ${item.id},  it doesn’t work. 
However, if hardcode a id number instead of  '${item.id}’ , the it add this one 
line to every document.

for example 
select description from feature where item_id=3456

then this single description is added to every document as a field. it seems 
the parent entity item.id is
not passed to the second entity.

Thanks,

Qiu Mo (Joe)



From: Alexandre Rafalovitch 
Sent: Friday, January 23, 2015 3:25 PM
To: solr-user
Subject: Re: multiple data source indexing through data import handler

Try ${item.id} as that's what you are mapping it to.

See also: https://issues.apache.org/jira/browse/SOLR-4383

Regards,
   Alex.

Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 23 January 2015 at 15:01, Qiu Mo  wrote:
> I am indexing data from two different databases, but I can't add second 
> database to indexing, can anyone help!  below is my dats-config.xml
>
>
>  url="jdbc:mysql://XXX" user="XXX" password="XXX"/>
>
>  url="jdbc:mysql://XXX" user="XXX" password="XXX"/>
>
>
> 
>
> 
>
> 
>
>
> 
>
> 
>
> 
>
>
>
> 
>
> my log indicate that '${item.ID}' is not catch any value from entity item.
>
> Thanks,
>
> Joe Moore

Re: multiple data source indexing through data import handler

2015-01-23 Thread Alexandre Rafalovitch
Try ${item.id} as that's what you are mapping it to.

See also: https://issues.apache.org/jira/browse/SOLR-4383

Regards,
   Alex.

Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 23 January 2015 at 15:01, Qiu Mo  wrote:
> I am indexing data from two different databases, but I can't add second 
> database to indexing, can anyone help!  below is my dats-config.xml
>
>
>  url="jdbc:mysql://XXX" user="XXX" password="XXX"/>
>
>  url="jdbc:mysql://XXX" user="XXX" password="XXX"/>
>
>
> 
>
> 
>
> 
>
>
> 
>
> 
>
> 
>
>
>
> 
>
> my log indicate that '${item.ID}' is not catch any value from entity item.
>
> Thanks,
>
> Joe Moore


Re: Is it possible to read multiple RSS feeds and XML Zip file feeds with DIH into one core?

2015-01-23 Thread Alexandre Rafalovitch
Unzipping things might be an issue. You may need to do that as part of
a batch job outside of Solr. For the rest, go through the
documentation first, it does answer a bunch of questions. There is
also a page on the Wiki as well, not just in the reference guide.

Regards,
   Alex.

Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 23 January 2015 at 14:51, Carl Roberts  wrote:
> Excellent - thanks Shalin.  But how does delta-import work?  Does it do a
> clean also?  Does it require a unique Id?  Does it update existing records
> and only add when necessary?
>
> And, how would I go about unzipping the content from a URL to then import
> the unzipped XML?  Is the recommended way to extend the URLDataSource class
> or is there any built-in logic to plug in pre-processing handlers?
>
>
> And,
>
> On 1/23/15, 2:39 PM, Shalin Shekhar Mangar wrote:
>>
>> If you add clean=false as a parameter to the full-import then deletion is
>> disabled. Since you are ingesting RSS there is no need for deletion at all
>> I guess.
>>
>> On Fri, Jan 23, 2015 at 7:31 PM, Carl Roberts
>> >>
>>> wrote:
>>> OK - Thanks for the doc.
>>>
>>> Is it possible to just provide an empty value to preImportDeleteQuery to
>>> disable the delete prior to import?
>>>
>>> Will the data still be deleted for each entity during a delta-import
>>> instead of full-import?
>>>
>>> Is there any capability in the handler to unzip an XML file from a URL
>>> prior to reading it or can I perhaps hook a custom pre-processing
>>> handler?
>>>
>>> Regards,
>>>
>>> Joe
>>>
>>>
>>>
>>> On 1/23/15, 1:40 PM, Alexandre Rafalovitch wrote:
>>>
 https://cwiki.apache.org/confluence/display/solr/
 Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler

 Admin UI has the interface, so you can play there once you define it.

 You do have to use Curl, there is no built-in scheduler.

 Regards,
  Alex.
 
 Sign up for my Solr resources newsletter at http://www.solr-start.com/


 On 23 January 2015 at 13:29, Carl Roberts
 
 wrote:

> Hi Alex,
>
> If I am understanding this correctly, I can define multiple entities
> like
> this?
>
> 
>   
>   
>   
>   ...
> 
>
> How would I trigger loading certain entities during start?
>
> How would I trigger loading other entities during update?
>
> Is there a way to set an auto-update for certain entities so that I
> don't
> have to invoke an update via curl?
>
> Where / how do I specify the preImportDeleteQuery to avoid deleting
> everything upon each update?
>
> Is there an example or doc that shows how to do all this?
>
> Regards,
>
> Joe
>
>
> On 1/23/15, 11:24 AM, Alexandre Rafalovitch wrote:
>
>> You can define both multiple entities in the same file and nested
>> entities if your list comes from an external source (e.g. a text file
>> of URLs).
>> You can also trigger DIH with a name of a specific entity to load just
>> that.
>> You can even pass DIH configuration file when you are triggering the
>> processing start, so you can have different files completely for
>> initial load and update. Though you can just do the same with
>> entities.
>>
>> The only thing to be aware of is that before an entity definition is
>> processed, a delete command is run. By default, it's "delete all", so
>> executing one entity will delete everything but then just populate
>> that one entity's results. You can avoid that by defining
>> preImportDeleteQuery and having a clear identifier on content
>> generated by each entity (e.g. source, either extracted or manually
>> added with TemplateTransformer).
>>
>> Regards,
>>   Alex.
>>
>> 
>> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>>
>>
>> On 23 January 2015 at 11:15, Carl Roberts <
>> carl.roberts.zap...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I have the RSS DIH example working with my own RSS feed - here is the
>>> configuration for it.
>>>
>>> 
>>>
>>>
>>>>>pk="link"
>>>url="https://nvd.nist.gov/download/nvd-rss.xml";
>>>processor="XPathEntityProcessor"
>>>forEach="/RDF/item"
>>>transformer="DateFormatTransformer">
>>>
>>>>> commonField="true" />
>>>>> commonField="true"
>>> />
>>>>> commonField="true" />
>>>>> commonField="true"
>>> />
>>>
>>>
>>>
>>> 
>>>
>>> However, my problem is that I also have to load multiple XML feeds
>>> into
>>> the
>>> same

Re: Retrieving Phonetic Code as result

2015-01-23 Thread Jack Krupansky
That's phone the filter is doing - transforming text into phonetic codes at
index time. And at query time as well to do the phonetic matching in the
query. The actual phonetic codes are stored in the index for the purposes
of query matching.

-- Jack Krupansky

On Fri, Jan 23, 2015 at 12:57 PM, Amit Jha  wrote:

> Can I extend solr to add phonetic codes at time of indexing as uuid field
> getting added. Because I want to preprocess the metaphone code because I
> calculate the code on runtime will give me some performance hit.
>
> Rgds
> AJ
>
> > On Jan 23, 2015, at 5:37 PM, Jack Krupansky 
> wrote:
> >
> > Your app can use the field analysis API (FieldAnalysisRequestHandler) to
> > query Solr for what the resulting field values are for each filter in the
> > analysis chain for a given input string. This is what the Solr Admin UI
> > Analysis web page uses.
> >
> > See:
> >
> http://lucene.apache.org/solr/4_10_2/solr-core/org/apache/solr/handler/FieldAnalysisRequestHandler.html
> > and in solrconfig.xml
> >
> >
> > -- Jack Krupansky
> >
> >> On Thu, Jan 22, 2015 at 8:42 AM, Amit Jha  wrote:
> >>
> >> Hi,
> >>
> >> I need to know how can I retrieve phonetic codes. Does solr provide it
> as
> >> part of result? I need codes for record matching.
> >>
> >> *following is schema fragment:*
> >>
> >>  >> class="solr.TextField" >
> >>  
> >>
> >> >> maxCodeLength="4"/>
> >>  
> >>
> >>
> >>  stored="true"/>
> >>  
> >>  
> >>   stored="true"/>
> >>
> >> 
> >> 
> >>
>


multiple data source indexing through data import handler

2015-01-23 Thread Qiu Mo
I am indexing data from two different databases, but I can't add second 
database to indexing, can anyone help!  below is my dats-config.xml
























my log indicate that '${item.ID}' is not catch any value from entity item.

Thanks,

Joe Moore


Re: Is it possible to read multiple RSS feeds and XML Zip file feeds with DIH into one core?

2015-01-23 Thread Carl Roberts
Excellent - thanks Shalin.  But how does delta-import work?  Does it do 
a clean also?  Does it require a unique Id?  Does it update existing 
records and only add when necessary?


And, how would I go about unzipping the content from a URL to then 
import the unzipped XML?  Is the recommended way to extend the 
URLDataSource class or is there any built-in logic to plug in 
pre-processing handlers?



And,
On 1/23/15, 2:39 PM, Shalin Shekhar Mangar wrote:

If you add clean=false as a parameter to the full-import then deletion is
disabled. Since you are ingesting RSS there is no need for deletion at all
I guess.

On Fri, Jan 23, 2015 at 7:31 PM, Carl Roberts 
wrote:
OK - Thanks for the doc.

Is it possible to just provide an empty value to preImportDeleteQuery to
disable the delete prior to import?

Will the data still be deleted for each entity during a delta-import
instead of full-import?

Is there any capability in the handler to unzip an XML file from a URL
prior to reading it or can I perhaps hook a custom pre-processing handler?

Regards,

Joe



On 1/23/15, 1:40 PM, Alexandre Rafalovitch wrote:


https://cwiki.apache.org/confluence/display/solr/
Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler

Admin UI has the interface, so you can play there once you define it.

You do have to use Curl, there is no built-in scheduler.

Regards,
 Alex.

Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 23 January 2015 at 13:29, Carl Roberts 
wrote:


Hi Alex,

If I am understanding this correctly, I can define multiple entities like
this?


  
  
  
  ...


How would I trigger loading certain entities during start?

How would I trigger loading other entities during update?

Is there a way to set an auto-update for certain entities so that I don't
have to invoke an update via curl?

Where / how do I specify the preImportDeleteQuery to avoid deleting
everything upon each update?

Is there an example or doc that shows how to do all this?

Regards,

Joe


On 1/23/15, 11:24 AM, Alexandre Rafalovitch wrote:


You can define both multiple entities in the same file and nested
entities if your list comes from an external source (e.g. a text file
of URLs).
You can also trigger DIH with a name of a specific entity to load just
that.
You can even pass DIH configuration file when you are triggering the
processing start, so you can have different files completely for
initial load and update. Though you can just do the same with
entities.

The only thing to be aware of is that before an entity definition is
processed, a delete command is run. By default, it's "delete all", so
executing one entity will delete everything but then just populate
that one entity's results. You can avoid that by defining
preImportDeleteQuery and having a clear identifier on content
generated by each entity (e.g. source, either extracted or manually
added with TemplateTransformer).

Regards,
  Alex.


Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 23 January 2015 at 11:15, Carl Roberts <
carl.roberts.zap...@gmail.com>
wrote:


Hi,

I have the RSS DIH example working with my own RSS feed - here is the
configuration for it.


   
   
   https://nvd.nist.gov/download/nvd-rss.xml";
   processor="XPathEntityProcessor"
   forEach="/RDF/item"
   transformer="DateFormatTransformer">

   
   
   
   

   
   


However, my problem is that I also have to load multiple XML feeds into
the
same core.  Here is one example (there are about 10 of them):

http://static.nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-2014.xml.zip


Is there any built-in functionality that would allow me to do this?
Basically, the use-case is to load and index all the XML ZIP files
first,
and then check the RSS feed every two hours and update the indexes with
any
new ones.

Regards,

Joe









Re: How to inject custom response data after results have been sorted

2015-01-23 Thread Chris Hostetter

: If you just need to transform an individual result, that can be done by a
: custom DocTransformer. But from your email, I think you need a custom
: SearchComponent.

if your PostFilter has already collected all of the info you need, and you 
now just wnat to return a subset of that information that corrispponds to 
the individual documents being returmed on the current "page" of results 
(ie: the current DocList) then a custom DocTransformer should probably be 
enough as long as your PostFilter puts the computed data in the request 
context.

see for example how the ElevatedMarkerFactory works in conjunction with 
the QueryElevationComponent...

https://cwiki.apache.org/confluence/display/solr/Transforming+Result+Documents
https://cwiki.apache.org/confluence/display/solr/The+Query+Elevation+Component

-Hoss
http://www.lucidworks.com/


Re: Is it possible to read multiple RSS feeds and XML Zip file feeds with DIH into one core?

2015-01-23 Thread Shalin Shekhar Mangar
If you add clean=false as a parameter to the full-import then deletion is
disabled. Since you are ingesting RSS there is no need for deletion at all
I guess.

On Fri, Jan 23, 2015 at 7:31 PM, Carl Roberts  wrote:

> OK - Thanks for the doc.
>
> Is it possible to just provide an empty value to preImportDeleteQuery to
> disable the delete prior to import?
>
> Will the data still be deleted for each entity during a delta-import
> instead of full-import?
>
> Is there any capability in the handler to unzip an XML file from a URL
> prior to reading it or can I perhaps hook a custom pre-processing handler?
>
> Regards,
>
> Joe
>
>
>
> On 1/23/15, 1:40 PM, Alexandre Rafalovitch wrote:
>
>> https://cwiki.apache.org/confluence/display/solr/
>> Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler
>>
>> Admin UI has the interface, so you can play there once you define it.
>>
>> You do have to use Curl, there is no built-in scheduler.
>>
>> Regards,
>> Alex.
>> 
>> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>>
>>
>> On 23 January 2015 at 13:29, Carl Roberts 
>> wrote:
>>
>>> Hi Alex,
>>>
>>> If I am understanding this correctly, I can define multiple entities like
>>> this?
>>>
>>> 
>>>  
>>>  
>>>  
>>>  ...
>>> 
>>>
>>> How would I trigger loading certain entities during start?
>>>
>>> How would I trigger loading other entities during update?
>>>
>>> Is there a way to set an auto-update for certain entities so that I don't
>>> have to invoke an update via curl?
>>>
>>> Where / how do I specify the preImportDeleteQuery to avoid deleting
>>> everything upon each update?
>>>
>>> Is there an example or doc that shows how to do all this?
>>>
>>> Regards,
>>>
>>> Joe
>>>
>>>
>>> On 1/23/15, 11:24 AM, Alexandre Rafalovitch wrote:
>>>
 You can define both multiple entities in the same file and nested
 entities if your list comes from an external source (e.g. a text file
 of URLs).
 You can also trigger DIH with a name of a specific entity to load just
 that.
 You can even pass DIH configuration file when you are triggering the
 processing start, so you can have different files completely for
 initial load and update. Though you can just do the same with
 entities.

 The only thing to be aware of is that before an entity definition is
 processed, a delete command is run. By default, it's "delete all", so
 executing one entity will delete everything but then just populate
 that one entity's results. You can avoid that by defining
 preImportDeleteQuery and having a clear identifier on content
 generated by each entity (e.g. source, either extracted or manually
 added with TemplateTransformer).

 Regards,
  Alex.

 
 Sign up for my Solr resources newsletter at http://www.solr-start.com/


 On 23 January 2015 at 11:15, Carl Roberts <
 carl.roberts.zap...@gmail.com>
 wrote:

> Hi,
>
> I have the RSS DIH example working with my own RSS feed - here is the
> configuration for it.
>
> 
>   
>   
>      pk="link"
>   url="https://nvd.nist.gov/download/nvd-rss.xml";
>   processor="XPathEntityProcessor"
>   forEach="/RDF/item"
>   transformer="DateFormatTransformer">
>
>    commonField="true" />
>    commonField="true"
> />
>    commonField="true" />
>    commonField="true"
> />
>
>   
>   
> 
>
> However, my problem is that I also have to load multiple XML feeds into
> the
> same core.  Here is one example (there are about 10 of them):
>
> http://static.nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-2014.xml.zip
>
>
> Is there any built-in functionality that would allow me to do this?
> Basically, the use-case is to load and index all the XML ZIP files
> first,
> and then check the RSS feed every two hours and update the indexes with
> any
> new ones.
>
> Regards,
>
> Joe
>
>
>
>


-- 
Regards,
Shalin Shekhar Mangar.


Re: Is it possible to read multiple RSS feeds and XML Zip file feeds with DIH into one core?

2015-01-23 Thread Carl Roberts

OK - Thanks for the doc.

Is it possible to just provide an empty value to preImportDeleteQuery to 
disable the delete prior to import?


Will the data still be deleted for each entity during a delta-import 
instead of full-import?


Is there any capability in the handler to unzip an XML file from a URL 
prior to reading it or can I perhaps hook a custom pre-processing handler?


Regards,

Joe


On 1/23/15, 1:40 PM, Alexandre Rafalovitch wrote:

https://cwiki.apache.org/confluence/display/solr/Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler

Admin UI has the interface, so you can play there once you define it.

You do have to use Curl, there is no built-in scheduler.

Regards,
Alex.

Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 23 January 2015 at 13:29, Carl Roberts  wrote:

Hi Alex,

If I am understanding this correctly, I can define multiple entities like
this?


 
 
 
 ...


How would I trigger loading certain entities during start?

How would I trigger loading other entities during update?

Is there a way to set an auto-update for certain entities so that I don't
have to invoke an update via curl?

Where / how do I specify the preImportDeleteQuery to avoid deleting
everything upon each update?

Is there an example or doc that shows how to do all this?

Regards,

Joe


On 1/23/15, 11:24 AM, Alexandre Rafalovitch wrote:

You can define both multiple entities in the same file and nested
entities if your list comes from an external source (e.g. a text file
of URLs).
You can also trigger DIH with a name of a specific entity to load just
that.
You can even pass DIH configuration file when you are triggering the
processing start, so you can have different files completely for
initial load and update. Though you can just do the same with
entities.

The only thing to be aware of is that before an entity definition is
processed, a delete command is run. By default, it's "delete all", so
executing one entity will delete everything but then just populate
that one entity's results. You can avoid that by defining
preImportDeleteQuery and having a clear identifier on content
generated by each entity (e.g. source, either extracted or manually
added with TemplateTransformer).

Regards,
 Alex.


Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 23 January 2015 at 11:15, Carl Roberts 
wrote:

Hi,

I have the RSS DIH example working with my own RSS feed - here is the
configuration for it.


  
  
  https://nvd.nist.gov/download/nvd-rss.xml";
  processor="XPathEntityProcessor"
  forEach="/RDF/item"
  transformer="DateFormatTransformer">

  
  
  
  

  
  


However, my problem is that I also have to load multiple XML feeds into
the
same core.  Here is one example (there are about 10 of them):

http://static.nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-2014.xml.zip


Is there any built-in functionality that would allow me to do this?
Basically, the use-case is to load and index all the XML ZIP files first,
and then check the RSS feed every two hours and update the indexes with
any
new ones.

Regards,

Joe






Re: How to inject custom response data after results have been sorted

2015-01-23 Thread Shalin Shekhar Mangar
If you just need to transform an individual result, that can be done by a
custom DocTransformer. But from your email, I think you need a custom
SearchComponent.

On Fri, Jan 23, 2015 at 6:23 PM, tedsolr  wrote:

> Hello! With the help of this community I have solved 2 problems on my way
> to
> creating a search that collapses documents based on multiple fields. The
> CollapsingQParserPlugin was key.
>
> I have a new problem now. All the custom stats I generate in my custom
> QParser makes for way to much data to simply write out in the response. I
> need to filter that data so I only return the stats the user will see on
> one
> page. Say my search returns 800K collapsed docs - in the
> DelegatingCollector's collect() method I am computing some info for each
> collapsed group - that's 800K map entries.
>
> I can't filter the stats in my post filter implementation because the
> results have not been sorted. So I need a new downstream component that can
> read the sorted results, and grab the custom stats from my post filter. Can
> someone recommend a suggestion? Is this a SearchComponent extension? Where
> is the proper hook for examining results after sorting?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-inject-custom-response-data-after-results-have-been-sorted-tp4181545.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Regards,
Shalin Shekhar Mangar.


Re: Sporadic Socket Timeout Error during Import

2015-01-23 Thread Shalin Shekhar Mangar
The default is 10 seconds and you can increase it by adding a "readTimeout"
attribute (whose value is in milliseconds) in the URLDataSource e.g.



On Fri, Jan 23, 2015 at 6:33 PM, Carl Roberts  wrote:

> Hi,
>
> I am using the DIH RSS example and I am running into a sporadic socket
> timeout error during every 3rd or 4th request. Below is the stack trace.
> What is the default socket timeout for reads and how can I increase it?
>
>
> 15046 [Thread-17] ERROR org.apache.solr.handler.dataimport.URLDataSource
> – Exception thrown while getting data
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:152)
> at java.net.SocketInputStream.read(SocketInputStream.java:122)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
> at sun.security.ssl.InputRecord.read(InputRecord.java:480)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:927)
> at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:884)
> at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
> at sun.net.www.protocol.http.HttpURLConnection.getInputStream(
> HttpURLConnection.java:1323)
> at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(
> HttpsURLConnectionImpl.java:254)
> at org.apache.solr.handler.dataimport.URLDataSource.
> getData(URLDataSource.java:98)
> at org.apache.solr.handler.dataimport.URLDataSource.
> getData(URLDataSource.java:42)
> at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(
> XPathEntityProcessor.java:283)
> at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(
> XPathEntityProcessor.java:224)
> at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(
> XPathEntityProcessor.java:204)
> at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(
> EntityProcessorWrapper.java:243)
> at org.apache.solr.handler.dataimport.DocBuilder.
> buildDocument(DocBuilder.java:476)
> at org.apache.solr.handler.dataimport.DocBuilder.
> buildDocument(DocBuilder.java:415)
> at org.apache.solr.handler.dataimport.DocBuilder.
> doFullDump(DocBuilder.java:330)
> at org.apache.solr.handler.dataimport.DocBuilder.execute(
> DocBuilder.java:232)
> at org.apache.solr.handler.dataimport.DataImporter.
> doFullImport(DataImporter.java:416)
> at org.apache.solr.handler.dataimport.DataImporter.
> runCmd(DataImporter.java:480)
> at org.apache.solr.handler.dataimport.DataImporter$1.run(
> DataImporter.java:461)
> 815049 [Thread-17] ERROR org.apache.solr.handler.dataimport.DocBuilder –
> Exception while processing: nvd-rss document : SolrInputDocument(fields:
> []):org.apache.solr.handler.dataimport.DataImportHandlerException:
> Exception in invoking url https://nvd.nist.gov/download/nvd-rss.xml
> Processing Document # 1
> at org.apache.solr.handler.dataimport.URLDataSource.
> getData(URLDataSource.java:115)
> at org.apache.solr.handler.dataimport.URLDataSource.
> getData(URLDataSource.java:42)
> at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(
> XPathEntityProcessor.java:283)
> at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(
> XPathEntityProcessor.java:224)
> at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(
> XPathEntityProcessor.java:204)
> at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(
> EntityProcessorWrapper.java:243)
> at org.apache.solr.handler.dataimport.DocBuilder.
> buildDocument(DocBuilder.java:476)
> at org.apache.solr.handler.dataimport.DocBuilder.
> buildDocument(DocBuilder.java:415)
> at org.apache.solr.handler.dataimport.DocBuilder.
> doFullDump(DocBuilder.java:330)
> at org.apache.solr.handler.dataimport.DocBuilder.execute(
> DocBuilder.java:232)
> at org.apache.solr.handler.dataimport.DataImporter.
> doFullImport(DataImporter.java:416)
> at org.apache.solr.handler.dataimport.DataImporter.
> runCmd(DataImporter.java:480)
> at org.apache.solr.handler.dataimport.DataImporter$1.run(
> DataImporter.java:461)
> Caused by: java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:152)
> at java.net.SocketInputStream.read(SocketInputStream.java:122)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
> at sun.security.ssl.InputRecord.read(InputRecord.java:480)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:927)
> at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:884)
> at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
> a

Re: Is it possible to read multiple RSS feeds and XML Zip file feeds with DIH into one core?

2015-01-23 Thread Alexandre Rafalovitch
https://cwiki.apache.org/confluence/display/solr/Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler

Admin UI has the interface, so you can play there once you define it.

You do have to use Curl, there is no built-in scheduler.

Regards,
   Alex.

Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 23 January 2015 at 13:29, Carl Roberts  wrote:
> Hi Alex,
>
> If I am understanding this correctly, I can define multiple entities like
> this?
>
> 
> 
> 
> 
> ...
> 
>
> How would I trigger loading certain entities during start?
>
> How would I trigger loading other entities during update?
>
> Is there a way to set an auto-update for certain entities so that I don't
> have to invoke an update via curl?
>
> Where / how do I specify the preImportDeleteQuery to avoid deleting
> everything upon each update?
>
> Is there an example or doc that shows how to do all this?
>
> Regards,
>
> Joe
>
>
> On 1/23/15, 11:24 AM, Alexandre Rafalovitch wrote:
>>
>> You can define both multiple entities in the same file and nested
>> entities if your list comes from an external source (e.g. a text file
>> of URLs).
>> You can also trigger DIH with a name of a specific entity to load just
>> that.
>> You can even pass DIH configuration file when you are triggering the
>> processing start, so you can have different files completely for
>> initial load and update. Though you can just do the same with
>> entities.
>>
>> The only thing to be aware of is that before an entity definition is
>> processed, a delete command is run. By default, it's "delete all", so
>> executing one entity will delete everything but then just populate
>> that one entity's results. You can avoid that by defining
>> preImportDeleteQuery and having a clear identifier on content
>> generated by each entity (e.g. source, either extracted or manually
>> added with TemplateTransformer).
>>
>> Regards,
>> Alex.
>>
>> 
>> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>>
>>
>> On 23 January 2015 at 11:15, Carl Roberts 
>> wrote:
>>>
>>> Hi,
>>>
>>> I have the RSS DIH example working with my own RSS feed - here is the
>>> configuration for it.
>>>
>>> 
>>>  
>>>  
>>>  >>  pk="link"
>>>  url="https://nvd.nist.gov/download/nvd-rss.xml";
>>>  processor="XPathEntityProcessor"
>>>  forEach="/RDF/item"
>>>  transformer="DateFormatTransformer">
>>>
>>>  >> commonField="true" />
>>>  >> commonField="true"
>>> />
>>>  >> commonField="true" />
>>>  >> commonField="true"
>>> />
>>>
>>>  
>>>  
>>> 
>>>
>>> However, my problem is that I also have to load multiple XML feeds into
>>> the
>>> same core.  Here is one example (there are about 10 of them):
>>>
>>> http://static.nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-2014.xml.zip
>>>
>>>
>>> Is there any built-in functionality that would allow me to do this?
>>> Basically, the use-case is to load and index all the XML ZIP files first,
>>> and then check the RSS feed every two hours and update the indexes with
>>> any
>>> new ones.
>>>
>>> Regards,
>>>
>>> Joe
>>>
>>>
>


Sporadic Socket Timeout Error during Import

2015-01-23 Thread Carl Roberts

Hi,

I am using the DIH RSS example and I am running into a sporadic socket 
timeout error during every 3rd or 4th request. Below is the stack trace. 
What is the default socket timeout for reads and how can I increase it?



15046 [Thread-17] ERROR org.apache.solr.handler.dataimport.URLDataSource 
– Exception thrown while getting data

java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
at sun.security.ssl.InputRecord.read(InputRecord.java:480)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:927)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:884)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1323)
at 
sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:254)
at 
org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:98)
at 
org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:42)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:283)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:224)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:204)
at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:232)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:416)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:480)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:461)
815049 [Thread-17] ERROR org.apache.solr.handler.dataimport.DocBuilder – 
Exception while processing: nvd-rss document : SolrInputDocument(fields: 
[]):org.apache.solr.handler.dataimport.DataImportHandlerException: 
Exception in invoking url https://nvd.nist.gov/download/nvd-rss.xml 
Processing Document # 1
at 
org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:115)
at 
org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:42)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:283)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:224)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:204)
at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:232)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:416)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:480)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:461)

Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
at sun.security.ssl.InputRecord.read(InputRecord.java:480)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:927)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:884)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
at 
sun.net.www.protocol.http.HttpURLConnection.g

Re: Is it possible to read multiple RSS feeds and XML Zip file feeds with DIH into one core?

2015-01-23 Thread Carl Roberts

Hi Alex,

If I am understanding this correctly, I can define multiple entities 
like this?






...


How would I trigger loading certain entities during start?

How would I trigger loading other entities during update?

Is there a way to set an auto-update for certain entities so that I 
don't have to invoke an update via curl?


Where / how do I specify the preImportDeleteQuery to avoid deleting 
everything upon each update?


Is there an example or doc that shows how to do all this?

Regards,

Joe

On 1/23/15, 11:24 AM, Alexandre Rafalovitch wrote:

You can define both multiple entities in the same file and nested
entities if your list comes from an external source (e.g. a text file
of URLs).
You can also trigger DIH with a name of a specific entity to load just that.
You can even pass DIH configuration file when you are triggering the
processing start, so you can have different files completely for
initial load and update. Though you can just do the same with
entities.

The only thing to be aware of is that before an entity definition is
processed, a delete command is run. By default, it's "delete all", so
executing one entity will delete everything but then just populate
that one entity's results. You can avoid that by defining
preImportDeleteQuery and having a clear identifier on content
generated by each entity (e.g. source, either extracted or manually
added with TemplateTransformer).

Regards,
Alex.


Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 23 January 2015 at 11:15, Carl Roberts  wrote:

Hi,

I have the RSS DIH example working with my own RSS feed - here is the
configuration for it.


 
 
 https://nvd.nist.gov/download/nvd-rss.xml";
 processor="XPathEntityProcessor"
 forEach="/RDF/item"
 transformer="DateFormatTransformer">

 
 
 
 

 
 


However, my problem is that I also have to load multiple XML feeds into the
same core.  Here is one example (there are about 10 of them):

http://static.nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-2014.xml.zip


Is there any built-in functionality that would allow me to do this?
Basically, the use-case is to load and index all the XML ZIP files first,
and then check the RSS feed every two hours and update the indexes with any
new ones.

Regards,

Joe






How to inject custom response data after results have been sorted

2015-01-23 Thread tedsolr
Hello! With the help of this community I have solved 2 problems on my way to
creating a search that collapses documents based on multiple fields. The
CollapsingQParserPlugin was key.

I have a new problem now. All the custom stats I generate in my custom
QParser makes for way to much data to simply write out in the response. I
need to filter that data so I only return the stats the user will see on one
page. Say my search returns 800K collapsed docs - in the
DelegatingCollector's collect() method I am computing some info for each
collapsed group - that's 800K map entries.

I can't filter the stats in my post filter implementation because the
results have not been sorted. So I need a new downstream component that can
read the sorted results, and grab the custom stats from my post filter. Can
someone recommend a suggestion? Is this a SearchComponent extension? Where
is the proper hook for examining results after sorting?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-inject-custom-response-data-after-results-have-been-sorted-tp4181545.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Retrieving Phonetic Code as result

2015-01-23 Thread Amit Jha
Can I extend solr to add phonetic codes at time of indexing as uuid field 
getting added. Because I want to preprocess the metaphone code because I 
calculate the code on runtime will give me some performance hit.

Rgds
AJ

> On Jan 23, 2015, at 5:37 PM, Jack Krupansky  wrote:
> 
> Your app can use the field analysis API (FieldAnalysisRequestHandler) to
> query Solr for what the resulting field values are for each filter in the
> analysis chain for a given input string. This is what the Solr Admin UI
> Analysis web page uses.
> 
> See:
> http://lucene.apache.org/solr/4_10_2/solr-core/org/apache/solr/handler/FieldAnalysisRequestHandler.html
> and in solrconfig.xml
> 
> 
> -- Jack Krupansky
> 
>> On Thu, Jan 22, 2015 at 8:42 AM, Amit Jha  wrote:
>> 
>> Hi,
>> 
>> I need to know how can I retrieve phonetic codes. Does solr provide it as
>> part of result? I need codes for record matching.
>> 
>> *following is schema fragment:*
>> 
>> > class="solr.TextField" >
>>  
>>
>>> maxCodeLength="4"/>
>>  
>>
>> 
>> 
>>  
>>  
>>  
>> 
>> 
>> 
>> 


Re: Suggester Example In Documentation Not Working

2015-01-23 Thread Chris Hostetter

: However, you will notice on page 228, under the section "Suggester", it 
: gives an example of a suggester search component using 
: solr.SpellCheckComponet.
...
: So it would appear the solr.SuggestComponent has been around since 4.7, 
: but the documentation has not caught up with the changes. Which is the 
: source of a little confusion.

Ah -- ok .. yeah, sorry ...

You are correct, there was definitely a lag in having he ref guide updated 
to account for the new SuggestComponent -- I didn't realize that.

: Nevertheless, I had hoped to find a simple working example that I could 
: use as a starting point to get the solr.SuggestComponent working so that 
: I might play around with it and make it do what I want. The suggester 
: appears to have many parameters and options, of which, several contain 
: little or no explanation.

the params should all be documented in the ref guide *now* -- and you are 
correct, that consulting the current ref guide to understand what those 
params do will likely be helpful to you -- i guess the main take away of 
my comment "#1" was to keep in mind that you may find some params 
documented for 5.0 which didn't exist in 4.8.  (I'm not sure)

as far as starting with a simple example -- there is absolutely an example 
of using the SuggestComponent in the 4.8 sample solrconfig.xml, and if you 
index the exampledocs you can see it produce suggestions with a URL like 
this...

http://localhost:8983/solr/collection1/suggest?suggest.dictionary=mySuggester&suggest.q=elec

...but my point "#2" is still very important to keep in mind -- that URL 
gives good suggestions for "elec" precisely because of what terms exist in 
the example docs that were index -- the URL you posted is only going to 
give interesting suggestions if there are terms in your index (in the 
configured fields) that are relevant.  if i try this URL...

http://localhost:8983/solr/collection1/suggest?suggest.dictionary=mySuggester&suggest.q=kern

...i get no suggestions, because none of hte indexed docs have any words 
starting with "kern"

in general: posting the examples of URLs you have tried and gotten no good 
suggest results from isn't enough for anyone to help give you guidence 
unless you also post the specifics of the documents you indexed.



: 2) the behavior of the suggester is very specific to the contents of the 
: dictionary built -- the examples on that page apply to the example docs 
: included with solr -- hence the techproduct data, and the example queries 
: for input like "elec" suggesting "electronics" 
: 
: no where on that page is an example using the query "kern" -- wether or 
: not that input would return a suggestion is going to be entirely dependent 
: on wether the dictionary you built contains any similar terms to suggest. 
: 
: if you can please post more details about your documents -- ideally a full 
: set of all the documents in your index (using a small test index of 
: course) that may help to understand the results you are getting. 


-Hoss
http://www.lucidworks.com/


Re: Is it possible to read multiple RSS feeds and XML Zip file feeds with DIH into one core?

2015-01-23 Thread Alexandre Rafalovitch
You can define both multiple entities in the same file and nested
entities if your list comes from an external source (e.g. a text file
of URLs).
You can also trigger DIH with a name of a specific entity to load just that.
You can even pass DIH configuration file when you are triggering the
processing start, so you can have different files completely for
initial load and update. Though you can just do the same with
entities.

The only thing to be aware of is that before an entity definition is
processed, a delete command is run. By default, it's "delete all", so
executing one entity will delete everything but then just populate
that one entity's results. You can avoid that by defining
preImportDeleteQuery and having a clear identifier on content
generated by each entity (e.g. source, either extracted or manually
added with TemplateTransformer).

Regards,
   Alex.


Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 23 January 2015 at 11:15, Carl Roberts  wrote:
> Hi,
>
> I have the RSS DIH example working with my own RSS feed - here is the
> configuration for it.
>
> 
> 
> 
>  pk="link"
> url="https://nvd.nist.gov/download/nvd-rss.xml";
> processor="XPathEntityProcessor"
> forEach="/RDF/item"
> transformer="DateFormatTransformer">
>
> 
>  />
>  commonField="true" />
>  />
>
> 
> 
> 
>
> However, my problem is that I also have to load multiple XML feeds into the
> same core.  Here is one example (there are about 10 of them):
>
> http://static.nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-2014.xml.zip
>
>
> Is there any built-in functionality that would allow me to do this?
> Basically, the use-case is to load and index all the XML ZIP files first,
> and then check the RSS feed every two hours and update the indexes with any
> new ones.
>
> Regards,
>
> Joe
>
>


Is it possible to read multiple RSS feeds and XML Zip file feeds with DIH into one core?

2015-01-23 Thread Carl Roberts

Hi,

I have the RSS DIH example working with my own RSS feed - here is the 
configuration for it.





https://nvd.nist.gov/download/nvd-rss.xml";
processor="XPathEntityProcessor"
forEach="/RDF/item"
transformer="DateFormatTransformer">

commonField="true" />
commonField="true" />
commonField="true" />
commonField="true" />






However, my problem is that I also have to load multiple XML feeds into 
the same core.  Here is one example (there are about 10 of them):


http://static.nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-2014.xml.zip


Is there any built-in functionality that would allow me to do this? 
Basically, the use-case is to load and index all the XML ZIP files 
first, and then check the RSS feed every two hours and update the 
indexes with any new ones.


Regards,

Joe




Re: SolrCloud timing out marking node as down during startup.

2015-01-23 Thread Shalin Shekhar Mangar
Hi Mike,

This is a bug which was fixed in Solr 4.10.3 via
http://issues.apache.org/jira/browse/SOLR-6610 and it slows down cluster
restarts. Since you have a single node cluster, you will run into it on
every restart.

On Thu, Jan 22, 2015 at 6:42 PM, Michael Roberts 
wrote:

> Hi,
>
> I'm seeing some odd behavior that I am hoping someone could explain to me.
>
> The configuration I'm using to repro the issue, has a ZK cluster and a
> single Solr instance. The instance has 10 Cores, and none of the cores are
> sharded.
>
> The initial startup is fine, the Solr instance comes up and we build our
> index. However if the Solr instance exits uncleanly (killed rather than
> sent a SIGINT), the next time it starts I see the following in the logs.
>
> 2015-01-22 09:56:23.236 -0800 (,,,) localhost-startStop-1 : INFO
> org.apache.solr.common.cloud.ZkStateReader - Updating cluster state from
> ZooKeeper...
> 2015-01-22 09:56:30.008 -0800 (,,,) localhost-startStop-1-EventThread :
> DEBUG org.apache.solr.common.cloud.SolrZkClient - Submitting job to respond
> to event WatchedEvent state:SyncConnected type:NodeChildrenChanged
> path:/live_nodes
> 2015-01-22 09:56:30.008 -0800 (,,,) zkCallback-2-thread-1 : DEBUG
> org.apache.solr.common.cloud.ZkStateReader - Updating live nodes... (0)
> 2015-01-22 09:57:24.102 -0800 (,,,) localhost-startStop-1 : WARN
> org.apache.solr.cloud.ZkController - Timed out waiting to see all nodes
> published as DOWN in our cluster state.
> 2015-01-22 09:57:24.102 -0800 (,,,) localhost-startStop-1 : INFO
> org.apache.solr.cloud.ZkController - Register node as live in
> ZooKeeper:/live_nodes/10.18.8.113:11000_solr
> My question is about "Timed out waiting to see all nodes published as DOWN
> in our cluster state."
>
> Cursory look at the code, we seem to iterate through all
> Collections/Shards, and mark the state as Down. These notifications are
> offered to the Overseer, who I believe updates the ZK state. We then wait
> for the ZK state to update, with the 60 second timeout.
>
> However, it looks like the Overseer is not started until after we wait for
> the timeout. So, in a single instance scenario we'll always have to wait
> for the timeout.
>
> Is this the expected behavior (and just a side effect of running a single
> instance in cloud mode), or is my understanding of the Overseer/Zk
> relationhip incorrect?
>
> Thanks.
>
> .Mike
>
>


-- 
Regards,
Shalin Shekhar Mangar.


Re: How do you query a sentence composed of multiple words in a description field?

2015-01-23 Thread Walter Underwood
It isn’t that complicated. You need to understand URL escaping for working with 
any REST client. As soon as you need to read the logs, you’ll need to 
understand it.

The double quote becomes %22 and the colon becomes %3A. In a parameter, the 
spaces can be +, but in a path they need to be %20. 

http://localhost:8983/solr/nvd-rss/select?wt=json&indent=true&q=summary%3A%22Oracle+Fusion%22

wunder

Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/


On Jan 23, 2015, at 7:08 AM, Carl Roberts  wrote:

> Thanks Erick,
> 
> I think I am going to start using the browser for testing...:) Perhaps also a 
> REST client for the Mac.
> 
> Regards,
> 
> Joe
> 
> On 1/22/15, 6:56 PM, Erick Erickson wrote:
>> Have you considered using the admin/query form? Lots of escaping is done
>> there for you. Once you have the form of the query down and know what to
>> expect, it's probably easier to enter "escaping hell" with curl and the
>> like
>> 
>> And what is your schema definition for the field in question? the
>> admin/analysis page can help a lot here.
>> 
>> Best,
>> Erick
>> 
>> On Thu, Jan 22, 2015 at 3:51 PM, Carl Roberts >> wrote:
>>> Thanks Shawn - I tried this but it does not work.  I don't even get a
>>> response from curl when I try that format and when I look at the logging on
>>> the console for Jetty I don't see anything new - it seems that the request
>>> is not even making it to the server.
>>> 
>>> 
>>> 
>>> On 1/22/15, 6:43 PM, Shawn Heisey wrote:
>>> 
 On 1/22/2015 4:31 PM, Carl Roberts wrote:
 
> Hi Walter,
> 
> If I try this from my Mac shell:
> 
>  curl
> http://localhost:8983/solr/nvd-rss/select?wt=json&indent=true&q=summary
> :"Oracle
> Fusion"
> 
> I don't get a response.
> 
 Quotes are a special character to the shell on your mac, and get removed
 from what the curl command sees.  You'll need to put the whole thing in
 quotes (so that characters like & are not interpreted by the shell) and
 then escape the quotes that you want to actually be handled by curl:
 
 curl
 "http://localhost:8983/solr/nvd-rss/select?wt=json&indent=true&q=summary
 :\"Oracle
 Fusion\""
 
 Thanks,
 Shawn
 
 
> 



Re: Is Solr a good candidate to index 100s of nodes in one XML file?

2015-01-23 Thread Carl Roberts
I got the RSS DIH example to work with my own RSS feed and it works 
great - thanks for the help.


On 1/22/15, 11:20 AM, Carl Roberts wrote:

Thanks. I am looking at the RSS DIH example right now.


On 1/21/15, 3:15 PM, Alexandre Rafalovitch wrote:

Solr is just fine for this.

It even ships with an example of how to read an RSS file under the DIH
directory. DIH is also most likely what you will use for the first
implementation. Don't need to worry about Stax or anything, unless
your file format is very weird or has overlapping namespaces (DIH XML
parser does not care about namespaces).

Regards,
   Alex.

Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 21 January 2015 at 14:53, Carl Roberts 
 wrote:

Hi,

Is Solr a good candidate to index 100s of nodes in one XML file?

I have an RSS feed XML file that has 100s of nodes with several 
elements in
each node that I have to index, so I was planning to parse the XML 
with Stax
and extract the data from each node and add it to Solr.  There will 
always
be only one one file to start with and then a second file as the RSS 
feeds

supplies updates.  I want to return certain fields of each node when I
search certain fields of the same node.  Is Solr overkill in this case?
Should I just use Lucene instead?

Regards,

Joe






Re: How do you query a sentence composed of multiple words in a description field?

2015-01-23 Thread Carl Roberts

Thanks Erick,

I think I am going to start using the browser for testing...:) Perhaps 
also a REST client for the Mac.


Regards,

Joe

On 1/22/15, 6:56 PM, Erick Erickson wrote:

Have you considered using the admin/query form? Lots of escaping is done
there for you. Once you have the form of the query down and know what to
expect, it's probably easier to enter "escaping hell" with curl and the
like

And what is your schema definition for the field in question? the
admin/analysis page can help a lot here.

Best,
Erick

On Thu, Jan 22, 2015 at 3:51 PM, Carl Roberts 
wrote:
Thanks Shawn - I tried this but it does not work.  I don't even get a
response from curl when I try that format and when I look at the logging on
the console for Jetty I don't see anything new - it seems that the request
is not even making it to the server.



On 1/22/15, 6:43 PM, Shawn Heisey wrote:


On 1/22/2015 4:31 PM, Carl Roberts wrote:


Hi Walter,

If I try this from my Mac shell:

  curl
http://localhost:8983/solr/nvd-rss/select?wt=json&indent=true&q=summary
:"Oracle
Fusion"

I don't get a response.


Quotes are a special character to the shell on your mac, and get removed
from what the curl command sees.  You'll need to put the whole thing in
quotes (so that characters like & are not interpreted by the shell) and
then escape the quotes that you want to actually be handled by curl:

curl
"http://localhost:8983/solr/nvd-rss/select?wt=json&indent=true&q=summary
:\"Oracle
Fusion\""

Thanks,
Shawn






solr 4.7 Converting from one boost method to another using ExternalFileField

2015-01-23 Thread Parnit Pooni
Hi,

I'm currently running into issues creating a solr query to try and boost on
two ExternalFileFields. The following query seems to work, but is extremely
long and repeats query terms and does not use what I would like to use.


http://localhost/solr/Index/select?fl=field(externalFileField1),field(externalFileField2),score&q={!boost%20b=map(field(externalFileField1),5,15,10,3)}term+{!boost%20b=map(field(externalFileField2),70,90,25,1)}term


I would like to use bq instead of {!boost b=myBoostFunction()} format as
you can see I am repeating the term again according to the following
tutorial the formats should be compatible.
http://nolanlawson.com/2012/06/02/comparing-boost-methods-in-solr/

sample query


http://localhost/solr/Index/select?q=*&fl=field(externalFileField1),_val1_:map(field(externalFileField2),5,15,10,3),_val2_:map(field(externalFileField2),70,90,25,1),field(externalFileField2),score&bq=_val_:map(field(externalFileField1),5,15,10,3)%20_val1_:map(field(externalFileField2),70,90,25,1)

This does not seem to work and returns documents ranked by the primary keys.

I have tried to make multiple queries using bq using a external file field
and boosting does not seem to work. An additional requirement I have, is to
boost on specific range for the ExternalFileField and the map function
helps achieve this.
Any help is greatly appreciated.

Regards,
Parnit


Re: Using tmpfs for Solr index

2015-01-23 Thread Shawn Heisey
On 1/23/2015 2:40 AM, Toke Eskildsen wrote:
> If you have a single index on a box with enough memory to fully cache
> the index data, I would recommend just using MMapDirectory without
> involving tmpfs.

If it's Solr 4.x, I have pretty much the same advice, with one small
change.  I would actually use the default directory in Solr, which is
NRTCachingDirectoryFactory.  This is a wrapper directory implementation
that provides a small amount of memory-based caching on top of another
implementation.  On 64-bit Java, the wrapped implementation will be
MMapDirectory.  Turning on updateLog is highly recommended with the NRT
implementation.

Thanks,
shawn



Re: trying to get Apache Solr working with Dovecot.

2015-01-23 Thread Shawn Heisey
On 1/23/2015 12:11 AM, Kevin Laurie wrote:
> The solr / lucene version is 4.10.2
> 
> I am trying to figure out how to see if Dovecot and Solr can contact.
> Apparently when I make searches there seems to be no contact. I might try
> to rebuild dovecot again and see if that solves the problem.
> 
> I just checked var/log/solr and its empty. Might need to enable debugging
> on Solr.
> 
> Regarding tracing, not much as I am still relatively new(might be a
> challenge) but will figure out.
> 
> Is there any well documented manual for dovecot-solr integration?

Very likely you'll need to talk to whoever made the Solr plugin for
dovecot.  If they look at your situation and tell you that the problem
is in Solr itself and they don't know what to do, then you can come back
here with the specific logs and information that they point to.

Solr, as shipped by Apache, defaults to INFO level logging (which is
very verbose), into a file named logs/solr.log.  If the solr log is in
/var/log, then I can tell you already that either you're not using Solr
directly from Apache, or someone has changed the config.  If the Solr is
packaged by someone else, they will have information that we don't, and
they'll be better situated to help you.  If it's Solr from Apache but
the config has been modified, then we need to know what modifications
were made.

If I do a google search for "dovecot solr" (without the quotes), the
very first hit that comes up looks like it's completely relevant.  The
links at the end of that page are not very helpful -- one requires
authentication and the other talks about Solr 1.4.0, which is five years
old.

http://wiki2.dovecot.org/Plugins/FTS/Solr

Thanks,
Shawn



Re: Suggester Example In Documentation Not Working

2015-01-23 Thread Charles Sanders
Well, I'm running LucidWorks 2.9.1 which contains Solr 4.8. 

I initially was working with the Solr documentation: 
http://archive.apache.org/dist/lucene/solr/ref-guide/apache-solr-ref-guide-4.8.pdf
 

However, you will notice on page 228, under the section "Suggester", it gives 
an example of a suggester search component using solr.SpellCheckComponet. 

Then our support rep at LucidWorks said we should use the documentation found 
here: 
https://cwiki.apache.org/confluence/display/solr/Suggester 

This documentation is for Solr 5.0, however, you will notice the statement: 
"Solr has long had the autosuggest functionality, but Solr 4.7 introduced a new 
approach based on a dedicated SuggestComponent . " 

So it would appear the solr.SuggestComponent has been around since 4.7, but the 
documentation has not caught up with the changes. Which is the source of a 
little confusion. 

Nevertheless, I had hoped to find a simple working example that I could use as 
a starting point to get the solr.SuggestComponent working so that I might play 
around with it and make it do what I want. The suggester appears to have many 
parameters and options, of which, several contain little or no explanation. 

No problem. I will just have to invest a little more time to unravel how the 
component works and how I can best use it. 

Thanks for your reply. 


- Original Message -

From: "Chris Hostetter"  
To: solr-user@lucene.apache.org 
Sent: Thursday, January 22, 2015 12:50:46 PM 
Subject: Re: Suggester Example In Documentation Not Working 


1) which version of Solr are you using? (note that the online HTML ref 
guide is a DRARFT that applies to 5.0 - you may want to review the 
specific released version of the ref guide that applies to your version of 
solr: http://archive.apache.org/dist/lucene/solr/ref-guide/ 

2) the behavior of the suggester is very specific to the contents of the 
dictionary built -- the examples on that page apply to the example docs 
included with solr -- hence the techproduct data, and the example queries 
for input like "elec" suggesting "electronics" 

no where on that page is an example using the query "kern" -- wether or 
not that input would return a suggestion is going to be entirely dependent 
on wether the dictionary you built contains any similar terms to suggest. 

if you can please post more details about your documents -- ideally a full 
set of all the documents in your index (using a small test index of 
course) that may help to understand the results you are getting. 



: Date: Thu, 22 Jan 2015 11:14:43 -0500 (EST) 
: From: Charles Sanders  
: Reply-To: solr-user@lucene.apache.org 
: To: solr-user@lucene.apache.org 
: Subject: Suggester Example In Documentation Not Working 
: 
: Attempting to follow the documentation found here: 
: https://cwiki.apache.org/confluence/display/solr/Suggester 
: 
: The example given in the documentation is not working. See below my 
configuration. I only changed the field names to those in my schema. Can anyone 
provide an example for this component that actually works? 
: 
:  
:  
: mySuggester 
: FuzzyLookupFactory 
: DocumentDictionaryFactory 
: sugg_allText 
: suggestWeight 
: string 
:  
:  
: 
:  
:  
: true 
: 10 
: true 
:  
:  
: suggest 
:  
:  
: 
:  
:  
: 
: 
: 
http://localhost:/solr/collection1/suggest?suggest=true&suggest.build=true&suggest.dictionary=mySuggester&wt=json&suggest.q=kern
 
: 
: 
{"responseHeader":{"status":0,"QTime":4},"command":"build","suggest":{"mySuggester":{"kern":{"numFound":0,"suggestions":[]
 
: 

-Hoss 
http://www.lucidworks.com/ 



Re: In a SolrCloud, will a solr core(shard replica) failover to its good peer when its state is not Active

2015-01-23 Thread Shawn Heisey
On 1/22/2015 11:28 PM, 汤林 wrote:
> From a testing aspect, if we would like to verify the case that a query
> request to a "down" core on a running server will be failed over to the
> good core on another running server, is there any way to make a core as
> "down" on a running server? Thanks!

I think that would depend on exactly why the core is down.  Most
failures will make the core nonexistent in Solr, thus unable to accept
queries, but if you have a problem that results in the core functioning
correctly but the cluster is unaware of that fact, then the query would
probably work.  That kind of failure shouldn't be possible, but all
software has bugs, so I'm not going to rule it out.

> We tried to change the /clusterstate.json in ZooKeeper to mark an "active"
> core as "down", but it seems only change the state in ZK, while the core
> still functions in solr server.

I have no idea what will happen with manual clusterstate manipulation.
Although I do have a small cloud install, it's running an ancient
version (4.2.1), has no sharded indexes, and I do not interact with it
on a regular basis.

Thanks,
Shawn



RE: Avoiding wildcard queries using edismax query parser

2015-01-23 Thread Ryan, Michael F. (LNG-DAY)
Here's a Jira for this: https://issues.apache.org/jira/browse/SOLR-3031

I've attached a patch there that might be useful for you.

-Michael

-Original Message-
From: Jorge Luis Betancourt González [mailto:jlbetanco...@uci.cu] 
Sent: Thursday, January 22, 2015 4:34 PM
To: solr-user@lucene.apache.org
Subject: Avoiding wildcard queries using edismax query parser

Hello all,

Currently we are using edismax query parser in an internal application, we've 
detected that some wildcard queries including "*" are causing some performance 
issues and for this particular case we're not interested in allowing any user 
to request all the indexed documents. 

This could be easily escaped in the application level, but right now we have 
several applications (using several programming languages) consuming from Solr, 
and adding this into each application is kind of exhausting, so I'm wondering 
if there is some configuration that allow us to treat this special characters 
as normal alphanumeric characters. 

I've tried one solution that worked before, involving the WordDelimiterFilter 
an the types attribute:



and in characters.txt I've mapped the special characters into ALPHA:

+ => ALPHA 
* => ALPHA 

Any thoughts on this?


---
XII Aniversario de la creación de la Universidad de las Ciencias Informáticas. 
12 años de historia junto a Fidel. 12 de diciembre de 2014.



Re: Retrieving Phonetic Code as result

2015-01-23 Thread Jack Krupansky
Your app can use the field analysis API (FieldAnalysisRequestHandler) to
query Solr for what the resulting field values are for each filter in the
analysis chain for a given input string. This is what the Solr Admin UI
Analysis web page uses.

See:
http://lucene.apache.org/solr/4_10_2/solr-core/org/apache/solr/handler/FieldAnalysisRequestHandler.html
and in solrconfig.xml


-- Jack Krupansky

On Thu, Jan 22, 2015 at 8:42 AM, Amit Jha  wrote:

> Hi,
>
> I need to know how can I retrieve phonetic codes. Does solr provide it as
> part of result? I need codes for record matching.
>
> *following is schema fragment:*
>
>  class="solr.TextField" >
>   
> 
>  maxCodeLength="4"/>
>   
> 
>
>  
>   
>   
>   
>
> 
>  
>


Re: Avoiding wildcard queries using edismax query parser

2015-01-23 Thread Jack Krupansky
Presence of a wildcard in a query term is detected by the traditional Solr
and edismax query parsers and causes normal term analysis to be bypassed.
As I said, wildcards are a specific feature that dismax specifically
doesn't support - this has nothing to do with edismax.

-- Jack Krupansky

On Fri, Jan 23, 2015 at 12:45 AM, Jorge Luis Betancourt González <
jlbetanco...@uci.cu> wrote:

> Hi Jack!
>
> Yes, that was my point, I was thinking that being edismax an extended
> version of dismas, perhaps had a switch to turn on/off this feature or
> putting some limits. I've tried the multiterm approach but with no luck,
> the "*" keeps being treated a match all query, as far as I can see from
> enabling debug output:
>
>"rawquerystring": "*",
>"querystring": "*",
>"parsedquery": "(+MatchAllDocsQuery(*:*) ()
> FunctionQuery(1.0/(3.16E-11*float(ms(const(142198920),date(lastModified)))+1.0)))/no_coord",
>
> The query gets translated into a MatchAllDocsQuery, which I think happens
> before the textual analysis.
>
> - Original Message -
> From: "Jack Krupansky" 
> To: solr-user@lucene.apache.org
> Sent: Friday, January 23, 2015 12:02:44 AM
> Subject: Re: Avoiding wildcard queries using edismax query parser
>
> The dismax query parser does not support wildcards. It is designed to be
> simpler.
>
> -- Jack Krupansky
>
> On Thu, Jan 22, 2015 at 5:57 PM, Jorge Luis Betancourt González <
> jlbetanco...@uci.cu> wrote:
>
> > I was also suspecting something like that, the odd thing was that the
> with
> > the dismax parser this seems to work, I mean passing a single * in the
> > query just like:
> >
> >
> >
> http://localhost:8983/solr/collection1/select?q=*&wt=json&indent=true&defType=dismax
> >
> > Returns:
> >
> > {
> >   "responseHeader":{
> > "status":0,
> > "QTime":3},
> >   "response":{"numFound":0,"start":0,"docs":[]
> >   },
> >   "highlighting":{}
> > }
> >
> > Which is consisten with no "*" term indexed.
> >
> > Based on what I saw with dismax, I though that perhaps a configuration
> > option existed to accomplish the same with the edismax query parser, but
> I
> > haven't found such option.
> >
> > I'm going to test with a custom search component.
> >
> > Thanks for the quick response Alex,
> >
> > Regards,
> >
> > - Original Message -
> > From: "Alexandre Rafalovitch" 
> > To: "solr-user" 
> > Sent: Thursday, January 22, 2015 4:46:08 PM
> > Subject: Re: Avoiding wildcard queries using edismax query parser
> >
> > I suspect the special characters get caught before the analyzer chains.
> >
> > But what about pre-pending a custom search components?
> >
> > Regards,
> >Alex.
> > 
> > Sign up for my Solr resources newsletter at http://www.solr-start.com/
> >
> >
> > On 22 January 2015 at 16:33, Jorge Luis Betancourt González
> >  wrote:
> > > Hello all,
> > >
> > > Currently we are using edismax query parser in an internal application,
> > we've detected that some wildcard queries including "*" are causing some
> > performance issues and for this particular case we're not interested in
> > allowing any user to request all the indexed documents.
> > >
> > > This could be easily escaped in the application level, but right now we
> > have several applications (using several programming languages) consuming
> > from Solr, and adding this into each application is kind of exhausting,
> so
> > I'm wondering if there is some configuration that allow us to treat this
> > special characters as normal alphanumeric characters.
> > >
> > > I've tried one solution that worked before, involving the
> > WordDelimiterFilter an the types attribute:
> > >
> > >  > generateNumberParts="0" catenateWords="0"
> > > catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"
> > preserveOriginal="0" types="characters.txt" />
> > >
> > > and in characters.txt I've mapped the special characters into ALPHA:
> > >
> > > + => ALPHA
> > > * => ALPHA
> > >
> > > Any thoughts on this?
> > >
> > >
> > > ---
> > > XII Aniversario de la creación de la Universidad de las Ciencias
> > Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014.
> > >
> >
> >
> > ---
> > XII Aniversario de la creación de la Universidad de las Ciencias
> > Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014.
> >
> >
>
>
> ---
> XII Aniversario de la creación de la Universidad de las Ciencias
> Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014.
>
>


Re: Count total frequency of a word in a SOLR index

2015-01-23 Thread Nitin Solanki
Ok.. Is there any to use user-defined field instead of word and freq in
suggestion block ?

On Fri, Jan 23, 2015 at 2:33 PM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> I don't think it's implemented.
> I can propose to send the first request to termsComponent, that yields
> terms by prefix, then the second request can gather totaltermfreqs.
>
> On Fri, Jan 23, 2015 at 11:51 AM, Nitin Solanki 
> wrote:
>
> > Thanks Mikhail Khludnev..
> > I tried this:
> > *
> >
> http://localhost:8983/solr/collection1/spell?q=gram:%22the%22&rows=1&fl=totaltermfreq(gram,the)
> > <
> >
> http://localhost:8983/solr/collection1/spell?q=gram:%22the%22&rows=1&fl=totaltermfreq(gram,the)
> > >*
> > and it worked.
> > I want to know more. Can we do same thing *(totaltermfreq)* on
> suggestions
> > ? I tried "th" and get "the" is suggestion. I want to retrieve term
> > frequency not document frequency even in the suggestions. Can I do that?
> >
> > *Instance of suggestions: *
> > 
> > the
> > 897  *Here -* freq is Document frequency. I need
> > Term frequency
> > 
> >
> >
> >
> > On Fri, Jan 23, 2015 at 1:53 PM, Mikhail Khludnev <
> > mkhlud...@griddynamics.com> wrote:
> >
> > > https://cwiki.apache.org/confluence/display/solr/Function+Queries
> > > totaltermfreq()
> > >
> > > of you need to sum term freq on docs from resultset?
> > >
> > >
> > > On Fri, Jan 23, 2015 at 10:56 AM, Nitin Solanki 
> > > wrote:
> > >
> > > > I indexed some text_file files in Solr as it is. Applied "
> > > > *StandardTokenizerFactory*" and "*ShingleFilterFactory*" on text_file
> > > field
> > > >
> > > > *Configuration of Schema.xml structure below :*
> > > >  > > required="true"
> > > > multiValued="false" />
> > > >  > > > required="true" multiValued="false"/>
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > * > > > positionIncrementGap="100">> > > type="index">  > > > class="solr.StandardTokenizerFactory"/>
> > >   > > > class="solr.ShingleFilterFactory" maxShingleSize="5"
> minShingleSize="2"
> > > > outputUnigrams="true"/>   > > > type="query">  > > > class="solr.StandardTokenizerFactory"/>
> > >   > > > class="solr.ShingleFilterFactory" maxShingleSize="5"
> minShingleSize="2"
> > > > outputUnigrams="true"/>  *
> > > >
> > > > *Stored Documents like:*
> > > > *[{"id":"1", "text_file": "text": "text of document"}, {"id":"2",
> > > > "text_file": "text": "text of document"} and so on ]*
> > > >
> > > > *Problem* : If I search a word in a SOLR index I get a document count
> > for
> > > > documents which contain this word, but if the word is included more
> > times
> > > > in a document, the total count is still 1 per document. I need every
> > > > returned document is counted for the number of times they have the
> > > searched
> > > > word in the field. *Example* :I see a "numFound" value of 12, but the
> > > word
> > > > "what" is included 20 times in all 12 documents. Could you help me to
> > > find
> > > > where I'm wrong, please?
> > > >
> > >
> > >
> > >
> > > --
> > > Sincerely yours
> > > Mikhail Khludnev
> > > Principal Engineer,
> > > Grid Dynamics
> > >
> > > 
> > > 
> > >
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> 
> 
>


Re: Using tmpfs for Solr index

2015-01-23 Thread Toke Eskildsen
On Fri, 2015-01-23 at 07:34 +0100, deniz wrote:
> Would it boost any performance in case the index has been switched from
> RAMDirectoryFactory to use tmpfs?

RAMDirectoryFactory does not perform well for non-small indexes, so ...
probable yes.

> Or it would simply do the same thing like MMap? 

A fully cached MMap of files on permanent storage should perform the
same as a MMap of files in tmpfs. The primary selling point for tmpfs in
this context is that you force the data to always be in RAM (remember to
turn off or severely limit the swap system).

This makes sense in a mixed environment; for example with two Solr
collections, one small which should always respond as fast as possible,
another large which has lower real-time requirements. Putting the index
of the smaller one on tmpfs should ensure this.

> And in case it would be better to use tmpfs rather than RAMDirectory or
> MMap, which directory factory would be the most feasible one for this
> purpose?

MMap with tmpfs would be my guess, as it should avoid copying of the
data from one memory area to another when accessing files. But it is not
something I have experience with.

If you have a single index on a box with enough memory to fully cache
the index data, I would recommend just using MMapDirectory without
involving tmpfs.

- Toke Eskildsen, State and University Library, Denmark




Re: Field collapsing memory usage

2015-01-23 Thread Toke Eskildsen
On Thu, 2015-01-22 at 22:52 +0100, Erick Erickson wrote:
> What do you think about folding this into the Solr (or Lucene?) code
> base? Or is it to specialized?

(writing under the assumption that DVEnabler actually works as it should
for everyone and not just us)

Right now it is an explicit tool. As such, users need to find it and
learn how to use it, which is a large barrier. Most of the time it is
easier just to re-index everything.

It seems to me that it should be possible to do seamlessly instead:
Simply change the schema and reload. Old segments would have emulated
DocValues (high speed, high memory overhead), new segments would have
pure DVs. An optimize would be optional, but highly recommended.

- Toke Eskildsen, State and University Library, Denmark




Re: Count total frequency of a word in a SOLR index

2015-01-23 Thread Mikhail Khludnev
I don't think it's implemented.
I can propose to send the first request to termsComponent, that yields
terms by prefix, then the second request can gather totaltermfreqs.

On Fri, Jan 23, 2015 at 11:51 AM, Nitin Solanki 
wrote:

> Thanks Mikhail Khludnev..
> I tried this:
> *
> http://localhost:8983/solr/collection1/spell?q=gram:%22the%22&rows=1&fl=totaltermfreq(gram,the)
> <
> http://localhost:8983/solr/collection1/spell?q=gram:%22the%22&rows=1&fl=totaltermfreq(gram,the)
> >*
> and it worked.
> I want to know more. Can we do same thing *(totaltermfreq)* on suggestions
> ? I tried "th" and get "the" is suggestion. I want to retrieve term
> frequency not document frequency even in the suggestions. Can I do that?
>
> *Instance of suggestions: *
> 
> the
> 897  *Here -* freq is Document frequency. I need
> Term frequency
> 
>
>
>
> On Fri, Jan 23, 2015 at 1:53 PM, Mikhail Khludnev <
> mkhlud...@griddynamics.com> wrote:
>
> > https://cwiki.apache.org/confluence/display/solr/Function+Queries
> > totaltermfreq()
> >
> > of you need to sum term freq on docs from resultset?
> >
> >
> > On Fri, Jan 23, 2015 at 10:56 AM, Nitin Solanki 
> > wrote:
> >
> > > I indexed some text_file files in Solr as it is. Applied "
> > > *StandardTokenizerFactory*" and "*ShingleFilterFactory*" on text_file
> > field
> > >
> > > *Configuration of Schema.xml structure below :*
> > >  > required="true"
> > > multiValued="false" />
> > >  > > required="true" multiValued="false"/>
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > * > > positionIncrementGap="100">> > type="index">  > > class="solr.StandardTokenizerFactory"/>
> >   > > class="solr.ShingleFilterFactory" maxShingleSize="5" minShingleSize="2"
> > > outputUnigrams="true"/>   > > type="query">  > > class="solr.StandardTokenizerFactory"/>
> >   > > class="solr.ShingleFilterFactory" maxShingleSize="5" minShingleSize="2"
> > > outputUnigrams="true"/>  *
> > >
> > > *Stored Documents like:*
> > > *[{"id":"1", "text_file": "text": "text of document"}, {"id":"2",
> > > "text_file": "text": "text of document"} and so on ]*
> > >
> > > *Problem* : If I search a word in a SOLR index I get a document count
> for
> > > documents which contain this word, but if the word is included more
> times
> > > in a document, the total count is still 1 per document. I need every
> > > returned document is counted for the number of times they have the
> > searched
> > > word in the field. *Example* :I see a "numFound" value of 12, but the
> > word
> > > "what" is included 20 times in all 12 documents. Could you help me to
> > find
> > > where I'm wrong, please?
> > >
> >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > Principal Engineer,
> > Grid Dynamics
> >
> > 
> > 
> >
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics





Re: Count total frequency of a word in a SOLR index

2015-01-23 Thread Nitin Solanki
Thanks Mikhail Khludnev..
I tried this:
*http://localhost:8983/solr/collection1/spell?q=gram:%22the%22&rows=1&fl=totaltermfreq(gram,the)
*
and it worked.
I want to know more. Can we do same thing *(totaltermfreq)* on suggestions
? I tried "th" and get "the" is suggestion. I want to retrieve term
frequency not document frequency even in the suggestions. Can I do that?

*Instance of suggestions: *

the
897  *Here -* freq is Document frequency. I need
Term frequency




On Fri, Jan 23, 2015 at 1:53 PM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> https://cwiki.apache.org/confluence/display/solr/Function+Queries
> totaltermfreq()
>
> of you need to sum term freq on docs from resultset?
>
>
> On Fri, Jan 23, 2015 at 10:56 AM, Nitin Solanki 
> wrote:
>
> > I indexed some text_file files in Solr as it is. Applied "
> > *StandardTokenizerFactory*" and "*ShingleFilterFactory*" on text_file
> field
> >
> > *Configuration of Schema.xml structure below :*
> >  required="true"
> > multiValued="false" />
> >  > required="true" multiValued="false"/>
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > * > positionIncrementGap="100">> type="index">  > class="solr.StandardTokenizerFactory"/>
>   > class="solr.ShingleFilterFactory" maxShingleSize="5" minShingleSize="2"
> > outputUnigrams="true"/>   > type="query">  > class="solr.StandardTokenizerFactory"/>
>   > class="solr.ShingleFilterFactory" maxShingleSize="5" minShingleSize="2"
> > outputUnigrams="true"/>  *
> >
> > *Stored Documents like:*
> > *[{"id":"1", "text_file": "text": "text of document"}, {"id":"2",
> > "text_file": "text": "text of document"} and so on ]*
> >
> > *Problem* : If I search a word in a SOLR index I get a document count for
> > documents which contain this word, but if the word is included more times
> > in a document, the total count is still 1 per document. I need every
> > returned document is counted for the number of times they have the
> searched
> > word in the field. *Example* :I see a "numFound" value of 12, but the
> word
> > "what" is included 20 times in all 12 documents. Could you help me to
> find
> > where I'm wrong, please?
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> 
> 
>


Query to get A-Z index

2015-01-23 Thread Priya Rodrigues
Is there a way to get the A-Z index from a field

eg. if the field name contains Alpha, Pogo, Zoro,

it should return A, P, Z

Found something similar here
http://stackoverflow.com/questions/8974299/solr-query-by-range-of-name

But is there a way to do this without copyField?

Thanks,
Priya


Re: Count total frequency of a word in a SOLR index

2015-01-23 Thread Mikhail Khludnev
https://cwiki.apache.org/confluence/display/solr/Function+Queries
totaltermfreq()

of you need to sum term freq on docs from resultset?


On Fri, Jan 23, 2015 at 10:56 AM, Nitin Solanki 
wrote:

> I indexed some text_file files in Solr as it is. Applied "
> *StandardTokenizerFactory*" and "*ShingleFilterFactory*" on text_file field
>
> *Configuration of Schema.xml structure below :*
>  multiValued="false" />
>  required="true" multiValued="false"/>
>
>
>
>
>
>
>
>
>
>
> * positionIncrementGap="100">type="index">  class="solr.StandardTokenizerFactory"/>  class="solr.ShingleFilterFactory" maxShingleSize="5" minShingleSize="2"
> outputUnigrams="true"/>   type="query">  class="solr.StandardTokenizerFactory"/>  class="solr.ShingleFilterFactory" maxShingleSize="5" minShingleSize="2"
> outputUnigrams="true"/>  *
>
> *Stored Documents like:*
> *[{"id":"1", "text_file": "text": "text of document"}, {"id":"2",
> "text_file": "text": "text of document"} and so on ]*
>
> *Problem* : If I search a word in a SOLR index I get a document count for
> documents which contain this word, but if the word is included more times
> in a document, the total count is still 1 per document. I need every
> returned document is counted for the number of times they have the searched
> word in the field. *Example* :I see a "numFound" value of 12, but the word
> "what" is included 20 times in all 12 documents. Could you help me to find
> where I'm wrong, please?
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics