Indexing HTML Content

2008-05-22 Thread McBride, John
Hello,

In my application I wish to index articles which are stored in HTML
format.

Upon indexing these the html gets stored along with the content of the
article, which is undesirable.

Do you know of any common way of parsing the text content from HTML
before adding to SOLR?  I understand SOLR 1.3 has an HTML analyser, but
I am using SOLR 1.2 and won't use 1.3 until it's stable, so looking for
a solution to work on a batch of files before being added to SOLR.

Thanks,
John


Analytics e.g. "Top 10 searches"

2008-06-06 Thread McBride, John

 Hello,

Is anybody familiar with any SOLR-based analytical tools which would
allow us to extract "top ten seaches", for example.

I imagine at the query parse level, where the query is tokenized and
filtered would be the best place to log this, due to the many
permutations possible at the user input level.

Is there an existing plugin to do this, or could you suggest how to
architect this?

Thanks,
John


Exposing admin through XML

2008-06-16 Thread McBride, John
Hello,

I have noticed that the solr/admin page pulls in XML status information
from add on modules in solr eg DataImportHandler.  

Is the core SOLR statistical data exposed through an XML API, such that
I could collate all SOLR Slave status pages into one consolidated admin
panel?



Thanks,
John


Solr/bin/commit problem - fails to commit correctly and render response

2008-06-18 Thread McBride, John
Hello,

I am using the solr/bin/commit file to commit index changes after index
distribution in the collection distribution operations model.

The commit script is printed at the end of the email.

When I run the script as is, I get the following error:

commit request to Solr at port 8080 failed

This is corrected with the following addition to the line:

rs=`curl http://${solr_hostname}:${solr_port}/solr/update -s -d
""`
Becomes:
rs=`curl http://${solr_hostname}:${solr_port}/solr/update -s -d
"" -H 'Content-type:text/xml; charset=utf-8'`

This works, but the log reports an error, because the response is not as
expected.
SOLR returns:  0

But the commit script expects:[regular
expression]


Has anybody else had problems using this commit script?
Where can I get the latest version?  I got this script from the solr 1.2
package.

Thanks,
John

---
#!/bin/bash
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version
2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Shell script to force a commit of all changes since last commit
# for a Solr server

orig_dir=$(pwd)
cd ${0%/*}/..
solr_root=$(pwd)
cd ${orig_dir}

unset solr_hostname solr_port webapp_name user verbose debug
. ${solr_root}/bin/scripts-util

# set up variables
prog=${0##*/}
log=${solr_root}/logs/${prog}.log

# define usage string
USAGE="\
usage: $prog [-h hostname] [-p port] [-w webapp_name] [-u username] [-v]
   -h  specify Solr hostname
   -p  specify Solr port number
   -w  specify name of Solr webapp (defaults to solr)
   -u  specify user to sudo to before running script
   -v  increase verbosity
   -V  output debugging info
"

# parse args
while getopts h:p:w:u:vV OPTION
do
case $OPTION in
h)
solr_hostname="$OPTARG"
;;
p)
solr_port="$OPTARG"
;;
w)
webapp_name="$OPTARG"
;;
u)
user="$OPTARG"
;;
v)
verbose="v"
;;
V)
debug="V"
;;
*)
echo "$USAGE"
exit 1
esac
done

[[ -n $debug ]] && set -x

if [[ -z ${solr_port} ]]
then
echo "Solr port number missing in $confFile or command line."
echo "$USAGE"


exit 1
fi

# use default hostname if not specified
if [[ -z ${solr_hostname} ]]
then
solr_hostname=localhost
fi

# use default webapp name if not specified
if [[ -z ${webapp_name} ]]
then
webapp_name=solr
fi

fixUser "$@"

start=`date +"%s"`

logMessage started by $oldwhoami
logMessage command: $0 $@

rs=`curl http://${solr_hostname}:${solr_port}/solr/update -s -d
""`
if [[ $? != 0 ]]
then
  logMessage failed to connect to Solr server at port ${solr_port}
  logMessage commit failed
  logExit failed 1
fi

# check status of commit request
echo $rs | grep ' /dev/null 2>&1
if [[ $? != 0 ]]
then
  logMessage commit request to Solr at port ${solr_port} failed:
  logMessage $rs
  logExit failed 2
fi

logExit ended 0
---



RE: Solr/bin/commit problem - fails to commit correctly and render response

2008-06-18 Thread McBride, John
Ok I checked out the nightly builds and the two changes have been made.

I will use the SOLR 1.3 version of solr/bin/commit.

Thanks,
John 

-Original Message-
From: McBride, John [mailto:[EMAIL PROTECTED] 
Sent: 18 June 2008 11:48
To: solr-user@lucene.apache.org
Subject:  Solr/bin/commit problem - fails to commit correctly and
render response

Hello,

I am using the solr/bin/commit file to commit index changes after index
distribution in the collection distribution operations model.

The commit script is printed at the end of the email.

When I run the script as is, I get the following error:

commit request to Solr at port 8080 failed

This is corrected with the following addition to the line:

rs=`curl http://${solr_hostname}:${solr_port}/solr/update -s -d
""`
Becomes:
rs=`curl http://${solr_hostname}:${solr_port}/solr/update -s -d
"" -H 'Content-type:text/xml; charset=utf-8'`

This works, but the log reports an error, because the response is not as
expected.
SOLR returns:  0

But the commit script expects:[regular
expression]


Has anybody else had problems using this commit script?
Where can I get the latest version?  I got this script from the solr 1.2
package.

Thanks,
John

---
#!/bin/bash
#
# Licensed to the Apache Software Foundation (ASF) under one or more #
contributor license agreements.  See the NOTICE file distributed with #
this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version
2.0 # (the "License"); you may not use this file except in compliance
with # the License.  You may obtain a copy of the License at #
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software #
distributed under the License is distributed on an "AS IS" BASIS, #
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and #
limitations under the License.
#
# Shell script to force a commit of all changes since last commit # for
a Solr server

orig_dir=$(pwd)
cd ${0%/*}/..
solr_root=$(pwd)
cd ${orig_dir}

unset solr_hostname solr_port webapp_name user verbose debug .
${solr_root}/bin/scripts-util

# set up variables
prog=${0##*/}
log=${solr_root}/logs/${prog}.log

# define usage string
USAGE="\
usage: $prog [-h hostname] [-p port] [-w webapp_name] [-u username] [-v]
   -h  specify Solr hostname
   -p  specify Solr port number
   -w  specify name of Solr webapp (defaults to solr)
   -u  specify user to sudo to before running script
   -v  increase verbosity
   -V  output debugging info
"

# parse args
while getopts h:p:w:u:vV OPTION
do
case $OPTION in
h)
solr_hostname="$OPTARG"
;;
p)
solr_port="$OPTARG"
;;
w)
webapp_name="$OPTARG"
;;
u)
user="$OPTARG"
;;
v)
verbose="v"
;;
V)
debug="V"
;;
*)
echo "$USAGE"
exit 1
esac
done

[[ -n $debug ]] && set -x

if [[ -z ${solr_port} ]]
then
echo "Solr port number missing in $confFile or command line."
echo "$USAGE"


exit 1
fi

# use default hostname if not specified
if [[ -z ${solr_hostname} ]]
then
solr_hostname=localhost
fi

# use default webapp name if not specified if [[ -z ${webapp_name} ]]
then
webapp_name=solr
fi

fixUser "$@"

start=`date +"%s"`

logMessage started by $oldwhoami
logMessage command: $0 $@

rs=`curl http://${solr_hostname}:${solr_port}/solr/update -s -d
""` if [[ $? != 0 ]] then
  logMessage failed to connect to Solr server at port ${solr_port}
  logMessage commit failed
  logExit failed 1
fi

# check status of commit request
echo $rs | grep ' /dev/null 2>&1 if [[ $? != 0 ]]
then
  logMessage commit request to Solr at port ${solr_port} failed:
  logMessage $rs
  logExit failed 2
fi

logExit ended 0
---



snapshooter configuration

2008-06-19 Thread McBride, John
Hello,
 
In my solrconfig I have the entry:
 
 


 

I am unable to get this wokring - the catalin.out is unable to find
snapshooter.

 

Do others give the full path to snapshooter?  Why do the template docs
not say /full/path/to/snapshooter

Thanks,

John



SOLR Timeout

2008-07-09 Thread McBride, John
Hello All,
 
Prior to SOLR 1.3 and nutch patch integration - what actually is  the effect of 
SOLR (non)-timeout?  Do the threads eventally die?  DOes a new request cause a 
new query thread to open, or is the system locked?
 
What causes a timeout- a complex query?
 
Is SOLR 1.2 open to DoS attacks by submitting complex queries?
 
Thanks,
John
 
 


SOLR 1.2 Multicore configuration

2008-08-13 Thread McBride, John
 
Hi,

I am deploying an application across 3 geographies - and as a result
will be running multiple solr instances on one host.

I don't want to set up separate wars running on different ports as this
will cause an increased number of firewall requests and require more
management to track the set of ports we are using.

Is it possible to configure the server, such that it reads the country
in the url

Say 
Uk/solr/admin
Fr/solr/admin
De/solr/admin

Or possibly have different domain names.

And uses solr home as uk/solrhome etc and passes on the request to
solr/admin handler using that for solrhome?

What is the approach here?  I am a Tomcat config newbie.


As an adjunct.

In order to simplify things, I am thinking of maintaining just one index
for all countries and place a country filter on the queries.  The
implication would be throwing away stemming and having all stopwords in
one file, which may not be desirable, but seems logistically simpler -
any comments?

Thanks,
John


RE: SOLR 1.2 Multicore configuration

2008-08-13 Thread McBride, John
Thanks Ryan,

I think it would be high risk to move to solr 1.2 as our ops team have a
standard 1.2 configuration.

Perhaps I should ask them...

Thanks,
John 

-Original Message-
From: Ryan McKinley [mailto:[EMAIL PROTECTED] 
Sent: 13 August 2008 16:47
To: solr-user@lucene.apache.org
Subject: Re: SOLR 1.2 Multicore configuration

Check: http://wiki.apache.org/solr/MultiCore

If you can wait a few days, there will likely be a 1.3 release candidate
out soon.


On Aug 13, 2008, at 11:30 AM, McBride, John wrote:

>
> Hi,
>
> I am deploying an application across 3 geographies - and as a result 
> will be running multiple solr instances on one host.
>
> I don't want to set up separate wars running on different ports as 
> this will cause an increased number of firewall requests and require 
> more management to track the set of ports we are using.
>
> Is it possible to configure the server, such that it reads the country

> in the url
>
> Say
> Uk/solr/admin
> Fr/solr/admin
> De/solr/admin
>
> Or possibly have different domain names.
>
> And uses solr home as uk/solrhome etc and passes on the request to 
> solr/admin handler using that for solrhome?
>
> What is the approach here?  I am a Tomcat config newbie.
>
>
> As an adjunct.
>
> In order to simplify things, I am thinking of maintaining just one 
> index for all countries and place a country filter on the queries.  
> The implication would be throwing away stemming and having all 
> stopwords in one file, which may not be desirable, but seems 
> logistically simpler - any comments?
>
> Thanks,
> John