backport Heliosearch features to Solr

2015-03-01 Thread Yonik Seeley
As many of you know, I've been doing some work in the experimental
heliosearch fork of Solr over the past year.  I think it's time to
bring some more of those changes back.

So here's a poll: Which Heliosearch features do you think should be
brought back to Apache Solr?

http://bit.ly/1E7wi1Q
(link to google form)

-Yonik


Using HDFS with Solr

2015-03-01 Thread Jou Sung-Shik
Hello.

I have a question about using HDFS with Solr.

I watched when one of shard node is gone, another node take them like this
graph in admin console.


  *(10.62.65.46
is Gone)*

+- shard
1-10.62.65.48 (active)
collection-hdfs---+- shard
2-10.62.65.47 (active)
+- shard
3-10.62.65.48 (active)

So, when 10.62.65.46 is restarted but shard 1 is still assignd 10.62.65.48
node.

Is it right?

I think shard 1 assignd 10.62.65.46 node instead of 10.62.65.48 node.

Please comment me.

Thanks.

-- 
-
BLOG : http://www.codingstar.net
-


solr5 - where does solr5 look for schema files?

2015-03-01 Thread Gulliver Smith
I am running the out-of-the-box solr5 as instructed in the tutorial.

The solr documentation has no useful documentation about the shema
file argument to create core.

I have a schema.xml that I was using for a solr 4 installation by
manually editing the core directories as root.

When playing with solr5, I have tried a number of things without success.

a) copied my custom schema.xml to
server/solr/configsets/basic_configs/conf/custom_schema.xml
- when I typed custom_schema.xml into the schema: field in the
create core dialog, a core is created but the new schema isn't used.
Making cusom_schema.xml into invalid XML doesn't break anything.

b) put custom_schema.xml in an accessible location on my server and
entered the full path into the schema field - in this case I got an
error message Error CREATEing SolrCore 'xxx': Unable to create core
... Invalid path string
/configs/gettingstarted//.../custom_schema.xml

There is no configs directory in the solr installaition.There is no
gettingstarted directory either, though there are
gettingstarted_shard1_replica1 etc. directories.

The only meaningful schema.xml seems to be
server/solr/configsets/basic_configs/conf/schema.xml.

The cores are created in example/cloud/node*/solr

There is no directory structure in the installation matching that
described in the 500 page pdf.The files screen in the admin console
does not mention schema.xml and there doesn't seem to be any place
namimg or showing schema.xml in the admin interface.

So how in the world is one to install a custom schema?

Thanks
Gulliver


Re: solr5 - where does solr5 look for schema files?

2015-03-01 Thread Erick Erickson
You haven't stated it explicitly, but I think you're running SolrCloud, right?

In which case... the configs are all stored in ZooKeeper, and you don't
edit them there. The startup scripts automate the upconfig step that
pushes your configs to Zookeeper. Thereafter, they are read from
Zookeeper by the Solr node on startup from ZK, but not stored locally
on each node. Otherwise, keeping all the nodes coordinated would be
difficult.

You can see the uploaded configs in the Solr admin UI/Cloud/tree/configs
area.

So you keep your configs somewhere (some kind of VCS is recommended)
and, when you make changes to them, push the results to ZK and either
restart or reload your collection.

Did you see the documentation at:
https://cwiki.apache.org/confluence/display/solr/Using+ZooKeeper+to+Manage+Configuration+Files?

And assuming I'm right and you're using SolrCloud, I _strongly_ suggest you
try to think in terms of replicas rather than cores. In particular
avoid using the
old, familiar admin core API and instead use the collections api (see
the ref guide).
You can do pretty much anything with the collections api you used to
do with the core admin,
and at the same time have a lot less chance to get something wrong.
The collections
api makes use of the individual core admin API calls to carry out the
instructed tasks as
necessary.

All that said, the new way of doing things is a bit of a shock to
the system if you're an
old Solr hand, especially in SolrCloud.

Best,
Erick


On Sun, Mar 1, 2015 at 4:58 PM, Gulliver Smith
gulliver.m.sm...@gmail.com wrote:
 I am running the out-of-the-box solr5 as instructed in the tutorial.

 The solr documentation has no useful documentation about the shema
 file argument to create core.

 I have a schema.xml that I was using for a solr 4 installation by
 manually editing the core directories as root.

 When playing with solr5, I have tried a number of things without success.

 a) copied my custom schema.xml to
 server/solr/configsets/basic_configs/conf/custom_schema.xml
 - when I typed custom_schema.xml into the schema: field in the
 create core dialog, a core is created but the new schema isn't used.
 Making cusom_schema.xml into invalid XML doesn't break anything.

 b) put custom_schema.xml in an accessible location on my server and
 entered the full path into the schema field - in this case I got an
 error message Error CREATEing SolrCore 'xxx': Unable to create core
 ... Invalid path string
 /configs/gettingstarted//.../custom_schema.xml

 There is no configs directory in the solr installaition.There is no
 gettingstarted directory either, though there are
 gettingstarted_shard1_replica1 etc. directories.

 The only meaningful schema.xml seems to be
 server/solr/configsets/basic_configs/conf/schema.xml.

 The cores are created in example/cloud/node*/solr

 There is no directory structure in the installation matching that
 described in the 500 page pdf.The files screen in the admin console
 does not mention schema.xml and there doesn't seem to be any place
 namimg or showing schema.xml in the admin interface.

 So how in the world is one to install a custom schema?

 Thanks
 Gulliver


Re: Getting started with Solr

2015-03-01 Thread Baruch Kogan
OK, got it, works now.

Maybe you can advise on something more general?

I'm trying to use Solr to analyze html data retrieved with Nutch. I want to
crawl a list of webpages built according to a certain template, and analyze
certain fields in their HTML (identified by a span class and consisting of
a number,) then output results as csv to generate a list with the website's
domain and sum of the numbers in all the specified fields.

How should I set up the flow? Should I configure Nutch to only pull the
relevant fields from each page, then use Solr to add the integers in those
fields and output to a csv? Or should I use Nutch to pull in everything
from the relevant page and then use Solr to strip out the relevant fields
and process them as above? Can I do the processing strictly in Solr, using
the stuff found here
https://cwiki.apache.org/confluence/display/solr/Indexing+and+Basic+Data+Operations,
or should I use PHP through Solarium or something along those lines?

Your advice would be appreciated-I don't want to reinvent the bicycle.

Sincerely,

Baruch Kogan
Marketing Manager
Seller Panda http://sellerpanda.com
+972(58)441-3829
baruch.kogan at Skype

On Sun, Mar 1, 2015 at 9:17 AM, Baruch Kogan bar...@sellerpanda.com wrote:

 Thanks for bearing with me.

 I start Solr with `bin/solr start -e cloud' with 2 nodes. Then I get this:

 *Welcome to the SolrCloud example!*


 *This interactive session will help you launch a SolrCloud cluster on your
 local workstation.*

 *To begin, how many Solr nodes would you like to run in your local
 cluster? (specify 1-4 nodes) [2] *
 *Ok, let's start up 2 Solr nodes for your example SolrCloud cluster.*

 *Please enter the port for node1 [8983] *
 *8983*
 *Please enter the port for node2 [7574] *
 *7574*
 *Cloning Solr home directory /home/ubuntu/crawler/solr/example/cloud/node1
 into /home/ubuntu/crawler/solr/example/cloud/node2*

 *Starting up SolrCloud node1 on port 8983 using command:*

 *solr start -cloud -s example/cloud/node1/solr -p 8983   *

 I then go to http://localhost:8983/solr/admin/cores and get the following:


 *This XML file does not appear to have any style information associated
 with it. The document tree is shown below.*

 *responselst name=responseHeaderint name=status0/intint
 name=QTime2/int/lstlst name=initFailures/lst name=statuslst
 name=testCollection_shard1_replica1str
 name=nametestCollection_shard1_replica1/strstr
 name=instanceDir/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica1//strstr
 name=dataDir/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica1/data//strstr
 name=configsolrconfig.xml/strstr name=schemaschema.xml/strdate
 name=startTime2015-03-01T06:59:12.296Z/datelong
 name=uptime46380/longlst name=indexint name=numDocs0/intint
 name=maxDoc0/intint name=deletedDocs0/intlong
 name=indexHeapUsageBytes0/longlong name=version1/longint
 name=segmentCount0/intbool name=currenttrue/boolbool
 name=hasDeletionsfalse/boolstr
 name=directoryorg.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica1/data/index
 lockFactory=org.apache.lucene.store.NativeFSLockFactory@2a4f8f8b;
 maxCacheMB=48.0 maxMergeSizeMB=4.0)/strlst name=userData/long
 name=sizeInBytes71/longstr name=size71 bytes/str/lst/lstlst
 name=testCollection_shard1_replica2str
 name=nametestCollection_shard1_replica2/strstr
 name=instanceDir/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica2//strstr
 name=dataDir/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica2/data//strstr
 name=configsolrconfig.xml/strstr name=schemaschema.xml/strdate
 name=startTime2015-03-01T06:59:12.751Z/datelong
 name=uptime45926/longlst name=indexint name=numDocs0/intint
 name=maxDoc0/intint name=deletedDocs0/intlong
 name=indexHeapUsageBytes0/longlong name=version1/longint
 name=segmentCount0/intbool name=currenttrue/boolbool
 name=hasDeletionsfalse/boolstr
 name=directoryorg.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica2/data/index
 lockFactory=org.apache.lucene.store.NativeFSLockFactory@2a4f8f8b;
 maxCacheMB=48.0 maxMergeSizeMB=4.0)/strlst name=userData/long
 name=sizeInBytes71/longstr name=size71 bytes/str/lst/lstlst
 name=testCollection_shard2_replica1str
 name=nametestCollection_shard2_replica1/strstr
 name=instanceDir/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica1//strstr
 name=dataDir/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica1/data//strstr
 name=configsolrconfig.xml/strstr name=schemaschema.xml/strdate
 name=startTime2015-03-01T06:59:12.596Z/datelong
 name=uptime46081/longlst name=indexint name=numDocs0/intint
 name=maxDoc0/intint name=deletedDocs0/intlong
 name=indexHeapUsageBytes0/longlong 

Correct connection methodology for Zookeeper/SolrCloud?

2015-03-01 Thread Julian Perry

Hi

I'm really after best practice guidelines for making queries to
an index on a Solr cluster.  I'm not calling from Java.

I have Solr 4.10.2 up and running, seems stable.

I have about 6 indexes/collections - am running SolrCloud with
two Solr instances (both currently running on the same dev. box -
just one shard each) and standalone Zookeeper with 3 instances.
All seems fine.  I can do queries against either instance, and
perform index updates and replication works fine.

I'm not using Java to talk to Solr - the web pages are built with
PHP (or something similar - happy to call zk/Solr from C).  So I
need to call Solr from the web page code.  Clearly I need
resilience and so don't want to specifically call one of the Solr
instances directly.

I could just set up a load balancer on the two Solr instances and
let client query requests use the load balancer to find a working
instance.

From what I have read though - I am supposed to make a call to
zookeeper to ask which Solr instances are running up to date and
working replicas of the collection that I need.  Is that right?
I should do that every time I need to make a query?

There seems to be a zookeeper client library in the zk dist - in
zookeeper-3.4.6/src/c/ - can I use that?  It looks like I can
pass in a list of potential zk host:port pairs and it will find
a working zk for me - is that right?

Then I need to ask the working zk which solr instance I should
connect to for the given index/collection - how do I do that -
is that held in clusterstate.json?

So the steps to make a Solr query against my cluster would be:

a) call zk client library with list of zk host/ports

b) ask zk for clusterstate.json

c) pick an active server (at random) for the relevant collection
   (is there some load balancing option in there)

d) call the Solr server returned by (c)

Is that best practice - or am I missing something?

--
Cheers
Jules.



Conditional invocation of HTMLStripCharFactory

2015-03-01 Thread SolrUser1543
is it possible to make a considional invocation of a HTMLStripCharFactory? I
want to decide when to enable or disable it according to a value of specific
field in my document.  E.g. when a value of field A is true, then enable a
filter on field B,or disable otherwise. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Conditional-invocation-of-HTMLStripCharFactory-tp4190010.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: backport Heliosearch features to Solr

2015-03-01 Thread Otis Gospodnetic
Hi Yonik,

Now that you joined Cloudera, why not everything?

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Sun, Mar 1, 2015 at 4:50 PM, Yonik Seeley ysee...@gmail.com wrote:

 As many of you know, I've been doing some work in the experimental
 heliosearch fork of Solr over the past year.  I think it's time to
 bring some more of those changes back.

 So here's a poll: Which Heliosearch features do you think should be
 brought back to Apache Solr?

 http://bit.ly/1E7wi1Q
 (link to google form)

 -Yonik



Re: backport Heliosearch features to Solr

2015-03-01 Thread Yonik Seeley
On Sun, Mar 1, 2015 at 7:18 PM, Otis Gospodnetic
otis.gospodne...@gmail.com wrote:
 Hi Yonik,

 Now that you joined Cloudera, why not everything?

Everything is on the table, but from a practical point of view I
wanted to verify areas of user interest/support before doing the work
to get things back.

Even when there is user support, some things may be blocked anyway
(part of the reason why I did things under a fork in the first place).
I'll do what I can though.

-Yonik


 Otis
 --
 Monitoring * Alerting * Anomaly Detection * Centralized Log Management
 Solr  Elasticsearch Support * http://sematext.com/


 On Sun, Mar 1, 2015 at 4:50 PM, Yonik Seeley ysee...@gmail.com wrote:

 As many of you know, I've been doing some work in the experimental
 heliosearch fork of Solr over the past year.  I think it's time to
 bring some more of those changes back.

 So here's a poll: Which Heliosearch features do you think should be
 brought back to Apache Solr?

 http://bit.ly/1E7wi1Q
 (link to google form)

 -Yonik



SOLR Backup and Restore - Solr 3.6.1

2015-03-01 Thread abhi Abhishek
Hello,
   we have solr 3.6.1 in our environment. we are trying to analyse
backup and recovery solutions for the same. is there a way to compress the
backup taken?

we have explored about replicationHandler with backup command. but as our
index is in 100's of GB's we would like a solution that provides
compression to reduce storage overhead.

thanks in advance

Regards,
Abhishek


Re: solr cloud does not start with many collections

2015-03-01 Thread Damien Kamerman
I still see the same cloud startup issue with Solr 5.0.0. I created 4,000
collections from scratch and then attempted to stop/start the cloud.

node1:
WARN  - 2015-03-02 18:09:02.371;
org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
WARN  - 2015-03-02 18:10:07.196; org.apache.solr.cloud.ZkController; Timed
out waiting to see all nodes published as DOWN in our cluster state.
WARN  - 2015-03-02 18:13:46.238; org.apache.solr.cloud.ZkController; Still
seeing conflicting information about the leader of shard shard1 for
collection DD-3219 after 30 seconds; our state says
http://host:8002/solr/DD-3219_shard1_replica1/, but ZooKeeper says
http://host:8000/solr/DD-3219_shard1_replica2/

node2:
WARN  - 2015-03-02 18:09:01.871;
org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
WARN  - 2015-03-02 18:17:04.458;
org.apache.solr.common.cloud.ZkStateReader$3; ZooKeeper watch triggered,
but Solr cannot talk to ZK
stop/start
WARN  - 2015-03-02 18:53:12.725;
org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
WARN  - 2015-03-02 18:56:30.702; org.apache.solr.cloud.ZkController; Still
seeing conflicting information about the leader of shard shard1 for
collection DD-3581 after 30 seconds; our state says
http://host:8001/solr/DD-3581_shard1_replica2/, but ZooKeeper says
http://host:8002/solr/DD-3581_shard1_replica1/

node3:
WARN  - 2015-03-02 18:09:03.022;
org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
WARN  - 2015-03-02 18:10:08.178; org.apache.solr.cloud.ZkController; Timed
out waiting to see all nodes published as DOWN in our cluster state.
WARN  - 2015-03-02 18:13:47.737; org.apache.solr.cloud.ZkController; Still
seeing conflicting information about the leader of shard shard1 for
collection DD-2707 after 30 seconds; our state says
http://host:8002/solr/DD-2707_shard1_replica2/, but ZooKeeper says
http://host:8000/solr/DD-2707_shard1_replica1/



On 27 February 2015 at 17:48, Shawn Heisey apa...@elyograg.org wrote:

 On 2/26/2015 11:14 PM, Damien Kamerman wrote:
  I've run into an issue with starting my solr cloud with many collections.
  My setup is:
  3 nodes (solr 4.10.3 ; 64GB RAM each ; jdk1.8.0_25) running on a single
  server (256GB RAM).
  5,000 collections (1 x shard ; 2 x replica) = 10,000 cores
  1 x Zookeeper 3.4.6
  Java arg -Djute.maxbuffer=67108864 added to solr and ZK.
 
  Then I stop all nodes, then start all nodes. All replicas are in the down
  state, some have no leader. At times I have seen some (12 or so) leaders
 in
  the active state. In the solr logs I see lots of:
 
  org.apache.solr.cloud.ZkController; Still seeing conflicting information
  about the leader of shard shard1 for collection DD-4351 after 30
  seconds; our state says
 http://ftea1:8001/solr/DD-4351_shard1_replica1/,
  but ZooKeeper says http://ftea1:8000/solr/DD-4351_shard1_replica2/

 snip

  I've tried staggering the starts (1min) but does not help.
  I've reproduced with zero documents.
  Restarts are OK up to around 3,000 cores.
  Should this work?

 This is going to push SolrCloud beyond its limits.  Is this just an
 exercise to see how far you can push Solr, or are you looking at setting
 up a production install with several thousand collections?

 In Solr 4.x, the clusterstate is one giant JSON structure containing the
 state of the entire cloud.  With 5000 collections, the entire thing
 would need to be downloaded and uploaded at least 5000 times during the
 course of a successful full system startup ... and I think with
 replicationFactor set to 2, that might actually be 1 times. The
 best-case scenario is that it would take a VERY long time, the
 worst-case scenario is that concurrency problems would lead to a
 deadlock.  A deadlock might be what is happening here.

 In Solr 5.x, the clusterstate is broken up so there's a separate state
 structure for each collection.  This setup allows for faster and safer
 multi-threading and far less data transfer.  Assuming I understand the
 implications correctly, there might not be any need to increase
 jute.maxbuffer with 5.x ... although I have to assume that I might be
 wrong about that.

 I would very much recommend that you set your scenario up from scratch
 in Solr 5.0.0, to see if the new clusterstate format can eliminate the
 problem you're seeing.  If it doesn't, then we can pursue it as a likely
 bug in the 5.x branch and you can file an issue in Jira.

 Thanks,
 Shawn




-- 
Damien Kamerman


filtering tfq() function query to specific part of collection not the whole documents

2015-03-01 Thread Ali Nazemian
Hi,
I was wondering is it possible to filter tfq() function query to specific
selection of collection? Suppose I want to count all occurrences of term
test in documents with fq=category:2, how can I handle such query with
tfq() function query? It seems applying fq=category:2 in a select query
with considering tfq() does not affect tfq(), no matter what is the other
part of my query, tfq() always return the total term frequency for specific
field in the whole collection. So what is the solution for this case?
Best regards.

-- 
A.Nazemian


Re: Is it possible to use multiple index data directory in Apache Solr?

2015-03-01 Thread Alexandre Rafalovitch
On 1 March 2015 at 01:03, Shawn Heisey apa...@elyograg.org wrote:
 How exactly does ES split the index files when multiple paths are
 configured?  I am very curious about exactly how this works.  Google is
 not helping me figure it out.  I even grabbed the ES master branch and
 wasn't able to trace how path.data is used after it makes it into the
 environment.

Elasticsearch automatically creates indexes and shards. So, multiple
directories are just used to distribute the shards' indexes among
them. 
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-dir-layout.html
So, when a new shard is created, one of the directories is used either
randomly or usage-based.

So, to me, the question would be not about the implementation matching
but what is the OP trying to achieve with that: replication? more even
disk utilization? something else?

Regards,
Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


RE: Is it possible to use multiple index data directory in Apache Solr?

2015-03-01 Thread Susheel Kumar
Under Solr/example folder, you will find multicore folder under which you can 
create multiple core/index directory folders and edit the solr.xml to specify 
each of the new core/directory.  

When you start Solr under examples directory, use command line like below to 
load Solr and then you should be able to see these multiple core in Solr admin 
and index data in each of the core/data directory.

 java -Dsolr.solr.home=multicore -jar start.jar 

Thnx

-Original Message-
From: Jou Sung-Shik [mailto:lik...@gmail.com] 
Sent: February 28, 2015 10:03 PM
To: solr-user@lucene.apache.org
Subject: Is it possible to use multiple index data directory in Apache Solr?

I'm new in Apache Lucene/Solr.

I try to move from Elasticsearch to Apache Solr.

So, I have a question about following index data location configuration.


*in Elasticsearch*

# Can optionally include more than one lo # the locations (a la RAID 0) on a 
file l # space on creation. For example:
#
# path.data: /path/to/data1,/path/to/data2

*in Apache Solr*

dataDir/var/data/solr//dataDir


I want to configure multiple index data directory like Elasticsearch in Apache 
Solr.

Is it possible?

How I can reach the goal?





--
-
BLOG : http://www.codingstar.net
-


Re: [ANNOUNCE] Luke 4.10.3 released

2015-03-01 Thread Dmitry Kan
Hi Tomoko,

I have just created the pivot branch off of the current master. Let's move
our discussion there:

https://github.com/DmitryKey/luke/tree/pivot-luke

Thanks,
Dmitry

On Fri, Feb 27, 2015 at 7:53 PM, Tomoko Uchida tomoko.uchida.1...@gmail.com
 wrote:

 Hi Dmitry,

 In my environment, I cannot produce this pivots's error in HotSpot VM
 1.7.0, please give me some time...
 Or, I'll try to make pull requests https://github.com/DmitryKey/luke for
 pivots's version.

 At any rate, it would be best to manage both of (current) thinlet's and
 pivots's versions at same place, as you suggested.

 Thanks,
 Tomoko

 2015-02-26 22:15 GMT+09:00 Dmitry Kan solrexp...@gmail.com:

  Sure, it is:
 
  java version 1.7.0_76
  Java(TM) SE Runtime Environment (build 1.7.0_76-b13)
  Java HotSpot(TM) 64-Bit Server VM (build 24.76-b04, mixed mode)
 
 
  On Thu, Feb 26, 2015 at 2:39 PM, Tomoko Uchida 
  tomoko.uchida.1...@gmail.com
   wrote:
 
   Sorry, I'm afraid I have not encountered such errors when launch.
   Seems something wrong around Pivot's, but I have no idea about it.
   Would you tell me java version you're using ?
  
   Tomoko
  
   2015-02-26 21:15 GMT+09:00 Dmitry Kan solrexp...@gmail.com:
  
Thanks, Tomoko, it compiles ok!
   
Now launching produces some errors:
   
$ java -cp dist/* org.apache.lucene.luke.ui.LukeApplication
Exception in thread main java.lang.ExceptionInInitializerError
at org.apache.lucene.luke.ui.LukeApplication.main(Unknown
  Source)
Caused by: java.lang.NumberFormatException: For input string: 3
  1644336
   
at
   
   
  
 
 java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:492)
at java.lang.Byte.parseByte(Byte.java:148)
at java.lang.Byte.parseByte(Byte.java:174)
at org.apache.pivot.util.Version.decode(Version.java:156)
at
   
   
  
 
 org.apache.pivot.wtk.ApplicationContext.clinit(ApplicationContext.java:1704)
... 1 more
   
   
On Thu, Feb 26, 2015 at 1:48 PM, Tomoko Uchida 
tomoko.uchida.1...@gmail.com
 wrote:
   
 Thank you for checking out it!
 Sorry, I've forgot to note important information...

 ivy jar is needed to compile. Packaging process needs to be
  organized,
but
 for now, I'm borrowing it from lucene's tools/lib.
 In my environment, Fedora 20 and OpenJDK 1.7.0_71, it can be
 compiled
   and
 run as follows.
 If there are any problems, please let me know.

 

 $ svn co http://svn.apache.org/repos/asf/lucene/sandbox/luke/
 $ cd luke/

 // copy ivy jar to lib/tools
 $ cp /path/to/lucene_solr_4_10_3/lucene/tools/lib/ivy-2.3.0.jar
lib/tools/
 $ ls lib/tools/
 ivy-2.3.0.jar

 $ java -version
 java version 1.7.0_71
 OpenJDK Runtime Environment (fedora-2.5.3.3.fc20-x86_64 u71-b14)
 OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

 $ ant ivy-resolve
 ...
 BUILD SUCCESSFUL

 // compile and make jars and run
 $ ant dist
 ...
 BUILD SUCCESSFULL
 $ java -cp dist/* org.apache.lucene.luke.ui.LukeApplication
 ...
 

 Thanks,
 Tomoko

 2015-02-26 16:39 GMT+09:00 Dmitry Kan solrexp...@gmail.com:

  Hi Tomoko,
 
  Thanks for the link. Do you have build instructions somewhere?
  When I
  executed ant with no params, I get:
 
  BUILD FAILED
  /home/dmitry/projects/svn/luke/build.xml:40:
  /home/dmitry/projects/svn/luke/lib-ivy does not exist.
 
 
  On Thu, Feb 26, 2015 at 2:27 AM, Tomoko Uchida 
  tomoko.uchida.1...@gmail.com
   wrote:
 
   Thanks!
  
   Would you announce at LUCENE-2562 to me and all watchers
  interested
in
  this
   issue, when the branch is ready? :)
   As you know, current pivots's version (that supports Lucene
  4.10.3)
is
   here.
   http://svn.apache.org/repos/asf/lucene/sandbox/luke/
  
   Regards,
   Tomoko
  
   2015-02-25 18:37 GMT+09:00 Dmitry Kan solrexp...@gmail.com:
  
Ok, sure. The plan is to make the pivot branch in the current
github
  repo
and update its structure accordingly.
Once it is there, I'll let you know.
   
Thank you,
Dmitry
   
On Tue, Feb 24, 2015 at 5:26 PM, Tomoko Uchida 
tomoko.uchida.1...@gmail.com
 wrote:
   
 Hi Dmitry,

 Thank you for the detailed clarification!

 Recently, I've created a few patches to Pivot
version(LUCENE-2562),
  so
I'd
 like to some more work and keep up to date it.

  If you would like to work on the Pivot version, may I
  suggest
you
  to
fork
  the github's version? The ultimate goal is to donate this
  to
  Apache,
 

Re: About solr recovery

2015-03-01 Thread Erick Erickson
Several. One is if your network has trouble and Zookeeper times out a Solr node.

Can you describe your problem though? Or is this just an informational
question? Because I'm quite sure how to respond helpfully here.

Best,
Erick

On Fri, Feb 27, 2015 at 10:37 PM, 龚俊衡 junheng.g...@icloud.com wrote:
 HI,

 Our production solr’s replication was offline in some time but both zookeeper 
 and network is ok,  and Solr jvm is normal.

 my question are there any other reason will let solr’s replication into 
 recovering state?


Re: Correct connection methodology for Zookeeper/SolrCloud?

2015-03-01 Thread Erick Erickson
bq: I could just set up a load balancer on the two Solr instances and
let client query requests use the load balancer to find a working
instance.

That's all you need to do. The client shouldn't have to really even be
aware that Zookeeper exists, there's really no need to query ZK and
route your requests yourself. The _Solr_ instances query ZK and know
about each other's state and are notivied of any problems, i.e. nodes
going up/down etc. Once a request hits any running Solr node, it'll be
routed around any problems. In the setup you describe, i.e. not using
SolrJ, your client really shouldn't even need to be aware ZK exists.

Your load balancer should know what nodes are up and route your
requests around any hosed machines.

If you _do_ decide to use SolrJ sometime, CloudSolrServer (renamed
CloudSolrClient in 5x) _does_ take the ZK ensemble and do some smart
routing on the client side, including simple load balancing, and
responds to any solr nodes going up/down for you.

Putting a load balancer in front or some other type of connection,
though, will accomplish much the same thing if Java isn't an option.
The SolrJ stuff is more sophisticated though.

Best,
Erick

On Sun, Mar 1, 2015 at 3:51 AM, Julian Perry ju...@limitless.co.uk wrote:
 Hi

 I'm really after best practice guidelines for making queries to
 an index on a Solr cluster.  I'm not calling from Java.

 I have Solr 4.10.2 up and running, seems stable.

 I have about 6 indexes/collections - am running SolrCloud with
 two Solr instances (both currently running on the same dev. box -
 just one shard each) and standalone Zookeeper with 3 instances.
 All seems fine.  I can do queries against either instance, and
 perform index updates and replication works fine.

 I'm not using Java to talk to Solr - the web pages are built with
 PHP (or something similar - happy to call zk/Solr from C).  So I
 need to call Solr from the web page code.  Clearly I need
 resilience and so don't want to specifically call one of the Solr
 instances directly.

 I could just set up a load balancer on the two Solr instances and
 let client query requests use the load balancer to find a working
 instance.

 From what I have read though - I am supposed to make a call to
 zookeeper to ask which Solr instances are running up to date and
 working replicas of the collection that I need.  Is that right?
 I should do that every time I need to make a query?

 There seems to be a zookeeper client library in the zk dist - in
 zookeeper-3.4.6/src/c/ - can I use that?  It looks like I can
 pass in a list of potential zk host:port pairs and it will find
 a working zk for me - is that right?

 Then I need to ask the working zk which solr instance I should
 connect to for the given index/collection - how do I do that -
 is that held in clusterstate.json?

 So the steps to make a Solr query against my cluster would be:

 a) call zk client library with list of zk host/ports

 b) ask zk for clusterstate.json

 c) pick an active server (at random) for the relevant collection
(is there some load balancing option in there)

 d) call the Solr server returned by (c)

 Is that best practice - or am I missing something?

 --
 Cheers
 Jules.



Integrating Solr with Nutch

2015-03-01 Thread Baruch Kogan
Hi, guys,

I'm working through the tutorial here
http://wiki.apache.org/nutch/NutchTutorial#A6._Integrate_Solr_with_Nutch.
I've run a crawl on a list of webpages. Now I'm trying to index them into
Solr. Solr's installed, runs fine, indexes .json, .xml, whatever, returns
queries. I've edited the Nutch schema as per instructions. Now I hit a wall:

   -

   Save the file and restart Solr under ${APACHE_SOLR_HOME}/example:

   java -jar start.jar\


On my install (the latest Solr,) there is no such file, but there is a
solr.sh file in the /bin which I can start. So I pasted it into
solr/example/ and ran it from there. Solr cranks over. Now I need to:


   -

   run the Solr Index command from ${NUTCH_RUNTIME_HOME}:

   bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
-linkdb crawl/linkdb crawl/segments/


and I get this:

*ubuntu@ubuntu-VirtualBox:~/crawler/nutch$ bin/nutch solrindex
http://127.0.0.1:8983/solr/ http://127.0.0.1:8983/solr/ crawl/crawldb
-linkdb crawl/linkdb crawl/segments/*
*Indexer: starting at 2015-03-01 19:51:09*
*Indexer: deleting gone documents: false*
*Indexer: URL filtering: false*
*Indexer: URL normalizing: false*
*Active IndexWriters :*
*SOLRIndexWriter*
* solr.server.url : URL of the SOLR instance (mandatory)*
* solr.commit.size : buffer size when sending to SOLR (default 1000)*
* solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)*
* solr.auth : use authentication (default false)*
* solr.auth.username : use authentication (default false)*
* solr.auth : username for authentication*
* solr.auth.password : password for authentication*


*Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does
not exist: file:/home/ubuntu/crawler/nutch/crawl/segments/crawl_fetch*
*Input path does not exist:
file:/home/ubuntu/crawler/nutch/crawl/segments/crawl_parse*
*Input path does not exist:
file:/home/ubuntu/crawler/nutch/crawl/segments/parse_data*
*Input path does not exist:
file:/home/ubuntu/crawler/nutch/crawl/segments/parse_text*
*Input path does not exist:
file:/home/ubuntu/crawler/nutch/crawl/crawldb/current*
*Input path does not exist:
file:/home/ubuntu/crawler/nutch/crawl/linkdb/current*
* at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)*
* at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)*
* at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)*
* at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)*
* at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)*
* at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)*
* at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)*
* at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)*
* at java.security.AccessController.doPrivileged(Native Method)*
* at javax.security.auth.Subject.doAs(Subject.java:415)*
* at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)*
* at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)*
* at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)*
* at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)*
* at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)*
* at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)*
* at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)*
* at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)*

What am I doing wrong?

Sincerely,

Baruch Kogan
Marketing Manager
Seller Panda http://sellerpanda.com
+972(58)441-3829
baruch.kogan at Skype


RE: Integrating Solr with Nutch

2015-03-01 Thread Markus Jelsma
Hello Baruch!

You are not pointing to a directory of segments, not a specific segment.

You must either point to a directory with the -dir option:
   bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
-linkdb crawl/linkdb -dir crawl/segments/

Or point to a segment:

   bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
-linkdb crawl/linkdb crawl/segments/YOUR_SEGMENT

Cheers
 
 
-Original message-
 From:Baruch Kogan bar...@sellerpanda.com
 Sent: Sunday 1st March 2015 18:57
 To: solr-user@lucene.apache.org
 Subject: Integrating Solr with Nutch
 
 Hi, guys,
 
 I'm working through the tutorial here
 http://wiki.apache.org/nutch/NutchTutorial#A6._Integrate_Solr_with_Nutch.
 I've run a crawl on a list of webpages. Now I'm trying to index them into
 Solr. Solr's installed, runs fine, indexes .json, .xml, whatever, returns
 queries. I've edited the Nutch schema as per instructions. Now I hit a wall:
 
-
 
Save the file and restart Solr under ${APACHE_SOLR_HOME}/example:
 
java -jar start.jar\
 
 
 On my install (the latest Solr,) there is no such file, but there is a
 solr.sh file in the /bin which I can start. So I pasted it into
 solr/example/ and ran it from there. Solr cranks over. Now I need to:
 
 
-
 
run the Solr Index command from ${NUTCH_RUNTIME_HOME}:
 
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
 -linkdb crawl/linkdb crawl/segments/
 
 
 and I get this:
 
 *ubuntu@ubuntu-VirtualBox:~/crawler/nutch$ bin/nutch solrindex
 http://127.0.0.1:8983/solr/ http://127.0.0.1:8983/solr/ crawl/crawldb
 -linkdb crawl/linkdb crawl/segments/*
 *Indexer: starting at 2015-03-01 19:51:09*
 *Indexer: deleting gone documents: false*
 *Indexer: URL filtering: false*
 *Indexer: URL normalizing: false*
 *Active IndexWriters :*
 *SOLRIndexWriter*
 * solr.server.url : URL of the SOLR instance (mandatory)*
 * solr.commit.size : buffer size when sending to SOLR (default 1000)*
 * solr.mapping.file : name of the mapping file for fields (default
 solrindex-mapping.xml)*
 * solr.auth : use authentication (default false)*
 * solr.auth.username : use authentication (default false)*
 * solr.auth : username for authentication*
 * solr.auth.password : password for authentication*
 
 
 *Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does
 not exist: file:/home/ubuntu/crawler/nutch/crawl/segments/crawl_fetch*
 *Input path does not exist:
 file:/home/ubuntu/crawler/nutch/crawl/segments/crawl_parse*
 *Input path does not exist:
 file:/home/ubuntu/crawler/nutch/crawl/segments/parse_data*
 *Input path does not exist:
 file:/home/ubuntu/crawler/nutch/crawl/segments/parse_text*
 *Input path does not exist:
 file:/home/ubuntu/crawler/nutch/crawl/crawldb/current*
 *Input path does not exist:
 file:/home/ubuntu/crawler/nutch/crawl/linkdb/current*
 * at
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)*
 * at
 org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)*
 * at
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)*
 * at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)*
 * at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)*
 * at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)*
 * at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)*
 * at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)*
 * at java.security.AccessController.doPrivileged(Native Method)*
 * at javax.security.auth.Subject.doAs(Subject.java:415)*
 * at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)*
 * at
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)*
 * at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)*
 * at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)*
 * at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)*
 * at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)*
 * at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)*
 * at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)*
 
 What am I doing wrong?
 
 Sincerely,
 
 Baruch Kogan
 Marketing Manager
 Seller Panda http://sellerpanda.com
 +972(58)441-3829
 baruch.kogan at Skype