from:"Ted Dunning \(JIRA\)"

[jira] Commented: (SOLR-1375) BloomFilter on a field

2010-03-16 Thread Ted Dunning (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846166#action_12846166
]

Ted Dunning commented on SOLR-1375:
---

Sorry to comment late here, but when indexing in hadoop, it is really nice to
avoid any central dependence. It is also nice to focus the map-side join on
items likely to match. Thirdly, reduce side indexing is typically really
important.

The conclusions from these three considerations vary by duplication rate.
Using reduce-side indexing gets rid of most of the problems of duplicate
versions of a single document (with the same sort key) since the reducer can
scan to see whether it has the final copy handy before adding a document to the
index.

There remain problems where we have to not index documents that already exist
in the index or to generate a deletion list that can assist in applying the
index update. The former problem is usually the more severe one because it
isn't unusual for data sources to just include a full dump of all documents and
assume that the consumer will figure out which are new or updated. Here you
would like to only index new and modified documents.

My own preference for this is to avoid the complication of the map-side join
using Bloom filters and simply export a very simple list of stub documents that
correspond to the documents in the index. These stub documents should be much
smaller than the average document (unless you are indexing tweets) which makes
passing around great masses of stub documents not such a problem since Hadoop
shuffle, copy and sort times are all dominated by Lucene index times. Passing
stub documents allows the reducer to simply iterate through all documents with
the same key keeping the latest version or any stub that is encountered. For
documents without a stub, normal indexing can be done with the slight addition
exporting a list of stub documents for the new additions.

The same thing could be done with a map-side join, but the trade-off is that
you now need considerably more memory for the mapper to store the entire bitmap
in memory as opposed needing (somewhat) more time to pass the stub documents
around. How that trade-off plays out in the real world isn't clear. My
personal preference is to keep heap space small since the time cost is pretty
minimal for me.

This problem also turns up in our PDF conversion pipeline where we keep
check-sums of each PDF that has already been converted to viewable forms. In
that case, the ratio of real document size to stub size is even more
preponderate.

BloomFilter on a field
--

Key: SOLR-1375
URL: https://issues.apache.org/jira/browse/SOLR-1375
Project: Solr
Issue Type: New Feature
Components: update
Affects Versions: 1.4
Reporter: Jason Rutherglen
Priority: Minor
Fix For: 1.5

Attachments: SOLR-1375.patch, SOLR-1375.patch, SOLR-1375.patch,
SOLR-1375.patch, SOLR-1375.patch

Original Estimate: 120h
Remaining Estimate: 120h

* A bloom filter is a read only probabilistic set. Its useful
for verifying a key exists in a set, though it returns false
positives. http://en.wikipedia.org/wiki/Bloom_filter
* The use case is indexing in Hadoop and checking for duplicates
against a Solr cluster (which when using term dictionary or a
query) is too slow and exceeds the time consumed for indexing.
When a match is found, the host, segment, and term are returned.
If the same term is found on multiple servers, multiple results
are returned by the distributed process. (We'll need to add in
the core name I just realized).
* When new segments are created, and commit is called, a new
bloom filter is generated from a given field (default:id) by
iterating over the term dictionary values. There's a bloom
filter file per segment, which is managed on each Solr shard.
When segments are merged away, their corresponding .blm files is
also removed. In a future version we'll have a central server
for the bloom filters so we're not abusing the thread pool of
the Solr proxy and the networking of the Solr cluster (this will
be done sooner than later after testing this version). I held
off because the central server requires syncing the Solr
servers' files (which is like reverse replication).
* The patch uses the BloomFilter from Hadoop 0.20. I want to jar
up only the necessary classes so we don't have a giant Hadoop
jar in lib.
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/bloom/BloomFilter.html
* Distributed code is added and seems to work, I extended
TestDistributedSearch to test over multiple HTTP servers. I
chose this approach rather than the manual method used by (for
example) TermVectorComponent.testDistributed because I'm new to
Solr's distributed search

[jira] Commented: (SOLR-1814) select count(distinct fieldname) in SOLR

2010-03-10 Thread Ted Dunning (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12843632#action_12843632
 ] 

Ted Dunning commented on SOLR-1814:
---


Trove is GPL.

The Mahout project has a partial set of replacements for Trove collections in 
case you want to go forward with this.  Our plan is to consider breaking out 
the collections package from Mahout at some point in case you don't want to 
drag along the rest of Mahout.


 select count(distinct fieldname) in SOLR
 

 Key: SOLR-1814
 URL: https://issues.apache.org/jira/browse/SOLR-1814
 Project: Solr
  Issue Type: New Feature
  Components: SearchComponents - other
Affects Versions: 1.4, 1.5, 1.6, 2.0
Reporter: Marcus Herou
 Fix For: 1.4, 1.5, 1.6, 2.0

 Attachments: CountComponent.java


 I have seen questions on the mailinglist about having the functionality for 
 counting distinct on a field. We at Tailsweep as well want to that in for 
 example our blogsearch.
 Example:
 You had 1345 hits on 244 blogs
 The 244 part is not possible in SOLR today (correct me if I am wrong). So 
 I've written a component which does this. Attaching it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-02-19 Thread Ted Dunning (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12835963#action_12835963
 ] 

Ted Dunning commented on SOLR-1724:
---


Will this http access also allow a cluster with incrementally updated cores to 
replicate a core after a node failure?

 Real Basic Core Management with Zookeeper
 -

 Key: SOLR-1724
 URL: https://issues.apache.org/jira/browse/SOLR-1724
 Project: Solr
  Issue Type: New Feature
  Components: multicore
Affects Versions: 1.4
Reporter: Jason Rutherglen
 Fix For: 1.5

 Attachments: commons-lang-2.4.jar, gson-1.4.jar, 
 hadoop-0.20.2-dev-core.jar, hadoop-0.20.2-dev-test.jar, SOLR-1724.patch, 
 SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, SOLR-1724.patch, 
 SOLR-1724.patch, SOLR-1724.patch


 Though we're implementing cloud, I need something real soon I can
 play with and deploy. So this'll be a patch that only deploys
 new cores, and that's about it. The arch is real simple:
 On Zookeeper there'll be a directory that contains files that
 represent the state of the cores of a given set of servers which
 will look like the following:
 /production/cores-1.txt
 /production/cores-2.txt
 /production/core-host-1-actual.txt (ephemeral node per host)
 Where each core-N.txt file contains:
 hostname,corename,instanceDir,coredownloadpath
 coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
 etc
 and
 core-host-actual.txt contains:
 hostname,corename,instanceDir,size
 Everytime a new core-N.txt file is added, the listening host
 finds it's entry in the list and begins the process of trying to
 match the entries. Upon completion, it updates it's
 /core-host-1-actual.txt file to it's completed state or logs an error.
 When all host actual files are written (without errors), then a
 new core-1-actual.txt file is written which can be picked up by
 another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1301) Solr + Hadoop

2010-02-02 Thread Ted Dunning (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828961#action_12828961
]

Ted Dunning commented on SOLR-1301:
---

{quote}
Based on these observation, I have few questions. (I am a beginner to the
Hadoop Solr world. So, please forgive me if my questions are silly):
1. As per above observation, SOLR-1045 patch is functionally better
(performance I have not verified yet ). Can anyone tell me, whats the actual
advantage SOLR-1301 patch offers over SOLR-1045 patch?
2. If both the jira issues are trying to solve the same problem, do we really
need 2 separate issues?
{quote}

In the katta community, the recommended practice started with SOLR-1045 (what I
call map-side indexing) behavior, but I think that the consensus now is that
SOLR-1301 behavior (what I call reduce side indexing) is much, much better.
This is not necessarily the obvious result given your observations. There are
some operational differences between katta and SOLR that might make the
conclusions different, but what I have observed is the following:

a) index merging is a really bad idea that seems very attractive to begin with
because it is actually pretty expensive and doesn't solve the real problems of
bad document distribution across shards. It is much better to simply have lots
of shards per machine (aka micro-sharding) and use one reducer per shard. For
large indexes, this gives entirely acceptable performance. On a pretty small
cluster, we can index 50-100 million large documents in multiple ways in 2-3
hours. Index merging gives you no benefit compared to reduce side indexing and
just increases code complexity.

b) map-side indexing leaves you with indexes that are heavily skewed by being
composed of of documents from a single input split. At retrieval time, this
means that different shards have very different term frequency profiles and
very different numbers of relevant documents. This makes lots of statistics
very difficult including term frequency computation, term weighting and
determining the number of documents to retrieve. Map-side merge virtually
guarantees that you have to do two cluster queries, one to gather term
frequency statistics and another to do the actual query. With reduce side
indexing, you can provide strong probabilistic bounds on how different the
statistics in each shard can be so you can use local term statistics and you
can depend on the score distribution being this same which radically decreases
the number of documents you need to retrieve from each shard.

c) reduce-side indexing improves the balance of computation during retrieval.
If (as is the rule) some document subset is hotter than other document subset
due, say to data-source boosting or recency boosting, you will have very bad
cluster utilization with skewed shards from map-side indexing while all shards
will cost about the same for any query leading to good cluster utilization and
faster queries with reduce-side indexing.

d) with reduce-side indexing has properties that can be mathematically stated
and proved. Map-side indexing only has comparable properties if you make
unrealistic assumptions about your original data.

e) micro-sharding allows very simple and very effective use of multiple cores
on multiple machines in a search cluster. This can be very difficult to do
with large shards or a single index.

Now, as you say, these advantages may evaporate if you are looking to produce a
single output index. That seems, however, to contradict the whole point of
scaling. If you need to scale indexing, presumably you also need to scale
search speed and throughput. As such you probably want to have many shards
rather than few. Conversely, if you can stand to search a single index, then
you probably can stand to index on a single machine.

Another thing to think about is the fact SOLR doesn't yet do micro-sharding or
clustering very well and, in particular, doesn't handle multiple shards per
core. That will be changing before long, however, and it is very dangerous to
design for the past rather than the future.

In case, you didn't notice, I strongly suggest you stick with reduce-side
indexing.

Solr + Hadoop
-

Key: SOLR-1301
URL: https://issues.apache.org/jira/browse/SOLR-1301
Project: Solr
Issue Type: Improvement
Affects Versions: 1.4
Reporter: Andrzej Bialecki
Fix For: 1.5

Attachments: commons-logging-1.0.4.jar,
commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop.patch,
log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch,
SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch,
SOLR-1301.patch, SolrRecordWriter.java

This patch contains a contrib module that provides distributed indexing
(using Hadoop)

[jira] Commented: (SOLR-1301) Solr + Hadoop

2010-01-29 Thread Ted Dunning (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12806547#action_12806547
]

Ted Dunning commented on SOLR-1301:
---

It is critical to put indexes in the task local area on both local and hdfs
storage areas not just because of task cleanup, but also because a task may be
run more than once. Hadoop handles all the race conditions that would
otherwise happen as a result.

Solr + Hadoop
-

Key: SOLR-1301
URL: https://issues.apache.org/jira/browse/SOLR-1301
Project: Solr
Issue Type: Improvement
Affects Versions: 1.4
Reporter: Andrzej Bialecki
Fix For: 1.5

This patch contains a contrib module that provides distributed indexing
(using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is
twofold:
* provide an API that is familiar to Hadoop developers, i.e. that of
OutputFormat
* avoid unnecessary export and (de)serialization of data maintained on HDFS.
SolrOutputFormat consumes data produced by reduce tasks directly, without
storing it in intermediate files. Furthermore, by using an
EmbeddedSolrServer, the indexing task is split into as many parts as there
are reducers, and the data to be indexed is not sent over the network.
Design
--
Key/value pairs produced by reduce tasks are passed to SolrOutputFormat,
which in turn uses SolrRecordWriter to write this data. SolrRecordWriter
instantiates an EmbeddedSolrServer, and it also instantiates an
implementation of SolrDocumentConverter, which is responsible for turning
Hadoop (key, value) into a SolrInputDocument. This data is then added to a
batch, which is periodically submitted to EmbeddedSolrServer. When reduce
task completes, and the OutputFormat is closed, SolrRecordWriter calls
commit() and optimize() on the EmbeddedSolrServer.
The API provides facilities to specify an arbitrary existing solr.home
directory, from which the conf/ and lib/ files will be taken.
This process results in the creation of as many partial Solr home directories
as there were reduce tasks. The output shards are placed in the output
directory on the default filesystem (e.g. HDFS). Such part-N directories
can be used to run N shard servers. Additionally, users can specify the
number of reduce tasks, in particular 1 reduce task, in which case the output
will consist of a single shard.
An example application is provided that processes large CSV files and uses
this API. It uses a custom CSV processing to avoid (de)serialization overhead.
This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this
issue, you should put it in contrib/hadoop/lib.
Note: the development of this patch was sponsored by an anonymous contributor
and approved for release under Apache License.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-01-21 Thread Ted Dunning (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12803371#action_12803371
]

Ted Dunning commented on SOLR-1724:
---

{quote}
... I agree, I'm not really into ephemeral
ZK nodes for Solr hosts/nodes. The reason is contact with ZK is
highly superficial and can be intermittent.
{quote}
I have found that when I was having trouble with ZK connectivity, the problems
were simply surfacing issues that I had anyway. You do have to configure the
ZK client to not have long pauses (that is incompatible with SOLR how?) and you
may need to adjust the timeouts on the ZK side. More importantly, any issues
with ZK connectivity will have their parallels with any other heartbeat
mechanism and replicating a heartbeat system that tries to match ZK for
reliability is going to be a significant source of very nasty bugs. Better to
not rewrite that already works. Keep in mind that ZK *connection* issues are
not the same as session expiration. Katta has a fairly important set of
bugfixes now to make that distinction and ZK will soon handle connection loss
on its own.

It isn't a bad idea to keep shards around for a while if a node goes down.
That can seriously decrease the cost of momentary outages such as for a
software upgrade. The idea is that when the node comes back, it can advertise
availability of some shards and replication of those shards should cease.

Real Basic Core Management with Zookeeper
-

Key: SOLR-1724
URL: https://issues.apache.org/jira/browse/SOLR-1724
Project: Solr
Issue Type: New Feature
Components: multicore
Affects Versions: 1.4
Reporter: Jason Rutherglen
Fix For: 1.5

Attachments: SOLR-1724.patch

Though we're implementing cloud, I need something real soon I can
play with and deploy. So this'll be a patch that only deploys
new cores, and that's about it. The arch is real simple:
On Zookeeper there'll be a directory that contains files that
represent the state of the cores of a given set of servers which
will look like the following:
/production/cores-1.txt
/production/cores-2.txt
/production/core-host-1-actual.txt (ephemeral node per host)
Where each core-N.txt file contains:
hostname,corename,instanceDir,coredownloadpath
coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://,
etc
and
core-host-actual.txt contains:
hostname,corename,instanceDir,size
Everytime a new core-N.txt file is added, the listening host
finds it's entry in the list and begins the process of trying to
match the entries. Upon completion, it updates it's
/core-host-1-actual.txt file to it's completed state or logs an error.
When all host actual files are written (without errors), then a
new core-1-actual.txt file is written which can be picked up by
another process that can create a new core proxy.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-01-16 Thread Ted Dunning (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801296#action_12801296
]

Ted Dunning commented on SOLR-1724:
---

{quote}
We actually started out that way... (when a node went down there wasn't really
any trace it ever existed) but have been moving away from it.
ZK may not just be a reflection of the cluster but may also control certain
aspects of the cluster that you want persistent. For example, marking a node as
disabled (i.e. don't use it). One could create APIs on the node to enable and
disable and have that reflected in ZK, but it seems like more work than simply
saying change this znode.
{quote}

I see this as a conflation of two or three goals that leads to trouble. All
of the goals are worthy and important, but the conflation of them leads to
difficult problems. Taken separately, the goals are easily met.

One goal is the reflection of current cluster state. That is most reliably
done using ephemeral files roughly as I described.

Another goal is the reflection of constraints or desired state of the cluster.
This is best handled as you describe, with permanent files since you don't want
this desired state to disappear when a node disappears.

The real issue is making sure that the source of whatever information is most
directly connected to the physical manifestation of that information.
Moreover, it is important in some cases (node state, for instance) that the
state stay correct even when the source of that state loses control by
crashing, hanging or becoming otherwise indisposed. Inserting an intermediary
into this chain of control is a bad idea. Replicating ZK's rather well
implemented ephemeral state mechanism with ad hoc heartbeats is also a bad idea
(remember how *many* bugs there have been in hadoop relative to heartbeats and
the name node?).

A somewhat secondary issue is whether the cluster master has to be involved in
every query. That seems like a really bad bottleneck to me and Katta provides
a proof of existence that this is not necessary.

After trying several options in production, what I find is best is that the
master lay down a statement of desired state and the nodes publish their status
in a different and ephemeral fashion. The master can record a history or there
may be general directions such as your disabled list however you like but that
shouldn't be mixed into the node status because you otherwise get into a
situation where ephemeral files can no longer be used for what they are good at.

Real Basic Core Management with Zookeeper
-

Key: SOLR-1724
URL: https://issues.apache.org/jira/browse/SOLR-1724
Project: Solr
Issue Type: New Feature
Components: multicore
Affects Versions: 1.4
Reporter: Jason Rutherglen
Fix For: 1.5

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper

2010-01-15 Thread Ted Dunning (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801051#action_12801051
 ] 

Ted Dunning commented on SOLR-1724:
---


Katta had some interesting issues in the design of this.

These are discussed here: http://oss.101tec.com/jira/browse/KATTA-43

The basic design consideration is that failure of a node needs to automagically 
update the ZK state accordingly.  This allows all important updates to files to 
go one direction as well.


 Real Basic Core Management with Zookeeper
 -

 Key: SOLR-1724
 URL: https://issues.apache.org/jira/browse/SOLR-1724
 Project: Solr
  Issue Type: New Feature
  Components: multicore
Affects Versions: 1.4
Reporter: Jason Rutherglen
 Fix For: 1.5


 Though we're implementing cloud, I need something real soon I can
 play with and deploy. So this'll be a patch that only deploys
 new cores, and that's about it. The arch is real simple:
 On Zookeeper there'll be a directory that contains files that
 represent the state of the cores of a given set of servers which
 will look like the following:
 /production/cores-1.txt
 /production/cores-2.txt
 /production/core-host-1-actual.txt (ephemeral node per host)
 Where each core-N.txt file contains:
 hostname,corename,instanceDir,coredownloadpath
 coredownloadpath is a URL such as file://, http://, hftp://, hdfs://, ftp://, 
 etc
 and
 core-host-actual.txt contains:
 hostname,corename,instanceDir,size
 Everytime a new core-N.txt file is added, the listening host
 finds it's entry in the list and begins the process of trying to
 match the entries. Upon completion, it updates it's
 /core-host-1-actual.txt file to it's completed state or logs an error.
 When all host actual files are written (without errors), then a
 new core-1-actual.txt file is written which can be picked up by
 another process that can create a new core proxy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1375) BloomFilter on a field

[jira] Commented: (SOLR-1814) select count(distinct fieldname) in SOLR

[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper

[jira] Commented: (SOLR-1301) Solr + Hadoop

[jira] Commented: (SOLR-1301) Solr + Hadoop

[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper

[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper

[jira] Commented: (SOLR-1724) Real Basic Core Management with Zookeeper

8 matches

Site Navigation

Mail list logo

Footer information