Re: Replication for SolrCloud

2015-04-18 Thread Jürgen Wagner (DVT)
Replication on the storage layer will provide a reliable storage for the
index and other data of Solr. In particular, this replication does not
guarantee your index files are consistent at any time as there may be
intermediate states that are only partially replicated. Replication is
only a convergent process, not an instant, atomic operation. With
frequent changes, this becomes an issue.

Replication inside SolrCloud as an application will not only maintain
the consistency of the search-level interfaces to your indexes, but also
scale in the sense of the application (query throughput).

Imagine a database: if you change one record, this may also result in an
index change. If the record and the index are stored in different
storage blocks, one will get replicated first. However, the replication
target will only be consistent again when both have been replicated. So,
you would have to suspend all accesses until the entire replication has
completed. That's undesirable. If you replicate on the application
(database management system) level, the application will employ a more
fine-grained approach to replication, guaranteeing application consistency.

Consequently, HDFS will allow you to scale storage and possibly even
replicate static indexes that won't change, but it won't help much with
live index replication. That's where SolrCloud jumps in.

Cheers,
--Jürgen

On 18.04.2015 08:44, gengmao wrote:
> I wonder why need to use SolrCloud replication on HDFS at all, given HDFS
> already provides replication and availability? The way to optimize
> performance and scalability should be tweaking shards, just like tweaking
> regions on HBase - which doesn't provide "region replication" too, isn't
> it?
>
> I have this question for a while and I didn't find clear answer about it.
> Could some experts please explain a bit?
>
> Best regards,
> Mao Geng
>
>


-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wag...@devoteam.com
, URL: www.devoteam.de



Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071




Re: Deploying multiple ZooKeeper ensemble on a single machine

2015-04-08 Thread Jürgen Wagner (DVT)
To be precise: create one zoo.cfg for each of the instances. One config
file for all is a bad idea.

In each config file, use the same server.X lines, but use a unique
clientPort.

As you will also have separate data directories, I would recommend
having one root directory .../zookeeper where you create subdirectories
for each instance. In each of these subdirectories, you may have your
zoo.cfg. To start a zookeeper instance, simply have ZOOCFGDIR point to
the proper relative path, change to the respective directory and start
zookeeper.

Best regards,
--Jürgen

On 08.04.2015 11:22, Swaraj Kumar wrote:
> Hi Zheng,
>
> I am not sure if this command *"zkServer.cmd start zoo.cfg" * works in
> windows or not, but in zkServer.cmd it calls zkEnv.cmd where "
> *ZOOCFG=%ZOOCFGDIR%\zoo.cfg*" is set. So, if you want to run multiple
> instances of zookeeper, change zoo.cfg to your config file and start
> zookeeper.
> The command will not include any start.
>
>
>
> Regards,
>
>
> Swaraj Kumar
> Senior Software Engineer I
> MakeMyTrip.com
> Mob No- 9811774497
>
> On Wed, Apr 8, 2015 at 12:29 PM, Zheng Lin Edwin Yeo 
> wrote:
>
>

-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wag...@devoteam.com
, URL: www.devoteam.de



Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071




Re: Using SolrCloud to implement a kind of federated search

2015-01-20 Thread Jürgen Wagner (DVT)
Hello Charlie,
  theoretically, things may work as you describe them. A few big
HOWEVERs exist as far as I can see:

1. Attributes: as different organisations may use different schemata
(document attributes), the consolidation of results from multiple
sources may present a problem. This may not arise with common attributes
(for which there may be a standardization of some sort, e.g., like the
Dublin meta-core standard), but especially for very specific attributes
that pertain to the different focal work areas of the institutions
running the individual systems you want to federate.

2. Values: different organisations will work on different topics. There
may be large similarities, but as the staff involved is different, there
will be an inherent difference in the actual semantic domain dealt with.
Consequently, it is very likely that you won't have a homogeneous
ontology for all pieces of information across all federated sources.
This makes it hard to consolidate results in a semantically correct way.

3. Cardinality: there may be rather large collections and some smaller
collections in the federation. If you use SolrCloud to obtain results,
the ones from smaller collections will get more significance in the
result mixing than the ones from the larger collections, as relevance
will be relative to each federated source.

4. Uniqueness: different systems may index the same documents. The idea
of having a globally unique identifier should take this into account,
i.e., it won't suffice to simply prefix each (locally unique) document
id with a source identifier. The federated sources must be aware of
being federated and possibly having overlaps. Otherwise, you will get
multiple occurrences of very popular documents.

5. Security: security in SolrCloud is through filtering. If you simply
use the SolrCould distributed query mechanism, each source would have to
trust each federation instance to properly enforce security filters
through the respective entitlement groups. If one such federation system
won't comply and simply issue wild queries, there won't be any security.

6. Orchestration: there will be some issues with the orchestration of
these services. Zookeeper won't scale to the multiple datacenter
topology, effectively leaving node discovery to some other mechanism yet
to be defined.

These are the issues that quickly come to my mind. There may be more.

Also have a look at tribe nodes in Elasticsearch, although these don't
fully address all issues I listed above.

In my experience, there is a clear distinction between "technical"
federated search (possibly something like the tribe nodes) and
"semantic" federated search (requiring special processing of results
obtained from different sources, ready to be consolidated). FAST Unity
used to have elaborate (but still limited) mechanisms to handle this,
but they disappeared in the course of the Microsoft takeover.

Best regards,
--Jürgen


On 20.01.2015 15:13, Charlie Hull wrote:
> Hi all,
>
> We've been discussing a way of implementing a federated search by
> leveraging the distributed query parts of SolrCloud. I've written this up
> at
> http://www.flax.co.uk/blog/2015/01/20/solr-superclusters-for-improved-federated-search/
> and would welcome any comments or feedback. So far, two committers have
> failed to see any major flaw in our plan, which makes me slightly nervous :)
>
> cheers
>
> Charlie
>


-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wag...@devoteam.com
, URL: www.devoteam.de



Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071




Re: Frequent deletions

2015-01-11 Thread Jürgen Wagner (DVT)
Maybe you should consider creating different generations of indexes and
not keep everything in one index. If the likelihood of documents being
deleted is rather high in, e.g., the first week or so, you could have
one index for the high-probability of deletion documents (the fresh
ones) and a second one for the potentially longer-lived documents.
Without knowing the temporal distribution of deletion probabilities, it
is hard to say what would be the ideal index topology.

Apart from that, I have made the experience that in some cases where
Solr would produce the notorious out-of-memory exceptions, Elasticsearch
seems to be a bit more robust. You may want to give it a try as well.

Best regards,
--Jürgen

On 11.01.2015 07:46, ig01 wrote:
> Thank you all for your response,
> The thing is that we have 180G index while half of it are deleted documents.
> We  tried to run an optimization in order to shrink index size but it
> crashes on ‘out of memory’ when the process reaches 120G.   
> Is it possible to optimize parts of the index? 
> Please advise what can we do in this situation.
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Frequent-deletions-tp4176689p4178700.html
> Sent from the Solr - User mailing list archive at Nabble.com.


-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wag...@devoteam.com
, URL: www.devoteam.de



Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071




Re: PDF search functionality using Solr

2015-01-06 Thread Jürgen Wagner (DVT)
Hello,
  no matter which search platform you will use, this will pose two
challenges:

- The size of the documents will render search less and less useful as
the likelihood of matches increases with document size. So, without a
proper semantic extraction (e.g., using decent NER or relationship
extraction with a commercial text mining product), I doubt you will get
the required precision to make this overly usefiul.

- PDFs can have their own character sets based on the characters
actually used. Such file-specific character sets are almost impossible
to parse, i.e., if your PDFs happen to use this "feature" of the PDF
format, you won't be lucky getting any meaningful text out of them.

My suggestion is to use the Jira REST API to collect all necessary
documents and index the resulting XML or attachment formats. As the REST
API provides filtering capabilities, you could easily create incremental
feeds to avoid humongous indexing every time there's new information in
Jira. Dumping Jira stuff as PDF seems to me to be the least suitable way
of handling this.

Best regards,
--Jürgen


On 06.01.2015 18:30, ganesh.ya...@sungard.com wrote:
> Hello Solr-users and developers,
> Can you please suggest,
>
> 1.   What I should do to index PDF content information column wise?
>
> 2.   Do I need to extract the contents using one of the Analyzer, 
> Tokenize and Filter combination and then add it to Index? How can test the 
> results on command prompt? I do not know the selection of specific Analyzer, 
> Tokenizer and Filter for this purpose
>
> 3.   How can I verify that the needed column info is extracted out of PDF 
> and is indexed?
>
> 4.   So for example How to verify Ticket number is extracted in 
> Ticket_number tag and is indexed?
>
> 5.   Is it ok to post 4 GB worth of PDF to be imported and indexed by 
> Solr? I think I saw some posts complaining on how large size that can be 
> posted ?
>
> 6.   What will enable Solr to search in any PDF out of many, with 
> different words such as "Runtime" "Error" "" and result will provide the 
> link to the PDF
>
> My PDFs are nothing but Jira ticket system.
> PDF has info on
> Ticket Number:
> Desc:
> Client:
> Status:
> Submitter:
> And so on:
>
>
> 1.   I imported PDF document in Solr and it does the necessary searching 
> and I can test some of it using the browse client interface provided.
>
> 2.   I have 80 GB worth of PDFs.
>
> 3.   Total number of PDFs are about 200
>
> 4.   Many PDFs are of size 4 GB
>
> 5.   What do you suggest me to import such a large PDFs? What tools can 
> you suggest to extract PDF contents first in some XML format and later Post 
> that XML to be indexed by Solr.?
>
>
>
>
>
>
>
> Your early response is much appreciated.
>
>
>
> Thanks
>
> G
>
>


-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wag...@devoteam.com
, URL: www.devoteam.de



Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071




Re: Hardware requirement for 500 million documents

2015-01-04 Thread Jürgen Wagner (DVT)
Hi Ali,
  the sizing is not just determined by the number of indexed documents
(and even less by the number of concurrent users).

- Document volume (number of documents, amount of  text data to be
indexed with each document, number and types of fields, the cardinality
of fields) guide you to the number of primary shards or collections you
want to have in your environment.

- Query volume determines replication factors to deal with proper
response times.

- The amount of concurrency (e.g., do you have primarily insertions of
new documents and then queries, or is there also a significant deletion
process running in parallel - partial updates count as
deletion+insertion) and the frequency of required index updates also
influences the sizing.

- Usually, processing (document to text, extractions, enrichment, ...) 
will be handled outside Solr (and has to be taken into account for the
entire platform scaling of hardware).

Some figures you may want to know before tackling this project are

- Are there different types of documents (e.g., text, media, data) that
have different textual amounts for indexing (e.g., plain text ~100%,
HTML ~90%, Microsoft Word ~15%, PDF ~10%, ...) to be handled?

- What are the size distributions (possibly over these types of documents)?

- What is the expected update frequency? Can you do incremental crawling?

- What types of attributes and facets are you planning to have for these
documents?

- How fresh an index do you need?

- Is this concurrent indexing and querying or will indexing happen,
e.g., at night, while during the day, users will query the platform?

- What are the types of typical queries issued by users?

- Will you have to take security into account (possibly leading to large
Boolean expressions added to queries to filter by entitlement groups)?

This will guide you into a first direction. Then run a prototype to
measure representative figures for scaling and make your estimates.

Best regards,
--Jürgen




On 04.01.2015 15:36, Ali Nazemian wrote:
> Hi,
> I was wondering what is the hardware requirement for indexing 500 million
> documents in Solr? Suppose maximum number of concurrent users in peak time
> would be 20.
> Thank you very much.
>


-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wag...@devoteam.com
, URL: www.devoteam.de



Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071




Re: Solr HTTP client authentication

2014-11-17 Thread Jürgen Wagner (DVT)
Why rely on the default http client? Why not create one with

HttpClients.custom()
.setDefaultSocketConfig(socketConfig)
.setDefaultRequestConfig(requestConfig)
.setSSLSocketFactory(sslsf)
.build();

that has the SSLConnectionSocketFactory property set up with an
SSLContext that has the trust store and key store loaded properly?

Best,
--J.

On 17.11.2014 18:41, Bai Shen wrote:
> I had seen where I could pass in an HttpClient to the SolrServer.  The
> problem is that the HttpClient only receives the authentication information
> through the execute method using the context. See the example located here.
>
> https://hc.apache.org/httpcomponents-client-4.3.x/tutorial/html/authentication.html
>
> DefaultHttpClient has methods to set the authentication information but the
> class is deprecated.
>
> Thanks.
>
> On Mon, Nov 17, 2014 at 11:35 AM, Fuad Efendi  wrote:
>
>>>  I can
>>> manually create an httpclient and set up authentication but then I can't
>> use solrj.
>>
>> Yes; correct; except that you _can_ use solj with this custom HttpClient
>> instance (which will intercept authentication, which will support cookies,
>> SSL or plain HTTP, Keep-Alive, and etc.)
>>
>> You can provide to SolrJ custom HttpClient at construction:
>>
>> final HttpSolrServer myHttpSolrServer =
>> new HttpSolrServer(
>> SOLR_URL_BASE + "/" + SOLR_CORE_NAME,
>> myHttpClient);
>>
>>
>> Best Regards,
>>
>> http://www.tokenizer.ca
>>
>>
>> -Original Message-
>> From: Anurag Sharma [mailto:anura...@gmail.com]
>> Sent: November-17-14 11:21 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Solr HTTP client authentication
>>
>> I think Solr encourage SSL than authentication
>>
>> On Mon, Nov 17, 2014 at 6:08 PM, Bai Shen  wrote:
>>
>>> I am using solrj to connect to my solr server.  However I need to
>>> authenticate against the server and can not find out how to do so
>>> using solrj.  Is this possible or do I need to drop solrj?  I can
>>> manually create an httpclient and set up authentication but then I can't
>> use solrj.
>>> Thanks.
>>>
>>


-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wag...@devoteam.com
, URL: www.devoteam.de



Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071




Re: Restrict search to subset (a list of aprrox 40,000 ids from an external service) of corpus

2014-11-14 Thread Jürgen Wagner (DVT)
Hi guy,
  there's not much of a search operation here. Why not store the
documents in a key/value store and simply fetch them by matching ids?

Another approach:  as there is no query, you could easily partition the
set of ids and fetch the results in multiple batches.

The maximum number of clauses should be 1024. You can set it to a higher
value using the respective method in
org.apache.lucene.search.BooleanQuery (I've never done that one before,
though).

Now, your mileage may vary. What is the idea behind this retrieval? You
really want to fetch objects by id? Check out MemcacheDB or Apache
Cassandra or Apache CouchDB, depending on your application and the type
of information you want to store.

Best regards,
--Jürgen

On 14.11.2014 17:51, henry cleland wrote:
> Hi guys,
> How do I search only a subset of my corpus based on a large list of non
> consecutive unique key ids (cannot do a range query).
> Is there a way around doing this  q=id:(id1 OR id2 OR id3 OR id4 ... OR
> id4 ) AND name:*
>
> Also what is the limit of "OR"s i can apply on the query if that is the
> only way out, i don't suppose it is infinity.
> Thanks
>


-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wag...@devoteam.com
, URL: www.devoteam.de



Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071




Re: One ZooKeeper and many Solr clouds

2014-11-14 Thread Jürgen Wagner (DVT)
Hello Enrico,
  you may use the chroot feature of Zookeeper to root the different
SolrCloud instances differently. Instead of zoohost1:2181, you can use
zoohost1:2181/cluster1 as the Zookeeper location. Unless there is a load
issue with high rates of updates and other data traffic, a single
Zookeeper ensemble can very well handle multiple SolrCloud instances.

https://wiki.apache.org/solr/SolrCloud#Zookeeper_chroot

Best regards,
--Jürgen


On 14.11.2014 13:41, Enrico Trucco wrote:
> Hello
>
> I am considering to start using Solr Cloud and to share a single ZooKeeper
> between different Solr clouds and eventually other software.
>
> In all the examples I see online, the configuration of a Solr cloud is
> stored in the root node of ZooKeeper.
> I was wandering if it is possible to specify the node under which Solr
> stores its configuration.
> For example, let's suppose that i have 2 Solr clouds (solr1 and solr2)
> and another software sharing the same zookeeper instance and storing files
> under other1.
>
> I would like to have 3 nodes like
> /solr1
> /solr2
> /other1
> Each of them containing files of a single entity.
> I will manage isolation through ZooKeeper ACL functionalities.
>
> Is there a way to achieve this?
>
> Regards
> Enrico
>


-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wag...@devoteam.com
, URL: www.devoteam.de



Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071




Re: Best Practices for open source pipeline/connectors

2014-11-04 Thread Jürgen Wagner (DVT)
Hello Dan,
  ManifoldCF is a connector framework, not a processing framework.
Therefore, you may try your own lightweight connectors (which usually
are not really rocket science and may take less time to write than time
to configure a super-generic connector of some sort), any connector out
there (including Nutch and others), or even commercial offerings from
some companies. That, however, won't make you very happy all by itself -
my guess. Key to really creating value out of data dragged into a search
platform is the processing pipeline. Depending on the scale of data and
the amount of processing you need to do, you may have a simplistic
approach with just some more or less configurable Java components
massaging your data until it can be sent to Solr (without using Tika or
any other processing in Solr), or you can employ frameworks like Apache
Spark to really heavily transform and enrich data before feeding them
into Solr.

I prefer to have a clear separation between connectors, processing,
indexing/querying and front-end visualization/interaction. Only the
indexing/querying task I grant to Solr (or naked Lucene or
Elasticsearch). Each of the different task types has entirely different
scaling requirements and computing/networking properties, so you
definitely don't want them depend on each other too much. Addressing the
needs of several customers, one needs to even swap one or the other
component in favour of what a customer prefers or needs.

So, my answer is YES. But we've also tried Nutch, our own specialized
crawlers and a number of elaborate connectors for special customer
applications. In any case, the result of that connector won't go into
Solr. It will go into processing. From there it will go into Solr. I
suspect that connectors won't be the challenge in your project. Solr
requires a bit of tuning and tweaking, but you'll be fine eventually.
Document processing will be the fun part. As you come to scaling the zoo
of components, this will become evident :-)

What is the volume and influx rate in your scenario?

Best regards,
--Jürgen


On 04.11.2014 22:01, Dan Davis wrote:
> I'm trying to do research for my organization on the best practices for
> open source pipeline/connectors.   Since we need Web Crawls, File System
> crawls, and Databases, it seems to me that Manifold CF might be the best
> case.
>
> Has anyone combined ManifestCF with Solr UpdateRequestProcessors or
> DataImportHandler?   It would be nice to decide in ManifestCF which
> resultHandler should receive a document or id, barring that, you can post
> some fields including an URL and have Data Import Handler handle it - it
> already supports scripts whereas ManifestCF may not at this time.
>
> Suggestions and ideas?
>
> Thanks,
>
> Dan
>


-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wag...@devoteam.com
, URL: www.devoteam.de



Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071




Re: Consul instead of ZooKeeper anyone?

2014-11-04 Thread Jürgen Wagner (DVT)
Hello Greg,
  we run Zookeeper not on dedicated Zookeeper machines, but rather on
admin nodes in search application clusters (that makes two instances),
plus on at least one more node that does not have much load (e.g., a
crawling node). Also, as long as you don't stuff too much data into
Zookeeper yourself, the memory footprint of 2 GB seems to be a bit
generous to support SolrCloud.

Best regards,
--Jürgen


On 04.11.2014 20:23, Greg Solovyev wrote:
> Thanks for the answers Erick. I can see that this is a significant effort and 
> I am certainly not asking the community to undertake this work. I was 
> actually going to take a stab at it myself. Regarding $$ savings from not 
> requiring ZK my assumption is that ZK in production demands a dedicated host 
> and requires 2GB RAM/instance while Consul runs on less than 100MB 
> RAM/instance. So, for ISPs, BSP and large enterprise deployments, the savings 
> come would from reduced resource requirements. 
>
> Thanks,
> Greg
>
>

-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wag...@devoteam.com
, URL: www.devoteam.de



Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071




Re: Consul instead of ZooKeeper anyone?

2014-11-01 Thread Jürgen Wagner (DVT)
Hello Greg,
  Consul and  Zookeeper are quite similar in their offering with respect
to what SolrCloud needs. Service discovery, watches on distributed
cluster state, updates of configuration could all be handled through
Consul. Plus, Consul does offer built-in  capabilities for
multi-datacenter scenarios and encryption. Also, the capability to
inquire Consul via DNS, i.e., without any client-side library
requirements, is quite compelling. One could integrate Java, C/C++,
C#/.NET, Python, Ruby and other types of clients without much effort.

The largest benefit, however, I would see for the zoo of services around
Solr. At least in my experience, SolrCloud for serious applications is
never deployed by itself. There will be numerous services for data
collection, semantic processing, log management, monitoring,
administration, reporting and user front-ends around the core SolrCloud.
This zoo is hard to manage and especially the coordination of
configuration and cluster consistency is hard to manage. Consul could
help here as it comes from the more operations-type level of managing an
elastic set of services in data centers.

So, after singing the praises, why have I not started using Consul then? :-)

First and foremost: Zookeeper from the Hadoop/Apache ecosystem is
already integrated with SolrCloud. Ripping it out and replacing it with
something similar but not quite the same would require significant
effort, esp. for testing this thoroughly. My clients are not willing to
pay for basic groundworks.

Second: Consul looks nice but documentation leaves many questions open.
Once you start setting it up, there will be questions where you have to
dive into the code for answers. Consul does not give me the same
"mature" impression as Zookeeper. So, I am still using our own service
management framework for the zoo of services in typical search clouds.
Consul is young, however, and may evolve. The version is 0.4.1 and I
don't use anything with a zero in front to manage a serious customer
infrastructure. Would you trust the a customer's 50-100 TB of source
data to a set  of SolrClouds based on a 0.x Consul? ;-)

Third: Consul lacks a decent integration with log management. In any
distributed environment, you don't just want to keep a snapshot of the
moment, but rather a possibly long history of state changes and
statistics, so there is a chance to not just monitor, but also to act.
In that respect, we would need more of cloud management recipes
integrated, without having to pull out the entire Puppet or Chef stack
that will come with its own view of the world. That again is a topic of
maturity and being fit for real-life requirements. I would love to see
Consul evolve into that type of lightweight cloud management with basic
services integrated. But: some way to go still.

There are other issues, but these are the major ones from my perspective.

So, the concept is nice, Hashimoto et al. are known to be creative
heads, and therefore I will keep watching what's happing there, but I
won't use Consul for any real customer projects yet - not even that part
that is not SolrCloud-dependent.

Best regards,
--Jürgen



On 01.11.2014 00:08, Greg Solovyev wrote:
> I am investigating a project to make SolrCloud run on Consul instead of 
> ZooKeeper. So far, my research revealed no such efforts, but I wanted to 
> check with this list to make sure I am not going to be reinventing the wheel. 
> Have anyone attempted using Consul instead of ZK to coordinate SolrCloud 
> nodes? 
>
> Thanks, 
> Greg 
>


-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wag...@devoteam.com
, URL: www.devoteam.de



Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071




Re: Indexing documents/files for production use

2014-10-28 Thread Jürgen Wagner (DVT)
Hello Olivier,
  for real production use, you won't really want to use any toys like
post.jar or curl. You want a decent connector to whatever data source
there is, that fetches data, possibly massages it a bit, and then feeds
it into Solr - by means of SolrJ or directly into the web service of
Solr via binary protocols. This way, you can properly handle incremental
feeding, processing of data from remote locations (with the connector
being closer to the data source), and also source data security. Also
think about what happens if you do processing of incoming documents in
Solr. What happens if Tika runs out of memory because of PDF problems?
What if this crashes your Solr node? In our Solr projects, we generally
do not do any sizable processing within Solr as document processing and
document indexing or querying have all different scaling properties.

"Production use" most typically is not achieved by deploying a vanilla
Solr, but rather having a bit more glue and wrappage, so the whole will
fit your requirements in terms of functionality, scaling, monitoring and
robustness. Some similar platforms like Elasticsearch try to alleviate
these pains of going to a production-style infrastructure, but that's at
the expense of flexibility and comes with limitations.

For proof-of-concept or demonstrator-style applications, the plain tools
out of the box will be fine. For production applications, you want to
have more robust components.

Best regards,
--Jürgen

On 28.10.2014 22:12, Olivier Austina wrote:
> Hi All,
>
> I am reading the solr documentation. I have understood that post.jar
> 
> is not meant for production use, cURL
> 
> is not recommanded. Is SolrJ better for production?  Thank you.
> Regards
> Olivier
>


-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wag...@devoteam.com
, URL: www.devoteam.de



Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071




Re: CoreAdminRequest in SolrCloud

2014-10-20 Thread Jürgen Wagner (DVT)
Hi Nabil,
  you can get /clusterstate.json from Zookeeper. Check
CloudSolrServer.getZkStateReader():

http://lucene.apache.org/solr/4_10_1/solr-solrj/org/apache/solr/client/solrj/impl/CloudSolrServer.html

Best regards,
--Jürgen

On 20.10.2014 15:16, nabil Kouici wrote:
> Hi Jürgen,
>
> As you can see,  I'm not using direct connection to node. It's a CloudServer. 
> Do you have example to how to get Cluster status from solrJ.
>
> Regards,
> Nabil.
>
>
> Le Lundi 20 octobre 2014 13h44, Jürgen Wagner (DVT) 
>  a écrit :
>  
>
>
> Hello Nabil,
>   isn't that what should be expected? Cores are local to nodes, so
>   you only get the core status from the node you're asking. Cluster
>   status refers to the entire SolrCloud cluster, so you will get the
>   status over all collection/nodes/shards[=cores]. Check the Core
>   Admin REST interface for comparison.
>
> Cheers,
> --Jürgen
>
> On 20.10.2014 11:41, nabil Kouici wrote:
>
> Hi,
> I'm trying to get all shards statistics in cloud configuration. I'v used 
> CoreAdminRequest but the problem is I get statistics for only shards (or 
> core) in one node (I've 2 nodes): String zkHostString = "10.0.1.4:2181";
> CloudSolrServer solrServer= new CloudSolrServer(zkHostString);
> CoreAdminRequest request = new CoreAdminRequest();
> request.setAction(CoreAdminAction.STATUS);
> CoreAdminResponse cores = request.process(solrServer);
> for (int i = 0; i < cores.getCoreStatus().size(); i++) { NamedList 
> ll=cores.getCoreStatus().getVal(i); System.out.println(ll.toString());
> }  Any idea? Regards,
> Nabil. 
>
>


-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wag...@devoteam.com
<mailto:juergen.wag...@devoteam.com>, URL: www.devoteam.de
<http://www.devoteam.de/>


Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071




Re: CoreAdminRequest in SolrCloud

2014-10-20 Thread Jürgen Wagner (DVT)
Hello Nabil,
  isn't that what should be expected? Cores are local to nodes, so you
only get the core status from the node you're asking. Cluster status
refers to the entire SolrCloud cluster, so you will get the status over
all collection/nodes/shards[=cores]. Check the Core Admin REST interface
for comparison.

Cheers,
--Jürgen

On 20.10.2014 11:41, nabil Kouici wrote:
> Hi,
> I'm trying to get all shards statistics in cloud configuration. I'v used 
> CoreAdminRequest but the problem is I get statistics for only shards (or 
> core) in one node (I've 2 nodes):
>
> String zkHostString = "10.0.1.4:2181";
> CloudSolrServer solrServer= new CloudSolrServer(zkHostString);
> CoreAdminRequest request = new CoreAdminRequest();
> request.setAction(CoreAdminAction.STATUS);
> CoreAdminResponse cores = request.process(solrServer);
> for (int i = 0; i < cores.getCoreStatus().size(); i++) {
>
>  NamedList ll=cores.getCoreStatus().getVal(i);
>  System.out.println(ll.toString());
> } 
>
> Any idea?
>
> Regards,
> Nabil. 
>


-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wag...@devoteam.com
, URL: www.devoteam.de



Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071




Re: issue in launching SolrCloud windows/cygwin

2014-10-19 Thread Jürgen Wagner (DVT)
Hello Anurag,
  the CRLF problem with Cygwin can be cured by running the scripts all
through this filter:

tr -d '\r' < $script > $script.new ; mv $script.new $script

with $script holding the path of the script to be massaged.

Generally, however, I would advise to use the standard scripts only for
testing or demonstration purposes as you're very likely to have to
change parameters or settings for your production environment, anyway.
Using the latest Jetty is one such example.

Best regards,
--Jürgen

On 19.10.2014 08:51, Anurag Sharma wrote:
> Here is the issue am facing issue in using the 'solr' script on Windows
> with cygwin terminal:
>
> $ bin/solr -e cloud
> bin/solr: line 16: $'\r': command not found
> bin/solr: line 17: $'\r': command not found
> bin/solr: line 46: $'\r': command not found
> which: no lsof in
> (/usr/local/bin:/usr/bin:/cygdrive/c/Windows/system32:/cygdrive/c/Windows:/cygdrive/c/Windows/System32/Wbem:/cygdrive/c/Windows/System32/WindowsPowerShell/v1.0:/cygdrive/c/Program
> Files/TortoiseSVN/bin:/cygdrive/c/Program
> Files/Java/jdk1.7.0_51/bin:/cygdrive/c/Program
> Files/apache-ant-1.9.3/bin:/cygdrive/c/Program Files
> (x86)/Python-27:/cygdrive/c/Program Files (x86)/Python-27/Scripts)
> bin/solr: line 52: $'\r': command not found
> bin/solr: line 87: syntax error near unexpected token `"$HOME/.solr.in.sh"'
> 'in/solr: line 87: `   "$HOME/.solr.in.sh" \
>
>
> further
> $ bin/solr start -cloud -d node1 -p 8983
> bin/solr: line 16: $'\r': command not found
> bin/solr: line 17: $'\r': command not found
> bin/solr: line 46: $'\r': command not found
> which: no lsof in
> (/usr/local/bin:/usr/bin:/cygdrive/c/Windows/system32:/cygdrive/c/Windows:/cygdrive/c/Windows/System32/Wbem:/cygdrive/c/Windows/System32/WindowsPowerShell/v1.0:/cygdrive/c/Program
> Files/TortoiseSVN/bin:/cygdrive/c/Program
> Files/Java/jdk1.7.0_51/bin:/cygdrive/c/Program
> Files/apache-ant-1.9.3/bin:/cygdrive/c/Program Files
> (x86)/Python-27:/cygdrive/c/Program Files (x86)/Python-27/Scripts)
> bin/solr: line 52: $'\r': command not found
> bin/solr: line 87: syntax error near unexpected token `"$HOME/.solr.in.sh"'
> 'in/solr: line 87: `   "$HOME/.solr.in.sh" \
>
> Is there any other way I can run the SolrCloud using "java -jar start.jar"
> options?
>


-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wag...@devoteam.com
, URL: www.devoteam.de



Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071




Re: Frequent recovery of nodes in SolrCloud

2014-10-16 Thread Jürgen Wagner (DVT)
Hello,
  you have one shard and 11 replicas? Hmm...

- Why you have to keep two nodes on some machines?
- Physical hardware or virtual machines?
- What is the size of this index?
- Is this all on a local network or are there links with potential
outages or failures in between?
- What is the query load?
- Have you had a look at garbage collection?
- Do you use the internal Zookeeper?
- How many nodes?
- Any observers?
- What kind of load does Zookeeper show?
- How much RAM do these nodes have available?
- Do some servers get into swapping?
- ...

How about some more details in terms of sizing and topology?

Cheers,
--Jürgen

On 16.10.2014 18:41, sachinpkale wrote:
> Hi,
>
> Recently we have shifted to SolrCloud (4.10.1) from traditional Master-Slave
> configuration. We have only one collection and it has only only one shard.
> Cloud Cluster contains total 12 nodes (on 8 machines. On 4 machiens, we have
> two instances running on each) out of which one is leader. 
>
> Whenever I see the cluster status using http://:/solr/#/~cloud, it
> shows at least one (sometimes, it is 2-3) node status as recovering. We are
> using HAProxy load balancer and there also many times, it is showing the
> nodes are recovering. This is happening for all nodes in the cluster. 
>
> What would be the problem here? How do I check this in logs?
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Frequent-recovery-of-nodes-in-SolrCloud-tp4164541.html
> Sent from the Solr - User mailing list archive at Nabble.com.


-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wag...@devoteam.com
, URL: www.devoteam.de



Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071




Re: Access solr cloud via ssh tunnel?

2014-09-16 Thread Jürgen Wagner (DVT)
In a test scenario, I used stunnel for connections between some
zookeeper observers and the central ensemble, as well as between a SolrJ
4.9.0 client and the central zookeepers. This is entirely transparent
modulo performance penalties due to network latency and ssl overhead. I
finally ended up with placing the observer node close to the SolrJ client.

Depending on what kind of network connection is between the SolrJ client
and the cluster, you may run into TCP MTU issues or packet fragmentation
problems. Hard to say what's happening without knowing any details on
the nature of the tunnel.

Try testing some four-letter commands from the SolrJ client machine,
e.g. "echo ruok | nc localhost 2181". Does that work?

Best regards,
--Jürgen

On 16.09.2014 21:25, Michael Joyner wrote:
> I am in a situation where I need to access a solrcloud behind a firewall.
>
> I have a tunnel enabled to one of the zookeeper as a starting points
> and the following test code:
>
> CloudSolrServer server = new CloudSolrServer("localhost:2181");
> server.setDefaultCollection("test");
> SolrPingResponse p = server.ping();
> System.out.println(p.getRequestUrl());
>
> Right now it just "hangs" without any errors... what additional ports
> need forwarding and other configurations need setting to access a
> solrcloud over a ssh tunnel or tunnels?


-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wag...@devoteam.com
, URL: www.devoteam.de



Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071




Re: Performance of Unsorted Queries

2014-09-16 Thread Jürgen Wagner (DVT)
Depending on the size of the individual records returned, I'd use a
decent size window (to minimize network and marshalling/unmarshalling
overhead) of maybe 1000-1 items sorted by id, and use that in
combination with cursorMark. That will be easier on the server side in
terms of garbage collection.

Best regards,
--Jürgen

On 16.09.2014 17:03, Ilya Bernshteyn wrote:
> If I query for IDs and I do not care about order, should I still expect
> better performance paging the results? (e.g. rows=1000 or rows=1) The
> use case is that I need to get all of the IDs regardless (there will be
> thousands, maybe 10s of thousands, but not millions)
>
> Example query:
>
> http://domain/solr/select?q=ACCT_ID%3A1153&fq=SOME_FIELD%3SomeKeyword%2C+SOME_FIELD_2%3ASomeKeyword&rows=1&fl=ID&wt=json
>
> With this kind of query, I notice that rows=10 returns in 5ms, while
> rows=1 (producing about 7000 results) returns in about 500ms.
>
> Another way to word my question, if I have 100k not ordered IDs to
> retrieve, is performance better getting 1k at a time or all 100k at the
> same time?
>
> Thanks,
>
> Ilya
>


-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wag...@devoteam.com
, URL: www.devoteam.de



Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071




Re: FAST-like document vector data structures in Solr?

2014-09-05 Thread Jürgen Wagner (DVT)
Thanks for posting this. I was just about to send off a message of
similar content :-)

Important to add:

- In FAST ESP, you could have more than one such docvector associated
with a document, in order to reflect different metrics.

- Term weights in docvectors are document-relative, not absolute.

- Processing is done in the search processor (close to the index), not
in the QR server (providing transformations on the result list).

This docvector could be used for unsupervised clustering,
related-to/similarity search, tag clouds or more weird stuff like
identifying experts on topics contained in a particular document.

With Solr, it seems I have to handcraft the term vectors to reflect the
right weights, to approximate the effect of FAST docvectors, e.g., by
normalizing them to [0...1). Processing performance would still be
different from the classical FAST docvectors. The space consumption may
become ugly for a 200+ GB range shard, however, FAST has also been quite
generous with disk space, anyway.

So, the interesting question is whether there is a more canonical way of
handling this in Solr/Lucene, or if something the like is planned for 5.0+.

Best regards,
--Jürgen

On 05.09.2014 16:02, Jack Krupansky wrote:
> For reference:
>
> “Item Similarity Vector Reference
>
> This property represents a similarity reference when searching for similar 
> items. This is a similarity vector representation that is returned for each 
> item in the query result in the docvector managed property.
>
> The value is a string formatted according to the following format:
>
> [string1,weight1][string2,weight2]...[stringN,weightN]
>
> When performing a find similar query, the SimilarTo element should contain a 
> string parameter with the value of the docvector managed property of the item 
> that is to be used as the similarity reference. The similarity vector 
> consists of a set of "term,weight" expressions, indicating the most important 
> terms or concepts in the item and the corresponding perceived importance 
> (weight). Terms can be single words or phrases.
>
> The weight is a float value between 0 and 1, where 1 indicates the highest 
> relevance.
>
> The similarity vector is created during item processing and indicates the 
> most important terms or concepts in the item and the corresponding weight.”
>
> See:
> http://msdn.microsoft.com/en-us/library/office/ff521597(v=office.14).aspx
>
> -- Jack Krupansky



Re: FAST-like document vector data structures in Solr?

2014-09-05 Thread Jürgen Wagner (DVT)
Hello Jim,
  yes, I am aware of the TermVector and MoreLikeThis stuff. I am
presently mapping docvectors to these mechanisms and create term vectors
myself from third-party text mining components.

However, it's not quite like the FAST docvectors. Particularily, the
performance of MoreLikeThis queries based on TermVectors is suboptimal
on large document sets, so a more efficient support of such retrievals
in the Lucene kernel would be preferred.

Cheers,
--Jürgen

On 05.09.2014 10:55, jim ferenczi wrote:
> Hi,
> Something like ?:
> https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component
> And just to show some impressive search functionality of the wiki: ;)
> https://cwiki.apache.org/confluence/dosearchsite.action?where=solr&spaceSearch=true&queryString=document+vectors
>
> Cheers,
> Jim
>
>
> 2014-09-05 9:44 GMT+02:00 "Jürgen Wagner (DVT)" > :
>> Hello all,
>>   as the migration from FAST to Solr is a relevant topic for several of
>> our customers, there is one issue that does not seem to be addressed by
>> Lucene/Solr: document vectors FAST-style. These document vectors are
>> used to form metrics of similarity, i.e., they may be used as a
>> "semantic fingerprint" of documents to define similarity relations. I
>> can think of several ways of approximating a mapping of this mechanism
>> to Solr, but there are always drawbacks - mostly performance-wise.
>>
>> Has anybody else encountered and possibly approached this challenge so far?
>>
>> Is there anything in the roadmap of Solr that has not revealed itself to
>> me, addressing this issue?
>>
>> Your input is greatly appreciated!
>>
>> Cheers,
>> --Jürgen
>>
>>


-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wag...@devoteam.com
<mailto:juergen.wag...@devoteam.com>, URL: www.devoteam.de
<http://www.devoteam.de/>


Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071




FAST-like document vector data structures in Solr?

2014-09-05 Thread Jürgen Wagner (DVT)
Hello all,
  as the migration from FAST to Solr is a relevant topic for several of
our customers, there is one issue that does not seem to be addressed by
Lucene/Solr: document vectors FAST-style. These document vectors are
used to form metrics of similarity, i.e., they may be used as a
"semantic fingerprint" of documents to define similarity relations. I
can think of several ways of approximating a mapping of this mechanism
to Solr, but there are always drawbacks - mostly performance-wise.

Has anybody else encountered and possibly approached this challenge so far?

Is there anything in the roadmap of Solr that has not revealed itself to
me, addressing this issue?

Your input is greatly appreciated!

Cheers,
--Jürgen



Re: Create collection dynamically in my program

2014-09-03 Thread Jürgen Wagner (DVT)
Hello Xinwu,
  does it change anything if you use an underline instead of the dash in
the collection name?

What is the result of the call? Any status or error message?

Did you actually feed data into the collection?

Cheers,
--Jürgen

On 03.09.2014 11:21, xinwu wrote:
> Hi , all:
> I created collection per day dynamically in my program.Like this:
>   
> But,when I searched data with "collection=myCollection-20140903",it
> showed "Collection not found:myCollection-20140903 ".
> I checked the "clusterState" in debug mode , there was not
> "myCollection-20140903" in it.
> But,there was "myCollection-20140903" in zk "clusterstate.json"
> actually.
>
> Is there something wrong in my way?
> If there is new way or better way to create collection dynamically?
>
> Thanks!
> -Xinwu
>
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Create-collection-dynamically-in-my-program-tp4156601.html
> Sent from the Solr - User mailing list archive at Nabble.com.


-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wag...@devoteam.com
, URL: www.devoteam.de



Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071




Re: How can I set shard members?

2014-09-02 Thread Jürgen Wagner (DVT)
Hello,
  have you tried the "createNodeSet" option of collection/shard creation
and the "node" option of replica creation in Solr 4.9.0+?
As you're just testing, I would strongly recommend going to the latest
version.

https://cwiki.apache.org/confluence/display/solr/Collections+API

This is useful to provide underlying topology information. We use this
in customer scenarios to partition the set of servers into at least two
groups, so all shards of a SolrCloud cluster will have replica X of a
shard located in server group X (usually, X = 2). The two server groups
then correspond to two separate physical ESX clusters, so if one VM
cluster goes down, at least one replica of each shard will still be
available.

Cheers,
--Jürgen


On 03.09.2014 06:00, Lee Chunki wrote:
> Hi,
>
> I am trying to test Solr Cloud with version 4.1.0.
> (  
> http://wiki.apache.org/solr/SolrCloud#Example_C:_Two_shard_cluster_with_shard_replicas_and_zookeeper_ensemble
>  )
>
> Is there any way set shard & shard member ?
>
> for example.
> server1, server2 for shard1
> server3, server4 for shard2
>
> when I tested the example, shard member depend on running Solr order.
> i.e. run server1 -> server2 -> server3 -> server4 then server1, 3 are shard1 
> and server 2,4 are shard2
> of course, from second time there is no dependency of running Solr order.
>
> and I tried "-DshardId=shard1” but it is not working.
>
> Thanks,
> Chunki.