Re: Finding similar documents with Elasticsearch
Do you get same results if you compare against file.file field? -- David ;-) Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs Le 19 janv. 2014 à 07:13, Zoran Jeremic a écrit : Hi guys, I'm trying to develop service that will store uploaded files as attachment (file is one field in document). This part works fine as I can search these files using like_text as input. However, the second part of this service should compare the file that is just uploaded with the existing files in order to find duplicates or very similar files. The problem is that I always get the same results regardless the input I'm using, and these results are wrong as exactly the same file has smallest score very often. It looks that like_text extracted from uploaded file is always the same, and none of the documents has expected score, which should be I believe 1 in case of identical documents. The scores I get are always less then 0.2. Could you please check if there is something wrong with my code? String mapping = copyToStringFromClasspath("/org/prosolo/services/indexing/documents-mapping.json"); byte[] txt = org.elasticsearch.common.io.Streams.copyToByteArray(file); Client client = ElasticSearchFactory.getClient(); client.admin().indices().putMapping(putMappingRequest(indexName).type(indexType).source(mapping)).actionGet(); IndexResponse iResponse = client.index(indexRequest(indexName).type(indexType) .source(jsonBuilder() .startObject() .field("file", txt) .field("title",title) .field("visibility",visibilityType.name().toLowerCase()) .field("ownerId",ownerId) .field("description",description) .field("contentType",DocumentType.DOCUMENT.name().toLowerCase()) .field("dateCreated",dateCreated) .field("url",link) .field("relatedToType",relatedToType) .field("relatedToId",relatedToId) .endObject())) .actionGet(); client.admin().indices().refresh(refreshRequest()).actionGet(); MoreLikeThisRequestBuilder mltRequestBuilder=new MoreLikeThisRequestBuilder(client, ESIndexNames.INDEX_DOCUMENTS, ESIndexTypes.DOCUMENT, iResponse.getId()); mltRequestBuilder.setField("file"); SearchResponse response = client.moreLikeThis(mltRequestBuilder.request()).actionGet(); SearchHits searchHits= response.getHits(); System.out.println("getTotalHits:"+searchHits.getTotalHits()); Iterator hitsIter=searchHits.iterator(); while(hitsIter.hasNext()){ SearchHit searchHit=hitsIter.next(); System.out.println("FOUND DOCUMENT:"+searchHit.getId()+" title:"+searchHit.getSource().get("title")+" score:"+searchHit.score()); } And this is the mapping I was using { "document":{ "properties":{ "title":{ "type":"string", "store":true }, "description":{ "type":"string", "store":"yes" }, "contentType":{ "type":"string", "store":"yes" }, "dateCreated":{ "store":"yes", "type":"date" }, "url":{ "store":"yes", "type":"string" }, "visibility": { "store":"yes", "type":"string" }, "ownerId": { "type": "long", "store":"yes" }, "relatedToType": { "type": "string", "store":"yes" }, "relatedToId": { "type": "long", "store":"yes" }, "file":{ "path": "full", "type":"attachment", "fields":{ "author": { "type": "string" }, "title": { "store": true, "type": "string" }, "keywords": { "type": "string" }, "file": { "store": true, "term_vector": "with
Finding similar documents with Elasticsearch
Hi guys, I'm trying to develop service that will store uploaded files as attachment (file is one field in document). This part works fine as I can search these files using like_text as input. However, the second part of this service should compare the file that is just uploaded with the existing files in order to find duplicates or very similar files. The problem is that I always get the same results regardless the input I'm using, and these results are wrong as exactly the same file has smallest score very often. It looks that like_text extracted from uploaded file is always the same, and none of the documents has expected score, which should be I believe 1 in case of identical documents. The scores I get are always less then 0.2. Could you please check if there is something wrong with my code? String mapping = copyToStringFromClasspath( "/org/prosolo/services/indexing/documents-mapping.json"); byte[] txt = org.elasticsearch.common.io.Streams.copyToByteArray(file); Client client = ElasticSearchFactory.getClient(); client.admin().indices().putMapping(putMappingRequest(indexName).type(indexType).source(mapping)).actionGet(); IndexResponse iResponse = client.index(indexRequest(indexName).type(indexType) .source(jsonBuilder() .startObject() .field("file", txt) .field("title",title) .field("visibility",visibilityType.name().toLowerCase()) .field("ownerId",ownerId) .field("description",description) .field("contentType",DocumentType.DOCUMENT.name().toLowerCase()) .field("dateCreated",dateCreated) .field("url",link) .field("relatedToType",relatedToType) .field("relatedToId",relatedToId) .endObject())) .actionGet(); client.admin().indices().refresh(refreshRequest()).actionGet(); MoreLikeThisRequestBuilder mltRequestBuilder=new MoreLikeThisRequestBuilder(client, ESIndexNames.INDEX_DOCUMENTS, ESIndexTypes.DOCUMENT, iResponse.getId()); mltRequestBuilder.setField("file"); SearchResponse response = client.moreLikeThis(mltRequestBuilder.request()).actionGet(); SearchHits searchHits= response.getHits(); System.out.println("getTotalHits:"+searchHits.getTotalHits()); Iterator hitsIter=searchHits.iterator(); while(hitsIter.hasNext()){ SearchHit searchHit=hitsIter.next(); System.out.println("FOUND DOCUMENT:"+searchHit.getId()+" title:"+searchHit.getSource().get("title")+" score:"+searchHit.score()); } And this is the mapping I was using { "document":{ "properties":{ "title":{ "type":"string", "store":true }, "description":{ "type":"string", "store":"yes" }, "contentType":{ "type":"string", "store":"yes" }, "dateCreated":{ "store":"yes", "type":"date" }, "url":{ "store":"yes", "type":"string" }, "visibility": { "store":"yes", "type":"string" }, "ownerId": { "type": "long", "store":"yes" }, "relatedToType": { "type": "string", "store":"yes" }, "relatedToId": { "type": "long", "store":"yes" }, "file":{ "path": "full", "type":"attachment", "fields":{ "author": { "type": "string" }, "title": { "store": true, "type": "string" }, "keywords": { "type": "string" }, "file": { "store": true, "term_vector": "with_positions_offsets", "type": "string" }, "name": { "type": "string" }, "content_length": { "type": "integer" }, "date": { "format": "dateOptionalTime", "type": "date" }, "content_type": { "type": "string" } } } } } } Thanks, Zoran -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/79e29c89-62ea-42f3-be93-3e215a75860a%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
[ANN]Elasticsearch Extended Analyze plugin update for 1.0.0.RC1
Hi, Just released extended analyze plugin 1.0.0.RC1 for elasticsearch 1.0.0.RC1. https://github.com/johtani/elasticsearch-extended-analyze For more info about the plugins, see the Github pages. Feedback, comments, issues are most welcome. Best, Jun Jun Ohtani joht...@gmail.com blog : http://blog.johtani.info twitter : http://twitter.com/johtani signature.asc Description: Message signed with OpenPGP using GPGMail
start, end and gap in Price using Aggregations
I have been experimenting with the new aggregations feature and I'm wondering if this use case is possible Sample gist and Sample Query is available here. https://gist.github.com/hariinfo/8487083 This will generate Sample Price Facet in the UI Price 0 - 5 (1) 5 - 10 (2) 10 - 20 (1) 20 - 30 (2) As you can see I'm defining the price range in the query, this is more or less static or in other words when I query for a products in a category I should know the start, end and gap for price in a given category to generate meaningful price ranges. Question: # How do I set up a facet which takes into account the actual values of the prices field (min, max from search context) and sets up ranges dynamically, may be based on some rules?. # This turns out to be a typical use case for ecommerce domain, where in my products in a category may have dynamic range and I don't know start,end or gap. -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d8514ed4-9e0c-4b0a-8ae9-1db637dde841%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: Question about performance of has_child filter
I will upgrade, I didn't really see anything relating to this in the release notes, though - except possibly this one: https://github.com/elasticsearch/elasticsearch/issues/4592 Thanks On Saturday, January 18, 2014 1:36:03 PM UTC, David Pilato wrote: > > As far as I remember, some memory improvements have been done since. > I would suggest to upgrade to 0.90.10. > > -- > David ;-) > Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs > > > Le 18 janv. 2014 à 13:29, Josh F > a > écrit : > > Hey all, > > I am running a 5 node ES cluster (version 0.90.4) and recently experienced > an OutOfMemory issue where all nodes simultaneously went out of memory. > I am wondering if this could be related to my use of the has_child filter. > > I have two document types: 'Item' and 'ItemInList'. There is a parent > child relationship here, ItemInList is a child of Item. Any item can be in > several lists. > There are around 300,000 Item documents at 21 million ItemInList documents. > > I frequently execute a has_child filter (maybe around 200-400 per minute), > to retrieve a list (i.e. get all the items which have an ItemInList child > which contains a given listId) > > in Java this looks like: FilterBuilders.hasChildFilter("ItemInList", > FilterBuilders.termFilter("ListId", listId)); > > Are there any known performance issues with doing this? Is the has_child > filter cached by default? Does anyone have any tips to prevent out of > memory issues with these type of queries? > > Thanks for any advice, > Josh > > > -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to elasticsearc...@googlegroups.com . > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/e3c01918-de5a-4525-b4a3-a56cde670835%40googlegroups.com > . > For more options, visit https://groups.google.com/groups/opt_out. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2361be06-81dd-4b36-88a9-9e751cc91b19%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
how to save a term facet result?
What i want to do, is to use reuse the facets result of one query as the terms filter of another query. Preferably without sending the data to the client over the internet. Is this possible? -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/83a4c546-37d7-4dd2-8267-50e20f66c063%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: Question about performance of has_child filter
As far as I remember, some memory improvements have been done since. I would suggest to upgrade to 0.90.10. -- David ;-) Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs Le 18 janv. 2014 à 13:29, Josh F a écrit : Hey all, I am running a 5 node ES cluster (version 0.90.4) and recently experienced an OutOfMemory issue where all nodes simultaneously went out of memory. I am wondering if this could be related to my use of the has_child filter. I have two document types: 'Item' and 'ItemInList'. There is a parent child relationship here, ItemInList is a child of Item. Any item can be in several lists. There are around 300,000 Item documents at 21 million ItemInList documents. I frequently execute a has_child filter (maybe around 200-400 per minute), to retrieve a list (i.e. get all the items which have an ItemInList child which contains a given listId) in Java this looks like: FilterBuilders.hasChildFilter("ItemInList", FilterBuilders.termFilter("ListId", listId)); Are there any known performance issues with doing this? Is the has_child filter cached by default? Does anyone have any tips to prevent out of memory issues with these type of queries? Thanks for any advice, Josh -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e3c01918-de5a-4525-b4a3-a56cde670835%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out. -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5F8B6362-D300-4B96-A87A-35B24324B248%40pilato.fr. For more options, visit https://groups.google.com/groups/opt_out.
Question about performance of has_child filter
Hey all, I am running a 5 node ES cluster (version 0.90.4) and recently experienced an OutOfMemory issue where all nodes simultaneously went out of memory. I am wondering if this could be related to my use of the has_child filter. I have two document types: 'Item' and 'ItemInList'. There is a parent child relationship here, ItemInList is a child of Item. Any item can be in several lists. There are around 300,000 Item documents at 21 million ItemInList documents. I frequently execute a has_child filter (maybe around 200-400 per minute), to retrieve a list (i.e. get all the items which have an ItemInList child which contains a given listId) in Java this looks like: FilterBuilders.hasChildFilter("ItemInList", FilterBuilders.termFilter("ListId", listId)); Are there any known performance issues with doing this? Is the has_child filter cached by default? Does anyone have any tips to prevent out of memory issues with these type of queries? Thanks for any advice, Josh -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e3c01918-de5a-4525-b4a3-a56cde670835%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Sorting details
Hello, I am trying to sort a document using the locations attribute of a subdocument with the following architecture. { attribute1: value1, attribute2: value2, attribute3: [{ attribute1: value1, location: { lat: latitude, lon: longitude } }, { attribute1: value1, location: { lat: latitude, lon: longitude } }] }, { attribute1: value1, attribute2: value2, attribute3: [{ attribute1: value1, location: { lat: latitude, lon: longitude } }, { attribute1: value1, location: { lat: latitude, lon: longitude } }] } So basically I have the location attribute indexed as a geo_point in the mappings and I'm sorting the parent documents based on these values. Strangely enough, it works just fine without any alterations to the search/sort code but the problem is that I want to get which sub document (inside the array) resulted in having it's parent document as the first (or second or third) result so to have that as only sub document in the array. I don't know if that's doable via elasticsearch directly (probably not) but is there anyway to get some details about the sorting itself at run time to do it at application level? Regards, Ahmed. -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1d6bcf30-79f3-4ef7-9df9-e868806e2a5c%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: Understanding Index Stats API Better
Thanks for the clarification Luca! :) Vaidik Kapoor vaidikkapoor.info On 18 January 2014 00:17, Luca Cavanna wrote: > Yes, nested documents are separate lucene documents, that are never > returned (for now) by the ordinary elasticsearch search apis as they are > filtered out by default. Only a single document gets returned (and > counted), which contains both parent and nested docs. On the lucene index > you have though separate docs stored on the same block. > > The _stats api reads the number of documents from lucene (including > nested), while the count and search apis execute a query and filter the > nested docs out. > > Hope this clarifies things > > > On Friday, January 17, 2014 2:53:56 PM UTC+1, Vaidik Kapoor wrote: >> >> Hi Guys, >> >> I am trying to understand Index Stats API better here. I have two >> indices, both with the same data. However, the mappings differ. One of them >> has some fields that have type as *nested*. Now the number of documents >> shown in the Index Stats API response for the one that has nested type >> fields is more than the one that does not have nested type fields. Although >> when I do a GET /INDEX_NAME/_search?search_type=count for both the >> indices, I get the same count in the response. >> >> Does this mean that Index Stats API is counting nested documents >> separately? >> >> Would appreciate some clarification on this. >> >> Thanks, >> Vaidik Kapoor >> vaidikkapoor.info >> > -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to elasticsearch+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/f9fc4f87-8c65-4ece-9399-0264ded57985%40googlegroups.com > . > For more options, visit https://groups.google.com/groups/opt_out. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CACWtv5mNg2Rm%2BjFOUGn7708gkHNYvty6xdCNp6BE%2Bt1BFwjF8w%40mail.gmail.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: Naming an index or a type with URIs
I think I would base64 encode the URL and lowercase it. I did not try it though. My 2 cents. -- David ;-) Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs Le 18 janv. 2014 à 11:07, Olivier Rossel a écrit : > Is it technically possible to name an index or a type with a URI string > (containing dots, slashes, colons, ...)? > I have tried to do some curl stuff with an URL like > http://localhost:9200///... but it > didn't seem to work very well. > Are there technical issues with this kind of naming? Or is it possible to > name indices and types with URIs? In that case, what would a sample > Elasticsearch URL look like? > > I appreciate all advice from the community. Thx! > > -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to elasticsearch+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/5114ee1d-e020-42f1-b081-2c7c7ca667f4%40googlegroups.com. > For more options, visit https://groups.google.com/groups/opt_out. -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/77EAC54A-D360-48E8-AE39-BC830D23A63E%40pilato.fr. For more options, visit https://groups.google.com/groups/opt_out.
Re: Naming an index or a type with URIs
Index names must be lowercase and must not contain characters that are problematic in file names, because they are used as directory name. Type names are not used in the file system but to conform with index names, they are also restricted. I also use URIs in Elasticsearch. You can use them as field names and field values. In field values, they can be used for linking data. But i recommend to build URIs dynamically in the middleware where addressing is implemented, so you can move or reassign URIs more easily, just by changing a global config without the need to reindex or reset alias. Jörg -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHmbLAicWrRfTijGBiQ8Q%3DwRzXdD34j1vKaNg1YAm73gQ%40mail.gmail.com. For more options, visit https://groups.google.com/groups/opt_out.
Naming an index or a type with URIs
Is it technically possible to name an index or a type with a URI string (containing dots, slashes, colons, ...)? I have tried to do some curl stuff with an URL like http://localhost:9200///... but it didn't seem to work very well. Are there technical issues with this kind of naming? Or is it possible to name indices and types with URIs? In that case, what would a sample Elasticsearch URL look like? I appreciate all advice from the community. Thx! -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5114ee1d-e020-42f1-b081-2c7c7ca667f4%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.