Re: Finding similar documents with Elasticsearch

2014-01-18 Thread David Pilato
Do you get same results if you compare against file.file field?

--
David ;-)
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs


Le 19 janv. 2014 à 07:13, Zoran Jeremic  a écrit :

Hi guys,

I'm trying to develop service that will store uploaded files as attachment 
(file is one field in document). This part works fine as I can search these 
files using like_text as input. However, the second part of this service should 
compare the file that is just uploaded with the existing files in order to find 
duplicates or very similar files. The problem is that I always get the same 
results regardless the input I'm using, and these results are  wrong as exactly 
the same file has smallest score very often. It looks that like_text extracted 
from uploaded file is always the same, and none of the documents has expected 
score, which should be I believe 1 in case of identical documents. The scores I 
get are always less then 0.2. 
Could you please check if there is something wrong with my code?

String mapping = 
copyToStringFromClasspath("/org/prosolo/services/indexing/documents-mapping.json");
byte[] txt = 
org.elasticsearch.common.io.Streams.copyToByteArray(file);
Client client = ElasticSearchFactory.getClient();

client.admin().indices().putMapping(putMappingRequest(indexName).type(indexType).source(mapping)).actionGet();
IndexResponse iResponse = 
client.index(indexRequest(indexName).type(indexType)
.source(jsonBuilder()
.startObject()
.field("file", txt)
.field("title",title)

.field("visibility",visibilityType.name().toLowerCase())
.field("ownerId",ownerId)
.field("description",description)

.field("contentType",DocumentType.DOCUMENT.name().toLowerCase())
.field("dateCreated",dateCreated)
.field("url",link)
.field("relatedToType",relatedToType)
.field("relatedToId",relatedToId)
.endObject()))
.actionGet();
   
client.admin().indices().refresh(refreshRequest()).actionGet();

  MoreLikeThisRequestBuilder mltRequestBuilder=new 
MoreLikeThisRequestBuilder(client, ESIndexNames.INDEX_DOCUMENTS,
ESIndexTypes.DOCUMENT, iResponse.getId());
mltRequestBuilder.setField("file");
 SearchResponse response = 
client.moreLikeThis(mltRequestBuilder.request()).actionGet();
SearchHits searchHits= response.getHits();
System.out.println("getTotalHits:"+searchHits.getTotalHits());
 Iterator hitsIter=searchHits.iterator();
 while(hitsIter.hasNext()){
 SearchHit searchHit=hitsIter.next();
 System.out.println("FOUND 
DOCUMENT:"+searchHit.getId()+" title:"+searchHit.getSource().get("title")+" 
score:"+searchHit.score());
 }

And this is the mapping I was using

{
"document":{
"properties":{
  "title":{
"type":"string",
"store":true
},
 "description":{
 "type":"string",
"store":"yes"
},
 "contentType":{
 "type":"string",
"store":"yes"
},
"dateCreated":{
"store":"yes",
"type":"date"
},
"url":{
"store":"yes",
"type":"string"
},
  "visibility": {
   "store":"yes",
"type":"string"
},
  "ownerId": {
"type": "long",
"store":"yes"
},
  "relatedToType": {
"type": "string",
"store":"yes"
},
  "relatedToId": {
"type": "long",
"store":"yes"
},
  "file":{
"path": "full",
"type":"attachment",
"fields":{
"author": {
"type": "string"
},
"title": {
"store": true,
"type": "string"
},
"keywords": {
"type": "string"
},
"file": {
"store": true,
"term_vector": "with

Finding similar documents with Elasticsearch

2014-01-18 Thread Zoran Jeremic
Hi guys,

I'm trying to develop service that will store uploaded files as attachment 
(file is one field in document). This part works fine as I can search these 
files using like_text as input. However, the second part of this service 
should compare the file that is just uploaded with the existing files in 
order to find duplicates or very similar files. The problem is that I 
always get the same results regardless the input I'm using, and these 
results are  wrong as exactly the same file has smallest score very often. 
It looks that like_text extracted from uploaded file is always the same, 
and none of the documents has expected score, which should be I believe 1 
in case of identical documents. The scores I get are always less then 0.2. 
Could you please check if there is something wrong with my code?

String mapping = copyToStringFromClasspath(
"/org/prosolo/services/indexing/documents-mapping.json");
byte[] txt = org.elasticsearch.common.io.Streams.copyToByteArray(file);
Client client = ElasticSearchFactory.getClient();
client.admin().indices().putMapping(putMappingRequest(indexName).type(indexType).source(mapping)).actionGet();
IndexResponse iResponse = 
client.index(indexRequest(indexName).type(indexType)
.source(jsonBuilder()
 .startObject()
 .field("file", txt)
 .field("title",title)

 .field("visibility",visibilityType.name().toLowerCase())
 .field("ownerId",ownerId)
 .field("description",description)
 
.field("contentType",DocumentType.DOCUMENT.name().toLowerCase())
 .field("dateCreated",dateCreated)
 .field("url",link)
 .field("relatedToType",relatedToType)
 .field("relatedToId",relatedToId)
 .endObject()))
.actionGet();
   client.admin().indices().refresh(refreshRequest()).actionGet();

  MoreLikeThisRequestBuilder mltRequestBuilder=new 
MoreLikeThisRequestBuilder(client, ESIndexNames.INDEX_DOCUMENTS,   
 ESIndexTypes.DOCUMENT, iResponse.getId());
mltRequestBuilder.setField("file");
 SearchResponse response = 
client.moreLikeThis(mltRequestBuilder.request()).actionGet();
SearchHits searchHits= response.getHits();
System.out.println("getTotalHits:"+searchHits.getTotalHits());
 Iterator hitsIter=searchHits.iterator();
 while(hitsIter.hasNext()){
 SearchHit searchHit=hitsIter.next();
 System.out.println("FOUND DOCUMENT:"+searchHit.getId()+" 
title:"+searchHit.getSource().get("title")+" score:"+searchHit.score());
 }

And this is the mapping I was using

{
"document":{
"properties":{
  "title":{
"type":"string",
"store":true
},
 "description":{
 "type":"string",
"store":"yes"
},
 "contentType":{
 "type":"string",
"store":"yes"
},
"dateCreated":{
"store":"yes",
"type":"date"
},
"url":{
"store":"yes",
"type":"string"
},
  "visibility": {
   "store":"yes",
"type":"string"
},
  "ownerId": {
"type": "long",
"store":"yes"
},
  "relatedToType": {
"type": "string",
"store":"yes"
},
  "relatedToId": {
"type": "long",
"store":"yes"
},
  "file":{
"path": "full",
"type":"attachment",
"fields":{
"author": {
"type": "string"
},
"title": {
"store": true,
"type": "string"
},
"keywords": {
"type": "string"
},
"file": {
"store": true,
"term_vector": "with_positions_offsets",
"type": "string"
},
"name": {
"type": "string"
},
"content_length": {
"type": "integer"
},
"date": {
"format": "dateOptionalTime",
"type": "date"
},
"content_type": {
"type": "string"
}
 }
}
}

}
}

Thanks,
Zoran

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/79e29c89-62ea-42f3-be93-3e215a75860a%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


[ANN]Elasticsearch Extended Analyze plugin update for 1.0.0.RC1

2014-01-18 Thread Jun Ohtani
Hi,

Just released extended analyze plugin 1.0.0.RC1 for elasticsearch 1.0.0.RC1.

https://github.com/johtani/elasticsearch-extended-analyze

For more info about the plugins, see the Github pages.

Feedback, comments, issues are most welcome.

Best,
Jun 

Jun Ohtani
joht...@gmail.com
blog : http://blog.johtani.info
twitter : http://twitter.com/johtani



signature.asc
Description: Message signed with OpenPGP using GPGMail


start, end and gap in Price using Aggregations

2014-01-18 Thread Hariharan Vadivelu
I have been experimenting with the new aggregations feature and I'm 
wondering if this use case is possible
Sample gist and Sample Query is available here.
https://gist.github.com/hariinfo/8487083


This will generate Sample Price Facet in the UI
Price
0 - 5 (1)
5 - 10 (2)
10 - 20 (1)
20 - 30 (2)

As you can see I'm defining the price range in the query, this is more or 
less static or in other words when I query for a products in a category I 
should know the start, end and gap for price in a given category to 
generate meaningful price ranges.

Question:
# How do I set up a facet which takes into account the actual values of the 
prices field (min, max from search context) and sets up ranges dynamically, 
may be based on some rules?.
# This turns out to be a typical use case for ecommerce domain, where in my 
products in a category may have dynamic range and I don't know start,end or 
gap.


-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/d8514ed4-9e0c-4b0a-8ae9-1db637dde841%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Question about performance of has_child filter

2014-01-18 Thread Josh F
I will upgrade, I didn't really see anything relating to this in the 
release notes, though - except possibly this one: 
https://github.com/elasticsearch/elasticsearch/issues/4592

Thanks

On Saturday, January 18, 2014 1:36:03 PM UTC, David Pilato wrote:
>
> As far as I remember, some memory improvements have been done since.
> I would suggest to upgrade to 0.90.10.
>
> --
> David ;-)
> Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs
>
>
> Le 18 janv. 2014 à 13:29, Josh F > a 
> écrit :
>
> Hey all,
>
> I am running a 5 node ES cluster (version 0.90.4) and recently experienced 
> an OutOfMemory issue where all nodes simultaneously went out of memory.
> I am wondering if this could be related to my use of the has_child filter.
>
> I have two document types: 'Item' and 'ItemInList'. There is a parent 
> child relationship here, ItemInList is a child of Item. Any item can be in 
> several lists.
> There are around 300,000 Item documents at 21 million ItemInList documents.
>
> I frequently execute a has_child filter (maybe around 200-400 per minute), 
> to retrieve a list (i.e. get all the items which have an ItemInList child 
> which contains a given listId)
>
> in Java this looks like: FilterBuilders.hasChildFilter("ItemInList",
> FilterBuilders.termFilter("ListId", listId));
>
> Are there any known performance issues  with doing this? Is the has_child 
> filter cached by default? Does anyone have any tips to prevent out of 
> memory issues with these type of queries?
>
> Thanks for any advice,
> Josh
>
>
>  -- 
> You received this message because you are subscribed to the Google Groups 
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to elasticsearc...@googlegroups.com .
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/elasticsearch/e3c01918-de5a-4525-b4a3-a56cde670835%40googlegroups.com
> .
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/2361be06-81dd-4b36-88a9-9e751cc91b19%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


how to save a term facet result?

2014-01-18 Thread Valentin
What i want to do, is to use reuse the facets result of one query as the terms 
filter of another query. Preferably without sending the data to the client over 
the internet.

Is this possible?

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/83a4c546-37d7-4dd2-8267-50e20f66c063%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Question about performance of has_child filter

2014-01-18 Thread David Pilato
As far as I remember, some memory improvements have been done since.
I would suggest to upgrade to 0.90.10.

--
David ;-)
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs


Le 18 janv. 2014 à 13:29, Josh F  a écrit :

Hey all,

I am running a 5 node ES cluster (version 0.90.4) and recently experienced an 
OutOfMemory issue where all nodes simultaneously went out of memory.
I am wondering if this could be related to my use of the has_child filter.

I have two document types: 'Item' and 'ItemInList'. There is a parent child 
relationship here, ItemInList is a child of Item. Any item can be in several 
lists.
There are around 300,000 Item documents at 21 million ItemInList documents.

I frequently execute a has_child filter (maybe around 200-400 per minute), to 
retrieve a list (i.e. get all the items which have an ItemInList child which 
contains a given listId)

in Java this looks like: FilterBuilders.hasChildFilter("ItemInList",
FilterBuilders.termFilter("ListId", listId));

Are there any known performance issues  with doing this? Is the has_child 
filter cached by default? Does anyone have any tips to prevent out of memory 
issues with these type of queries?

Thanks for any advice,
Josh


-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/e3c01918-de5a-4525-b4a3-a56cde670835%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/5F8B6362-D300-4B96-A87A-35B24324B248%40pilato.fr.
For more options, visit https://groups.google.com/groups/opt_out.


Question about performance of has_child filter

2014-01-18 Thread Josh F
Hey all,

I am running a 5 node ES cluster (version 0.90.4) and recently experienced 
an OutOfMemory issue where all nodes simultaneously went out of memory.
I am wondering if this could be related to my use of the has_child filter.

I have two document types: 'Item' and 'ItemInList'. There is a parent child 
relationship here, ItemInList is a child of Item. Any item can be in 
several lists.
There are around 300,000 Item documents at 21 million ItemInList documents.

I frequently execute a has_child filter (maybe around 200-400 per minute), 
to retrieve a list (i.e. get all the items which have an ItemInList child 
which contains a given listId)

in Java this looks like: FilterBuilders.hasChildFilter("ItemInList",
FilterBuilders.termFilter("ListId", listId));

Are there any known performance issues  with doing this? Is the has_child 
filter cached by default? Does anyone have any tips to prevent out of 
memory issues with these type of queries?

Thanks for any advice,
Josh


-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/e3c01918-de5a-4525-b4a3-a56cde670835%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Sorting details

2014-01-18 Thread Ahmed Sabaa
Hello,

I am trying to sort a document using the locations attribute of a 
subdocument with the following architecture.

{
attribute1: value1,
attribute2: value2,
attribute3: [{
attribute1: value1,
location: {
lat: latitude,
lon: longitude
}
}, {
attribute1: value1,
location: {
lat: latitude,
lon: longitude
}
}]
}, {
attribute1: value1,
attribute2: value2,
attribute3: [{
attribute1: value1,
location: {
lat: latitude,
lon: longitude
}
}, {
attribute1: value1,
location: {
lat: latitude,
lon: longitude
}
}]
}

So basically I have the location attribute indexed as a geo_point in the 
mappings and I'm sorting the parent documents based on these values. 
Strangely enough, it works just fine without any alterations to the 
search/sort code but the problem is that I want to get which sub document 
(inside the array) resulted in having it's parent document as the first (or 
second or third) result so to have that as only sub document in the array.

I don't know if that's doable via elasticsearch directly (probably not) but 
is there anyway to get some details about the sorting itself at run time to 
do it at application level?

Regards,
Ahmed.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/1d6bcf30-79f3-4ef7-9df9-e868806e2a5c%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Understanding Index Stats API Better

2014-01-18 Thread Vaidik Kapoor
Thanks for the clarification Luca! :)

Vaidik Kapoor
vaidikkapoor.info


On 18 January 2014 00:17, Luca Cavanna  wrote:

> Yes, nested documents are separate lucene documents, that are never
> returned (for now) by the ordinary elasticsearch search apis as they are
> filtered out by default. Only a single document gets returned (and
> counted), which contains both parent and nested docs. On the lucene index
> you have though separate docs stored on the same block.
>
> The _stats api reads the number of documents from lucene (including
> nested), while the count and search apis execute a query and filter the
> nested docs out.
>
> Hope this clarifies things
>
>
> On Friday, January 17, 2014 2:53:56 PM UTC+1, Vaidik Kapoor wrote:
>>
>> Hi Guys,
>>
>> I am trying to understand Index Stats API better here. I have two
>> indices, both with the same data. However, the mappings differ. One of them
>> has some fields that have type as *nested*. Now the number of documents
>> shown in the Index Stats API response for the one that has nested type
>> fields is more than the one that does not have nested type fields. Although
>> when I do a GET /INDEX_NAME/_search?search_type=count for both the
>> indices, I get the same count in the response.
>>
>> Does this mean that Index Stats API is counting nested documents
>> separately?
>>
>> Would appreciate some clarification on this.
>>
>> Thanks,
>> Vaidik Kapoor
>> vaidikkapoor.info
>>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/f9fc4f87-8c65-4ece-9399-0264ded57985%40googlegroups.com
> .
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CACWtv5mNg2Rm%2BjFOUGn7708gkHNYvty6xdCNp6BE%2Bt1BFwjF8w%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Naming an index or a type with URIs

2014-01-18 Thread David Pilato
I think I would base64 encode the URL and lowercase it.

I did not try it though.

My 2 cents.

--
David ;-)
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 18 janv. 2014 à 11:07, Olivier Rossel  a écrit :

> Is it technically possible to name an index or a type with a URI string 
> (containing dots, slashes, colons, ...)?
> I have tried to do some curl stuff with an URL like 
> http://localhost:9200///... but it 
> didn't seem to work very well.
> Are there technical issues with this kind of naming? Or is it possible to 
> name indices and types with URIs? In that case, what would a sample 
> Elasticsearch URL look like?
> 
> I appreciate all advice from the community. Thx!
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/elasticsearch/5114ee1d-e020-42f1-b081-2c7c7ca667f4%40googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/77EAC54A-D360-48E8-AE39-BC830D23A63E%40pilato.fr.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Naming an index or a type with URIs

2014-01-18 Thread joergpra...@gmail.com
Index names must be lowercase and must not contain characters that are
problematic in file names, because they are used as directory name.

Type names are not used in the file system but to conform with index names,
they are also restricted.

I also use URIs in Elasticsearch. You can use them as field names and field
values. In field values, they can be used for linking data.

But i recommend to build URIs dynamically in the middleware where
addressing is implemented, so you can move or reassign URIs more easily,
just by changing a global config without the need to reindex or reset alias.

Jörg

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHmbLAicWrRfTijGBiQ8Q%3DwRzXdD34j1vKaNg1YAm73gQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


Naming an index or a type with URIs

2014-01-18 Thread Olivier Rossel
Is it technically possible to name an index or a type with a URI string 
(containing dots, slashes, colons, ...)?
I have tried to do some curl stuff with an URL like 
http://localhost:9200///... but it 
didn't seem to work very well.
Are there technical issues with this kind of naming? Or is it possible to name 
indices and types with URIs? In that case, what would a sample Elasticsearch 
URL look like?

I appreciate all advice from the community. Thx!

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/5114ee1d-e020-42f1-b081-2c7c7ca667f4%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.