Index design for user's activities

2015-04-11 Thread Chen Wang
I am maintaining a years of user's activity including browse, purchase 
data. Each entry in browse/purchase is a json object:{item_id: id1, 
item_name, name1, category: c1, brand:b1, event_time: t1} .

I would like to compose different queries such like getting all customers 
who browsed item A, and or  purchased item B within time range t1 to t2. 
There are tens of millions customers.

My current design is to use nested object for each customer:
customer1:
   customer_id,id1,
   name: name1,
   country: US,
   browse: [{browseentry1_json},{browseentry2_json},...],
   purchase: [{purchase entry1_json},{purchase entry2_json},...]
  

With this design, I can easily compose all kinds of queries with nested 
query. The only problem is that it is hard to expire older browse/purchase 
data: I only wanna keep, for example, a years of browse/purchase data. In 
this design, I will have to at some point, read the entire index out, 
delete the expired browse/purchase data, and write them back.

Another design  is to use parent/child structure.
type: user is the parent of type browse and purchase.
type browse will contain each browse entry.
Although deleting old data seems easier with delete by query,  for the 
above query, I will have to do multiple and/or has_child queries,and it 
would be much less performant. In fact, initially i was using parent/child 
structure, but the query time seemed really long. I thus gave it up and 
tried to switch to nested object.

I am also thinking about using nested object, but break the data into 
different index(like monthly index) so that I can easily expire old data. 
The problem with this approach is that I have to query across those 
multiple indexes, and do aggregation on that to get the distinct users, 
which I assume will be much slower.(havn't tried yet). One requirement of 
this project is to be able to give the count of the queries in acceptable 
time frame.(like seconds) and I am afraid this approach may not be 
acceptable.

The ES cluster is 7 machines, each 8 cores and 32G memory.
Any suggestions? 

Thanks in advance!
Chen

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/e1279e50-4ec7-4292-8ef3-49bc187498c1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


search result not expected

2015-04-10 Thread Chen Wang
Hello folks,
I have a nested query:

   "query": 
{"filtered":{"query":{"match_all":{}},"filter":{"nested":{"path":"browse","filter":{"bool":{"must":[{"term":{"browse.subcat.raw":["desktop
 
tower"]}}]}}

as you can see, browse is a nested object. subcat.raw is using keyword 
tokenizer. 
the above query can return a customer with customerId1,

   "hits": {
  "total": 1,
  "max_score": 1,
  "hits": [
 {
"_index": "user_activity",
"_type": "2015-02-15",
"_id": "customerid1",
"_score": 1,
"_source": {
   "browse": [
  {
 "item_id": "item1",
 "subcat": "DESKTOP TOWER"
 "event_time": "2015-02-15"
  }
   ]
}
 }
  ]

similarly if I do another query 
   "query": 
{"filtered":{"query":{"match_all":{}},"filter":{"nested":{"path":"browse","filter":{"bool":{"must":[{"term":{"browse.subcat.raw":["knit
 
tops"]}}]}}
It will return another customer customerId2
"hits": {
  "total": 1,
  "max_score": 1,
  "hits": [
 {
"_index": "user_activity_v2",
"_type": "2015-02-15",
"_id": "customerId2",
"_score": 1,
"_source": {
   "browse": [
  {
 "item_id": "item3",
 "subcat": "KNIT TOPS",
 "event_time": "2015-02-15"
  },
  {
 "item_id": "item4",
 "subcat": "ACTIVEWEAR",
 "event_time": "2015-02-15"
  },
  {
 "item_id": "item5",
 "subcat": "ACTIVEWEAR",
 "event_time": "2015-02-15"
  }
   ]
}
 }
  ]
   }

But if i combined these two queries together:
   "query": 
{"filtered":{"query":{"match_all":{}},"filter":{"nested":{"path":"browse","filter":{"bool":{"must":[{"term":{"browse.subcat.raw":["knit
 
tops","desktop tower"]}}]}}
It only returns customerId1. which corresponds to "desktop tower". It seems 
that the combined query always return the matched customer_id for the last 
search term in the browse.subcat.raw.


Is this expected? or i am doing something wrong? I am hoping the combined 
query would return both customerId1, and customerId2
Chen


-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/a88cd340-76ef-4a6b-a647-baa253b23fb1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Is there a way to do scan with limit?

2015-04-03 Thread Chen Wang
I was thinking that being able to do limit on ES side can reduce its query load.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/647b7cf9-1351-42d3-8238-fa2118b9f0b9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Is there a way to do scan with limit?

2015-04-02 Thread Chen Wang
I want to for example, fast get 1m out of 5m records.
I am currently using:

SearchResponse scrollResp = this.client
 .prepareSearch(esQuery.indices) 

 .addFields(esQuery.fields) 

 .setSearchType(SearchType.SCAN) 

 .setScroll( 

 TimeValue 

 .timeValueSeconds(this.scrollTimeInSeconds)) 

 .setQuery(esQuery.query) 

 .setSize(this.queryBatchSizePershard).execute().actionGet();


but the setSize defines how many records return per shard in one scroll. Is 
there a way to define a limit? or I have to control the limit in my code.

Chen

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/5fb4751f-7080-423c-9acd-bfd56abfd844%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Very slow queries for parent-child index

2015-03-20 Thread Chen Wang
Vlad,
I tried similar thing a while back.  Of cause the query performance depends 
on your ES configuration as well.
But I finally ended up giving up on parent/child, and flat the parent/child 
document into nested object as the parent/child never gives me the 
performance I need.
Chen

On Wednesday, March 18, 2015 at 11:53:28 PM UTC-7, Vladi Feigin wrote:
>
> Hello,
>
> I would like to ask for advice about our index that is built on the 
> principle of Parent Child,
> Now a search word takes a very long time about 3 minutes.
> Below is the index structure.
>
> Our database (shown below the index schema) contains information about 
> purchases and reviews in different stores.
> Parent contains meta data of purchase and the children contain comments 
> regarding a purchase and names of goods (products).
> Important to note that the search will always be performed for only one 
> shop (for a particular store_id). Although we have only one single index, 
> no partitions so all stores are in this index
>
> At the moment the query to find the parent whose children have the word 
> takes up to a few minutes .
>
> Information on cluster and index:
> Number of documents 2.7 billion, the size of 360 gigs
> 9 shards, one replica (of 18)
> 6 physical nodes
>
> Your help is highly appreciated
>
> Vlad Feigin
>
> The index structure is :
>
>
> {
>   "Purchase": {
> "mappings": {
>   "_default_": {
> "dynamic": "false",
> "_all": {
>   "enabled": false
> },
> "_ttl": {
>   "enabled": true,
>   "default": 3456000
> },
> "_source": {
>   "enabled": false
> },
> "properties": {
>   
> }
>   },
>
>   "Parent": {
> "dynamic": "false",
> "_all": {
>   "enabled": false
> },
> "_ttl": {
>   "enabled": true,
>   "default": 3456000
> },
> "properties": {
>   "store_id": {
> "type": "string",
> "index": "not_analyzed",
> "store": true
>   },
>   "endTime": {
> "type": "long",
> "store": true
>   },
>   "startTime": {
> "type": "long",
> "store": true
>   },
>   "purchaseId": {
> "type": "string",
> "index": "not_analyzed",
> "store": true
>   }
> }
>   },
>   "comments": {
> "dynamic": "false",
> "_all": {
>   "enabled": false
> },
> "_parent": {
>   "type": "Parent"
> },
> "_routing": {
>   "required": true
> },
> "_ttl": {
>   "enabled": true,
>   "default": 3456000
> },
> "_source": {
>   "enabled": false
> },
> "properties": {
>   "text": {
> "type": "string"
>   }
> }
>   },
> "products": {
> "dynamic": "false",
> "_all": {
>   "enabled": false
> },
> "_parent": {
>   "type": "Parent"
> },
> "_routing": {
>   "required": true
> },
> "_ttl": {
>   "enabled": true,
>   "default": 3456000
> },
> "_source": {
>   "enabled": false
> },
> "properties": {
>   "name": {
> "type": "string"
>   }
> }
>   }
> }
>   }
> }
>
> This message may contain confidential and/or privileged information. 
> If you are not the addressee or authorized to receive this on behalf of 
> the addressee you must not use, copy, disclose or take action based on this 
> message or any information herein. 
> If you have received this message in error, please advise the sender 
> immediately by reply email and delete this message. Thank you.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/2e6cd473-3b6f-448f-bbf3-0b37c6a6d320%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Expire nested object

2015-03-18 Thread Chen Wang
It seems that the TTL cannot be set on nested object level, will ES add 
this feature in the future?
As of now, say if I want to delete blog posts(nested object of user) that 
are 1 year old, do I need to read all the document out, filter out the 
older posts and then write them back?
Thanks,
Chen

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/eff3a440-72c3-4508-bf22-7197e3905882%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: search on nested multi fields

2015-03-17 Thread Chen Wang
Its my bad.
I defined the index in a wrong way: Once I move properties under 
user_activity_v2  to _default_
It starts working.
Chen

On Tuesday, March 17, 2015 at 5:36:57 PM UTC-7, Chen Wang wrote:
>
> the index definition is this:
>   "settings": {
> "index": {
> "number_of_shards": 7,
> "number_of_replicas": 1,
> "analysis": {
> "analyzer": {
> "analyzer_raw": {
> "tokenizer": "keyword",
> "filter": "lowercase"
> }
> }
> }
> }
> },
> "mappings": {
> "_default_": {
> "_ttl": {
> "enabled": 'true',
> "default": ttl
> }
> },
> "user_activity_v2": {
> "_id": {
> "path": "customer_id"
> },
> "properties": {
> "customer_id": {"type": "long"},
> "store_purchase": {
> "type": "nested",
> "include_in_parent": "true",
> "properties": {
> "item_id":{"type": "string"},
> "cat": {
> "type": "multi_field",
> "fields": {
> "cat": {
> "type": "string",
> },
> "original": {
> "type": "string",
> "search_analyzer": 
> "analyzer_raw",
> "index_analyzer": 
> "analyzer_raw"
> }
> }
> }
> }
>
> On Tuesday, March 17, 2015 at 5:24:04 PM UTC-7, Chen Wang wrote:
>>
>> Folks,
>> I have defined a nested object with multi_fields attribute: the "cat" in 
>> store_purchase
>>
>>
>> I loaded some data into Es:
>>  {
>> "_index": "user_activity_v2",
>> "_type": "combined",
>> "_id": "1229369",
>> "_score": 1,
>> "_source": {
>>"store_purchase": [
>>   {
>>  "item_id": "10423846",
>>  "subcat": "First Aid",
>>  "brand_name": "brand name",
>>  "event_time": "2015-03-09",
>>  "cat": "otc"
>>   },
>>   {
>>  "item_id": "34897214",
>>  "subcat": "coffee",
>>  "brand_name": "brand name2",
>>  "event_time": "2015-03-09",
>>  "cat": "cat2 with space"
>>   },
>> }
>>
>> However, I cannot find any data from the following search
>>
>> GET _search
>> {
>>   "query": {
>> "bool": {
>>   "must": [
>>
>> {
>>   "nested": {
>> "path": "store_purchase", 
>> "query": {
>>   "bool": {
>> "must": [ 
>>   { "match": { "store_purchase.cat": "otc" }}
>> ]
>> 
>>   ]
>> }}}
>>
>> i also tried with{ "match": { "store_purchase.cat.original": "otc" 
>> }}, it all returns nothing.
>>
>> What I am missing here?
>> Thanks,
>> Chen
>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/1237c72c-3fe6-48c0-b581-b80592470900%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: search on nested multi fields

2015-03-17 Thread Chen Wang
the index definition is this:
  "settings": {
"index": {
"number_of_shards": 7,
"number_of_replicas": 1,
"analysis": {
"analyzer": {
"analyzer_raw": {
"tokenizer": "keyword",
"filter": "lowercase"
}
}
}
}
},
"mappings": {
"_default_": {
"_ttl": {
"enabled": 'true',
"default": ttl
}
},
"user_activity_v2": {
"_id": {
"path": "customer_id"
},
"properties": {
"customer_id": {"type": "long"},
"store_purchase": {
"type": "nested",
"include_in_parent": "true",
"properties": {
"item_id":{"type": "string"},
"cat": {
"type": "multi_field",
"fields": {
"cat": {
"type": "string",
},
    "original": {
"type": "string",
"search_analyzer": 
"analyzer_raw",
"index_analyzer": "analyzer_raw"
}
}
}
}

On Tuesday, March 17, 2015 at 5:24:04 PM UTC-7, Chen Wang wrote:
>
> Folks,
> I have defined a nested object with multi_fields attribute: the "cat" in 
> store_purchase
>
>
> I loaded some data into Es:
>  {
> "_index": "user_activity_v2",
> "_type": "combined",
> "_id": "1229369",
> "_score": 1,
> "_source": {
>"store_purchase": [
>   {
>  "item_id": "10423846",
>  "subcat": "First Aid",
>  "brand_name": "brand name",
>  "event_time": "2015-03-09",
>  "cat": "otc"
>   },
>   {
>  "item_id": "34897214",
>  "subcat": "coffee",
>  "brand_name": "brand name2",
>  "event_time": "2015-03-09",
>  "cat": "cat2 with space"
>   },
> }
>
> However, I cannot find any data from the following search
>
> GET _search
> {
>   "query": {
> "bool": {
>   "must": [
>
> {
>   "nested": {
> "path": "store_purchase", 
> "query": {
>   "bool": {
> "must": [ 
>   { "match": { "store_purchase.cat": "otc" }}
> ]
> 
>   ]
> }}}
>
> i also tried with{ "match": { "store_purchase.cat.original": "otc" }}, 
> it all returns nothing.
>
> What I am missing here?
> Thanks,
> Chen
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/185cfeaf-1788-48c7-af72-ee98fa8ed956%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


search on nested multi fields

2015-03-17 Thread Chen Wang
Folks,
I have defined a nested object with multi_fields attribute: the "cat" in 
store_purchase


I loaded some data into Es:
 {
"_index": "user_activity_v2",
"_type": "combined",
"_id": "1229369",
"_score": 1,
"_source": {
   "store_purchase": [
  {
 "item_id": "10423846",
 "subcat": "First Aid",
 "brand_name": "brand name",
 "event_time": "2015-03-09",
 "cat": "otc"
  },
  {
 "item_id": "34897214",
 "subcat": "coffee",
 "brand_name": "brand name2",
 "event_time": "2015-03-09",
 "cat": "cat2 with space"
  },
}

However, I cannot find any data from the following search

GET _search
{
  "query": {
"bool": {
  "must": [
   
{
  "nested": {
"path": "store_purchase", 
"query": {
  "bool": {
"must": [ 
  { "match": { "store_purchase.cat": "otc" }}
]

  ]
}}}

i also tried with{ "match": { "store_purchase.cat.original": "otc" }}, 
it all returns nothing.

What I am missing here?
Thanks,
Chen


-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/7802f86f-fde6-4502-9e76-3c9347ba618a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: elasticsearch-hadoop-hive exception when writing array> column

2015-03-12 Thread Chen Wang
Costin,
Thanks for your info.
I am mapping array of maps to nested objects in ES, and in this specific 
case, the expected document in ES will look like
{
   _id:customer_id,
   store_purchase:[{item_id:123, category:'pants', department:'clothes'}, 
...]
}

so that I can do query like find all users between T1 and T2 ,have 
purchased items whose department is A and category is B.
Anyway of achieving this with es-hadoop?

Chen

On Thursday, March 12, 2015 at 9:18:14 PM UTC-7, Costin Leau wrote:
>
> The exception occurs because you are trying to extract a field (the script 
> parameters) from a complex type (array) and not a primitive. The issue with 
> that (and why it's currently not supported) is because the internal 
> structure of the complex type can get quite complex and its serialized, 
> JSON form incorrect.
> Any reason why you need to pass the array of maps as a script parameter 
> and not use primitives instead (you can use Hive column mapping to extract 
> the ones you need)?
>
> On Thu, Mar 12, 2015 at 11:56 PM, Chen Wang  > wrote:
>
>> Folks,
>> I am using elasticsearch-hadoop-hive-2.1.0.Beta3.jar
>>
>> I defined the external table as:.
>> CREATE EXTERNAL TABLE IF NOT EXISTS ${staging_table}(
>> customer_id STRING,
>>  store_purchase array>)
>> ROW FORMAT SERDE 'org.elasticsearch.hadoop.hive.EsSerDe'
>> STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
>> TBLPROPERTIES (
>> 'es.nodes'='localhost:9200',
>> 'es.resource'='user_activity/store',
>> 'es.mapping.id'='customer_id',
>> 'es.input.json'='false',
>> 'es.write.operation'='upsert',
>> 'es.update.script'='ctx._source.store_purchase += purchase',
>> 'es.update.script.params'='purchase:store_purchase'
>> ) ;"
>>
>> I create another source table with the same column names and put some 
>> sample data.
>>
>> Running INSERT OVERWRITE TABLE ${staging_table}
>>
>> SELECT customer_id, store_purchase) FROM ${test_table} 
>>
>> but it throws EsHadoopIllegalArgumentException: Field [_col1] needs to be 
>> a primitive; found [array>]. Is array> supported yet? If not, how can I get 
>> around this issue?
>>
>> Thanks~
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com .
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/929c7b5b-fbb4-4232-821b-331499c18369%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/929c7b5b-fbb4-4232-821b-331499c18369%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/96c5b247-987b-45c5-bdfd-e9a690ad0215%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


elasticsearch-hadoop-hive exception when writing array> column

2015-03-12 Thread Chen Wang
Folks,
I am using elasticsearch-hadoop-hive-2.1.0.Beta3.jar

I defined the external table as:.
CREATE EXTERNAL TABLE IF NOT EXISTS ${staging_table}(
customer_id STRING,
 store_purchase array>)
ROW FORMAT SERDE 'org.elasticsearch.hadoop.hive.EsSerDe'
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES (
'es.nodes'='localhost:9200',
'es.resource'='user_activity/store',
'es.mapping.id'='customer_id',
'es.input.json'='false',
'es.write.operation'='upsert',
'es.update.script'='ctx._source.store_purchase += purchase',
'es.update.script.params'='purchase:store_purchase'
) ;"

I create another source table with the same column names and put some 
sample data.

Running INSERT OVERWRITE TABLE ${staging_table}

SELECT customer_id, store_purchase) FROM ${test_table} 

but it throws EsHadoopIllegalArgumentException: Field [_col1] needs to be a 
primitive; found [array>]. Is array> supported yet? If not, how can I get 
around this issue?

Thanks~

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/929c7b5b-fbb4-4232-821b-331499c18369%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: index design for web activity

2015-01-14 Thread Chen Wang
Adrien,
Is there a more clear version of the video record? I can barely see the 
slides, and don't quite get the idea of entity-centric..
Does it mean, for my user case, to maintain a single user document that 
contains list of activities, and during the index time, just simply update 
the list of this property?
something like:
{
 _source:{
 customer_id: 123
browse: [{item1, time1},{item2, time2}],
   purchase: [{item1,time1},{item2, time2}],

}
}

during the index time, I just update the browse/purchase list?
Then my query basically becomes flat.

Is my understanding correct?
Chen


On Sunday, December 21, 2014 at 1:54:48 PM UTC-8, Adrien Grand wrote:
>
>
>
> On Sat, Dec 20, 2014 at 12:53 AM, Chen Wang  > wrote:
>
>> Hey Guys, 
>> Wanna seek your suggestions on the index design for web activities.
>> Lets say I have browse data,  online purchase data, and store purchase 
>> data, and I will need to save a year of them.
>> For browse data, a year of data is around 80G , online purchase data is 
>> around 50G, and offline data is around 1T.
>>
>> I have to do query like, e.g, find all the customers who browsed item A 
>> in the past X months, and also online purchased B in the past Y month. 
>> Originally I am using complicated parent/child structure, and that 
>> sometimes results in very bad performance. and I store all browse 
>> data/online purchase/store purchase in one index distributed to 7 shards.
>>
>
> Parent/child is indeed slow. Can you somehow denormalize your data to make 
> queries faster?
>  
>
>> I have 7 machines with 128G each, and 1T hard disk.
>>
>> Now, I am trying to save each of those type of data into its own index, 
>> say browse_v1, onlinepurchase_v1, storepurchase_v1. Since its time based 
>> data, how should I decide to break them into monthly , or simply yearly? 
>> for browse(70G)/online purchase(50G), i think i can just use one index and 
>> one shard for them,. or should I break them into monthly data instead? 
>> breaking into monthly indexes gives me the flexibility of adding/removing 
>> data, but it also will decrease the query performance, right? (search 
>> against 1 index now becomes search against 12 indexes).
>>
>> For store data(1T) apparently I have to break them into at least monthly 
>> index, but each monthly index still contains around 100G data. With my 
>> current cluster, how many shards should I allocate to each monthly index? I 
>> am also concerned about the query performance. 
>>
>> Then since I am now storing them into separate indexes, to achieve the 
>> query I want, I will need to do application level join. Is this the common 
>> way to handle such user case?
>>
>
> As much as possible, you should try to design you documents in such a way 
> that you don't need to perform joins at search time. Would it be possible 
> for you to adopt a more "entity-centric" approach at indexing time? 
> http://www.elasticsearch.org/videos/entity-centric-indexing-london-meetup-sep-2014/
>  
>
>> I know I should perform some testing first, but hope someone may have 
>> similar experience in handling this and could provide some guidance.
>>
>
> The Elasticsearch book has a chapter about "designing for scale" that 
> gives good advices around modeling the data and chosing the right shard 
> size and numbers of shards: 
> http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/scale.html
>
> -- 
> Adrien Grand
>  

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/9c6ac540-2d77-49de-85b4-7fd1574ff2ed%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Design a system that maintains historical view of user activity

2015-01-12 Thread Chen Wang
Hey Guys,
I am seeking advice on design a system that maintains a historical view of 
a user's activities in past one year. Each user can have different 
activities: email_open, email_click, item_view, add_to_cart, purchase etc. 
The query I would like to do is, for example,

Find all customers who browse item A in the past 6 month, and also clicked 
an email.
and I would like the query to be done in reasonable time frame. (for 
example, within 30 minutes to retrieve 10million such users)

Is ES a good candidate for such problem? I am thinking creating an index 
for each user, but that would have too many indexes(millions). I also tried 
to index each activity(userid, activity_type, item_id, timestamp etc) as a 
individual document to ES, but it involves join operations which turns out 
not so efficient(I am using parent-child).

Has any of you have experience in designing similar system? As I think this 
is a rather common problem that need to be solved..(Of cause we can do it 
in map reduce)
Any suggestion is appreciated.

Thanks in advance.
Chen

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/ebcb521f-4115-489c-a759-e1459c250eda%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


index design for web activity

2014-12-19 Thread Chen Wang
Hey Guys, 
Wanna seek your suggestions on the index design for web activities.
Lets say I have browse data,  online purchase data, and store purchase 
data, and I will need to save a year of them.
For browse data, a year of data is around 80G , online purchase data is 
around 50G, and offline data is around 1T.

I have to do query like, e.g, find all the customers who browsed item A in 
the past X months, and also online purchased B in the past Y month. 
Originally I am using complicated parent/child structure, and that 
sometimes results in very bad performance. and I store all browse 
data/online purchase/store purchase in one index distributed to 7 shards.

I have 7 machines with 128G each, and 1T hard disk.

Now, I am trying to save each of those type of data into its own index, say 
browse_v1, onlinepurchase_v1, storepurchase_v1. Since its time based data, 
how should I decide to break them into monthly , or simply yearly? for 
browse(70G)/online purchase(50G), i think i can just use one index and one 
shard for them,. or should I break them into monthly data instead? breaking 
into monthly indexes gives me the flexibility of adding/removing data, but 
it also will decrease the query performance, right? (search against 1 index 
now becomes search against 12 indexes).

For store data(1T) apparently I have to break them into at least monthly 
index, but each monthly index still contains around 100G data. With my 
current cluster, how many shards should I allocate to each monthly index? I 
am also concerned about the query performance. 

Then since I am now storing them into separate indexes, to achieve the 
query I want, I will need to do application level join. Is this the common 
way to handle such user case?

I know I should perform some testing first, but hope someone may have 
similar experience in handling this and could provide some guidance.

thanks in advance,
Chen


-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/2cba8839-2577-4fd7-b1e9-550ae579bb1a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: ES bulk insert time out

2014-11-24 Thread Chen Wang
Rob,
Even with 7 shards, each shard has around 100G data. I don't think I can 
achieve "each shard should be around 20 - 30 gb in size"
I am using a file for testing, so its actually indexing sequentially for 
every 200 entries. When I run cat thread:\
curl 
'localhost:9200/_cat/thread_pool?v&h=id,host,bulk.active,bulk.rejected,bulk.completed,bulk.queue,bulk.queueSize'
id   host   bulk.active bulk.rejected bulk.completed bulk.queue 
bulk.queueSize
-fmG es-trgt01 0 15901   13024036  0   
  50
Bp9R es-trgt04041   10806286  0 
50
lB0j es-trgt02   0 0   6412  0 
50
tW2Z es-trgt05 0 4   11000638  0   
  50
_qPw es-trgt064 0   8286 25 
50
csxB es-trgt03 0 0   8314  0   
  50
ah7F es-trgt00 0  22009978972  0   
  50

It does show large amount of rejections, but none of the queue reaches its 
queue size (50). Why would indexing fail in such case?

Another thing worthy of mention is that the documents i am indexing are 
child documents. Does this affect the bulk behavior at all?

I am going to lower the heap size to see whether it helps.

Thanks,
Chen
On Monday, November 17, 2014 10:55:03 PM UTC-8, Robert Gardam wrote:
>
> There are a few things going on here.
>
> When you say 200 entries, is this per second?? It might be that it's 
> chunking them into 200 docs, but you're really just smashing it with more 
> than you think. - 
> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cat-thread-pool.html
>  
> This doc can show you what the different thread pools are doing. If you 
> notice that it's rejecting large amounts of documents you might find you're 
> bulk queue is too low. Increasing this might help, but be a little bit 
> careful. If you increase it to something too huge you can easily break ES. 
>
> First thing is it's always a good idea to not go above 32gb of heap. By 
> doing so you disable compressed pointers and memory can run away. The file 
> system cache will happily consume the rest of your memory. 
>
> you could also not have any replicas while doing the bulk import and then 
> increase this after the import has completed. That way you're not writing 
> out replicas while trying to bulk import. Replicas are only useful for 
> reads not indexing. 
>
> Another important thing is working out your mappings for your index. Are 
> you enumerating all fields or are there fields that don't require full text 
> searching etc?
>
> What is the refresh rate on this index? It could be trying to refresh the 
> index and is getting busy doing this. Although I wouldn't expect a cluster 
> of this size to be having trouble indexing 200 with this. (if its 200 per 
> second)  
>
> I've also found that running the same number of shards as nodes can have a 
> bad impact on the cluster as all nodes are busy trying to index and then 
> they can't perform other cluster functions. - To give you an idea each 
> shard should be around 20 - 30 gb in size
> Try reducing your shard count to 3 or maybe even 2 and then increase 
> replicas. 
>
> I hope this helps. 
>
> Cheers, 
> Rob
>
>
> On Monday, November 17, 2014 8:04:28 PM UTC+1, Chen Wang wrote:
>>
>> Hey, Guys,
>> I am loading a hive table of around 10million records into ES regularly. 
>> Each document is small with 5-6 attributes. My Es cluster has 7 nodes, each 
>> has 4 core and 128G. ES was allocated with 60% of the memory, and I am 
>> bulking insert (use python client) every 200 entries.  My cluster is in 
>> Green status, running version  1.2.1. The index "number_of_shards" : 7, 
>> "number_of_replicas" : 1
>> But I keep getting read time out exception:
>>
>> Traceback (most recent call last):
>>   File "reduce_dotcom_browse.test.py", line 95, in 
>> helpers.bulk(es, actions)
>>   File "/usr/lib/python2.6/site-packages/elasticsearch/helpers.py", line 
>> 148, in bulk
>> for ok, item in streaming_bulk(client, actions, **kwargs):
>>   File "/usr/lib/python2.6/site-packages/elasticsearch/helpers.py", line 
>> 107, in streaming_bulk
>> resp = client.bulk(bulk_actions, **kwargs)
>>   File "/usr/lib/python2.6/site-packages/elasticsearch/client/utils.py", 
>> line 70, in _wrapped
>> return func(*args, params=params, **kwargs)
>>   File 
>> "/usr/lib/python2.6/site-packages/elasticsearch/client/__init__

ES bulk insert time out

2014-11-17 Thread Chen Wang
Hey, Guys,
I am loading a hive table of around 10million records into ES regularly. 
Each document is small with 5-6 attributes. My Es cluster has 7 nodes, each 
has 4 core and 128G. ES was allocated with 60% of the memory, and I am 
bulking insert (use python client) every 200 entries.  My cluster is in 
Green status, running version  1.2.1. The index "number_of_shards" : 7, 
"number_of_replicas" : 1
But I keep getting read time out exception:

Traceback (most recent call last):
  File "reduce_dotcom_browse.test.py", line 95, in 
helpers.bulk(es, actions)
  File "/usr/lib/python2.6/site-packages/elasticsearch/helpers.py", line 
148, in bulk
for ok, item in streaming_bulk(client, actions, **kwargs):
  File "/usr/lib/python2.6/site-packages/elasticsearch/helpers.py", line 
107, in streaming_bulk
resp = client.bulk(bulk_actions, **kwargs)
  File "/usr/lib/python2.6/site-packages/elasticsearch/client/utils.py", 
line 70, in _wrapped
return func(*args, params=params, **kwargs)
  File "/usr/lib/python2.6/site-packages/elasticsearch/client/__init__.py", 
line 568, in bulk
params=params, body=self._bulk_body(body))
  File "/usr/lib/python2.6/site-packages/elasticsearch/transport.py", line 
274, in perform_request
status, headers, data = connection.perform_request(method, url, params, 
body, ignore=ignore)
  File 
"/usr/lib/python2.6/site-packages/elasticsearch/connection/http_urllib3.py", 
line 51, in perform_request
raise ConnectionError('N/A', str(e), e)
elasticsearch.exceptions.ConnectionError: 
ConnectionError(HTTPConnectionPool(host=u'10.93.80.216', port=9200): Read 
timed out. (read timeout=10)) caused by: 
ReadTimeoutError(HTTPConnectionPool(host=u'10.93.80.216', port=9200): Read 
timed out. (read timeout=10))

How can I trouble shoot this? In my opinion, bulk insert 200 entries should 
be fairly easy..
Thanks for any pointers.
Chen


-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/450ae411-586a-431b-b3a9-3767230eaf92%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


effiecient way to store the result of a large slow query

2014-06-20 Thread Chen Wang
Hi guys,
Just wondering what is the most efficient way of executing a query that
takes time(parent/child documents) and returns large amount of entries, and
store the result in randomly evenly divided block to HDFS? e.g, the query
will return 100million records and I want every random 1million stored in a
different location(file/folder) on HDFS.

I assume I could execute the query with scroll, and then whenever I
received the 1 million records back, I then spawn anther thread to commit
it to HDFS? Is there a way to run the query distributed way and have 100
threads query ES at the same time and each getting a random 1million
back(without duplicate)? will ES hadoop help in this case?

Appreciate your input!
Chen

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CACim9Rm64uHE9EQ35r_mJr9VhiEbDfD-70vS1uQHSG6UXM7ZDQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


shard allocation for large amount of data

2014-06-09 Thread Chen Wang
We have huge amount of data (5Billion records, 3TB in size) organized in
parent / child type in one index to enable the joins. My first question is,
how  should I allocate shards for this big index in order to make the
parent/child query more efficient? Right now doing queries will cause out
of memory on several nodes, and I have 7 VMs, with 64GMem, and 1T disk.
Each Es has 32Gmem allocated to it. The index has 20 shards.

Any insights are helpful!
Thanks,
Chen

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CACim9RkMgWAxAZnLagKjnZd_saoQdP0Gof7t0-MsK97d4F--yw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


*same* query return different result

2014-04-15 Thread Chen Wang
I am using version 1.0.0.RC2

for the following two queries: one is using match_all, the other is using
term filter
 "query": {
  "match_all": {}
  },
"aggregations": {
"by_campaign": {
"terms": {
"field": "campaign_name"
}
}
}

 "query": {
  "term": {"campaign_name.original":"snbp_407400"}
  },
"aggregations": {
"by_campaign": {
"terms": {
"field": "campaign_name.original"
}
}
}

However the result is different:
with match_all, it returns 3 records
  {
   "key": "snbp_407600",
   "doc_count": 3
},

with term query it returns 4 records
  {
   "key": "snbp_407400",
   "doc_count": 4
  }

and there are a total of 4 records.(meaning the one with term query is
correct). Any idea why this is happending?
Thanks,
Chen

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CACim9Rm5zCPEzYyVxfuqwmJGkLSHHApqmXbkqyLrpOhUKURqCg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: 1.0.0.RC2 lots of [WARN ][discovery.zen.ping.multicast] [Pixx] failed to read requesting data from ***

2014-02-04 Thread Chen Wang
Alex,
Thanks for your reply.
There are other es instance (0.90.10) running. But I have configured mine 
to have a different cluster name. but still it throws "failed to read 
requesting data" warnings. Can these warnings be safely ignored?
Thanks,
Chen

On Monday, February 3, 2014 11:47:58 PM UTC-8, Alexander Reelsen wrote:
>
> Hey,
>
> yes, elasticsearch now starts in the foreground by default, please see 
> http://www.elasticsearch.org/guide/en/elasticsearch/reference/master/breaking-changes.htmlfor
>  a list of breaking changes compared to 0.90
>
> The other problem might stem from the problem, that you are still using an 
> old elasticsearch version somewhere (maybe a node client?) or different JVM 
> versions (but if it worked before, I rather think the first).
>
> Can you check that? Also, you got an IP (10.93.x.y), does that run a valid 
> elasticsearch instance or just something that tries to connect to your 
> cluster?
>
>
> --Alex
>
>
> On Tue, Feb 4, 2014 at 8:18 AM, Chen Wang 
> > wrote:
>
>> Hi,
>> It seems that the default ./bin elasticsearch are changed to run in the 
>> front instead of the back end, is this true? As of beta2, still seems to 
>> run fine. But when I upgrade to 1.0.0.Rc1, or RC2
>> when running ./bin elasticsearch, it starts to run in the front, and 
>> gives me lots of warnings like:
>> [WARN ][discovery.zen.ping.multicast] [Pixx] failed to read requesting 
>> data from 
>> /10.93.69.138:54328<http://www.google.com/url?q=http%3A%2F%2F10.93.69.138%3A54328&sa=D&sntz=1&usg=AFQjCNFFy0XQT3DnHx0A83FH4cTkHAynXQ>
>> java.io.IOException: No transport address mapped to [21623]
>> at 
>> org.elasticsearch.common.transport.TransportAddressSerializers.addressFromStream(TransportAddressSerializers.java:71)
>> at 
>> org.elasticsearch.cluster.node.DiscoveryNode.readFrom(DiscoveryNode.java:267)
>> at 
>> org.elasticsearch.cluster.node.DiscoveryNode.readNode(DiscoveryNode.java:257)
>> at 
>> org.elasticsearch.discovery.zen.ping.multicast.MulticastZenPing$Receiver.run(MulticastZenPing.java:410)
>> at java.lang.Thread.run(Thread.java:662)
>>
>> Is this expected?
>> Thanks,
>> Chen
>>
>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com .
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/5ca1d17c-1b4a-4f4e-bd8b-fb8b8bd0896b%40googlegroups.com
>> .
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/e6909354-1bb1-4846-9392-fdae5ae1db91%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


1.0.0.RC2 lots of [WARN ][discovery.zen.ping.multicast] [Pixx] failed to read requesting data from ***

2014-02-03 Thread Chen Wang
Hi,
It seems that the default ./bin elasticsearch are changed to run in the 
front instead of the back end, is this true? As of beta2, still seems to 
run fine. But when I upgrade to 1.0.0.Rc1, or RC2
when running ./bin elasticsearch, it starts to run in the front, and gives 
me lots of warnings like:
[WARN ][discovery.zen.ping.multicast] [Pixx] failed to read requesting data 
from /10.93.69.138:54328
java.io.IOException: No transport address mapped to [21623]
at 
org.elasticsearch.common.transport.TransportAddressSerializers.addressFromStream(TransportAddressSerializers.java:71)
at 
org.elasticsearch.cluster.node.DiscoveryNode.readFrom(DiscoveryNode.java:267)
at 
org.elasticsearch.cluster.node.DiscoveryNode.readNode(DiscoveryNode.java:257)
at 
org.elasticsearch.discovery.zen.ping.multicast.MulticastZenPing$Receiver.run(MulticastZenPing.java:410)
at java.lang.Thread.run(Thread.java:662)

Is this expected?
Thanks,
Chen

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/5ca1d17c-1b4a-4f4e-bd8b-fb8b8bd0896b%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: IndexMissingException: [myindex] missing

2014-01-30 Thread Chen Wang
ah. force alarm. I somehow think that i don't need to create index before
hand...


On Thu, Jan 30, 2014 at 5:25 PM, Chen Wang wrote:

> Hey Guys,
> i am using java TransportClient to talk to ES
> when I am running
>
>  ImmutableList nodes = client.connectedNodes();
>
> if (nodes.isEmpty()) {
>
> throw new Exception("No nodes available. Verify ES is
> running!");
>
> } else {
>
> logger.info("connected to nodes: " + nodes.toString());
>
> }
>
> It prints out the
>
>  connected to nodes: [[Korvac][JymMPUw1QsqczjCznV6uPg][inet[/*10.93.69.50:9300
> <http://10.93.69.50:9300>*]],
> [Avarrish][BAmEnBp3SfaeDUFOrF6CgA][inet[/10.93.69.51:9301]]]
>
> I assume it means that it has successfully connected to my cluster?
>
> Then when running a query:
>
> SearchResponse sr = client.prepareSearch("myindex")
>
>   .setTypes("mytype").setQuery(queryBuidler).execute()
>
>.actionGet();
>
>
> It throws exception
>
> org.elasticsearch.indices.IndexMissingException: [myindex] missing
>
> at
> org.elasticsearch.cluster.metadata.MetaData.convertFromWildcards(MetaData.java:634)
>
> at
> org.elasticsearch.cluster.metadata.MetaData.concreteIndices(MetaData.java:533)
>
> at
> org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.(TransportSearchTypeAction.java:109)
>
> at
> org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.(TransportSearchQueryThenFetchAction.java:68)
>
> at
> org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.(TransportSearchQueryThenFetchAction.java:62)
>
> at
> org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction.doExecute(TransportSearchQueryThenFetchAction.java:59)
>
> at
> org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction.doExecute(TransportSearchQueryThenFetchAction.java:49)
>
> at
> org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63)
>
> at
> org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:108)
>
> at
> org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:43)
>
> at
> org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63)
>
> at
> org.elasticsearch.action.search.TransportSearchAction$TransportHandler.messageReceived(TransportSearchAction.java:135)
>
> at
> org.elasticsearch.action.search.TransportSearchAction$TransportHandler.messageReceived(TransportSearchAction.java:120)
>
> at
> org.elasticsearch.transport.netty.MessageChannelHandler.handleRequest(MessageChannelHandler.java:212)
>
> at
> org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:109)
>
> at
> org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
>
> at
> org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>
> at
> org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>
> at
> org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:296)
>
> at
> org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
>
> at
> org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
>
> at
> org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
>
> at
> org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
>
> at
> org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>
> at
> org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>
> at
> org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
>
> at
> org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>
> at
> org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
>
> at
> org.

IndexMissingException: [myindex] missing

2014-01-30 Thread Chen Wang
Hey Guys,
i am using java TransportClient to talk to ES
when I am running

 ImmutableList nodes = client.connectedNodes();

if (nodes.isEmpty()) {

throw new Exception("No nodes available. Verify ES is running!"
);

} else {

logger.info("connected to nodes: " + nodes.toString());

}

It prints out the

 connected to nodes: [[Korvac][JymMPUw1QsqczjCznV6uPg][inet[/*10.93.69.50:9300
*]],
[Avarrish][BAmEnBp3SfaeDUFOrF6CgA][inet[/10.93.69.51:9301]]]

I assume it means that it has successfully connected to my cluster?

Then when running a query:

SearchResponse sr = client.prepareSearch("myindex")

  .setTypes("mytype").setQuery(queryBuidler).execute()

   .actionGet();


It throws exception

org.elasticsearch.indices.IndexMissingException: [myindex] missing

at
org.elasticsearch.cluster.metadata.MetaData.convertFromWildcards(MetaData.java:634)

at
org.elasticsearch.cluster.metadata.MetaData.concreteIndices(MetaData.java:533)

at
org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.(TransportSearchTypeAction.java:109)

at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.(TransportSearchQueryThenFetchAction.java:68)

at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.(TransportSearchQueryThenFetchAction.java:62)

at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction.doExecute(TransportSearchQueryThenFetchAction.java:59)

at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction.doExecute(TransportSearchQueryThenFetchAction.java:49)

at
org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63)

at
org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:108)

at
org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:43)

at
org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63)

at
org.elasticsearch.action.search.TransportSearchAction$TransportHandler.messageReceived(TransportSearchAction.java:135)

at
org.elasticsearch.action.search.TransportSearchAction$TransportHandler.messageReceived(TransportSearchAction.java:120)

at
org.elasticsearch.transport.netty.MessageChannelHandler.handleRequest(MessageChannelHandler.java:212)

at
org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:109)

at
org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)

at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)

at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)

at
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:296)

at
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)

at
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)

at
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)

at
org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)

at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)

at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)

at
org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)

at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)

at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)

at
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)

at
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)

at
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)

at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)

at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)

at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)

at
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)

at
org.elasticsearch.common.netty.util.Th

Re: get count by certain field based on a sub query

2014-01-24 Thread Chen Wang
I actually have a feeling that it might be possible by the new aggregation
in ES 1.0, my thoughts are:
1. make the campaign_id field a integer value
2. group by all session data, doing the count, and also sum the campaign_id
field.
3. filter through the group with sum(campaign_id) != 0, (meaning this
session is from campaign), then do a sum on all the filtered group.

But how can I implement this in ES query...
Chen


On Thu, Jan 23, 2014 at 1:23 PM, Adrien Grand <
adrien.gr...@elasticsearch.com> wrote:

> I don't think this is possible. To me the way to solve this kind of issues
> would be to reindex events as soon as you know their campain_id.
>
>
> On Thu, Jan 23, 2014 at 2:04 AM, Chen Wang wrote:
>
>> Guys,
>> I just successfully imported my data to ES, e,g. It has  looks like this:
>>   "activity": 'viewed',
>>  "sessionId": "00143198107b3fe510b041138cd33fdd9252aab9808c",
>> "campaign_id":""
>> ,
>>
>>   "activity": 'campaign_viewed',
>>  "sessionId": "00143198107b3fe510b041138cd33fdd9252aab9808c",
>> "campaign_id":"my_campaign"
>>
>> As you can see, the two entries has the same session id, and since the
>> second entry has a campaign_id, i will assume the first activity(viewed) is
>> also generated from the campaign.
>> So how can i do count like:
>> (count the activities that are generated from campaign):
>>
>> count(activity)
>> where sessionId in (select sessionid from index where
>> campaign_id="m_campaign") ?
>>
>> Thanks much!
>> Chen
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to elasticsearch+unsubscr...@googlegroups.com.
>>
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/0758af7f-a8ef-451b-a029-e42d1678e73d%40googlegroups.com
>> .
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>
>
>
> --
> Adrien Grand
>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "elasticsearch" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/elasticsearch/scwSRLM08vc/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j4mEB0rJGPJOF2bcWHn_q-MfO0xzD081_bB2A9%2B52UJww%40mail.gmail.com
> .
>
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CACim9RnoWHcXtaEp2oCYJ8%2ByAvy0ryDfw_ey4n6_eTqYbLmvVQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


get count by certain field based on a sub query

2014-01-22 Thread Chen Wang
Guys,
I just successfully imported my data to ES, e,g. It has  looks like this:
  "activity": 'viewed',
 "sessionId": "00143198107b3fe510b041138cd33fdd9252aab9808c",
"campaign_id":""
,

  "activity": 'campaign_viewed',
 "sessionId": "00143198107b3fe510b041138cd33fdd9252aab9808c",
"campaign_id":"my_campaign"

As you can see, the two entries has the same session id, and since the 
second entry has a campaign_id, i will assume the first activity(viewed) is 
also generated from the campaign.
So how can i do count like:
(count the activities that are generated from campaign):

count(activity)
where sessionId in (select sessionid from index where 
campaign_id="m_campaign") ?

Thanks much!
Chen

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/0758af7f-a8ef-451b-a029-e42d1678e73d%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.