Re: [Hadoop][Spark] Exclude metadata fields from _source

2015-02-19 Thread Itai Yaffe
Thanks for the response Costin!
As you mentioned, option 1, i.e es.mapping.exclude, is more appropriate 
when working with JSON.
Since it doesn't seem to work, I've followed your advice and raised a new 
issue (https://github.com/elasticsearch/elasticsearch-hadoop/issues/381) 
including a small test application to reproduce.
I'd be happy to hear what you think of it.

Thanks again,
   Itai

On Wednesday, February 18, 2015 at 7:42:36 PM UTC+2, Costin Leau wrote:
>
> Hi Itay,
>
> Sorry I missed your email. I'm not clear from your post how your documents 
> look like - can you post a gist somewhere with your JSON input that you are 
> sending to Elasticsearch?
> Typically the metadata appear in the _source if they are declared that 
> way. You should be able to go around this by using:
> 1. es.mapping.exclude - if it doesn't seem to be working
> 2. in case of Spark, by specifying the metadata through the `saveWithMeta` 
> methods which allows it to stay decoupled from the object itself.
>
> Since you are using JSON likely 1 is your best shot. If it doesn't work 
> for you can you please raise an issue with a quick/small sample to be able 
> to reproduce it?
>
> Thanks,
>
>
> On Wed, Feb 18, 2015 at 10:27 AM, Itai Yaffe  > wrote:
>
>> Hey,
>> Has anyone experienced with such an issue?
>> Perhaps Costin can help here?
>>
>> Thanks!
>>
>> On Thursday, February 12, 2015 at 8:27:14 AM UTC+2, Itai Yaffe wrote:
>>
>>> Hey,
>>> I've recently started using Elasticsearch for Spark (Scala application).
>>> I've added elasticsearch-spark_2.10 version 2.1.0.BUILD-SNAPSHOT to my 
>>> Spark application pom file, and used 
>>> org.apache.spark.rdd.RDD[String].saveJsonToEs() 
>>> to send documents to Elasticsearch.
>>> When the documents are loaded to Elasticsearch, my metadata fields (e.g 
>>> id, index, etc.) are being loaded as part of the _source field.
>>> Is there a way to exclude them from the _source?
>>> I've tried using the new "es.mapping.exclude" configuration property 
>>> (added in this commit 
>>> <https://github.com/elasticsearch/elasticsearch-hadoop/commit/aae4f0460a23bac9567ea2ad335c74245a1ba069>
>>>  
>>> - that's why I needed to take the latest build rather than using version 
>>> 2.1.0.Beta3), but it doesn't seem to have any affect (although I'm not sure 
>>> it's even possible to exclude fields I'm using for mapping, e.g "
>>> es.mapping.id").
>>>
>>> A code snippet (I'm using a single-node Elasticsearch cluster for 
>>> testing purposes and running the Spark app from my desktop) :
>>> val conf = new SparkConf()...
>>> conf.set("es.index.auto.create", "false")
>>> conf.set("es.nodes.discovery", "false")
>>> conf.set("es.nodes", "XXX:9200")
>>> conf.set("es.update.script", "XXX")
>>> conf.set("es.update.script.params", "param1:events")
>>> conf.set("es.update.retry.on.conflict" , "2")
>>> conf.set("es.write.operation", "upsert")
>>> conf.set("es.input.json", "true")
>>> val documentsRdd =  ...
>>> documentsRdd.saveJsonToEs("test/user", scala.collection.Map("es.
>>> mapping.id" -> "_id", "es.mapping.exclude" -> "_id"))
>>>
>>> The JSON looks like that :
>>> {
>>>   "_id": "",
>>>   "_type": "user",
>>>   "_index": "test",
>>>   "params": {
>>> "events": [
>>>   {
>>> ...
>>>   }
>>> ]
>>>   }
>>>
>>> Thanks!
>>> }
>>>
>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com .
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/aea88dfb-8d4b-49d1-a236-8de6d513b4f6%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/aea88dfb-8d4b-49d1-a236-8de6d513b4f6%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/9f210b41-4a31-4dd4-aa2d-cae7aabd3a1f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Elasticsearch aggregations for analytics

2015-02-18 Thread Itai Yaffe
Hey,
We have an Elasticsearch cluster in production with 20 nodes which hold a 
few TBs of data and loading millions of documents a day.
We use Elasticsearch for analytics purposes and the main thing we're 
interested is counting unique users.

We started the production search with Elasticsearch 0.9.X, when there was 
no cardinality aggregation, therefor we were bound to create the document 
structure as seen below.
Most of our queries are looking for the unique users count based on a date 
range and specific segments.
Some of our analytic UI screens require executing hundreds of queries in 
parallel and one even requires thousands of queries.

When migrating to V1.4, we hoped to start using the aggregation feature, 
but even with the doc_values enabled, we experience aggregation time of 
*minutes*...
We're running on c3.8xlarge EC2 instances with 60GB RAM, of which 30GB are 
allocated to ES heap.
We have 6 indexes with 2 replicas each, each index has 20 shards.
Each aggregation/query is performed against a single index (see aggregation 
example below).

Has anyone dealt with such use cases? 

Thanks!

*Document structure* :
{
 "user": {
"_ttl": {
   "enabled": true
},
"properties": {
   "events": {
  "type": "nested",
  "properties": {
 "event_time": {
"type": "date",
"format": "dateOptionalTime",
"doc_values" : true
 },
 "segments": {
"properties": {
   "segment": {
  "type": "string",
  "index": "not_analyzed",
  "doc_values" : true
   }
}
 }
  }
   }
}
 }
}

For example :
{
  "_index": "...",
  "_type": "...",
  "_id": "...",
  "_version": 1,
  "_score": 1,
  "_source": {
"events": [
  {
"event_time": "2014-11-03",
"segments": [
  {
"segment": "ALICE"
  },
  {
"segment": "BOB"
  }
]
  },
  {
"event_time": "2014-11-04",
"segments": [
  {
"segment": "RON"
  },
  {
"segment": "YULA"
  }
]
  }
]
  }
}


*Aggegation example* :
{
"size": 0,
"query": {
"nested": {
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"bool": {
"must": [
{
"range": {
"events.event_time": {
"from": "2014-11-17",
"to": "2014-11-24",
"include_lower": true,
"include_upper": true
}
}
}
]
}
}
}
},
"path": "events"
}
},
"aggregations": {
"nested": {
"nested": {
"path": "events"
},
"aggregations": {
"segments": {
"terms": {
"field": "events.segments.segment",
"size": 0
},
"aggregations": {
"uu": {
"reverse_nested": {}
}
}
}
}
}
}
}

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/acbc3022-8845-4170-999d-d0b2bc9dfeb3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [Hadoop][Spark] Exclude metadata fields from _source

2015-02-18 Thread Itai Yaffe
Hey,
Has anyone experienced with such an issue?
Perhaps Costin can help here?

Thanks!

On Thursday, February 12, 2015 at 8:27:14 AM UTC+2, Itai Yaffe wrote:
>
> Hey,
> I've recently started using Elasticsearch for Spark (Scala application).
> I've added elasticsearch-spark_2.10 version 2.1.0.BUILD-SNAPSHOT to my 
> Spark application pom file, and used 
> org.apache.spark.rdd.RDD[String].saveJsonToEs() to send documents to 
> Elasticsearch.
> When the documents are loaded to Elasticsearch, my metadata fields (e.g 
> id, index, etc.) are being loaded as part of the _source field.
> Is there a way to exclude them from the _source?
> I've tried using the new "es.mapping.exclude" configuration property 
> (added in this commit 
> <https://github.com/elasticsearch/elasticsearch-hadoop/commit/aae4f0460a23bac9567ea2ad335c74245a1ba069>
>  
> - that's why I needed to take the latest build rather than using version 
> 2.1.0.Beta3), but it doesn't seem to have any affect (although I'm not sure 
> it's even possible to exclude fields I'm using for mapping, e.g "
> es.mapping.id").
>
> A code snippet (I'm using a single-node Elasticsearch cluster for testing 
> purposes and running the Spark app from my desktop) :
> val conf = new SparkConf()...
> conf.set("es.index.auto.create", "false")
> conf.set("es.nodes.discovery", "false")
> conf.set("es.nodes", "XXX:9200")
> conf.set("es.update.script", "XXX")
> conf.set("es.update.script.params", "param1:events")
> conf.set("es.update.retry.on.conflict" , "2")
> conf.set("es.write.operation", "upsert")
> conf.set("es.input.json", "true")
> val documentsRdd =  ...
> documentsRdd.saveJsonToEs("test/user", scala.collection.Map("
> es.mapping.id" -> "_id", "es.mapping.exclude" -> "_id"))
>
> The JSON looks like that :
> {
>   "_id": "",
>   "_type": "user",
>   "_index": "test",
>   "params": {
> "events": [
>   {
> ...
>   }
> ]
>   }
>
> Thanks!
> }
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/aea88dfb-8d4b-49d1-a236-8de6d513b4f6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [Hadoop][Spark] Exclude metadata fields from _source

2015-02-18 Thread Itai Yaffe
Hey,
Have anyone experienced with such an issue?
Perhaps Costin can help here?

Thanks!


On Thursday, February 12, 2015 at 8:27:14 AM UTC+2, Itai Yaffe wrote:
>
> Hey,
> I've recently started using Elasticsearch for Spark (Scala application).
> I've added elasticsearch-spark_2.10 version 2.1.0.BUILD-SNAPSHOT to my 
> Spark application pom file, and used 
> org.apache.spark.rdd.RDD[String].saveJsonToEs() to send documents to 
> Elasticsearch.
> When the documents are loaded to Elasticsearch, my metadata fields (e.g 
> id, index, etc.) are being loaded as part of the _source field.
> Is there a way to exclude them from the _source?
> I've tried using the new "es.mapping.exclude" configuration property 
> (added in this commit 
> <https://github.com/elasticsearch/elasticsearch-hadoop/commit/aae4f0460a23bac9567ea2ad335c74245a1ba069>
>  
> - that's why I needed to take the latest build rather than using version 
> 2.1.0.Beta3), but it doesn't seem to have any affect (although I'm not sure 
> it's even possible to exclude fields I'm using for mapping, e.g "
> es.mapping.id").
>
> A code snippet (I'm using a single-node Elasticsearch cluster for testing 
> purposes and running the Spark app from my desktop) :
> val conf = new SparkConf()...
> conf.set("es.index.auto.create", "false")
> conf.set("es.nodes.discovery", "false")
> conf.set("es.nodes", "XXX:9200")
> conf.set("es.update.script", "XXX")
> conf.set("es.update.script.params", "param1:events")
> conf.set("es.update.retry.on.conflict" , "2")
> conf.set("es.write.operation", "upsert")
> conf.set("es.input.json", "true")
> val documentsRdd =  ...
> documentsRdd.saveJsonToEs("test/user", scala.collection.Map("
> es.mapping.id" -> "_id", "es.mapping.exclude" -> "_id"))
>
> The JSON looks like that :
> {
>   "_id": "",
>   "_type": "user",
>   "_index": "test",
>   "params": {
> "events": [
>   {
> ...
>   }
> ]
>   }
>
> Thanks!
> }
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/b0d957c6-ce86-4329-91fb-99a536a9b14b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[Hadoop][Spark] Exclude metadata fields from _source

2015-02-11 Thread Itai Yaffe
Hey,
I've recently started using Elasticsearch for Spark (Scala application).
I've added elasticsearch-spark_2.10 version 2.1.0.BUILD-SNAPSHOT to my 
Spark application pom file, and used 
org.apache.spark.rdd.RDD[String].saveJsonToEs() to send documents to 
Elasticsearch.
When the documents are loaded to Elasticsearch, my metadata fields (e.g id, 
index, etc.) are being loaded as part of the _source field.
Is there a way to exclude them from the _source?
I've tried using the new "es.mapping.exclude" configuration property (added 
in this commit 

 
- that's why I needed to take the latest build rather than using version 
2.1.0.Beta3), but it doesn't seem to have any affect (although I'm not sure 
it's even possible to exclude fields I'm using for mapping, e.g 
"es.mapping.id").

A code snippet (I'm using a single-node Elasticsearch cluster for testing 
purposes and running the Spark app from my desktop) :
val conf = new SparkConf()...
conf.set("es.index.auto.create", "false")
conf.set("es.nodes.discovery", "false")
conf.set("es.nodes", "XXX:9200")
conf.set("es.update.script", "XXX")
conf.set("es.update.script.params", "param1:events")
conf.set("es.update.retry.on.conflict" , "2")
conf.set("es.write.operation", "upsert")
conf.set("es.input.json", "true")
val documentsRdd =  ...
documentsRdd.saveJsonToEs("test/user", 
scala.collection.Map("es.mapping.id" -> "_id", "es.mapping.exclude" -> 
"_id"))

The JSON looks like that :
{
  "_id": "",
  "_type": "user",
  "_index": "test",
  "params": {
"events": [
  {
...
  }
]
  }

Thanks!
}

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/8055055f-8787-492b-97f4-144b2a7f7fce%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.