On Mon, Apr 8, 2013 at 2:58 AM, Christian Tzolov <[email protected] > wrote:
> Hey Josh, > > Thanks for the tips! > > I followed the HBaseSource.java for implementing the ESSource and copied > the inputId handling approach: > > https://github.com/tzolov/elasticsearch-hadoop/blob/master/src/main/java/org/elasticsearch/hadoop/crunch/ESSource.java > > I don't completely understand the implication of the dummy Path parameter. > In this context is the Path needed only for input equality check? > > The ESTarget is more tricky. I was not sure what to do with the keyClass > parameter in the CrunchOutputs.addNamedOutput() so I've set it to String. > The ES-Hadoop uses Jackson for JSON serializations and it fails when trying > to serialize internal Crunch Writable types. I guess because they are not > public. Storing internal Crunch Writable types in ES doesn't make much > sense anyway. The current implementation expects a custom (Writable) class > to define the JSON format. Perhaps with Avro we can try to reuse the Avro > schema. > > Here is the ES-Hadoop ticket for adding Crunch to the ES-Hadoop project: > https://github.com/elasticsearch/elasticsearch-hadoop/issues/20 > > Shall we deploy the 0.6.0-SNAPSHOT in some public snapshot repo? The > https://repository.apache.org/content/groups/snapshots/org/apache/crunch/is > empty. Perhaps we can deploy the latest Jenkins builds into this > snapshot repo? Unless there is some policy against it? > I just think it means it's time to cut the 0.6.0 release. I would have liked to get CRUNCH-165 in as well, but I don't think it's been tested enough. > Cheers, > Chris > > > > > > > > > On Mon, Apr 8, 2013 at 7:18 AM, Josh Wills <[email protected]> wrote: > > > Hey Christian, > > > > Supe-cool. Replies inlined. > > > > On Sun, Apr 7, 2013 at 8:32 PM, Christian Tzolov < > > [email protected] > > > wrote: > > > > > I've been working on Crunch - ElasticSearch ( > > http://www.elasticsearch.org/ > > > ) > > > integration over the weekend :) > > > > > > Here is my first prototype: > > > https://github.com/tzolov/elasticsearch-hadoop#crunch and a sample > > > application: http://bit.ly/Y7lasW. > > > > > > It implements ES Source and Target on top of the ES-Hadoop's ( > > > https://github.com/elasticsearch/elasticsearch-hadoop) ESInputFormat > and > > > ESOutputFormat. > > > > > > Not sure though what is the best/right way to build Source/Targets for > > new > > > Input/Output Formats? Any suggestions, references? > > > > > > > I built a Source for HCatalog last week as part of ML: > > > > > > > https://github.com/cloudera/ml/blob/master/hcatalog/src/main/java/com/cloudera/science/ml/hcatalog/HCatalogSource.java > > > > The interesting bit is really in the configureSource method: if the > inputId > > is < 0, then it's a single-input MapReduce job, and you can essentially > > configure the input just as you would for a regular MapReduce. If the > > inputId >= 0, then it's a multi-input job (e.g., for a join), and you > have > > to use CrunchInputs w/a FormatBundle object. The FormatBundle wraps an > > InputFormat or an OutputFormat w/any Configuration settings that the > > InputFormat/OutputFormat needs. This way, you can have multiple inputs > that > > use the same InputFormat, but have different configuration settings > (e.g., > > when you're joining multiple Avro files together and they each need to > have > > their own schema specified.) > > > > > > > > > The write to ES is tricky and at the moment looks more like a hack (see > > the > > > doc). > > > > > > Cheers > > > Chris > > > > > > (P.S The prototype doesn't support AvroTypeFamily yet but I've been > > looking > > > at jackson-dataformat-avro kind of solution (ES-Hadoop relies on > Jackson > > > for the JSON serialisation) > > > > > > > I'd like to work on this as well-- I'll take a look tomorrow and try to > put > > together a pull req for anything that I think should be configured > > differently. > > > > J > > > > > > > > -- > > Director of Data Science > > Cloudera <http://www.cloudera.com> > > Twitter: @josh_wills <http://twitter.com/josh_wills> > > > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
