If you use pig or spark you increase the complexity from an operations management perspective significantly. Spark should be seen from a platform perspective if it make sense. If you can do it directly with hbase/phoenix or only hbase coprocessor then this should be preferred. Otherwise you pay more money for maintenance and development.
Le jeu. 3 sept. 2015 à 17:16, Tao Lu <taolu2...@gmail.com> a écrit : > Yes. Ayan, you approach will work. > > Or alternatively, use Spark, and write a Scala/Java function which > implements similar logic in your Pig UDF. > > Both approaches look similar. > > Personally, I would go with Spark solution, it will be slightly faster, > and easier if you already have Spark cluster setup on top of your hadoop > cluster in your infrastructure. > > Cheers, > Tao > > > On Thu, Sep 3, 2015 at 1:15 AM, ayan guha <guha.a...@gmail.com> wrote: > >> Thanks for your info. I am planning to implement a pig udf to do record >> look ups. Kindly let me know if this is a good idea. >> >> Best >> Ayan >> >> On Thu, Sep 3, 2015 at 2:55 PM, Jörn Franke <jornfra...@gmail.com> wrote: >> >>> >>> You may check if it makes sense to write a coprocessor doing an upsert >>> for you, if it does not exist already. Maybe phoenix for Hbase supports >>> this already. >>> >>> Another alternative, if the records do not have an unique Id, is to put >>> them into a text index engine, such as Solr or Elasticsearch, which does in >>> this case a fast matching with relevancy scores. >>> >>> >>> You can use also Spark and Pig there. However, I am not sure if Spark is >>> suitable for these one row lookups. Same holds for Pig. >>> >>> >>> Le mer. 2 sept. 2015 à 23:53, ayan guha <guha.a...@gmail.com> a écrit : >>> >>> Hello group >>> >>> I am trying to use pig or spark in order to achieve following: >>> >>> 1. Write a batch process which will read from a file >>> 2. Lookup hbase to see if the record exists. If so then need to compare >>> incoming values with hbase and update fields which do not match. Else >>> create a new record. >>> >>> My questions: >>> 1. Is this a good use case for pig or spark? >>> 2. Is there any way to read hbase for each incoming record in pig >>> without writing map reduce code? >>> 3. In case of spark I think we have to connect to hbase for every >>> record. Is thr any other way? >>> 4. What is the best connector for hbase which gives this functionality? >>> >>> Best >>> >>> Ayan >>> >>> >>> >> >> >> -- >> Best Regards, >> Ayan Guha >> > > > > -- > ------------------------------------------------ > Thanks! > Tao >