If you use pig or spark you increase the complexity from an operations
management perspective significantly. Spark should be seen from a platform
perspective if it make sense. If you can do it directly with hbase/phoenix
or only hbase coprocessor then this should be preferred. Otherwise you pay
more money for maintenance and development.

Le jeu. 3 sept. 2015 à 17:16, Tao Lu <taolu2...@gmail.com> a écrit :

> Yes. Ayan, you approach will work.
>
> Or alternatively, use Spark, and write a Scala/Java function which
> implements similar logic in your Pig UDF.
>
> Both approaches look similar.
>
> Personally, I would go with Spark solution, it will be slightly faster,
> and easier if you already have Spark cluster setup on top of your hadoop
> cluster in your infrastructure.
>
> Cheers,
> Tao
>
>
> On Thu, Sep 3, 2015 at 1:15 AM, ayan guha <guha.a...@gmail.com> wrote:
>
>> Thanks for your info. I am planning to implement a pig udf to do record
>> look ups. Kindly let me know if this is a good idea.
>>
>> Best
>> Ayan
>>
>> On Thu, Sep 3, 2015 at 2:55 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>>
>>>
>>> You may check if it makes sense to write a coprocessor doing an upsert
>>> for you, if it does not exist already. Maybe phoenix for Hbase supports
>>> this already.
>>>
>>> Another alternative, if the records do not have an unique Id, is to put
>>> them into a text index engine, such as Solr or Elasticsearch, which does in
>>> this case a fast matching with relevancy scores.
>>>
>>>
>>> You can use also Spark and Pig there. However, I am not sure if Spark is
>>> suitable for these one row lookups. Same holds for Pig.
>>>
>>>
>>> Le mer. 2 sept. 2015 à 23:53, ayan guha <guha.a...@gmail.com> a écrit :
>>>
>>> Hello group
>>>
>>> I am trying to use pig or spark in order to achieve following:
>>>
>>> 1. Write a batch process which will read from a file
>>> 2. Lookup hbase to see if the record exists. If so then need to compare
>>> incoming values with hbase and update fields which do not match. Else
>>> create a new record.
>>>
>>> My questions:
>>> 1. Is this a good use case for pig or spark?
>>> 2. Is there any way to read hbase for each incoming record in pig
>>> without writing map reduce code?
>>> 3. In case of spark I think we have to connect to hbase for every
>>> record. Is thr any other way?
>>> 4. What is the best connector for hbase which gives this functionality?
>>>
>>> Best
>>>
>>> Ayan
>>>
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>
>
> --
> ------------------------------------------------
> Thanks!
> Tao
>

Reply via email to