You might also be interested in knowing that there has been discussions about 
deprecating Hive on Spark: 
https://lists.apache.org/thread/sspltkv3ovbsjmoct72p4m1ooqk2g740

On Sat, 2023-08-19 at 10:17 +0000, Aaron Grubb wrote:
Hi Mich,

It's not a question of cannot but rather a) is it worth converting our 
pipelines from Hive to Spark and b) is Spark more performant than LLAP, and in 
both cases the answer seems to be no. 2016 is a lifetime ago in technological 
time and since then there's been a major release of Hive as well as many minor 
releases. When we started looking for our "big data processor" 2 years ago, we 
had evaluated Spark, Presto, AWS Athena and Hive on LLAP and all literature 
pointed to Hive on LLAP being the most performant, in particular when you're 
able to take advantage of the ORC footer caching. If you'd like to review some 
benchmarks, you can take a look at this [1] but the direct comparison between 
Spark and LLAP is done with a fork of Hive.

Regards,
Aaron

[1] https://www.datamonad.com/post/2022-04-01-spark-hive-performance-1.4/

On Fri, 2023-08-18 at 16:06 +0100, Mich Talebzadeh wrote:
interesting!

In 2016 I gave a presentation in London, in Future of DataOrganised by 
Hortonworks July 20, 2016,

Query Engines for Hive: MR, Spark, Tez with LLAP – 
Considerations!<https://talebzadehmich.files.wordpress.com/2016/08/hive_on_spark_only.pdf>


Then I thought Spark as an underlying engine for Hive did the best job. 
However, I am not sure there has been many new developments to make Spark as 
the underlying engine for Hive. Any particular reason you cannot use Spark as 
the ET: tool with Hive providing the underlying storage? Spark has excellent 
APIs to work with hive including spark thrift server (which is under the bonnet 
Hive thrift server).

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom

 
[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
   view my Linkedin 
profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Fri, 18 Aug 2023 at 15:45, Aaron Grubb 
<aa...@kaden.ai<mailto:aa...@kaden.ai>> wrote:
Hi Mich,

Yes, that's correct

On Fri, 2023-08-18 at 15:24 +0100, Mich Talebzadeh wrote:
Hi,

Are you using LLAP (Long live and prosper) as a Hive engine?

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom

 
[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
   view my Linkedin 
profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Fri, 18 Aug 2023 at 15:09, Aaron Grubb 
<aa...@kaden.ai<mailto:aa...@kaden.ai>> wrote:
For those interested, I managed to define a way to launch the LLAP application 
master and daemons on separate, targeted machines. It was inspired by an 
article I found [1] and implemented using YARN Node Labels [2] and Placement 
Constraints [3] with a modification to the file scripts/llap/yarn/templates.py. 
Here are the basic instructions:

1. Configure YARN to enable placement constraints and node labels. You have the 
option of using 2 node labels or 1 node label + the default partition. The 
machines that are intended to run the daemons must have a label associated with 
them. If you choose to use 2 node labels, you must set the default label for 
the queue that you're submitting LLAP to, to the node label associated with the 
machine that will run the application master. Note that this affects other 
applications submitted to the same queue. If it's only 1 label, the machine 
that will run the AM must be accessible by the DEFAULT_PARTITION queue, and 
that machine will not be specifically targeted if you have more than one 
machine accessible by the DEFAULT_PARTITION, so this scenario is recommended 
only if you have a single machine intended for application masters, as is my 
case.

2. Modify scripts/llap/yarn/templates.py like so:

#SNIP


          "APP_ROOT": "<WORK_DIR>/app/install/",

          "APP_TMP_DIR": "<WORK_DIR>/tmp/"

        }

      },

      "placement_policy": {

        "constraints": [

          {

            "type": "ANTI_AFFINITY",

            "scope": "NODE",

            "target_tags": [

              "llap"

            ],

            "node_partitions": [

              "<INSERT LLAP DAEMON NODE LABEL HERE>"

            ]

          }

        ]

      }

    }

  ],

  "kerberos_principal" : {

#SNIP

Note that ANTI_AFFINITY means that only 1 daemon will be spawned per machine 
but that should be the desired behaviour anyway. Read more about it in [3].

3. Launch LLAP using the hive --service llap command

Hope this helps someone!
Aaron

[1] 
https://www.gresearch.com/blog/article/hive-llap-in-practice-sizing-setup-and-troubleshooting/
[2] 
https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeLabel.html
[3] 
https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/PlacementConstraints.html

On 2023/03/22 10:19:57 Aaron Grubb wrote:
> Hi all,
>
> I have a Hadoop cluster (3.3.4) with 6 nodes of equal resource size that run 
> HDFS and YARN and 1 node with lower resources which only runs YARN that I use 
> for Hive AMs, the LLAP AM, Spark AMs and Hive file merge containers. The HDFS 
> nodes are set up such that the queue for LLAP on the YARN NodeManager is 
> allocated resources exactly equal to what the LLAP daemons consume. However, 
> when I need to re-launch LLAP, I currently have to stop the NodeManager 
> processes on each HDFS node, then launch LLAP to guarantee that the 
> application master ends up on the YARN-only machine, then start the 
> NodeManager processes again to let the daemons start spawning on the nodes. 
> This used to not be a problem because only Hive/LLAP was using YARN but now 
> we've started using Spark in my company and I'm in a position where if LLAP 
> happens to crash, I would need to wait for Spark jobs to finish before I can 
> re-launch LLAP, which would put our ETL processes behind, potentially to 
> unacceptable delays. I could allocate 1 vcore and 1024mb memory extra for the 
> LLAP queue on each machine, however that would mean I have 5 vcores and 5gb 
> RAM being reserved and unused at all times, so I was wondering if there's a 
> way to specify which node to launch the LLAP AM on, perhaps through YARN node 
> labels similar to the Spark "spark.yarn.am.nodeLabelExpression" 
> configuration? Or even a way to specify the node machine through a different 
> mechanism? My Hive version is 3.1.3.
>
> Thanks,
> Aaron
>




Reply via email to