I remember seeing in hive, they had a serialized form before switching to xml. Most likely the switch to xml was for easy debugging. We should either go with some serialization format (kryo, protobuf, etc) or json if we want readability(json would be more compact than xml). Serializing and deserializing would add to the cost of job time.
-Rohini On Wed, Sep 5, 2012 at 2:01 PM, Travis Crawford <[email protected]>wrote: > Interesting that Hive solves this with a separate file in the > distributed cache, I was curious how Hive dealt with it. > > Given that Hadoop has Jackson as a dependency, is it safe to assume > every HCatalog user will have Jackson available. If we took the same > approach and serialized to XML we would not require another > dependency. > > Renato - what are your thoughts? We don't want to do the whole patch for > you ;) > > --travis > > > On Wed, Sep 5, 2012 at 1:35 PM, Rohini Palaniswamy > <[email protected]> wrote: > > Yes, it would be a good idea to serialize that to a separate file and use > > distributed cache for it. Hive does it that way by serializing the plan > and > > partition information(MapredWork) to a xml file. And we should also > > investigate the serialized data and the way we serialize it to avoid > > bloating or inefficiency. For eg in hive, the serialization is done > > badly(HIVE-2988) and that makes the client require at least 1G for memory > > when querying table involving large number of partitions. > > > > I think the easier approach would be to move it to a separate file > first > > and avoid the max jobconf issue and then work on optimizing the > serialized > > data for size. Because how much ever we optimize it, it will not prove > > scalable to keep the partition information in jobconf for very big > tables. > > > > -Rohini > > > > On Wed, Sep 5, 2012 at 7:32 AM, Travis Crawford < > [email protected]>wrote: > > > >> Thought I wrote back to this on the train but it didn't send :/ > >> > >> Agreed the distributed cache would be a good way to distribute this > >> info to worker tasks, since it reuses an existing MR feature for > >> sharing files with worker tasks. > >> > >> Something I'm curious about is why the partition info we store in the > >> jobconf is so large, it naively feels like we may be serializing too > >> much stuff and that could be trimmed down. > >> > >> As a starting point, I'd take a look at this test which creates a > >> table, adds a partition, and performs a query: > >> > >> > >> > http://svn.apache.org/viewvc/incubator/hcatalog/trunk/src/test/org/apache/hcatalog/mapreduce/TestHCatHiveThriftCompatibility.java?view=markup > >> > >> We could do something like this and just make a huge number of > >> partitions and see when the jobconf becomes "too large", then profile > >> exactly what that bloat is. We could investigate trimming it down, > >> compressing, or using distributed cache. > >> > >> Going with the test-first approach would be useful to pinpoint the > >> actual issue, then we can investigate the right approach to solve. > >> > >> Does this sound like a good starting point? Holler if you run into any > >> issues along the way! > >> > >> --travis > >> > >> > >> > >> On Tue, Sep 4, 2012 at 5:39 PM, Alan Gates <[email protected]> > wrote: > >> > > >> > On Sep 1, 2012, at 4:38 PM, Renato Marroquín Mogrovejo wrote: > >> > > >> >> Hi Travis, > >> >> > >> >> Thanks a ton for this issue I know I will enjoy solving this (: So I > >> >> have some questions about this jira even though I think I understand > >> >> what the problem is. > >> >> > >> >> - How do you think I should approach this? I mean if HCat can't send > >> >> the partitions' information through the configuration object, maybe > we > >> >> should think on a different way of communicating this information > >> >> (thrift, or the database)? > >> > Thrift or the database aren't options. You can't count on being able > to > >> communicate with the client from the map tasks, not to mention you would > >> overwhelm the client. One of the rules of hcat is the map and reduce > tasks > >> should never talk to the database, as it isn't sized to handle large > >> numbers of tasks talking to it. > >> > > >> > My first thought would be to use the distributed cache. You should > only > >> use this option when you have a very large number of files. But in that > >> case write them to a file, put that file in the distributed cache, and > then > >> put a pointer to that in the job conf instead of the file list. > >> > > >> > Alan. > >> > > >> >> - I was looking at HCatLoader but I am not sue if this would be a > good > >> >> entry point for the modifications. Any suggestions? > >> >> > >> >> Thanks again Travis! > >> >> > >> >> > >> >> Renato M. > >> >> > >> >> > >> >> 2012/8/30 Travis Crawford <[email protected]>: > >> >>> You might be interested in > >> https://issues.apache.org/jira/browse/HCATALOG-453 > >> >>> > >> >>> The issue here is HCatalog queries the HiveMetaStore for info about > >> >>> the partitions to process, and stores that response in the job conf. > >> >>> When processing large numbers of partitions this bloats the job conf > >> >>> beyond what Hadoop will allow and the job fails. > >> >>> > >> >>> What's interesting about this issue is you'll learn about the main > >> >>> feature of HCatalog - translating db+table+partition_spec into a > list > >> >>> of partitions, how HCat handles that internally, and how its > >> >>> communicated between the frontend & backend. The actual issue is > >> >>> straightforward, but I think spending the time to understand the > >> >>> problem will give a great overview of how HCat works. > >> >>> > >> >>> Thoughts? > >> >>> > >> >>> --travis > >> >>> > >> >>> > >> >>> > >> >>> On Thu, Aug 30, 2012 at 4:25 PM, Renato Marroquín Mogrovejo > >> >>> <[email protected]> wrote: > >> >>>> Travis, > >> >>>> > >> >>>> Thanks a lot for your response! My master's dissertation was about > >> >>>> using statistics to smarten up Apache Pig rule optimizer, so I > would > >> >>>> love to help out with something related, but maybe you can suggest > me > >> >>>> some interesting jiras (not complicated ones but maybe "noobies" > ones) > >> >>>> I can start with (: > >> >>>> And yeah the labels thing is much better than creating a jura type > for > >> >>>> noobies. Thanks again! > >> >>>> > >> >>>> > >> >>>> Renato M. > >> >>>> > >> >>>> 2012/8/30 Travis Crawford <[email protected]>: > >> >>>>> Hey Renato - > >> >>>>> > >> >>>>> Awesome! What in particular are you interested in starting out > with? > >> >>>>> We can definitely find a starter project for you in that area. > >> >>>>> > >> >>>>> JIRA issues can have a variety of attributes; the attribute I > started > >> >>>>> this thread about is the "issue type". > >> >>>>> > >> >>>>> JIRA also has "labels", which I think are a great place to > indicate > >> >>>>> something would be good for noobies. For example, there could be > an > >> >>>>> "issue type" of bug, with "label" noobie. > >> >>>>> > >> >>>>> Let us know what area you're interested in diving into and we can > >> help > >> >>>>> come up with a starter project for ya. > >> >>>>> > >> >>>>> --travis > >> >>>>> > >> >>>>> > >> >>>>> On Thu, Aug 30, 2012 at 9:21 AM, Renato Marroquín Mogrovejo > >> >>>>> <[email protected]> wrote: > >> >>>>>> Hi all, > >> >>>>>> > >> >>>>>> I am new to HCatalog but I would like to get involved with the > >> >>>>>> project, and one thing that would totally help is to create an > issue > >> >>>>>> type that indicates it is for "newbies". I saw that in Apache Pig > >> they > >> >>>>>> have a special type of issue for this and with this they try to > >> engage > >> >>>>>> more with the community. This would be awesome guys! > >> >>>>>> Thanks in advance! > >> >>>>>> > >> >>>>>> > >> >>>>>> Renato M. > >> >>>>>> > >> >>>>>> 2012/8/30 Travis Crawford <[email protected]>: > >> >>>>>>> Hey hcat gurus - > >> >>>>>>> > >> >>>>>>> Filing an issue just now I noticed the list of possible option > >> types > >> >>>>>>> is pretty crazy long - any objection to requesting a > simplification > >> >>>>>>> to: > >> >>>>>>> > >> >>>>>>> PROPOSED ISSUE TYPES: > >> >>>>>>> > >> >>>>>>> Bug - fixing unintended behavior > >> >>>>>>> New Feature - addition of brand-new functionality > >> >>>>>>> Improvement - making existing functionality better > >> >>>>>>> > >> >>>>>>> CURRENT ISSUE TYPES: > >> >>>>>>> > >> >>>>>>> Bug > >> >>>>>>> New Feature > >> >>>>>>> Improvement > >> >>>>>>> Test > >> >>>>>>> Wish > >> >>>>>>> Task > >> >>>>>>> New JIRA Project > >> >>>>>>> RTC > >> >>>>>>> TCK Challenge > >> >>>>>>> Question > >> >>>>>>> Temp > >> >>>>>>> Brainstorming > >> >>>>>>> Umbrella > >> >>>>>>> Epic > >> >>>>>>> Dependency upgrade > >> >>>>>>> Suitable Name Search > >> >>>>>>> > >> >>>>>>> If this sounds good I'll ping the infra folks and try to make > this > >> happen. > >> >>>>>>> > >> >>>>>>> --travis > >> > > >> >
