[jira] Updated: (HIVE-165) var(col) built-in to go with avg(col) and count(col)
[ https://issues.apache.org/jira/browse/HIVE-165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Hammerbacher updated HIVE-165: --- Component/s: Query Processor Adding to Query Processor component. var(col) built-in to go with avg(col) and count(col) Key: HIVE-165 URL: https://issues.apache.org/jira/browse/HIVE-165 Project: Hadoop Hive Issue Type: Wish Components: Query Processor Reporter: Adam Kramer Assignee: David Phillips Priority: Minor The last step in the unholy triumvirate of statistical built-ins is the variance. We already have the n (count) and the mean (avg). I currently have a job or two that filters all of the data into a single reducer which just computes mean/n/variance and writes it to a table...so my guess is that this would be a pretty big speed increase. Not a huge deal though, as computing the variance myself is trivial. (Average, variance, and n can be co-computed in one pass, so if you're doing var() you can basically have avg() and count() for free.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-160) sampling in a subquery is broken
[ https://issues.apache.org/jira/browse/HIVE-160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Hammerbacher updated HIVE-160: --- Component/s: Query Processor Adding to Query Processor component. sampling in a subquery is broken Key: HIVE-160 URL: https://issues.apache.org/jira/browse/HIVE-160 Project: Hadoop Hive Issue Type: Bug Components: Query Processor Reporter: Venky Iyer -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-167) Hive: add a RegularExpressionDeserializer
[ https://issues.apache.org/jira/browse/HIVE-167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Hammerbacher updated HIVE-167: --- Component/s: Serializers/Deserializers Adding to Serializers/Deserializers component. Hive: add a RegularExpressionDeserializer - Key: HIVE-167 URL: https://issues.apache.org/jira/browse/HIVE-167 Project: Hadoop Hive Issue Type: New Feature Components: Serializers/Deserializers Reporter: Zheng Shao We need a RegularExpressionDeserializer to read data based on a regex. This will be very useful for reading files like apache log. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-161) for list column x that is sometimes null, select x.y will cause a nullpointerexception
[ https://issues.apache.org/jira/browse/HIVE-161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Hammerbacher updated HIVE-161: --- Component/s: Query Processor Adding to Query Processor component. for list column x that is sometimes null, select x.y will cause a nullpointerexception -- Key: HIVE-161 URL: https://issues.apache.org/jira/browse/HIVE-161 Project: Hadoop Hive Issue Type: Bug Components: Query Processor Reporter: Venky Iyer Assignee: Zheng Shao -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-137) Create tests for new date functions
[ https://issues.apache.org/jira/browse/HIVE-137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Hammerbacher updated HIVE-137: --- Component/s: Testing Infrastructure Adding to Testing Infrastructure component. Create tests for new date functions --- Key: HIVE-137 URL: https://issues.apache.org/jira/browse/HIVE-137 Project: Hadoop Hive Issue Type: Bug Components: Testing Infrastructure Reporter: David Phillips Validate that the date functions actually work. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-135) need more accurate way of tracking memory consumption on map side aggregates
[ https://issues.apache.org/jira/browse/HIVE-135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Hammerbacher updated HIVE-135: --- Component/s: Query Processor Adding to Query Processor component. need more accurate way of tracking memory consumption on map side aggregates Key: HIVE-135 URL: https://issues.apache.org/jira/browse/HIVE-135 Project: Hadoop Hive Issue Type: Bug Components: Query Processor Reporter: Joydeep Sen Sarma from email thread: Just trying it out - I am confused by one thing: hive set hive.map.aggr=true; set hive.map.aggr=true; hive explain from mytable u insert overwrite directory '/user/jssarma/tmp_agg' select u.a, avg(size(u.b)) group by u.a; everything looks good. Now I submit this query and this is what I see on the tracker: Map input records 87,912,961 0 87,912,961 Map output records 87,912,960 0 87,912,960 This doesn't make sense. With map-side aggregates - we should be getting vastly reduced number of rows emitted from mapper. I am wondering whether we should rethink our flushing logic. The freeMemory() call is not reliable (since it doesn't account for stuff that's not cleaned out by GC). Perhaps we should switch to an explicit setting for amount of memory for hash tables (we do know the size of each hash table entry and overall size and should be able to guess reasonably). From what Dhruba reported - there's no way to call the garbage collector and wait for it to complete (to get a more accurate report of free memory). so the whole route of obtaining free memory seems a little hosed. by way of comparison - hadoop also estimates memory usage in sorting. there - the sort run is just stored in a sequential stream and it just takes the size of the stream and compares it to max allowed sort memory usage (which is a configuration option) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-136) SerDe should escape some special characters
[ https://issues.apache.org/jira/browse/HIVE-136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Hammerbacher updated HIVE-136: --- Component/s: Serializers/Deserializers Adding to Serializers/Deserializers component. SerDe should escape some special characters --- Key: HIVE-136 URL: https://issues.apache.org/jira/browse/HIVE-136 Project: Hadoop Hive Issue Type: Bug Components: Serializers/Deserializers Reporter: Zheng Shao MetadataTypedColumnsetSerDe and DynamicSerDe should escape some special characters like '\n' or the column/item/key separator. Otherwise the data will look corrupted. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-133) Add support for RecordIO in serde2
[ https://issues.apache.org/jira/browse/HIVE-133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Hammerbacher updated HIVE-133: --- Component/s: Serializers/Deserializers Adding to Serializers/Deserializers component. Add support for RecordIO in serde2 -- Key: HIVE-133 URL: https://issues.apache.org/jira/browse/HIVE-133 Project: Hadoop Hive Issue Type: Improvement Components: Serializers/Deserializers Affects Versions: 0.19.0 Reporter: Johan Oskarsson Fix For: 0.19.0 Currently there is no support for Hadoop's RecordIO (also known as Jute) in Hive's new serde version 2. I believe quite a few Hadoop installations are using SequenceFiles with keys and values in a combination of normal Writables and generated RecordIO classes. This issue needs to cover the following points (as suggested by Joydeep Sen Sarma): - traditionally our serde's have ignored the keys altogether (the row is embedded in the value). - the jute code was written for an older version of the serde interface and needs to be ported to the new interface -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-124) aggregation on empty table should still return 1 row
[ https://issues.apache.org/jira/browse/HIVE-124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Hammerbacher updated HIVE-124: --- Component/s: Query Processor Adding to Query Processor component. aggregation on empty table should still return 1 row Key: HIVE-124 URL: https://issues.apache.org/jira/browse/HIVE-124 Project: Hadoop Hive Issue Type: Bug Components: Query Processor Reporter: Zheng Shao The query SELECT COUNT(1) FROM f_status_update fsu WHERE FALSE should return a single row with value 0. Our code treat that query as SELECT 1, COUNT(1) FROM f_status_update fsu WHERE FALSE GROUP BY 1, but these 2 queries are not equivalent because the second query will return empty result if the input is empty. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-125) Hive CLI should allow ; inside quotes
[ https://issues.apache.org/jira/browse/HIVE-125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Hammerbacher updated HIVE-125: --- Component/s: Clients Adding to Clients component. Hive CLI should allow ; inside quotes - Key: HIVE-125 URL: https://issues.apache.org/jira/browse/HIVE-125 Project: Hadoop Hive Issue Type: Bug Components: Clients Reporter: Zheng Shao Now Hive CLI breaks the command line whenever it sees a ; even inside quotes. This prevents users to input ; in string literals or scripts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.