[jira] Commented: (HIVE-895) Add SerDe for Avro serialized data

2010-07-17 Thread Alex Rovner (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12889558#action_12889558
 ] 

Alex Rovner commented on HIVE-895:
--

Can some one please explain to me how would this serde work?

Specifically how would it deserialize the data?

>From what I understand Avro file has a header that defines the data that is 
>stored in the file. In order to deserialize the data you need to read the 
>header which is a challenge in Hive's Deserialize interface because the 
>initialize() method does not know anything about the input file. (Note: there 
>is a hack that can get you the file by getting the map.input hadoop 
>property this hack however is not good enough in hive because some one 
>might be using the CLI to query which will not trigger a map reduce job.

Does anyone know a good solution to this issue?

I am actually trying to implements a different file format but the idea of our 
format is similar to Avro: Each file has a header in which it contains a 
"schema"

Thanks

> Add SerDe for Avro serialized data
> --
>
> Key: HIVE-895
> URL: https://issues.apache.org/jira/browse/HIVE-895
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Serializers/Deserializers
>Reporter: Jeff Hammerbacher
>
> As Avro continues to mature, having a SerDe to allow HiveQL queries over Avro 
> data seems like a solid win.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1470) percentile_approx() fails with more than 1 reducer

2010-07-17 Thread John Sichi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12889551#action_12889551
 ] 

John Sichi commented on HIVE-1470:
--

Oh...if it requires more than one reducer, we'll need minimr mode in order to 
exercise it, meaning I need to finish HIVE-117.


> percentile_approx() fails with more than 1 reducer
> --
>
> Key: HIVE-1470
> URL: https://issues.apache.org/jira/browse/HIVE-1470
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 0.6.0
>Reporter: Mayank Lahiri
>Assignee: Mayank Lahiri
> Fix For: 0.7.0
>
> Attachments: HIVE-1470.1.patch
>
>
> The larger issue is that a UDAF that has variable return types needs two 
> inner Evaluator classes. This patch fixes a NullPointerException bug that is 
> only encountered when partial aggregations are invoked.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1470) percentile_approx() fails with more than 1 reducer

2010-07-17 Thread John Sichi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12889550#action_12889550
 ] 

John Sichi commented on HIVE-1470:
--

Unit test?


> percentile_approx() fails with more than 1 reducer
> --
>
> Key: HIVE-1470
> URL: https://issues.apache.org/jira/browse/HIVE-1470
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 0.6.0
>Reporter: Mayank Lahiri
>Assignee: Mayank Lahiri
> Fix For: 0.7.0
>
> Attachments: HIVE-1470.1.patch
>
>
> The larger issue is that a UDAF that has variable return types needs two 
> inner Evaluator classes. This patch fixes a NullPointerException bug that is 
> only encountered when partial aggregations are invoked.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1468) intermediate data produced for select queries ignores hive.exec.compress.intermediate

2010-07-17 Thread Zheng Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12889528#action_12889528
 ] 

Zheng Shao commented on HIVE-1468:
--

"select queries" means "SELECT" without "INSERT", correct?

I agree that we should treat these queries differently, specifically, no 
compression (or maybe use lzo to same bandwidth - clients can be in other data 
centers) will be a big win.


> intermediate data produced for select queries ignores 
> hive.exec.compress.intermediate
> -
>
> Key: HIVE-1468
> URL: https://issues.apache.org/jira/browse/HIVE-1468
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Query Processor
>Reporter: Joydeep Sen Sarma
>
> > set hive.exec.compress.intermediate=false;
> > explain extended select xxx from yyy;
> ...
> File Output Operator
>   compressed: true
>   GlobalTableId: 0
> looks like we only intermediate locations identified during splitting mr 
> tasks follow this directive. this should be fixed because this forces clients 
> to always decompress output data (even if the config setting is altered).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1463) hive output file names are unnecessarily large

2010-07-17 Thread Joydeep Sen Sarma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joydeep Sen Sarma updated HIVE-1463:


Attachment: 1463.3.patch

this fixes all the issues

1) regex expanded to cover both 17 and later releases. in 17 tasks are indeed 
named _map_ and _reduce_ in local mode.
2) no change to strip leading zeros in taskid. ordering of files will not be 
changed by this diff. the filename component being removed is constant per 
map-reduce job (jobid + jobtracker_id etc.).
3) one line env setting in the build file that allows us to control test 
execution logging from hive-log4j.

this passes all the tests. the problem with load_dyn_part2.q was due to 
incorrect regex application. the taskid matching has to be applied to the last 
component of the path name only.

as an aside - replaceTaskIdFromFilename would also be easier to understand and 
simpler if it simply did this (cut last component, replace taskid, concat back 
and return).

> hive output file names are unnecessarily large
> --
>
> Key: HIVE-1463
> URL: https://issues.apache.org/jira/browse/HIVE-1463
> Project: Hadoop Hive
>  Issue Type: Improvement
>Reporter: Joydeep Sen Sarma
> Attachments: 1463.2.patch, 1463.3.patch, hive-1463.1.patch
>
>
> Hive's output files are named like this:
> attempt_201006221843_431854_r_00_0
> out of all of this goop - only one character '0' would have sufficed. we 
> should fix this. This would help environments with namenode memory 
> constraints.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.