[jira] [Updated] (HIVE-7292) Hive on Spark
[ https://issues.apache.org/jira/browse/HIVE-7292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Hammerbacher updated HIVE-7292: Description: Spark as an open-source data analytics cluster computing framework has gained significant momentum recently. Many Hive users already have Spark installed as their computing backbone. To take advantages of Hive, they still need to have either MapReduce or Tez on their cluster. This initiative will provide user a new alternative so that those user can consolidate their backend. Secondly, providing such an alternative further increases Hive's adoption as it exposes Spark users to a viable, feature-rich de facto standard SQL tools on Hadoop. Finally, allowing Hive to run on Spark also has performance benefits. Hive queries, especially those involving multiple reducer stages, will run faster, thus improving user experience as Tez does. This is an umbrella JIRA which will cover many coming subtask. Design doc will be attached here shortly, and will be on the wiki as well. Feedback from the community is greatly appreciated! was: Spark as an open-source data analytics cluster computing framework has gained significant momentum recently. Many Hive users already have Spark installed as their computing backbone. To take advantages of Hive, they still need to have either MapReduce or Tez on their cluster. This initiative will provide user a new alternative so that those user can consolidate their backend. Secondly, providing such an alternative further increases Hive's adoption as it exposes Spark users to a viable, feature-rich de facto standard SQL tools on Hadoop. Finally, allowing Hive to run on Spark also has performance benefits. Hive queries, especially those involving multiple reducer stages, will run faster, thus improving user experience as Tez does. This is an umber JIRA which will cover many coming subtask. Design doc will be attached here shortly, and will be on the wiki as well. Feedback from the community is greatly appreciated! Hive on Spark - Key: HIVE-7292 URL: https://issues.apache.org/jira/browse/HIVE-7292 Project: Hive Issue Type: Improvement Reporter: Xuefu Zhang Assignee: Xuefu Zhang Attachments: Hive-on-Spark.pdf Spark as an open-source data analytics cluster computing framework has gained significant momentum recently. Many Hive users already have Spark installed as their computing backbone. To take advantages of Hive, they still need to have either MapReduce or Tez on their cluster. This initiative will provide user a new alternative so that those user can consolidate their backend. Secondly, providing such an alternative further increases Hive's adoption as it exposes Spark users to a viable, feature-rich de facto standard SQL tools on Hadoop. Finally, allowing Hive to run on Spark also has performance benefits. Hive queries, especially those involving multiple reducer stages, will run faster, thus improving user experience as Tez does. This is an umbrella JIRA which will cover many coming subtask. Design doc will be attached here shortly, and will be on the wiki as well. Feedback from the community is greatly appreciated! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HIVE-2997) Store sort order of table in the metastore
Jeff Hammerbacher created HIVE-2997: --- Summary: Store sort order of table in the metastore Key: HIVE-2997 URL: https://issues.apache.org/jira/browse/HIVE-2997 Project: Hive Issue Type: New Feature Components: Metastore Reporter: Jeff Hammerbacher If a table or view is sorted on a specific column, it would be useful to record this fact in the metastore. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HIVE-1803) Implement bitmap indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999656#comment-12999656 ] Jeff Hammerbacher commented on HIVE-1803: - Hey, I came across a Daniel Lemire project recently that may be of use here: http://code.google.com/p/javaewah. Later, Jeff Implement bitmap indexing in Hive - Key: HIVE-1803 URL: https://issues.apache.org/jira/browse/HIVE-1803 Project: Hive Issue Type: New Feature Reporter: Marquis Wang Assignee: Marquis Wang Attachments: HIVE-1803.1.patch, HIVE-1803.2.patch, HIVE-1803.3.patch, bitmap_index_1.png, bitmap_index_2.png Implement bitmap index handler to complement compact indexing. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HIVE-1899) add a factory method for creating a synchronized wrapper for IMetaStoreClient
[ https://issues.apache.org/jira/browse/HIVE-1899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12978685#action_12978685 ] Jeff Hammerbacher commented on HIVE-1899: - Hey John, Could you link this JIRA to the JIRAs for the multithreading bugs? I couldn't track them down. Thanks, Jeff add a factory method for creating a synchronized wrapper for IMetaStoreClient - Key: HIVE-1899 URL: https://issues.apache.org/jira/browse/HIVE-1899 Project: Hive Issue Type: Improvement Components: Metastore Affects Versions: 0.7.0 Reporter: John Sichi Assignee: John Sichi Fix For: 0.7.0 Attachments: HIVE-1899.1.patch There are currently some HiveMetaStoreClient multithreading bugs. This patch adds an (optional) synchronized wrapper for IMetaStoreClient using a dynamic proxy. This can be used for thread safety by multithreaded apps until all reentrancy bugs are fixed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1856) Implement DROP TABLE/VIEW ... IF EXISTS
[ https://issues.apache.org/jira/browse/HIVE-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12973998#action_12973998 ] Jeff Hammerbacher commented on HIVE-1856: - John: added your comments about patch updates to http://wiki.apache.org/hadoop/Hive/HowToContribute#Updating_a_patch Implement DROP TABLE/VIEW ... IF EXISTS Key: HIVE-1856 URL: https://issues.apache.org/jira/browse/HIVE-1856 Project: Hive Issue Type: New Feature Affects Versions: 0.7.0 Reporter: Marcel Kornacker Assignee: Marcel Kornacker Fix For: 0.7.0 Attachments: hive-1856.patch, hive-1856.patch This issue combines issues HIVE-1550/1165/1542/1551: - augment DROP TABLE/VIEW with IF EXISTS - signal an error if the table/view doesn't exist and IF EXISTS wasn't specified - introduce a flag in the configuration that allows you to turn off the new behavior -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1326) RowContainer uses hard-coded '/tmp/' path for temporary files
[ https://issues.apache.org/jira/browse/HIVE-1326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12973466#action_12973466 ] Jeff Hammerbacher commented on HIVE-1326: - Hey, Could a Hive committer assign this issue to Michael in order to keep the JIRA metadata up to date? Thanks, Jeff RowContainer uses hard-coded '/tmp/' path for temporary files - Key: HIVE-1326 URL: https://issues.apache.org/jira/browse/HIVE-1326 Project: Hive Issue Type: Bug Components: Query Processor Environment: Hadoop 0.19.2 with Hive trunk. We're using FreeBSD 7.0, but that doesn't seem relevant. Reporter: Michael Klatt Fix For: 0.6.0 Attachments: rowcontainer.patch, rowcontainer_v2.patch In our production hadoop environment, the /tmp/ is actually pretty small, and we encountered a problem when a query used the RowContainer class and filled up the /tmp/ partition. I tracked down the cause to the RowContainer class putting temporary files in the '/tmp/' path instead of using the configured Hadoop temporary path. I've attached a patch to fix this. Here's the traceback: 2010-04-25 12:05:05,120 INFO org.apache.hadoop.hive.ql.exec.persistence.RowContainer: RowContainer created temp file /tmp/hive-rowcontainer-1244151903/RowContainer7816.tmp 2010-04-25 12:05:06,326 INFO ExecReducer: ExecReducer: processing 1000 rows: used memory = 385520312 2010-04-25 12:05:08,513 INFO ExecReducer: ExecReducer: processing 1100 rows: used memory = 341780472 2010-04-25 12:05:10,697 INFO ExecReducer: ExecReducer: processing 1200 rows: used memory = 301446768 2010-04-25 12:05:12,837 INFO ExecReducer: ExecReducer: processing 1300 rows: used memory = 399208768 2010-04-25 12:05:15,085 INFO ExecReducer: ExecReducer: processing 1400 rows: used memory = 364507216 2010-04-25 12:05:17,260 INFO ExecReducer: ExecReducer: processing 1500 rows: used memory = 332907280 2010-04-25 12:05:19,580 INFO ExecReducer: ExecReducer: processing 1600 rows: used memory = 298774096 2010-04-25 12:05:21,629 INFO ExecReducer: ExecReducer: processing 1700 rows: used memory = 396505408 2010-04-25 12:05:23,830 INFO ExecReducer: ExecReducer: processing 1800 rows: used memory = 362477288 2010-04-25 12:05:25,914 INFO ExecReducer: ExecReducer: processing 1900 rows: used memory = 327229744 2010-04-25 12:05:27,978 INFO ExecReducer: ExecReducer: processing 2000 rows: used memory = 296051904 2010-04-25 12:05:28,155 FATAL ExecReducer: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:49) at java.io.DataOutputStream.write(DataOutputStream.java:90) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.writeChunk(ChecksumFileSystem.java:346) at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:150) at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:132) at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:121) at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:112) at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:86) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:49) at java.io.DataOutputStream.write(DataOutputStream.java:90) at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1013) at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:977) at org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat$1.write(HiveSequenceFileOutputFormat.java:70) at org.apache.hadoop.hive.ql.exec.persistence.RowContainer.spillBlock(RowContainer.java:343) at org.apache.hadoop.hive.ql.exec.persistence.RowContainer.add(RowContainer.java:163) at org.apache.hadoop.hive.ql.exec.JoinOperator.processOp(JoinOperator.java:118) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:456) at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:244) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) at org.apache.hadoop.mapred.Child.main(Child.java:158) Caused by: java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at
[jira] Commented: (HIVE-1693) Make the compile target depend on thrift.home
[ https://issues.apache.org/jira/browse/HIVE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12973468#action_12973468 ] Jeff Hammerbacher commented on HIVE-1693: - Could someone with JIRA editing privileges assign this issue to Eli to keep the metadata up to date? Thanks, Jeff Make the compile target depend on thrift.home - Key: HIVE-1693 URL: https://issues.apache.org/jira/browse/HIVE-1693 Project: Hive Issue Type: Improvement Components: Build Infrastructure Affects Versions: 0.5.0 Reporter: Eli Collins Priority: Minor Fix For: 0.6.0 Attachments: hive-1693-1.patch Per http://wiki.apache.org/hadoop/Hive/HiveODBC the ant compile targets require thrift.home be set. Rather than fail to compile fail with a message indicating it should be set. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HIVE-650) [UDAF] implement GROUP_CONCAT(expr)
[ https://issues.apache.org/jira/browse/HIVE-650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Hammerbacher resolved HIVE-650. Resolution: Duplicate Resolving as duplicate of HIVE-707 and to concentrate conversation on that ticket (since most of the discussion has happened there). [UDAF] implement GROUP_CONCAT(expr) - Key: HIVE-650 URL: https://issues.apache.org/jira/browse/HIVE-650 Project: Hive Issue Type: New Feature Reporter: Min Zhou It's a very useful udaf for us. http://dev.mysql.com/doc/refman/5.0/en/group-by-functions.html#function_group-concat GROUP_CONCAT(expr) This function returns a string result with the concatenated non-NULL values from a group. It returns NULL if there are no non-NULL values. The full syntax is as follows: GROUP_CONCAT([DISTINCT] expr [,expr ...] [ORDER BY {unsigned_integer | col_name | expr} [ASC | DESC] [,col_name ...]] [SEPARATOR str_val]) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-707) add group_concat
[ https://issues.apache.org/jira/browse/HIVE-707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935463#action_12935463 ] Jeff Hammerbacher commented on HIVE-707: Hey, Given that this JIRA has been opened three separate times, and that I have received a recent request for it in IRC, I think it would be worth bumping to near the top of the queue. Thanks, Jeff add group_concat Key: HIVE-707 URL: https://issues.apache.org/jira/browse/HIVE-707 Project: Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Min Zhou Moving the discussion to a new jira: I've implemented group_cat() in a rush, and found something difficult to slove: 1. function group_cat() has a internal order by clause, currently, we can't implement such an aggregation in hive. 2. when the strings will be group concated are too large, in another words, if data skew appears, there is often not enough memory to store such a big result. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1107) Generic parallel execution framework for Hive (and Pig, and ...)
[ https://issues.apache.org/jira/browse/HIVE-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12933174#action_12933174 ] Jeff Hammerbacher commented on HIVE-1107: - bq. I agree with Russell that Oozie seems too complicated for this task. Could you provide more color here? What aspects of Oozie make it too complicated for this task? Generic parallel execution framework for Hive (and Pig, and ...) Key: HIVE-1107 URL: https://issues.apache.org/jira/browse/HIVE-1107 Project: Hive Issue Type: New Feature Components: Query Processor Reporter: Carl Steinbach Pig and Hive each have their own libraries for handling plan execution. As we prepare to invest more time improving Hive's plan execution mechanism we should also start to consider ways of building a generic plan execution mechanism that is capable of supporting the needs of Hive and Pig, as well as other Hadoop data flow programming environments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1107) Generic parallel execution framework for Hive (and Pig, and ...)
[ https://issues.apache.org/jira/browse/HIVE-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12933194#action_12933194 ] Jeff Hammerbacher commented on HIVE-1107: - Okay, thanks. Let me try to pull apart the issues so that I can understand them: bq. Oozie is more complex than Pig and HIVE put together Compare their manuals, both in terms of length and readability. bq. Oozie is (nearly?) turing complete XML, not easily human readable script, and scheduling one job takes far too much of it. bq. Also, there is no need to force Oozie either, people can use Azkaban etc. for workflow. Each of these objects seem moot, given that Oozie would be targeted by the Hive and Pig developers, not the Hive and Pig users. No Hive or Pig user would be required to write Oozie: the configuration files would be generated by the Hive and Pig query planners, from my understanding. bq. I believe, mid-to-long term, that Pig/Hive will get significantly smarter about the way they construct MR jobs - they will want to run some of the nodes in the DAG, wait for their output (e.g. a sampler) and then make ever more complicated decisions to modify the DAG. I believe Oozie isn't the right tool to be using for this purpose. Adaptive query optimization is indeed a noble goal. Oozie seems to think at the level of workflow rather than dataflow, so as you say, it may not be an appropriate layer for performing these optimizations. I'm not sure if it detracts from the ability of Hive or Pig to perform adaptive query optimization though, either. Anyways, thanks for the discussion. We're certainly thinking through these issues as well. Generic parallel execution framework for Hive (and Pig, and ...) Key: HIVE-1107 URL: https://issues.apache.org/jira/browse/HIVE-1107 Project: Hive Issue Type: New Feature Components: Query Processor Reporter: Carl Steinbach Pig and Hive each have their own libraries for handling plan execution. As we prepare to invest more time improving Hive's plan execution mechanism we should also start to consider ways of building a generic plan execution mechanism that is capable of supporting the needs of Hive and Pig, as well as other Hadoop data flow programming environments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1107) Generic parallel execution framework for Hive (and Pig, and ...)
[ https://issues.apache.org/jira/browse/HIVE-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12933196#action_12933196 ] Jeff Hammerbacher commented on HIVE-1107: - Gah, can't edit, but of course I meant objections, not objects. Generic parallel execution framework for Hive (and Pig, and ...) Key: HIVE-1107 URL: https://issues.apache.org/jira/browse/HIVE-1107 Project: Hive Issue Type: New Feature Components: Query Processor Reporter: Carl Steinbach Pig and Hive each have their own libraries for handling plan execution. As we prepare to invest more time improving Hive's plan execution mechanism we should also start to consider ways of building a generic plan execution mechanism that is capable of supporting the needs of Hive and Pig, as well as other Hadoop data flow programming environments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-787) Hive Freeway - support near-realtime data processing
[ https://issues.apache.org/jira/browse/HIVE-787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12926192#action_12926192 ] Jeff Hammerbacher commented on HIVE-787: More details on the Data Freeway implementation at Facebook: http://vimeo.com/15337985 Hive Freeway - support near-realtime data processing Key: HIVE-787 URL: https://issues.apache.org/jira/browse/HIVE-787 Project: Hive Issue Type: New Feature Reporter: Zheng Shao Most people are using Hive for daily (or at most hourly) data processing. We want to explore what are the obstacles for using Hive for 15 minutes, 5 minutes or even 1 minute data processing intervals, and remove these obstacles. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.