[jira] [Updated] (HIVE-7292) Hive on Spark

2014-07-01 Thread Jeff Hammerbacher (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Hammerbacher updated HIVE-7292:


Description: 
Spark as an open-source data analytics cluster computing framework has gained 
significant momentum recently. Many Hive users already have Spark installed as 
their computing backbone. To take advantages of Hive, they still need to have 
either MapReduce or Tez on their cluster. This initiative will provide user a 
new alternative so that those user can consolidate their backend. 

Secondly, providing such an alternative further increases Hive's adoption as it 
exposes Spark users  to a viable, feature-rich de facto standard SQL tools on 
Hadoop.

Finally, allowing Hive to run on Spark also has performance benefits. Hive 
queries, especially those involving multiple reducer stages, will run faster, 
thus improving user experience as Tez does.

This is an umbrella JIRA which will cover many coming subtask. Design doc will 
be attached here shortly, and will be on the wiki as well. Feedback from the 
community is greatly appreciated!

  was:
Spark as an open-source data analytics cluster computing framework has gained 
significant momentum recently. Many Hive users already have Spark installed as 
their computing backbone. To take advantages of Hive, they still need to have 
either MapReduce or Tez on their cluster. This initiative will provide user a 
new alternative so that those user can consolidate their backend. 

Secondly, providing such an alternative further increases Hive's adoption as it 
exposes Spark users  to a viable, feature-rich de facto standard SQL tools on 
Hadoop.

Finally, allowing Hive to run on Spark also has performance benefits. Hive 
queries, especially those involving multiple reducer stages, will run faster, 
thus improving user experience as Tez does.

This is an umber JIRA which will cover many coming subtask. Design doc will be 
attached here shortly, and will be on the wiki as well. Feedback from the 
community is greatly appreciated!


 Hive on Spark
 -

 Key: HIVE-7292
 URL: https://issues.apache.org/jira/browse/HIVE-7292
 Project: Hive
  Issue Type: Improvement
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang
 Attachments: Hive-on-Spark.pdf


 Spark as an open-source data analytics cluster computing framework has gained 
 significant momentum recently. Many Hive users already have Spark installed 
 as their computing backbone. To take advantages of Hive, they still need to 
 have either MapReduce or Tez on their cluster. This initiative will provide 
 user a new alternative so that those user can consolidate their backend. 
 Secondly, providing such an alternative further increases Hive's adoption as 
 it exposes Spark users  to a viable, feature-rich de facto standard SQL tools 
 on Hadoop.
 Finally, allowing Hive to run on Spark also has performance benefits. Hive 
 queries, especially those involving multiple reducer stages, will run faster, 
 thus improving user experience as Tez does.
 This is an umbrella JIRA which will cover many coming subtask. Design doc 
 will be attached here shortly, and will be on the wiki as well. Feedback from 
 the community is greatly appreciated!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HIVE-2997) Store sort order of table in the metastore

2012-05-02 Thread Jeff Hammerbacher (JIRA)
Jeff Hammerbacher created HIVE-2997:
---

 Summary: Store sort order of table in the metastore
 Key: HIVE-2997
 URL: https://issues.apache.org/jira/browse/HIVE-2997
 Project: Hive
  Issue Type: New Feature
  Components: Metastore
Reporter: Jeff Hammerbacher


If a table or view is sorted on a specific column, it would be useful to record 
this fact in the metastore.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HIVE-1803) Implement bitmap indexing in Hive

2011-02-25 Thread Jeff Hammerbacher (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999656#comment-12999656
 ] 

Jeff Hammerbacher commented on HIVE-1803:
-

Hey,

I came across a Daniel Lemire project recently that may be of use here: 
http://code.google.com/p/javaewah.

Later,
Jeff

 Implement bitmap indexing in Hive
 -

 Key: HIVE-1803
 URL: https://issues.apache.org/jira/browse/HIVE-1803
 Project: Hive
  Issue Type: New Feature
Reporter: Marquis Wang
Assignee: Marquis Wang
 Attachments: HIVE-1803.1.patch, HIVE-1803.2.patch, HIVE-1803.3.patch, 
 bitmap_index_1.png, bitmap_index_2.png


 Implement bitmap index handler to complement compact indexing.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HIVE-1899) add a factory method for creating a synchronized wrapper for IMetaStoreClient

2011-01-07 Thread Jeff Hammerbacher (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12978685#action_12978685
 ] 

Jeff Hammerbacher commented on HIVE-1899:
-

Hey John,

Could you link this JIRA to the JIRAs for the multithreading bugs? I couldn't 
track them down.

Thanks,
Jeff

 add a factory method for creating a synchronized wrapper for IMetaStoreClient
 -

 Key: HIVE-1899
 URL: https://issues.apache.org/jira/browse/HIVE-1899
 Project: Hive
  Issue Type: Improvement
  Components: Metastore
Affects Versions: 0.7.0
Reporter: John Sichi
Assignee: John Sichi
 Fix For: 0.7.0

 Attachments: HIVE-1899.1.patch


 There are currently some HiveMetaStoreClient multithreading bugs.  This patch 
 adds an (optional) synchronized wrapper for IMetaStoreClient using a dynamic 
 proxy.  This can be used for thread safety by multithreaded apps until all 
 reentrancy bugs are fixed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1856) Implement DROP TABLE/VIEW ... IF EXISTS

2010-12-21 Thread Jeff Hammerbacher (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12973998#action_12973998
 ] 

Jeff Hammerbacher commented on HIVE-1856:
-

John: added your comments about patch updates to 
http://wiki.apache.org/hadoop/Hive/HowToContribute#Updating_a_patch

 Implement DROP TABLE/VIEW ... IF EXISTS 
 

 Key: HIVE-1856
 URL: https://issues.apache.org/jira/browse/HIVE-1856
 Project: Hive
  Issue Type: New Feature
Affects Versions: 0.7.0
Reporter: Marcel Kornacker
Assignee: Marcel Kornacker
 Fix For: 0.7.0

 Attachments: hive-1856.patch, hive-1856.patch


 This issue combines issues HIVE-1550/1165/1542/1551:
 - augment DROP TABLE/VIEW with IF EXISTS
 - signal an error if the table/view doesn't exist and IF EXISTS wasn't 
 specified
 - introduce a flag in the configuration that allows you to turn off the new 
 behavior

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1326) RowContainer uses hard-coded '/tmp/' path for temporary files

2010-12-20 Thread Jeff Hammerbacher (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12973466#action_12973466
 ] 

Jeff Hammerbacher commented on HIVE-1326:
-

Hey,

Could a Hive committer assign this issue to Michael in order to keep the JIRA 
metadata up to date?

Thanks,
Jeff

 RowContainer uses hard-coded '/tmp/' path for temporary files
 -

 Key: HIVE-1326
 URL: https://issues.apache.org/jira/browse/HIVE-1326
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
 Environment: Hadoop 0.19.2 with Hive trunk.  We're using FreeBSD 7.0, 
 but that doesn't seem relevant.
Reporter: Michael Klatt
 Fix For: 0.6.0

 Attachments: rowcontainer.patch, rowcontainer_v2.patch


 In our production hadoop environment, the /tmp/ is actually pretty small, 
 and we encountered a problem when a query used the RowContainer class and 
 filled up the /tmp/ partition.  I tracked down the cause to the RowContainer 
 class putting temporary files in the '/tmp/' path instead of using the 
 configured Hadoop temporary path.  I've attached a patch to fix this.
 Here's the traceback:
 2010-04-25 12:05:05,120 INFO 
 org.apache.hadoop.hive.ql.exec.persistence.RowContainer: RowContainer created 
 temp file /tmp/hive-rowcontainer-1244151903/RowContainer7816.tmp
 2010-04-25 12:05:06,326 INFO ExecReducer: ExecReducer: processing 1000 
 rows: used memory = 385520312
 2010-04-25 12:05:08,513 INFO ExecReducer: ExecReducer: processing 1100 
 rows: used memory = 341780472
 2010-04-25 12:05:10,697 INFO ExecReducer: ExecReducer: processing 1200 
 rows: used memory = 301446768
 2010-04-25 12:05:12,837 INFO ExecReducer: ExecReducer: processing 1300 
 rows: used memory = 399208768
 2010-04-25 12:05:15,085 INFO ExecReducer: ExecReducer: processing 1400 
 rows: used memory = 364507216
 2010-04-25 12:05:17,260 INFO ExecReducer: ExecReducer: processing 1500 
 rows: used memory = 332907280
 2010-04-25 12:05:19,580 INFO ExecReducer: ExecReducer: processing 1600 
 rows: used memory = 298774096
 2010-04-25 12:05:21,629 INFO ExecReducer: ExecReducer: processing 1700 
 rows: used memory = 396505408
 2010-04-25 12:05:23,830 INFO ExecReducer: ExecReducer: processing 1800 
 rows: used memory = 362477288
 2010-04-25 12:05:25,914 INFO ExecReducer: ExecReducer: processing 1900 
 rows: used memory = 327229744
 2010-04-25 12:05:27,978 INFO ExecReducer: ExecReducer: processing 2000 
 rows: used memory = 296051904
 2010-04-25 12:05:28,155 FATAL ExecReducer: org.apache.hadoop.fs.FSError: 
 java.io.IOException: No space left on device
   at 
 org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
   at 
 java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
   at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
   at 
 org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:49)
   at java.io.DataOutputStream.write(DataOutputStream.java:90)
   at 
 org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.writeChunk(ChecksumFileSystem.java:346)
   at 
 org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:150)
   at 
 org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:132)
   at 
 org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:121)
   at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:112)
   at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:86)
   at 
 org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:49)
   at java.io.DataOutputStream.write(DataOutputStream.java:90)
   at 
 org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1013)
   at 
 org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:977)
   at 
 org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat$1.write(HiveSequenceFileOutputFormat.java:70)
   at 
 org.apache.hadoop.hive.ql.exec.persistence.RowContainer.spillBlock(RowContainer.java:343)
   at 
 org.apache.hadoop.hive.ql.exec.persistence.RowContainer.add(RowContainer.java:163)
   at 
 org.apache.hadoop.hive.ql.exec.JoinOperator.processOp(JoinOperator.java:118)
   at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:456)
   at 
 org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:244)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
   at org.apache.hadoop.mapred.Child.main(Child.java:158)
 Caused by: java.io.IOException: No space left on device
   at java.io.FileOutputStream.writeBytes(Native Method)
   at 

[jira] Commented: (HIVE-1693) Make the compile target depend on thrift.home

2010-12-20 Thread Jeff Hammerbacher (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12973468#action_12973468
 ] 

Jeff Hammerbacher commented on HIVE-1693:
-

Could someone with JIRA editing privileges assign this issue to Eli to keep the 
metadata up to date? Thanks, Jeff

 Make the compile target depend on thrift.home
 -

 Key: HIVE-1693
 URL: https://issues.apache.org/jira/browse/HIVE-1693
 Project: Hive
  Issue Type: Improvement
  Components: Build Infrastructure
Affects Versions: 0.5.0
Reporter: Eli Collins
Priority: Minor
 Fix For: 0.6.0

 Attachments: hive-1693-1.patch


 Per http://wiki.apache.org/hadoop/Hive/HiveODBC the ant compile targets 
 require thrift.home be set. Rather than fail to compile fail with a message 
 indicating it should be set.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (HIVE-650) [UDAF] implement GROUP_CONCAT(expr)

2010-11-24 Thread Jeff Hammerbacher (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Hammerbacher resolved HIVE-650.


Resolution: Duplicate

Resolving as duplicate of HIVE-707 and to concentrate conversation on that 
ticket (since most of the discussion has happened there).

 [UDAF]  implement  GROUP_CONCAT(expr)
 -

 Key: HIVE-650
 URL: https://issues.apache.org/jira/browse/HIVE-650
 Project: Hive
  Issue Type: New Feature
Reporter: Min Zhou

 It's a very useful udaf for us. 
 http://dev.mysql.com/doc/refman/5.0/en/group-by-functions.html#function_group-concat
 GROUP_CONCAT(expr)
 This function returns a string result with the concatenated non-NULL values 
 from a group. It returns NULL if there are no non-NULL values. The full 
 syntax is as follows: 
 GROUP_CONCAT([DISTINCT] expr [,expr ...]
  [ORDER BY {unsigned_integer | col_name | expr}
  [ASC | DESC] [,col_name ...]]
  [SEPARATOR str_val])

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-707) add group_concat

2010-11-24 Thread Jeff Hammerbacher (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935463#action_12935463
 ] 

Jeff Hammerbacher commented on HIVE-707:


Hey,

Given that this JIRA has been opened three separate times, and that I have 
received a recent request for it in IRC, I think it would be worth bumping to 
near the top of the queue.

Thanks,
Jeff

 add group_concat
 

 Key: HIVE-707
 URL: https://issues.apache.org/jira/browse/HIVE-707
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Reporter: Namit Jain
Assignee: Min Zhou

 Moving the discussion to a new jira:
 I've implemented group_cat() in a rush, and found something difficult to 
 slove:
 1. function group_cat() has a internal order by clause, currently, we can't 
 implement such an aggregation in hive.
 2. when the strings will be group concated are too large, in another words, 
 if data skew appears, there is often not enough memory to store such a big 
 result.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1107) Generic parallel execution framework for Hive (and Pig, and ...)

2010-11-17 Thread Jeff Hammerbacher (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12933174#action_12933174
 ] 

Jeff Hammerbacher commented on HIVE-1107:
-

bq. I agree with Russell that Oozie seems too complicated for this task.

Could you provide more color here? What aspects of Oozie make it too 
complicated for this task?

 Generic parallel execution framework for Hive (and Pig, and ...)
 

 Key: HIVE-1107
 URL: https://issues.apache.org/jira/browse/HIVE-1107
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Reporter: Carl Steinbach

 Pig and Hive each have their own libraries for handling plan execution. As we 
 prepare to invest more time improving Hive's plan execution mechanism we 
 should also start to consider ways of building a generic plan execution 
 mechanism that is capable of supporting the needs of Hive and Pig, as well as 
 other Hadoop data flow programming environments. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1107) Generic parallel execution framework for Hive (and Pig, and ...)

2010-11-17 Thread Jeff Hammerbacher (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12933194#action_12933194
 ] 

Jeff Hammerbacher commented on HIVE-1107:
-

Okay, thanks. Let me try to pull apart the issues so that I can understand them:

bq. Oozie is more complex than Pig and HIVE put together Compare their manuals, 
both in terms of length and readability.

bq. Oozie is (nearly?) turing complete XML, not easily human readable script, 
and scheduling one job takes far too much of it.

bq. Also, there is no need to force Oozie either, people can use Azkaban etc. 
for workflow.

Each of these objects seem moot, given that Oozie would be targeted by the Hive 
and Pig developers, not the Hive and Pig users. No Hive or Pig user would be 
required to write Oozie: the configuration files would be generated by the Hive 
and Pig query planners, from my understanding.

bq. I believe, mid-to-long term, that Pig/Hive will get significantly smarter 
about the way they construct MR jobs - they will want to run some of the nodes 
in the DAG, wait for their output (e.g. a sampler) and then make ever more 
complicated decisions to modify the DAG. I believe Oozie isn't the right tool 
to be using for this purpose.

Adaptive query optimization is indeed a noble goal. Oozie seems to think at the 
level of workflow rather than dataflow, so as you say, it may not be an 
appropriate layer for performing these optimizations. I'm not sure if it 
detracts from the ability of Hive or Pig to perform adaptive query optimization 
though, either.

Anyways, thanks for the discussion. We're certainly thinking through these 
issues as well.

 Generic parallel execution framework for Hive (and Pig, and ...)
 

 Key: HIVE-1107
 URL: https://issues.apache.org/jira/browse/HIVE-1107
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Reporter: Carl Steinbach

 Pig and Hive each have their own libraries for handling plan execution. As we 
 prepare to invest more time improving Hive's plan execution mechanism we 
 should also start to consider ways of building a generic plan execution 
 mechanism that is capable of supporting the needs of Hive and Pig, as well as 
 other Hadoop data flow programming environments. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1107) Generic parallel execution framework for Hive (and Pig, and ...)

2010-11-17 Thread Jeff Hammerbacher (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12933196#action_12933196
 ] 

Jeff Hammerbacher commented on HIVE-1107:
-

Gah, can't edit, but of course I meant objections, not objects.

 Generic parallel execution framework for Hive (and Pig, and ...)
 

 Key: HIVE-1107
 URL: https://issues.apache.org/jira/browse/HIVE-1107
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Reporter: Carl Steinbach

 Pig and Hive each have their own libraries for handling plan execution. As we 
 prepare to invest more time improving Hive's plan execution mechanism we 
 should also start to consider ways of building a generic plan execution 
 mechanism that is capable of supporting the needs of Hive and Pig, as well as 
 other Hadoop data flow programming environments. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-787) Hive Freeway - support near-realtime data processing

2010-10-29 Thread Jeff Hammerbacher (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12926192#action_12926192
 ] 

Jeff Hammerbacher commented on HIVE-787:


More details on the Data Freeway implementation at Facebook: 
http://vimeo.com/15337985

 Hive Freeway - support near-realtime data processing
 

 Key: HIVE-787
 URL: https://issues.apache.org/jira/browse/HIVE-787
 Project: Hive
  Issue Type: New Feature
Reporter: Zheng Shao

 Most people are using Hive for daily (or at most hourly) data processing.
 We want to explore what are the obstacles for using Hive for 15 minutes, 5 
 minutes or even 1 minute data processing intervals, and remove these 
 obstacles.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.