[jira] Commented: (HIVE-1482) Not all jdbc calls are threadsafe.
[ https://issues.apache.org/jira/browse/HIVE-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12895997#action_12895997 ] Bennie Schut commented on HIVE-1482: Ok I think I covered the sync stuff like you sugested. For the state part I was looking at how SessionState is used within the HiveConnection. There are several spots where we pass a sessionstate object (HivePreparedStatement, HiveStatement) but both places end up not actually using the object. Simplest thing would be not to pass them. But even if we do pass it I think it only contains session data and not query specific data. Query specific things like column names/types etc. can be done nicely within a sync block together with the client call for the query itself so that shouldn't be a problem. I'll upload a patch with what I have soon. Not all jdbc calls are threadsafe. -- Key: HIVE-1482 URL: https://issues.apache.org/jira/browse/HIVE-1482 Project: Hadoop Hive Issue Type: Bug Components: Drivers Affects Versions: 0.7.0 Reporter: Bennie Schut Fix For: 0.7.0 As per jdbc spec they should be threadsafe: http://download.oracle.com/docs/cd/E17476_01/javase/1.3/docs/guide/jdbc/spec/jdbc-spec.frame9.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1482) Not all jdbc calls are threadsafe.
[ https://issues.apache.org/jira/browse/HIVE-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bennie Schut updated HIVE-1482: --- Attachment: HIVE-1482-1.patch Full test ran with success on my pc. Plus no additional check-style errors found. Anything else we can think of for this? Not all jdbc calls are threadsafe. -- Key: HIVE-1482 URL: https://issues.apache.org/jira/browse/HIVE-1482 Project: Hadoop Hive Issue Type: Bug Components: Drivers Affects Versions: 0.7.0 Reporter: Bennie Schut Fix For: 0.7.0 Attachments: HIVE-1482-1.patch As per jdbc spec they should be threadsafe: http://download.oracle.com/docs/cd/E17476_01/javase/1.3/docs/guide/jdbc/spec/jdbc-spec.frame9.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1513) hive starter scripts should load admin/user supplied script for configurability
[ https://issues.apache.org/jira/browse/HIVE-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joydeep Sen Sarma updated HIVE-1513: Attachment: 1513.1.patch simple change to let hive started script include conf/hive-env.sh. a template is provided as an example. ran all tests on 20 and tested by hand that the inclusion works. hive starter scripts should load admin/user supplied script for configurability --- Key: HIVE-1513 URL: https://issues.apache.org/jira/browse/HIVE-1513 Project: Hadoop Hive Issue Type: Improvement Components: CLI Reporter: Joydeep Sen Sarma Attachments: 1513.1.patch it's difficult to add environment variables to Hive starter scripts except by modifying the scripts directly. this is undesirable (since they are source code). Hive starter scripts should load a admin supplied shell script for configurability. This would be similar to what hadoop does with hadoop-env.sh -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1513) hive starter scripts should load admin/user supplied script for configurability
[ https://issues.apache.org/jira/browse/HIVE-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joydeep Sen Sarma updated HIVE-1513: Status: Patch Available (was: Open) Assignee: Joydeep Sen Sarma hive starter scripts should load admin/user supplied script for configurability --- Key: HIVE-1513 URL: https://issues.apache.org/jira/browse/HIVE-1513 Project: Hadoop Hive Issue Type: Improvement Components: CLI Reporter: Joydeep Sen Sarma Assignee: Joydeep Sen Sarma Attachments: 1513.1.patch it's difficult to add environment variables to Hive starter scripts except by modifying the scripts directly. this is undesirable (since they are source code). Hive starter scripts should load a admin supplied shell script for configurability. This would be similar to what hadoop does with hadoop-env.sh -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (HIVE-1515) archive is not working when multiple partitions inside one table are archived.
[ https://issues.apache.org/jira/browse/HIVE-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang reassigned HIVE-1515: -- Assignee: He Yongqiang archive is not working when multiple partitions inside one table are archived. -- Key: HIVE-1515 URL: https://issues.apache.org/jira/browse/HIVE-1515 Project: Hadoop Hive Issue Type: Bug Reporter: He Yongqiang Assignee: He Yongqiang set hive.exec.compress.output = true; set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; set mapred.min.split.size=256; set mapred.min.split.size.per.node=256; set mapred.min.split.size.per.rack=256; set mapred.max.split.size=256; set hive.archive.enabled = true; drop table combine_3_srcpart_seq_rc; create table combine_3_srcpart_seq_rc (key int , value string) partitioned by (ds string, hr string) stored as sequencefile; insert overwrite table combine_3_srcpart_seq_rc partition (ds=2010-08-03, hr=00) select * from src; insert overwrite table combine_3_srcpart_seq_rc partition (ds=2010-08-03, hr=001) select * from src; ALTER TABLE combine_3_srcpart_seq_rc ARCHIVE PARTITION (ds=2010-08-03, hr=00); ALTER TABLE combine_3_srcpart_seq_rc ARCHIVE PARTITION (ds=2010-08-03, hr=001); select key, value, ds, hr from combine_3_srcpart_seq_rc where ds=2010-08-03 order by key, hr limit 30; drop table combine_3_srcpart_seq_rc; will fail. java.io.IOException: Invalid file name: har:/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001/data.har/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001 in har:/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har The reason it fails is because: there are 2 input paths (one for each partition) for the above query: 1): har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00 2): har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001/data.har/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001 But when doing path.getFileSystem() for these 2 input paths. they both return same one file system instance which points the first caller, in this case which is har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har The reason here is Hadoop's FileSystem has a global cache, and when trying to load a FileSystem instance from a given path, it only take the path's scheme and username to lookup the cache. So when we do Path.getFileSystem for the second har path, it actually returns the file system handle for the first path. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1293) Concurreny Model for Hive
[ https://issues.apache.org/jira/browse/HIVE-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Namit Jain updated HIVE-1293: - Attachment: hive.1293.2.patch The patch updated Concurreny Model for Hive - Key: HIVE-1293 URL: https://issues.apache.org/jira/browse/HIVE-1293 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Fix For: 0.7.0 Attachments: hive.1293.1.patch, hive.1293.2.patch, hive_leases.txt Concurrency model for Hive: Currently, hive does not provide a good concurrency model. The only guanrantee provided in case of concurrent readers and writers is that reader will not see partial data from the old version (before the write) and partial data from the new version (after the write). This has come across as a big problem, specially for background processes performing maintenance operations. The following possible solutions come to mind. 1. Locks: Acquire read/write locks - they can be acquired at the beginning of the query or the write locks can be delayed till move task (when the directory is actually moved). Care needs to be taken for deadlocks. 2. Versioning: The writer can create a new version if the current version is being read. Note that, it is not equivalent to snapshots, the old version can only be accessed by the current readers, and will be deleted when all of them have finished. Comments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-192) Add TIMESTAMP column type
[ https://issues.apache.org/jira/browse/HIVE-192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12896114#action_12896114 ] Shyam Sundar Sarkar commented on HIVE-192: -- I converted a TIMESTAMP type into string type as part of createTable method in DDLTask.java class. Can anyone suggest how do I retain TIMESTAMP type in the Hive metadata repository and send string type only to db.createTable() method so that SerDe takes it as string type only ?I want this for back and forth conversions during SELECT, UPDATE etc. It seems that metadata is permanently modified to string type in Hive. Any suggestions will help. -S. Sarkar Add TIMESTAMP column type - Key: HIVE-192 URL: https://issues.apache.org/jira/browse/HIVE-192 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Johan Oskarsson Assignee: Shyam Sundar Sarkar Attachments: create_2.q.txt, TIMESTAMP_specification.txt create table something2 (test timestamp); ERROR: DDL specifying type timestamp which has not been defined java.lang.RuntimeException: specifying type timestamp which has not been defined at org.apache.hadoop.hive.serde2.dynamic_type.thrift_grammar.FieldType(thrift_grammar.java:1879) at org.apache.hadoop.hive.serde2.dynamic_type.thrift_grammar.Field(thrift_grammar.java:1545) at org.apache.hadoop.hive.serde2.dynamic_type.thrift_grammar.FieldList(thrift_grammar.java:1501) at org.apache.hadoop.hive.serde2.dynamic_type.thrift_grammar.Struct(thrift_grammar.java:1171) at org.apache.hadoop.hive.serde2.dynamic_type.thrift_grammar.TypeDefinition(thrift_grammar.java:497) at org.apache.hadoop.hive.serde2.dynamic_type.thrift_grammar.Definition(thrift_grammar.java:439) at org.apache.hadoop.hive.serde2.dynamic_type.thrift_grammar.Start(thrift_grammar.java:101) at org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe.initialize(DynamicSerDe.java:97) at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:180) at org.apache.hadoop.hive.ql.metadata.Table.initSerDe(Table.java:141) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:202) at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:641) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:98) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:215) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:174) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:207) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:305) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1515) archive is not working when multiple partitions inside one table are archived.
[ https://issues.apache.org/jira/browse/HIVE-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang updated HIVE-1515: --- Attachment: hive-1515.1.patch archive is not working when multiple partitions inside one table are archived. -- Key: HIVE-1515 URL: https://issues.apache.org/jira/browse/HIVE-1515 Project: Hadoop Hive Issue Type: Bug Reporter: He Yongqiang Assignee: He Yongqiang Attachments: hive-1515.1.patch set hive.exec.compress.output = true; set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; set mapred.min.split.size=256; set mapred.min.split.size.per.node=256; set mapred.min.split.size.per.rack=256; set mapred.max.split.size=256; set hive.archive.enabled = true; drop table combine_3_srcpart_seq_rc; create table combine_3_srcpart_seq_rc (key int , value string) partitioned by (ds string, hr string) stored as sequencefile; insert overwrite table combine_3_srcpart_seq_rc partition (ds=2010-08-03, hr=00) select * from src; insert overwrite table combine_3_srcpart_seq_rc partition (ds=2010-08-03, hr=001) select * from src; ALTER TABLE combine_3_srcpart_seq_rc ARCHIVE PARTITION (ds=2010-08-03, hr=00); ALTER TABLE combine_3_srcpart_seq_rc ARCHIVE PARTITION (ds=2010-08-03, hr=001); select key, value, ds, hr from combine_3_srcpart_seq_rc where ds=2010-08-03 order by key, hr limit 30; drop table combine_3_srcpart_seq_rc; will fail. java.io.IOException: Invalid file name: har:/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001/data.har/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001 in har:/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har The reason it fails is because: there are 2 input paths (one for each partition) for the above query: 1): har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00 2): har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001/data.har/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001 But when doing path.getFileSystem() for these 2 input paths. they both return same one file system instance which points the first caller, in this case which is har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har The reason here is Hadoop's FileSystem has a global cache, and when trying to load a FileSystem instance from a given path, it only take the path's scheme and username to lookup the cache. So when we do Path.getFileSystem for the second har path, it actually returns the file system handle for the first path. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Syntax and Semantics for Continuous queries in Hive
Hello, I am curious whether large streaming data can be queried by syantax and semantics of continuous queries inside Hive as defined in Streams Databases (e.g. StreamBase, Coral8 etc.). Continuous queries are essential for real-time enterprise, log data and many other applications. Thanks, Shyam Sarkar
[jira] Created: (HIVE-1516) optimize split sizes automatically taking into account amount of nature of map tasks
optimize split sizes automatically taking into account amount of nature of map tasks Key: HIVE-1516 URL: https://issues.apache.org/jira/browse/HIVE-1516 Project: Hadoop Hive Issue Type: Improvement Components: Query Processor Reporter: Joydeep Sen Sarma two immediate cases come to mind: - pure filter job (ie. no map-side sort required) - full aggregate computations only (like count(1)). in these cases - the amount of data to be sorted is zero or negligible. so mapper parallelism (and split size) should be dictated by the size of the cluster. there's no point running 1 mappers on a 500 node cluster for a pure filter job. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HIVE-1517) ability to select across a database
ability to select across a database --- Key: HIVE-1517 URL: https://issues.apache.org/jira/browse/HIVE-1517 Project: Hadoop Hive Issue Type: Improvement Components: Query Processor Reporter: Namit Jain Assignee: Carl Steinbach After https://issues.apache.org/jira/browse/HIVE-675, we need a way to be able to select across a database for this feature to be useful. For eg: use db1 create table foo(); use db2 select .. from db1.foo. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1293) Concurreny Model for Hive
[ https://issues.apache.org/jira/browse/HIVE-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12896182#action_12896182 ] John Sichi commented on HIVE-1293: -- Added comments here: https://review.cloudera.org/r/563/ (doesn't seem to be adding comments here) Concurreny Model for Hive - Key: HIVE-1293 URL: https://issues.apache.org/jira/browse/HIVE-1293 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Fix For: 0.7.0 Attachments: hive.1293.1.patch, hive.1293.2.patch, hive_leases.txt Concurrency model for Hive: Currently, hive does not provide a good concurrency model. The only guanrantee provided in case of concurrent readers and writers is that reader will not see partial data from the old version (before the write) and partial data from the new version (after the write). This has come across as a big problem, specially for background processes performing maintenance operations. The following possible solutions come to mind. 1. Locks: Acquire read/write locks - they can be acquired at the beginning of the query or the write locks can be delayed till move task (when the directory is actually moved). Care needs to be taken for deadlocks. 2. Versioning: The writer can create a new version if the current version is being read. Note that, it is not equivalent to snapshots, the old version can only be accessed by the current readers, and will be deleted when all of them have finished. Comments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HIVE-1518) context_ngrams() UDAF for estimating top-k contextual n-grams
context_ngrams() UDAF for estimating top-k contextual n-grams - Key: HIVE-1518 URL: https://issues.apache.org/jira/browse/HIVE-1518 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Affects Versions: 0.6.0 Reporter: Mayank Lahiri Assignee: Mayank Lahiri Fix For: 0.6.0 Create a new context_ngrams() function that generalizes the ngrams() UDAF to allow the user to specify context around n-grams. The analogy is fill-in-the-blanks, and is best illustrated with an example: SELECT context_ngrams(sentences(tweets), array(i, love, null), 300) FROM twitter; will estimate the top-300 words that follow the phrase i love in a database of tweets. The position of the null(s) specifies where to generate the n-gram from, and can be placed anywhere. For example: SELECT context_ngrams(sentences(tweets), array(i, love, null, but, hate, null), 300) FROM twitter; will estimate the top-300 word-pairs that fill in the blanks specified by null. POSSIBLE USES: 1. Pre-computing search lookaheads 2. Sentiment analysis for products or entities -- e.g., querying with context = array(twitter, is, null) 3. Navigation path analysis in URL databases -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (HIVE-1482) Not all jdbc calls are threadsafe.
[ https://issues.apache.org/jira/browse/HIVE-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Sichi reassigned HIVE-1482: Assignee: Bennie Schut Not all jdbc calls are threadsafe. -- Key: HIVE-1482 URL: https://issues.apache.org/jira/browse/HIVE-1482 Project: Hadoop Hive Issue Type: Bug Components: Drivers Affects Versions: 0.7.0 Reporter: Bennie Schut Assignee: Bennie Schut Fix For: 0.7.0 Attachments: HIVE-1482-1.patch As per jdbc spec they should be threadsafe: http://download.oracle.com/docs/cd/E17476_01/javase/1.3/docs/guide/jdbc/spec/jdbc-spec.frame9.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1513) hive starter scripts should load admin/user supplied script for configurability
[ https://issues.apache.org/jira/browse/HIVE-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joydeep Sen Sarma updated HIVE-1513: Attachment: 1513.2.patch forgot to add one file hive starter scripts should load admin/user supplied script for configurability --- Key: HIVE-1513 URL: https://issues.apache.org/jira/browse/HIVE-1513 Project: Hadoop Hive Issue Type: Improvement Components: CLI Reporter: Joydeep Sen Sarma Assignee: Joydeep Sen Sarma Attachments: 1513.1.patch, 1513.2.patch it's difficult to add environment variables to Hive starter scripts except by modifying the scripts directly. this is undesirable (since they are source code). Hive starter scripts should load a admin supplied shell script for configurability. This would be similar to what hadoop does with hadoop-env.sh -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.