RE: Hive output file 000000_0

2018-08-08 Thread Ryan Harris
Also, I believe that the output format matters. If your output is TEXTFILE I think that all of the reducers can append to the same file concurrently. However for block-based output formats, that isn’t possible. From: Furcy Pin [mailto:pin.fu...@gmail.com] Sent: Wednesday, August 08, 2018 9:58

RE: Migrating Variable Length Files to Hive

2017-06-02 Thread Ryan Harris
are standard in case of all files .Any idea how the schema would look if I use the stingray reader?.I am guessing it would be more like string,string,string,array(strings)?. -Nishanth On Fri, Jun 2, 2017 at 10:51 AM, Ryan Harris <ryan.har...@zionsbancorp.com<mailto:ryan.har...@zionsbanco

RE: Migrating Variable Length Files to Hive

2017-06-02 Thread Ryan Harris
I wrote some custom python parsing scripts using StingRay Reader ( http://stingrayreader.sourceforge.net/cobol.html ) that read in the copybooks and use the results to automatically generate hive table schema based on the source copybook. The EBCDIC data is then extracted to TAB separated

RE: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table?

2017-04-07 Thread Ryan Harris
Hive just take the data and take full care of partitioning it? On Tue, Apr 4, 2017 at 6:14 PM, Ryan Harris <ryan.har...@zionsbancorp.com<mailto:ryan.har...@zionsbancorp.com>> wrote: For A) I’d recommend mapping an EXTERNAL table to the raw/original source files…then you can just run

RE: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table?

2017-04-06 Thread Ryan Harris
anything specific in the input files / with the input files in order to make partitioning work, or does Hive just take the data and take full care of partitioning it? On Tue, Apr 4, 2017 at 6:14 PM, Ryan Harris <ryan.har...@zionsbancorp.com<mailto:ryan.har...@zionsbancorp.com&g

RE: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table?

2017-04-04 Thread Ryan Harris
For A) I’d recommend mapping an EXTERNAL table to the raw/original source files…then you can just run a SELECT query from the EXTERNAL source and INSERT into your destination. LOAD DATA can be very useful when you are trying to move data between two tables that share the same schema but 1

RE: Possible Bug: to_date("2015-01-15") returns a string

2016-06-30 Thread Ryan Harris
FWIW, the wiki states that the function returns a string https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDFhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF From: Long, Andrew [loand...@amazon.com] Sent: Thursday, June 30, 2016 5:31 PM To: user@hive.apache.org

RE: Spark Streaming, Batch interval, Windows length and Sliding Interval settings

2016-05-05 Thread Ryan Harris
This is really outside of the scope of Hive and would probably be better addressed by the Spark community, however I can say that this very much depends on your use case Take a look at this discussion if you haven't already:

RE: Container out of memory: ORC format with many dynamic partitions

2016-05-02 Thread Ryan Harris
reading this: "but when I add 2000 new titles with 300 rows each" I'm thinking that you are over-partitioning your data I'm not sure exactly how that relates to the OOM error you are getting (it may not)I'd test things out partitioning by date-only maybe date + title_type, but adding

RE: Hive query to split one row into many rows such that Row 1 will have col 1 Name, col 1 Value and Row 2 will have col 2 Name and col 2 value

2016-04-26 Thread Ryan Harris
if you are doing group by, you could have potential duplicates on your concat_wstake a look at using collect_set or collect_list. if you do select col_a, collect_set(concat_ws(', ',col_b,col_c)) from t you will have an array of unique collection pairs...collect_list will give you all

RE: Mappers spawning Hive queries

2016-04-18 Thread Ryan Harris
2016 at 1:31 PM, Ryan Harris <ryan.har...@zionsbancorp.com<mailto:ryan.har...@zionsbancorp.com>> wrote: My $0.02 If you are running multiple concurrent queries on the data, you are probably doing it wrong (or at least inefficiently)although this somewhat depends on what typ

RE: Mappers spawning Hive queries

2016-04-18 Thread Ryan Harris
My $0.02 If you are running multiple concurrent queries on the data, you are probably doing it wrong (or at least inefficiently)although this somewhat depends on what type of files are backing your hive warehouse... Let's assume that your data is NOT backed by ORC/parquet files, and

RE: Best Hive Authorization Model for Shared data

2016-04-12 Thread Ryan Harris
if your only problem with #2 is the issue of creating the external table, you should be able to throw together a script running as a more privileged user that could handle the task of creating the external table. Once the table is created, the user should be able to access the read-only data.

RE: make best use of VCore in Hive

2016-03-28 Thread Ryan Harris
In my opinion, this ultimately becomes a resource balance issue that you'll need to test. You have a fixed amount of memory (although you haven't said what it is). As you increase the number of tasks, the available memory per task will decrease. If the tasks run out of memory, they will

RE: Best way of Unpivoting of hiva table data. Any Analytic function for unpivoting

2016-03-28 Thread Ryan Harris
collect_list(col) will give you an array with all of the data from that column However, the scalability of this approach will have limits. -Original Message- From: mahender bigdata [mailto:mahender.bigd...@outlook.com] Sent: Monday, March 28, 2016 5:47 PM To: user@hive.apache.org

RE: Issue joining 21 HUGE Hive tables

2016-03-23 Thread Ryan Harris
the query that you are using would have to be analyzed to know how much it could be optimized. The small tables should be able to be handled with a map-join, depending on hive version, that may be happening automatically. Hive will be doing the joins in stages. You could manually implement the

RE: Difference between RC file format & Parquet file format

2016-02-17 Thread Ryan Harris
ORC files = optimized RC files https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC Parquet is similar to ORC, but a bit different. http://parquet.apache.org/documentation/latest/ Parquet is a bit more of a "standard" file format outside of Hive, while ORC files are primarily

Add partitioning to a table that is not already partitioned?

2016-02-12 Thread Ryan Harris
I'm very aware of the "textbook" approach to creating a partitioned table. I'm searching for an easy/repeatable solution for the following workflow requirements 1) An initial complex source query, with multiple joins from different source tables, field substring extracts, type conversions, etc

RE: Add partition data to an external ORC table.

2016-02-11 Thread Ryan Harris
If your original source is text, why don't you make your ORC-based table a hive managed table instead of an external table. Then you can load/partition your text data into the external table, query from that and insert into your ORC-backed Hive managed table. Theoretically, if you had your data

RE: Sessionize using Hive

2016-02-05 Thread Ryan Harris
Hive Ryan, Can you perhaps point me to example(s) of how this is done in Hive? Thanks, J. B. Rawlings Senior Consultant C: 425.233.1315 www.societyconsulting.com<http://www.societyconsulting.com/> From: Ryan Harris [mailto:ryan.har...@zionsbancorp.com] Sent: Monday, February 1, 2016 6

RE: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Ryan Harris
https://github.com/myui/hivemall as long as you are comfortable with java UDFs, the sky is really the limit...it's not for everyone and spark does have many advantages, but they are two tools that can complement each other in numerous ways. I don't know that there is necessarily a universal

RE: Sessionize using Hive

2016-02-01 Thread Ryan Harris
it can be done in hive...whether or not it is the "best choice" depends on whether or not you have any other reason for your data to be in hive. If you are wondering whether Hive is the best tool for accomplishing this one taskit would probably be easier to do in pig. From: JB Rawlings

RE: Loading data containing newlines

2016-01-15 Thread Ryan Harris
Mich, if you have a toolpath that you can use to pipeline the required edits to the source file, you can use a chain similar to this: hadoop fs -text ${hdfs_path}/${orig_filename} | iconv -f EBCDIC-US -t ASCII | sed 's/\(.\{133\}\)/\1\n/g' | gzip -c | /usr/bin/hadoop fs -put -

RE: Loop if table is not empty

2015-12-28 Thread Ryan Harris
either use a multi table insert to write the results of the source table into another file/table: https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-MULTITABLEINSERT or use windowing and analytics functions to run a count over the entire table as a separate results

RE: Create hive table with same schema without any data

2015-12-10 Thread Ryan Harris
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableLike -Original Message- From: mahender bigdata [mailto:mahender.bigd...@outlook.com] Sent: Thursday, December 10, 2015 11:09 AM To: user@hive.apache.org Subject: Create hive table with same

RE: Help With Hive Windowing

2015-12-10 Thread Ryan Harris
Each record is being returned. For each record, the last_seen_dt is calculated for the window. It sounds like you are looking for the last record, which would be the record where hit_time = last_seen_dt try adding that as a where clause. From: Justin Workman [mailto:justinjwork...@gmail.com]

RE: how to get counts as a byproduct of a query

2015-12-03 Thread Ryan Harris
, a.Y, b.Z insert OVERWRITE TABLE count_A select count(a.X) insert OVERWRITE TABLE count_B select count(b.X) ; From: Ryan Harris [mailto:ryan.har...@zionsbancorp.com] Sent: Wednesday, December 02, 2015 4:20 PM To: user@hive.apache.org Subject: RE: how to get counts as a byproduct of a query

RE: how to get counts as a byproduct of a query

2015-12-02 Thread Ryan Harris
https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-MULTITABLEINSERT From: Frank Luo [mailto:j...@merkleinc.com] Sent: Wednesday, December 02, 2015 1:26 PM To: user@hive.apache.org Subject: RE: how to get counts as a byproduct of a query Didn’t get any response, so

RE: how to get counts as a byproduct of a query

2015-12-02 Thread Ryan Harris
Personally, I'd do it this way... https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics Select suba.X, suba.Y, suba.countA, subb.Z, subb.countB FROM (SELECT x, y, count(1) OVER (PARTITION BY X) as countA) suba JOIN (SELECT x, z, count(1) OVER (PARTITION BY X) as

RE: Using json_tuple for Nested json Arrays

2015-10-27 Thread Ryan Harris
T 1; FAILED: UDFArgumentException explode() takes an array or a map as a parameter Thanks, Joel On Tue, Oct 27, 2015 at 3:37 PM, Ryan Harris <ryan.har...@zionsbancorp.com<mailto:ryan.har...@zionsbancorp.com>> wrote: Do you have an example of the query that you tried (which failed). In short,

RE: Using json_tuple for Nested json Arrays

2015-10-27 Thread Ryan Harris
ing. Thanks, Joel On Tue, Oct 27, 2015 at 4:21 PM, Ryan Harris <ryan.har...@zionsbancorp.com<mailto:ryan.har...@zionsbancorp.com>> wrote: looking at your sample data, you shouldn't need to use lateral view explode unless you are trying to get 1 entry per row for your media sizes (

RE: Using json_tuple for Nested json Arrays

2015-10-27 Thread Ryan Harris
Do you have an example of the query that you tried (which failed). In short, you probably want to use the get_json_object() UDF: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-get_json_object if you need the JSON array broken into individual records, you

RE: Using json_tuple for Nested json Arrays

2015-10-27 Thread Ryan Harris
4g>","type":"photo","url":"http://t.co/i3004WyF4g","id":654301608994586624,"media_url_https":"https://pbs.twimg.com/media/CRSL2MQWwAAP4Qo.jpg","expanded_url":"http://twitter.com/lordlancaster/status/6543016266651

RE: Using json_tuple for Nested json Arrays

2015-10-27 Thread Ryan Harris
, Oct 27, 2015 at 5:22 PM, Ryan Harris <ryan.har...@zionsbancorp.com<mailto:ryan.har...@zionsbancorp.com>> wrote: hmmm...I'm not sure what the return value type of json_tuple is... I'd probably try creating a temporary table from your working query below and then work on getting the

RE: Hive SerDe regex error

2015-10-01 Thread Ryan Harris
depending on how you are submitting the statement to hive, you'll probably need to escape the backslash... try replacing every \ with \\ From: IT CTO [mailto:goi@gmail.com] Sent: Thursday, October 01, 2015 6:25 AM To: user@hive.apache.org Subject: Re: Hive SerDe regex error Your Regex

RE: Better way to do UDF's for Hive

2015-10-01 Thread Ryan Harris
If you want to use python... The python script should expect tab-separated input on stdin and it should return tab-separated delimited columns for the output... add file mypython.py; SELECT TRANSFORM (tbl.id, tbl.name, tbl.city) USING 'python mypython.py' AS (id, name, city, state) FROM

RE: CombineHiveInputFormat not working

2015-09-30 Thread Ryan Harris
what are your values for: mapred.min.split.size mapred.max.split.size hive.hadoop.supports.splittable.combineinputformat From: Pradeep Gollakota [mailto:pradeep...@gmail.com] Sent: Wednesday, September 30, 2015 2:20 PM To: user@hive.apache.org Subject: CombineHiveInputFormat not working Hi all,

RE: CombineHiveInputFormat not working

2015-09-30 Thread Ryan Harris
Also... mapreduce.input.fileinputformat.split.maxsize and, what is the size of your input files? From: Ryan Harris Sent: Wednesday, September 30, 2015 2:37 PM To: 'user@hive.apache.org' Subject: RE: CombineHiveInputFormat not working what are your values for: mapred.min.split.size

RE: Hive Generic UDF invoking Hbase

2015-09-30 Thread Ryan Harris
g Date: Wed, 30 Sep 2015 17:19:18 + Take a look at hive.fetch.task.conversion in https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties​, try setting to "none" or "minimal" From: Ryan Harris <ryan.har...@zionsban

RE: Hive Generic UDF invoking Hbase

2015-09-30 Thread Ryan Harris
@hive.apache.org; user@hive.apache.org Subject: RE: Hive Generic UDF invoking Hbase I believe It's not because of classpath. For a single task / for streaming it's working fine right. Sent from Outlook<http://aka.ms/Ox5hz3> On Wed, Sep 30, 2015 at 1:58 PM -0700, "Ryan Harris&

RE: CombineHiveInputFormat not working

2015-09-30 Thread Ryan Harris
Harris <ryan.har...@zionsbancorp.com<mailto:ryan.har...@zionsbancorp.com>> wrote: Also... mapreduce.input.fileinputformat.split.maxsize and, what is the size of your input files? From: Ryan Harris Sent: Wednesday, September 30, 2015 2:37 PM To: 'user@hive.apache.org<mailto:user@

RE: can we add column type in where clause in a hive query?

2015-09-02 Thread Ryan Harris
the fact that you have other data in the column (like letters) implies that you have the column stored as a string, so use a regex. SELECT CAST(mycol as BIGINT) WHERE my mycol RLIKE '^-?[0-9.]+$' From: Mohit Durgapal [mailto:durgapalmo...@gmail.com] Sent: Wednesday, September 02, 2015 5:09 AM

RE: Loading multiple file format in hive

2015-08-26 Thread Ryan Harris
in necessary. On Tue, Aug 25, 2015 at 11:57 PM, Ryan Harris ryan.har...@zionsbancorp.commailto:ryan.har...@zionsbancorp.com wrote: A few things.. 1) If you are using spark streaming, I don't see any reason why the output of your spark streaming can't match the necessary destination format

RE: Loading multiple file format in hive

2015-08-25 Thread Ryan Harris
A few things.. 1) If you are using spark streaming, I don't see any reason why the output of your spark streaming can't match the necessary destination format...you shouldn't need a second job to read the output from Spark Streaming and convert to parquet. Do a search for spark streaming and

RE: Run multiple queries simultaneously

2015-08-25 Thread Ryan Harris
You need to be a bit more clear with your environment and objective here What is your back-end execution engine? MapReduce, Spark, or Tez? What are you using for resource management? YARN or MapReduce? The running time of one query in the presence of other queries will entirely depend on

RE: Running python UDF in hive

2015-08-20 Thread Ryan Harris
remember that transform scripts in hive should receive data from STDIN and return results to STDOUT. So, to properly test your transform script try this: hive -e select id from test limit 10 testout.txt cat testout.txt | python transform_value.py if your transform script is working correctly,

RE: Parquet Files in Hive - Settings

2015-08-18 Thread Ryan Harris
most are parquet settings from https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java: * # The block size is the size of a row group being buffered in memory * # this limits the memory usage when writing * # Larger

RE: External sorted tables

2015-08-03 Thread Ryan Harris
On Aug 3, 2015 10:47 AM, Ryan Harris ryan.har...@zionsbancorp.commailto:ryan.har...@zionsbancorp.com wrote: Unless you are using bucketing and sampling, there is no benefit (that I can think of) to informing hive that the data *is* in fact sorted... If there is something specific you are trying

RE: External sorted tables

2015-08-03 Thread Ryan Harris
Unless you are using bucketing and sampling, there is no benefit (that I can think of) to informing hive that the data *is* in fact sorted... If there is something specific you are trying to accomplish by specifying the sort order of that column, perhaps you can elaborate on that. Otherwise,

RE: im having an issue extracting some data from json within hive in hdinsight

2015-07-23 Thread Ryan Harris
You probably want to be using the UDF get_json_object(), I added to this stackoverflow post [http://stackoverflow.com/questions/24447428/parse-json-arrays-using-hive] a few months agothe problem was specific to top-level JSON arrays, and is related to JIRA HIVE-1575

RE: DISTRIBUTE BY question

2015-07-13 Thread Ryan Harris
this should get you on the right path: https://issues.apache.org/jira/browse/HIVE-7121 From: Connell Donaghy [mailto:cdona...@pinterest.com] Sent: Monday, July 13, 2015 2:50 PM To: user@hive.apache.org Subject: DISTRIBUTE BY question Hey! I'm trying to write a tool which uses a storagehandler

Change in Abstract Syntax Tree output format with EXPLAIN EXTENDED in 0.13

2015-06-26 Thread Ryan Harris
In hive 0.12, the Abstract Syntax Tree output format when using EXPLAIN EXTENDED matched what is in the wiki: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain As an example, consider the following EXPLAIN query: EXPLAIN FROM src INSERT OVERWRITE TABLE dest_g1 SELECT

RE: Updating hive metadata

2015-06-18 Thread Ryan Harris
you *should* be able to do: create my_table_2 like my_table; dfs -cp /user/hive/warehouse/my_table/* /user/hive/warehouse/my_table_2/ MSCK repair table my_table_2; From: Devopam Mittra [mailto:devo...@gmail.com] Sent: Thursday, June 18, 2015 10:12 PM To: user@hive.apache.org Subject: Re:

collect_set() with OVER clause

2015-05-12 Thread Ryan Harris
It looks like the OVER clause currently supports the aggregate functions (count, sum, min, max, avg, ntile). Is there any plan to include support for other built-in aggregate functions like collect_set() ? == THIS ELECTRONIC