Also, I believe that the output format matters. If your output is TEXTFILE I
think that all of the reducers can append to the same file concurrently.
However for block-based output formats, that isn’t possible.
From: Furcy Pin [mailto:pin.fu...@gmail.com]
Sent: Wednesday, August 08, 2018 9:58
are standard in
case of all files .Any idea how the schema would look if I use the stingray
reader?.I am guessing it would be more like
string,string,string,array(strings)?.
-Nishanth
On Fri, Jun 2, 2017 at 10:51 AM, Ryan Harris
<ryan.har...@zionsbancorp.com<mailto:ryan.har...@zionsbanco
I wrote some custom python parsing scripts using StingRay Reader (
http://stingrayreader.sourceforge.net/cobol.html ) that read in the copybooks
and use the results to automatically generate hive table schema based on the
source copybook. The EBCDIC data is then extracted to TAB separated
Hive just take the data and take full care of
partitioning it?
On Tue, Apr 4, 2017 at 6:14 PM, Ryan Harris
<ryan.har...@zionsbancorp.com<mailto:ryan.har...@zionsbancorp.com>> wrote:
For A) I’d recommend mapping an EXTERNAL table to the raw/original source
files…then you can just run
anything specific in the input files / with the input files in order to
make partitioning work, or does Hive just take the data and take full care of
partitioning it?
On Tue, Apr 4, 2017 at 6:14 PM, Ryan Harris
<ryan.har...@zionsbancorp.com<mailto:ryan.har...@zionsbancorp.com&g
For A) I’d recommend mapping an EXTERNAL table to the raw/original source
files…then you can just run a SELECT query from the EXTERNAL source and INSERT
into your destination.
LOAD DATA can be very useful when you are trying to move data between two
tables that share the same schema but 1
FWIW, the wiki states that the function returns a string
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDFhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
From: Long, Andrew [loand...@amazon.com]
Sent: Thursday, June 30, 2016 5:31 PM
To: user@hive.apache.org
This is really outside of the scope of Hive and would probably be better
addressed by the Spark community, however I can say that this very much depends
on your use case
Take a look at this discussion if you haven't already:
reading this:
"but when I add 2000 new titles with 300 rows each"
I'm thinking that you are over-partitioning your data
I'm not sure exactly how that relates to the OOM error you are getting (it may
not)I'd test things out partitioning by date-only maybe date +
title_type, but adding
if you are doing group by, you could have potential duplicates on your
concat_wstake a look at using collect_set or collect_list. if you do
select col_a,
collect_set(concat_ws(', ',col_b,col_c))
from t
you will have an array of unique collection pairs...collect_list will give you
all
2016 at 1:31 PM, Ryan Harris
<ryan.har...@zionsbancorp.com<mailto:ryan.har...@zionsbancorp.com>> wrote:
My $0.02
If you are running multiple concurrent queries on the data, you are probably
doing it wrong (or at least inefficiently)although this somewhat depends on
what typ
My $0.02
If you are running multiple concurrent queries on the data, you are probably
doing it wrong (or at least inefficiently)although this somewhat depends on
what type of files are backing your hive warehouse...
Let's assume that your data is NOT backed by ORC/parquet files, and
if your only problem with #2 is the issue of creating the external table, you
should be able to throw together a script running as a more privileged user
that could handle the task of creating the external table. Once the table is
created, the user should be able to access the read-only data.
In my opinion, this ultimately becomes a resource balance issue that you'll
need to test.
You have a fixed amount of memory (although you haven't said what it is). As
you increase the number of tasks, the available memory per task will decrease.
If the tasks run out of memory, they will
collect_list(col) will give you an array with all of the data from that column
However, the scalability of this approach will have limits.
-Original Message-
From: mahender bigdata [mailto:mahender.bigd...@outlook.com]
Sent: Monday, March 28, 2016 5:47 PM
To: user@hive.apache.org
the query that you are using would have to be analyzed to know how much it
could be optimized.
The small tables should be able to be handled with a map-join, depending on
hive version, that may be happening automatically.
Hive will be doing the joins in stages.
You could manually implement the
ORC files = optimized RC files
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC
Parquet is similar to ORC, but a bit different.
http://parquet.apache.org/documentation/latest/
Parquet is a bit more of a "standard" file format outside of Hive, while ORC
files are primarily
I'm very aware of the "textbook" approach to creating a partitioned table.
I'm searching for an easy/repeatable solution for the following workflow
requirements
1) An initial complex source query, with multiple joins from different source
tables, field substring extracts, type conversions, etc
If your original source is text, why don't you make your ORC-based table a hive
managed table instead of an external table.
Then you can load/partition your text data into the external table, query from
that and insert into your ORC-backed Hive managed table.
Theoretically, if you had your data
Hive
Ryan,
Can you perhaps point me to example(s) of how this is done in Hive?
Thanks,
J. B. Rawlings
Senior Consultant
C: 425.233.1315
www.societyconsulting.com<http://www.societyconsulting.com/>
From: Ryan Harris [mailto:ryan.har...@zionsbancorp.com]
Sent: Monday, February 1, 2016 6
https://github.com/myui/hivemall
as long as you are comfortable with java UDFs, the sky is really the
limit...it's not for everyone and spark does have many advantages, but they are
two tools that can complement each other in numerous ways.
I don't know that there is necessarily a universal
it can be done in hive...whether or not it is the "best choice" depends on
whether or not you have any other reason for your data to be in hive.
If you are wondering whether Hive is the best tool for accomplishing this one
taskit would probably be easier to do in pig.
From: JB Rawlings
Mich, if you have a toolpath that you can use to pipeline the required edits to
the source file, you can use a chain similar to this:
hadoop fs -text ${hdfs_path}/${orig_filename} | iconv -f EBCDIC-US -t ASCII |
sed 's/\(.\{133\}\)/\1\n/g' | gzip -c | /usr/bin/hadoop fs -put -
either use a multi table insert to write the results of the source table into
another file/table:
https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-MULTITABLEINSERT
or use windowing and analytics functions to run a count over the entire table
as a separate results
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableLike
-Original Message-
From: mahender bigdata [mailto:mahender.bigd...@outlook.com]
Sent: Thursday, December 10, 2015 11:09 AM
To: user@hive.apache.org
Subject: Create hive table with same
Each record is being returned.
For each record, the last_seen_dt is calculated for the window.
It sounds like you are looking for the last record, which would be the record
where hit_time = last_seen_dt
try adding that as a where clause.
From: Justin Workman [mailto:justinjwork...@gmail.com]
, a.Y, b.Z
insert OVERWRITE TABLE count_A select count(a.X)
insert OVERWRITE TABLE count_B select count(b.X)
;
From: Ryan Harris [mailto:ryan.har...@zionsbancorp.com]
Sent: Wednesday, December 02, 2015 4:20 PM
To: user@hive.apache.org
Subject: RE: how to get counts as a byproduct of a query
https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-MULTITABLEINSERT
From: Frank Luo [mailto:j...@merkleinc.com]
Sent: Wednesday, December 02, 2015 1:26 PM
To: user@hive.apache.org
Subject: RE: how to get counts as a byproduct of a query
Didn’t get any response, so
Personally, I'd do it this way...
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
Select suba.X, suba.Y, suba.countA, subb.Z, subb.countB
FROM
(SELECT x, y, count(1) OVER (PARTITION BY X) as countA) suba
JOIN
(SELECT x, z, count(1) OVER (PARTITION BY X) as
T 1;
FAILED: UDFArgumentException explode() takes an array or a map as a parameter
Thanks,
Joel
On Tue, Oct 27, 2015 at 3:37 PM, Ryan Harris
<ryan.har...@zionsbancorp.com<mailto:ryan.har...@zionsbancorp.com>> wrote:
Do you have an example of the query that you tried (which failed).
In short,
ing.
Thanks,
Joel
On Tue, Oct 27, 2015 at 4:21 PM, Ryan Harris
<ryan.har...@zionsbancorp.com<mailto:ryan.har...@zionsbancorp.com>> wrote:
looking at your sample data, you shouldn't need to use lateral view explode
unless you are trying to get 1 entry per row for your media sizes (
Do you have an example of the query that you tried (which failed).
In short, you probably want to use the get_json_object() UDF:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-get_json_object
if you need the JSON array broken into individual records, you
4g>","type":"photo","url":"http://t.co/i3004WyF4g","id":654301608994586624,"media_url_https":"https://pbs.twimg.com/media/CRSL2MQWwAAP4Qo.jpg","expanded_url":"http://twitter.com/lordlancaster/status/6543016266651
, Oct 27, 2015 at 5:22 PM, Ryan Harris
<ryan.har...@zionsbancorp.com<mailto:ryan.har...@zionsbancorp.com>> wrote:
hmmm...I'm not sure what the return value type of json_tuple is...
I'd probably try creating a temporary table from your working query below and
then work on getting the
depending on how you are submitting the statement to hive, you'll probably need
to escape the backslash...
try replacing every \ with \\
From: IT CTO [mailto:goi@gmail.com]
Sent: Thursday, October 01, 2015 6:25 AM
To: user@hive.apache.org
Subject: Re: Hive SerDe regex error
Your Regex
If you want to use python...
The python script should expect tab-separated input on stdin and it should
return tab-separated delimited columns for the output...
add file mypython.py;
SELECT TRANSFORM (tbl.id, tbl.name, tbl.city)
USING 'python mypython.py'
AS (id, name, city, state)
FROM
what are your values for:
mapred.min.split.size
mapred.max.split.size
hive.hadoop.supports.splittable.combineinputformat
From: Pradeep Gollakota [mailto:pradeep...@gmail.com]
Sent: Wednesday, September 30, 2015 2:20 PM
To: user@hive.apache.org
Subject: CombineHiveInputFormat not working
Hi all,
Also...
mapreduce.input.fileinputformat.split.maxsize
and, what is the size of your input files?
From: Ryan Harris
Sent: Wednesday, September 30, 2015 2:37 PM
To: 'user@hive.apache.org'
Subject: RE: CombineHiveInputFormat not working
what are your values for:
mapred.min.split.size
g
Date: Wed, 30 Sep 2015 17:19:18 +
Take a look at hive.fetch.task.conversion in
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties, try
setting to "none" or "minimal"
From: Ryan Harris <ryan.har...@zionsban
@hive.apache.org; user@hive.apache.org
Subject: RE: Hive Generic UDF invoking Hbase
I believe It's not because of classpath. For a single task / for streaming
it's working fine right.
Sent from Outlook<http://aka.ms/Ox5hz3>
On Wed, Sep 30, 2015 at 1:58 PM -0700, "Ryan Harris&
Harris
<ryan.har...@zionsbancorp.com<mailto:ryan.har...@zionsbancorp.com>> wrote:
Also...
mapreduce.input.fileinputformat.split.maxsize
and, what is the size of your input files?
From: Ryan Harris
Sent: Wednesday, September 30, 2015 2:37 PM
To: 'user@hive.apache.org<mailto:user@
the fact that you have other data in the column (like letters) implies that you
have the column stored as a string, so use a regex.
SELECT CAST(mycol as BIGINT) WHERE my mycol RLIKE '^-?[0-9.]+$'
From: Mohit Durgapal [mailto:durgapalmo...@gmail.com]
Sent: Wednesday, September 02, 2015 5:09 AM
in necessary.
On Tue, Aug 25, 2015 at 11:57 PM, Ryan Harris
ryan.har...@zionsbancorp.commailto:ryan.har...@zionsbancorp.com wrote:
A few things..
1) If you are using spark streaming, I don't see any reason why the output of
your spark streaming can't match the necessary destination format
A few things..
1) If you are using spark streaming, I don't see any reason why the output of
your spark streaming can't match the necessary destination format...you
shouldn't need a second job to read the output from Spark Streaming and convert
to parquet. Do a search for spark streaming and
You need to be a bit more clear with your environment and objective here
What is your back-end execution engine? MapReduce, Spark, or Tez?
What are you using for resource management? YARN or MapReduce?
The running time of one query in the presence of other queries will entirely
depend on
remember that transform scripts in hive should receive data from STDIN and
return results to STDOUT. So, to properly test your transform script try this:
hive -e select id from test limit 10 testout.txt
cat testout.txt | python transform_value.py
if your transform script is working correctly,
most are parquet settings
from
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java:
* # The block size is the size of a row group being buffered in memory
* # this limits the memory usage when writing
* # Larger
On Aug 3, 2015 10:47 AM, Ryan Harris
ryan.har...@zionsbancorp.commailto:ryan.har...@zionsbancorp.com wrote:
Unless you are using bucketing and sampling, there is no benefit (that I can
think of) to informing hive that the data *is* in fact sorted...
If there is something specific you are trying
Unless you are using bucketing and sampling, there is no benefit (that I can
think of) to informing hive that the data *is* in fact sorted...
If there is something specific you are trying to accomplish by specifying the
sort order of that column, perhaps you can elaborate on that. Otherwise,
You probably want to be using the UDF get_json_object(), I added to this
stackoverflow post
[http://stackoverflow.com/questions/24447428/parse-json-arrays-using-hive]
a few months agothe problem was specific to top-level JSON arrays, and is
related to JIRA HIVE-1575
this should get you on the right path:
https://issues.apache.org/jira/browse/HIVE-7121
From: Connell Donaghy [mailto:cdona...@pinterest.com]
Sent: Monday, July 13, 2015 2:50 PM
To: user@hive.apache.org
Subject: DISTRIBUTE BY question
Hey! I'm trying to write a tool which uses a storagehandler
In hive 0.12, the Abstract Syntax Tree output format when using EXPLAIN
EXTENDED matched what is in the wiki:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain
As an example, consider the following EXPLAIN query:
EXPLAIN
FROM src INSERT OVERWRITE TABLE dest_g1 SELECT
you *should* be able to do:
create my_table_2 like my_table;
dfs -cp /user/hive/warehouse/my_table/* /user/hive/warehouse/my_table_2/
MSCK repair table my_table_2;
From: Devopam Mittra [mailto:devo...@gmail.com]
Sent: Thursday, June 18, 2015 10:12 PM
To: user@hive.apache.org
Subject: Re:
It looks like the OVER clause currently supports the aggregate functions
(count, sum, min, max, avg, ntile).
Is there any plan to include support for other built-in aggregate functions
like collect_set() ?
==
THIS ELECTRONIC
54 matches
Mail list logo