RE: [ANNOUNCE] New Hive Committer - Mithun Radhakrishnan

2015-04-14 Thread Rohith Sharma K S
Congratulations Mithun ☺

-Regards
Rohith Sharma K S

From: cwsteinb...@gmail.com [mailto:cwsteinb...@gmail.com] On Behalf Of Carl 
Steinbach
Sent: 15 April 2015 03:25
To: d...@hive.apache.org; user@hive.apache.org; mit...@apache.org
Subject: [ANNOUNCE] New Hive Committer - Mithun Radhakrishnan

The Apache Hive PMC has voted to make Mithun Radhakrishnan a committer on the 
Apache Hive Project.

Please join me in congratulating Mithun.

Thanks.

- Carl



Re: [ANNOUNCE] New Hive Committer - Mithun Radhakrishnan

2015-04-14 Thread Chinna Rao Lalam
Congrats Mithun!


On Wed, Apr 15, 2015 at 11:34 AM, Mohammad Islam  wrote:

> Congrats Mithun!
>
> --Mohammad
>
>
>
>   On Tuesday, April 14, 2015 9:10 PM, Prasanth Jayachandran <
> pjayachand...@hortonworks.com> wrote:
>
>
>  Congrats Mithun!
>
> Thanks
> Prasanth
>
>
>
>
> On Tue, Apr 14, 2015 at 8:51 PM -0700, "Jimmy Xiang" 
> wrote:
>
>  Congrats!
>
> On Tue, Apr 14, 2015 at 8:46 PM, Lefty Leverenz 
> wrote:
>
> Congrats Mithun -- when they gave me the cape, they called it a cloak of
> invisibility.  But the only thing it makes invisible is itself.  Maybe I
> should open a jira
>
>  -- Lefty
>
> On Tue, Apr 14, 2015 at 9:03 PM, Xu, Cheng A  wrote:
>
>  Congrats Mithun!
>
>  *From:* Gunther Hagleitner [mailto:ghagleit...@hortonworks.com]
> *Sent:* Wednesday, April 15, 2015 8:10 AM
> *To:* d...@hive.apache.org; Chris Drome; user@hive.apache.org
> *Cc:* mit...@apache.org
> *Subject:* Re: [ANNOUNCE] New Hive Committer - Mithun Radhakrishnan
>
> Congrats Mithun!
>
> Thanks,
> Gunther.
>  --
>  *From:* Chao Sun 
> *Sent:* Tuesday, April 14, 2015 3:48 PM
> *To:* d...@hive.apache.org; Chris Drome
> *Cc:* user@hive.apache.org; mit...@apache.org
> *Subject:* Re: [ANNOUNCE] New Hive Committer - Mithun Radhakrishnan
>
>Congrats Mithun!
>
>  On Tue, Apr 14, 2015 at 3:29 PM, Chris Drome <
> cdr...@yahoo-inc.com.invalid> wrote:
> Congratulations Mithun!
>
>
>
>  On Tuesday, April 14, 2015 2:57 PM, Carl Steinbach 
> wrote:
>
>
>  The Apache Hive PMC has voted to make Mithun Radhakrishnan a committer on
> the Apache Hive Project.
> Please join me in congratulating Mithun.
> Thanks.
> - Carl
>
>
>
>
>
>
>  --
>  Best,
>  Chao
>
>
>
>
>
>


-- 
Hope It Helps,
Chinna


Re: [ANNOUNCE] New Hive Committer - Mithun Radhakrishnan

2015-04-14 Thread Mohammad Islam
Congrats Mithun!
 --Mohammad


 On Tuesday, April 14, 2015 9:10 PM, Prasanth Jayachandran 
 wrote:
   

 Congrats Mithun!

Thanks
Prasanth



On Tue, Apr 14, 2015 at 8:51 PM -0700, "Jimmy Xiang"  
wrote:

Congrats!

On Tue, Apr 14, 2015 at 8:46 PM, Lefty Leverenz  wrote:

Congrats Mithun -- when they gave me the cape, they called it a cloak of 
invisibility.  But the only thing it makes invisible is itself.  Maybe I should 
open a jira
-- Lefty
On Tue, Apr 14, 2015 at 9:03 PM, Xu, Cheng A  wrote:

Congrats Mithun! From: Gunther Hagleitner [mailto:ghagleit...@hortonworks.com]
Sent: Wednesday, April 15, 2015 8:10 AM
To: d...@hive.apache.org; Chris Drome;user@hive.apache.org
Cc: mit...@apache.org
Subject: Re: [ANNOUNCE] New Hive Committer - Mithun Radhakrishnan Congrats 
Mithun! Thanks,Gunther.From: Chao Sun 
Sent: Tuesday, April 14, 2015 3:48 PM
To: d...@hive.apache.org; Chris Drome
Cc: user@hive.apache.org;mit...@apache.org
Subject: Re: [ANNOUNCE] New Hive Committer - Mithun Radhakrishnan Congrats 
Mithun! On Tue, Apr 14, 2015 at 3:29 PM, Chris Drome 
 wrote:Congratulations Mithun!


     On Tuesday, April 14, 2015 2:57 PM, Carl Steinbach  wrote:


 The Apache Hive PMC has voted to make Mithun Radhakrishnan a committer on the 
Apache Hive Project. 
Please join me in congratulating Mithun.
Thanks.
- Carl


  

 --Best,Chao





  

Re: [ANNOUNCE] New Hive Committer - Mithun Radhakrishnan

2015-04-14 Thread Prasanth Jayachandran
Congrats Mithun!

Thanks
Prasanth




On Tue, Apr 14, 2015 at 8:51 PM -0700, "Jimmy Xiang" 
mailto:jxi...@cloudera.com>> wrote:

Congrats!

On Tue, Apr 14, 2015 at 8:46 PM, Lefty Leverenz 
mailto:leftylever...@gmail.com>> wrote:
Congrats Mithun -- when they gave me the cape, they called it a cloak of 
invisibility.  But the only thing it makes invisible is itself.  Maybe I should 
open a jira

-- Lefty

On Tue, Apr 14, 2015 at 9:03 PM, Xu, Cheng A 
mailto:cheng.a...@intel.com>> wrote:

Congrats Mithun!

From: Gunther Hagleitner 
[mailto:ghagleit...@hortonworks.com]
Sent: Wednesday, April 15, 2015 8:10 AM
To: d...@hive.apache.org; Chris Drome; 
user@hive.apache.org
Cc: mit...@apache.org
Subject: Re: [ANNOUNCE] New Hive Committer - Mithun Radhakrishnan


Congrats Mithun!



Thanks,

Gunther.


From: Chao Sun mailto:c...@cloudera.com>>
Sent: Tuesday, April 14, 2015 3:48 PM
To: d...@hive.apache.org; Chris Drome
Cc: user@hive.apache.org; 
mit...@apache.org
Subject: Re: [ANNOUNCE] New Hive Committer - Mithun Radhakrishnan

Congrats Mithun!

On Tue, Apr 14, 2015 at 3:29 PM, Chris Drome 
mailto:cdr...@yahoo-inc.com.invalid>> wrote:
Congratulations Mithun!



 On Tuesday, April 14, 2015 2:57 PM, Carl Steinbach 
mailto:c...@apache.org>> wrote:


 The Apache Hive PMC has voted to make Mithun Radhakrishnan a committer on the 
Apache Hive Project.
Please join me in congratulating Mithun.
Thanks.
- Carl






--
Best,
Chao




Re: [ANNOUNCE] New Hive Committer - Mithun Radhakrishnan

2015-04-14 Thread Jimmy Xiang
Congrats!

On Tue, Apr 14, 2015 at 8:46 PM, Lefty Leverenz 
wrote:

> Congrats Mithun -- when they gave me the cape, they called it a cloak of
> invisibility.  But the only thing it makes invisible is itself.  Maybe I
> should open a jira
>
> -- Lefty
>
> On Tue, Apr 14, 2015 at 9:03 PM, Xu, Cheng A  wrote:
>
>>  Congrats Mithun!
>>
>>
>>
>> *From:* Gunther Hagleitner [mailto:ghagleit...@hortonworks.com]
>> *Sent:* Wednesday, April 15, 2015 8:10 AM
>> *To:* d...@hive.apache.org; Chris Drome; user@hive.apache.org
>> *Cc:* mit...@apache.org
>> *Subject:* Re: [ANNOUNCE] New Hive Committer - Mithun Radhakrishnan
>>
>>
>>
>> Congrats Mithun!
>>
>>
>>
>> Thanks,
>>
>> Gunther.
>>  --
>>
>> *From:* Chao Sun 
>> *Sent:* Tuesday, April 14, 2015 3:48 PM
>> *To:* d...@hive.apache.org; Chris Drome
>> *Cc:* user@hive.apache.org; mit...@apache.org
>> *Subject:* Re: [ANNOUNCE] New Hive Committer - Mithun Radhakrishnan
>>
>>
>>
>> Congrats Mithun!
>>
>>
>>
>> On Tue, Apr 14, 2015 at 3:29 PM, Chris Drome <
>> cdr...@yahoo-inc.com.invalid> wrote:
>>
>> Congratulations Mithun!
>>
>>
>>
>>
>>  On Tuesday, April 14, 2015 2:57 PM, Carl Steinbach 
>> wrote:
>>
>>
>>  The Apache Hive PMC has voted to make Mithun Radhakrishnan a committer
>> on the Apache Hive Project.
>> Please join me in congratulating Mithun.
>> Thanks.
>> - Carl
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> Best,
>>
>> Chao
>>
>
>


Re: [ANNOUNCE] New Hive Committer - Mithun Radhakrishnan

2015-04-14 Thread Lefty Leverenz
Congrats Mithun -- when they gave me the cape, they called it a cloak of
invisibility.  But the only thing it makes invisible is itself.  Maybe I
should open a jira

-- Lefty

On Tue, Apr 14, 2015 at 9:03 PM, Xu, Cheng A  wrote:

>  Congrats Mithun!
>
>
>
> *From:* Gunther Hagleitner [mailto:ghagleit...@hortonworks.com]
> *Sent:* Wednesday, April 15, 2015 8:10 AM
> *To:* d...@hive.apache.org; Chris Drome; user@hive.apache.org
> *Cc:* mit...@apache.org
> *Subject:* Re: [ANNOUNCE] New Hive Committer - Mithun Radhakrishnan
>
>
>
> Congrats Mithun!
>
>
>
> Thanks,
>
> Gunther.
>  --
>
> *From:* Chao Sun 
> *Sent:* Tuesday, April 14, 2015 3:48 PM
> *To:* d...@hive.apache.org; Chris Drome
> *Cc:* user@hive.apache.org; mit...@apache.org
> *Subject:* Re: [ANNOUNCE] New Hive Committer - Mithun Radhakrishnan
>
>
>
> Congrats Mithun!
>
>
>
> On Tue, Apr 14, 2015 at 3:29 PM, Chris Drome 
> wrote:
>
> Congratulations Mithun!
>
>
>
>
>  On Tuesday, April 14, 2015 2:57 PM, Carl Steinbach 
> wrote:
>
>
>  The Apache Hive PMC has voted to make Mithun Radhakrishnan a committer on
> the Apache Hive Project.
> Please join me in congratulating Mithun.
> Thanks.
> - Carl
>
>
>
>
>
>
>
>
> --
>
> Best,
>
> Chao
>


Re: partition and bucket

2015-04-14 Thread Devopam Mittra
+1
quite well explained. liked it much

regards
Dev

On Mon, Apr 13, 2015 at 1:34 AM, Mich Talebzadeh 
wrote:

> Hi,
>
>
>
> I will try to have a go at your points but I am sure there are many
> experts around.
>
>
>
> As you may know already in RDBMS partitioning (dividing a very large table
> into sub-tables conceptually) is deployed to address three areast.
>
>
>
> 1. Availability -- each partition can reside on a different
> tablespace/device. Hence a problem with a tablespace/device will take out a
> slice of the table's data instead of the whole thing. This does not really
> ap[ply to Hive with 3 block replication as standard
>
> 2. Manageability -- partitioning provides a mechanism for splitting
> whole table jobs into clear batches. Partition exchange can make it easier
> to bulk load data. Defragging, moving older partitions to lower tier
> storage, updating stats etc Most of these benefits apply to Hive as well.
> Please check the docs.
>
> 3. Performance -- partition elimination
>
>
>
> In simplest form (excluding composite partitioning), Hive partitioning
> will be similar to “range partitioning” in RDBMS. One can partition a table
> (say *partitioned_table* as shown below which is batch loaded from
> *non_partitioned_table*) -- by country, year, month etc. Each partition
> will be stored in Hive under sub-directory *table/year/month* like below
>
>
>
> /user/hive/warehouse/scratchpad.db
> */partitioned_table/country=Italy/year=2014/month=Feb*
>
>
>
> Hive does not have the concept of indexes local or global as yet. So
> without partitioning a simple query in Hive will have to read the entire
> table even if it is filtering a smaller result set (WHERE CLAUSE). This
> becomes a bottleneck for running multiple MapReduce jobs over a large table. 
> So
> partitioning will help localise the query by hitting the relevant
> sub-directory or sub-directories only. There is another important aspect
> with Hive as well. The locking granularity will be determined by the lowest
> slice in the filing system (sub-directory). So entering data into the above
> partition/file, will take an exclusive lock on that partition/file but
> crucially the rest of partitions will be available (assuming concurrency in
> Hive is enabled).
>
>
>
>
> +--+-+++-+--+-+-++-+---+--+
>
> |  lockid  |  database   | table  |
> partition  | lock_state  |  lock_type   | transaction_id  |
> last_heartbeat  |  acquired_at   |  user   | hostname  |
>
>
> +--+-+++-+--+-+-++-+---+--+
>
> | Lock ID  | Database| Table  |
> Partition  | State   | Type |
> Transaction ID  | Last Hearbeat   | Acquired At| User| Hostname  |
>
> | 1711 | scratchpad  | non_partitioned_table  |
> NULL   | ACQUIRED| *SHARED_READ*  |
> NULL| 1428862154670   | 1428862151904  | hduser  | rhes564   |
>
> | 1711 | scratchpad  | *partitioned_table  |
> country=Italy/year=2014/month=Feb*  | ACQUIRED| *EXCLUSIVE *   |
> NULL| 1428862154670   | 1428862151905  | hduser  | rhes564   |
>
>
> +--+-+++-+--+-+-++-+---+--+
>
>
>
> Now your point 2, bucketing in Hive refers to hash partitioning where a
> hashing function is applied. Likewise an RDBMS, Hive will apply a linear
> hashing algorithm to prevent data from clustering within specific
> partitions. Hashing is very effective if the column selected for bucketing
> has very high selectivity like an ID column where selectivity (*select
> count(distinct(column))/count(column)* ) = 1.  In this case, the created
> partitions/ files will be as evenly sized as possible. In a nutshell
> bucketing is a method to get data evenly distributed over many
> partitions/files.  One should define the number of buckets by a power of
> two -- 2^n,  like 2, 4, 8, 16 etc to achieve best results. Again bucketing
> will help concurrency in Hive. It may even allow a *partition wise join*
> i.e. a join between two tables that are bucketed on the same column with
> the same number of buckets (anyone has tried this?)
>
>
>
> One more things. When one defines the number of buckets at table creation
> level in Hive, the number of partitions/files will be fixed. In contrast,
> with partitioning you do not have this limitation.
>
>
>
> HTH
>
>
>
> Mich
>
>
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you a

RE: [ANNOUNCE] New Hive Committer - Mithun Radhakrishnan

2015-04-14 Thread Xu, Cheng A
Congrats Mithun!

From: Gunther Hagleitner [mailto:ghagleit...@hortonworks.com]
Sent: Wednesday, April 15, 2015 8:10 AM
To: d...@hive.apache.org; Chris Drome; user@hive.apache.org
Cc: mit...@apache.org
Subject: Re: [ANNOUNCE] New Hive Committer - Mithun Radhakrishnan


Congrats Mithun!



Thanks,

Gunther.


From: Chao Sun mailto:c...@cloudera.com>>
Sent: Tuesday, April 14, 2015 3:48 PM
To: d...@hive.apache.org; Chris Drome
Cc: user@hive.apache.org; 
mit...@apache.org
Subject: Re: [ANNOUNCE] New Hive Committer - Mithun Radhakrishnan

Congrats Mithun!

On Tue, Apr 14, 2015 at 3:29 PM, Chris Drome 
mailto:cdr...@yahoo-inc.com.invalid>> wrote:
Congratulations Mithun!



 On Tuesday, April 14, 2015 2:57 PM, Carl Steinbach 
mailto:c...@apache.org>> wrote:


 The Apache Hive PMC has voted to make Mithun Radhakrishnan a committer on the 
Apache Hive Project.
Please join me in congratulating Mithun.
Thanks.
- Carl






--
Best,
Chao


Re: [ANNOUNCE] New Hive Committer - Mithun Radhakrishnan

2015-04-14 Thread Mithun RK
Thank you, chaps. :] One is happy to contribute. This is an honour, and
more than a little daunting.

Many thanks,
Mithun

P.S. Where do I pick up my cape? I was told there were capes...



On Tue, Apr 14, 2015 at 5:10 PM Gunther Hagleitner <
ghagleit...@hortonworks.com> wrote:

>  Congrats Mithun!
>
>
>  Thanks,
>
> Gunther.
>  --
> *From:* Chao Sun 
> *Sent:* Tuesday, April 14, 2015 3:48 PM
> *To:* d...@hive.apache.org; Chris Drome
> *Cc:* user@hive.apache.org; mit...@apache.org
> *Subject:* Re: [ANNOUNCE] New Hive Committer - Mithun Radhakrishnan
>
>   Congrats Mithun!
>
> On Tue, Apr 14, 2015 at 3:29 PM, Chris Drome  > wrote:
>
>> Congratulations Mithun!
>>
>>
>>
>>  On Tuesday, April 14, 2015 2:57 PM, Carl Steinbach 
>> wrote:
>>
>>
>>  The Apache Hive PMC has voted to make Mithun Radhakrishnan a committer
>> on the Apache Hive Project.
>> Please join me in congratulating Mithun.
>> Thanks.
>> - Carl
>>
>>
>>
>>
>
>
>
>  --
>  Best,
> Chao
>


Re: [ANNOUNCE] New Hive Committer - Mithun Radhakrishnan

2015-04-14 Thread Gunther Hagleitner
Congrats Mithun!


Thanks,

Gunther.


From: Chao Sun 
Sent: Tuesday, April 14, 2015 3:48 PM
To: d...@hive.apache.org; Chris Drome
Cc: user@hive.apache.org; mit...@apache.org
Subject: Re: [ANNOUNCE] New Hive Committer - Mithun Radhakrishnan

Congrats Mithun!

On Tue, Apr 14, 2015 at 3:29 PM, Chris Drome 
mailto:cdr...@yahoo-inc.com.invalid>> wrote:
Congratulations Mithun!



 On Tuesday, April 14, 2015 2:57 PM, Carl Steinbach 
mailto:c...@apache.org>> wrote:


 The Apache Hive PMC has voted to make Mithun Radhakrishnan a committer on the 
Apache Hive Project.
Please join me in congratulating Mithun.
Thanks.
- Carl






--
Best,
Chao


Re: Default schema

2015-04-14 Thread matshyeq
For the following I suggested:
"…or setting param file (like hive-env.sh, hiverc or hive-site.xml…)?"
I don't know what property or variable to set up?
Would you provide an example excerpt?



Thank you,
Kind Regards
~Maciek

On Tue, Apr 14, 2015 at 10:43 PM, Bala Krishna Gangisetty <
b...@altiscale.com> wrote:

> You can also specify it in "*.hiverc*" file.
>
> .hiverc is executed automatically when Hive CLI is launched.
>
> The file can be located at "$HIVE_CONF_DIR/.hiverc", or "$HOME/.hiverc".
> It may vary based on distribution you're on.
>
> The below 2 Hive JIRAs can provide more details.
>
> HIVE-1414  automatically
> invoke .hiverc init script
> HIVE-2911  Move global
> .hiverc file
>
> --Bala G.
>
> On Tue, Apr 14, 2015 at 2:03 PM, Maciek  wrote:
>
>> Thought about that but not sure if it's the most suitable one
>> Would you mind sharing those other ways?
>>
>> Thanks!
>>
>> On Tue, Apr 14, 2015 at 9:49 PM, Bala Krishna Gangisetty <
>> b...@altiscale.com> wrote:
>>
>>> Yes, certainly. There are couple of ways to do this.
>>>
>>> One such way is to define an alias for "hive --database
>>> **"
>>>
>>> --Bala G.
>>>
>>> On Tue, Apr 14, 2015 at 1:30 PM, Maciek  wrote:
>>>
 Is it possible to customize the schema user logs on to?
 I was thinking of setting some bash environment variable
 or setting param file (like hive-env.sh, hiverc or hive-site.xml…)?

>>>
>>>
>>
>


Re: [ANNOUNCE] New Hive Committer - Mithun Radhakrishnan

2015-04-14 Thread Chao Sun
Congrats Mithun!

On Tue, Apr 14, 2015 at 3:29 PM, Chris Drome 
wrote:

> Congratulations Mithun!
>
>
>
>  On Tuesday, April 14, 2015 2:57 PM, Carl Steinbach 
> wrote:
>
>
>  The Apache Hive PMC has voted to make Mithun Radhakrishnan a committer on
> the Apache Hive Project.
> Please join me in congratulating Mithun.
> Thanks.
> - Carl
>
>
>
>



-- 
Best,
Chao


Re: [ANNOUNCE] New Hive Committer - Mithun Radhakrishnan

2015-04-14 Thread Chris Drome
Congratulations Mithun!
 


 On Tuesday, April 14, 2015 2:57 PM, Carl Steinbach  wrote:
   

 The Apache Hive PMC has voted to make Mithun Radhakrishnan a committer on the 
Apache Hive Project. 
Please join me in congratulating Mithun.
Thanks.
- Carl


  

Re: External Table with unclosed orc files.

2015-04-14 Thread Grant Overby (groverby)
IIRC the HW Trucking Demo creates a temporary table from csv files of the
new data then issues a select … insert into an orc table.

For the love of google, I can’t find this demo atm, and I’m out of time.


If I recall correctly, this strikes me as suboptimal compared to writing
orc files directly. Data must be written to disk in a huge format and then
must be copied.


I’ll dig deep here as soon as I get a chance.



On 4/14/15, 6:09 PM, "Grant Overby (groverby)"  wrote:

>Submitting patches or test cases is tricky business for a Cisco employee.
>I’ll put in the legal admin effort to get approval to do this. :/ The
>majority of the issues I mentioned /should/ find their way to apache via
>hortonworks.
>
>
>Additional responses are inline.
>
>
>
>
>
>
>
>
>
>On 4/14/15, 5:28 PM, "Gopal Vijayaraghavan"  wrote:
>
>>
>>>0.14 . Acid tables have been a real pain for us. We don¹t believe they
>>>are
>>>production ready. At least in our use cases, Tez crashes for assorted
>>>reasons or only assigns 1 mapper to the partition. Having delta files
>>>and
>>>no base files borks mapper assignments.
>>
>>Some of the chicken-egg problems for those were solved recently in
>>HIVE-10114.
>>
>>Then TEZ-1993 is coming out in the next version of Tez, into which we¹re
>>plugging in HIVE-7428 (no fix yet).
>>
>>Currently delta-only splits have 0 bytes as the ³file size², so it
>>grouped
>>together to make a 16Mb chunk (rather a huge single 0 sized split).
>>
>>Those patches are the effect of me shaving the yak from the ³1 mapper²
>>issue.
>>
>>After which the writer has to follow up on HIVE-9933 to get the locality
>>of files fixed.
>
>I’ll look into this. If the 1 mapper issue is solved, that would be a huge
>win for streaming for us.
>
>
>>
>>>name are left scattered about, borking queries. Latency is higher with
>>>streaming than writing to an orc file in hdfs, forcing obscene
>>>quantities
>>>of buckets and orc files smaller than any reasonable orc stripe / hdfs
>>>block size. The compactor hangs seemingly at random for no reason we¹ve
>>>been able to discern.
>>
>>I haven¹t seen these issues yet, but I am not dealing with a large volume
>>insert rate, so haven¹t produced latency issues there.
>>
>>Since I work on Hive performance and I haven¹t seen too many bugs filed,
>>so I haven¹t paid attention to the performance of ACID.
>>
>>Please file bugs when you find them, so that it appears on the radar for
>>folks like me.
>>
>>I¹m poking about because I want a live stream into LLAP to work
>>seamlessly
>>& return sub-second query results when queried (pre-cache/stage & merge
>>etc).
>
>These files aren’t orc, but hive expects them to be, leading to errors.
>They are made by using the hive streaming api.
>root@twig13:~# hdfs dfs -ls -R
>/apps/hive/warehouse/events.db/connection_events4/ | grep flush | head -n
>1
>-rw-r--r-- 3 storm hadoop 200 2015-04-09 17:12
>/apps/hive/warehouse/events.db/connection_events4/dt=1428613200/delta_1171
>4
>703_11714802/bucket_7_flush_length
>root@twig13:~# hdfs dfs -ls -R
>/apps/hive/warehouse/events.db/connection_events4/ | grep flush | wc -l
>283
>
>This may be addressed by 8966 which is in the 1.0.0 release. kill -9 to
>the processing writing to hive is a near guaranteed way to leave these
>orphaned flush files, but we have seen them on several occasions when
>there is no indication that .close() was skipped.
>
>Our insert rate is about 100k/s for a 4 box cluster. Storm, Kafka, Hdfs,
>Hive, etc are ‘pancaked’ on this cluster. To keep up with this insert rate
>we need somewhere between 64 and 128 buckets for streaming to support an
>equal number of threads. We can keep up this same pace when writing orc
>files directly to hdfs with only 8 threads and thus 8 orc files. The orc
>files from streaming are on the order of 5mb a piece (15min insert-time
>base partitions). Even if orc stripes this small isn’t a problem, it’s
>still going to waste a lot of disk space due to hdfs block size.
>
>
>>
>>>An orc file without a footer is junk data (or, at least, the last stripe
>>>is junk data). I suppose my question should have been 'what will the
>>>hive
>>>query do when it encounters this? Skip the stripe / file? Error out the
>>>query? Something else?¹
>>
>>It should throw an exception, because that¹s a corrupt ORC file.
>>
>>The trucking demo uses Storm without ACID - this is likely to get better
>>once we use Apache Falcon to move the data around.
>>
>>Cheers,
>>Gopal
>>
>>
>
>I suppose the best thing to do then is to write the orc file outside the
>of the partition directory then issue an mv when the file is closed?
>
>>
>



Re: External Table with unclosed orc files.

2015-04-14 Thread Grant Overby (groverby)
Submitting patches or test cases is tricky business for a Cisco employee.
I’ll put in the legal admin effort to get approval to do this. :/ The
majority of the issues I mentioned /should/ find their way to apache via
hortonworks.


Additional responses are inline.









On 4/14/15, 5:28 PM, "Gopal Vijayaraghavan"  wrote:

>
>>0.14 . Acid tables have been a real pain for us. We don¹t believe they
>>are
>>production ready. At least in our use cases, Tez crashes for assorted
>>reasons or only assigns 1 mapper to the partition. Having delta files and
>>no base files borks mapper assignments.
>
>Some of the chicken-egg problems for those were solved recently in
>HIVE-10114.
>
>Then TEZ-1993 is coming out in the next version of Tez, into which we¹re
>plugging in HIVE-7428 (no fix yet).
>
>Currently delta-only splits have 0 bytes as the ³file size², so it grouped
>together to make a 16Mb chunk (rather a huge single 0 sized split).
>
>Those patches are the effect of me shaving the yak from the ³1 mapper²
>issue.
>
>After which the writer has to follow up on HIVE-9933 to get the locality
>of files fixed.

I’ll look into this. If the 1 mapper issue is solved, that would be a huge
win for streaming for us.


>
>>name are left scattered about, borking queries. Latency is higher with
>>streaming than writing to an orc file in hdfs, forcing obscene quantities
>>of buckets and orc files smaller than any reasonable orc stripe / hdfs
>>block size. The compactor hangs seemingly at random for no reason we¹ve
>>been able to discern.
>
>I haven¹t seen these issues yet, but I am not dealing with a large volume
>insert rate, so haven¹t produced latency issues there.
>
>Since I work on Hive performance and I haven¹t seen too many bugs filed,
>so I haven¹t paid attention to the performance of ACID.
>
>Please file bugs when you find them, so that it appears on the radar for
>folks like me.
>
>I¹m poking about because I want a live stream into LLAP to work seamlessly
>& return sub-second query results when queried (pre-cache/stage & merge
>etc).

These files aren’t orc, but hive expects them to be, leading to errors.
They are made by using the hive streaming api.
root@twig13:~# hdfs dfs -ls -R
/apps/hive/warehouse/events.db/connection_events4/ | grep flush | head -n 1
-rw-r--r-- 3 storm hadoop 200 2015-04-09 17:12
/apps/hive/warehouse/events.db/connection_events4/dt=1428613200/delta_11714
703_11714802/bucket_7_flush_length
root@twig13:~# hdfs dfs -ls -R
/apps/hive/warehouse/events.db/connection_events4/ | grep flush | wc -l
283

This may be addressed by 8966 which is in the 1.0.0 release. kill -9 to
the processing writing to hive is a near guaranteed way to leave these
orphaned flush files, but we have seen them on several occasions when
there is no indication that .close() was skipped.

Our insert rate is about 100k/s for a 4 box cluster. Storm, Kafka, Hdfs,
Hive, etc are ‘pancaked’ on this cluster. To keep up with this insert rate
we need somewhere between 64 and 128 buckets for streaming to support an
equal number of threads. We can keep up this same pace when writing orc
files directly to hdfs with only 8 threads and thus 8 orc files. The orc
files from streaming are on the order of 5mb a piece (15min insert-time
base partitions). Even if orc stripes this small isn’t a problem, it’s
still going to waste a lot of disk space due to hdfs block size.


>
>>An orc file without a footer is junk data (or, at least, the last stripe
>>is junk data). I suppose my question should have been 'what will the hive
>>query do when it encounters this? Skip the stripe / file? Error out the
>>query? Something else?¹
>
>It should throw an exception, because that¹s a corrupt ORC file.
>
>The trucking demo uses Storm without ACID - this is likely to get better
>once we use Apache Falcon to move the data around.
>
>Cheers,
>Gopal
>
>

I suppose the best thing to do then is to write the orc file outside the
of the partition directory then issue an mv when the file is closed?

>



[ANNOUNCE] New Hive Committer - Mithun Radhakrishnan

2015-04-14 Thread Carl Steinbach
The Apache Hive PMC has voted to make Mithun Radhakrishnan a committer on
the Apache Hive Project.

Please join me in congratulating Mithun.

Thanks.

- Carl


Re: Default schema

2015-04-14 Thread Bala Krishna Gangisetty
You can also specify it in "*.hiverc*" file.

.hiverc is executed automatically when Hive CLI is launched.

The file can be located at "$HIVE_CONF_DIR/.hiverc", or "$HOME/.hiverc". It
may vary based on distribution you're on.

The below 2 Hive JIRAs can provide more details.

HIVE-1414  automatically
invoke .hiverc init script
HIVE-2911  Move global
.hiverc file

--Bala G.

On Tue, Apr 14, 2015 at 2:03 PM, Maciek  wrote:

> Thought about that but not sure if it's the most suitable one
> Would you mind sharing those other ways?
>
> Thanks!
>
> On Tue, Apr 14, 2015 at 9:49 PM, Bala Krishna Gangisetty <
> b...@altiscale.com> wrote:
>
>> Yes, certainly. There are couple of ways to do this.
>>
>> One such way is to define an alias for "hive --database
>> **"
>>
>> --Bala G.
>>
>> On Tue, Apr 14, 2015 at 1:30 PM, Maciek  wrote:
>>
>>> Is it possible to customize the schema user logs on to?
>>> I was thinking of setting some bash environment variable
>>> or setting param file (like hive-env.sh, hiverc or hive-site.xml…)?
>>>
>>
>>
>


Re: External Table with unclosed orc files.

2015-04-14 Thread Grant Overby (groverby)
The remainder of my ranting paragraph is intended as an expansion on that
comment. Sorry, I wasn’t clear.


Grant Overby
Software Engineer
Cisco.com 
grove...@cisco.com
Mobile: 865 724 4910




 Think before you print.This email may contain confidential and privileged
material for the sole use of the intended recipient. Any review, use,
distribution or disclosure by others is strictly prohibited. If you are
not the intended recipient (or authorized to receive for the recipient),
please contact the sender by reply email and delete all copies of this
message.
Please click here 
 for
Company Registration Information.







On 4/14/15, 5:09 PM, "Mich Talebzadeh"  wrote:

>Hi Grant,
>
>Thanks for insight.
>
>You mentioned and I quote
>
>" Acid tables have been a real pain for us. We don’t believe they are
>production ready.. "
>
>Can you please elaborate on this/
>
>Thanks
>
>Mich Talebzadeh
>
>http://talebzadehmich.wordpress.com
>
>Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE
>15",
>ISBN 978-0-9563693-0-7.
>co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
>978-0-9759693-0-4
>Publications due shortly:
>Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and
>Coherence Cache
>Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
>one out shortly
>
>NOTE: The information in this email is proprietary and confidential. This
>message is for the designated recipient only, if you are not the intended
>recipient, you should destroy it immediately. Any information in this
>message shall not be understood as given or endorsed by Peridale Ltd, its
>subsidiaries or their employees, unless expressly so stated. It is the
>responsibility of the recipient to ensure that this email is virus free,
>therefore neither Peridale Ltd, its subsidiaries nor their employees
>accept
>any responsibility.
>
>
>-Original Message-
>From: Grant Overby (groverby) [mailto:grove...@cisco.com]
>Sent: 14 April 2015 22:02
>To: Gopal Vijayaraghavan; user@hive.apache.org
>Subject: Re: External Table with unclosed orc files.
>
>Thanks for the link to the hive streaming bolt. We rolled our own bolt
>many
>moons ago to utilize hive streaming. We’ve tried it against 0.13 and
>0.14 . Acid tables have been a real pain for us. We don’t believe they are
>production ready. At least in our use cases, Tez crashes for assorted
>reasons or only assigns 1 mapper to the partition. Having delta files and
>no
>base files borks mapper assignments.  Files containing flush in their name
>are left scattered about, borking queries. Latency is higher with
>streaming
>than writing to an orc file in hdfs, forcing obscene quantities of buckets
>and orc files smaller than any reasonable orc stripe / hdfs block size.
>The
>compactor hangs seemingly at random for no reason we’ve been able to
>discern.
>
>
>
>An orc file without a footer is junk data (or, at least, the last stripe
>is
>junk data). I suppose my question should have been 'what will the hive
>query
>do when it encounters this? Skip the stripe / file? Error out the query?
>Something else?’
>
>
>
>
>Grant Overby
>Software Engineer
>Cisco.com 
>grove...@cisco.com
>Mobile: 865 724 4910
>
>
>
>
> Think before you print.This email may contain confidential and privileged
>material for the sole use of the intended recipient. Any review, use,
>distribution or disclosure by others is strictly prohibited. If you are
>not
>the intended recipient (or authorized to receive for the recipient),
>please
>contact the sender by reply email and delete all copies of this message.
>Please click here
> for
>Company Registration Information.
>
>
>
>
>
>
>
>On 4/14/15, 4:23 PM, "Gopal Vijayaraghavan"  wrote:
>
>>
>>> What will Hive do if querying an external table containing orc files
>>>that are still being written to?
>>
>>Doing that directly won¹t work at all. Because ORC files are only
>>readable
>>after the Footer is written out, which won¹t be for any open files.
>>
>>> I won¹t be able to test these scenarios till tomorrow and would like to
>>>have some idea of what to expect this afternoon.
>>
>>If I remember correctly, your previous question was about writing ORC
>>from
>>Storm.
>>
>>If you¹re on a recent version of Storm, I¹d advise you to look at
>>storm-hive/ 
>>
>>https://github.com/apache/storm/tree/master/external/storm-hive
>>
>>
>>Or alternatively, there¹s a ³hortonworks trucking demo² which does a
>>partition insert instead.
>>
>>Cheers,
>>Gopal
>>
>>
>
>



Re: External Table with unclosed orc files.

2015-04-14 Thread Chad Dotzenrod
unsubscribe

On Tue, Apr 14, 2015 at 4:28 PM, Gopal Vijayaraghavan 
wrote:

>
> >0.14 . Acid tables have been a real pain for us. We don¹t believe they are
> >production ready. At least in our use cases, Tez crashes for assorted
> >reasons or only assigns 1 mapper to the partition. Having delta files and
> >no base files borks mapper assignments.
>
> Some of the chicken-egg problems for those were solved recently in
> HIVE-10114.
>
> Then TEZ-1993 is coming out in the next version of Tez, into which we¹re
> plugging in HIVE-7428 (no fix yet).
>
> Currently delta-only splits have 0 bytes as the ³file size², so it grouped
> together to make a 16Mb chunk (rather a huge single 0 sized split).
>
> Those patches are the effect of me shaving the yak from the ³1 mapper²
> issue.
>
> After which the writer has to follow up on HIVE-9933 to get the locality
> of files fixed.
>
> >name are left scattered about, borking queries. Latency is higher with
> >streaming than writing to an orc file in hdfs, forcing obscene quantities
> >of buckets and orc files smaller than any reasonable orc stripe / hdfs
> >block size. The compactor hangs seemingly at random for no reason we¹ve
> >been able to discern.
>
> I haven¹t seen these issues yet, but I am not dealing with a large volume
> insert rate, so haven¹t produced latency issues there.
>
> Since I work on Hive performance and I haven¹t seen too many bugs filed,
> so I haven¹t paid attention to the performance of ACID.
>
> Please file bugs when you find them, so that it appears on the radar for
> folks like me.
>
> I¹m poking about because I want a live stream into LLAP to work seamlessly
> & return sub-second query results when queried (pre-cache/stage & merge
> etc).
>
> >An orc file without a footer is junk data (or, at least, the last stripe
> >is junk data). I suppose my question should have been 'what will the hive
> >query do when it encounters this? Skip the stripe / file? Error out the
> >query? Something else?¹
>
> It should throw an exception, because that¹s a corrupt ORC file.
>
> The trucking demo uses Storm without ACID - this is likely to get better
> once we use Apache Falcon to move the data around.
>
> Cheers,
> Gopal
>
>
>


-- 
Chad J. Dotzenrod
(630)669-6095
cdotzen...@gmail.com


Re: External Table with unclosed orc files.

2015-04-14 Thread Gopal Vijayaraghavan

>0.14 . Acid tables have been a real pain for us. We don¹t believe they are
>production ready. At least in our use cases, Tez crashes for assorted
>reasons or only assigns 1 mapper to the partition. Having delta files and
>no base files borks mapper assignments.

Some of the chicken-egg problems for those were solved recently in
HIVE-10114.

Then TEZ-1993 is coming out in the next version of Tez, into which we¹re
plugging in HIVE-7428 (no fix yet).

Currently delta-only splits have 0 bytes as the ³file size², so it grouped
together to make a 16Mb chunk (rather a huge single 0 sized split).

Those patches are the effect of me shaving the yak from the ³1 mapper²
issue.

After which the writer has to follow up on HIVE-9933 to get the locality
of files fixed.

>name are left scattered about, borking queries. Latency is higher with
>streaming than writing to an orc file in hdfs, forcing obscene quantities
>of buckets and orc files smaller than any reasonable orc stripe / hdfs
>block size. The compactor hangs seemingly at random for no reason we¹ve
>been able to discern.

I haven¹t seen these issues yet, but I am not dealing with a large volume
insert rate, so haven¹t produced latency issues there.

Since I work on Hive performance and I haven¹t seen too many bugs filed,
so I haven¹t paid attention to the performance of ACID.

Please file bugs when you find them, so that it appears on the radar for
folks like me.

I¹m poking about because I want a live stream into LLAP to work seamlessly
& return sub-second query results when queried (pre-cache/stage & merge
etc).

>An orc file without a footer is junk data (or, at least, the last stripe
>is junk data). I suppose my question should have been 'what will the hive
>query do when it encounters this? Skip the stripe / file? Error out the
>query? Something else?¹

It should throw an exception, because that¹s a corrupt ORC file.

The trucking demo uses Storm without ACID - this is likely to get better
once we use Apache Falcon to move the data around.

Cheers,
Gopal




Re: Default schema

2015-04-14 Thread Maciek
Thought about that but not sure if it's the most suitable one
Would you mind sharing those other ways?

Thanks!

On Tue, Apr 14, 2015 at 9:49 PM, Bala Krishna Gangisetty  wrote:

> Yes, certainly. There are couple of ways to do this.
>
> One such way is to define an alias for "hive --database
> **"
>
> --Bala G.
>
> On Tue, Apr 14, 2015 at 1:30 PM, Maciek  wrote:
>
>> Is it possible to customize the schema user logs on to?
>> I was thinking of setting some bash environment variable
>> or setting param file (like hive-env.sh, hiverc or hive-site.xml…)?
>>
>
>


RE: External Table with unclosed orc files.

2015-04-14 Thread Mich Talebzadeh
Hi Grant,

Thanks for insight.

You mentioned and I quote

" Acid tables have been a real pain for us. We don’t believe they are
production ready.. "

Can you please elaborate on this/

Thanks

Mich Talebzadeh

http://talebzadehmich.wordpress.com

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4
Publications due shortly:
Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and
Coherence Cache
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Ltd, its
subsidiaries or their employees, unless expressly so stated. It is the
responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.


-Original Message-
From: Grant Overby (groverby) [mailto:grove...@cisco.com] 
Sent: 14 April 2015 22:02
To: Gopal Vijayaraghavan; user@hive.apache.org
Subject: Re: External Table with unclosed orc files.

Thanks for the link to the hive streaming bolt. We rolled our own bolt many
moons ago to utilize hive streaming. We’ve tried it against 0.13 and
0.14 . Acid tables have been a real pain for us. We don’t believe they are
production ready. At least in our use cases, Tez crashes for assorted
reasons or only assigns 1 mapper to the partition. Having delta files and no
base files borks mapper assignments.  Files containing flush in their name
are left scattered about, borking queries. Latency is higher with streaming
than writing to an orc file in hdfs, forcing obscene quantities of buckets
and orc files smaller than any reasonable orc stripe / hdfs block size. The
compactor hangs seemingly at random for no reason we’ve been able to
discern.



An orc file without a footer is junk data (or, at least, the last stripe is
junk data). I suppose my question should have been 'what will the hive query
do when it encounters this? Skip the stripe / file? Error out the query?
Something else?’




Grant Overby
Software Engineer
Cisco.com 
grove...@cisco.com
Mobile: 865 724 4910




 Think before you print.This email may contain confidential and privileged
material for the sole use of the intended recipient. Any review, use,
distribution or disclosure by others is strictly prohibited. If you are not
the intended recipient (or authorized to receive for the recipient), please
contact the sender by reply email and delete all copies of this message.
Please click here
 for
Company Registration Information.







On 4/14/15, 4:23 PM, "Gopal Vijayaraghavan"  wrote:

>
>> What will Hive do if querying an external table containing orc files
>>that are still being written to?
>
>Doing that directly won¹t work at all. Because ORC files are only readable
>after the Footer is written out, which won¹t be for any open files.
>
>> I won¹t be able to test these scenarios till tomorrow and would like to
>>have some idea of what to expect this afternoon.
>
>If I remember correctly, your previous question was about writing ORC from
>Storm.
>
>If you¹re on a recent version of Storm, I¹d advise you to look at
>storm-hive/ 
>
>https://github.com/apache/storm/tree/master/external/storm-hive
>
>
>Or alternatively, there¹s a ³hortonworks trucking demo² which does a
>partition insert instead.
>
>Cheers,
>Gopal
>
>




Re: External Table with unclosed orc files.

2015-04-14 Thread Grant Overby (groverby)
Thanks for the link to the hive streaming bolt. We rolled our own bolt
many moons ago to utilize hive streaming. We’ve tried it against 0.13 and
0.14 . Acid tables have been a real pain for us. We don’t believe they are
production ready. At least in our use cases, Tez crashes for assorted
reasons or only assigns 1 mapper to the partition. Having delta files and
no base files borks mapper assignments.  Files containing flush in their
name are left scattered about, borking queries. Latency is higher with
streaming than writing to an orc file in hdfs, forcing obscene quantities
of buckets and orc files smaller than any reasonable orc stripe / hdfs
block size. The compactor hangs seemingly at random for no reason we’ve
been able to discern.



An orc file without a footer is junk data (or, at least, the last stripe
is junk data). I suppose my question should have been 'what will the hive
query do when it encounters this? Skip the stripe / file? Error out the
query? Something else?’




Grant Overby
Software Engineer
Cisco.com 
grove...@cisco.com
Mobile: 865 724 4910




 Think before you print.This email may contain confidential and privileged
material for the sole use of the intended recipient. Any review, use,
distribution or disclosure by others is strictly prohibited. If you are
not the intended recipient (or authorized to receive for the recipient),
please contact the sender by reply email and delete all copies of this
message.
Please click here 
 for
Company Registration Information.







On 4/14/15, 4:23 PM, "Gopal Vijayaraghavan"  wrote:

>
>> What will Hive do if querying an external table containing orc files
>>that are still being written to?
>
>Doing that directly won¹t work at all. Because ORC files are only readable
>after the Footer is written out, which won¹t be for any open files.
>
>> I won¹t be able to test these scenarios till tomorrow and would like to
>>have some idea of what to expect this afternoon.
>
>If I remember correctly, your previous question was about writing ORC from
>Storm.
>
>If you¹re on a recent version of Storm, I¹d advise you to look at
>storm-hive/ 
>
>https://github.com/apache/storm/tree/master/external/storm-hive
>
>
>Or alternatively, there¹s a ³hortonworks trucking demo² which does a
>partition insert instead.
>
>Cheers,
>Gopal
>
>



Re: Default schema

2015-04-14 Thread Bala Krishna Gangisetty
Yes, certainly. There are couple of ways to do this.

One such way is to define an alias for "hive --database **"

--Bala G.

On Tue, Apr 14, 2015 at 1:30 PM, Maciek  wrote:

> Is it possible to customize the schema user logs on to?
> I was thinking of setting some bash environment variable
> or setting param file (like hive-env.sh, hiverc or hive-site.xml…)?
>


Default schema

2015-04-14 Thread Maciek
Is it possible to customize the schema user logs on to?
I was thinking of setting some bash environment variable
or setting param file (like hive-env.sh, hiverc or hive-site.xml…)?


Re: External Table with unclosed orc files.

2015-04-14 Thread Gopal Vijayaraghavan

> What will Hive do if querying an external table containing orc files
>that are still being written to?

Doing that directly won¹t work at all. Because ORC files are only readable
after the Footer is written out, which won¹t be for any open files.

> I won¹t be able to test these scenarios till tomorrow and would like to
>have some idea of what to expect this afternoon.

If I remember correctly, your previous question was about writing ORC from
Storm.

If you¹re on a recent version of Storm, I¹d advise you to look at
storm-hive/ 

https://github.com/apache/storm/tree/master/external/storm-hive


Or alternatively, there¹s a ³hortonworks trucking demo² which does a
partition insert instead.

Cheers,
Gopal




Re: External Table with unclosed orc files.

2015-04-14 Thread Alan Gates
It will fail.  Orc writes info in the footers that are required to 
properly read the file.  If close hasn't been called, then that footer 
hasn't been written yet.


Alan.


Grant Overby (groverby) 
April 14, 2015 at 20:46
What will Hive do if querying an external table containing orc files 
that are still being written to?


If the process writing the orc files exits without calling .close()?


Sorry for taking the cheap way out and asking instead of testing. I 
couldn’t find anything on this via google. I won’t be able to test 
these scenarios till tomorrow and would like to have some idea of what 
to expect this afternoon.


RE: External Table with unclosed orc files.

2015-04-14 Thread Mich Talebzadeh
Hi,

 

I believe in the same way as UNIX file/partitions behave.

 

If the file is opened by the first process writing to it, a swap file will
be created. If the second process is querying it only, then it will see the
data at the time of last save by the first process but not the changes after
last save

 

It will behave much like versioning in an RDBMS.

 

HTH

 

 

Mich Talebzadeh

 

http://talebzadehmich.wordpress.com

 

Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4

Publications due shortly:

Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and
Coherence Cache

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Ltd, its
subsidiaries or their employees, unless expressly so stated. It is the
responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.

 

From: Grant Overby (groverby) [mailto:grove...@cisco.com] 
Sent: 14 April 2015 19:46
To: user@hive.apache.org
Subject: External Table with unclosed orc files.

 

What will Hive do if querying an external table containing orc files that
are still being written to?

 

If the process writing the orc files exits without calling .close()?

 

 

Sorry for taking the cheap way out and asking instead of testing. I couldn't
find anything on this via google. I won't be able to test these scenarios
till tomorrow and would like to have some idea of what to expect this
afternoon.



External Table with unclosed orc files.

2015-04-14 Thread Grant Overby (groverby)
What will Hive do if querying an external table containing orc files that are 
still being written to?

If the process writing the orc files exits without calling .close()?


Sorry for taking the cheap way out and asking instead of testing. I couldn’t 
find anything on this via google. I won’t be able to test these scenarios till 
tomorrow and would like to have some idea of what to expect this afternoon.


Re: [Hive] Slow Loading Data Process with Parquet over 30k Partitions

2015-04-14 Thread Edward Capriolo
That is too many partitions. Way to much overhead in anything that has that
many partitions.

On Tue, Apr 14, 2015 at 12:53 PM, Tianqi Tong  wrote:

>  Hi Slava and Ferdinand,
>
> Thanks for the reply! Later when I was looking at the hive.log, I found
> Hive was indeed calculating the partition stats, and the log looks like:
>
> ….
>
> 2015-04-14 09:38:21,146 WARN  [main]: hive.log
> (MetaStoreUtils.java:updatePartitionStatsFast(296)) - Updating partition
> stats fast for: parquet_table
>
> 2015-04-14 09:38:21,147 WARN  [main]: hive.log
> (MetaStoreUtils.java:updatePartitionStatsFast(299)) - Updated size to
> 5533480
>
> 2015-04-14 09:38:44,511 WARN  [main]: hive.log
> (MetaStoreUtils.java:updatePartitionStatsFast(296)) - Updating partition
> stats fast for: parquet_table
>
> 2015-04-14 09:38:44,512 WARN  [main]: hive.log
> (MetaStoreUtils.java:updatePartitionStatsFast(299)) - Updated size to 66246
>
> 2015-04-14 09:39:07,554 WARN  [main]: hive.log
> (MetaStoreUtils.java:updatePartitionStatsFast(296)) - Updating partition
> stats fast for: parquet_table
>
> 2015-04-14 09:39:07,555 WARN  [main]: hive.log
> (MetaStoreUtils.java:updatePartitionStatsFast(299)) - Updated size to 418925
>
> ….
>
>
>
> One interesting thing is, it's getting slower and slower. Right after I
> launched the job, it took less than 1s to calculate for one partition. Now
> it's taking 20+s for each one.
>
> I tried hive.stats.autogather=false, but somehow it didn't seem to work. I
> also ended up hard coding a little bit to the Hive source code.
>
>
>
> In my case, I have around 4 partitions with one file (varies from 1M
> to 1G) in each of them. Now it's been 4 days and the first job I launched
> is still not done yet, with partition stats.
>
>
>
> Thanks
>
> Tianqi Tong
>
>
>
> *From:* Slava Markeyev [mailto:slava.marke...@upsight.com]
> *Sent:* Monday, April 13, 2015 11:00 PM
> *To:* user@hive.apache.org
> *Cc:* Sergio Pena
> *Subject:* Re: [Hive] Slow Loading Data Process with Parquet over 30k
> Partitions
>
>
>
> This is something I've encountered when doing ETL with hive and having it
> create 10's of thousands partitions. The issue is each partition needs to
> be added to the metastore and this is an expensive operation to perform. My
> work around was adding a flag to hive that optionally disables the
> metastore partition creation step. This may not be a solution for everyone
> as that table then has no partitions and you would have to run msck repair
> but depending on your use case, you may just want the data in hdfs.
>
> If there is interest in having this be an option I'll make a ticket and
> submit the patch.
>
> -Slava
>
>
>
> On Mon, Apr 13, 2015 at 10:40 PM, Xu, Cheng A 
> wrote:
>
> Hi Tianqi,
>
> Can you attach hive.log as more detailed information?
>
> +Sergio
>
>
>
> Yours,
>
> Ferdinand Xu
>
>
>
> *From:* Tianqi Tong [mailto:tt...@brightedge.com]
> *Sent:* Friday, April 10, 2015 1:34 AM
> *To:* user@hive.apache.org
> *Subject:* [Hive] Slow Loading Data Process with Parquet over 30k
> Partitions
>
>
>
> Hello Hive,
>
> I'm a developer using Hive to process TB level data, and I'm having some
> difficulty loading the data to table.
>
> I have 2 tables now:
>
>
>
> -- table_1:
>
> CREATE EXTERNAL TABLE `table_1`(
>
>   `keyword` string,
>
>   `domain` string,
>
>   `url` string
>
>   )
>
> PARTITIONED BY (yearmonth INT, partition1 STRING)
>
> STORED AS RCfile
>
>
>
> -- table_2:
>
> CREATE EXTERNAL TABLE `table_2`(
>
>   `keyword` string,
>
>   `domain` string,
>
>   `url` string
>
>   )
>
> PARTITIONED BY (yearmonth INT, partition2 STRING)
>
> STORED AS Parquet
>
>
>
> I'm doing an INSERT OVERWRITE to table_2 from SELECT FROM table_1 with
> dynamic partitioning, and the number of partitions grows dramatically from
> 1500 to 40k (because I want to use something else as partitioning).
>
> The mapreduce job was fine.
>
> Somehow the process stucked at " Loading data to table default.table_2
> (yearmonth=null, domain_prefix=null) ", and I've been waiting for hours.
>
>
>
> Is this expected when we have 40k partitions?
>
>
>
> --
>
> Refs - Here are the parameters that I used:
>
> export HADOOP_HEAPSIZE=16384
>
> set PARQUET_FILE_SIZE=268435456;
>
> set parquet.block.size=268435456;
>
> set dfs.blocksize=268435456;
>
> set parquet.compression=SNAPPY;
>
> SET hive.exec.dynamic.partition.mode=nonstrict;
>
> SET hive.exec.max.dynamic.partitions=50;
>
> SET hive.exec.max.dynamic.partitions.pernode=5;
>
> SET hive.exec.max.created.files=100;
>
>
>
>
>
> Thank you very much!
>
> Tianqi Tong
>
>
>
>
> --
>
> Slava Markeyev | Engineering | Upsight
>


RE: [Hive] Slow Loading Data Process with Parquet over 30k Partitions

2015-04-14 Thread Tianqi Tong
Hi Slava and Ferdinand,
Thanks for the reply! Later when I was looking at the hive.log, I found Hive 
was indeed calculating the partition stats, and the log looks like:
….
2015-04-14 09:38:21,146 WARN  [main]: hive.log 
(MetaStoreUtils.java:updatePartitionStatsFast(296)) - Updating partition stats 
fast for: parquet_table
2015-04-14 09:38:21,147 WARN  [main]: hive.log 
(MetaStoreUtils.java:updatePartitionStatsFast(299)) - Updated size to 5533480
2015-04-14 09:38:44,511 WARN  [main]: hive.log 
(MetaStoreUtils.java:updatePartitionStatsFast(296)) - Updating partition stats 
fast for: parquet_table
2015-04-14 09:38:44,512 WARN  [main]: hive.log 
(MetaStoreUtils.java:updatePartitionStatsFast(299)) - Updated size to 66246
2015-04-14 09:39:07,554 WARN  [main]: hive.log 
(MetaStoreUtils.java:updatePartitionStatsFast(296)) - Updating partition stats 
fast for: parquet_table
2015-04-14 09:39:07,555 WARN  [main]: hive.log 
(MetaStoreUtils.java:updatePartitionStatsFast(299)) - Updated size to 418925
….

One interesting thing is, it's getting slower and slower. Right after I 
launched the job, it took less than 1s to calculate for one partition. Now it's 
taking 20+s for each one.
I tried hive.stats.autogather=false, but somehow it didn't seem to work. I also 
ended up hard coding a little bit to the Hive source code.

In my case, I have around 4 partitions with one file (varies from 1M to 1G) 
in each of them. Now it's been 4 days and the first job I launched is still not 
done yet, with partition stats.

Thanks
Tianqi Tong

From: Slava Markeyev [mailto:slava.marke...@upsight.com]
Sent: Monday, April 13, 2015 11:00 PM
To: user@hive.apache.org
Cc: Sergio Pena
Subject: Re: [Hive] Slow Loading Data Process with Parquet over 30k Partitions

This is something I've encountered when doing ETL with hive and having it 
create 10's of thousands partitions. The issue is each partition needs to be 
added to the metastore and this is an expensive operation to perform. My work 
around was adding a flag to hive that optionally disables the metastore 
partition creation step. This may not be a solution for everyone as that table 
then has no partitions and you would have to run msck repair but depending on 
your use case, you may just want the data in hdfs.
If there is interest in having this be an option I'll make a ticket and submit 
the patch.
-Slava

On Mon, Apr 13, 2015 at 10:40 PM, Xu, Cheng A 
mailto:cheng.a...@intel.com>> wrote:
Hi Tianqi,
Can you attach hive.log as more detailed information?
+Sergio

Yours,
Ferdinand Xu

From: Tianqi Tong [mailto:tt...@brightedge.com]
Sent: Friday, April 10, 2015 1:34 AM
To: user@hive.apache.org
Subject: [Hive] Slow Loading Data Process with Parquet over 30k Partitions

Hello Hive,
I'm a developer using Hive to process TB level data, and I'm having some 
difficulty loading the data to table.
I have 2 tables now:

-- table_1:
CREATE EXTERNAL TABLE `table_1`(
  `keyword` string,
  `domain` string,
  `url` string
  )
PARTITIONED BY (yearmonth INT, partition1 STRING)
STORED AS RCfile

-- table_2:
CREATE EXTERNAL TABLE `table_2`(
  `keyword` string,
  `domain` string,
  `url` string
  )
PARTITIONED BY (yearmonth INT, partition2 STRING)
STORED AS Parquet

I'm doing an INSERT OVERWRITE to table_2 from SELECT FROM table_1 with dynamic 
partitioning, and the number of partitions grows dramatically from 1500 to 40k 
(because I want to use something else as partitioning).
The mapreduce job was fine.
Somehow the process stucked at " Loading data to table default.table_2 
(yearmonth=null, domain_prefix=null) ", and I've been waiting for hours.

Is this expected when we have 40k partitions?

--
Refs - Here are the parameters that I used:
export HADOOP_HEAPSIZE=16384
set PARQUET_FILE_SIZE=268435456;
set parquet.block.size=268435456;
set dfs.blocksize=268435456;
set parquet.compression=SNAPPY;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions=50;
SET hive.exec.max.dynamic.partitions.pernode=5;
SET hive.exec.max.created.files=100;


Thank you very much!
Tianqi Tong



--

Slava Markeyev | Engineering | Upsight


Re: partition and bucket

2015-04-14 Thread Ashok Kumar
Thank you sir. Much appreciated 


 On Sunday, 12 April 2015, 21:05, Mich Talebzadeh  
wrote:
   

 #yiv0994893552 #yiv0994893552 -- _filtered #yiv0994893552 {panose-1:2 4 5 3 5 
4 6 3 2 4;} _filtered #yiv0994893552 {font-family:Calibri;panose-1:2 15 5 2 2 2 
4 3 2 4;} _filtered #yiv0994893552 {font-family:Tahoma;panose-1:2 11 6 4 3 5 4 
4 2 4;}#yiv0994893552 #yiv0994893552 p.yiv0994893552MsoNormal, #yiv0994893552 
li.yiv0994893552MsoNormal, #yiv0994893552 div.yiv0994893552MsoNormal 
{margin:0cm;margin-bottom:.0001pt;font-size:12.0pt;}#yiv0994893552 a:link, 
#yiv0994893552 span.yiv0994893552MsoHyperlink 
{color:blue;text-decoration:underline;}#yiv0994893552 a:visited, #yiv0994893552 
span.yiv0994893552MsoHyperlinkFollowed 
{color:purple;text-decoration:underline;}#yiv0994893552 
p.yiv0994893552MsoListParagraph, #yiv0994893552 
li.yiv0994893552MsoListParagraph, #yiv0994893552 
div.yiv0994893552MsoListParagraph 
{margin-top:0cm;margin-right:0cm;margin-bottom:0cm;margin-left:36.0pt;margin-bottom:.0001pt;font-size:12.0pt;}#yiv0994893552
 span.yiv0994893552EmailStyle17 {color:windowtext;}#yiv0994893552 
span.yiv0994893552apple-converted-space {}#yiv0994893552 
.yiv0994893552MsoChpDefault {} _filtered #yiv0994893552 {margin:72.0pt 72.0pt 
72.0pt 72.0pt;}#yiv0994893552 div.yiv0994893552WordSection1 {}#yiv0994893552 
_filtered #yiv0994893552 {} _filtered #yiv0994893552 {}#yiv0994893552 ol 
{margin-bottom:0cm;}#yiv0994893552 ul {margin-bottom:0cm;}#yiv0994893552 Hi,  I 
will try to have a go at your points but I am sure there are many experts 
around.  As you may know already in RDBMS partitioning (dividing a very large 
table into sub-tables conceptually) is deployed to address three areast.   1.   
  Availability -- each partition can reside on a different tablespace/device. 
Hence a problem with a tablespace/device will take out a slice of the table's 
data instead of the whole thing. This does not really ap[ply to Hive with 3 
block replication as standard2. Manageability -- partitioning provides a 
mechanism for splitting whole table jobs into clear batches. Partition exchange 
can make it easier to bulk load data. Defragging, moving older partitions to 
lower tier storage, updating stats etc Most of these benefits apply to Hive as 
well. Please check the docs. 3. Performance -- partition elimination   In 
simplest form (excluding composite partitioning), Hive partitioning will be 
similar to “range partitioning” in RDBMS. One can partition a table (say 
partitioned_table as shown below which is batch loaded from 
non_partitioned_table) -- by country, year, month etc. Each partition will be 
stored in Hive under sub-directory table/year/month like below  
/user/hive/warehouse/scratchpad.db/partitioned_table/country=Italy/year=2014/month=Feb
  Hive does not have the concept of indexes local or global as yet. So without 
partitioning a simple query in Hive will have to read the entire table even if 
it is filtering a smaller result set (WHERE CLAUSE). This becomes a bottleneck 
for running multiple MapReduce jobs over a large table. So partitioning will 
help localise the query by hitting the relevant sub-directory or 
sub-directories only. There is another important aspect with Hive as well. The 
locking granularity will be determined by the lowest slice in the filing system 
(sub-directory). So entering data into the above partition/file, will take an 
exclusive lock on that partition/file but crucially the rest of partitions will 
be available (assuming concurrency in Hive is enabled).   
+--+-+++-+--+-+-++-+---+--+|
  lockid  |  database   | table  | partition
  | lock_state  |  lock_type   | transaction_id  | last_heartbeat  |  
acquired_at   |  user   | hostname  
|+--+-+++-+--+-+-++-+---+--+|
 Lock ID  | Database    | Table  | Partition
      | State   | Type | Transaction ID  | Last Hearbeat   | 
Acquired At    | User    | Hostname  || 1711 | scratchpad  | 
non_partitioned_table  | NULL   | ACQUIRED    | 
SHARED_READ  | NULL    | 1428862154670   | 1428862151904  | hduser  | 
rhes564   || 1711 | scratchpad  | partitioned_table  | 
country=Italy/year=2014/month=Feb  | ACQUIRED    | EXCLUSIVE    | NULL  
  | 1428862154670   | 1428862151905  | hduser  | rhes564   
|+--+-+++-+--+-+-++-+---+--+
  Now your point 2, bucketing in Hive refers to hash partitioning whe