Re: Hive Cli ORC table read error with limit option

2016-03-06 Thread Biswajit Nayak
Hi Gopal,


I had already pasted the table format in this thread. Will repeat it again.


*hive> desc formatted *testdb.table_orc*;*

*OK*

*# col_name data_typecomment *



*row_id   bigint   *

*a int  *

*b  int  *

*cvarchar(2)   *

*d bigint   *

*e   int  *

*fbigint   *

*gfloat*

*h int  *

*i  int  *



*# Partition Information*

*# col_name data_typecomment *



*year int  *

*monthint  *

*day  int  *



*# Detailed Table Information*

*Database:*testdb

*Owner:   **

*CreateTime:  Mon Jan 25 22:32:22 UTC 2016  *

*LastAccessTime:  UNKNOWN   *

*Protect Mode:None  *

*Retention:   0 *

*Location:hdfs://***:8020/hive/*testdb*.db/table_orc
 *

*Table Type:  MANAGED_TABLE *

*Table Parameters:*

* last_modified_by **  *

* last_modified_time   **  *

* orc.compress SNAPPY  *

* transient_lastDdlTime 1454104669  *



*# Storage Information*

*SerDe Library:   org.apache.hadoop.hive.ql.io.orc.OrcSerde  *

*InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat  *

*OutputFormat:org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat  *

*Compressed:  No*

*Num Buckets: 7 *

*Bucket Columns:  [f]*

*Sort Columns:[]*

*Storage Desc Params:*

* field.delim  \t  *

* serialization.format \t  *

*Time taken: 0.105 seconds, Fetched: 46 row(s)*

*hive> *


>>>Depends on whether any of those columns are paritition columns or not & 
>>>whether
the table is marked transactional.

Yes those columns are partitioned and they are not marked as transactional.


>>>Usually that and a copy of --orcfiledump output to check the
offsets/types.

there are around 10 files, so copying all the orcfiledump will be a mess
here. Is there any way to find the defective file so that i could isolate
it and copy the orcfiledump of it here.

Thanks
Biswa


On Sat, Mar 5, 2016 at 12:21 AM, Gopal Vijayaraghavan 
wrote:

>
> > Any one has any idea about this.. Really stuck with this.
> ...
> > hive> select h from testdb.table_orc where year = 2016 and month =1 and
> >day >29 limit 10;
>
> Depends on whether any of those columns are paritition columns or not &
> whether the table is marked transactional.
>
> > Caused by: java.lang.IndexOutOfBoundsException: Index: 0
> > at java.util.Collections$EmptyList.get(Collections.java:3212)
> > at
> >org.apache.hadoop.hive.ql.io.orc.OrcProto$Type.getSubtypes(OrcProto.java:1
> >2240)
>
> If you need answers to rare problems, these emails need at least the table
> format ("desc formatted").
>
>
> Usually that and a copy of --orcfiledump output to check the offsets/types.
>
> Cheers,
> Gopal
>
>
>


Re: Updating column in table throws error

2016-03-06 Thread Mich Talebzadeh
Hi,

This update will throw an error as any column used for bucketing (read for
hash partitioning) cannot be updated as it is used for physical ordering of
rows in the table.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 6 March 2016 at 18:19, Marcin Tustin  wrote:

> Don't bucket on columns you expect to update.
>
> Potentially you could delete the whole row and reinsert it.
>
>
> On Sunday, March 6, 2016, Ashok Kumar  wrote:
>
>> Hi gurus,
>>
>> I have an ORC table bucketed on invoicenumber with "transactional"="true"
>>
>> I am trying to update invoicenumber column used for bucketing this table
>> but it comes back with
>>
>> Error: Error while compiling statement: FAILED: SemanticException [Error
>> 10302]: Updating values of bucketing columns is not supported.  Column
>> invoicenumber
>>
>> Any ideas how it can be solved?
>>
>> Thank you
>>
>
> Want to work at Handy? Check out our culture deck and open roles
> 
> Latest news  at Handy
> Handy just raised $50m
> 
>  led
> by Fidelity
>
>


Re: Updating column in table throws error

2016-03-06 Thread Marcin Tustin
Don't bucket on columns you expect to update.

Potentially you could delete the whole row and reinsert it.

On Sunday, March 6, 2016, Ashok Kumar  wrote:

> Hi gurus,
>
> I have an ORC table bucketed on invoicenumber with "transactional"="true"
>
> I am trying to update invoicenumber column used for bucketing this table
> but it comes back with
>
> Error: Error while compiling statement: FAILED: SemanticException [Error
> 10302]: Updating values of bucketing columns is not supported.  Column
> invoicenumber
>
> Any ideas how it can be solved?
>
> Thank you
>

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Updating column in table throws error

2016-03-06 Thread Ashok Kumar
 Hi gurus,
I have an ORC table bucketed on invoicenumber with "transactional"="true"
I am trying to update invoicenumber column used for bucketing this table but it 
comes back with
Error: Error while compiling statement: FAILED: SemanticException [Error 
10302]: Updating values of bucketing columns is not supported.  Column 
invoicenumber
Any ideas how it can be solved?
Thank you

Re: Parquet versus ORC

2016-03-06 Thread Marcin Tustin
If you google, you'll find benchmarks showing each to be faster than the
other. In so far as there's any reality to which is faster in any given
comparison, it seems to be a result of each incorporating ideas from the
other, or at least going through development cycles to beat each other.

ORC is very fast for working with hive, and we use it at Handy. That said,
the broader support for parquet might enable things like performing your
own insertions into tables by dropping new files in there, or doing your
own concatenation and cleanup.

In summary, until you benchmark your own usage I'd assume performance is
the same. If you're not going to benchmark, go by what's likely to be most
convenient.

On Sun, Mar 6, 2016 at 11:06 AM, Mich Talebzadeh 
wrote:

> Hi,
>
> Thanks for that link.
>
> It appears that the main advantages of Parquet is stated as and I quote:
>
> "Parquet is built to be used by anyone. The Hadoop ecosystem is rich with
> data processing frameworks, and we are not interested in playing favorites.
> We believe that an efficient, well-implemented columnar storage substrate
> should be useful to all frameworks without the cost of extensive and
> difficult to set up dependencies."
>
> Fair enough Parquet provides columnar format and compression. As I stated
> I do not know much about it. However, my understanding of ORC is that it
> provides better encoding of data, Predicate push down for some predicates
> plus support for ACID properties.
>
> As Alan Gates stated before (Hive user forum, "Difference between ORC and
> RC files" , 21 Dec 15) and I quote
>
> "Whether ORC is the best format for what you're doing depends on the data
> you're storing and how you are querying it.  If you are storing data where
> you know the schema and you are doing analytic type queries it's the best
> choice (in fairness, some would dispute this and choose Parquet, though
> much of what I said above (about ORC vs RC applies to Parquet as well).  If
> you are doing queries that select the whole row each time columnar formats
> like ORC won't be your friend.  Also, if you are storing self structured
> data such as JSON or Avro you may find text or Avro storage to be a better
> format.
>
> So what would be the main advantage(s) of Parquet over ORC please besides
> using queries that select whole row (much like "a row based" type
> relational database does).
>
>
> Cheers.
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 6 March 2016 at 15:34, Uli Bethke  wrote:
>
>> Curious why you think that Parquet does not have metadat at file, row
>> group or column level.
>> Please refer here to the type of metadata that Parquet supports in the
>> docs http://parquet.apache.org/documentation/latest/
>>
>>
>> n 06/03/2016 15:26, Mich Talebzadeh wrote:
>>
>> Hi.
>>
>> I have been hearing a fair bit about Parquet versus ORC tables.
>>
>> In a nutshell I can say that Parquet is a predecessor to ORC (both
>> provide columnar type storage) but I notice that it is still being used
>> especially with Spark users.
>>
>> In mitigation it appears that Spark users are reluctant to use ORC
>> despite the fact that with inbuilt Store Index it offers superior
>> optimisation with data and stats at file, stripe and row group level. Both
>> Parquet and ORC offer SNAPPY compression as well. ORC offers ZLIB as
>> default.
>>
>> There may be other than technical reasons for this adaption, for example
>> too much reliance on Hive plus the fact that it is easier to flatten
>> Parquet than ORC (whatever that means).
>>
>> I for myself use either text files or ORC with Hive and Spark and don't
>> really see any reason why I should adopt others like Avro, Parquet etc.
>>
>> Appreciate any verification or experience on this.
>>
>> Thanks
>> ,
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>>
>>
>> --
>> ___
>> Uli Bethke
>> Chair Hadoop User Group Irelandwww.hugireland.org
>> HUG Ireland is community sponsor of Hadoop Summit Europe in Dublin 
>> http://2016.hadoopsummit.org/dublin/
>>
>>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Re: Parquet versus ORC

2016-03-06 Thread Mich Talebzadeh
Hi,

Thanks for that link.

It appears that the main advantages of Parquet is stated as and I quote:

"Parquet is built to be used by anyone. The Hadoop ecosystem is rich with
data processing frameworks, and we are not interested in playing favorites.
We believe that an efficient, well-implemented columnar storage substrate
should be useful to all frameworks without the cost of extensive and
difficult to set up dependencies."

Fair enough Parquet provides columnar format and compression. As I stated I
do not know much about it. However, my understanding of ORC is that it
provides better encoding of data, Predicate push down for some predicates
plus support for ACID properties.

As Alan Gates stated before (Hive user forum, "Difference between ORC and
RC files" , 21 Dec 15) and I quote

"Whether ORC is the best format for what you're doing depends on the data
you're storing and how you are querying it.  If you are storing data where
you know the schema and you are doing analytic type queries it's the best
choice (in fairness, some would dispute this and choose Parquet, though
much of what I said above (about ORC vs RC applies to Parquet as well).  If
you are doing queries that select the whole row each time columnar formats
like ORC won't be your friend.  Also, if you are storing self structured
data such as JSON or Avro you may find text or Avro storage to be a better
format.

So what would be the main advantage(s) of Parquet over ORC please besides
using queries that select whole row (much like "a row based" type
relational database does).


Cheers.


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 6 March 2016 at 15:34, Uli Bethke  wrote:

> Curious why you think that Parquet does not have metadat at file, row
> group or column level.
> Please refer here to the type of metadata that Parquet supports in the
> docs http://parquet.apache.org/documentation/latest/
>
>
> n 06/03/2016 15:26, Mich Talebzadeh wrote:
>
> Hi.
>
> I have been hearing a fair bit about Parquet versus ORC tables.
>
> In a nutshell I can say that Parquet is a predecessor to ORC (both provide
> columnar type storage) but I notice that it is still being used
> especially with Spark users.
>
> In mitigation it appears that Spark users are reluctant to use ORC despite
> the fact that with inbuilt Store Index it offers superior optimisation with
> data and stats at file, stripe and row group level. Both Parquet and ORC
> offer SNAPPY compression as well. ORC offers ZLIB as default.
>
> There may be other than technical reasons for this adaption, for example
> too much reliance on Hive plus the fact that it is easier to flatten
> Parquet than ORC (whatever that means).
>
> I for myself use either text files or ORC with Hive and Spark and don't
> really see any reason why I should adopt others like Avro, Parquet etc.
>
> Appreciate any verification or experience on this.
>
> Thanks
> ,
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>
>
> --
> ___
> Uli Bethke
> Chair Hadoop User Group Irelandwww.hugireland.org
> HUG Ireland is community sponsor of Hadoop Summit Europe in Dublin 
> http://2016.hadoopsummit.org/dublin/
>
>


Re: Parquet versus ORC

2016-03-06 Thread Uli Bethke
Curious why you think that Parquet does not have metadat at file, row 
group or column level.
Please refer here to the type of metadata that Parquet supports in the 
docs http://parquet.apache.org/documentation/latest/


n 06/03/2016 15:26, Mich Talebzadeh wrote:

Hi.

I have been hearing a fair bit about Parquet versus ORC tables.

In a nutshell I can say that Parquet is a predecessor to ORC (both 
provide columnar type storage) but I notice that it is still being 
used especially with Spark users.


In mitigation it appears that Spark users are reluctant to use ORC 
despite the fact that with inbuilt Store Index it offers superior 
optimisation with data and stats at file, stripe and row group level. 
Both Parquet and ORC offer SNAPPY compression as well. ORC offers ZLIB 
as default.


There may be other than technical reasons for this adaption, for 
example too much reliance on Hive plus the fact that it is easier to 
flatten Parquet than ORC (whatever that means).


I for myself use either text files or ORC with Hive and Spark and 
don't really see any reason why I should adopt others like Avro, 
Parquet etc.


Appreciate any verification or experience on this.

Thanks
,

Dr Mich Talebzadeh

LinkedIn 
/https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/


http://talebzadehmich.wordpress.com 




--
___
Uli Bethke
Chair Hadoop User Group Ireland
www.hugireland.org
HUG Ireland is community sponsor of Hadoop Summit Europe in Dublin
http://2016.hadoopsummit.org/dublin/



Parquet versus ORC

2016-03-06 Thread Mich Talebzadeh
Hi.

I have been hearing a fair bit about Parquet versus ORC tables.

In a nutshell I can say that Parquet is a predecessor to ORC (both provide
columnar type storage) but I notice that it is still being used
especially with Spark users.

In mitigation it appears that Spark users are reluctant to use ORC despite
the fact that with inbuilt Store Index it offers superior optimisation with
data and stats at file, stripe and row group level. Both Parquet and ORC
offer SNAPPY compression as well. ORC offers ZLIB as default.

There may be other than technical reasons for this adaption, for example
too much reliance on Hive plus the fact that it is easier to flatten
Parquet than ORC (whatever that means).

I for myself use either text files or ORC with Hive and Spark and don't
really see any reason why I should adopt others like Avro, Parquet etc.

Appreciate any verification or experience on this.

Thanks
,

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


Re: Which one should i use for benchmark tasks in hive & hadoop

2016-03-06 Thread dhruv kapatel
Thank you very much.


-- 


*With Regards:Kapatel Dhruv v*


Re: Which one should i use for benchmark tasks in hive & hadoop

2016-03-06 Thread Jiacai Liu
I have answered this question at stackoverflow.☺

On Sun, Mar 6, 2016 at 1:47 PM, dhruv kapatel 
wrote:

>
>
> Hi
>
> I am comparing performance of pig and hive for weblog data.
> I was reading this pig and hive benchmarks. In which one statement written
> on page 10 that "The CPU time
> required by a job running on 10 node cluster will (more or less) be the
> same
> than the time required to run the same job on a 1000 node cluster. However
> the real time it takes the job to complete on the 1000 node cluster will be
> 100 times less than if it were to run on a 10 node cluster."
>
> How it will take same cpu time on clusters having different capacity?
>
> In this benchmark they have considered both real and cumulative cpu time.
> As real time affected by other processes also which time shouls i consider
> for actual performance measure of pig and hive?
>
> See question below for more details.
>
>
> http://stackoverflow.com/questions/35500987/which-one-should-i-use-for-benchmark-tasks-in-hadoop-usersys-time-or-total-cpu
>
>
> http://www.ibm.com/developerworks/library/ba-pigvhive/pighivebenchmarking.pdf
> .
>
> --
>
>
> *With Regards:Kapatel Dhruv v*
>
>
>
>
>
>
>
>


Re: Problems with building hive from source code

2016-03-06 Thread Jiacai Liu
When I compile a project, error happens now and then, for most time, I just
recompile it, then everything get ok,

Also, JDK 1.8 may work well with hadoop ecosystem, so I advice try jdk 1.7
instead.



On Sun, Mar 6, 2016 at 6:31 PM, Isuru Sankalpa  wrote:

> When i build hive according to instructions from
> https://lens.apache.org/lenshome/install-and-run.html it gives errors
> when building the project
>
>
> [INFO] Hive HCatalog Server Extensions  SUCCESS [
>  2.592 s]
> [INFO] Hive HCatalog Webhcat Java Client .. SUCCESS [
>  1.996 s]
> [INFO] Hive HCatalog Webhcat .. FAILURE [
>  5.603 s]
> [INFO] Hive HCatalog Streaming  SKIPPED
> [INFO] Hive HWI ... SKIPPED
> [INFO] Hive ODBC .. SKIPPED
> [INFO] Hive Shims Aggregator .. SKIPPED
> [INFO] Hive TestUtils . SKIPPED
> [INFO] Hive Packaging . SKIPPED
> [INFO]
> 
> [INFO] BUILD FAILURE
> [INFO]
> 
> [INFO] Total time: 03:47 min
> [INFO] Finished at: 2016-03-06T01:11:00-08:00
> [INFO] Final Memory: 132M/367M
> [INFO]
> 
> [ERROR] Failed to execute goal
> org.apache.maven.plugins:maven-javadoc-plugin:2.4
> :javadoc (resourcesdoc.xml) on project hive-webhcat: An error has occurred
> in Ja
> vaDocs report generation:Exit code: 1 - javadoc: error - Illegal package
> name: "
> drills\hive"
> [ERROR] javadoc: error - Illegal package name:
> "build\hive-hive-release-0.13.4-i
> nm\hcatalog\webhcat\svr\target\classes/resourcedoc.xml"
> [ERROR] javadoc: warning - No source files for package with
> [ERROR]
> [ERROR] Command line was:"C:\Program
> Files\Java\jdk1.8.0_11\jre\..\bin\javadoc.e
> xe" @options @packages
> [ERROR] -> [Help 1]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with the
> -e swit
> ch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions,
> please rea
> d the following articles:
> [ERROR] [Help 1]
> http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionE
> xception
> [ERROR]
> [ERROR] After correcting the problems, you can resume the build with the
> command
>
> [ERROR]   mvn  -rf :hive-webhcat
>
> can someone please explain the reason.
>
>


Problems with building hive from source code

2016-03-06 Thread Isuru Sankalpa
When i build hive according to instructions from
https://lens.apache.org/lenshome/install-and-run.html it gives errors when
building the project


[INFO] Hive HCatalog Server Extensions  SUCCESS [
 2.592 s]
[INFO] Hive HCatalog Webhcat Java Client .. SUCCESS [
 1.996 s]
[INFO] Hive HCatalog Webhcat .. FAILURE [
 5.603 s]
[INFO] Hive HCatalog Streaming  SKIPPED
[INFO] Hive HWI ... SKIPPED
[INFO] Hive ODBC .. SKIPPED
[INFO] Hive Shims Aggregator .. SKIPPED
[INFO] Hive TestUtils . SKIPPED
[INFO] Hive Packaging . SKIPPED
[INFO]

[INFO] BUILD FAILURE
[INFO]

[INFO] Total time: 03:47 min
[INFO] Finished at: 2016-03-06T01:11:00-08:00
[INFO] Final Memory: 132M/367M
[INFO]

[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-javadoc-plugin:2.4
:javadoc (resourcesdoc.xml) on project hive-webhcat: An error has occurred
in Ja
vaDocs report generation:Exit code: 1 - javadoc: error - Illegal package
name: "
drills\hive"
[ERROR] javadoc: error - Illegal package name:
"build\hive-hive-release-0.13.4-i
nm\hcatalog\webhcat\svr\target\classes/resourcedoc.xml"
[ERROR] javadoc: warning - No source files for package with
[ERROR]
[ERROR] Command line was:"C:\Program
Files\Java\jdk1.8.0_11\jre\..\bin\javadoc.e
xe" @options @packages
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e
swit
ch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions,
please rea
d the following articles:
[ERROR] [Help 1]
http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionE
xception
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the
command

[ERROR]   mvn  -rf :hive-webhcat

can someone please explain the reason.