Hadoop based product recomendations.

2013-05-28 Thread Sai Sai
Just wondering if anyone would have any suggestions.
We r a bunch of developers on bench for a few months trained on Hadoop but do 
not have any projects to work.
We would like to develop a Hadoop/Hive/Pig based product for our company so we 
can be of value to the company and not be scared of lay offs. We r wondering if 
anyone could share any ideas of any product that we can develop and be of value 
to our company rather than just hoping we would get any projects to work. Any 
help/suggestions/direction will be really appreciated.
Thanks,
Sai

Deadline Extension: 2013 Workshop on Middleware for HPC and Big Data Systems (MHPC'13)

2013-05-28 Thread MHPC 2013
we apologize if you receive multiple copies of this message

===

CALL FOR PAPERS

2013 Workshop on

Middleware for HPC and Big Data Systems

MHPC '13

as part of Euro-Par 2013, Aachen, Germany

===

Date: August 27, 2012

Workshop URL: http://m-hpc.org

Springer LNCS

SUBMISSION DEADLINE:

June 10, 2013 - LNCS Full paper submission (extended)
June 28, 2013 - Lightning Talk abstracts


SCOPE

Extremely large, diverse, and complex data sets are generated from
scientific applications, the Internet, social media and other applications.
Data may be physically distributed and shared by an ever larger community.
Collecting, aggregating, storing and analyzing large data volumes
presents major challenges. Processing such amounts of data efficiently
has been an issue to scientific discovery and technological
advancement. In addition, making the data accessible, understandable and
interoperable includes unsolved problems. Novel middleware architectures,
algorithms, and application development frameworks are required.

In this workshop we are particularly interested in original work at the
intersection of HPC and Big Data with regard to middleware handling
and optimizations. Scope is existing and proposed middleware for HPC
and big data, including analytics libraries and frameworks.

The goal of this workshop is to bring together software architects,
middleware and framework developers, data-intensive application developers
as well as users from the scientific and engineering community to exchange
their experience in processing large datasets and to report their scientific
achievement and innovative ideas. The workshop also offers a dedicated forum
for these researchers to access the state of the art, to discuss problems
and requirements, to identify gaps in current and planned designs, and to
collaborate in strategies for scalable data-intensive computing.

The workshop will be one day in length, composed of 20 min paper
presentations, each followed by 10 min discussion sections.
Presentations may be accompanied by interactive demonstrations.


TOPICS

Topics of interest include, but are not limited to:

- Middleware including: Hadoop, Apache Drill, YARN, Spark/Shark, Hive,
Pig, Sqoop,
HBase, HDFS, S4, CIEL, Oozie, Impala, Storm and Hyrack
- Data intensive middleware architecture
- Libraries/Frameworks including: Apache Mahout, Giraph, UIMA and GraphLab
- NG Databases including Apache Cassandra, MongoDB and CouchDB/Couchbase
- Schedulers including Cascading
- Middleware for optimized data locality/in-place data processing
- Data handling middleware for deployment in virtualized HPC environments
- Parallelization and distributed processing architectures at the
middleware level
- Integration with cloud middleware and application servers
- Runtime environments and system level support for data-intensive computing
- Skeletons and patterns
- Checkpointing
- Programming models and languages
- Big Data ETL
- Stream processing middleware
- In-memory databases for HPC
- Scalability and interoperability
- Large-scale data storage and distributed file systems
- Content-centric addressing and networking
- Execution engines, languages and environments including CIEL/Skywriting
- Performance analysis, evaluation of data-intensive middleware
- In-depth analysis and performance optimizations in existing data-handling
middleware, focusing on indexing/fast storing or retrieval between compute
and storage nodes
- Highly scalable middleware optimized for minimum communication
- Use cases and experience for popular Big Data middleware
- Middleware security, privacy and trust architectures

DATES

Papers:
Rolling abstract submission
June 10, 2013 - Full paper submission (extended)
July 8, 2013 - Acceptance notification
October 3, 2013 - Camera-ready version due

Lightning Talks:
June 28, 2013 - Deadline for lightning talk abstracts
July 15, 2013 - Lightning talk notification

August 27, 2013 - Workshop Date


TPC

CHAIR

Michael Alexander (chair), TU Wien, Austria
Anastassios Nanos (co-chair), NTUA, Greece
Jie Tao (co-chair), Karlsruhe Institut of Technology, Germany
Lizhe Wang (co-chair), Chinese Academy of Sciences, China
Gianluigi Zanetti (co-chair), CRS4, Italy

PROGRAM COMMITTEE

Amitanand Aiyer, Facebook, USA
Costas Bekas, IBM, Switzerland
Jakob Blomer, CERN, Switzerland
William Gardner, University of Guelph, Canada
José Gracia, HPC Center of the University of Stuttgart, Germany
Zhenghua Guom,  Indiana University, USA
Marcus Hardt,  Karlsruhe Institute of Technology, Germany
Sverre Jarp, CERN, Switzerland
Christopher Jung,  Karlsruhe Institute of Technology, Germany
Andreas Knüpfer - Technische Universität Dresden, Germany
Nectarios Koziris, National Technical University of Athens, Greece
Yan Ma, Chinese Academy of Sciences, China
Martin Schulz - Lawrence Livermore National Laboratory
V

RE: how does hive find where is MR job tracker

2013-05-28 Thread Frank Luo
Thanks for reply.

Yes, I had the old server name in mapred-site.xml. Odd enough, I couldn't find 
a way to update the file through CM.

From: Sanjay Subramanian [mailto:sanjay.subraman...@wizecommerce.com]
Sent: Tuesday, May 28, 2013 12:08 PM
To: user@hive.apache.org; bejoy...@yahoo.com
Subject: Re: how does hive find where is MR job tracker

In Cloudera Manager , there is a Safety Valve feature (its a multiline text 
widget) that u can use to input the XML properties that u would use for 
mapred-site.xml

Possibly since u changed the JobTracker machine , u have to mod the 
mapred-site.xml  to specify the machine name and port


Regards

Sanjay

From: "bejoy...@yahoo.com" 
mailto:bejoy...@yahoo.com>>
Reply-To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>, 
"bejoy...@yahoo.com" 
mailto:bejoy...@yahoo.com>>
Date: Tuesday, May 28, 2013 10:02 AM
To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Subject: Re: how does hive find where is MR job tracker

Hive gets the JobTracker from the mapred-site.xml specified within your 
$HADOOP_HOME/conf.

Is your $HADOOP_HOME/conf/mapred-site.xml on the node that runs hive have the 
correct value for jobtracker?
If not changing that to the right one might resolve your issue.
Regards
Bejoy KS

Sent from remote device, Please excuse typos

From: Frank Luo mailto:j...@merkleinc.com>>
Date: Tue, 28 May 2013 16:49:01 +
To: 
user@hive.apache.orgmailto:user@hive.apache.org>>
ReplyTo: user@hive.apache.org
Subject: how does hive find where is MR job tracker

I have a cloudera cluster, version 4.2.0.

In the hive configuration, I have "MapReduce Service" set to "mapreduce1", 
which is my MR service.

However, without setting "mapred.job.tracker", whenever I run hive command, it 
always sends the job to a wrong job tracker. Here is the error:


java.net.ConnectException: Call From hqhd01ed01.pclc0.merkle.local/10.129.2.52 
to hqhd01ed01.pclc0.merkle.local:8021 failed on connection exception: 
java.net.ConnectException: Connection refused; For more details see:  
http://wiki.apache.org/hadoop/ConnectionRefused

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)

And the Cloudera Manager doesn't allow me to manually set "mapred.job.tracker". 
So my question is how to make Hive point to the right job tracker without 
setting ""mapred.job.tracker" every time.

PS. Not sure it matters, but I did move the job tracker from machine A to 
machine B.

Thx!

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Accessing Table Properies from InputFormat

2013-05-28 Thread Edward Capriolo
ORC, Parquet, and "new ones" are ... new. They do not constitute a huge
portion of the user base if they constitute any at all.

I do see a case for what you are describing, currently there are input
formats that do properties via the configuration to the task. Also I feel
like some of the confusion you are describing centers around someone
building code with the goal of only one input format in mind or one use
case. So when I see "OCR" "could benefit" and "refactor" in the same email
alarms go off. They may be false alarms but if the feature does not
immediately benefit two input formats, its creating fragmentation and I am
not behind it.









On Tue, May 28, 2013 at 1:31 PM, Owen O'Malley  wrote:

> On Tue, May 28, 2013 at 9:27 AM, Edward Capriolo wrote:
>
>> The question we are diving into is how much of hive is going to be
>> designed around edge cases? Hive really was not made for columnar formats,
>> or self describing data-types. For the most part it handles them fairly
>> well.
>>
>
> I don't view columnar or self describing data-types as an edge case. I
> think in a couple years, the various columnar stores (ORC, Parquet, or new
> ones) and text will be the primary formats. Given the performance advantage
> of binary formats, text should only be used for staging tables.
>
>
>> I am not sure what I believe about refactoring all of hive's guts. How
>> much refactoring and complexity are we going to add to support special
>> cases? I do not think we can justify sweeping API changes for the sake of
>> one new input format, or something that can be done in some other way.
>>
>
> The problem is actually, much bigger. We have a wide range of nested
> abstractions for input/output that all interact in various ways.
>
> org.apache.hadoop.mapred.InputFormat
> org.apache.hadoop.hive.ql.io.HiveInputFormat
> org.apache.hadoop.hive.ql.meta.HiveStorageHandler
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
> org.apache.hadoop.hive.serde2.SerDe
>
> I would suggest that there is a lot of confusion about the current state
> of what is allowed and what will break things. Furthermore, because
> critical functionality like accessing table properties, partition
> properties, columnar projection, and predicate pushdown has been added
> incrementally, it isn't clear at all how to users what is available and how
> to take advantage of them.
>
> -- Owen
>
>


Re: Accessing Table Properies from InputFormat

2013-05-28 Thread Owen O'Malley
On Tue, May 28, 2013 at 9:27 AM, Edward Capriolo wrote:

> The question we are diving into is how much of hive is going to be
> designed around edge cases? Hive really was not made for columnar formats,
> or self describing data-types. For the most part it handles them fairly
> well.
>

I don't view columnar or self describing data-types as an edge case. I
think in a couple years, the various columnar stores (ORC, Parquet, or new
ones) and text will be the primary formats. Given the performance advantage
of binary formats, text should only be used for staging tables.


> I am not sure what I believe about refactoring all of hive's guts. How
> much refactoring and complexity are we going to add to support special
> cases? I do not think we can justify sweeping API changes for the sake of
> one new input format, or something that can be done in some other way.
>

The problem is actually, much bigger. We have a wide range of nested
abstractions for input/output that all interact in various ways.

org.apache.hadoop.mapred.InputFormat
org.apache.hadoop.hive.ql.io.HiveInputFormat
org.apache.hadoop.hive.ql.meta.HiveStorageHandler
org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
org.apache.hadoop.hive.serde2.SerDe

I would suggest that there is a lot of confusion about the current state of
what is allowed and what will break things. Furthermore, because critical
functionality like accessing table properties, partition properties,
columnar projection, and predicate pushdown has been added incrementally,
it isn't clear at all how to users what is available and how to take
advantage of them.

-- Owen


Re: how does hive find where is MR job tracker

2013-05-28 Thread Sanjay Subramanian
In Cloudera Manager , there is a Safety Valve feature (its a multiline text 
widget) that u can use to input the XML properties that u would use for 
mapred-site.xml

Possibly since u changed the JobTracker machine , u have to mod the 
mapred-site.xml  to specify the machine name and port


Regards

Sanjay

From: "bejoy...@yahoo.com" 
mailto:bejoy...@yahoo.com>>
Reply-To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>, 
"bejoy...@yahoo.com" 
mailto:bejoy...@yahoo.com>>
Date: Tuesday, May 28, 2013 10:02 AM
To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Subject: Re: how does hive find where is MR job tracker

Hive gets the JobTracker from the mapred-site.xml specified within your 
$HADOOP_HOME/conf.

Is your $HADOOP_HOME/conf/mapred-site.xml on the node that runs hive have the 
correct value for jobtracker?
If not changing that to the right one might resolve your issue.
Regards
Bejoy KS

Sent from remote device, Please excuse typos

From: Frank Luo mailto:j...@merkleinc.com>>
Date: Tue, 28 May 2013 16:49:01 +
To: 
user@hive.apache.orgmailto:user@hive.apache.org>>
ReplyTo: user@hive.apache.org
Subject: how does hive find where is MR job tracker

I have a cloudera cluster, version 4.2.0.

In the hive configuration, I have “MapReduce Service” set to “mapreduce1”, 
which is my MR service.

However, without setting “mapred.job.tracker”, whenever I run hive command, it 
always sends the job to a wrong job tracker. Here is the error:


java.net.ConnectException: Call From hqhd01ed01.pclc0.merkle.local/10.129.2.52 
to hqhd01ed01.pclc0.merkle.local:8021 failed on connection exception: 
java.net.ConnectException: Connection refused; For more details see:  
http://wiki.apache.org/hadoop/ConnectionRefused

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)

And the Cloudera Manager doesn’t allow me to manually set “mapred.job.tracker”. 
So my question is how to make Hive point to the right job tracker without 
setting ““mapred.job.tracker” every time.

PS. Not sure it matters, but I did move the job tracker from machine A to 
machine B.

Thx!

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: how does hive find where is MR job tracker

2013-05-28 Thread bejoy_ks
Hive gets the JobTracker from the mapred-site.xml specified within your 
$HADOOP_HOME/conf.

Is your $HADOOP_HOME/conf/mapred-site.xml on the node that runs hive have the 
correct value for jobtracker?
 If not changing that to the right one might resolve your issue.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Frank Luo 
Date: Tue, 28 May 2013 16:49:01 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: how does hive find where is MR job tracker

I have a cloudera cluster, version 4.2.0.

In the hive configuration, I have "MapReduce Service" set to "mapreduce1", 
which is my MR service.

However, without setting "mapred.job.tracker", whenever I run hive command, it 
always sends the job to a wrong job tracker. Here is the error:


java.net.ConnectException: Call From hqhd01ed01.pclc0.merkle.local/10.129.2.52 
to hqhd01ed01.pclc0.merkle.local:8021 failed on connection exception: 
java.net.ConnectException: Connection refused; For more details see:  
http://wiki.apache.org/hadoop/ConnectionRefused

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)

And the Cloudera Manager doesn't allow me to manually set "mapred.job.tracker". 
So my question is how to make Hive point to the right job tracker without 
setting ""mapred.job.tracker" every time.

PS. Not sure it matters, but I did move the job tracker from machine A to 
machine B.

Thx!



how does hive find where is MR job tracker

2013-05-28 Thread Frank Luo
I have a cloudera cluster, version 4.2.0.

In the hive configuration, I have "MapReduce Service" set to "mapreduce1", 
which is my MR service.

However, without setting "mapred.job.tracker", whenever I run hive command, it 
always sends the job to a wrong job tracker. Here is the error:


java.net.ConnectException: Call From hqhd01ed01.pclc0.merkle.local/10.129.2.52 
to hqhd01ed01.pclc0.merkle.local:8021 failed on connection exception: 
java.net.ConnectException: Connection refused; For more details see:  
http://wiki.apache.org/hadoop/ConnectionRefused

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)

And the Cloudera Manager doesn't allow me to manually set "mapred.job.tracker". 
So my question is how to make Hive point to the right job tracker without 
setting ""mapred.job.tracker" every time.

PS. Not sure it matters, but I did move the job tracker from machine A to 
machine B.

Thx!


Re: Accessing Table Properies from InputFormat

2013-05-28 Thread Edward Capriolo
Right, hive discarding the key is rather annoying. I have a series of
key+value input formats key only input format etc. Having hive return both
the key and the value, would be a breaking change, but not be very
difficult.

The question we are diving into is how much of hive is going to be designed
around edge cases? Hive really was not made for columnar formats, or self
describing data-types. For the most part it handles them fairly well.

I am not sure what I believe about refactoring all of hive's guts. How much
refactoring and complexity are we going to add to support special cases? I
do not think we can justify sweeping API changes for the sake of one new
input format, or something that can be done in some other way.







On Tue, May 28, 2013 at 12:10 PM, Owen O'Malley  wrote:

>
>
>
> On Tue, May 28, 2013 at 8:45 AM, Edward Capriolo wrote:
>
>> That does not really make sense. Your breaking the layered approache.
>> InputFormats read/write data, serdes interpret data based on the table
>> definition. its like asking "Why can't my input format run assembly code?"
>>
>
> The current model of:
>
> SerDe
> Input/OutputFormat
> FileSystem
>
> does well for text formats, but otherwise limits the input/output formats
> to doing binary data. That creates problems if the Input/OutputFormat has
> an integrated serialization mechanism. For example, ORC requires its SerDe
> and the OrcSerde just passes along the values through serialize and
> deserialize.
>
> Also note that other formats like SequenceFile are restricted because the
> SerDe is placed above the FileFormat. Hive's SequenceFile input format
> discards the key and requires the value to be Text or BytesWritable. That
> covers many cases, but certainly not all. On the other hand, if it was
> Hive's SequenceFile InputFormat that was creating the ObjectInspector, it
> could actually handle more complex types and let Hive usefully read a wider
> range of SequenceFiles.
>
> I would propose that it would be better to push SerDes down into the
> Input/OutputFormats that can be parameterized by the serialization. Using
> them for TextInput/OutputFormat and HBaseTableInput/OutputFormat makes a
> lot of sense, but in general that isn't true.
>
> -- Owen
>


Re: Accessing Table Properies from InputFormat

2013-05-28 Thread Owen O'Malley
On Tue, May 28, 2013 at 8:45 AM, Edward Capriolo wrote:

> That does not really make sense. Your breaking the layered approache.
> InputFormats read/write data, serdes interpret data based on the table
> definition. its like asking "Why can't my input format run assembly code?"
>

The current model of:

SerDe
Input/OutputFormat
FileSystem

does well for text formats, but otherwise limits the input/output formats
to doing binary data. That creates problems if the Input/OutputFormat has
an integrated serialization mechanism. For example, ORC requires its SerDe
and the OrcSerde just passes along the values through serialize and
deserialize.

Also note that other formats like SequenceFile are restricted because the
SerDe is placed above the FileFormat. Hive's SequenceFile input format
discards the key and requires the value to be Text or BytesWritable. That
covers many cases, but certainly not all. On the other hand, if it was
Hive's SequenceFile InputFormat that was creating the ObjectInspector, it
could actually handle more complex types and let Hive usefully read a wider
range of SequenceFiles.

I would propose that it would be better to push SerDes down into the
Input/OutputFormats that can be parameterized by the serialization. Using
them for TextInput/OutputFormat and HBaseTableInput/OutputFormat makes a
lot of sense, but in general that isn't true.

-- Owen


Re: Accessing Table Properies from InputFormat

2013-05-28 Thread Edward Capriolo
That does not really make sense. Your breaking the layered approache.
InputFormats read/write data, serdes interpret data based on the table
definition. its like asking "Why can't my input format run assembly code?"


On Tue, May 28, 2013 at 11:42 AM, Owen O'Malley  wrote:

>
>
>
> On Tue, May 28, 2013 at 7:59 AM, Peter Marron <
> peter.mar...@trilliumsoftware.com> wrote:
>
>>  Hi,
>>
>> ** **
>>
>> Hive 0.10.0 over Hadoop 1.0.4.
>>
>> ** **
>>
>> Further to my filtering questions of before.
>>
>> I would like to be able to access the table properties from inside my
>> custom InputFormat.
>>
>> I’ve done searches and there seem to be some other people who have had a
>> similar problem.
>>
>> The closest I can see to a solution is to use 
>>
>> MapredWork mrwork =
>> Utilities.getMapRedWork(configuration);
>>
>> but this fails for me with the error below.
>>
>> I’m not truly surprised because I and trying to make sure that my query**
>> **
>>
>> runs without a map/reduce and some of the e-mails suggest that in this
>> case:
>>
>> ** **
>>
>> “…no mapred job is
>> run, so this trick doesn't work (and instead, the Configuration object
>> can be used, since it's local).”
>>
>> ** **
>>
>> Any pointers would be very much appreciated.
>>
>
> Yeah, as you discovered, that only works in the MapReduce case and breaks
> on cases like "select count(*)" that don't run in MapReduce.
>
> I haven't tried it, but it looks like the best you can do with the current
> interface is to implement a SerDe which is passed the table properties in
> initialize. In terms of passing it to the InputFormat, I'd try a thread
> local variable. It looks like the getRecordReader is called soon after the
> serde.initialize although I didn't do a very deep search of the code.
>
> -- Owen
>
>
>


Re: Accessing Table Properies from InputFormat

2013-05-28 Thread Owen O'Malley
On Tue, May 28, 2013 at 7:59 AM, Peter Marron <
peter.mar...@trilliumsoftware.com> wrote:

>  Hi,
>
> ** **
>
> Hive 0.10.0 over Hadoop 1.0.4.
>
> ** **
>
> Further to my filtering questions of before.
>
> I would like to be able to access the table properties from inside my
> custom InputFormat.
>
> I’ve done searches and there seem to be some other people who have had a
> similar problem.
>
> The closest I can see to a solution is to use 
>
> MapredWork mrwork = Utilities.getMapRedWork(configuration);
> 
>
> but this fails for me with the error below.
>
> I’m not truly surprised because I and trying to make sure that my query***
> *
>
> runs without a map/reduce and some of the e-mails suggest that in this
> case:
>
> ** **
>
> “…no mapred job is
> run, so this trick doesn't work (and instead, the Configuration object
> can be used, since it's local).”
>
> ** **
>
> Any pointers would be very much appreciated.
>

Yeah, as you discovered, that only works in the MapReduce case and breaks
on cases like "select count(*)" that don't run in MapReduce.

I haven't tried it, but it looks like the best you can do with the current
interface is to implement a SerDe which is passed the table properties in
initialize. In terms of passing it to the InputFormat, I'd try a thread
local variable. It looks like the getRecordReader is called soon after the
serde.initialize although I didn't do a very deep search of the code.

-- Owen


Accessing Table Properies from InputFormat

2013-05-28 Thread Peter Marron
Hi,

Hive 0.10.0 over Hadoop 1.0.4.

Further to my filtering questions of before.
I would like to be able to access the table properties from inside my custom 
InputFormat.
I've done searches and there seem to be some other people who have had a 
similar problem.
The closest I can see to a solution is to use
MapredWork mrwork = Utilities.getMapRedWork(configuration);
but this fails for me with the error below.
I'm not truly surprised because I and trying to make sure that my query
runs without a map/reduce and some of the e-mails suggest that in this case:

"...no mapred job is
run, so this trick doesn't work (and instead, the Configuration object
can be used, since it's local)."

Any pointers would be very much appreciated.

Z

java.lang.IllegalArgumentException: Can not create a Path from an empty string
at org.apache.hadoop.fs.Path.checkPathArg(Path.java:82)
at org.apache.hadoop.fs.Path.(Path.java:90)
at 
org.apache.hadoop.hive.ql.exec.Utilities.getHiveJobID(Utilities.java:382)
at 
org.apache.hadoop.hive.ql.exec.Utilities.getMapRedWork(Utilities.java:205)
at SimpleInputFormat.getThingy(SimpleInputFormat.java:134)
at SimpleInputFormat.getSplits(SimpleInputFormat.java:90)
at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:373)
at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:486)
at 
org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:466)
at 
org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:136)
at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1387)
at 
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:270)
at 
org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216)
at 
org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:412)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:755)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:613)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Failed with exception java.io.IOException:java.lang.RuntimeException: 
java.lang.IllegalArgumentException: Can not create a Path from an empty string

Peter Marron
Trillium Software UK Limited

Tel : +44 (0) 118 940 7609
Fax : +44 (0) 118 940 7699
E: peter.mar...@trilliumsoftware.com



Combining 2 JSON objects in Hive

2013-05-28 Thread Ike Walker
Hello,

I have two JSON objects stored as strings in a Hive table.

I would like to combine them into a single JSON object in Hive.

I'm running Hive 0.7, but am planning to upgrade soon so a solution that works 
in Hive 0.8 could be acceptable as well.

For example, here's the data now:
+-+---+
| col1| col2  |
+-+---+
| {"age":"Over 30","gender":"female"} | {"counter":"0","version":"1"  |
| {"age":"Over 30","gender":"female"} | {"counter":"4","version":"1"  |
| {"age":"Over 30","gender":"male"}   | {"counter":"10","version":"1" |
+-+---+

And here's what I want to select:

{"age":"Over 30","gender":"female","counter":"0","version":"1"}
{"age":"Over 30","gender":"female","counter":"4","version":"1"}
{"age":"Over 30","gender":"male","counter":"10","version":"1"}

Thanks,
Ike Walker

IndexOutOfBoundsException with Snappy compressed SequenceFile from Flume

2013-05-28 Thread Keith Wright
Hi all,

   This is my first post to the hive mailing list and I was hoping to get some 
help with the exception I am getting below.  I am using CDH4.2 (hive 0.10.0) to 
query snappy compressed, Sequence files that are built using Flume (relevant 
portion of flume conf below as well).  Note that I'm using a SequenceFile as it 
was needed for Impala integration.  Has anyone see this error before?  Couple 
of additional points to help diagnose:

 1.  Queries seem to be able to process some mappers without issues.   In fact, 
I can do a simple select * from  limit 10 without issue. However if I 
make the limit high enough, it will eventually fail presumably as it needs to 
read in a file that has this issue.
 2.  The same query runs in Impala without errors but appears to "skip" some 
data.  I can confirm that the missing data is present via a custom map/reduce 
job
 3.  I am able to write a map/reduce job that reads through all of the same 
data without issue and have been unable to identify data corruption
 4.  This is a partitioned table and queries fail that touch ANY of the 
partitions (and there are hundreds) so this does not appear to be a sporadic, 
data integrity problem (table definition below)
 5.  We are using '\001' as our field separator.  We are capturing other data 
also with SequenceFile, snappy but using '|' as our delimiter and we do not 
have any issues querying there.  Although we are using a different flume source.

My next step for debugging was to disable snappy compression and see if I could 
query the data.  If not, switch from SequenceFile to simple text.

I appreciate the help!!!

CREATE EXTERNAL TABLE ORGANIC_EVENTS (
event_id BIGINT,
app_id INT,
user_id BIGINT,
type STRING,
name STRING,
value STRING,
extra STRING,
ip_address STRING,
user_agent STRING,
referrer STRING,
event_time BIGINT,
install_flag TINYINT,
first_for_user TINYINT,
cookie STRING,
year int,
month int,
day int,
hour int)  PARTITIONED BY (year int, month int, day int,hour int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
STORED AS SEQUENCEFILE
LOCATION '/events/organic';

agent.sinks.exhaustHDFSSink3.type = HDFS
agent.sinks.exhaustHDFSSink3.channel = exhaustFileChannel
agent.sinks.exhaustHDFSSink3.hdfs.path = 
hdfs://lxscdh001.nanigans.com:8020%{path}
agent.sinks.exhaustHDFSSink3.hdfs.filePrefix = 3.%{hostname}
agent.sinks.exhaustHDFSSink3.hdfs.rollInterval = 0
agent.sinks.exhaustHDFSSink3.hdfs.idleTimeout = 600
agent.sinks.exhaustHDFSSink3.hdfs.rollSize = 0
agent.sinks.exhaustHDFSSink3.hdfs.rollCount = 0
agent.sinks.exhaustHDFSSink3.hdfs.batchSize = 5000
agent.sinks.exhaustHDFSSink3.hdfs.txnEventMax = 5000
agent.sinks.exhaustHDFSSink3.hdfs.fileType = SequenceFile
agent.sinks.exhaustHDFSSink3.hdfs.maxOpenFiles = 100
agent.sinks.exhaustHDFSSink3.hdfs.codeC = snappy
agent.sinks.exhaustHDFSSink.3hdfs.writeFormat = Text


2013-05-28 12:29:00,919 WARN org.apache.hadoop.mapred.Child: Error running 
child  java.io.IOException: java.io.IOException: 
java.lang.IndexOutOfBoundsException
  at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
  at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
  at 
org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:330)
  at 
org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.next(HadoopShimsSecure.java:246)
  at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:216)
  at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:201)
  at 
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
  at 
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:418)
  at 
org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
  at 
org.apache.hadoop.mapred.Child$4.run(Child.java:268)
  at 
java.security.AccessController.doPrivileged(Native Method)
  at 
javax.security.auth.Subject.doAs(Subject.java:396)
  at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
  at 
org.apache.hadoop.mapred.Child.main(Child.java:262)
  Caused by: java.io.IOException: 
java.lang.IndexOutOfBoundsException
  at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
  at 

Re: Where should Hive Process Installed??

2013-05-28 Thread yuchen xie
I have just met this.

You can try to add this configuration in hive-site.xml

   mapreduce.jobtracker.address
   ignorethis


Or you can patch
https://issues.apache.org/jira/browse/HIVE-3029
to your hive



2013/4/8 rohithsharma 

> Hi
>
> ** **
>
> I am using Hive-0.9.0 + Hadoop-2.0.1 with 2 machine. One machine contains
> say
>
> Machine-1 : NameNode, SecondaraNameNode , ResourceManager and Hive
>
> Machine-2 :  Proxy server, JHS , DataNode and NodeManager.
>
> ** **
>
> Problem : 
>
> When I execute Job queries i.e “select count(key) from src” for the table
> src, Job is getting killed. 
>
> The exception in application log was
>
> ** **
>
> ** **
>
> MODIFY_APP VIEW_APP
> APPLICATION_OWNERrohith(&container_1365412074766_0031_01_03�stderr1503WARNING:
> org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use
> org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties
> files.
> java.io.FileNotFoundException:
> /tmp/rohith/hive_2013-04-08_15-36-36_587_2956773547832982704/-mr-10001/41253f2c-9b15-49e0-a419-32ea4559e4e6
> (No such file or directory)
> at java.io.FileInputStream.open(Native Method)
> at java.io.FileInputStream.(FileInputStream.java:120)
> at java.io.FileInputStream.(FileInputStream.java:79)
> at
> org.apache.hadoop.hive.ql.exec.Utilities.getMapRedWork(Utilities.java:215)
> at
> org.apache.hadoop.hive.ql.io.HiveInputFormat.init(HiveInputFormat.java:255)
> at
> org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:381)
> at
> org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:374)
> at
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:536)
> at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:161)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:382)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:335)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:154)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:149)
> stdout0syslog58942013-04-08 15:45:52,293 WARN [main]
> org.apache.hadoop.conf.Configuration: job
>
> 
>
> ** **
>
> ** **
>
> Please  let us know, where HiveServer should run exactly??
>
> ** **
>
> Thanks & Regards
>
> Rohith Sharma K S
>