Casscading API

2012-03-02 Thread Shreya.Pal
Hi,



Has anyone used Cascading Data Processing API, What were the advantages
or performance?



Thanks and Regards,

Shreya Pal


This e-mail and any files transmitted with it are for the sole use of the 
intended recipient(s) and may contain confidential and privileged information.
If you are not the intended recipient, please contact the sender by reply 
e-mail and destroy all copies of the original message.
Any unauthorized review, use, disclosure, dissemination, forwarding, printing 
or copying of this email or any action taken in reliance on this e-mail is 
strictly prohibited and may be unlawful.


Re: fairscheduler : group.name doesn't work, please help

2012-03-02 Thread Austin Chungath
I tried the patch MAPREDUCE-2457 but it didn't work for my hadoop 0.20.205.
Are you sure this patch will work for 0.20.205?
According to the description it says that the patch works for 0.21 and 0.22
and it says that 0.20 supports group.name without this patch...

So does this patch also apply to 0.20.205?

Thanks,
Austin

On Thu, Mar 1, 2012 at 11:24 PM, Harsh J  wrote:

> The group.name scheduler support was introduced in
> https://issues.apache.org/jira/browse/HADOOP-3892 but may have been
> broken by the security changes present in 0.20.205. You'll need the
> fix presented in  https://issues.apache.org/jira/browse/MAPREDUCE-2457
> to have group.name support.
>
> On Thu, Mar 1, 2012 at 6:42 PM, Austin Chungath 
> wrote:
> >  I am running fair scheduler on hadoop 0.20.205.0
> >
> > http://hadoop.apache.org/common/docs/r0.20.205.0/fair_scheduler.html
> > The above page talks about the following property
> >
> > *mapred.fairscheduler.poolnameproperty*
> > **
> > which I can set to *group.name*
> > The default is user.name and when a user submits a job the fair
> scheduler
> > assigns each user's job to a pool which has the name of the user.
> > I am trying to change it to group.name so that the job is submitted to a
> > pool which has the name of the user's linux group. Thus all jobs from any
> > user from a specific group go to the same pool instead of an individual
> > pool for every user.
> > But *group.name* doesn't seem to work, has anyone tried this before?
> >
> > *user.name* and *mapred.job.queue.name* works. Is group.name supported
> in
>  > 0.20.205.0 because I don't see it mentioned in the docs?
> >
> > Thanks,
> > Austin
>
>
>
> --
> Harsh J
>


[Blog Post]: Accumulo and Pig play together now

2012-03-02 Thread Jason Trost
For anyone interested...

Accumulo and Pig play together now:
http://www.covert.io/post/18605091231/accumulo-and-pig
   and
https://github.com/jt6211/accumulo-pig

--Jason


Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R

2012-03-02 Thread Subir S
Thank you Jie!

I have downloaded Pig Experience and will read it.

On Fri, Mar 2, 2012 at 12:36 PM, Jie Li  wrote:

> Considering Pig essentially translates scripts into Map Reduce jobs, one
> can always write as good Map Reduce jobs as Pig does. You can refer to "Pig
> experience" paper to see the overhead Pig introduces, but it's been
> improved all the time.
>
> Btw if you really care about the performance, how you configure Hadoop and
> Pig can also play an important role.
>
> Thanks,
> Jie
> --
> Starfish is an intelligent performance tuning tool for Hadoop.
> Homepage: www.cs.duke.edu/starfish/
> Mailing list: http://groups.google.com/group/hadoop-starfish
>
> On Thu, Mar 1, 2012 at 11:48 PM, Subir S 
> wrote:
>
> > Hello Folks,
> >
> > Are there any pointers to such comparisons between Apache Pig and Hadoop
> > Streaming Map Reduce jobs?
> >
> > Also there was a claim in our company that Pig performs better than Map
> > Reduce jobs? Is this true? Are there any such benchmarks available
> >
> > Thanks, Subir
> >
>


Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R

2012-03-02 Thread Subir S
On Fri, Mar 2, 2012 at 12:38 PM, Harsh J  wrote:

> On Fri, Mar 2, 2012 at 10:18 AM, Subir S 
> wrote:
> > Hello Folks,
> >
> > Are there any pointers to such comparisons between Apache Pig and Hadoop
> > Streaming Map Reduce jobs?
>
> I do not see why you seek to compare these two. Pig offers a language
> that lets you write data-flow operations and runs these statements as
> a series of MR jobs for you automatically (Making it a great tool to
> use to get data processing done really quick, without bothering with
> code), while streaming is something you use to write non-Java, simple
> MR jobs. Both have their own purposes.
>

Basically we are comparing these two to see the benefits and how much they
help in improving the productive coding time, without jeopardizing the
performance of MR jobs.


> > Also there was a claim in our company that Pig performs better than Map
> > Reduce jobs? Is this true? Are there any such benchmarks available
>
> Pig _runs_ MR jobs. It does do job design (and some data)
> optimizations based on your queries, which is what may give it an edge
> over designing elaborate flows of plain MR jobs with tools like
> Oozie/JobControl (Which takes more time to do). But regardless, Pig
> only makes it easy doing the same thing with Pig Latin statements for
> you.
>

I knew that Pig runs MR jobs, as Hive runs MR jobs. But Hive jobs become
pretty slow with lot of joins, which we can achieve faster with writing raw
MR jobs. So with that context was trying to see how Pig runs MR jobs. Like
for example what kind of projects should consider Pig. Say when we have a
lot of Joins, which writing with plain MR jobs takes time. Thoughts?

Thank you Harsh for your comments. They are helpful!


>
> --
> Harsh J
>


Hadoop pain points?

2012-03-02 Thread Kunaal
I am doing a general poll on what are the most prevalent pain points that
people run into with Hadoop? These could be performance related (memory
usage, IO latencies), usage related or anything really.

The goal is to look for what areas this platform could benefit the most in
the near future.

Any feedback is much appreciated.

Thanks,
Kunal.


Re: Hadoop pain points?

2012-03-02 Thread Mike Spreitzer
Interesting question.  Do you want to be asking those who use Hadoop --- 
or those who find it too painful to use?

Regards,
Mike



From:   Kunaal 
To: common-user@hadoop.apache.org
Date:   03/02/2012 11:23 AM
Subject:Hadoop pain points?
Sent by:kunaalbha...@gmail.com



I am doing a general poll on what are the most prevalent pain points that
people run into with Hadoop? These could be performance related (memory
usage, IO latencies), usage related or anything really.

The goal is to look for what areas this platform could benefit the most in
the near future.

Any feedback is much appreciated.

Thanks,
Kunal.



Re: Hadoop pain points?

2012-03-02 Thread Raj Vishwanathan
Lol!

Raj



>
> From: Mike Spreitzer 
>To: common-user@hadoop.apache.org 
>Sent: Friday, March 2, 2012 8:31 AM
>Subject: Re: Hadoop pain points?
> 
>Interesting question.  Do you want to be asking those who use Hadoop --- 
>or those who find it too painful to use?
>
>Regards,
>Mike
>
>
>
>From:   Kunaal 
>To:    common-user@hadoop.apache.org
>Date:   03/02/2012 11:23 AM
>Subject:        Hadoop pain points?
>Sent by:        kunaalbha...@gmail.com
>
>
>
>I am doing a general poll on what are the most prevalent pain points that
>people run into with Hadoop? These could be performance related (memory
>usage, IO latencies), usage related or anything really.
>
>The goal is to look for what areas this platform could benefit the most in
>the near future.
>
>Any feedback is much appreciated.
>
>Thanks,
>Kunal.
>
>
>
>

Re: failed to build trunk, what's wrong?

2012-03-02 Thread Akshay Singh
Hi Folks,

I have also run in to the similar problem, while trying to set up Hadoop 
Develop Environment for Eclipse. 
(http://wiki.apache.org/hadoop/EclipseEnvironment)

***
...
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-antrun-plugin:1.6:run (compile-proto) on project 
hadoop-common: An Ant BuildException has occured: exec returned: 1 -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please 
read the following articles:
[ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
...

Detailed logs at : http://pastebin.ca/2123540

as per suggestions I have protoc on the right path and I guess I need not 
explicitly appended LD_LIBRARY_PATH, as libprotobuf is in my /usr/lib

$ protoc --version
libprotoc 2.3.0

$ locate libprotobuf
...
/usr/lib/libprotobuf.so.6
/usr/lib/libprotobuf.so.6.0.0
/usr/share/doc/libprotobuf6
/usr/share/doc/libprotobuf6/changelog.Debian.gz
/usr/share/doc/libprotobuf6/changelog.gz
/usr/share/doc/libprotobuf6/copyright
/var/lib/dpkg/info/libprotobuf6.list
/var/lib/dpkg/info/libprotobuf6.md5sums

Any more suggestions ??

Thanks
-Akshay



 From: Ronald Petty 
To: common-user@hadoop.apache.org 
Sent: Monday, 16 January 2012 1:44 PM
Subject: Re: failed to build trunk, what's wrong?
 
Hello,

If you type protoc on the command line is it found?

Kindest regards.

Ron

On Sat, Jan 14, 2012 at 5:52 PM, smith jack  wrote:

> mvn compile and failed:(
> jdk version is "1.6.0_23"
> maven version is Apache Maven 3.0.3
> [ERROR] Failed to execute goal
> org.apache.maven.plugins:maven-antrun-plugin:1.6:run (compile-proto) on
> project hadoop-common: An Ant BuildException has occured: exec returned:
> 127 -> [Help 1]
> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
> goal org.apache.maven.plugins:maven-antrun-plugin:1.6:run (compile-proto)
> on project hadoop-common: An Ant BuildException has occured: exec returned:
> 127
>        at
>
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:217)
>        at
>
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
>        at
>
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
>        at
>
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84)
>        at
>
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59)
>        at
>
> org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183)
>        at
>
> org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161)
>        at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:319)
>        at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156)
>        at org.apache.maven.cli.MavenCli.execute(MavenCli.java:537)
>        at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196)
>        at org.apache.maven.cli.MavenCli.main(MavenCli.java:141)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at
>
> org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290)
>        at
> org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230)
>        at
>
> org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:409)
>        at
> org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:352)
> Caused by: org.apache.maven.plugin.MojoExecutionException: An Ant
> BuildException has occured: exec returned: 127
>        at
> org.apache.maven.plugin.antrun.AntRunMojo.execute(AntRunMojo.java:283)
>        at
>
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:101)
>        at
>
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:209)
>        ... 19 more
> Caused by:
>
> /home/jack/home/download/build/hadoop-common/hadoop-common-project/hadoop-common/target/antrun/build-main.xml:23:
> exec returned: 127
>        at
> org.apache.tools.ant.taskdefs.ExecTask.runExecute(ExecTask.java:650)
>        at org.apache.tools.ant.taskdefs.ExecTask.runExec(ExecTask.java:676)
>        at org.apache.tools.ant.taskdefs.ExecTask.execute(ExecTask.java:502)
>        at
> org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
>
> sun

Re: Hadoop and Hibernate

2012-03-02 Thread Geoffry Roberts
This is a tardy response.  I'm spread pretty thinly right now.

DistributedCacheis
apparently deprecated.  Is there a replacement?  I didn't see anything
about this in the documentation, but then I am still using 0.21.0. I have
to for performance reasons.  1.0.1 is too slow and the client won't have
it.

Also, the 
DistributedCacheapproach
seems only to work from within a hadoop job.  i.e. From within a
Mapper or a Reducer, but not from within a Driver.  I have libraries that I
must access both from both places.  I take it that I am stuck keeping two
copies of these libraries in synch--Correct?  It's either that, or copy
them into hdfs, replacing them all at the beginning of each job run.

Looking for best practices.

Thanks

On 28 February 2012 10:17, Owen O'Malley  wrote:

> On Tue, Feb 28, 2012 at 5:15 PM, Geoffry Roberts
>  wrote:
>
> > If I create an executable jar file that contains all dependencies
> required
> > by the MR job do all said dependencies get distributed to all nodes?
>
> You can make a single jar and that will be distributed to all of the
> machines that run the task, but it is better in most cases to use the
> distributed cache.
>
> See
> http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
>
> > If I specify but one reducer, which node in the cluster will the reducer
> > run on?
>
> The scheduling is done by the JobTracker and it isn't possible to
> control the location of the reducers.
>
> -- Owen
>



-- 
Geoffry Roberts


Re: Hadoop pain points?

2012-03-02 Thread Kunaal
I am asking users who use Hadoop and love it, but would want to see it
improved in certain specific areas.

On Fri, Mar 2, 2012 at 8:31 AM, Mike Spreitzer  wrote:

> Interesting question.  Do you want to be asking those who use Hadoop ---
> or those who find it too painful to use?
>
> Regards,
> Mike
>
>
>
> From:   Kunaal 
> To: common-user@hadoop.apache.org
> Date:   03/02/2012 11:23 AM
> Subject:Hadoop pain points?
> Sent by:kunaalbha...@gmail.com
>
>
>
> I am doing a general poll on what are the most prevalent pain points that
> people run into with Hadoop? These could be performance related (memory
> usage, IO latencies), usage related or anything really.
>
> The goal is to look for what areas this platform could benefit the most in
> the near future.
>
> Any feedback is much appreciated.
>
> Thanks,
> Kunal.
>
>


-- 
"What we are is the universe's gift to us.
What we become is our gift to the universe."


Re: Hadoop and Hibernate

2012-03-02 Thread Kunaal
Are you looking to use DistributedCache for better performance?

On Fri, Mar 2, 2012 at 9:42 AM, Geoffry Roberts
wrote:

> This is a tardy response.  I'm spread pretty thinly right now.
>
> DistributedCache<
> http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
> >is
> apparently deprecated.  Is there a replacement?  I didn't see anything
> about this in the documentation, but then I am still using 0.21.0. I have
> to for performance reasons.  1.0.1 is too slow and the client won't have
> it.
>
> Also, the DistributedCache<
> http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
> >approach
> seems only to work from within a hadoop job.  i.e. From within a
> Mapper or a Reducer, but not from within a Driver.  I have libraries that I
> must access both from both places.  I take it that I am stuck keeping two
> copies of these libraries in synch--Correct?  It's either that, or copy
> them into hdfs, replacing them all at the beginning of each job run.
>
> Looking for best practices.
>
> Thanks
>
> On 28 February 2012 10:17, Owen O'Malley  wrote:
>
> > On Tue, Feb 28, 2012 at 5:15 PM, Geoffry Roberts
> >  wrote:
> >
> > > If I create an executable jar file that contains all dependencies
> > required
> > > by the MR job do all said dependencies get distributed to all nodes?
> >
> > You can make a single jar and that will be distributed to all of the
> > machines that run the task, but it is better in most cases to use the
> > distributed cache.
> >
> > See
> >
> http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
> >
> > > If I specify but one reducer, which node in the cluster will the
> reducer
> > > run on?
> >
> > The scheduling is done by the JobTracker and it isn't possible to
> > control the location of the reducers.
> >
> > -- Owen
> >
>
>
>
> --
> Geoffry Roberts
>



-- 
"What we are is the universe's gift to us.
What we become is our gift to the universe."


Re: [Blog Post]: Accumulo and Pig play together now

2012-03-02 Thread Bill Graham
- bcc: u...@nutch.apache.org common-user@hadoop.apache.org

This is great Jason. One thing to add though is this line in your Pig
script:

SET mapred.map.tasks.speculative.execution false

Otherwise you'll likely going to get duplicate writes into accumulo.


On Fri, Mar 2, 2012 at 5:48 AM, Jason Trost  wrote:

> For anyone interested...
>
> Accumulo and Pig play together now:
> http://www.covert.io/post/18605091231/accumulo-and-pig
>   and
> https://github.com/jt6211/accumulo-pig
>
> --Jason
>



-- 
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgra...@gmail.com going forward.*


Re: Hadoop pain points?

2012-03-02 Thread Iván de Prado
Hi Kunaal,

We have a recopilation of some of them here:
http://www.datasalt.com/2012/02/mapreduce-hadoop-problems/

Regards,
Iván

2012/3/2 Kunaal 

> I am asking users who use Hadoop and love it, but would want to see it
> improved in certain specific areas.
>
> On Fri, Mar 2, 2012 at 8:31 AM, Mike Spreitzer 
> wrote:
>
> > Interesting question.  Do you want to be asking those who use Hadoop ---
> > or those who find it too painful to use?
> >
> > Regards,
> > Mike
> >
> >
> >
> > From:   Kunaal 
> > To: common-user@hadoop.apache.org
> > Date:   03/02/2012 11:23 AM
> > Subject:Hadoop pain points?
> > Sent by:kunaalbha...@gmail.com
> >
> >
> >
> > I am doing a general poll on what are the most prevalent pain points that
> > people run into with Hadoop? These could be performance related (memory
> > usage, IO latencies), usage related or anything really.
> >
> > The goal is to look for what areas this platform could benefit the most
> in
> > the near future.
> >
> > Any feedback is much appreciated.
> >
> > Thanks,
> > Kunal.
> >
> >
>
>
> --
> "What we are is the universe's gift to us.
> What we become is our gift to the universe."
>



-- 
Iván de Prado
CEO & Co-founder
www.datasalt.com


RE: Hadoop and Hibernate

2012-03-02 Thread Leo Leung
Geoffry,

 Hadoop distributedCache (as of now) is used to "cache" M/R application 
specific files.
 These files are used by M/R app only and not the framework. (Normally as 
side-lookup)

 You can certainly try to use Hibernate to query your SQL based back-end within 
the M/R code.
 But think of what happens when a few hundred or thousands of M/R task do that 
concurrently.
 Your back-end is going to cry. (if it can - before it dies)

 So IMO,  prep your M/R job with distributedCache files (pull it down first) is 
a better approach.

 Also, MPI is pretty much out of question (not baked into the framework).  
 You'll likely have to roll your own.  (And try to trick the JobTracker in not 
starting the same task)

 Anyone has a better solution for Geoffry?



-Original Message-
From: Geoffry Roberts [mailto:geoffry.robe...@gmail.com] 
Sent: Friday, March 02, 2012 9:42 AM
To: common-user@hadoop.apache.org
Subject: Re: Hadoop and Hibernate

This is a tardy response.  I'm spread pretty thinly right now.

DistributedCacheis
apparently deprecated.  Is there a replacement?  I didn't see anything about 
this in the documentation, but then I am still using 0.21.0. I have to for 
performance reasons.  1.0.1 is too slow and the client won't have it.

Also, the 
DistributedCacheapproach
seems only to work from within a hadoop job.  i.e. From within a Mapper or a 
Reducer, but not from within a Driver.  I have libraries that I must access 
both from both places.  I take it that I am stuck keeping two copies of these 
libraries in synch--Correct?  It's either that, or copy them into hdfs, 
replacing them all at the beginning of each job run.

Looking for best practices.

Thanks

On 28 February 2012 10:17, Owen O'Malley  wrote:

> On Tue, Feb 28, 2012 at 5:15 PM, Geoffry Roberts 
>  wrote:
>
> > If I create an executable jar file that contains all dependencies
> required
> > by the MR job do all said dependencies get distributed to all nodes?
>
> You can make a single jar and that will be distributed to all of the 
> machines that run the task, but it is better in most cases to use the 
> distributed cache.
>
> See
> http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#Distr
> ibutedCache
>
> > If I specify but one reducer, which node in the cluster will the 
> > reducer run on?
>
> The scheduling is done by the JobTracker and it isn't possible to 
> control the location of the reducers.
>
> -- Owen
>



--
Geoffry Roberts


Re: Hadoop and Hibernate

2012-03-02 Thread Geoffry Roberts
No, I am using 0.21.0 for better performance.  I am interested in
DistributedCache so certain libraries can be found during MR processing.
As it is now, I'm getting ClassNotFoundException being thrown by the
Reducers.  The Driver throws no error, the Reducer(s) does.  It would seem
something is not being distributed across the cluster as I assumed it
would.  After all, the whole business is in a single, executable jar file.

On 2 March 2012 09:46, Kunaal  wrote:

> Are you looking to use DistributedCache for better performance?
>
> On Fri, Mar 2, 2012 at 9:42 AM, Geoffry Roberts
> wrote:
>
> > This is a tardy response.  I'm spread pretty thinly right now.
> >
> > DistributedCache<
> >
> http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
> > >is
> > apparently deprecated.  Is there a replacement?  I didn't see anything
> > about this in the documentation, but then I am still using 0.21.0. I have
> > to for performance reasons.  1.0.1 is too slow and the client won't have
> > it.
> >
> > Also, the DistributedCache<
> >
> http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
> > >approach
> > seems only to work from within a hadoop job.  i.e. From within a
> > Mapper or a Reducer, but not from within a Driver.  I have libraries
> that I
> > must access both from both places.  I take it that I am stuck keeping two
> > copies of these libraries in synch--Correct?  It's either that, or copy
> > them into hdfs, replacing them all at the beginning of each job run.
> >
> > Looking for best practices.
> >
> > Thanks
> >
> > On 28 February 2012 10:17, Owen O'Malley  wrote:
> >
> > > On Tue, Feb 28, 2012 at 5:15 PM, Geoffry Roberts
> > >  wrote:
> > >
> > > > If I create an executable jar file that contains all dependencies
> > > required
> > > > by the MR job do all said dependencies get distributed to all nodes?
> > >
> > > You can make a single jar and that will be distributed to all of the
> > > machines that run the task, but it is better in most cases to use the
> > > distributed cache.
> > >
> > > See
> > >
> >
> http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
> > >
> > > > If I specify but one reducer, which node in the cluster will the
> > reducer
> > > > run on?
> > >
> > > The scheduling is done by the JobTracker and it isn't possible to
> > > control the location of the reducers.
> > >
> > > -- Owen
> > >
> >
> >
> >
> > --
> > Geoffry Roberts
> >
>
>
>
> --
> "What we are is the universe's gift to us.
> What we become is our gift to the universe."
>



-- 
Geoffry Roberts


Re: Hadoop and Hibernate

2012-03-02 Thread Tarjei Huse
On 03/02/2012 07:31 PM, Geoffry Roberts wrote:
> No, I am using 0.21.0 for better performance.  I am interested in
> DistributedCache so certain libraries can be found during MR processing.
> As it is now, I'm getting ClassNotFoundException being thrown by the
> Reducers.  The Driver throws no error, the Reducer(s) does.  It would seem
> something is not being distributed across the cluster as I assumed it
> would.  After all, the whole business is in a single, executable jar file.

How complex are the queries you are doing?

Have you considered one of the following:

1) Use plain jdbc instead of integrating Hibernate into Hadoop.
2) Create a local version of the db that can be in the Distributed Cache.

I tried using Hibernate with hadoop (the queries were not an important
part of the size of the jobs) but I ran up against so many issues trying
to get Hibernate to start up within the MR job that i ended up just
exporting the tables, loading them into memory and doing queries against
them with basic HashMap lookups.

My best advice is that if you can, you should consider a way to abstract
away Hibernate from the job and use something closer to the metal like
either JDBC or just dump the data to files. Getting Hibernate to run
outside of Spring and friends can quickly grow tiresome.

T
>
> On 2 March 2012 09:46, Kunaal  wrote:
>
>> Are you looking to use DistributedCache for better performance?
>>
>> On Fri, Mar 2, 2012 at 9:42 AM, Geoffry Roberts
>> wrote:
>>
>>> This is a tardy response.  I'm spread pretty thinly right now.
>>>
>>> DistributedCache<
>>>
>> http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
 is
>>> apparently deprecated.  Is there a replacement?  I didn't see anything
>>> about this in the documentation, but then I am still using 0.21.0. I have
>>> to for performance reasons.  1.0.1 is too slow and the client won't have
>>> it.
>>>
>>> Also, the DistributedCache<
>>>
>> http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
 approach
>>> seems only to work from within a hadoop job.  i.e. From within a
>>> Mapper or a Reducer, but not from within a Driver.  I have libraries
>> that I
>>> must access both from both places.  I take it that I am stuck keeping two
>>> copies of these libraries in synch--Correct?  It's either that, or copy
>>> them into hdfs, replacing them all at the beginning of each job run.
>>>
>>> Looking for best practices.
>>>
>>> Thanks
>>>
>>> On 28 February 2012 10:17, Owen O'Malley  wrote:
>>>
 On Tue, Feb 28, 2012 at 5:15 PM, Geoffry Roberts
  wrote:

> If I create an executable jar file that contains all dependencies
 required
> by the MR job do all said dependencies get distributed to all nodes?
 You can make a single jar and that will be distributed to all of the
 machines that run the task, but it is better in most cases to use the
 distributed cache.

 See

>> http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
> If I specify but one reducer, which node in the cluster will the
>>> reducer
> run on?
 The scheduling is done by the JobTracker and it isn't possible to
 control the location of the reducers.

 -- Owen

>>>
>>>
>>> --
>>> Geoffry Roberts
>>>
>>
>>
>> --
>> "What we are is the universe's gift to us.
>> What we become is our gift to the universe."
>>
>
>


-- 
Regards / Med vennlig hilsen
Tarjei Huse
Mobil: 920 63 413



Re: Hadoop and Hibernate

2012-03-02 Thread Geoffry Roberts
Thanks Leo.  I appreciate your response.

Let me explain my situation more precisely.

I am running a series of MR sub-jobs all harnessed together so they run as
a single job.  The last MR sub-job does nothing more than aggregate the
output of the previous sub-job into a single file(s).  It does this, by
having but a single reducer.  I could eliminate this aggregation sub-job if
I could have the aforementioned previous sub-job insert its output into a
database instead of hdfs.  Doing this, would also eliminate my current
dependance on MultipleOutputs.

The trouble comes when the Reducer(s) cannot find the persistent objects
hence the dreaded CNFE.  I find this odd because they are in the same
package as the Reducer.

Your comment about the back end crying is duly noted.

btw,
MPI = Message Passing Interface?

On 2 March 2012 10:30, Leo Leung
  wrote:

> Geoffry,
>
>  Hadoop distributedCache (as of now) is used to "cache" M/R application
> specific files.
>  These files are used by M/R app only and not the framework. (Normally as
> side-lookup)
>
>  You can certainly try to use Hibernate to query your SQL based back-end
> within the M/R code.
>  But think of what happens when a few hundred or thousands of M/R task do
> that concurrently.
>  Your back-end is going to cry. (if it can - before it dies)
>
>  So IMO,  prep your M/R job with distributedCache files (pull it down
> first) is a better approach.
>
>  Also, MPI is pretty much out of question (not baked into the framework).
>  You'll likely have to roll your own.  (And try to trick the JobTracker in
> not starting the same task)
>
>  Anyone has a better solution for Geoffry?
>
>
>
> -Original Message-
> From: Geoffry Roberts [mailto:geoffry.robe...@gmail.com]
> Sent: Friday, March 02, 2012 9:42 AM
> To: common-user@hadoop.apache.org
> Subject: Re: Hadoop and Hibernate
>
> This is a tardy response.  I'm spread pretty thinly right now.
>
> DistributedCache<
> http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
> >is
> apparently deprecated.  Is there a replacement?  I didn't see anything
> about this in the documentation, but then I am still using 0.21.0. I have
> to for performance reasons.  1.0.1 is too slow and the client won't have it.
>
> Also, the DistributedCache<
> http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
> >approach
> seems only to work from within a hadoop job.  i.e. From within a Mapper or
> a Reducer, but not from within a Driver.  I have libraries that I must
> access both from both places.  I take it that I am stuck keeping two copies
> of these libraries in synch--Correct?  It's either that, or copy them into
> hdfs, replacing them all at the beginning of each job run.
>
> Looking for best practices.
>
> Thanks
>
> On 28 February 2012 10:17, Owen O'Malley  wrote:
>
> > On Tue, Feb 28, 2012 at 5:15 PM, Geoffry Roberts
> >  wrote:
> >
> > > If I create an executable jar file that contains all dependencies
> > required
> > > by the MR job do all said dependencies get distributed to all nodes?
> >
> > You can make a single jar and that will be distributed to all of the
> > machines that run the task, but it is better in most cases to use the
> > distributed cache.
> >
> > See
> > http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#Distr
> > ibutedCache
> >
> > > If I specify but one reducer, which node in the cluster will the
> > > reducer run on?
> >
> > The scheduling is done by the JobTracker and it isn't possible to
> > control the location of the reducers.
> >
> > -- Owen
> >
>
>
>
> --
> Geoffry Roberts
>



-- 
Geoffry Roberts


Re: Hadoop and Hibernate

2012-03-02 Thread Geoffry Roberts
Queries are nothing but inserts.  Create an object, populated it, persist
it. If it worked, life would be good right now.

I've considered JDBC and may yet take that approach.

re: Hibernate outside of Spring -- I'm getting tired already.

Interesting thing:  I use EMF (Eclipse Modelling Framework).  The
supporting jar files for emf and ecore are built into the job.  They are
being found by the Driver(s) and the MR(s) no problemo.  If these work, why
not the hibernate stuff?  Mystery!

On 2 March 2012 10:50, Tarjei Huse  wrote:

> On 03/02/2012 07:31 PM, Geoffry Roberts wrote:
> > No, I am using 0.21.0 for better performance.  I am interested in
> > DistributedCache so certain libraries can be found during MR processing.
> > As it is now, I'm getting ClassNotFoundException being thrown by the
> > Reducers.  The Driver throws no error, the Reducer(s) does.  It would
> seem
> > something is not being distributed across the cluster as I assumed it
> > would.  After all, the whole business is in a single, executable jar
> file.
>
> How complex are the queries you are doing?
>
> Have you considered one of the following:
>
> 1) Use plain jdbc instead of integrating Hibernate into Hadoop.
> 2) Create a local version of the db that can be in the Distributed Cache.
>
> I tried using Hibernate with hadoop (the queries were not an important
> part of the size of the jobs) but I ran up against so many issues trying
> to get Hibernate to start up within the MR job that i ended up just
> exporting the tables, loading them into memory and doing queries against
> them with basic HashMap lookups.
>
> My best advice is that if you can, you should consider a way to abstract
> away Hibernate from the job and use something closer to the metal like
> either JDBC or just dump the data to files. Getting Hibernate to run
> outside of Spring and friends can quickly grow tiresome.
>
> T
> >
> > On 2 March 2012 09:46, Kunaal  wrote:
> >
> >> Are you looking to use DistributedCache for better performance?
> >>
> >> On Fri, Mar 2, 2012 at 9:42 AM, Geoffry Roberts
> >> wrote:
> >>
> >>> This is a tardy response.  I'm spread pretty thinly right now.
> >>>
> >>> DistributedCache<
> >>>
> >>
> http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
>  is
> >>> apparently deprecated.  Is there a replacement?  I didn't see anything
> >>> about this in the documentation, but then I am still using 0.21.0. I
> have
> >>> to for performance reasons.  1.0.1 is too slow and the client won't
> have
> >>> it.
> >>>
> >>> Also, the DistributedCache<
> >>>
> >>
> http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
>  approach
> >>> seems only to work from within a hadoop job.  i.e. From within a
> >>> Mapper or a Reducer, but not from within a Driver.  I have libraries
> >> that I
> >>> must access both from both places.  I take it that I am stuck keeping
> two
> >>> copies of these libraries in synch--Correct?  It's either that, or copy
> >>> them into hdfs, replacing them all at the beginning of each job run.
> >>>
> >>> Looking for best practices.
> >>>
> >>> Thanks
> >>>
> >>> On 28 February 2012 10:17, Owen O'Malley  wrote:
> >>>
>  On Tue, Feb 28, 2012 at 5:15 PM, Geoffry Roberts
>   wrote:
> 
> > If I create an executable jar file that contains all dependencies
>  required
> > by the MR job do all said dependencies get distributed to all nodes?
>  You can make a single jar and that will be distributed to all of the
>  machines that run the task, but it is better in most cases to use the
>  distributed cache.
> 
>  See
> 
> >>
> http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
> > If I specify but one reducer, which node in the cluster will the
> >>> reducer
> > run on?
>  The scheduling is done by the JobTracker and it isn't possible to
>  control the location of the reducers.
> 
>  -- Owen
> 
> >>>
> >>>
> >>> --
> >>> Geoffry Roberts
> >>>
> >>
> >>
> >> --
> >> "What we are is the universe's gift to us.
> >> What we become is our gift to the universe."
> >>
> >
> >
>
>
> --
> Regards / Med vennlig hilsen
> Tarjei Huse
> Mobil: 920 63 413
>
>


-- 
Geoffry Roberts


problem running hadoop map reduce due to zookeeper ensemble not found

2012-03-02 Thread T Vinod Gupta
can someone tell, what the right way to do this.. i created a jar that
creates a map reduce job and submits it. but i get this error when i run it
-

12/03/02 21:42:13 ERROR zookeeper.ZKConfig: no clientPort found in zoo.cfg
12/03/02 21:42:13 ERROR mapreduce.TableInputFormat:
org.apache.hadoop.hbase.ZooKeeperConnectionException: java.io.IOException:
Unable to determine ZooKeeper ensemble
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getZooKeeperWatcher(HConnectionManager.java:1000)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.setupZookeeperTrackers(HConnectionManager.java:303)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.(HConnectionManager.java:294)
at
org.apache.hadoop.hbase.client.HConnectionManager.getConnection(HConnectionManager.java:156)
at org.apache.hadoop.hbase.client.HTable.(HTable.java:167)
at org.apache.hadoop.hbase.client.HTable.(HTable.java:145)
at
org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:91)
at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at
org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:882)
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:448)
Caused by: java.io.IOException: Unable to determine ZooKeeper ensemble
at org.apache.hadoop.hbase.zookeeper.ZKUtil.connect(ZKUtil.java:92)
at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.(ZooKeeperWatcher.java:119)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getZooKeeperWatcher(HConnectionManager.java:998)
... 17 more

this is on a standalone hbase installation.. when i try to run it on a
different machine with distributed hbase installation, i get the same
error..
i just it simply by doing
java  

thanks


Re: failed to build trunk, what's wrong?

2012-03-02 Thread Harsh J
You need protoc version 2.4+ for Hadoop 0.23 and trunk compilation.
Using 2.3 or lesser will not work.

On Fri, Mar 2, 2012 at 10:56 PM, Akshay Singh  wrote:
> Hi Folks,
>
> I have also run in to the similar problem, while trying to set up Hadoop 
> Develop Environment for Eclipse. 
> (http://wiki.apache.org/hadoop/EclipseEnvironment)
>
> ***
> ...
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-antrun-plugin:1.6:run (compile-proto) on 
> project hadoop-common: An Ant BuildException has occured: exec returned: 1 -> 
> [Help 1]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
> switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions, please 
> read the following articles:
> [ERROR] [Help 1] 
> http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
> ...
> 
> Detailed logs at : http://pastebin.ca/2123540
>
> as per suggestions I have protoc on the right path and I guess I need not 
> explicitly appended LD_LIBRARY_PATH, as libprotobuf is in my /usr/lib
>
> $ protoc --version
> libprotoc 2.3.0
>
> $ locate libprotobuf
> ...
> /usr/lib/libprotobuf.so.6
> /usr/lib/libprotobuf.so.6.0.0
> /usr/share/doc/libprotobuf6
> /usr/share/doc/libprotobuf6/changelog.Debian.gz
> /usr/share/doc/libprotobuf6/changelog.gz
> /usr/share/doc/libprotobuf6/copyright
> /var/lib/dpkg/info/libprotobuf6.list
> /var/lib/dpkg/info/libprotobuf6.md5sums
>
> Any more suggestions ??
>
> Thanks
> -Akshay
>
>
> 
>  From: Ronald Petty 
> To: common-user@hadoop.apache.org
> Sent: Monday, 16 January 2012 1:44 PM
> Subject: Re: failed to build trunk, what's wrong?
>
> Hello,
>
> If you type protoc on the command line is it found?
>
> Kindest regards.
>
> Ron
>
> On Sat, Jan 14, 2012 at 5:52 PM, smith jack  wrote:
>
>> mvn compile and failed:(
>> jdk version is "1.6.0_23"
>> maven version is Apache Maven 3.0.3
>> [ERROR] Failed to execute goal
>> org.apache.maven.plugins:maven-antrun-plugin:1.6:run (compile-proto) on
>> project hadoop-common: An Ant BuildException has occured: exec returned:
>> 127 -> [Help 1]
>> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
>> goal org.apache.maven.plugins:maven-antrun-plugin:1.6:run (compile-proto)
>> on project hadoop-common: An Ant BuildException has occured: exec returned:
>> 127
>>        at
>>
>> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:217)
>>        at
>>
>> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
>>        at
>>
>> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
>>        at
>>
>> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84)
>>        at
>>
>> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59)
>>        at
>>
>> org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183)
>>        at
>>
>> org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161)
>>        at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:319)
>>        at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156)
>>        at org.apache.maven.cli.MavenCli.execute(MavenCli.java:537)
>>        at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196)
>>        at org.apache.maven.cli.MavenCli.main(MavenCli.java:141)
>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>        at
>>
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>        at
>>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at
>>
>> org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290)
>>        at
>> org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230)
>>        at
>>
>> org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:409)
>>        at
>> org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:352)
>> Caused by: org.apache.maven.plugin.MojoExecutionException: An Ant
>> BuildException has occured: exec returned: 127
>>        at
>> org.apache.maven.plugin.antrun.AntRunMojo.execute(AntRunMojo.java:283)
>>        at
>>
>> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:101)
>>        at
>>
>> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:209)
>>        ... 19 more
>> Caused by:
>>
>> /home/jack/home/download/build/hadoop-common/hadoop-common-project/hadoop-common/target/antrun/build-main.xml:23:
>> exec returned: 127
>>        at
>> org.apache.tools.ant.taskdefs.ExecTask.runExecute(ExecTask.

Re: Hadoop pain points?

2012-03-02 Thread Harsh J
Since you ask about anything in general, when I forayed into using
Hadoop, my biggest pain was lack of documentation clarity and
completeness over the MR and DFS user APIs (and other little points).

It would be nice to have some work done to have one example or
semi-example for every single Input/OutputFormat, Mapper/Reducer
implementations, etc. added to the javadocs.

I believe examples and snippets help out a ton (tons more than
explaining just behavior) to new devs.

On Fri, Mar 2, 2012 at 9:45 PM, Kunaal  wrote:
> I am doing a general poll on what are the most prevalent pain points that
> people run into with Hadoop? These could be performance related (memory
> usage, IO latencies), usage related or anything really.
>
> The goal is to look for what areas this platform could benefit the most in
> the near future.
>
> Any feedback is much appreciated.
>
> Thanks,
> Kunal.



-- 
Harsh J


Re: better partitioning strategy in hive

2012-03-02 Thread Mark Grover
Sorry about the dealyed response, RK.

Here is what I think:
1) first of all why hive is not able to even submit the job? Is it taking for 
ever to query the list pf partitions from the meta store? getting 43K recs 
should not be big deal at all?? 

--> Hive is possibly taking a long time to figure out what partitions it needs 
to query. I experienced the same problem when I had a lot of partitions (with 
relatively small sized files). I reverted back to having less number of 
partitions with larger file sizes, that fixed the problem. Finding the balance 
between how many partitions you want and how big you want each partition to be 
is tricky, but, in general, it's better to have lesser number of partitions. 
You want to be aware of the small files problem. It has been discussed at many 
places. Some links are:
http://blog.rapleaf.com/dev/2008/11/20/give-me-liberty-or-give-me-death-but-dont-give-me-small-files/
http://www.cloudera.com/blog/2009/02/the-small-files-problem/
http://arunxjacob.blogspot.com/2011/04/hdfs-file-size-vs-allocation-other.html

2) So in order to improve my situation, what are my options? I can think of 
changing the partition strategy to daily partition instead of hourly. What 
should be the ideal partitioning strategy? 

--> I would say that's a good step forward.

3) if we have one partition per day and 24 files under it (i.e less partitions 
but same number of files), will it improve anything or i will have same issue ? 

--> You probably wouldn't have the same issue; if you still do, it wouldn't be 
as bad. Since the number of partitions have been reduced by a factor of 24, 
hive doesn't have to go through as many number of partitions. However, your 
queries that look for data in a particular hour on a given day would be slower 
now that you don't have hour as a partition.

4)Are there any special input formats or tricks to handle this? 

--> This is a separate question. What format, SerDe and compression you use for 
your data, is a part of the design but isn't necessarily linked to the problem 
in question.

5) When i tried to insert into a different table by selecting from whole days 
data, hive generate 164mappers with map-only jobs, hence creating many output 
files. How can force hive to create one output file instead of many. Setting 
mapred.reduce.tasks=1 is not even generating reduce tasks. What i can do to 
achieve this? 

--> mapred.reduce.tasks wouldn't help because the job is map-only and has no 
reduce tasks. You should look into hive.merge.* properties. Setting them in 
your hive-site.xml would do the trick. You can see refer to this template 
(https://svn.apache.org/repos/asf/hive/trunk/conf/hive-default.xml.template) to 
see what properties exist. 

Good luck!
Mark

Mark Grover, Business Intelligence Analyst
OANDA Corporation 

www: oanda.com www: fxtrade.com 
e: mgro...@oanda.com 

"Best Trading Platform" - World Finance's Forex Awards 2009. 
"The One to Watch" - Treasury Today's Adam Smith Awards 2009. 


- Original Message -
From: "rk vishu" 
To: cdh-u...@cloudera.org, common-user@hadoop.apache.org, u...@hive.apache.org
Sent: Saturday, February 18, 2012 4:39:48 AM
Subject: Re: better partitioning strategy in hive





Hello All, 

We have a hive table partitioned by date and hour(330 columns). We have 5 years 
worth of data for the table. Each hourly partition have around 800MB. 
So total 43,800 partitions with one file per partition. 

When we run select count(*) from table, hive is taking for ever to submit the 
job. I waited for 20 min and killed it. If i run for a month it takes little 
time to submit the job, but at least hive is able to get the work done?. 

Questions: 
1) first of all why hive is not able to even submit the job? Is it taking for 
ever to query the list pf partitions from the meta store? getting 43K recs 
should not be big deal at all?? 
2) So in order to improve my situation, what are my options? I can think of 
changing the partition strategy to daily partition instead of hourly. What 
should be the ideal partitioning strategy? 
3) if we have one partition per day and 24 files under it (i.e less partitions 
but same number of files), will it improve anything or i will have same issue ? 
4)Are there any special input formats or tricks to handle this? 
5) When i tried to insert into a different table by selecting from whole days 
data, hive generate 164mappers with map-only jobs, hence creating many output 
files. How can force hive to create one output file instead of many. Setting 
mapred.reduce.tasks=1 is not even generating reduce tasks. What i can do to 
achieve this? 


-RK 








Re: better partitioning strategy in hive

2012-03-02 Thread Ravikumar MAV
Thank you very much for the reply. It helps.

As you mensioned in one of the points, I tried daily partitions with Snappy
Seq file Compression. That performed much better than any other option. I
am able to run medium complex queries on 25TB data (size when uncompressed)
and was able to see the results under 20Min.

Thanks and Regards
Ravi

On Fri, Mar 2, 2012 at 8:20 AM, Mark Grover  wrote:

> Sorry about the dealyed response, RK.
>
> Here is what I think:
> 1) first of all why hive is not able to even submit the job? Is it taking
> for ever to query the list pf partitions from the meta store? getting 43K
> recs should not be big deal at all??
>
> --> Hive is possibly taking a long time to figure out what partitions it
> needs to query. I experienced the same problem when I had a lot of
> partitions (with relatively small sized files). I reverted back to having
> less number of partitions with larger file sizes, that fixed the problem.
> Finding the balance between how many partitions you want and how big you
> want each partition to be is tricky, but, in general, it's better to have
> lesser number of partitions. You want to be aware of the small files
> problem. It has been discussed at many places. Some links are:
>
> http://blog.rapleaf.com/dev/2008/11/20/give-me-liberty-or-give-me-death-but-dont-give-me-small-files/
> http://www.cloudera.com/blog/2009/02/the-small-files-problem/
>
> http://arunxjacob.blogspot.com/2011/04/hdfs-file-size-vs-allocation-other.html
>
> 2) So in order to improve my situation, what are my options? I can think
> of changing the partition strategy to daily partition instead of hourly.
> What should be the ideal partitioning strategy?
>
> --> I would say that's a good step forward.
>
> 3) if we have one partition per day and 24 files under it (i.e less
> partitions but same number of files), will it improve anything or i will
> have same issue ?
>
> --> You probably wouldn't have the same issue; if you still do, it
> wouldn't be as bad. Since the number of partitions have been reduced by a
> factor of 24, hive doesn't have to go through as many number of partitions.
> However, your queries that look for data in a particular hour on a given
> day would be slower now that you don't have hour as a partition.
>
> 4)Are there any special input formats or tricks to handle this?
>
> --> This is a separate question. What format, SerDe and compression you
> use for your data, is a part of the design but isn't necessarily linked to
> the problem in question.
>
> 5) When i tried to insert into a different table by selecting from whole
> days data, hive generate 164mappers with map-only jobs, hence creating many
> output files. How can force hive to create one output file instead of many.
> Setting mapred.reduce.tasks=1 is not even generating reduce tasks. What i
> can do to achieve this?
>
> --> mapred.reduce.tasks wouldn't help because the job is map-only and has
> no reduce tasks. You should look into hive.merge.* properties. Setting them
> in your hive-site.xml would do the trick. You can see refer to this
> template (
> https://svn.apache.org/repos/asf/hive/trunk/conf/hive-default.xml.template)
> to see what properties exist.
>
> Good luck!
> Mark
>
> Mark Grover, Business Intelligence Analyst
> OANDA Corporation
>
> www: oanda.com www: fxtrade.com
> e: mgro...@oanda.com
>
> "Best Trading Platform" - World Finance's Forex Awards 2009.
> "The One to Watch" - Treasury Today's Adam Smith Awards 2009.
>
>
> - Original Message -
> From: "rk vishu" 
> To: cdh-u...@cloudera.org, common-user@hadoop.apache.org,
> u...@hive.apache.org
> Sent: Saturday, February 18, 2012 4:39:48 AM
> Subject: Re: better partitioning strategy in hive
>
>
>
>
>
> Hello All,
>
> We have a hive table partitioned by date and hour(330 columns). We have 5
> years worth of data for the table. Each hourly partition have around 800MB.
> So total 43,800 partitions with one file per partition.
>
> When we run select count(*) from table, hive is taking for ever to submit
> the job. I waited for 20 min and killed it. If i run for a month it takes
> little time to submit the job, but at least hive is able to get the work
> done?.
>
> Questions:
> 1) first of all why hive is not able to even submit the job? Is it taking
> for ever to query the list pf partitions from the meta store? getting 43K
> recs should not be big deal at all??
> 2) So in order to improve my situation, what are my options? I can think
> of changing the partition strategy to daily partition instead of hourly.
> What should be the ideal partitioning strategy?
> 3) if we have one partition per day and 24 files under it (i.e less
> partitions but same number of files), will it improve anything or i will
> have same issue ?
> 4)Are there any special input formats or tricks to handle this?
> 5) When i tried to insert into a different table by selecting from whole
> days data, hive generate 164mappers with map-only jobs, hence creating many
> output file

Re: problem running hadoop map reduce due to zookeeper ensemble not found

2012-03-02 Thread Harsh J
Set hbase.zookeeper.quorum in your JobConf before submitting, to the
list of hosts that form your ZK quorum. Usually thats all you need to
run a HBase job and having it pick up the right cluster.

On Sat, Mar 3, 2012 at 5:10 AM, T Vinod Gupta  wrote:
> can someone tell, what the right way to do this.. i created a jar that
> creates a map reduce job and submits it. but i get this error when i run it
> -
>
> 12/03/02 21:42:13 ERROR zookeeper.ZKConfig: no clientPort found in zoo.cfg
> 12/03/02 21:42:13 ERROR mapreduce.TableInputFormat:
> org.apache.hadoop.hbase.ZooKeeperConnectionException: java.io.IOException:
> Unable to determine ZooKeeper ensemble
>        at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getZooKeeperWatcher(HConnectionManager.java:1000)
>        at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.setupZookeeperTrackers(HConnectionManager.java:303)
>        at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.(HConnectionManager.java:294)
>        at
> org.apache.hadoop.hbase.client.HConnectionManager.getConnection(HConnectionManager.java:156)
>        at org.apache.hadoop.hbase.client.HTable.(HTable.java:167)
>        at org.apache.hadoop.hbase.client.HTable.(HTable.java:145)
>        at
> org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:91)
>        at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
>        at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
>        at
> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:882)
>        at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
>        at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
>        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:448)
> Caused by: java.io.IOException: Unable to determine ZooKeeper ensemble
>        at org.apache.hadoop.hbase.zookeeper.ZKUtil.connect(ZKUtil.java:92)
>        at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.(ZooKeeperWatcher.java:119)
>        at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getZooKeeperWatcher(HConnectionManager.java:998)
>        ... 17 more
>
> this is on a standalone hbase installation.. when i try to run it on a
> different machine with distributed hbase installation, i get the same
> error..
> i just it simply by doing
> java  
>
> thanks



-- 
Harsh J


Re: Hadoop pain points?

2012-03-02 Thread Mohit Anchlia
+1

On Fri, Mar 2, 2012 at 4:09 PM, Harsh J  wrote:

> Since you ask about anything in general, when I forayed into using
> Hadoop, my biggest pain was lack of documentation clarity and
> completeness over the MR and DFS user APIs (and other little points).
>
> It would be nice to have some work done to have one example or
> semi-example for every single Input/OutputFormat, Mapper/Reducer
> implementations, etc. added to the javadocs.
>
> I believe examples and snippets help out a ton (tons more than
> explaining just behavior) to new devs.
>
> On Fri, Mar 2, 2012 at 9:45 PM, Kunaal  wrote:
> > I am doing a general poll on what are the most prevalent pain points that
> > people run into with Hadoop? These could be performance related (memory
> > usage, IO latencies), usage related or anything really.
> >
> > The goal is to look for what areas this platform could benefit the most
> in
> > the near future.
> >
> > Any feedback is much appreciated.
> >
> > Thanks,
> > Kunal.
>
>
>
> --
> Harsh J
>


Re: Hadoop pain points?

2012-03-02 Thread Russell Jurney
+2

Russell Jurney http://datasyndrome.com

On Mar 2, 2012, at 4:38 PM, Mohit Anchlia  wrote:

> +1
> 
> On Fri, Mar 2, 2012 at 4:09 PM, Harsh J  wrote:
> 
>> Since you ask about anything in general, when I forayed into using
>> Hadoop, my biggest pain was lack of documentation clarity and
>> completeness over the MR and DFS user APIs (and other little points).
>> 
>> It would be nice to have some work done to have one example or
>> semi-example for every single Input/OutputFormat, Mapper/Reducer
>> implementations, etc. added to the javadocs.
>> 
>> I believe examples and snippets help out a ton (tons more than
>> explaining just behavior) to new devs.
>> 
>> On Fri, Mar 2, 2012 at 9:45 PM, Kunaal  wrote:
>>> I am doing a general poll on what are the most prevalent pain points that
>>> people run into with Hadoop? These could be performance related (memory
>>> usage, IO latencies), usage related or anything really.
>>> 
>>> The goal is to look for what areas this platform could benefit the most
>> in
>>> the near future.
>>> 
>>> Any feedback is much appreciated.
>>> 
>>> Thanks,
>>> Kunal.
>> 
>> 
>> 
>> --
>> Harsh J
>> 


Re: Hadoop pain points?

2012-03-02 Thread Leonardo Urbina
+3.14159265358979

Sent from my phone

On Mar 2, 2012, at 6:42 PM, Russell Jurney  wrote:

> +2
>
> Russell Jurney http://datasyndrome.com
>
> On Mar 2, 2012, at 4:38 PM, Mohit Anchlia  wrote:
>
>> +1
>>
>> On Fri, Mar 2, 2012 at 4:09 PM, Harsh J  wrote:
>>
>>> Since you ask about anything in general, when I forayed into using
>>> Hadoop, my biggest pain was lack of documentation clarity and
>>> completeness over the MR and DFS user APIs (and other little points).
>>>
>>> It would be nice to have some work done to have one example or
>>> semi-example for every single Input/OutputFormat, Mapper/Reducer
>>> implementations, etc. added to the javadocs.
>>>
>>> I believe examples and snippets help out a ton (tons more than
>>> explaining just behavior) to new devs.
>>>
>>> On Fri, Mar 2, 2012 at 9:45 PM, Kunaal  wrote:
 I am doing a general poll on what are the most prevalent pain points that
 people run into with Hadoop? These could be performance related (memory
 usage, IO latencies), usage related or anything really.

 The goal is to look for what areas this platform could benefit the most
>>> in
 the near future.

 Any feedback is much appreciated.

 Thanks,
 Kunal.
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>


Re: Hadoop pain points?

2012-03-02 Thread Russell Jurney
+6.28318531

On Fri, Mar 2, 2012 at 7:35 PM, Leonardo Urbina  wrote:

> +3.14159265358979
>
> Sent from my phone
>
> On Mar 2, 2012, at 6:42 PM, Russell Jurney 
> wrote:
>
> > +2
> >
> > Russell Jurney http://datasyndrome.com
> >
> > On Mar 2, 2012, at 4:38 PM, Mohit Anchlia 
> wrote:
> >
> >> +1
> >>
> >> On Fri, Mar 2, 2012 at 4:09 PM, Harsh J  wrote:
> >>
> >>> Since you ask about anything in general, when I forayed into using
> >>> Hadoop, my biggest pain was lack of documentation clarity and
> >>> completeness over the MR and DFS user APIs (and other little points).
> >>>
> >>> It would be nice to have some work done to have one example or
> >>> semi-example for every single Input/OutputFormat, Mapper/Reducer
> >>> implementations, etc. added to the javadocs.
> >>>
> >>> I believe examples and snippets help out a ton (tons more than
> >>> explaining just behavior) to new devs.
> >>>
> >>> On Fri, Mar 2, 2012 at 9:45 PM, Kunaal 
> wrote:
>  I am doing a general poll on what are the most prevalent pain points
> that
>  people run into with Hadoop? These could be performance related
> (memory
>  usage, IO latencies), usage related or anything really.
> 
>  The goal is to look for what areas this platform could benefit the
> most
> >>> in
>  the near future.
> 
>  Any feedback is much appreciated.
> 
>  Thanks,
>  Kunal.
> >>>
> >>>
> >>>
> >>> --
> >>> Harsh J
> >>>
>



-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com


Re: Hadoop and Hibernate

2012-03-02 Thread Tarjei Huse
On 03/02/2012 07:59 PM, Geoffry Roberts wrote:
> Queries are nothing but inserts.  Create an object, populated it, persist
> it. If it worked, life would be good right now.
>
> I've considered JDBC and may yet take that approach.
I used Mybatis on a project now - also worth considering if you want a
more orm like feel to the job.
>
> re: Hibernate outside of Spring -- I'm getting tired already.
>
> Interesting thing:  I use EMF (Eclipse Modelling Framework).  The
> supporting jar files for emf and ecore are built into the job.  They are
> being found by the Driver(s) and the MR(s) no problemo.  If these work, why
> not the hibernate stuff?  Mystery!
I wish I knew. :)


T
>
> On 2 March 2012 10:50, Tarjei Huse  wrote:
>
>> On 03/02/2012 07:31 PM, Geoffry Roberts wrote:
>>> No, I am using 0.21.0 for better performance.  I am interested in
>>> DistributedCache so certain libraries can be found during MR processing.
>>> As it is now, I'm getting ClassNotFoundException being thrown by the
>>> Reducers.  The Driver throws no error, the Reducer(s) does.  It would
>> seem
>>> something is not being distributed across the cluster as I assumed it
>>> would.  After all, the whole business is in a single, executable jar
>> file.
>>
>> How complex are the queries you are doing?
>>
>> Have you considered one of the following:
>>
>> 1) Use plain jdbc instead of integrating Hibernate into Hadoop.
>> 2) Create a local version of the db that can be in the Distributed Cache.
>>
>> I tried using Hibernate with hadoop (the queries were not an important
>> part of the size of the jobs) but I ran up against so many issues trying
>> to get Hibernate to start up within the MR job that i ended up just
>> exporting the tables, loading them into memory and doing queries against
>> them with basic HashMap lookups.
>>
>> My best advice is that if you can, you should consider a way to abstract
>> away Hibernate from the job and use something closer to the metal like
>> either JDBC or just dump the data to files. Getting Hibernate to run
>> outside of Spring and friends can quickly grow tiresome.
>>
>> T
>>> On 2 March 2012 09:46, Kunaal  wrote:
>>>
 Are you looking to use DistributedCache for better performance?

 On Fri, Mar 2, 2012 at 9:42 AM, Geoffry Roberts
 wrote:

> This is a tardy response.  I'm spread pretty thinly right now.
>
> DistributedCache<
>
>> http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
>> is
> apparently deprecated.  Is there a replacement?  I didn't see anything
> about this in the documentation, but then I am still using 0.21.0. I
>> have
> to for performance reasons.  1.0.1 is too slow and the client won't
>> have
> it.
>
> Also, the DistributedCache<
>
>> http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
>> approach
> seems only to work from within a hadoop job.  i.e. From within a
> Mapper or a Reducer, but not from within a Driver.  I have libraries
 that I
> must access both from both places.  I take it that I am stuck keeping
>> two
> copies of these libraries in synch--Correct?  It's either that, or copy
> them into hdfs, replacing them all at the beginning of each job run.
>
> Looking for best practices.
>
> Thanks
>
> On 28 February 2012 10:17, Owen O'Malley  wrote:
>
>> On Tue, Feb 28, 2012 at 5:15 PM, Geoffry Roberts
>>  wrote:
>>
>>> If I create an executable jar file that contains all dependencies
>> required
>>> by the MR job do all said dependencies get distributed to all nodes?
>> You can make a single jar and that will be distributed to all of the
>> machines that run the task, but it is better in most cases to use the
>> distributed cache.
>>
>> See
>>
>> http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
>>> If I specify but one reducer, which node in the cluster will the
> reducer
>>> run on?
>> The scheduling is done by the JobTracker and it isn't possible to
>> control the location of the reducers.
>>
>> -- Owen
>>
>
> --
> Geoffry Roberts
>

 --
 "What we are is the universe's gift to us.
 What we become is our gift to the universe."

>>>
>>
>> --
>> Regards / Med vennlig hilsen
>> Tarjei Huse
>> Mobil: 920 63 413
>>
>>
>


-- 
Regards / Med vennlig hilsen
Tarjei Huse
Mobil: 920 63 413