Re: question about fair scheduler

2013-08-23 Thread Sandy Ryza
That's right that the other 2 apps will end up getting 10 resources each,
but as more resources become released, eventually the cluster will converge
to a fair state.  I.e. if the first app requested additional resources
after releasing resources, it would not receive any more until either
another app either grew to more resources than it, or no other app needed
them.

Does that make sense?

-Sandy


On Thu, Aug 22, 2013 at 6:36 PM, ch huang  wrote:

> hi,i have a question about fair scheduler
> doc says "When there is a single app running, that app uses the entire
> cluster. When other apps are submitted, resources that free up are assigned
> to the new apps, so that each app gets roughly the same amount of
> resources",
> suppose i have only a big app running ,so it get 100 resource ,as it
> running ,it release 20 resource ,that time 2 short app come to run,so each
> get
> 10 resource ,what i think is  "fair is for current available
> resources ,not the whole resources" ,is it right?
>


DICOM Image Processing using Hadoop

2013-08-23 Thread Shalish VJ
Hi,
 
Is it possible to process DICOM images using hadoop.
Please help me with an example.
 
Thanks,
Shalish.

Re: DICOM Image Processing using Hadoop

2013-08-23 Thread kapil bhosale
Hi,
there is one api Called HIPI( http://hipi.cs.virginia.edu/ ), for
processing Huge Images using Hadoop.
You can get some more information there.
It might work, in your case.
if not please ignore.

Thanks and regards
Kapil


On Fri, Aug 23, 2013 at 3:01 PM, Shalish VJ  wrote:

> Hi,
>
> Is it possible to process DICOM images using hadoop.
> Please help me with an example.
>
> Thanks,
> Shalish.
>



-- 
kap's


Re: DICOM Image Processing using Hadoop

2013-08-23 Thread haiwei.xie-soulinfo


> Hi,
>  
> Is it possible to process DICOM images using hadoop.
> Please help me with an example.
>  
> Thanks,
> Shalish.


What's your aim to process DICOM images?

We are developing a demo to do with face recognition, looks diffcult.
Expect advises too.

thanks,
-terrs.


Re: DICOM Image Processing using Hadoop

2013-08-23 Thread Shalish VJ
Hi,
 
I am trying to do a proof of  concept on that.
Please help if  anyone has some idea.
I couldnt find any from the internet.



From: haiwei.xie-soulinfo 
To: user@hadoop.apache.org 
Sent: Friday, August 23, 2013 3:16 PM
Subject: Re: DICOM Image Processing using Hadoop




> Hi,
>  
> Is it possible to process DICOM images using hadoop.
> Please help me with an example.
>  
> Thanks,
> Shalish.


What's your aim to process DICOM images?

We are developing a demo to do with face recognition, looks diffcult.
Expect advises too.

thanks,
-terrs.

Re: running map tasks in remote node

2013-08-23 Thread rab ra
Thanks for the reply.

I am basically exploring possible ways to work with hadoop framework for
one of my use case. I have my limitations in using hdfs but agree with the
fact that using map reduce in conjunction with hdfs makes sense.

I successfully tested wholeFileInputFormat by some googling.

Now, coming to my use case. I would like to keep some files in my master
node and want to do some processing in the cloud nodes. The policy does not
allow us to configure and use cloud nodes as HDFS.  However, I would like
to span a map process in those nodes. Hence, I set input path as local file
system, for example, $HOME/inputs. I have a file listing filenames (10
lines) in this input directory.  I use NLineInputFormat and span 10 map
process. Each map process gets a line. The map process will then do a file
transfer and process it.  However, I get an error in the map saying that
the FileNotFoundException $HOME/inputs. I am sure this directory is present
in my master but not in the slave nodes. When I copy this input directory
to slave nodes, it works fine. I am not able to figure out how to fix this
and the reason for the error. I am not understand why it complains about
the input directory is not present. As far as I know, slave nodes get a map
and map method contains contents of the input file. This should be fine for
the map logic to work.


with regards
rabmdu




On Thu, Aug 22, 2013 at 4:40 PM, java8964 java8964 wrote:

> If you don't plan to use HDFS, what kind of sharing file system you are
> going to use between cluster? NFS?
> For what you want to do, even though it doesn't make too much sense, but
> you need to the first problem as the shared file system.
>
> Second, if you want to process the files file by file, instead of block by
> block in HDFS, then you need to use the WholeFileInputFormat (google this
> how to write one). So you don't need a file to list all the files to be
> processed, just put them into one folder in the sharing file system, then
> send this folder to your MR job. In this way, as long as each node can
> access it through some file system URL, each file will be processed in each
> mapper.
>
> Yong
>
> --
> Date: Wed, 21 Aug 2013 17:39:10 +0530
> Subject: running map tasks in remote node
> From: rab...@gmail.com
> To: user@hadoop.apache.org
>
>
> Hello,
>
> Here is the new bie question of the day.
>
> For one of my use cases, I want to use hadoop map reduce without HDFS.
> Here, I will have a text file containing a list of file names to process.
> Assume that I have 10 lines (10 files to process) in the input text file
> and I wish to generate 10 map tasks and execute them in parallel in 10
> nodes. I started with basic tutorial on hadoop and could setup single node
> hadoop cluster and successfully tested wordcount code.
>
> Now, I took two machines A (master) and B (slave). I did the below
> configuration in these machines to setup a two node cluster.
>
> hdfs-site.xml
>
> 
> 
> 
> 
> 
>   dfs.replication
>   1
> 
> 
>   dfs.name.dir
>   /tmp/hadoop-bala/dfs/name
> 
> 
>   dfs.data.dir
>   /tmp/hadoop-bala/dfs/data
> 
> 
>  mapred.job.tracker
> A:9001
> 
>
> 
>
> mapred-site.xml
>
> 
> 
>
> 
>
> 
> 
> mapred.job.tracker
> A:9001
> 
> 
>   mapreduce.tasktracker.map.tasks.maximum
>1
> 
> 
>
> core-site.xml
>
> 
> 
> 
> 
>  
> fs.default.name
> hdfs://A:9000
> 
> 
>
>
> In A and B, I do have a file named ‘slaves’ with an entry ‘B’ in it and
> another file called ‘masters’ wherein an entry ‘A’ is there.
>
> I have kept my input file at A. I see the map method process the input
> file line by line but they are all processed in A. Ideally, I would expect
> those processing to take place in B.
>
> Can anyone highlight where I am going wrong?
>
>  regards
> rab
>


Re: Hadoop - impersonation doubts/issues while accessing from remote machine

2013-08-23 Thread Harsh J
I've answered this on the stackoverflow link:
http://stackoverflow.com/questions/18354664/spring-data-hadoop-connectivity

On Thu, Aug 22, 2013 at 1:29 PM, Omkar Joshi
 wrote:
> For readability, I haven’t posted the code, output etc. in this mail –
> please check the thread below :
>
>
>
> http://stackoverflow.com/questions/18354664/spring-data-hadoop-connectivity
>
>
>
> I'm trying to connect to a remote hadoop(1.1.2) cluster from my local
> Windows machine via Spring data(later, eclipse plug-in may also be used). In
> future, multiple such connections from several Windows machines are
> expected.
>
>
>
> On my remote(single-node) cluster, bigdata is the user for Hadoop etc.
>
> bigdata@cloudx-843-770:~$ groups bigdata
>
> bigdata : bigdata
>
> On my local Windows machine
>
> D:\>echo %username%
>
> 298790
>
> D:\>hostname
>
> INFVA03351
>
>
>
> Now if I refer to Hadoop Secure Impersonation., does it mean I need to
> create a user 298790 on the cluster, add the hostname in core-site.xml etc.
> ??? Any less-cumbersome ways out? I tried that too on the cluster but the
> (partial given)output error still persists :
>
>
>
> Aug 22, 2013 12:29:20 PM
> org.springframework.context.support.AbstractApplicationContext
> prepareRefresh
>
> INFO: Refreshing
> org.springframework.context.support.ClassPathXmlApplicationContext@1815338:
> startup date [Thu Aug 22 12:29:20 IST 2013]; root of context hierarchy
>
> Aug 22, 2013 12:29:20 PM
> org.springframework.beans.factory.xml.XmlBeanDefinitionReader
> loadBeanDefinitions
>
> INFO: Loading XML bean definitions from class path resource
> [com/hadoop/basics/applicationContext.xml]
>
> Aug 22, 2013 12:29:20 PM
> org.springframework.core.io.support.PropertiesLoaderSupport loadProperties
>
> INFO: Loading properties file from class path resource
> [resources/hadoop.properties]
>
> Aug 22, 2013 12:29:20 PM
> org.springframework.beans.factory.support.DefaultListableBeanFactory
> preInstantiateSingletons
>
> INFO: Pre-instantiating singletons in
> org.springframework.beans.factory.support.DefaultListableBeanFactory@7c197e:
> defining beans
> [org.springframework.context.support.PropertySourcesPlaceholderConfigurer#0,hadoopConfiguration,wc-job,myjobs-runner,resourceLoader];
> root of factory hierarchy
>
> Aug 22, 2013 12:29:21 PM
> org.springframework.data.hadoop.mapreduce.JobExecutor$2 run
>
> INFO: Starting job [wc-job]
>
> Aug 22, 2013 12:29:21 PM org.apache.hadoop.security.UserGroupInformation
> doAs
>
> SEVERE: PriviledgedActionException as:bigdata via 298790
> cause:org.apache.hadoop.ipc.RemoteException: User: 298790 is not allowed to
> impersonate bigdata
>
> Aug 22, 2013 12:29:21 PM
> org.springframework.data.hadoop.mapreduce.JobExecutor$2 run
>
> WARNING: Cannot start job [wc-job]
>
> org.apache.hadoop.ipc.RemoteException: User: 298790 is not allowed to
> impersonate bigdata
>
>   at org.apache.hadoop.ipc.Client.call(Client.java:1107)
>
>   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
>
>   at org.apache.hadoop.mapred.$Proxy2.getProtocolVersion(Unknown Source)
>
>   at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:411)
>
>   at
> org.apache.hadoop.mapred.JobClient.createRPCProxy(JobClient.java:499)
>
>   at org.apache.hadoop.mapred.JobClient.init(JobClient.java:490)
>
>   at org.apache.hadoop.mapred.JobClient.(JobClient.java:473)
>
>   at org.apache.hadoop.mapreduce.Job$1.run(Job.java:513)
>
>   at java.security.AccessController.doPrivileged(Native Method)
>
>   at javax.security.auth.Subject.doAs(Unknown Source)
>
>   at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
>
>   at org.apache.hadoop.mapreduce.Job.connect(Job.java:511)
>
>   at org.apache.hadoop.mapreduce.Job.submit(Job.java:499)
>
>   at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:530)
>
>   at
> org.springframework.data.hadoop.mapreduce.JobExecutor$2.run(JobExecutor.java:197)
>
>   at
> org.springframework.core.task.SyncTaskExecutor.execute(SyncTaskExecutor.java:49)
>
>   at
> org.springframework.data.hadoop.mapreduce.JobExecutor.startJobs(JobExecutor.java:168)
>
>   at
> org.springframework.data.hadoop.mapreduce.JobExecutor.startJobs(JobExecutor.java:160)
>
>   at
> org.springframework.data.hadoop.mapreduce.JobRunner.call(JobRunner.java:52)
>
>   at
> org.springframework.data.hadoop.mapreduce.JobRunner.afterPropertiesSet(JobRunner.java:44)
>
>   at
> org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.invokeInitMethods(AbstractAutowireCapableBeanFactory.java:1541)
>
>   at
> org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.initializeBean(AbstractAutowireCapableBeanFactory.java:1479)
>
>   at
> org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:521)
>
>   at
> org.springframework.beans.factory.support.AbstractAutowireCa

RE: Hadoop - impersonation doubts/issues while accessing from remote machine

2013-08-23 Thread Omkar Joshi
Thanks :)

Regards,
Omkar Joshi


-Original Message-
From: Harsh J [mailto:ha...@cloudera.com] 
Sent: Friday, August 23, 2013 3:52 PM
To: 
Subject: Re: Hadoop - impersonation doubts/issues while accessing from remote 
machine

I've answered this on the stackoverflow link:
http://stackoverflow.com/questions/18354664/spring-data-hadoop-connectivity

On Thu, Aug 22, 2013 at 1:29 PM, Omkar Joshi
 wrote:
> For readability, I haven't posted the code, output etc. in this mail -
> please check the thread below :
>
>
>
> http://stackoverflow.com/questions/18354664/spring-data-hadoop-connectivity
>
>
>
> I'm trying to connect to a remote hadoop(1.1.2) cluster from my local
> Windows machine via Spring data(later, eclipse plug-in may also be used). In
> future, multiple such connections from several Windows machines are
> expected.
>
>
>
> On my remote(single-node) cluster, bigdata is the user for Hadoop etc.
>
> bigdata@cloudx-843-770:~$ groups bigdata
>
> bigdata : bigdata
>
> On my local Windows machine
>
> D:\>echo %username%
>
> 298790
>
> D:\>hostname
>
> INFVA03351
>
>
>
> Now if I refer to Hadoop Secure Impersonation., does it mean I need to
> create a user 298790 on the cluster, add the hostname in core-site.xml etc.
> ??? Any less-cumbersome ways out? I tried that too on the cluster but the
> (partial given)output error still persists :
>
>
>
> Aug 22, 2013 12:29:20 PM
> org.springframework.context.support.AbstractApplicationContext
> prepareRefresh
>
> INFO: Refreshing
> org.springframework.context.support.ClassPathXmlApplicationContext@1815338:
> startup date [Thu Aug 22 12:29:20 IST 2013]; root of context hierarchy
>
> Aug 22, 2013 12:29:20 PM
> org.springframework.beans.factory.xml.XmlBeanDefinitionReader
> loadBeanDefinitions
>
> INFO: Loading XML bean definitions from class path resource
> [com/hadoop/basics/applicationContext.xml]
>
> Aug 22, 2013 12:29:20 PM
> org.springframework.core.io.support.PropertiesLoaderSupport loadProperties
>
> INFO: Loading properties file from class path resource
> [resources/hadoop.properties]
>
> Aug 22, 2013 12:29:20 PM
> org.springframework.beans.factory.support.DefaultListableBeanFactory
> preInstantiateSingletons
>
> INFO: Pre-instantiating singletons in
> org.springframework.beans.factory.support.DefaultListableBeanFactory@7c197e:
> defining beans
> [org.springframework.context.support.PropertySourcesPlaceholderConfigurer#0,hadoopConfiguration,wc-job,myjobs-runner,resourceLoader];
> root of factory hierarchy
>
> Aug 22, 2013 12:29:21 PM
> org.springframework.data.hadoop.mapreduce.JobExecutor$2 run
>
> INFO: Starting job [wc-job]
>
> Aug 22, 2013 12:29:21 PM org.apache.hadoop.security.UserGroupInformation
> doAs
>
> SEVERE: PriviledgedActionException as:bigdata via 298790
> cause:org.apache.hadoop.ipc.RemoteException: User: 298790 is not allowed to
> impersonate bigdata
>
> Aug 22, 2013 12:29:21 PM
> org.springframework.data.hadoop.mapreduce.JobExecutor$2 run
>
> WARNING: Cannot start job [wc-job]
>
> org.apache.hadoop.ipc.RemoteException: User: 298790 is not allowed to
> impersonate bigdata
>
>   at org.apache.hadoop.ipc.Client.call(Client.java:1107)
>
>   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
>
>   at org.apache.hadoop.mapred.$Proxy2.getProtocolVersion(Unknown Source)
>
>   at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:411)
>
>   at
> org.apache.hadoop.mapred.JobClient.createRPCProxy(JobClient.java:499)
>
>   at org.apache.hadoop.mapred.JobClient.init(JobClient.java:490)
>
>   at org.apache.hadoop.mapred.JobClient.(JobClient.java:473)
>
>   at org.apache.hadoop.mapreduce.Job$1.run(Job.java:513)
>
>   at java.security.AccessController.doPrivileged(Native Method)
>
>   at javax.security.auth.Subject.doAs(Unknown Source)
>
>   at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
>
>   at org.apache.hadoop.mapreduce.Job.connect(Job.java:511)
>
>   at org.apache.hadoop.mapreduce.Job.submit(Job.java:499)
>
>   at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:530)
>
>   at
> org.springframework.data.hadoop.mapreduce.JobExecutor$2.run(JobExecutor.java:197)
>
>   at
> org.springframework.core.task.SyncTaskExecutor.execute(SyncTaskExecutor.java:49)
>
>   at
> org.springframework.data.hadoop.mapreduce.JobExecutor.startJobs(JobExecutor.java:168)
>
>   at
> org.springframework.data.hadoop.mapreduce.JobExecutor.startJobs(JobExecutor.java:160)
>
>   at
> org.springframework.data.hadoop.mapreduce.JobRunner.call(JobRunner.java:52)
>
>   at
> org.springframework.data.hadoop.mapreduce.JobRunner.afterPropertiesSet(JobRunner.java:44)
>
>   at
> org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.invokeInitMethods(AbstractAutowireCapableBeanFactory.java:1541)
>
>   at
> org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.initializeBean(AbstractAutowireCapableBeanFactory.j

Re: running map tasks in remote node

2013-08-23 Thread Shahab Yunus
You say:
"Each map process gets a line. The map process will then do a file transfer
and process it.  "

What file, from where to where is being transferred in the map? Are you
sure that the mappers are not complaining about 'this' file access? Because
this seem to be separate from the initial data input that each mapper gets
(basically your understanding "map method contains contents of the input
file")

Regards,
Shahab


On Fri, Aug 23, 2013 at 6:13 AM, rab ra  wrote:

> Thanks for the reply.
>
> I am basically exploring possible ways to work with hadoop framework for
> one of my use case. I have my limitations in using hdfs but agree with the
> fact that using map reduce in conjunction with hdfs makes sense.
>
> I successfully tested wholeFileInputFormat by some googling.
>
> Now, coming to my use case. I would like to keep some files in my master
> node and want to do some processing in the cloud nodes. The policy does not
> allow us to configure and use cloud nodes as HDFS.  However, I would like
> to span a map process in those nodes. Hence, I set input path as local file
> system, for example, $HOME/inputs. I have a file listing filenames (10
> lines) in this input directory.  I use NLineInputFormat and span 10 map
> process. Each map process gets a line. The map process will then do a file
> transfer and process it.  However, I get an error in the map saying that
> the FileNotFoundException $HOME/inputs. I am sure this directory is present
> in my master but not in the slave nodes. When I copy this input directory
> to slave nodes, it works fine. I am not able to figure out how to fix this
> and the reason for the error. I am not understand why it complains about
> the input directory is not present. As far as I know, slave nodes get a map
> and map method contains contents of the input file. This should be fine for
> the map logic to work.
>
>
> with regards
> rabmdu
>
>
>
>
> On Thu, Aug 22, 2013 at 4:40 PM, java8964 java8964 
> wrote:
>
>> If you don't plan to use HDFS, what kind of sharing file system you are
>> going to use between cluster? NFS?
>> For what you want to do, even though it doesn't make too much sense, but
>> you need to the first problem as the shared file system.
>>
>> Second, if you want to process the files file by file, instead of block
>> by block in HDFS, then you need to use the WholeFileInputFormat (google
>> this how to write one). So you don't need a file to list all the files to
>> be processed, just put them into one folder in the sharing file system,
>> then send this folder to your MR job. In this way, as long as each node can
>> access it through some file system URL, each file will be processed in each
>> mapper.
>>
>> Yong
>>
>> --
>> Date: Wed, 21 Aug 2013 17:39:10 +0530
>> Subject: running map tasks in remote node
>> From: rab...@gmail.com
>> To: user@hadoop.apache.org
>>
>>
>> Hello,
>>
>>  Here is the new bie question of the day.
>>
>> For one of my use cases, I want to use hadoop map reduce without HDFS.
>> Here, I will have a text file containing a list of file names to process.
>> Assume that I have 10 lines (10 files to process) in the input text file
>> and I wish to generate 10 map tasks and execute them in parallel in 10
>> nodes. I started with basic tutorial on hadoop and could setup single node
>> hadoop cluster and successfully tested wordcount code.
>>
>> Now, I took two machines A (master) and B (slave). I did the below
>> configuration in these machines to setup a two node cluster.
>>
>> hdfs-site.xml
>>
>> 
>> 
>> 
>> 
>> 
>>   dfs.replication
>>   1
>> 
>> 
>>   dfs.name.dir
>>   /tmp/hadoop-bala/dfs/name
>> 
>> 
>>   dfs.data.dir
>>   /tmp/hadoop-bala/dfs/data
>> 
>> 
>>  mapred.job.tracker
>> A:9001
>> 
>>
>> 
>>
>> mapred-site.xml
>>
>> 
>> 
>>
>> 
>>
>> 
>> 
>> mapred.job.tracker
>> A:9001
>> 
>> 
>>   mapreduce.tasktracker.map.tasks.maximum
>>1
>> 
>> 
>>
>> core-site.xml
>>
>> 
>> 
>> 
>> 
>>  
>> fs.default.name
>> hdfs://A:9000
>> 
>> 
>>
>>
>> In A and B, I do have a file named ‘slaves’ with an entry ‘B’ in it and
>> another file called ‘masters’ wherein an entry ‘A’ is there.
>>
>> I have kept my input file at A. I see the map method process the input
>> file line by line but they are all processed in A. Ideally, I would expect
>> those processing to take place in B.
>>
>> Can anyone highlight where I am going wrong?
>>
>>  regards
>> rab
>>
>
>


RE: running map tasks in remote node

2013-08-23 Thread java8964 java8964
It is possible to do what you are trying to do, but only make sense if your MR 
job is very CPU intensive, and you want to use the CPU resource in your 
cluster, instead of the IO.
You may want to do some research about what is the HDFS's role in Hadoop. First 
but not least, it provides a central storage for all the files will be 
processed by MR jobs. If you don't want to use HDFS, so you need to  identify a 
share storage to be shared among all the nodes in your cluster. HDFS is NOT 
required, but a shared storage is required in the cluster.
For simply your question, let's just use NFS to replace HDFS. It is possible 
for a POC to help you understand how to set it up.
Assume your have a cluster with 3 nodes (one NN, two DN. The JT running on NN, 
and TT running on DN). So instead of using HDFS, you can try to use NFS by this 
way:
1) Mount /share_data in all of your 2 data nodes. They need to have the same 
mount. So /share_data in each data node point to the same NFS location. It 
doesn't matter where you host this NFS share, but just make sure each data node 
mount it as the same /share_data2) Create a folder under /share_data, put all 
your data into that folder.3) When kick off your MR jobs, you need to give a 
full URL of the input path, like 'file:///shared_data/myfolder', also a full 
URL of the output path, like 'file:///shared_data/output'. In this way, each 
mapper will understand that in fact they will run the data from local file 
system, instead of HDFS. That's the reason you want to make sure each task node 
has the same mount path, as 'file:///shared_data/myfolder' should work fine for 
each  task node. Check this and make sure that /share_data/myfolder all point 
to the same path in each of your task node.4) You want each mapper to process 
one file, so instead of using the default 'TextInputFormat', use a 
'WholeFileInputFormat', this will make sure that every file under 
'/share_data/myfolder' won't be split and sent to the same mapper processor. 5) 
In the above set up, I don't think you need to start NameNode or DataNode 
process any more, anyway you just use JobTracker and TaskTracker.6) Obviously 
when your data is big, the NFS share will be your bottleneck. So maybe you can 
replace it with Share Network Storage, but above set up gives you a start 
point.7) Keep in mind when set up like above, you lost the Data Replication, 
Data Locality etc, that's why I said it ONLY makes sense if your MR job is CPU 
intensive. You simple want to use the Mapper/Reducer tasks to process your 
data, instead of any scalability of IO.
Make sense?
Yong

Date: Fri, 23 Aug 2013 15:43:38 +0530
Subject: Re: running map tasks in remote node
From: rab...@gmail.com
To: user@hadoop.apache.org

Thanks for the reply. 
I am basically exploring possible ways to work with hadoop framework for one of 
my use case. I have my limitations in using hdfs but agree with the fact that 
using map reduce in conjunction with hdfs makes sense.  

I successfully tested wholeFileInputFormat by some googling. 
Now, coming to my use case. I would like to keep some files in my master node 
and want to do some processing in the cloud nodes. The policy does not allow us 
to configure and use cloud nodes as HDFS.  However, I would like to span a map 
process in those nodes. Hence, I set input path as local file system, for 
example, $HOME/inputs. I have a file listing filenames (10 lines) in this input 
directory.  I use NLineInputFormat and span 10 map process. Each map process 
gets a line. The map process will then do a file transfer and process it.  
However, I get an error in the map saying that the FileNotFoundException 
$HOME/inputs. I am sure this directory is present in my master but not in the 
slave nodes. When I copy this input directory to slave nodes, it works fine. I 
am not able to figure out how to fix this and the reason for the error. I am 
not understand why it complains about the input directory is not present. As 
far as I know, slave nodes get a map and map method contains contents of the 
input file. This should be fine for the map logic to work.


with regardsrabmdu



On Thu, Aug 22, 2013 at 4:40 PM, java8964 java8964  wrote:




If you don't plan to use HDFS, what kind of sharing file system you are going 
to use between cluster? NFS?For what you want to do, even though it doesn't 
make too much sense, but you need to the first problem as the shared file 
system.

Second, if you want to process the files file by file, instead of block by 
block in HDFS, then you need to use the WholeFileInputFormat (google this how 
to write one). So you don't need a file to list all the files to be processed, 
just put them into one folder in the sharing file system, then send this folder 
to your MR job. In this way, as long as each node can access it through some 
file system URL, each file will be processed in each mapper.

Yong

Date: Wed, 21 Aug 2013 17:39:10 +0530
Subject: running map tasks in remote node
From: rab

Partitioner vs GroupComparator

2013-08-23 Thread Eugene Morozov
Hello,

I have two different types of keys emerged from Map and processed by
Reduce. These keys have some part in common. And I'd like to have similar
keys in one reducer. For that purpose I used Partitioner and partition
everything gets in by this common part. It seems to be fine, but MRUnit
seems doesn't know anything about Partitioners. So, here is where
GroupComparator comes into play. It seems that MRUnit well aware of the
guy, but it surprises me: it looks like Partitioner and GroupComparator are
actually doing exactly same - they both somehow group keys to have them in
one reducer.
Could you shed some light on it, please.
--


Need Help

2013-08-23 Thread Manish Kumar
Hi All,

I am new to Hadoop technology. I had used it only once for my BE project to
create Weblog Analyzer. Entire cluster was made up of 7 nodes.I am eager to
know more about this technology and want to build my career in this.

But I am not able to make out
1. How I can shape up myself to proceed further in this field ?
2. What are other technologies that I need to know to work in this field ?
3. What are future scopes in this field ?
4. Which technical domain area this technology belong ?

Currently I am working in one of the reputed service based company. Can't
reveal the name because of some constrains.
Looking for help.

Thanks & Regards,
Manish


Re: Partitioner vs GroupComparator

2013-08-23 Thread Harsh J
The partitioner runs on the map-end. It assigns a partition ID
(reducer ID) to each key.
The grouping comparator runs on the reduce-end. It helps reducers,
which read off a merge-sorted single file, to understand how to break
the sequential file into reduce calls of .

Typically one never overrides the GroupingComparator, and it is
usually the same as the SortComparator. But if you wish to do things
such as Secondary Sort, then overriding this comes useful - cause you
may want to sort over two parts of a key object, but only group by one
part, etc..

On Fri, Aug 23, 2013 at 8:49 PM, Eugene Morozov
 wrote:
> Hello,
>
> I have two different types of keys emerged from Map and processed by Reduce.
> These keys have some part in common. And I'd like to have similar keys in
> one reducer. For that purpose I used Partitioner and partition everything
> gets in by this common part. It seems to be fine, but MRUnit seems doesn't
> know anything about Partitioners. So, here is where GroupComparator comes
> into play. It seems that MRUnit well aware of the guy, but it surprises me:
> it looks like Partitioner and GroupComparator are actually doing exactly
> same - they both somehow group keys to have them in one reducer.
> Could you shed some light on it, please.
> --
>



-- 
Harsh J


Re: Partitioner vs GroupComparator

2013-08-23 Thread Jan Lukavský

Hi all,

when speaking about this, has anyone ever measured how much more data 
needs to be transferred over the network when using GroupingComparator 
the way Harsh suggests? What do I mean, when you use the 
GroupingComparator, it hides you the real key that you have emitted from 
Mapper. You just see the first key in the reduce group and any data that 
was carried in the key needs to be duplicated in the value in order to 
be accessible on the reduce end.


Let's say you have key consisting of two parts (base, extension), you 
partition by the 'base' part and use GroupingComparator to group keys 
with the same base part. Than you have no other chance than to emit from 
Mapper something like this - (key: (base, extension), value: extension), 
which means the 'extension' part is duplicated in the data, that has to 
be transferred over the network. This overhead can be diminished by 
using compression between map and reduce side, but I believe that in 
some cases this can be significant.


It would be nice if the API allowed to access the 'real' key for each 
value, not only the first key of the reduce group. The only way to get 
rid of this overhead now is by not using the GroupingComparator and 
instead store some internal state in the Reducer class, that is 
persisted across mutliple calls to reduce() method, which in my opinion 
makes using GroupingComparator this way less 'preferred' way of doing 
secondary sort.


Does anyone have any experience with this overhead?

Jan

On 08/23/2013 06:05 PM, Harsh J wrote:

The partitioner runs on the map-end. It assigns a partition ID
(reducer ID) to each key.
The grouping comparator runs on the reduce-end. It helps reducers,
which read off a merge-sorted single file, to understand how to break
the sequential file into reduce calls of .

Typically one never overrides the GroupingComparator, and it is
usually the same as the SortComparator. But if you wish to do things
such as Secondary Sort, then overriding this comes useful - cause you
may want to sort over two parts of a key object, but only group by one
part, etc..

On Fri, Aug 23, 2013 at 8:49 PM, Eugene Morozov
 wrote:

Hello,

I have two different types of keys emerged from Map and processed by Reduce.
These keys have some part in common. And I'd like to have similar keys in
one reducer. For that purpose I used Partitioner and partition everything
gets in by this common part. It seems to be fine, but MRUnit seems doesn't
know anything about Partitioners. So, here is where GroupComparator comes
into play. It seems that MRUnit well aware of the guy, but it surprises me:
it looks like Partitioner and GroupComparator are actually doing exactly
same - they both somehow group keys to have them in one reducer.
Could you shed some light on it, please.
--






Hadoop upgrade

2013-08-23 Thread Viswanathan J
Hi,

We are planning to upgrade our production hdfs cluster from 1.0.4 to 1.2.1

So if I directly upgrade the cluster, it won't affect the edits, fsimage
and checkpoints?

Also after upgrade is it will read the blocks, files from the data nodes
properly?

Is the version id conflict occurs with NN?

Do I need lose the data after upgrade?

Thanks in advance.

Appreciate your response.

-Viswa.J


RE: yarn-site.xml and aux-services

2013-08-23 Thread John Lilley
Are there recommended conventions for adding additional code to a stock Hadoop 
install?
It would be nice if we could piggyback on whatever mechanisms are used to 
distribute hadoop itself around the cluster.
john

From: Vinod Kumar Vavilapalli [mailto:vino...@hortonworks.com]
Sent: Thursday, August 22, 2013 6:25 PM
To: user@hadoop.apache.org
Subject: Re: yarn-site.xml and aux-services


Auxiliary services are essentially administer-configured services. So, they 
have to be set up at install time - before NM is started.

+Vinod

On Thu, Aug 22, 2013 at 1:38 PM, John Lilley 
mailto:john.lil...@redpoint.net>> wrote:
Following up on this, how exactly does one *install* the jar(s) for auxiliary 
service?  Can it be shipped out with the LocalResources of an AM?
MapReduce's aux-service is presumably installed with Hadoop and is just sitting 
there in the right place, but if one wanted to make a whole new aux-service 
that belonged with an AM, how would one do it?

John

-Original Message-
From: John Lilley 
[mailto:john.lil...@redpoint.net]
Sent: Wednesday, June 05, 2013 11:41 AM
To: user@hadoop.apache.org
Subject: RE: yarn-site.xml and aux-services

Wow, thanks.  Is this documented anywhere other than the code?  I hate to waste 
y'alls time on things that can be RTFMed.
John


-Original Message-
From: Harsh J [mailto:ha...@cloudera.com]
Sent: Wednesday, June 05, 2013 9:35 AM
To: mailto:user@hadoop.apache.org>>
Subject: Re: yarn-site.xml and aux-services

John,

The format is ID and sub-config based:

First, you define an ID as a service, like the string "foo". This is the ID the 
applications may lookup in their container responses map we discussed over 
another thread (around shuffle handler).


yarn.nodemanager.aux-services
foo


Then you define an actual implementation class for that ID "foo", like so:


yarn.nodemanager.aux-services.foo.class
com.mypack.MyAuxServiceClassForFoo


If you have multiple services foo and bar, then it would appear like the below 
(comma separated IDs and individual configs):


yarn.nodemanager.aux-services
foo,bar


yarn.nodemanager.aux-services.foo.class
com.mypack.MyAuxServiceClassForFoo


yarn.nodemanager.aux-services.bar.class
com.mypack.MyAuxServiceClassForBar


On Wed, Jun 5, 2013 at 8:42 PM, John Lilley 
mailto:john.lil...@redpoint.net>> wrote:
> Good, I was hoping that would be the case.  But what are the mechanics of it? 
>  Do I just add another entry?  And what exactly is "madreduce.shuffle"?  A 
> scoped class name?  Or a key string into some map elsewhere?
>
> e.g. like:
>
> 
> yarn.nodemanager.aux-services
> mapreduce.shuffle
> 
> 
> yarn.nodemanager.aux-services
> myauxserviceclassname
> 
>
> Concerning auxiliary services -- do they communicate with NodeManager via 
> RPC?  Is there an interface to implement?  How are they opened and closed 
> with NodeManager?
>
> Thanks
> John
>
> -Original Message-
> From: Harsh J [mailto:ha...@cloudera.com]
> Sent: Tuesday, June 04, 2013 11:58 PM
> To: mailto:user@hadoop.apache.org>>
> Subject: Re: yarn-site.xml and aux-services
>
> Yes, thats what this is for. You can implement, pass in and use your own 
> AuxService. It needs to be on the NodeManager CLASSPATH to run (and NM has to 
> be restarted to apply).
>
> On Wed, Jun 5, 2013 at 4:00 AM, John Lilley 
> mailto:john.lil...@redpoint.net>> wrote:
>> I notice the yarn-site.xml
>>
>>
>>
>>   
>>
>> yarn.nodemanager.aux-services
>>
>> mapreduce.shuffle
>>
>> shuffle service that needs to be set for Map Reduce
>> to run 
>>
>>   
>>
>>
>>
>> Is this a general-purpose hook?
>>
>> Can I tell yarn to run *my* per-node service?
>>
>> Is there some other way (within the recommended Hadoop framework) to
>> run a per-node service that exists during the lifetime of the NodeManager?
>>
>>
>>
>> John Lilley
>>
>> Chief Architect, RedPoint Global Inc.
>>
>> 1515 Walnut Street | Suite 200 | Boulder, CO 80302
>>
>> T: +1 303 541 1516  | M: +1 720 938 
>> 5761 | F: +1 
>> 781-705-2077
>>
>> Skype: jlilley.redpoint | 
>> john.lil...@redpoint.net | 
>> www.redpoint.net
>>
>>
>
>
>
> --
> Harsh J



--
Harsh J



--
+Vinod
Hortonworks Inc.
http://hortonworks.com/

CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader of 
this message is not the intended recipient, you are hereby notified that any 
printing, copying, dissemination, distribution, disclosure or forwarding of 
this communication is strictly prohibited. If you have received this 
communication in error, please contact the sender immediately and delete it 
from your system. Thank You.


Re: Hadoop upgrade

2013-08-23 Thread Harsh J
Hi,

On Fri, Aug 23, 2013 at 10:05 PM, Viswanathan J
 wrote:
> Hi,
>
> We are planning to upgrade our production hdfs cluster from 1.0.4 to 1.2.1
>
> So if I directly upgrade the cluster, it won't affect the edits, fsimage and
> checkpoints?

Yes, an upgrade won't affect those.

> Also after upgrade is it will read the blocks, files from the data nodes
> properly?

Yes.

> Is the version id conflict occurs with NN?

Not sure what you mean, but there shouldn't be a problem as long as
all service nodes and client nodes are upgraded to the same version.

> Do I need lose the data after upgrade?

You won't lose data after an upgrade.

-- 
Harsh J


Re: Hadoop upgrade

2013-08-23 Thread Viswanathan J
Thanks Harsh for your answers.

I asked that namenode layout version id doesn't conflict.

-Viswa.J
On Aug 23, 2013 10:12 PM, "Harsh J"  wrote:

> Hi,
>
> On Fri, Aug 23, 2013 at 10:05 PM, Viswanathan J
>  wrote:
> > Hi,
> >
> > We are planning to upgrade our production hdfs cluster from 1.0.4 to
> 1.2.1
> >
> > So if I directly upgrade the cluster, it won't affect the edits, fsimage
> and
> > checkpoints?
>
> Yes, an upgrade won't affect those.
>
> > Also after upgrade is it will read the blocks, files from the data nodes
> > properly?
>
> Yes.
>
> > Is the version id conflict occurs with NN?
>
> Not sure what you mean, but there shouldn't be a problem as long as
> all service nodes and client nodes are upgraded to the same version.
>
> > Do I need lose the data after upgrade?
>
> You won't lose data after an upgrade.
>
> --
> Harsh J
>


Re: yarn-site.xml and aux-services

2013-08-23 Thread Harsh J
The general practice is to install your deps into a custom location
such as /opt/john-jars, and extend YARN_CLASSPATH to include the jars,
while also configuring the classes under the aux-services list. You
need to take care of deploying jar versions to /opt/john-jars/
contents across the cluster though.

I think it may be a neat idea to have jars be placed on HDFS or any
other DFS, and the yarn-site.xml indicating the location plus class to
load. Similar to HBase co-processors. But I'll defer to Vinod on if
this would be a good thing to do.

(I know the right next thing with such an ability people will ask for
is hot-code-upgrades…)

On Fri, Aug 23, 2013 at 10:11 PM, John Lilley  wrote:
> Are there recommended conventions for adding additional code to a stock
> Hadoop install?
>
> It would be nice if we could piggyback on whatever mechanisms are used to
> distribute hadoop itself around the cluster.
>
> john
>
>
>
> From: Vinod Kumar Vavilapalli [mailto:vino...@hortonworks.com]
> Sent: Thursday, August 22, 2013 6:25 PM
>
>
> To: user@hadoop.apache.org
> Subject: Re: yarn-site.xml and aux-services
>
>
>
>
>
> Auxiliary services are essentially administer-configured services. So, they
> have to be set up at install time - before NM is started.
>
>
>
> +Vinod
>
>
>
> On Thu, Aug 22, 2013 at 1:38 PM, John Lilley 
> wrote:
>
> Following up on this, how exactly does one *install* the jar(s) for
> auxiliary service?  Can it be shipped out with the LocalResources of an AM?
> MapReduce's aux-service is presumably installed with Hadoop and is just
> sitting there in the right place, but if one wanted to make a whole new
> aux-service that belonged with an AM, how would one do it?
>
> John
>
>
> -Original Message-
> From: John Lilley [mailto:john.lil...@redpoint.net]
> Sent: Wednesday, June 05, 2013 11:41 AM
> To: user@hadoop.apache.org
> Subject: RE: yarn-site.xml and aux-services
>
> Wow, thanks.  Is this documented anywhere other than the code?  I hate to
> waste y'alls time on things that can be RTFMed.
> John
>
>
> -Original Message-
> From: Harsh J [mailto:ha...@cloudera.com]
> Sent: Wednesday, June 05, 2013 9:35 AM
> To: 
> Subject: Re: yarn-site.xml and aux-services
>
> John,
>
> The format is ID and sub-config based:
>
> First, you define an ID as a service, like the string "foo". This is the ID
> the applications may lookup in their container responses map we discussed
> over another thread (around shuffle handler).
>
> 
> yarn.nodemanager.aux-services
> foo
> 
>
> Then you define an actual implementation class for that ID "foo", like so:
>
> 
> yarn.nodemanager.aux-services.foo.class
> com.mypack.MyAuxServiceClassForFoo
> 
>
> If you have multiple services foo and bar, then it would appear like the
> below (comma separated IDs and individual configs):
>
> 
> yarn.nodemanager.aux-services
> foo,bar
> 
> 
> yarn.nodemanager.aux-services.foo.class
> com.mypack.MyAuxServiceClassForFoo
> 
> 
> yarn.nodemanager.aux-services.bar.class
> com.mypack.MyAuxServiceClassForBar
> 
>
> On Wed, Jun 5, 2013 at 8:42 PM, John Lilley 
> wrote:
>> Good, I was hoping that would be the case.  But what are the mechanics of
>> it?  Do I just add another entry?  And what exactly is "madreduce.shuffle"?
>> A scoped class name?  Or a key string into some map elsewhere?
>>
>> e.g. like:
>>
>> 
>> yarn.nodemanager.aux-services
>> mapreduce.shuffle
>> 
>> 
>> yarn.nodemanager.aux-services
>> myauxserviceclassname
>> 
>>
>> Concerning auxiliary services -- do they communicate with NodeManager via
>> RPC?  Is there an interface to implement?  How are they opened and closed
>> with NodeManager?
>>
>> Thanks
>> John
>>
>> -Original Message-
>> From: Harsh J [mailto:ha...@cloudera.com]
>> Sent: Tuesday, June 04, 2013 11:58 PM
>> To: 
>> Subject: Re: yarn-site.xml and aux-services
>>
>> Yes, thats what this is for. You can implement, pass in and use your own
>> AuxService. It needs to be on the NodeManager CLASSPATH to run (and NM has
>> to be restarted to apply).
>>
>> On Wed, Jun 5, 2013 at 4:00 AM, John Lilley 
>> wrote:
>>> I notice the yarn-site.xml
>>>
>>>
>>>
>>>   
>>>
>>> yarn.nodemanager.aux-services
>>>
>>> mapreduce.shuffle
>>>
>>> shuffle service that needs to be set for Map Reduce
>>> to run 
>>>
>>>   
>>>
>>>
>>>
>>> Is this a general-purpose hook?
>>>
>>> Can I tell yarn to run *my* per-node service?
>>>
>>> Is there some other way (within the recommended Hadoop framework) to
>>> run a per-node service that exists during the lifetime of the
>>> NodeManager?
>>>
>>>
>>>
>>> John Lilley
>>>
>>> Chief Architect, RedPoint Global Inc.
>>>
>>> 1515 Walnut Street | Suite 200 | Boulder, CO 80302
>>>
>>> T: +1 303 541 1516  | M: +1 720 938 5761 | F: +1 781-705-2077
>>>
>>> Skype: jlilley.redpoint | john.lil...@redpoint.net | www.redpoint.net
>>>
>>>
>>
>>
>>
>> --
>> Harsh J
>
>
>
> --
> Harsh J
>
>
>
>
> --
> +Vinod
> Hortonworks Inc.
> http://horton

Re: Partitioner vs GroupComparator

2013-08-23 Thread Shahab Yunus
@Jan, why not, not send the 'hidden' part of the key as a value? Why not
then pass value as null or with some other value part. So in the reducer
side there is no duplication and you can extract the 'hidden' part of the
key yourself (which should be possible as you will be encapsulating it in a
some class/object model...?

Regards,
Shahab




On Fri, Aug 23, 2013 at 12:22 PM, Jan Lukavský  wrote:

> Hi all,
>
> when speaking about this, has anyone ever measured how much more data
> needs to be transferred over the network when using GroupingComparator the
> way Harsh suggests? What do I mean, when you use the GroupingComparator, it
> hides you the real key that you have emitted from Mapper. You just see the
> first key in the reduce group and any data that was carried in the key
> needs to be duplicated in the value in order to be accessible on the reduce
> end.
>
> Let's say you have key consisting of two parts (base, extension), you
> partition by the 'base' part and use GroupingComparator to group keys with
> the same base part. Than you have no other chance than to emit from Mapper
> something like this - (key: (base, extension), value: extension), which
> means the 'extension' part is duplicated in the data, that has to be
> transferred over the network. This overhead can be diminished by using
> compression between map and reduce side, but I believe that in some cases
> this can be significant.
>
> It would be nice if the API allowed to access the 'real' key for each
> value, not only the first key of the reduce group. The only way to get rid
> of this overhead now is by not using the GroupingComparator and instead
> store some internal state in the Reducer class, that is persisted across
> mutliple calls to reduce() method, which in my opinion makes using
> GroupingComparator this way less 'preferred' way of doing secondary sort.
>
> Does anyone have any experience with this overhead?
>
> Jan
>
>
> On 08/23/2013 06:05 PM, Harsh J wrote:
>
>> The partitioner runs on the map-end. It assigns a partition ID
>> (reducer ID) to each key.
>> The grouping comparator runs on the reduce-end. It helps reducers,
>> which read off a merge-sorted single file, to understand how to break
>> the sequential file into reduce calls of .
>>
>> Typically one never overrides the GroupingComparator, and it is
>> usually the same as the SortComparator. But if you wish to do things
>> such as Secondary Sort, then overriding this comes useful - cause you
>> may want to sort over two parts of a key object, but only group by one
>> part, etc..
>>
>> On Fri, Aug 23, 2013 at 8:49 PM, Eugene Morozov
>>  wrote:
>>
>>> Hello,
>>>
>>> I have two different types of keys emerged from Map and processed by
>>> Reduce.
>>> These keys have some part in common. And I'd like to have similar keys in
>>> one reducer. For that purpose I used Partitioner and partition everything
>>> gets in by this common part. It seems to be fine, but MRUnit seems
>>> doesn't
>>> know anything about Partitioners. So, here is where GroupComparator comes
>>> into play. It seems that MRUnit well aware of the guy, but it surprises
>>> me:
>>> it looks like Partitioner and GroupComparator are actually doing exactly
>>> same - they both somehow group keys to have them in one reducer.
>>> Could you shed some light on it, please.
>>> --
>>>
>>>
>>
>>


Re: Partitioner vs GroupComparator

2013-08-23 Thread Lukavsky, Jan
Hi Shahab, I'm not sure if I understand right, but the problem is that you need 
to put the data you want to secondary sort into your key class. But, what I 
just realized is that the original key probably IS accessible, because of the 
Writable semantics. As you iterate through the Iterable passed to the reduce 
call the Key changes its contents. Am I right? This seems a bit weird but 
probably is how it works. I just overlooked this, because of the way the API 
looks and how all howtos on doing secondary sort look. All I have seen 
duplicate the secondary part of the key in value.

Jan



 Original message 
Subject: Re: Partitioner vs GroupComparator
From: Shahab Yunus 
To: "user@hadoop.apache.org" 
CC:


@Jan, why not, not send the 'hidden' part of the key as a value? Why not then 
pass value as null or with some other value part. So in the reducer side there 
is no duplication and you can extract the 'hidden' part of the key yourself 
(which should be possible as you will be encapsulating it in a some 
class/object model...?

Regards,
Shahab




On Fri, Aug 23, 2013 at 12:22 PM, Jan Lukavský 
mailto:jan.lukav...@firma.seznam.cz>> wrote:
Hi all,

when speaking about this, has anyone ever measured how much more data needs to 
be transferred over the network when using GroupingComparator the way Harsh 
suggests? What do I mean, when you use the GroupingComparator, it hides you the 
real key that you have emitted from Mapper. You just see the first key in the 
reduce group and any data that was carried in the key needs to be duplicated in 
the value in order to be accessible on the reduce end.

Let's say you have key consisting of two parts (base, extension), you partition 
by the 'base' part and use GroupingComparator to group keys with the same base 
part. Than you have no other chance than to emit from Mapper something like 
this - (key: (base, extension), value: extension), which means the 'extension' 
part is duplicated in the data, that has to be transferred over the network. 
This overhead can be diminished by using compression between map and reduce 
side, but I believe that in some cases this can be significant.

It would be nice if the API allowed to access the 'real' key for each value, 
not only the first key of the reduce group. The only


Passing parameters to MapReduce through Oozie

2013-08-23 Thread Shailesh Samudrala
Hello,

I am trying to pass parameters to my MapReduce job from Oozie MapReduce
action. Here's how I'm declaring the parameters in Oozie workflow.xml:


 param1
 paramValue



However, I'm not sure if this is the right way. Even if it is, I don't know
how to access those parameters from within my Map-Reduce job.

I tried

configuration.get("param1");

but, it doesn't work.

I would really appreciate some advise on this.

Thank you!

Regards,

Shailesh


RE: Partitioner vs GroupComparator

2013-08-23 Thread java8964 java8964
As Harsh said, sometime you want to do the 2nd sort, but for MR, it can only be 
sorted by key, not by value.
A lot of time, you want to the reducer output sort by a field, but only do the 
sort within a group, kind of like 'windowing sort' in relation DB SQL. For 
example, if you have a data about all the employee, you want the MR job to sort 
the Employee by salary, but within each department.
So what you choose the key as the omit from Mapper? Department_id? If so, then 
it is hard to make the result sorted by salary. Using "Department_id + salary", 
then we cannot put all the data from one department into one reducer.
In this case, you separate keys composing way from grouping way. You still use 
'Department_id+salary' as the key, but override the GroupComparator to group 
ONLY by "Department_id", but in the meantime, you sort the data on both 
'Department_id + salary'. The final goal is to make sure that all the data for 
the same department arrive in the same reducer, and when they arrive, they will 
be sorted by salary too, by utilizing the MR's sort/shuffle build-in ability.
Yong

Date: Fri, 23 Aug 2013 13:06:01 -0400
Subject: Re: Partitioner vs GroupComparator
From: shahab.yu...@gmail.com
To: user@hadoop.apache.org

@Jan, why not, not send the 'hidden' part of the key as a value? Why not then 
pass value as null or with some other value part. So in the reducer side there 
is no duplication and you can extract the 'hidden' part of the key yourself 
(which should be possible as you will be encapsulating it in a some 
class/object model...?

Regards,Shahab




On Fri, Aug 23, 2013 at 12:22 PM, Jan Lukavský  
wrote:

Hi all,



when speaking about this, has anyone ever measured how much more data needs to 
be transferred over the network when using GroupingComparator the way Harsh 
suggests? What do I mean, when you use the GroupingComparator, it hides you the 
real key that you have emitted from Mapper. You just see the first key in the 
reduce group and any data that was carried in the key needs to be duplicated in 
the value in order to be accessible on the reduce end.




Let's say you have key consisting of two parts (base, extension), you partition 
by the 'base' part and use GroupingComparator to group keys with the same base 
part. Than you have no other chance than to emit from Mapper something like 
this - (key: (base, extension), value: extension), which means the 'extension' 
part is duplicated in the data, that has to be transferred over the network. 
This overhead can be diminished by using compression between map and reduce 
side, but I believe that in some cases this can be significant.




It would be nice if the API allowed to access the 'real' key for each value, 
not only the first key of the reduce group. The only way to get rid of this 
overhead now is by not using the GroupingComparator and instead store some 
internal state in the Reducer class, that is persisted across mutliple calls to 
reduce() method, which in my opinion makes using GroupingComparator this way 
less 'preferred' way of doing secondary sort.




Does anyone have any experience with this overhead?



Jan



On 08/23/2013 06:05 PM, Harsh J wrote:


The partitioner runs on the map-end. It assigns a partition ID

(reducer ID) to each key.

The grouping comparator runs on the reduce-end. It helps reducers,

which read off a merge-sorted single file, to understand how to break

the sequential file into reduce calls of .



Typically one never overrides the GroupingComparator, and it is

usually the same as the SortComparator. But if you wish to do things

such as Secondary Sort, then overriding this comes useful - cause you

may want to sort over two parts of a key object, but only group by one

part, etc..



On Fri, Aug 23, 2013 at 8:49 PM, Eugene Morozov

 wrote:


Hello,



I have two different types of keys emerged from Map and processed by Reduce.

These keys have some part in common. And I'd like to have similar keys in

one reducer. For that purpose I used Partitioner and partition everything

gets in by this common part. It seems to be fine, but MRUnit seems doesn't

know anything about Partitioners. So, here is where GroupComparator comes

into play. It seems that MRUnit well aware of the guy, but it surprises me:

it looks like Partitioner and GroupComparator are actually doing exactly

same - they both somehow group keys to have them in one reducer.

Could you shed some light on it, please.

--










  

RE: yarn-site.xml and aux-services

2013-08-23 Thread John Lilley
Harsh,

Thanks for the clarification.  I would find it very convenient in this case to 
have my custom jars available in HDFS, but I can see the added complexity 
needed for YARN to maintain cache those to local disk.

What about having the tasks themselves start the per-node service as a child 
process?   I've been told that the NM kills the process group, but won't 
setgrp() circumvent that?  

Even given that, would the child process of one task have proper environment 
and permission to act on behalf of other tasks?  Consider a scenario analogous 
to the MR shuffle, where the persistent service serves up mapper output files 
to the reducers across the network:
1) AM spawns "mapper-like" tasks around the cluster
2) Each mapper-like task on a given node launches a "persistent service" child, 
but only if one is not already running.
3) Each mapper-like task writes one or more output files, and informs the 
service of those files (along with AM-id, Task-id etc).
4) AM spawns "reducer-like" tasks around the cluster.
5) Each reducer-like task is told which nodes contain "mapper" result data, and 
connects to services on those nodes to read the data.

There are some details missing, like how the lifetime of the temporary files is 
controlled to extend beyond the mapper-like task lifetime but still be cleaned 
up on AM exit, and how the reducer-like tasks are informed of which nodes have 
data.

John


-Original Message-
From: Harsh J [mailto:ha...@cloudera.com] 
Sent: Friday, August 23, 2013 11:00 AM
To: 
Subject: Re: yarn-site.xml and aux-services

The general practice is to install your deps into a custom location such as 
/opt/john-jars, and extend YARN_CLASSPATH to include the jars, while also 
configuring the classes under the aux-services list. You need to take care of 
deploying jar versions to /opt/john-jars/ contents across the cluster though.

I think it may be a neat idea to have jars be placed on HDFS or any other DFS, 
and the yarn-site.xml indicating the location plus class to load. Similar to 
HBase co-processors. But I'll defer to Vinod on if this would be a good thing 
to do.

(I know the right next thing with such an ability people will ask for is 
hot-code-upgrades...)

On Fri, Aug 23, 2013 at 10:11 PM, John Lilley  wrote:
> Are there recommended conventions for adding additional code to a 
> stock Hadoop install?
>
> It would be nice if we could piggyback on whatever mechanisms are used 
> to distribute hadoop itself around the cluster.
>
> john
>
>
>
> From: Vinod Kumar Vavilapalli [mailto:vino...@hortonworks.com]
> Sent: Thursday, August 22, 2013 6:25 PM
>
>
> To: user@hadoop.apache.org
> Subject: Re: yarn-site.xml and aux-services
>
>
>
>
>
> Auxiliary services are essentially administer-configured services. So, 
> they have to be set up at install time - before NM is started.
>
>
>
> +Vinod
>
>
>
> On Thu, Aug 22, 2013 at 1:38 PM, John Lilley 
> 
> wrote:
>
> Following up on this, how exactly does one *install* the jar(s) for 
> auxiliary service?  Can it be shipped out with the LocalResources of an AM?
> MapReduce's aux-service is presumably installed with Hadoop and is 
> just sitting there in the right place, but if one wanted to make a 
> whole new aux-service that belonged with an AM, how would one do it?
>
> John
>
>
> -Original Message-
> From: John Lilley [mailto:john.lil...@redpoint.net]
> Sent: Wednesday, June 05, 2013 11:41 AM
> To: user@hadoop.apache.org
> Subject: RE: yarn-site.xml and aux-services
>
> Wow, thanks.  Is this documented anywhere other than the code?  I hate 
> to waste y'alls time on things that can be RTFMed.
> John
>
>
> -Original Message-
> From: Harsh J [mailto:ha...@cloudera.com]
> Sent: Wednesday, June 05, 2013 9:35 AM
> To: 
> Subject: Re: yarn-site.xml and aux-services
>
> John,
>
> The format is ID and sub-config based:
>
> First, you define an ID as a service, like the string "foo". This is 
> the ID the applications may lookup in their container responses map we 
> discussed over another thread (around shuffle handler).
>
> 
> yarn.nodemanager.aux-services
> foo
> 
>
> Then you define an actual implementation class for that ID "foo", like so:
>
> 
> yarn.nodemanager.aux-services.foo.class
> com.mypack.MyAuxServiceClassForFoo
> 
>
> If you have multiple services foo and bar, then it would appear like 
> the below (comma separated IDs and individual configs):
>
> 
> yarn.nodemanager.aux-services
> foo,bar
> 
> 
> yarn.nodemanager.aux-services.foo.class
> com.mypack.MyAuxServiceClassForFoo
> 
> 
> yarn.nodemanager.aux-services.bar.class
> com.mypack.MyAuxServiceClassForBar
> 
>
> On Wed, Jun 5, 2013 at 8:42 PM, John Lilley 
> wrote:
>> Good, I was hoping that would be the case.  But what are the 
>> mechanics of it?  Do I just add another entry?  And what exactly is 
>> "madreduce.shuffle"?
>> A scoped class name?  Or a key string into some map elsewhere?
>>
>> e.g. like:
>>
>> 
>> yarn.node

Re: Partitioner vs GroupComparator

2013-08-23 Thread Shahab Yunus
Jan

" is that you need to put the data you want to secondary sort into your key
class. "
Yes but then you can also don't put the secondary sort column/data piece in
the value part and this way there will be no duplication.

" But, what I just realized is that the original key probably IS
accessible, because of the Writable semantics. As you iterate through the
Iterable passed to the reduce call the Key changes its contents. Am I
right? "

Yes.

"all howtos on doing secondary sort look. All I have seen duplicate the
secondary part of the key in value."

Check this link below where 'null' value is being passed because that has
already been captured as part of the key due to secondary sort requirements.
http://www.javacodegeeks.com/2013/01/mapreduce-algorithms-secondary-sorting.html

Regards,
Shahab




On Fri, Aug 23, 2013 at 1:34 PM, Lukavsky, Jan  wrote:

>  Hi Shahab, I'm not sure if I understand right, but the problem is that
> you need to put the data you want to secondary sort into your key class.
> But, what I just realized is that the original key probably IS accessible,
> because of the Writable semantics. As you iterate through the Iterable
> passed to the reduce call the Key changes its contents. Am I right? This
> seems a bit weird but probably is how it works. I just overlooked this,
> because of the way the API looks and how all howtos on doing secondary sort
> look. All I have seen duplicate the secondary part of the key in value.
>
>  Jan
>
>
>
>  Original message 
> Subject: Re: Partitioner vs GroupComparator
> From: Shahab Yunus 
> To: "user@hadoop.apache.org" 
> CC:
>
>
> @Jan, why not, not send the 'hidden' part of the key as a value? Why not
> then pass value as null or with some other value part. So in the reducer
> side there is no duplication and you can extract the 'hidden' part of the
> key yourself (which should be possible as you will be encapsulating it in a
> some class/object model...?
>
>  Regards,
> Shahab
>
>
>
>
> On Fri, Aug 23, 2013 at 12:22 PM, Jan Lukavský <
> jan.lukav...@firma.seznam.cz> wrote:
>
>> Hi all,
>>
>> when speaking about this, has anyone ever measured how much more data
>> needs to be transferred over the network when using GroupingComparator the
>> way Harsh suggests? What do I mean, when you use the GroupingComparator, it
>> hides you the real key that you have emitted from Mapper. You just see the
>> first key in the reduce group and any data that was carried in the key
>> needs to be duplicated in the value in order to be accessible on the reduce
>> end.
>>
>> Let's say you have key consisting of two parts (base, extension), you
>> partition by the 'base' part and use GroupingComparator to group keys with
>> the same base part. Than you have no other chance than to emit from Mapper
>> something like this - (key: (base, extension), value: extension), which
>> means the 'extension' part is duplicated in the data, that has to be
>> transferred over the network. This overhead can be diminished by using
>> compression between map and reduce side, but I believe that in some cases
>> this can be significant.
>>
>> It would be nice if the API allowed to access the 'real' key for each
>> value, not only the first key of the reduce group. The only
>
>


Re: Partitioner vs GroupComparator

2013-08-23 Thread Lukavsky, Jan
Hi Shahab,

thanks, I just missed the fact that the key gets updated while iterating the 
values. Although working with Hadoop for three years there is always something 
that can surprise you. :-)

Cheers,
 Jan



 Original message 
Subject: Re: Partitioner vs GroupComparator
From: Shahab Yunus 
To: "user@hadoop.apache.org" 
CC:


Jan

" is that you need to put the data you want to secondary sort into your key 
class. "
Yes but then you can also don't put the secondary sort column/data piece in the 
value part and this way there will be no duplication.

" But, what I just realized is that the original key probably IS accessible, 
because of the Writable semantics. As you iterate through the Iterable passed 
to the reduce call the Key changes its contents. Am I right? "

Yes.

"all howtos on doing secondary sort look. All I have seen duplicate the 
secondary part of the key in value."

Check this link below where 'null' value is being passed because that has 
already been captured as part of the key due to secondary sort requirements.
http://www.javacodegeeks.com/2013/01/mapreduce-algorithms-secondary-sorting.html

Regards,
Shahab




On Fri, Aug 23, 2013 at 1:34 PM, Lukavsky, Jan 
mailto:jan.lukav...@firma.seznam.cz>> wrote:
Hi Shahab, I'm not sure if I understand right, but the problem is that you need 
to put the data you want to secondary sort into your key class. But, what I 
just realized is that the original key probably IS accessible, because of the 
Writable semantics. As you iterate through the Iterable passed to the reduce 
call the Key changes its contents. Am I right? This seems a bit weird but 
probably is how it works. I just overlooked this, because of the way the API 
looks and how all howtos on doing secondary sort look. All I have seen 
duplicate the secondary part of the key in value.

Jan



 Original message 
Subject: Re: Partitioner vs GroupComparator
From: Shahab Yunus mailto:shahab.yu...@gmail.com>>
To: "user@hadoop.apache.org" 
mailto:user@hadoop.apache.org>>
CC:


@Jan, why not, not send the 'hidden' part of the key as a value? Why not then 
pass value as null or with some other value part. So in the reducer side there 
is no duplication and you can extract the 'hidden' part of the key yourself 
(which should be possible as you will be encapsulating it in a some 
class/object model...?

Regards,
Shahab




On Fri, Aug 23, 2013 at 12:22 PM, Jan Lukavský 
mailto:jan.lukav...@firma.seznam.cz>> wrote:
Hi all,

when speaking about this, has anyone ever measured how much more data needs to 
be transferred over the network when using GroupingComparator the way Harsh 
suggests? What do I mean, when you use the GroupingComparator, it hides you the 
real key that you have emitted from Mapper. You just see the first key in the 
reduce group and any data that was carried in the key needs to be duplicated in 
the value in order to be accessible on the reduce end.

Let's say you have key consisting of two parts (base, extension), you partition 
by the 'base' part and use GroupingComparator to group keys with the same base 
part. Than you have no other chance than to emit from Mapper something like 
this - (key: (base, extension), value: extension), which means the 'extension' 
part is duplicated in the data, that has to be transferred over the network. 
This overhead can be diminished by using compression between map and reduce 
side, but I believe that in some cases this can be significant.

It would be nice if the API allowed to access the 'real' key for each value, 
not only the first key of the reduce group. The only



Getting HBaseStorage() to work in Pig

2013-08-23 Thread Botelho, Andrew
I am trying to use the function HBaseStorage() in my Pig code in order to load 
an HBase table into Pig.

When I run my code, I get this error:

ERROR 2998: Unhandled internal error. 
org/apache/hadoop/hbase/filter/WritableByteArrayComparable


I believe the PIG_CLASSPATH needs to be extended to include the classpath for 
loading HBase, but I am not sure how to do this.  I've tried several export 
commands in the unix shell to change the PIG_CLASSPATH, but nothing seems to be 
working.

Any advice would be much appreciated.

Thanks,

Andrew Botelho



Re: Getting HBaseStorage() to work in Pig

2013-08-23 Thread Ted Yu
Please look at the example in 15.1.1 under
http://hbase.apache.org/book.html#tools


On Fri, Aug 23, 2013 at 1:41 PM, Botelho, Andrew wrote:

> I am trying to use the function HBaseStorage() in my Pig code in order to
> load an HBase table into Pig.
>
>  
>
> When I run my code, I get this error:
>
>  
>
> ERROR 2998: Unhandled internal error.
> org/apache/hadoop/hbase/filter/WritableByteArrayComparable
>
>  
>
>  
>
> I believe the PIG_CLASSPATH needs to be extended to include the classpath
> for loading HBase, but I am not sure how to do this.  I've tried several
> export commands in the unix shell to change the PIG_CLASSPATH, but nothing
> seems to be working.
>
>  
>
> Any advice would be much appreciated.
>
>  
>
> Thanks,
>
>  
>
> Andrew Botelho
>
> ** **
>


RE: Getting HBaseStorage() to work in Pig

2013-08-23 Thread Botelho, Andrew
Could you explain what is going on here:
HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop 
jar ${HBASE_HOME}/hbase-VERSION.jar

I'm not a Unix expert by any means.
How can I use this to enable HBaseStorage() in Pig?

Thanks,

Andrew

From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Friday, August 23, 2013 4:50 PM
To: common-u...@hadoop.apache.org
Subject: Re: Getting HBaseStorage() to work in Pig

Please look at the example in 15.1.1 under 
http://hbase.apache.org/book.html#tools

On Fri, Aug 23, 2013 at 1:41 PM, Botelho, Andrew 
mailto:andrew.bote...@emc.com>> wrote:
I am trying to use the function HBaseStorage() in my Pig code in order to load 
an HBase table into Pig.

When I run my code, I get this error:

ERROR 2998: Unhandled internal error. 
org/apache/hadoop/hbase/filter/WritableByteArrayComparable


I believe the PIG_CLASSPATH needs to be extended to include the classpath for 
loading HBase, but I am not sure how to do this.  I've tried several export 
commands in the unix shell to change the PIG_CLASSPATH, but nothing seems to be 
working.

Any advice would be much appreciated.

Thanks,

Andrew Botelho




Re: Getting HBaseStorage() to work in Pig

2013-08-23 Thread Shahab Yunus
You are here running multiple UNIX commands and the end result or the end
command is to run hbase-.jar using hadoop's *jar* command. So
basically you add HBase jars to the classpath of your Hadoop environment
and then execute hbase tools using hadoop. If you get the message as
specified in the doc then it means that you have successfully added Hbase
libs to your Hadoop setup.

First you are setting your HADOOP_CLASSPATH by assigning it the classpath
of your HBase libs by executing the following command.
*`${HBASE_HOME}/bin/hbase classpath` *

Any command within back ticks in unix shell, is executed and its output is
being assigned to your HADOOP_CLASSPATH. Note that the 'classpath' argument
or command to the /bin/hbase executable returns the classpath of your HBase
setup.

Then
*${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar*

you are running the hadoop executable with the 'jar' command (
http://hadoop.apache.org/docs/r1.0.4/commands_manual.html#jar) The 'jar'
command takes a jar as argument and here you are passing it your Hbase main
lib.jar to run. Note that there are 2 parts here. The main command is:
*${HADOOP_HOME}/bin/hadoop jar *
*
*
and the argument, the jar file is:
*${HBASE_HOME}/hbase-VERSION.jar*

${STRING} is the convention to refer to properties' holders/variables.

If you get the message from the docs, it means you have set your Hbase jars
to your Hadoop classpath correcty.

Regards,
Shahab


On Fri, Aug 23, 2013 at 4:57 PM, Botelho, Andrew wrote:

> Could you explain what is going on here:
>
> HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath`
> ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar
>
> ** **
>
> I’m not a Unix expert by any means.
>
> How can I use this to enable HBaseStorage() in Pig?
>
> ** **
>
> Thanks,
>
> ** **
>
> Andrew
>
> ** **
>
> *From:* Ted Yu [mailto:yuzhih...@gmail.com]
> *Sent:* Friday, August 23, 2013 4:50 PM
> *To:* common-u...@hadoop.apache.org
> *Subject:* Re: Getting HBaseStorage() to work in Pig
>
> ** **
>
> Please look at the example in 15.1.1 under
> http://hbase.apache.org/book.html#tools
>
> ** **
>
> On Fri, Aug 23, 2013 at 1:41 PM, Botelho, Andrew 
> wrote:
>
> I am trying to use the function HBaseStorage() in my Pig code in order to
> load an HBase table into Pig.
>
>  
>
> When I run my code, I get this error:
>
>  
>
> ERROR 2998: Unhandled internal error.
> org/apache/hadoop/hbase/filter/WritableByteArrayComparable
>
>  
>
>  
>
> I believe the PIG_CLASSPATH needs to be extended to include the classpath
> for loading HBase, but I am not sure how to do this.  I've tried several
> export commands in the unix shell to change the PIG_CLASSPATH, but nothing
> seems to be working.
>
>  
>
> Any advice would be much appreciated.
>
>  
>
> Thanks,
>
>  
>
> Andrew Botelho
>
>  
>
> ** **
>


CVE-2013-2192: Apache Hadoop Man in the Middle Vulnerability

2013-08-23 Thread Aaron T. Myers
Hello,

Please see below for the official announcement of a serious security
vulnerability which has been discovered and subsequently fixed in Apache
Hadoop releases.

Best,
Aaron

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

CVE-2013-2192: Apache Hadoop Man in the Middle Vulnerability

Severity: Severe

Vendor: The Apache Software Foundation

Versions Affected:
All versions of Hadoop 2.x prior to Hadoop 2.0.6-alpha.
All versions of Hadoop 0.23.x prior to Hadoop 0.23.9.
All versions of Hadoop 1.x prior to Hadoop 1.2.1.

Users affected: Users who have enabled Hadoop's Kerberos security features.

Impact: RPC traffic from clients, potentially including authentication
credentials, may be intercepted by a malicious user with access to run
tasks or containers on a cluster.

Description:
The Apache Hadoop RPC protocol is intended to provide bidirectional
authentication between clients and servers. However, a malicious server or
network attacker can unilaterally disable these authentication checks. This
allows for potential reduction in the configured quality of protection of
the RPC traffic, and privilege escalation if authentication credentials are
passed over RPC.

Mitigation:
Users of Hadoop 1.x versions should immediately upgrade to 1.2.1 or later.
Users of Hadoop 0.23.x versions should immediately upgrade to 0.23.9 or
later.
Users of Hadoop 2.x versions prior to 2.0.6-alpha should immediately
upgrade to 2.0.6-alpha or later.

Credit: This issue was discovered by Kyle Leckie of Microsoft and Aaron T.
Myers of Cloudera.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)

iQEcBAEBAgAGBQJSF84CAAoJECEaGfB4kTjfI7kH/0v4JJ992vGV4esnAKgNnTmn
A7GCj2zT7KFgF7ii6G6+5Xny9AnISTZWfMII/Szs5qaFgiaByvsNR5FoN+o5BS8s
vPWU8v5f3/cayacQgl8vxUiTlkXYZWQX+3V+8RTqAR3fPsr9IUMse4hOEcXvAjHr
 gDeWKiQaXRRhVjfmTLll1OWuKT8PmVar3qcbsg3vo/tj/yjOoVEfhV3DMOdIi+ES
pWtTxs5/fB8t+wA4hdY1r6trE7X6fys9NYC11jp83ej+ecjnHy7kmKGl41WESD+G
GOhAPYCMS9D29KGs2c6q0xCqi22R0klTs9d3Z/f7F5htGfBSAfAOpC6xPJ66/ZY=
 =4+in
-END PGP SIGNATURE-


Writable readFields question

2013-08-23 Thread Ken Sullivan
For my application I'm decoding data in readFields() of non-predetermined
length.  I've found parsing for "4" (ASCII End Of Transmission) or "-1"
tend to mark the end of the data stream.  Is this reliable, or is there a
better way?

Thanks,
Ken


How to specify delimiter in Hive select query

2013-08-23 Thread Shailesh Samudrala
I'm querying a Hive table on my cluster, (*select * from ;*)
and writing this select output to output file using (*INSERT OVERWRITE
DIRECTORY*). When the I open the output file, I see that the columns are
delimited by Hive's default delimiter (*^A or ctrl-A*).

So my question is, is there a way I can select data from the table and use
a tab or space delimiter?

Thanks a lot for your help!

Shailesh


Re: How to specify delimiter in Hive select query

2013-08-23 Thread Jagat Singh
This is not possible till hive 0.10 in hive 0.11 there is patch to do.this.

You can insert to some temp.table with required delimiter or use some pig
action to.replace afterwards.

Or best use hive 0.11
On 24/08/2013 10:45 AM, "Shailesh Samudrala"  wrote:

> I'm querying a Hive table on my cluster, (*select * from ;*)
> and writing this select output to output file using (*INSERT OVERWRITE
> DIRECTORY*). When the I open the output file, I see that the columns are
> delimited by Hive's default delimiter (*^A or ctrl-A*).
>
> So my question is, is there a way I can select data from the table and use
> a tab or space delimiter?
>
> Thanks a lot for your help!
>
> Shailesh
>


Re: Passing parameters to MapReduce through Oozie

2013-08-23 Thread Harsh J
If you use mapreduce-action, then is the configuration lookup done
within a Mapper/Reducer? If so, how are you grabbing the configuration
object? Via the overridden "configure(JobConf conf)" method?

On Fri, Aug 23, 2013 at 11:07 PM, Shailesh Samudrala
 wrote:
> Hello,
>
> I am trying to pass parameters to my MapReduce job from Oozie MapReduce
> action. Here's how I'm declaring the parameters in Oozie workflow.xml:
>
> 
>  param1
>  paramValue
> 
>
>
> However, I'm not sure if this is the right way. Even if it is, I don't know
> how to access those parameters from within my Map-Reduce job.
>
> I tried
>
> configuration.get("param1");
>
> but, it doesn't work.
>
> I would really appreciate some advise on this.
>
> Thank you!
>
> Regards,
>
> Shailesh



-- 
Harsh J


io.file.buffer.size different when not running in proper bash shell?

2013-08-23 Thread Nathan Grice
Thanks in advance for any help. I have been banging my head against the
wall on this one all day.
When I run the cmd:
hadoop fs -put /path/to/input /path/in/hdfs from the command line, the
hadoop shell dutifully copies my entire file correctly, no matter the size.


I wrote a webservice client for an external service in python and I am
simply trying to replicate the same command after retreiving some csv
delimited results from the webservice

cmd = ['hadoop', 'fs', '-put', '/path/to/input/', '/path/in/hdfs/']
p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE,
bufsize=256*1024*1024)
output, errors = p.communicate()
if p.returncode:
   raise OSError(errors)
else:
  LOG.info( output )

without fail the hadoop shell only writes the first 4096 bytes of the input
file (which according to the documentation is the default value for
io.file.buffer.size)

I have tried almost everything including adding
-Dio.file.buffer.size=XX where XX is a really big number and
NOTHING seems to work.

Please help!


Re: Writable readFields question

2013-08-23 Thread Harsh J
When you're encoding/write()-ing the writable, do you not know the
length? If you do, store the length first, and you can solve your
problem?

On Sat, Aug 24, 2013 at 3:58 AM, Ken Sullivan  wrote:
> For my application I'm decoding data in readFields() of non-predetermined
> length.  I've found parsing for "4" (ASCII End Of Transmission) or "-1" tend
> to mark the end of the data stream.  Is this reliable, or is there a better
> way?
>
> Thanks,
> Ken



-- 
Harsh J


Re: Hadoop upgrade

2013-08-23 Thread Viswanathan J
Thanks Harsh, if I upgrade the cluster with Apache hadoop can I install or
untar the hadoop stable source folder directly and will this work?

Or do I need to run upgrade and finalizeupgrade commands?

Thanks,
Viswa.J
On Aug 23, 2013 10:26 PM, "Harsh J"  wrote:

> CDH4 is perfectly stable and is in use in several production
> installations worldwide.
>
> On Fri, Aug 23, 2013 at 10:19 PM, Viswanathan J
>  wrote:
> > Thanks Harsh, I'm looking for stable release.
> >
> > On Aug 23, 2013 10:10 PM, "Harsh J"  wrote:
> >>
> >> I'd encourage upgrading to a 2.x distro such as CDH4 instead. The
> >> feature set and performance of 1.x is far too old (2+ years old) now
> >> :)
> >>
> >> On Fri, Aug 23, 2013 at 10:05 PM, Viswanathan J
> >>  wrote:
> >> > Hi,
> >> >
> >> > We are planning to upgrade our production hdfs cluster from 1.0.4 to
> >> > 1.2.1
> >> >
> >> > So if I directly upgrade the cluster, it won't affect the edits,
> fsimage
> >> > and
> >> > checkpoints?
> >> >
> >> > Also after upgrade is it will read the blocks, files from the data
> nodes
> >> > properly?
> >> >
> >> > Is the version id conflict occurs with NN?
> >> >
> >> > Do I need lose the data after upgrade?
> >> >
> >> > Thanks in advance.
> >> >
> >> > Appreciate your response.
> >> >
> >> > -Viswa.J
> >> >
> >> > --
> >> >
> >> > ---
> >> > You received this message because you are subscribed to the Google
> >> > Groups
> >> > "CDH Users" group.
> >> > To unsubscribe from this group and stop receiving emails from it, send
> >> > an
> >> > email to cdh-user+unsubscr...@cloudera.org.
> >> > For more options, visit
> >> > https://groups.google.com/a/cloudera.org/groups/opt_out.
> >>
> >>
> >>
> >> --
> >> Harsh J
> >>
> >> --
> >>
> >> ---
> >> You received this message because you are subscribed to the Google
> Groups
> >> "CDH Users" group.
> >> To unsubscribe from this group and stop receiving emails from it, send
> an
> >> email to cdh-user+unsubscr...@cloudera.org.
> >> For more options, visit
> >> https://groups.google.com/a/cloudera.org/groups/opt_out.
> >
> > --
> >
> > ---
> > You received this message because you are subscribed to the Google Groups
> > "CDH Users" group.
> > To unsubscribe from this group and stop receiving emails from it, send an
> > email to cdh-user+unsubscr...@cloudera.org.
> > For more options, visit
> > https://groups.google.com/a/cloudera.org/groups/opt_out.
>
>
>
> --
> Harsh J
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "CDH Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to cdh-user+unsubscr...@cloudera.org.
> For more options, visit
> https://groups.google.com/a/cloudera.org/groups/opt_out.
>


Pig upgrade

2013-08-23 Thread Viswanathan J
Hi,

I'm planning to upgrade pig version from 0.8.0 to 0.11.0, hope this is
stable release.

So what are the improvements, key features, benefits, advantages by
upgrading this?

Thanks,
Viswa.J


Re: Pig upgrade

2013-08-23 Thread Harsh J
The Apache Pig's own lists at u...@pig.apache.org is the right place
to ask this.

On Sat, Aug 24, 2013 at 10:22 AM, Viswanathan J
 wrote:
> Hi,
>
> I'm planning to upgrade pig version from 0.8.0 to 0.11.0, hope this is
> stable release.
>
> So what are the improvements, key features, benefits, advantages by
> upgrading this?
>
> Thanks,
> Viswa.J



-- 
Harsh J


Re: Pig upgrade

2013-08-23 Thread Viswanathan J
Thanks a lot.
On Aug 24, 2013 10:38 AM, "Harsh J"  wrote:

> The Apache Pig's own lists at u...@pig.apache.org is the right place
> to ask this.
>
> On Sat, Aug 24, 2013 at 10:22 AM, Viswanathan J
>  wrote:
> > Hi,
> >
> > I'm planning to upgrade pig version from 0.8.0 to 0.11.0, hope this is
> > stable release.
> >
> > So what are the improvements, key features, benefits, advantages by
> > upgrading this?
> >
> > Thanks,
> > Viswa.J
>
>
>
> --
> Harsh J
>