Re: Passing Command-line Parameters to the Job Submit Command

2012-09-25 Thread Mohit Anchlia
You could always write your own properties file and read it as resource. On Tue, Sep 25, 2012 at 12:10 AM, Hemanth Yamijala wrote: > By java environment variables, do you mean the ones passed as > -Dkey=value ? That's one way of passing them. I suppose another way is > to have a client side site

Re: Number of Maps running more than expected

2012-08-16 Thread Mohit Anchlia
It would be helpful to see some statistics out of both the jobs like bytes read, written number of errors etc. On Thu, Aug 16, 2012 at 8:02 PM, Raj Vishwanathan wrote: > You probably have speculative execution on. Extra maps and reduce tasks > are run in case some of them fail > > Raj > > > Sent

Re: Local jobtracker in test env?

2012-08-07 Thread Mohit Anchlia
rocesses run everything is pipelined in the same process on the local file system. > A JobTracker (via MiniMRCluster) is only required for simulating > distributed tests. > > On Wed, Aug 8, 2012 at 2:27 AM, Mohit Anchlia > wrote: > > I just wrote a test where fs.default.name is fi

Local jobtracker in test env?

2012-08-07 Thread Mohit Anchlia
I just wrote a test where fs.default.name is file:/// and mapred.job.tracker is set to local. The test ran fine, I also see mapper and reducer were invoked but what I am trying to understand is that how did this run without specifying the job tracker port and which port task tracker connected with

Re: Setting Configuration for local file:///

2012-08-07 Thread Mohit Anchlia
conf = new JobConf(getConf()) and I don't pass in any configuration then does the data from xml files in the path used? I want this to work for all the scenarios. > > On Wed, Aug 8, 2012 at 1:10 AM, Mohit Anchlia > wrote: > > I am trying to write a test on local file system but t

Setting Configuration for local file:///

2012-08-07 Thread Mohit Anchlia
I am trying to write a test on local file system but this test keeps taking xml files in the path even though I am setting a different Configuration object. Is there a way for me to override it? I thought the way I am doing overwrites the configuration but doesn't seem to be working: @Test publi

Re: Basic Question

2012-08-07 Thread Mohit Anchlia
et() calls to its benefit to > avoid too many object creation. Thanks! > > On Tue, Aug 7, 2012 at 11:56 PM, Mohit Anchlia > wrote: > > In Mapper I often use a Global Text object and througout the map > processing > > I just call "set" on it. My question is, what

Re: Avro

2012-08-04 Thread Mohit Anchlia
a file, you should be able to read older > data as well. Try it out. It is very straight forward. > > Hope this helps! Thanks! I am new to Avro what's the best place to see some examples of how Avro deals with schema changes? I am trying to find some examples. > > On Sun, Aug 5,

Compression and Decompression

2012-07-05 Thread Mohit Anchlia
Is the compression done on the client side or on the server side? If I run hadoop fs -text then is this client decompressing the file for me?

Re: Dealing with changing file format

2012-07-02 Thread Mohit Anchlia
t will grow with you. Another is to > > store the scheme with the data when it is written out. Your code may > need > > to the dynamically adjust to when the field is there and when it is not. > > > > --Bobby Evans > > > > On 7/2/12 4:09 PM, "Mohit An

Dealing with changing file format

2012-07-02 Thread Mohit Anchlia
I am wondering what's the right way to go about designing reading input and output where file format may change over period. For instance we might start with "field1,field2,field3" but at some point we add new field4 in the input. What's the best way to deal with such scenarios? Keep a catalog of c

How to find compression codec

2012-06-25 Thread Mohit Anchlia
Is there a way to look at the sequence file or a block report to see which compression is being used?

Re: Sync and Data Replication

2012-06-10 Thread Mohit Anchlia
On Sun, Jun 10, 2012 at 9:39 AM, Harsh J wrote: > Mohit, > > On Sat, Jun 9, 2012 at 11:11 PM, Mohit Anchlia > wrote: > > Thanks Harsh for detailed info. It clears things up. Only thing from > those > > page is concerning is what happens when client crashes. It say

Re: Sync and Data Replication

2012-06-09 Thread Mohit Anchlia
the call hflush() and save on performance. One place where hsync() > is to be preferred instead of hflush() is where you use WALs (for data > reliability), and HBase is one such application. With hsync(), HBase > can survive potential failures caused by major power failure cases > (amon

Sync and Data Replication

2012-06-08 Thread Mohit Anchlia
I am wondering the role of sync in replication of data to other nodes. Say client writes a line to a file in Hadoop, at this point file handle is open and sync has not been called. In this scenario is data also replicated as defined by the replication factor to other nodes as well? I am wondering i

Re: Ideal file size

2012-06-06 Thread Mohit Anchlia
#x27;ve described it doesn't cause issues with the NameNode but rather increase in processing times if there are too many small files. Looks like I need to find that balance. It would also be interesting to see how others solve this problem when not using Flume. > > > On Wed,

Ideal file size

2012-06-06 Thread Mohit Anchlia
We have continuous flow of data into the sequence file. I am wondering what would be the ideal file size before file gets rolled over. I know too many small files are not good but could someone tell me what would be the ideal size such that it doesn't overload NameNode.

Re: Writing click stream data to hadoop

2012-05-30 Thread Mohit Anchlia
and the WAL-kinda reliability you seek. > Thanks Harsh, Does flume also provides API on top. I am getting this data as http call, how would I go about using flume with http calls? > > On Fri, May 25, 2012 at 8:24 PM, Mohit Anchlia > wrote: > > We get click data through API calls. I

Re: Bad connect ack with firstBadLink

2012-05-04 Thread Mohit Anchlia
Please see: http://hbase.apache.org/book.html#dfs.datanode.max.xcievers On Fri, May 4, 2012 at 5:46 AM, madhu phatak wrote: > Hi, > We are running a three node cluster . From two days whenever we copy file > to hdfs , it is throwing java.IO.Exception Bad connect ack with > firstBadLink . I sea

Re: Compressing map only output

2012-04-30 Thread Mohit Anchlia
ig params are also > available at: > http://hadoop.apache.org/common/docs/current/mapred-default.html > (core-default.html, hdfs-default.html) > > On Tue, May 1, 2012 at 6:36 AM, Mohit Anchlia > wrote: > > Thanks! When I tried to search for this property I couldn't find it

Re: Compressing map only output

2012-04-30 Thread Mohit Anchlia
set > those properties in your job conf. > > > On Mon, Apr 30, 2012 at 5:25 PM, Mohit Anchlia >wrote: > > > Is there a way to compress map only jobs to compress map output that gets > > stored on hdfs as part-m-* files? In pig I used : > > > > Wou

Compressing map only output

2012-04-30 Thread Mohit Anchlia
Is there a way to compress map only jobs to compress map output that gets stored on hdfs as part-m-* files? In pig I used : Would these work form plain map reduce jobs as well? set output.compression.enabled true; set output.compression.codec org.apache.hadoop.io.compress.SnappyCodec;

Re: DFSClient error

2012-04-29 Thread Mohit Anchlia
x.xcievers > > I.e., add: > > >dfs.datanode.max.xcievers >4096 > > > To your DNs' config/hdfs-site.xml and restart the DNs. > > On Mon, Apr 30, 2012 at 1:35 AM, Mohit Anchlia > wrote: > > I even tried to lower number of parallel jobs even further bu

Re: DFSClient error

2012-04-29 Thread Mohit Anchlia
4 On Fri, Apr 27, 2012 at 3:45 PM, Mohit Anchlia wrote: > After all the jobs fail I can't run anything. Once I restart the cluster I > am able to run other jobs with no problems, hadoop fs and other io > intensive jobs run just fine. > > > On Fri, Apr 27, 2012 at 3:12 PM, John

Re: DFSClient error

2012-04-27 Thread Mohit Anchlia
command? > If yes, how about a wordcount example? > '/hadoop jar hadoop-*examples*.jar wordcount input output' > > > -Original Message- > From: Mohit Anchlia > Reply-To: "common-user@hadoop.apache.org" > Date: Fri, 27 Apr 2012 14:36:49 -0700 &

Re: DFSClient error

2012-04-27 Thread Mohit Anchlia
2012-04-26 16:12:53,598 INFO org.apache.hadoop.mapred.JobInProgress: job_201204261140_0244 LOCALITY_WAIT_FACTOR=0.4 2012-04-26 16:12:53,598 INFO org.apache.hadoop.mapred.JobInProgress: Job job_201204261140_0244 initialized successfully with 1 map tasks and 0 reduce tasks. On Fri, Apr 27, 2012 at

Re: DFSClient error

2012-04-27 Thread Mohit Anchlia
ing pig script to read .gz file > On Fri, Apr 27, 2012 at 5:19 AM, Mohit Anchlia > wrote: > > I had 20 mappers in parallel reading 20 gz files and each file around > > 30-40MB data over 5 hadoop nodes and then writing to the analytics > > database. Almost midway it started

DFSClient error

2012-04-26 Thread Mohit Anchlia
I had 20 mappers in parallel reading 20 gz files and each file around 30-40MB data over 5 hadoop nodes and then writing to the analytics database. Almost midway it started to get this error: 2012-04-26 16:13:53,723 [Thread-8] INFO org.apache.hadoop.hdfs.DFSClient - Exception in createBlockOutputS

Re: Design question

2012-04-26 Thread Mohit Anchlia
Ant suggestion or pointers would be helpful. Are there any best practices? On Mon, Apr 23, 2012 at 3:27 PM, Mohit Anchlia wrote: > I just wanted to check how do people design their storage directories for > data that is sent to the system continuously. For eg: for a given > functionali

Design question

2012-04-23 Thread Mohit Anchlia
I just wanted to check how do people design their storage directories for data that is sent to the system continuously. For eg: for a given functionality we get data feed continuously writen to sequencefile, that is then coverted to more structured format using map reduce and stored in tab separate

Re: Get Current Block or Split ID, and using it, the Block Path

2012-04-08 Thread Mohit Anchlia
I think if you called getInputFormat on JobConf and then called getSplits you would atleast get the locations. http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/InputSplit.html On Sun, Apr 8, 2012 at 9:16 AM, Deepak Nettem wrote: > Hi, > > Is it possible to get the 'id' o

Re: Doubt from the book "Definitive Guide"

2012-04-05 Thread Mohit Anchlia
estion to understand the rational behind using local disk for final output. > Prashant > > On Apr 4, 2012, at 9:55 PM, Mohit Anchlia wrote: > > > On Wed, Apr 4, 2012 at 8:42 PM, Harsh J wrote: > > > >> Hi Mohit, > >> > >> On Thu, Apr 5, 2012

Re: Doubt from the book "Definitive Guide"

2012-04-04 Thread Mohit Anchlia
On Wed, Apr 4, 2012 at 8:42 PM, Harsh J wrote: > Hi Mohit, > > On Thu, Apr 5, 2012 at 5:26 AM, Mohit Anchlia > wrote: > > I am going through the chapter "How mapreduce works" and have some > > confusion: > > > > 1) Below description of Mapper

Re: Doubt from the book "Definitive Guide"

2012-04-04 Thread Mohit Anchlia
On Wed, Apr 4, 2012 at 5:23 PM, Prashant Kommireddi wrote: > Answers inline. > > On Wed, Apr 4, 2012 at 4:56 PM, Mohit Anchlia >wrote: > > > I am going through the chapter "How mapreduce works" and have some > > confusion: > > > > 1) Below descri

Doubt from the book "Definitive Guide"

2012-04-04 Thread Mohit Anchlia
I am going through the chapter "How mapreduce works" and have some confusion: 1) Below description of Mapper says that reducers get the output file using HTTP call. But the description under "The Reduce Side" doesn't specifically say if it's copied using HTTP. So first confusion, Is the output cop

Re: setNumTasks

2012-03-22 Thread Mohit Anchlia
uot;setNumTasks". There is however, > "setNumReduceTasks", which sets "mapred.reduce.tasks". > > Does this answer your question? > > On Thu, Mar 22, 2012 at 8:21 PM, Mohit Anchlia > wrote: > > Could someone please help me answer this question? >

Re: setNumTasks

2012-03-22 Thread Mohit Anchlia
Could someone please help me answer this question? On Wed, Mar 14, 2012 at 8:06 AM, Mohit Anchlia wrote: > What is the corresponding system property for setNumTasks? Can it be used > explicitly as system property like "mapred.tasks."?

Re: Snappy Error

2012-03-22 Thread Mohit Anchlia
Looks like org.apache.hadoop.io.compress.SnappyCodec is not in the classpath? On Thu, Mar 22, 2012 at 4:30 AM, hadoop hive wrote: > HI Folks, > > i follow all ther steps and build and install snappy and after creating > sequencetable when i m insert overwrite the data into this table its > throw

Re: EOFException

2012-03-19 Thread Mohit Anchlia
I guess I am trying to see how to debug such problems? I don't see enough info in the logs. On Mon, Mar 19, 2012 at 12:48 AM, madhu phatak wrote: > Hi, > Seems like HDFS is in safemode. > > On Fri, Mar 16, 2012 at 1:37 AM, Mohit Anchlia >wrote: > > > This is

Re: EOFException

2012-03-15 Thread Mohit Anchlia
This is actually just hadoop job over HDFS. I am assuming you also know why this is erroring out? On Thu, Mar 15, 2012 at 1:02 PM, Gopal wrote: > On 03/15/2012 03:06 PM, Mohit Anchlia wrote: > >> When I start a job to read data from HDFS I start getting these errors. >> Doe

Re: SequenceFile split question

2012-03-15 Thread Mohit Anchlia
stored in the same. Same applies in case of MR > tasks as well. > > Regards > Bejoy.K.S > > On Thu, Mar 15, 2012 at 6:17 AM, Mohit Anchlia >wrote: > > > I have a client program that creates sequencefile, which essentially > merges > > small files into a big fi

SequenceFile split question

2012-03-14 Thread Mohit Anchlia
I have a client program that creates sequencefile, which essentially merges small files into a big file. I was wondering how is sequence file splitting the data accross nodes. When I start the sequence file is empty. Does it get split when it reaches the dfs.block size? If so then does it mean that

Re: mapred.tasktracker.map.tasks.maximum not working

2012-03-10 Thread Mohit Anchlia
mapred.tasktracker.map.tasks.maximum not working > > you set the " mapred.tasktracker.map.tasks.maximum " in your job means > nothing. Because Hadoop mapreduce platform only checks this parameter when > it starts. This is a system configuration. > > You need to set it in your conf/mapred-site.xml f

Re: mapred.map.tasks vs mapred.tasktracker.map.tasks.maximum

2012-03-09 Thread Mohit Anchlia
l > file. > > On Fri, Mar 9, 2012 at 7:19 PM, Mohit Anchlia >wrote: > > > What's the difference between setNumMapTasks and mapred.map.tasks? > > > > On Fri, Mar 9, 2012 at 5:00 PM, Chen He wrote: > > > > > Hi Mohit > > > > &

mapred.tasktracker.map.tasks.maximum not working

2012-03-09 Thread Mohit Anchlia
I have mapred.tasktracker.map.tasks.maximum set to 2 in my job and I have 5 nodes. I was expecting this to have only 10 concurrent jobs. But I have 30 mappers running. Does hadoop ignores this setting when supplied from the job?

Re: mapred.map.tasks vs mapred.tasktracker.map.tasks.maximum

2012-03-09 Thread Mohit Anchlia
mapred.job.reduce(maps)" means default number of reduce (map) tasks your > job will has. > > To set the number of mappers in your application. You can write like this: > > *configuration.setNumMapTasks(the number you want);* > > Chen > > Actually, you can just use

mapred.map.tasks vs mapred.tasktracker.map.tasks.maximum

2012-03-09 Thread Mohit Anchlia
What's the difference between mapred.tasktracker.reduce.tasks.maximum and mapred.map.tasks ** I want my data to be split against only 10 mappers in the entire cluster. Can I do that using one of the above parameters?

Re: Profiling Hadoop Job

2012-03-08 Thread Mohit Anchlia
Can you check which user you are running this process as and compare it with the ownership on the directory? On Thu, Mar 8, 2012 at 3:13 PM, Leonardo Urbina wrote: > Does anyone have any idea how to solve this problem? Regardless of whether > I'm using plain HPROF or profiling through Starfish,

Re: Java Heap space error

2012-03-06 Thread Mohit Anchlia
I am still trying to see how to narrow this down. Is it possible to set heapdumponoutofmemoryerror option on these individual tasks? On Mon, Mar 5, 2012 at 5:49 PM, Mohit Anchlia wrote: > Sorry for multiple emails. I did find: > > > 2012-03-05 17:26

Re: Java Heap space error

2012-03-05 Thread Mohit Anchlia
46 PM, Mohit Anchlia wrote: > All I see in the logs is: > > > 2012-03-05 17:26:36,889 FATAL org.apache.hadoop.mapred.TaskTracker: Task: > attempt_201203051722_0001_m_30_1 - Killed : Java heap space > > Looks like task tracker is killing the tasks. Not sure why. I increased >

Re: Java Heap space error

2012-03-05 Thread Mohit Anchlia
, 2012 at 5:03 PM, Mohit Anchlia wrote: > I currently have java.opts.mapred set to 512MB and I am getting heap space > errors. How should I go about debugging heap space issues? >

Re: AWS MapReduce

2012-03-05 Thread Mohit Anchlia
t; parameters > > > you can bypass - for example blocksizes etc. - but in the end imho > > setting > > > up ec2 instances by copying images is the better alternative. > > > > > > Kind Regards > > > > > > Hannes > > > > > > On Sun

Re: AWS MapReduce

2012-03-04 Thread Mohit Anchlia
etup is done pretty fast and there are some configuration parameters > you can bypass - for example blocksizes etc. - but in the end imho setting > up ec2 instances by copying images is the better alternative. > > Kind Regards > > Hannes > > On Sun, Mar 4, 2012 at 2:31 AM,

Re: AWS MapReduce

2012-03-03 Thread Mohit Anchlia
I think found answer to this question. However, it's still not clear if HDFS is on local disk or EBS volumes. Does anyone know? On Sat, Mar 3, 2012 at 3:54 PM, Mohit Anchlia wrote: > Just want to check how many are using AWS mapreduce and understand the > pros and cons of Amazon&

Re: Hadoop pain points?

2012-03-02 Thread Mohit Anchlia
+1 On Fri, Mar 2, 2012 at 4:09 PM, Harsh J wrote: > Since you ask about anything in general, when I forayed into using > Hadoop, my biggest pain was lack of documentation clarity and > completeness over the MR and DFS user APIs (and other little points). > > It would be nice to have some work do

Re: Adding nodes

2012-03-01 Thread Mohit Anchlia
you would need to edit these files and > issue the refresh command. > > > On Mar 1, 2012, at 5:35 PM, Mohit Anchlia wrote: > > On Thu, Mar 1, 2012 at 4:57 PM, Joey Echeverria > wrote: > > Not quite. Datanodes get the namenode host from fs.defalt.name in > > c

Re: Adding nodes

2012-03-01 Thread Mohit Anchlia
tracker know there is a new node in the cluster. Is it initiated by namenode when slave file is edited? Or is it initiated by tasktracker when tasktracker is started? > > Sent from my iPhone > > On Mar 1, 2012, at 18:49, Mohit Anchlia wrote: > > > On Thu, Mar 1, 2012 at

Re: Adding nodes

2012-03-01 Thread Mohit Anchlia
ar 1, 2012, at 18:29, Mohit Anchlia wrote: > > > Is this the right procedure to add nodes? I took some from hadoop wiki > FAQ: > > > > http://wiki.apache.org/hadoop/FAQ > > > > 1. Update conf/slave > > 2. on the slave nodes start datanode and tasktracker

Adding nodes

2012-03-01 Thread Mohit Anchlia
Is this the right procedure to add nodes? I took some from hadoop wiki FAQ: http://wiki.apache.org/hadoop/FAQ 1. Update conf/slave 2. on the slave nodes start datanode and tasktracker 3. hadoop balancer Do I also need to run dfsadmin -refreshnodes?

kill -QUIT

2012-03-01 Thread Mohit Anchlia
When I try kill -QUIT for a job it doesn't send the stacktrace to the log files. Does anyone know why or if I am doing something wrong? I find the job using ps -ef|grep "attempt". I then go to logs/userLogs/job/attempt/

Re: Invocation exception

2012-02-29 Thread Mohit Anchlia
t; sifting through this to debug your issue. > > This is also explained in Tom's book under the title "Debugging a Job" > (p154, Hadoop: The Definitive Guide, 2nd ed.). > > On Wed, Feb 29, 2012 at 1:40 AM, Mohit Anchlia > wrote: > > It looks like adding this

Re: 100x slower mapreduce compared to pig

2012-02-29 Thread Mohit Anchlia
ldn't find one. Does anyone know where stacktraces are generally sent? On Wed, Feb 29, 2012 at 1:08 PM, Mohit Anchlia wrote: > I can't seem to find what's causing this slowness. Nothing in the logs. > It's just painfuly slow. However, pig job is awesome in performance th

Re: 100x slower mapreduce compared to pig

2012-02-29 Thread Mohit Anchlia
) { // *TODO* Auto-generated catch block log.error("Invalid xml", e); *throw* *new* IllegalArgumentException("invalid xml " + e.getCause().getMessage()); } } On Wed, Feb 29, 2012 at 9:27 AM, Mohit Anchlia wrote: > I am going to try few things today. I have a JAXBContex

Re: 100x slower mapreduce compared to pig

2012-02-29 Thread Mohit Anchlia
look at what you are doing in the UDF vs > the Mapper. > > 100x slow does not make sense for the same job/logic, its either the Mapper > code or may be the cluster was busy at the time you scheduled MapReduce > job? > > Thanks, > Prashant > > On Tue, Feb 28, 2012 at 4

100x slower mapreduce compared to pig

2012-02-28 Thread Mohit Anchlia
I am comparing runtime of similar logic. The entire logic is exactly same but surprisingly map reduce job that I submit is 100x slow. For pig I use udf and for hadoop I use mapper only and the logic same as pig. Even the splits on the admin page are same. Not sure why it's so slow. I am submitting

Re: Invocation exception

2012-02-28 Thread Mohit Anchlia
;/jars/analytics.jar"), conf);" but this works just fine. On Tue, Feb 28, 2012 at 11:44 AM, Mohit Anchlia wrote: > I commented reducer and combiner both and still I see the same exception. > Could it be because I have 2 jars being added? > > On Mon, Feb 27, 2012 at 8:23 PM, S

Re: Invocation exception

2012-02-28 Thread Mohit Anchlia
I commented reducer and combiner both and still I see the same exception. Could it be because I have 2 jars being added? On Mon, Feb 27, 2012 at 8:23 PM, Subir S wrote: > On Tue, Feb 28, 2012 at 4:30 AM, Mohit Anchlia >wrote: > > > For some reason I am getting invocation except

Re: Handling bad records

2012-02-27 Thread Mohit Anchlia
t(key, new Text("Chau")); On Mon, Feb 27, 2012 at 9:53 PM, Harsh J wrote: > Mohit, > > Use the MultipleOutputs API: > > http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html > to have a named output of bad records. There i

Re: Invocation exception

2012-02-27 Thread Mohit Anchlia
ou point me to the topic in that book where I'll find this information? > Sent from my iPhone > > On Feb 27, 2012, at 8:54 PM, Mohit Anchlia wrote: > > > Does it matter if reducer is set even if the no of reducers is 0? Is > there > > a way to get more clear reason

Re: Invocation exception

2012-02-27 Thread Mohit Anchlia
Does it matter if reducer is set even if the no of reducers is 0? Is there a way to get more clear reason? On Mon, Feb 27, 2012 at 8:23 PM, Subir S wrote: > On Tue, Feb 28, 2012 at 4:30 AM, Mohit Anchlia >wrote: > > > For some reason I am getting invocation exception and I don&#

Handling bad records

2012-02-27 Thread Mohit Anchlia
What's the best way to write records to a different file? I am doing xml processing and during processing I might come accross invalid xml format. Current I have it under try catch block and writing to log4j. But I think it would be better to just write it to an output file that just contains error

Re: dfs.block.size

2012-02-27 Thread Mohit Anchlia
How do I verify the block size of a given file? Is there a command? On Mon, Feb 27, 2012 at 7:59 AM, Joey Echeverria wrote: > dfs.block.size can be set per job. > > mapred.tasktracker.map.tasks.maximum is per tasktracker. > > -Joey > > On Mon, Feb 27, 2012 at 10:19 AM, M

Task Killed but no errors

2012-02-27 Thread Mohit Anchlia
I submitted a map reduce job that had 9 tasks killed out of 139. But I don't see any errors in the admin page. The entire job however has SUCCEDED. How can I track down the reason? Also, how do I determine if this is something to worry about?

Re: dfs.block.size

2012-02-27 Thread Mohit Anchlia
Can someone please suggest if parameters like dfs.block.size, mapred.tasktracker.map.tasks.maximum are only cluster wide settings or can these be set per client job configuration? On Sat, Feb 25, 2012 at 5:43 PM, Mohit Anchlia wrote: > If I want to change the block size then can I

Re: LZO with sequenceFile

2012-02-26 Thread Mohit Anchlia
012 at 9:55 PM, Ioan Eugen Stan > wrote: > > 2012/2/26 Mohit Anchlia : > >> Thanks. Does it mean LZO is not installed by default? How can I install > LZO? > > > > The LZO library is released under GPL and I believe it can't be > > included in most distrib

Re: LZO with sequenceFile

2012-02-25 Thread Mohit Anchlia
Thanks. Does it mean LZO is not installed by default? How can I install LZO? On Sat, Feb 25, 2012 at 6:27 PM, Shi Yu wrote: > Yes, it is supported by Hadoop sequence file. It is splittable > by default. If you have installed and specified LZO correctly, > use these: > > > org.apache.hadoop.mapre

dfs.block.size

2012-02-25 Thread Mohit Anchlia
If I want to change the block size then can I use Configuration in mapreduce job and set it when writing to the sequence file or does it need to be cluster wide setting in .xml files? Also, is there a way to check the block of a given file?

MapReduce tunning

2012-02-24 Thread Mohit Anchlia
I am looking at some hadoop tuning parameters like io.sort.mb, mapred.child.javaopts etc. - My question was where to look at for current setting - Are these settings configured cluster wide or per job? - What's the best way to look at reasons of slow performance?

Streaming job hanging

2012-02-22 Thread Mohit Anchlia
Streaming job just seems to be hanging 12/02/22 17:35:50 INFO streaming.StreamJob: map 0% reduce 0% - On the admin page I see that it created 551 input split. Could somone suggest a way to find out what might be causing it to hang? I increased io.sort.mb to 200 MB. I am using 5 data nod

Re: Splitting files on new line using hadoop fs

2012-02-22 Thread Mohit Anchlia
iciently. > > Regards > Bejoy K S > > From handheld, Please excuse typos. > -- > *From: *Mohit Anchlia > *Date: *Wed, 22 Feb 2012 12:29:26 -0800 > *To: *; > *Subject: *Re: Splitting files on new line using hadoop fs > > > On Wed, Feb 22,

Re: Splitting files on new line using hadoop fs

2012-02-22 Thread Mohit Anchlia
nk/java/piggybank.jar' raw = LOAD '/examples/testfile5.txt using org.apache.pig.piggybank.storage.XMLLoader('') as (document:chararray); dump raw; > > --Original Message-- > From: Mohit Anchlia > To: common-user@hadoop.apache.org > ReplyTo: common-user

Re: Writing small files to one big file in hdfs

2012-02-21 Thread Mohit Anchlia
Finally figured it out. I needed to use SequenceFileAstextInputFormat. There is just lack of examples that makes it difficult when you start. On Tue, Feb 21, 2012 at 4:50 PM, Mohit Anchlia wrote: > It looks like in mapper values are coming as binary instead of Text. Is > this expecte

Re: Writing small files to one big file in hdfs

2012-02-21 Thread Mohit Anchlia
It looks like in mapper values are coming as binary instead of Text. Is this expected from sequence file? I initially wrote SequenceFile with Text values. On Tue, Feb 21, 2012 at 4:13 PM, Mohit Anchlia wrote: > Need some more help. I wrote sequence file using below code but now when I &g

Re: Writing small files to one big file in hdfs

2012-02-21 Thread Mohit Anchlia
Cheers > Arko > > On Tue, Feb 21, 2012 at 2:04 PM, Mohit Anchlia >wrote: > > > Sorry may be it's something obvious but I was wondering when map or > reduce > > gets called what would be the class used for key and value? If I used > > "org.apache.hadoop

Re: Writing to SequenceFile fails

2012-02-21 Thread Mohit Anchlia
I am past this error. Looks like I needed to use CDH libraries. I changed my maven repo. Now I am stuck at *org.apache.hadoop.security.AccessControlException *since I am not writing as user that owns the file. Looking online for solutions On Tue, Feb 21, 2012 at 12:48 PM, Mohit Anchlia wrote

Re: Writing small files to one big file in hdfs

2012-02-21 Thread Mohit Anchlia
; > From inside your Map/Reduce methods, I think you should NOT be tinkering > with the input / output paths of that Map/Reduce job. > Cheers > Arko > > > On Tue, Feb 21, 2012 at 1:38 PM, Mohit Anchlia >wrote: > > > Thanks How does mapreduce work on sequence fil

Re: Writing small files to one big file in hdfs

2012-02-21 Thread Mohit Anchlia
(data);* > > * }* > > *reader.close* > > *}* > > *output.close* > > > In case you have the files in multiple directories, call the code for each > of them with different input paths. > > Hope this helps! > > Cheers > > Arko > > On Tue

Re: Writing small files to one big file in hdfs

2012-02-21 Thread Mohit Anchlia
key value format.Since SequenceFileInputFormat is available at your > disposal you don't need any custom input formats for processing the same > using map reduce. It is a cleaner and better approach compared to just > appending small xml file contents into a big file. > > On Tue, Feb

Re: Writing small files to one big file in hdfs

2012-02-21 Thread Mohit Anchlia
nning to use pig's > org.apache.pig.piggybank.storage.XMLLoader for processing. Would it work with sequence file? This text file that I was referring to would be in hdfs itself. Is it still different than using sequence file? > Regards > Bejoy.K.S > > On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia >wrote:

Re: Hadoop install

2012-02-18 Thread Mohit Anchlia
otes, as always, are well worth reading. > > > Tom Deutsch > Program Director > Information Management > Big Data Technologies > IBM > 3565 Harbor Blvd > Costa Mesa, CA 92626-1420 > tdeut...@us.ibm.com > > > > &

Re: Processing small xml files

2012-02-18 Thread Mohit Anchlia
g. > I can't seem to find examples of how to do xml processing in Pig. Can you please send me some pointers? Basically I need to convert my xml to more structured format using hadoop to write it to database. > > On Sat, Feb 18, 2012 at 1:18 AM, Mohit Anchlia > wrote: > &

Re: Processing small xml files

2012-02-17 Thread Mohit Anchlia
On Tue, Feb 14, 2012 at 10:56 AM, W.P. McNeill wrote: > I'm not sure what you mean by "flat format" here. > > In my scenario, I have an file input.xml that looks like this. > > > > 1 > > > 2 > > > > input.xml is a plain text file. Not a sequence file. If I read it with the

Re: Processing small xml files

2012-02-12 Thread Mohit Anchlia
On Sun, Feb 12, 2012 at 9:24 AM, W.P. McNeill wrote: > I've used the Mahout XMLInputFormat. It is the right tool if you have an > XML file with one type of section repeated over and over again and want to > turn that into Sequence file where each repeated section is a value. I've > found it helpfu

Developing MapReduce

2011-10-10 Thread Mohit Anchlia
I use eclipse. Is this http://wiki.apache.org/hadoop/EclipsePlugIn still the best way to develop mapreduce programs in hadoop? Just want to make sure before I go down this path. Or should I just add hadoop jars in my classpath of eclipse and create my own MapReduce programs. Thanks

Re: incremental loads into hadoop

2011-10-03 Thread Mohit Anchlia
This process of managing looks like more pain long term. Would it be easier to store in Hbase which has smaller block size? What's the avg. file size? On Sun, Oct 2, 2011 at 7:34 PM, Vitthal "Suhas" Gogate wrote: > Agree with Bejoy, although to minimize the processing latency you can still > cho

Re: Binary content

2011-09-01 Thread Mohit Anchlia
On Thu, Sep 1, 2011 at 1:25 AM, Dieter Plaetinck wrote: > On Wed, 31 Aug 2011 08:44:42 -0700 > Mohit Anchlia wrote: > >> Does map-reduce work well with binary contents in the file? This >> binary content is basically some CAD files and map reduce program need >> to

Binary content

2011-08-31 Thread Mohit Anchlia
Does map-reduce work well with binary contents in the file? This binary content is basically some CAD files and map reduce program need to read these files using some proprietry tool extract values and do some processing. Wondering if there are others doing similar type of processing. Best practice

Re: Question about RAID controllers and hadoop

2011-08-11 Thread Mohit Anchlia
On Thu, Aug 11, 2011 at 3:26 PM, Charles Wimmer wrote: > We currently use P410s in 12 disk system.  Each disk is set up as a RAID0 > volume.  Performance is at least as good as a bare disk. Can you please share what throughput you see with P410s? Are these SATA or SAS? > > > On 8/11/11 3:23 PM,

Re: maprd vs mapreduce api

2011-08-05 Thread Mohit Anchlia
On Fri, Aug 5, 2011 at 3:42 PM, Stevens, Keith D. wrote: > The Mapper and Reducer class in org.apache.hadoop.mapreduce implement the > identity function.  So you should be able to just do > > conf.setMapperClass(org.apache.hadoop.mapreduce.Mapper.class); > conf.setReducerClass(org.apache.hadoop.m

Re: Hadoop cluster network requirement

2011-08-01 Thread Mohit Anchlia
Assuming everything is up this solution still will not scale given the latency, tcpip buffers, sliding window etc. See BDP Sent from my iPad On Aug 1, 2011, at 4:57 PM, Michael Segel wrote: > > Yeah what he said. > Its never a good idea. > Forget about losing a NN or a Rack, but just losing c

  1   2   >