Re: Problem reading file with spooling directory

2013-08-07 Thread Jagadish Bihani
I too had the same problem (in flume 1.4). We had checked that the input data is actually utf-8. When we used input charset as 'unicode' it worked. By "worked" I mean, it didn't give this exception. At the destination that data was garbage for us? Is it a known thing or are we missing anything?

Apache flume 1.4.0 - Spooling directory issue

2013-07-18 Thread Jagadish Bihani
Hi I am using spooling directory source with apache flume 1.4.0 and having the problem that *same configuration works on some machines and doesn't work on some machines.* Configuration used to work with flume 1.3.1. (Only the properties related to deserializer are changed). *Configuration

HDFS sink data loss possible ?

2013-05-29 Thread Jagadish Bihani
Hi Based on our observations on our production setup in flume: We have seen file roll sink delivering almost 1% events greater than those delivered by HDFS sink per day. (We have replicating setup and two different file channels for the sinks). Configuration : Flume version:1.3.1 Flume

Re: "single source - multi channel" scenario and applying interceptor while writing to only one channel and not on others...possible approaches

2013-04-23 Thread Jagadish Bihani
zer to modify the event. -Jeff On Tue, Apr 16, 2013 at 11:12 PM, Jagadish Bihani mailto:jagadish.bih...@pubmatic.com>> wrote: Hi If anybody has any inputs on this that will surely help. Regards, Jagadish

Re: "single source - multi channel" scenario and applying interceptor while writing to only one channel and not on others...possible approaches

2013-04-16 Thread Jagadish Bihani
Hi If anybody has any inputs on this that will surely help. Regards, Jagadish On 04/16/2013 12:06 PM, Jagadish Bihani wrote: Hi We have a use case in which 1. spooling source reads data. 2. It needs to write events into multiple channels. It should apply interceptor only when putting into

"single source - multi channel" scenario and applying interceptor while writing to only one channel and not on others...possible approaches

2013-04-15 Thread Jagadish Bihani
Hi We have a use case in which 1. spooling source reads data. 2. It needs to write events into multiple channels. It should apply interceptor only when putting into one channel and should put the event as it is while putting into another channel. Possible approach we have thought: 1. Create 2

Re: In flume-ng is there any advantages of 2-tier topology in a cluster of 30-40 nodes?

2013-02-01 Thread Jagadish Bihani
than what is written in point 1). Regards, Jagadish On 02/01/2013 06:45 PM, Alexander Alten-Lorenz wrote: Ah, I missed your response. Inline On Jan 30, 2013, at 3:43 PM, Jagadish Bihani wrote: Hi Thanks Alexander for the reply. I have added my thoughts in line. On 01/30/2013 11:56 A

Re: In flume-ng is there any advantages of 2-tier topology in a cluster of 30-40 nodes?

2013-02-01 Thread Jagadish Bihani
, Jagadish On 01/30/2013 08:13 PM, Jagadish Bihani wrote: Hi Thanks Alexander for the reply. I have added my thoughts in line. On 01/30/2013 11:56 AM, Alexander Alten-Lorenz wrote: Hi, If the agents (Tier 1) have access to HDFS, each single client can put data into HDFS. But this doesn't

Re: In flume-ng is there any advantages of 2-tier topology in a cluster of 30-40 nodes?

2013-01-30 Thread Jagadish Bihani
n then its data wont reach HDFS. Similarly in 2 tier scenario : if a node from 1st tier goes down then its data wont reach HDFS. Could you please elaborate if I am missing something? Cheers, Alex On Jan 30, 2013, at 7:05 AM, Jagadish Bihani wrote: Hi In our scenario there are around 3

In flume-ng is there any advantages of 2-tier topology in a cluster of 30-40 nodes?

2013-01-29 Thread Jagadish Bihani
Hi In our scenario there are around 30 machines from which we want to put data into HDFS. Now the approach we thought of initially was: 1. First tier : Agent which collect data from source then pass it to avro sink. 2. Second tier: Lets call those agents 'collectors' which collect data fr

Efficient way of handling of duplicates/Deduplication strategy

2012-12-23 Thread Jagadish Bihani
Hi I was thinking of 2 possible approaches for this : Approach 1. Deduplication at destination- Using spooling dir source -file channel - hdfs sink combination: === -- After the HDFS sink has written to the HDFS directory. We can run

Re: Recommendation of parameters for better performance with File Channel

2012-12-18 Thread Jagadish Bihani
7:36 AM, Brock Noland wrote: Hi, Why not try increasing the batch size on the source and sink to 10,000? Brock On Wed, Dec 12, 2012 at 4:08 AM, Jagadish Bihani mailto:jagadish.bih...@pubmatic.com>> wrote: I am using latest release of flume. (Flume 1.3.0) and hadoop 1.0.3. On 12/12/201

Re: Recommendation of parameters for better performance with File Channel

2012-12-12 Thread Jagadish Bihani
I am using latest release of flume. (Flume 1.3.0) and hadoop 1.0.3. On 12/12/2012 03:35 PM, Jagadish Bihani wrote: Hi I am able to write maximum 1.5 MB/sec data to HDFS (without compression) using File Channel. Are there any recommendations to improve the performance? Has anybody achieved

Recommendation of parameters for better performance with File Channel

2012-12-12 Thread Jagadish Bihani
Hi I am able to write maximum 1.5 MB/sec data to HDFS (without compression) using File Channel. Are there any recommendations to improve the performance? Has anybody achieved around 10 MB/sec with file channel ? If yes please share the configuration like (Hardware used, RAM allocated and batch

Re: Flume bz2 issue while processing by a map reduce job

2012-11-03 Thread Jagadish Bihani
container. The latter two are splittable and properly handle several compression codecs, including Snappy, which is a great way to go if you can do it. Regards, Mike On Fri, Nov 2, 2012 at 12:50 AM, Jagadish Bihani mailto:jagadish.bih...@pubmatic.com>> wrote: Hi Any inputs

Re: Flume bz2 issue while processing by a map reduce job

2012-11-02 Thread Jagadish Bihani
Hi Any inputs on this? It looks like a basic thing which, I guess, must have been handled in flume On 10/30/2012 10:31 PM, Jagadish Bihani wrote: Text. Few updates on that: -- It looks like some header issue. -- When I copyToLocal the file and then again copy it back to HDFS, map reduce job

Re: Flume bz2 issue while processing by a map reduce job

2012-10-30 Thread Jagadish Bihani
/2012 09:15 PM, Brock Noland wrote: What kind of files is your sink writing out? Text, Sequence, etc? On Fri, Oct 26, 2012 at 8:02 AM, Jagadish Bihani wrote: Same thing happens even for gzip. Regards, Jagadish On 10/26/2012 04:30 PM, Jagadish Bihani wrote: Hi I have a very peculiar scenario

Re: Flume compression peculiar behaviour while processing compressed files by a map reduce job

2012-10-29 Thread Jagadish Bihani
Does anyone have any inputs about why below mentioned behaviour might have happened? On 10/26/2012 06:32 PM, Jagadish Bihani wrote: Same thing happens even for gzip. Regards, Jagadish On 10/26/2012 04:30 PM, Jagadish Bihani wrote: Hi I have a very peculiar scenario. 1. My HDFS sink

Re: Flume bz2 issue while processing by a map reduce job

2012-10-26 Thread Jagadish Bihani
Same thing happens even for gzip. Regards, Jagadish On 10/26/2012 04:30 PM, Jagadish Bihani wrote: Hi I have a very peculiar scenario. 1. My HDFS sink creates a bz2 file. File is perfectly fine I can decompress it and read it. It has 0.2 million records. 2. Now I give that file to map

Flume bz2 issue while processing by a map reduce job

2012-10-26 Thread Jagadish Bihani
Hi I have a very peculiar scenario. 1. My HDFS sink creates a bz2 file. File is perfectly fine I can decompress it and read it. It has 0.2 million records. 2. Now I give that file to map-reduce job (hadoop 1.0.3) and surprisingly it only reads first 100 records. 3. I then decompress the sam

Re: File Channel performance and fsync

2012-10-22 Thread Jagadish Bihani
on, Oct 22, 2012 at 8:59 AM, Brock Noland <mailto:br...@cloudera.com>> wrote: Which version? 1.2 or trunk? On Monday, October 22, 2012 at 8:18 AM, Jagadish Bihani wrote: Hi This is the simplistic configuration with which I am getting lower performance. Even wi

Re: File Channel performance and fsync

2012-10-22 Thread Jagadish Bihani
riting in this case). Hope useful for you. PS : I heard that OS has demon thread to flush page cache to disk asynchronously with second latency, does it's effective for amount of data with tolerant loss? -Regards Denny Ye 2012/10/22 Jagadish Bihani <mailto:jagadish.bih...@pubm

Re: File Channel performance and fsync

2012-10-22 Thread Jagadish Bihani
int iterations = len/PAGESIZE; int i; struct timeval t0,t1; for(i=0;i Hi, On Wed, Oct 10, 2012 at 11:22 AM, Jagadish Bihani <mailto:jagadish.bih...@pubmatic.com> wrote: Hi Brock I will surely look into 'fsync lie

File Channel performance and fsync

2012-10-22 Thread Jagadish Bihani
16 core processors, 16 GB RAM etc.) Regards, Jagadish On 10/10/2012 11:30 PM, Brock Noland wrote: Hi, On Wed, Oct 10, 2012 at 11:22 AM, Jagadish Bihani wrote: Hi Brock I will surely look into 'fsync lies'. But as per my experiments I think "file channel" is causing t

Re: Flume throughput correlation with RAM

2012-10-10 Thread Jagadish Bihani
what I am referring to. A spinning disk can do 100 fsync operations per second (this is done at the end of every batch). That is how I estimated your event size, 40KB/second is doing 40KB / 100 = 409 bytes. Once again, if you want increased performance, you should increase the batch size. Brock

Re: Flume throughput correlation with RAM

2012-10-10 Thread Jagadish Bihani
Hi Yes. It is around 480 - 500 bytes. On 10/10/2012 09:24 PM, Brock Noland wrote: How big are your events? Average about 400 bytes? Brock On Wed, Oct 10, 2012 at 5:11 AM, Jagadish Bihani wrote: Hi Thanks for the inputs Brock. After doing several experiments eventually problem boiled down

Re: Flume throughput correlation with RAM

2012-10-10 Thread Jagadish Bihani
our data is actually written to the drive. If you search for "fsync lies" you'll find more information on this. You probably want to increase the batch size to get better performance. Brock On Tue, Oct 9, 2012 at 2:46 AM, Jagadish Bihani wrote: Hi My flume setup is: Source Agent

Flume throughput correlation with RAM

2012-10-09 Thread Jagadish Bihani
Hi My flume setup is: Source Agent : cat source - File Channel - Avro Sink Dest Agent : avro source - File Channel - HDFS Sink. There is only 1 source agent and 1 destination agent. I measure throughput as amount of data written to HDFS per second. ( I have rolling interval 30 sec; so If 6

HDFS sink - Property: hdfs.callTimeout

2012-10-03 Thread Jagadish Bihani
Hi What is the implication of this property "hdfs.callTimeout". What adverse effect it may have if I change it ? I am getting timeout exception as: Noted checkpoint for file: /home/hadoop/flume_channel/dataDir15/log-21, id: 21, checkpoint position: 1576210481 12/10/03 23:19:45 INFO file.LogFile

Re: HDFS sink Bucketwriter working

2012-09-26 Thread Jagadish Bihani
en go with 1000, depending on the application you may want to go lower or higher. Regards, Mike On Wed, Sep 26, 2012 at 8:23 PM, Jagadish Bihani mailto:jagadish.bih...@pubmatic.com>> wrote: Hi I had few doubts about HDFS sink Bucketwriter : -- How does HDFS's bu

HDFS sink Bucketwriter working

2012-09-26 Thread Jagadish Bihani
Hi I had few doubts about HDFS sink Bucketwriter : -- How does HDFS's bucketwriter works? What criteria does it use to create another bucket? -- Creation of a file in HDFS is function of how many parameters ? Initially I thought it is function of only rolling parameter(interval/size). But appa

Re: HDFS SINK "txnEventMax" property

2012-09-26 Thread Jagadish Bihani
But there exists already a different property called batchSize. -Harish On Wed, Sep 26, 2012 at 7:30 AM, Brock Noland <mailto:br...@cloudera.com>> wrote: A better name for that property would be batchSize. Brock On Wed, Sep 26, 2012 at 5:13 AM, Jagadish Bihani mailto

HDFS SINK "txnEventMax" property

2012-09-26 Thread Jagadish Bihani
Hi What is the significance of this property? I think because of this property almost 100 files are being created within a particular rolling interval instead of 1. If I set it to 1; what performance penalty it may cause? Regards, Jagadish

File Channel related Exception

2012-09-20 Thread Jagadish Bihani
Hi In my flume agent with "Avro source-File Channel-HDFS sink"; I am getting following exception: *java.io.IOException: Failed to obtain lock for writing to the log. Try increasing the log write timeout value or disabling it by setting it to 0. [channel=fileChannel]** * when I set my file r

Re: HDFS file rolling behaviour

2012-09-19 Thread Jagadish Bihani
he default limit of 1024 is too low for nearly every modern application. In regards to the rolling, can you paste you config and describe in more detail the unexpected behavior you are seeing? Brock On Tue, Sep 18, 2012 at 7:08 AM, Jagadish Bihani mailto:jagadish.bih...@pubmatic.com>>

HDFS file rolling behaviour

2012-09-18 Thread Jagadish Bihani
seconds. Is it something to do with HDFS batch size? Regards, Jagadish Original Message Subject:HDFS file rolling behaviour Date: Thu, 13 Sep 2012 14:26:56 +0530 From: Jagadish Bihani To: user@flume.apache.org Hi I use two flume agents: 1. flume_agent 1 which is

HDFS file rolling behaviour

2012-09-13 Thread Jagadish Bihani
Hi I use two flume agents: 1. flume_agent 1 which is a source with (exec source -file channel -avro sink) 2. flume_agent 2 which is a dest with (avro source -file channel - HDFS sink) I have observed that for HDFS sink with rolling by *file size/number of events* it creates a lot of simulta

Re: Flume -ng startup failed

2012-09-10 Thread Jagadish Bihani
to do that in roughly 4 spots throughout the script. That cleared it up for me. I'd love to hear of a better way to do that though :) Chris On Mon, Sep 10, 2012 at 9:26 AM, Jagadish Bihani mailto:jagadish.bih...@pubmatic.com>> wrote: Hi My flume 1.2.0 setup is working f

Flume -ng startup failed

2012-09-10 Thread Jagadish Bihani
Hi My flume 1.2.0 setup is working fine on one machine. But when I ran it on another machine it gave me syntax error while starting agent : "bin/flume-ng: line 81: syntax error in conditional expression: unexpected token `(' bin/flume-ng: line 81: syntax error near `^java\.library\.path=(.' bin

Re: Flume netcat source related problems

2012-09-04 Thread Jagadish Bihani
;m not entirely sure as I can't dive deeply into the source right now. I wouldn't be surprised if it's some kind of congestion problem and lack of logging(or your log levels are just too high, try switching them to INFO or DEBUG?) that will be resolved once you get the throughput up.

Flume netcat source related problems

2012-09-04 Thread Jagadish Bihani
Hi I encountered an problem in my scenario with netcat source. Setup is Host A: Netcat source -file channel -avro sink Host B: Avro source - file channel - HDFS sink But to simplify it I have created a single agent with "Netcat Source" and "file roll sink"* *It is *: *Host A: Netcat source - fi

Re: flume-ng agent startup problem

2012-08-14 Thread Jagadish Bihani
Thanks , Jagadish On 08/10/2012 03:30 PM, Jagadish Bihani wrote: Hi Thanks all for the inputs. After the ini

Re: flume-ng agent startup problem

2012-08-11 Thread Jagadish Bihani
adish On 08/10/2012 03:30 PM, Jagadish Bihani wrote: Hi Thanks all for the inputs. After the initial problem I was able to start flume except in one scenario in which I use HDFS as sink. I have a production machine

Re: flume-ng agent startup problem

2012-08-10 Thread Jagadish Bihani
x27;t attached logs or error messages it's hard to say what happen. best - Alex Jagadish Bihani wrote:

flume-ng agent startup problem

2012-08-08 Thread Jagadish Bihani
Hi I have downloaded the tarball of latest flume-ng 1.2.0. I have JAVA_HOME properly set. To begin with I have followed the instructions in " https://cwiki.apache.org/FLUME/getting-started.html"; as it is. And even for that basic example: My flume agent stucks printing the following output an

doubt in exec source specifically in tail -F

2012-07-27 Thread Jagadish Bihani
Hi In Flume-ng is there any way using exec (tail -F) as the source to get only the new lines which are being added to the log file ? (i.e. there is a growing log file and we want to transfer all the logs using flume without duplication of logs) I understand if something fails and as tail does

Re: flume 1.2.0 bzip2 codec problem

2012-07-26 Thread Jagadish Bihani
to check configuration property io.compression.codecs for inclusion of org.apache.hadoop.io.compress.BZip2Codec. Jarcec On Jul 25, 2012, at 12:20 PM, Jagadish Bihani wrote: Hi I have downloaded and deployed latest flume code from repository "https://svn.apache.org/repos/asf/flume/trunk/&

flume 1.2.0 bzip2 codec problem

2012-07-25 Thread Jagadish Bihani
Hi I have downloaded and deployed latest flume code from repository "https://svn.apache.org/repos/asf/flume/trunk/"; In the conf file I am using following properties for the hdfs sink: agent.sinks.hdfsSink.type = hdfs agent.sinks.hdfsSink.hdfs.path= agent.sinks.hdfsSink.hdfs.fileType =Compres

flume-ng failure recovery

2012-07-17 Thread Jagadish Bihani
Hi We want to deploy flume-ng in the production environment in our organization. Here is the following scenario for which I am not able to find the answer: 1. We receive logs using 'tail -f' source. 2. Now the agent process gets killed. 3. We restart it. 4. How would the restarted agent will