Unsubscribe

2018-06-09 Thread Xiaobin She
Unsubscribe


Re: What's the best practice of loading logs into hdfs while using hive to do log analytic?

2012-02-07 Thread Xiaobin She
hi Bejoy and Alex,

thank you for your advice.

Actually I have look at Scribe first, and I have heard of Flume.

I look at flume's user guide just now, and flume seems promising, as Bejoy
said , the flume collector can dump data into hdfs when the collector
buffer reaches a particular size of after a particular time interval, this
is good and I think it can solve the problem of data delivery latency.

But what about compress?

from the user's guide of flume, I see that flum supports compression  of
log files, but if flume did not wait until the collector has collect one
hour of log and then compress it and send it to hdfs, then it will  send
part of the one hour log to hdfs, am I right?

so if I want to use thest data in hive (assume I have an external table in
hive), I have to specify at least two partiton key while creating table,
one for day-month-hour, and one for some other time interval like ten
miniutes, then I add hive partition to the existed external table with
specified partition key.

Is the above process right ?

If this right, then there could be some other problem, like the ten miniute
logs after compress is not big enough to fit the block size of hdfs which
may couse lots of small files ( for some of our log id, this may come
true), or if I set the time interval to be half an hour, then at the end of
hour, it may still cause the data delivery latency problem.

this seems not a very good solution, am I making some mistakes or
misunderstanding here?

thank you very much!





2012/2/7 alo alt wget.n...@googlemail.com

 Hi,

 a first start with flume:

 http://mapredit.blogspot.com/2011/10/centralized-logfile-management-across.html

 Facebook's scribe could also be work for you.

 - Alex

 --
 Alexander Lorenz
 http://mapredit.blogspot.com

 On Feb 7, 2012, at 11:03 AM, Xiaobin She wrote:

  Hi all,
 
  Sorry if it is not appropriate to send one thread into two maillist.
  **
  I'm tring to use hadoop and hive to do some log analytic jobs.
 
  Our system generate lots of logs every day, for example, it produce about
  370GB logs(including lots of log files) yesterday, and every day the logs
  increases.
 
  And we want to use hadoop and hive to replace our old log analysic
 system.
 
  We distinguish our logs with logid, we have an log collector which will
  collect logs from clients and then generate log files.
 
  for every logid, there will be one log file every hour, for some logid,
  this hourly log file can be 1~2GB
 
  I have set up an test cluster with hadoop and hive, and I have run some
  test which seems good for us.
 
  For reference, we will create one table in hive for every logid which
 will
  be partitoned by hour.
 
  Now I have a question, what's the best practice for loading logs files
 into
  hdfs or hive warehouse dir ?
 
  My first thought is,  at the begining of every hour,  compress the log
 file
  of the last hour of every logid and then use the hive cmd tool to load
  these compressed log files into hdfs.
 
  using  commands like LOAD DATA LOCAL inpath '$logname' OVERWRITE  INTO
  TABLE $tablename PARTITION (dt='$h') 
 
  I think this can work, and I have run some test on our 3-nodes test
  clusters.
 
  But the problem is, there are lots of logid which means there are lots of
  log files,  so every hour we will have to load lots of files into hdfs
  and there is another problem,  we will run hourly analysis job on these
  hourly collected log files,
  which inroduces the problem, because there are lots of log files, if we
  load these log files at the same time at the begining of every hour, I
  think  there will some network flows and there will be data delivery
  latency problem.
 
  For data delivery latency problem, I mean it will take some time for the
  log files to be copyed into hdfs,  and this will cause our hourly log
  analysis job to start later.
 
  So I want to figure out if we can write or append logs into an compressed
  file which is already located in hdfs, and I have posted an thread in the
  mailist, and from what I have learned, this is not possible.
 
 
  So, what's the best practice of loading logs into hdfs while using hive
 to
  do log analytic?
 
  Or what's the common methods to handle problem I have describe above?
 
  Can anyone give me some advices?
 
  Thank you very much for your help!




Re: tasktracker keep recevied KillJobAction and then delete unknown job while using hive

2012-02-01 Thread Xiaobin She
hi Alex,

I'm using jre 1.6.0_24

with hadoop 0.20.0
hive 0.80

thx


2012/2/1 alo alt wget.n...@googlemail.com

 Hi,

 + hdfs-user (bcc'd)

 which jre version u use?

 - Alex

 --
 Alexander Lorenz
 http://mapredit.blogspot.com

 On Feb 1, 2012, at 8:16 AM, Xiaobin She wrote:

  hi ,
 
 
  I'm using hive to do some log analysis, and I have encountered a problem.
 
  My cluster have 3 nodes, one for NameNode/JobTracker and the other two
 for DataNode/TaskTracker
 
  One of the tasktracker will repeatedly receive KillJobAction and then
 delete unknown jobs
 
  the logs look like:
 
  2012-01-31 00:35:37,640 INFO org.apache.hadoop.mapred.TaskTracker:
 Received 'KillJobAction' for job: job_201201301055_0381
  2012-01-31 00:35:37,640 WARN org.apache.hadoop.mapred.TaskTracker:
 Unknown job job_201201301055_0381 being deleted.
  2012-01-31 00:36:22,697 INFO org.apache.hadoop.mapred.TaskTracker:
 Received 'KillJobAction' for job: job_201201301055_0383
  2012-01-31 00:36:22,698 WARN org.apache.hadoop.mapred.TaskTracker:
 Unknown job job_201201301055_0383 being deleted.
  2012-01-31 01:05:34,108 INFO org.apache.hadoop.mapred.TaskTracker:
 Received 'KillJobAction' for job: job_201201301055_0384
  2012-01-31 01:05:34,108 WARN org.apache.hadoop.mapred.TaskTracker:
 Unknown job job_201201301055_0384 being deleted.
  2012-01-31 01:07:43,280 INFO org.apache.hadoop.mapred.TaskTracker:
 Received 'KillJobAction' for job: job_201201301055_0385
  2012-01-31 01:07:43,280 WARN org.apache.hadoop.mapred.TaskTracker:
 Unknown job job_201201301055_0385 being deleted.
 
  this happens occasionally, and if this happens, this tasktracker will do
 notghing but keep receiveing KillJobAction and delete unknown job, and thus
 the performance will drop down.
 
  to solve this problem, I have to restart the cluster.
  but obviously, this is not a good solution.
 
  these jobs eventually will be run on the other tasktracker, and they
 will run well, the job will success.
 
  has anybody have encountered this problem and give me some advices?
 
  and occasionally there will be some errlog like:
 
  2012-01-31 13:11:40,183 INFO org.apache.hadoop.ipc.Server: IPC Server
 listener on 55837: readAndProcess threw exception java.io.IOException:
 Connection reset by peer. Count of bytes read: 0
  java.io.IOException: Connection reset by peer
  at sun.nio.ch.FileDispatcher.read0(Native Method)
  at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
  at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202)
  at sun.nio.ch.IOUtil.read(IOUtil.java:175)
  at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
  at org.apache.hadoop.ipc.Server.channelRead(Server.java:1211)
  at org.apache.hadoop.ipc.Server.access$2300(Server.java:77)
  at
 org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:799)
  at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:419)
  at org.apache.hadoop.ipc.Server$Listener.run(Server.java:328)
  2012-01-31 13:11:40,211 INFO org.apache.hadoop.mapred.JvmManager: JVM :
 jvm_201201311041_0071_r_-1096994286 exited. Number of tasks it ran: 0
  2012-01-31 13:11:40,214 INFO org.apache.hadoop.mapred.TaskTracker:
 Killing unknown JVM jvm_201201311041_0071_r_-386575334
  2012-01-31 13:11:40,221 INFO org.apache.hadoop.ipc.Server: IPC Server
 listener on 55837: readAndProcess threw exception java.io.IOException:
 Connection reset by peer. Count of bytes read: 0
  java.io.IOException: Connection reset by peer
  at sun.nio.ch.FileDispatcher.read0(Native Method)
  at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
  at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202)
  at sun.nio.ch.IOUtil.read(IOUtil.java:175)
  at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
  at org.apache.hadoop.ipc.Server.channelRead(Server.java:1211)
  at org.apache.hadoop.ipc.Server.access$2300(Server.java:77)
  at
 org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:799)
  at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:419)
  at org.apache.hadoop.ipc.Server$Listener.run(Server.java:328)
 
  Is there some connections between these two errors?
 
  thank you very much!
 
  xiaobin




Re: how to avoid scan the same table multi times?

2012-01-13 Thread Xiaobin She
hi,

I use the multiple inserts method, and I write an sql like this:

from td
INSERT OVERWRITE  DIRECTORY '/tmp/total.out' select count(v1)
INSERT OVERWRITE  DIRECTORY '/tmp/totaldistinct.out' select count(distinct
v1)
INSERT OVERWRITE  DIRECTORY '/tmp/distinctuin.out' select distinct v1

INSERT OVERWRITE  DIRECTORY '/tmp/v4.out' select v4 , count(v1),
count(distinct v1) group by v4
INSERT OVERWRITE  DIRECTORY '/tmp/v3v4.out' select v3, v4 , count(v1),
count(distinct v1) group by v3, v4

INSERT OVERWRITE  DIRECTORY '/tmp/v426.out' select count(v1),
count(distinct v1)  where v4=2 or v4=6
INSERT OVERWRITE  DIRECTORY '/tmp/v3v426.out' select v3, count(v1),
count(distinct v1) where v4=2 or v4=6 group by v3

INSERT OVERWRITE  DIRECTORY '/tmp/v415.out' select count(v1),
count(distinct v1)  where v4=1 or v4=5
INSERT OVERWRITE  DIRECTORY '/tmp/v3v415.out' select v3, count(v1),
count(distinct v1) where v4=1 or v4=5 group by v3


it works, and the output result is what I want.

but there is one problem, hive generate 9 mapreduce jobs and run these jobs
one by one.

I run explain on this query, and I got the following message:


STAGE DEPENDENCIES:
  Stage-9 is a root stage
  Stage-0 depends on stages: Stage-9
  Stage-10 depends on stages: Stage-9
  Stage-1 depends on stages: Stage-10
  Stage-11 depends on stages: Stage-9
  Stage-2 depends on stages: Stage-11
  Stage-12 depends on stages: Stage-9
  Stage-3 depends on stages: Stage-12
  Stage-13 depends on stages: Stage-9
  Stage-4 depends on stages: Stage-13
  Stage-14 depends on stages: Stage-9
  Stage-5 depends on stages: Stage-14
  Stage-15 depends on stages: Stage-9
  Stage-6 depends on stages: Stage-15
  Stage-16 depends on stages: Stage-9
  Stage-7 depends on stages: Stage-16
  Stage-17 depends on stages: Stage-9
  Stage-8 depends on stages: Stage-17


it seems that stage 9-17 is corresponding to mapreduce job 0-8
but from the explain message above, stage 10-17 only depends on stage 9,
so I have an question, why job 1-8 can't run concurrently?

Or how can I make job 1-8 run concurrently?

thank you very much for your help again!

xiaobin



在 2012年1月13日 下午12:01,Xiaobin She xiaobin...@gmail.com写道:

 to Martin, Mark and Edward,

 thank you for your advices, I will try it out.

 And to Martin, by appropriate data format, do you mean something like 
 2012011202 ?

 thanks!

 xiaobin

 在 2012年1月12日 下午10:20,Martin Kuhn martin.k...@affinitas.de写道:

 Hi there,

  Select count(*), count(distinct u), type from t group by type where
 plat=1 and dt=”2012-1-12-02”
  Select count(*), count(distinct u), type from t where (type =2 or type
 =6) and dt=”2012-1-12-02” group by type;

  Is there a better way to do these queries?

 You could try something like this:

 SELECT
type
  , count(*)
  , count(DISTINCT u)
  , count(CASE WHEN plat=1 THEN u ELSE NULL)
  , count(DISTINCT CASE WHEN plat=1 THEN u ELSE NULL)
  , count(CASE WHEN (type=2 OR type=6) THEN u ELSE NULL)
  , count(DISTINCT CASE WHEN (type=2 OR type=6) THEN u ELSE NULL)
 FROM
t
 WHERE
dt in (2012-1-12-02, 2012-1-12-03)
 GROUP BY
type
 ORDER BY
type
 ;

 Good luck :)
 Martin Kuhn


 P.S.  You'ge got a strange date format there. For sorting purposes it
 would be more appropriate to use something like 2012-01-12-02.





how to avoid scan the same table multi times?

2012-01-12 Thread Xiaobin She
Hello, everyone,

I'm new to hive, and I got some questions.

I have a table like this:

create table t(id int, time string, ip string, u bigint, ret int, plat int,
type int, u2 bigint, ver int)  PARTITIONED BY(dt STRING)   ROW FORMAT
DELIMITED FIELDS TERMINATED BY ','  lines TERMINATED BY '\n' ;

and I will do lots of query on this table base on different value of the
column, like:


Select count(*), count(distinct u), type from t group by type where plat=1
and dt=”2012-1-12-02”

Select count(*), count(distinct u), type from t group by type where plat=2
and dt=”2012-1-12-02”

Select count(*), count(distinct u), type from t where (type =2 or type =6)
and dt=”2012-1-12-02” group by type;

Select count(*), count(distinct u), type from t where (type =1 or type =5)
and dt=”2012-1-12-02” group by type;

Select count(*), count(distinct u), type from t where (type =1 or type =5)
and (dt=”2012-1-12-02” and dt=”2012-1-12-03”) group by type;

but these queries seems not so effective, because they query on the same
table for multiple times, and that meas it will scan the same files for
many times.

And my question is , how can I avoid this?
Is there a better way to do these queries?

Thank you very much for your help!


Re: how to avoid scan the same table multi times?

2012-01-12 Thread Xiaobin She
to Martin, Mark and Edward,

thank you for your advices, I will try it out.

And to Martin, by appropriate data format, do you mean something like
2012011202 ?

thanks!

xiaobin

在 2012年1月12日 下午10:20,Martin Kuhn martin.k...@affinitas.de写道:

 Hi there,

  Select count(*), count(distinct u), type from t group by type where
 plat=1 and dt=”2012-1-12-02”
  Select count(*), count(distinct u), type from t where (type =2 or type
 =6) and dt=”2012-1-12-02” group by type;

  Is there a better way to do these queries?

 You could try something like this:

 SELECT
type
  , count(*)
  , count(DISTINCT u)
  , count(CASE WHEN plat=1 THEN u ELSE NULL)
  , count(DISTINCT CASE WHEN plat=1 THEN u ELSE NULL)
  , count(CASE WHEN (type=2 OR type=6) THEN u ELSE NULL)
  , count(DISTINCT CASE WHEN (type=2 OR type=6) THEN u ELSE NULL)
 FROM
t
 WHERE
dt in (2012-1-12-02, 2012-1-12-03)
 GROUP BY
type
 ORDER BY
type
 ;

 Good luck :)
 Martin Kuhn


 P.S.  You'ge got a strange date format there. For sorting purposes it
 would be more appropriate to use something like 2012-01-12-02.