QJM and dfs.namenode.edits.dir

2013-07-16 Thread lei liu
When I use QJM for HA, do I need to save edit log  on the local filesystem?

I think the QJM is high availability for edit log, so I don't need to
configuration the dfs.namenode.edits.dir.


Thanks,

LiuLei


RE: spawn maps without any input data - hadoop streaming

2013-07-16 Thread Devaraj k
Hi Austin,

Here number of maps  for a Job  depends on the splits return by 
InputFormat.getSplits() API. We can have an input format which decides the 
number of maps(by returning the splits) for a Job according to the need.

If we use FileInputFormat, these number of splits depend on the input data for 
the Job, that's why you see no of mappers is proportional to the Job input size.

http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/InputFormat.html#getSplits(org.apache.hadoop.mapreduce.JobContext)

Thanks
Devaraj k

From: Austin Chungath [mailto:austi...@gmail.com]
Sent: 16 July 2013 14:40
To: user@hadoop.apache.org
Subject: spawn maps without any input data - hadoop streaming

Hi,

I am trying to generate random data using hadoop streaming & python. It's a map 
only job and I need to run a number of maps. There is no input to the map as 
it's just going to generate random data.

How do I specify the number of maps to run? ( I am confused here because, if I 
am not wrong, the number of maps spawned is related to the input data size )
Any ideas as to how this can be done?

Warm regards,
Austin


RE: Incrementally adding to existing output directory

2013-07-16 Thread Devaraj k
Hi Max,

  It can be done by customizing the output format class for your Job according 
to your expectations. You could you refer 
OutputFormat.checkOutputSpecs(JobContext context) method which checks the ouput 
specification. We can override this in your custom OutputFormat. You can also 
see MultipleOutputs class for implementation details how it could be done.

Thanks
Devaraj k

From: Max Lebedev [mailto:ma...@actionx.com]
Sent: 16 July 2013 23:33
To: user@hadoop.apache.org
Subject: Incrementally adding to existing output directory

Hi
I'm trying to figure out how to incrementally add to an existing output 
directory using MapReduce.
I cannot specify the exact output path, as data in the input is sorted into 
categories and then written to different directories based in the contents. (in 
the examples below, token= or token=)
As an example:
When using MultipleOutput and provided that outDir does not exist yet, the 
following will work:
hadoop jar myMR.jar --input-path=inputDir/dt=2013-05-03/* --output-path=outDir
The result will be:
outDir/token=/dt=2013-05-03/
outDir/token=/dt=2013-05-03/
However, the following will fail because outDir already exists. Even though I 
am copying new inputs.
hadoop jar myMR.jar  --input-path=inputDir/dt=2013-05-04/* --output-path=outDir
will throw FileAlreadyExistsException
What I would expect is that it adds
outDir/token=/dt=2013-05-04/
outDir/token=/dt=2013-05-04/
Another possibility would be the following hack but it does not seem to be very 
elegant:
hadoop jar myMR.jar --input-path=inputDir/2013-05-04/* --output-path=tempOutDir
then copy from tempOutDir to outDir
Is there a better way to address incrementally adding to an existing hadoop 
output directory?


Re: While Inserting data into hive Why I colud not able to query ?

2013-07-16 Thread Nitin Pawar
Samir try running the command unlock table and see if it works


On Tue, Jul 16, 2013 at 8:42 PM, Alan Gates  wrote:

> This question should be sent to u...@hive.apache.org.
>
> Alan.
>
> On Jul 16, 2013, at 3:23 AM, samir das mohapatra wrote:
>
> > Dear All,
> >   Did any one faced the issue :
> >While Loading  huge dataset into hive table , hive restricting me to
> query from same table.
> >
> >   I have set hive.support.concurrency=true, still showing
> >
> > conflicting lock present for TABLENAME  mode SHARED
> >
> > 
> >   hive.support.concurrency
> >   true
> >   Whether hive supports concurrency or not. A zookeeper
> instance must be up and running for the default hive lock manager to
> support read-write locks.
> > 
> >
> >
> > If It is  like that then how to solve that issue? is there any row lock
> > ?
> >
> > Regards
>
>


-- 
Nitin Pawar


Re: While Inserting data into hive Why I colud not able to query ?

2013-07-16 Thread Alan Gates
This question should be sent to u...@hive.apache.org.

Alan.

On Jul 16, 2013, at 3:23 AM, samir das mohapatra wrote:

> Dear All,
>   Did any one faced the issue :
>While Loading  huge dataset into hive table , hive restricting me to query 
> from same table.
> 
>   I have set hive.support.concurrency=true, still showing 
> 
> conflicting lock present for TABLENAME  mode SHARED
> 
> 
>   hive.support.concurrency
>   true
>   Whether hive supports concurrency or not. A zookeeper instance 
> must be up and running for the default hive lock manager to support 
> read-write locks.
> 
> 
> 
> If It is  like that then how to solve that issue? is there any row lock 
> ?
> 
> Regards



spawn maps without any input data - hadoop streaming

2013-07-16 Thread Austin Chungath
Hi,

I am trying to generate random data using hadoop streaming & python. It's a
map only job and I need to run a number of maps. There is no input to the
map as it's just going to generate random data.

How do I specify the number of maps to run? ( I am confused here because,
if I am not wrong, the number of maps spawned is related to the input data
size )
Any ideas as to how this can be done?

Warm regards,
Austin


header of a tuple/bag

2013-07-16 Thread Mix Nin
Hi,

I am trying query a data set on HDFS using PIG.

Data = LOAD '/user/xx/20130523/*;
x = FOREACH Data GENERATE cookie_id;

I get below error.

 Invalid field projection. Projected field [cookie_id]
does not exist

How do i find the column names in the bag "Data" .  The developer who
created the file says, it is coookie_id.
Is there any way I could get schema/header for this?


Thanks


Incrementally adding to existing output directory

2013-07-16 Thread Max Lebedev
Hi

I'm trying to figure out how to incrementally add to an existing output
directory using MapReduce.

I cannot specify the exact output path, as data in the input is sorted into
categories and then written to different directories based in the contents.
(in the examples below, token= or token=)

As an example:

When using MultipleOutput and provided that outDir does not exist yet, the
following will work:

hadoop jar myMR.jar
--input-path=inputDir/dt=2013-05-03/* --output-path=outDir

The result will be:

outDir/token=/dt=2013-05-03/

outDir/token=/dt=2013-05-03/

However, the following will fail because outDir already exists. Even though
I am copying new inputs.

hadoop jar myMR.jar  --input-path=inputDir/dt=2013-05-04/*
--output-path=outDir

will throw FileAlreadyExistsException

What I would expect is that it adds

outDir/token=/dt=2013-05-04/

outDir/token=/dt=2013-05-04/

Another possibility would be the following hack but it does not seem to be
very elegant:

hadoop jar myMR.jar --input-path=inputDir/2013-05-04/*
--output-path=tempOutDir

then copy from tempOutDir to outDir

Is there a better way to address incrementally adding to an existing hadoop
output directory?


Re: Collect, Spill and Merge phases insight

2013-07-16 Thread Stephen Boesch
great questions, i am also looking forward to answers from expert(s) here.


2013/7/16 Felix.徐 

> Hi all,
>
> I am trying to understand the process of Collect, Spill and Merge in Map,
> I've referred to a few documentations but still have a few questions.
>
> Here is my understanding about the spill phase in map:
>
> 1.Collect function add a record into the buffer.
> 2.If the buffer exceeds a threshold (determined by parameters like
> io.sort.mb), spill phase begins.
> 3.Spill phase includes 3 actions : sort , combine and compression.
> 4.Spill may be performed multiple times thus a few spilled files will be
> generated.
> 5.If there are more than 1 spilled files, Merge phase begins and merge
> these files into a big one.
>
> If there is any miss understanding about these phases, please correct me
> ,thanks!
> And my questions are:
>
> 1.Where is the partition being calculated (in Collect or Spill) ?  Does
> Collect simply append a record into the buffer and check whether we should
> spill the buffer?
>
> 2.At Merge phase, since the spilled files are compressed, does it need to
> uncompressed these files and compress them again? Since Merge may be
> performed more than 1 round, does it compress intermediate files?
>
> 3.Does the Merge phase at Map and Reduce side almost the same (External
> merge-sort combined with Min-Heap) ?
>
>


Re: copy files from ftp to hdfs in parallel, distcp failed

2013-07-16 Thread Hao Ren

Hi,

Actually, I test with my own ftp host at first, however it doesn't work.

Then I changed it into 0.0.0.0.

But I always get the "can not access ftp" msg.

Thank you .

Hao.

Le 16/07/2013 17:03, Ram a écrit :

Hi,
Please replace 0.0.0.0.with your ftp host ip address and try it.

Hi,



From,
Ramesh.




On Mon, Jul 15, 2013 at 3:22 PM, Hao Ren > wrote:


Thank you, Ram

I have configured core-site.xml as following:









hadoop.tmp.dir
/vol/persistent-hdfs



fs.default.name 
hdfs://ec2-23-23-33-234.compute-1.amazonaws.com:9010




io.file.buffer.size
65536



fs.ftp.host
0.0.0.0



fs.ftp.host.port
21




Then I tried  hadoop fs -ls file:/// , it works.
But hadoop fs -ls ftp://:@// doesn't work as usual:
ls: Cannot access ftp://:@//: No such file or directory.

When ignoring  as :

hadoop fs -ls ftp://:@/

There are no error msgs, but it lists nothing.


I have also check the rights for my /home/ directroy:

drwxr-xr-x 114096 jui 11 16:30 

and all the files under /home/ have rights 755.

I can easily copy the link ftp://:@// to firefox, it lists all the files as expected.

Any workaround here ?

Thank you.

Le 12/07/2013 14:01, Ram a écrit :

Please configure the following in core-ste.xml and try.
   Use hadoop fs -ls file:///  -- to display local file system files
   Use hadoop fs -ls ftp://   -- to display
ftp files if it is listing files go for distcp.

reference from

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/core-default.xml

fs.ftp.host 0.0.0.0 FTP filesystem connects to this server
fs.ftp.host.port21  FTP filesystem connects to fs.ftp.host on
this port




-- 
Hao Ren

ClaraVista
www.claravista.fr  





--
Hao Ren
ClaraVista
www.claravista.fr



Re: copy files from ftp to hdfs in parallel, distcp failed

2013-07-16 Thread Ram
Hi,
Please replace 0.0.0.0.with your ftp host ip address and try it.

Hi,



From,
Ramesh.




On Mon, Jul 15, 2013 at 3:22 PM, Hao Ren  wrote:

>  Thank you, Ram
>
> I have configured core-site.xml as following:
>
> 
> 
>
> 
>
> 
>
> 
> hadoop.tmp.dir
> /vol/persistent-hdfs
> 
>
> 
> fs.default.name
> hdfs://ec2-23-23-33-234.compute-1.amazonaws.com:9010
> 
> 
>
> 
> io.file.buffer.size
> 65536
> 
>
> 
> fs.ftp.host
> 0.0.0.0
> 
>
> 
> fs.ftp.host.port
> 21
> 
>
> 
>
> Then I tried  hadoop fs -ls file:/// , it works.
> But hadoop fs -ls ftp://:@//
> doesn't work as usual:
> ls: Cannot access ftp://:@ ip>//: No such file or directory.
>
> When ignoring  as :
>
> hadoop fs -ls ftp://:@/
>
> There are no error msgs, but it lists nothing.
>
>
> I have also check the rights for my /home/ directroy:
>
> drwxr-xr-x 114096 jui 11 16:30 
>
> and all the files under /home/ have rights 755.
>
> I can easily copy the link ftp://:@ ip>// to firefox, it lists all the files as expected.
>
> Any workaround here ?
>
> Thank you.
>
> Le 12/07/2013 14:01, Ram a écrit :
>
> Please configure the following in core-ste.xml and try.
>Use hadoop fs -ls file:///  -- to display local file system files
>Use hadoop fs -ls ftp://   -- to display ftp files
> if it is listing files go for distcp.
>
>  reference from
> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/core-default.xml
>
>
>fs.ftp.host 0.0.0.0 FTP filesystem connects to this server
> fs.ftp.host.port 21 FTP filesystem connects to fs.ftp.host on this port
>
>
>
> --
> Hao Ren
> ClaraVistawww.claravista.fr
>
>


Re: java.io.IOException: error=2, No such file or directory

2013-07-16 Thread Shahab Yunus
Great. Can you please share, if possible, what was the problem and how you
solved it? Thanks.

Regards,
Shahab


On Tue, Jul 16, 2013 at 9:58 AM, Fatih Haltas  wrote:

> Thanks Shahab, I solved my problem, in anyother way,
>


Re: java.io.IOException: error=2, No such file or directory

2013-07-16 Thread Fatih Haltas
Thanks Shahab, I solved my problem, in anyother way,


Re: java.io.IOException: error=2, No such file or directory

2013-07-16 Thread Shahab Yunus
The error is:

*Please set $HBASE_HOME to the root of your HBase installation.*
*
*
Have you checked whether it is set or not? Have you verified your HBase or
Hadoop installation?

Similarly, the following:

 *Cannot run program "psql": java.io.IOException: error=2, No such file or
directory *

Also seem to indicate that PostGres is not on the path. What and how are
your environment variables set? Can you access/run this independently from
this directory?

Regards,
Shahab


On Tue, Jul 16, 2013 at 6:44 AM, Fatih Haltas  wrote:

> Hi everyone,
>
> I am trying to import data from postgresql to hdfs. But I am having some
> problems, Here is the problem details:
>
> Sqoop Version: 1.4.3
> Hadoop Version:1.0.4
>
>
> *1) When I use this command:*
> *
> *
> *./sqoop import-all-tables --connect jdbc:postgresql://
> 192.168.194.158:5432/IMS --username pgsql -P*
>
>
> Here is the exact output
> ==
> Warning: /usr/lib/hbase does not exist! HBase imports will fail.
> Please set $HBASE_HOME to the root of your HBase installation.
> Warning: $HADOOP_HOME is deprecated.
>
> Enter password:
> 13/07/16 13:52:28 INFO manager.SqlManager: Using default fetchSize of 1000
> 13/07/16 13:52:29 INFO tool.CodeGenTool: Beginning code generation
> 13/07/16 13:52:29 INFO manager.SqlManager: Executing SQL statement: SELECT
> t.* FROM "publicinici" AS t LIMIT 1
> 13/07/16 13:52:29 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is
> /home/hadoop/project/hadoop-1.0.4
> Note:
> /tmp/sqoop-hadoop/compile/0f484159ce27d8ed3c2d95ca13974f1a/publicinici.java
> uses or overrides a deprecated API.
> Note: Recompile with -Xlint:deprecation for details.
> 13/07/16 13:52:30 INFO orm.CompilationManager: Writing jar file:
> /tmp/sqoop-hadoop/compile/0f484159ce27d8ed3c2d95ca13974f1a/publicinici.jar
> 13/07/16 13:52:30 WARN manager.PostgresqlManager: It looks like you are
> importing from postgresql.
> 13/07/16 13:52:30 WARN manager.PostgresqlManager: This transfer can be
> faster! Use the --direct
> 13/07/16 13:52:30 WARN manager.PostgresqlManager: option to exercise a
> postgresql-specific fast path.
> 13/07/16 13:52:30 INFO mapreduce.ImportJobBase: Beginning import of
> publicinici
> 13/07/16 13:52:31 INFO db.DataDrivenDBInputFormat: BoundingValsQuery:
> SELECT MIN("budapablik"), MAX("budapablik") FROM "publicinici"
> 13/07/16 13:52:31 INFO mapred.JobClient: Running job: job_201302261146_0525
> 13/07/16 13:52:32 INFO mapred.JobClient:  map 0% reduce 0%
> ^X[hadoop@ADUAE042-LAP-V bin]./sqoop version
> Warning: /usr/lib/hbase does not exist! HBase imports will fail.
> Please set $HBASE_HOME to the root of your HBase installation.
> Warning: $HADOOP_HOME is deprecated.
> ===
>
> *I have only one table which is "publicinici",that holds 2 entity in one
> row* and  it read but at mapreduce phase it is getting stucked.
>
>
> *2) When I use this command in which --direct option is added, eveything
> is going well, but getting one error, which is  *
>
> ERROR tool.ImportAllTablesTool: Encountered IOException running import
> job: java.io.IOException: Cannot run program "psql": java.io.IOException:
> error=2, No such file or directory
>
> *Also, in the hdfs publicinici file has been created as empty. *
>
> Here is the exact output.
> 
> 13/07/16 13:58:42 INFO manager.SqlManager: Using default fetchSize of 1000
> 13/07/16 13:58:42 INFO tool.CodeGenTool: Beginning code generation
> 13/07/16 13:58:43 INFO manager.SqlManager: Executing SQL statement: SELECT
> t.* FROM "publicinici" AS t LIMIT 1
> 13/07/16 13:58:43 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is
> /home/hadoop/project/hadoop-1.0.4
> Note:
> /tmp/sqoop-hadoop/compile/56c2a5b04e83ace0112fe038eb4d1599/publicinici.java
> uses or overrides a deprecated API.
> Note: Recompile with -Xlint:deprecation for details.
> 13/07/16 13:58:44 INFO orm.CompilationManager: Writing jar file:
> /tmp/sqoop-hadoop/compile/56c2a5b04e83ace0112fe038eb4d1599/publicinici.jar
> 13/07/16 13:58:44 INFO manager.DirectPostgresqlManager: Beginning psql
> fast path import
> 13/07/16 13:58:44 INFO manager.SqlManager: Executing SQL statement: SELECT
> t.* FROM "publicinici" AS t LIMIT 1
> 13/07/16 13:58:44 INFO manager.DirectPostgresqlManager: Copy command is
> COPY (SELECT "budapablik" FROM "publicinici" WHERE 1=1) TO STDOUT WITH
> DELIMITER E'\54' CSV ;
> 13/07/16 13:58:44 INFO manager.DirectPostgresqlManager: Performing import
> of table publicinici from database IMS
> 13/07/16 13:58:45 INFO manager.DirectPostgresqlManager: Transfer loop
> complete.
> 13/07/16 13:58:45 INFO manager.DirectPostgresqlManager: Transferred 0
> bytes in 17,361,268.2044 seconds (0 bytes/sec)
> 13/07/16 13:58:45 ERROR tool.ImportAllTablesTool: Encountered IOException
> running import job: java.io.IOException: Cannot run program "psql":
> java.io.IOException: error=2, No such file or directory
>
>
> What should I do to be able to fi

Re: Running a single cluster in multiple datacenters

2013-07-16 Thread Azuryy Yu
Hi Bertrand,
I guess you configured two racks totally. one IDC is a rack, and another IDC is 
another rack. 
so if you want to don't replicate populate during one IDC down, you had to 
change the replicate placement policy, 
if there are minimum blocks on one rack, then don't  do anything. (here minims 
blocks should be '2', which can guarantee you have two blocks at lease in one 
IDC)
so you had to configure replicator factor to '4' if you adopt my advice.



On Jul 16, 2013, at 6:37 AM, Bertrand Dechoux  wrote:

> According to your own analysis, you wouldn't be more available but that was 
> your aim.
> Did you consider having two separate clusters? One per datacenter, with an 
> automatic copy of the data?
> I understand that load balancing of work and data would not be easy but it 
> seems to me a simple strategy (that I have seen working).
> 
> However, you are stating that the two datacenters are close and linked by a 
> big network connection.
> What is the impact on the latency and the bandwidth? (between two nodes in 
> the same datacenter versus two nodes in different datacenters)
> The main question is what happens when a job will use TaskTrackers from 
> datacenter A but DataNodes from datacenter B.
> It will happen. Simply consider Reducer tasks that don't have any strategy 
> about locality because it doesn't really make sense in a general context.
> 
> Regards
> 
> Bertrand
> 
> 
> On Mon, Jul 15, 2013 at 11:56 PM,  wrote:
> Hi Niels,
> 
> it's depend of the number of replicas and the Hadoop rack configuration 
> (level).
> It's possible to have replicas on the two datacenters.
> 
> What's the rack configuration that you plan ? You can implement your own one 
> and define it using the topology.node.switch.mapping.impl property.
> 
> Regards
> JB
> 
> 
> On 2013-07-15 23:49, Niels Basjes wrote:
> Hi,
> 
> Last week we had a discussion at work regarding setting up our new
> Hadoop cluster(s).
> One of the things that has changed is that the importance of the
> Hadoop stack is growing so we want to be "more available".
> 
> One of the points we talked about was setting up the cluster in such a
> way that the nodes are physically located in two separate datacenters
> (on opposite sides of the same city) with a big network connection in
> between.
> 
> Were currently talking about a cluster in the 50 nodes range, but that
> 
> will grow over time.
> 
> The advantages I see:
> - More CPU power available for jobs.
> - The data is automatically copied between the datacenters as long as
> we configure them to be different racks.
> 
> 
> The disadvantages I see:
> - If the network goes out then one half is dead and the other half
> will most likely go to safemode because the recovering of the missing
> replicas will fill up the disks fast.
> 
> What things should we consider also?
> Has anyone any experience with such a setup?
> Is it a good idea to do this?
> 
> What are better options for us to consider?
> 
> Thanks for any input.
> 
> 
> 
> 
> -- 
> Bertrand Dechoux



java.io.IOException: error=2, No such file or directory

2013-07-16 Thread Fatih Haltas
Hi everyone,

I am trying to import data from postgresql to hdfs. But I am having some
problems, Here is the problem details:

Sqoop Version: 1.4.3
Hadoop Version:1.0.4


*1) When I use this command:*
*
*
*./sqoop import-all-tables --connect jdbc:postgresql://
192.168.194.158:5432/IMS --username pgsql -P*


Here is the exact output
==
Warning: /usr/lib/hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: $HADOOP_HOME is deprecated.

Enter password:
13/07/16 13:52:28 INFO manager.SqlManager: Using default fetchSize of 1000
13/07/16 13:52:29 INFO tool.CodeGenTool: Beginning code generation
13/07/16 13:52:29 INFO manager.SqlManager: Executing SQL statement: SELECT
t.* FROM "publicinici" AS t LIMIT 1
13/07/16 13:52:29 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is
/home/hadoop/project/hadoop-1.0.4
Note:
/tmp/sqoop-hadoop/compile/0f484159ce27d8ed3c2d95ca13974f1a/publicinici.java
uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
13/07/16 13:52:30 INFO orm.CompilationManager: Writing jar file:
/tmp/sqoop-hadoop/compile/0f484159ce27d8ed3c2d95ca13974f1a/publicinici.jar
13/07/16 13:52:30 WARN manager.PostgresqlManager: It looks like you are
importing from postgresql.
13/07/16 13:52:30 WARN manager.PostgresqlManager: This transfer can be
faster! Use the --direct
13/07/16 13:52:30 WARN manager.PostgresqlManager: option to exercise a
postgresql-specific fast path.
13/07/16 13:52:30 INFO mapreduce.ImportJobBase: Beginning import of
publicinici
13/07/16 13:52:31 INFO db.DataDrivenDBInputFormat: BoundingValsQuery:
SELECT MIN("budapablik"), MAX("budapablik") FROM "publicinici"
13/07/16 13:52:31 INFO mapred.JobClient: Running job: job_201302261146_0525
13/07/16 13:52:32 INFO mapred.JobClient:  map 0% reduce 0%
^X[hadoop@ADUAE042-LAP-V bin]./sqoop version
Warning: /usr/lib/hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: $HADOOP_HOME is deprecated.
===

*I have only one table which is "publicinici",that holds 2 entity in one row
* and  it read but at mapreduce phase it is getting stucked.


*2) When I use this command in which --direct option is added, eveything is
going well, but getting one error, which is  *

ERROR tool.ImportAllTablesTool: Encountered IOException running import job:
java.io.IOException: Cannot run program "psql": java.io.IOException:
error=2, No such file or directory

*Also, in the hdfs publicinici file has been created as empty. *

Here is the exact output.

13/07/16 13:58:42 INFO manager.SqlManager: Using default fetchSize of 1000
13/07/16 13:58:42 INFO tool.CodeGenTool: Beginning code generation
13/07/16 13:58:43 INFO manager.SqlManager: Executing SQL statement: SELECT
t.* FROM "publicinici" AS t LIMIT 1
13/07/16 13:58:43 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is
/home/hadoop/project/hadoop-1.0.4
Note:
/tmp/sqoop-hadoop/compile/56c2a5b04e83ace0112fe038eb4d1599/publicinici.java
uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
13/07/16 13:58:44 INFO orm.CompilationManager: Writing jar file:
/tmp/sqoop-hadoop/compile/56c2a5b04e83ace0112fe038eb4d1599/publicinici.jar
13/07/16 13:58:44 INFO manager.DirectPostgresqlManager: Beginning psql fast
path import
13/07/16 13:58:44 INFO manager.SqlManager: Executing SQL statement: SELECT
t.* FROM "publicinici" AS t LIMIT 1
13/07/16 13:58:44 INFO manager.DirectPostgresqlManager: Copy command is
COPY (SELECT "budapablik" FROM "publicinici" WHERE 1=1) TO STDOUT WITH
DELIMITER E'\54' CSV ;
13/07/16 13:58:44 INFO manager.DirectPostgresqlManager: Performing import
of table publicinici from database IMS
13/07/16 13:58:45 INFO manager.DirectPostgresqlManager: Transfer loop
complete.
13/07/16 13:58:45 INFO manager.DirectPostgresqlManager: Transferred 0 bytes
in 17,361,268.2044 seconds (0 bytes/sec)
13/07/16 13:58:45 ERROR tool.ImportAllTablesTool: Encountered IOException
running import job: java.io.IOException: Cannot run program "psql":
java.io.IOException: error=2, No such file or directory


What should I do to be able to fix the problem.
Any help is gonna be appreciated. Thank you very much.


While Inserting data into hive Why I colud not able to query ?

2013-07-16 Thread samir das mohapatra
Dear All,
  Did any one faced the issue :
   While Loading  huge dataset into hive table , hive restricting me to
query from same table.

  I have set hive.support.concurrency=true, still showing

conflicting lock present for TABLENAME mode SHARED


  hive.support.concurrency
  true
  Whether hive supports concurrency or not. A zookeeper
instance must be up and running for the default hive lock manager to
support read-write locks.



If It is  like that then how to solve that issue? is there any row lock
?

Regards


Collect, Spill and Merge phases insight

2013-07-16 Thread Felix . 徐
Hi all,

I am trying to understand the process of Collect, Spill and Merge in Map,
I've referred to a few documentations but still have a few questions.

Here is my understanding about the spill phase in map:

1.Collect function add a record into the buffer.
2.If the buffer exceeds a threshold (determined by parameters like
io.sort.mb), spill phase begins.
3.Spill phase includes 3 actions : sort , combine and compression.
4.Spill may be performed multiple times thus a few spilled files will be
generated.
5.If there are more than 1 spilled files, Merge phase begins and merge
these files into a big one.

If there is any miss understanding about these phases, please correct me
,thanks!
And my questions are:

1.Where is the partition being calculated (in Collect or Spill) ?  Does
Collect simply append a record into the buffer and check whether we should
spill the buffer?

2.At Merge phase, since the spilled files are compressed, does it need to
uncompressed these files and compress them again? Since Merge may be
performed more than 1 round, does it compress intermediate files?

3.Does the Merge phase at Map and Reduce side almost the same (External
merge-sort combined with Min-Heap) ?


Re: hive task fails when left semi join

2013-07-16 Thread Nitin Pawar
Dev,

from what I learned in my past exp with running huge one table queries is
one hits reduce side memory limits or timeout limits. I will wait for Kira
to give more details on the same.
sorry i forgot to ask for the logs and suggested a different approach :(

Kira,
Page is in chinese so can't make much out of it but the query looks like
map join.
If you are using older hive version
then the query showed on the mail thread looks good

if you are using new hive version then
 hive.auto.convert.join=true will do the job


On Tue, Jul 16, 2013 at 1:07 PM, Devaraj k  wrote:

>  Hi,
>
> 
>
>In the given image, I see there are some failed/killed map& reduce task
> attempts. Could you check why these are failing, you can check further
> based on the fail/kill reason.
>
> ** **
>
> ** **
>
> Thanks
>
> Devaraj k
>
> ** **
>
> *From:* kira.w...@xiaoi.com [mailto:kira.w...@xiaoi.com]
> *Sent:* 16 July 2013 12:57
> *To:* user@hadoop.apache.org
> *Subject:* hive task fails when left semi join
>
> ** **
>
> Hello,
>
> ** **
>
> I am trying to filter out some records in a table in hive.
>
> The number of lines in this table is 4billions+, 
>
> I make a left semi join between above table and a small table with 1k
> lines.
>
> ** **
>
> However, after 3 hours job running, it turns out a fail status.
>
> ** **
>
> My question are as follows,
>
> **1. **How could I address this problem and final solve it?
>
> **2. **Is there any other good methods could filter out records with
> give conditions?
>
> ** **
>
> The following picture is a snapshot of the failed job.
>
> 
>
> ** **
>



-- 
Nitin Pawar
<>

答复: hive task fails when left semi join

2013-07-16 Thread kira.wang
 

I have check it. As datanode logs shown that,

2013-07-16 00:05:31,294 WARN org.apache.hadoop.mapred.TaskTracker:
getMapOutput(attempt_201307041810_0138_m_000259_0,53) failed :

org.mortbay.jetty.EofException: timeout

 

This may be caused by a so-called “data skew” problem.

 

Thanks, Devaraj k.

 

 

发件人: Devaraj k [mailto:devara...@huawei.com] 
发送时间: 2013年7月16日 15:37
收件人: user@hadoop.apache.org
主题: RE: hive task fails when left semi join

 

Hi,

 

   In the given image, I see there are some failed/killed map& reduce task
attempts. Could you check why these are failing, you can check further based
on the fail/kill reason.

 

 

Thanks

Devaraj k

 

From: kira.w...@xiaoi.com [mailto:kira.w...@xiaoi.com] 
Sent: 16 July 2013 12:57
To: user@hadoop.apache.org
Subject: hive task fails when left semi join

 

Hello,

 

I am trying to filter out some records in a table in hive.

The number of lines in this table is 4billions+, 

I make a left semi join between above table and a small table with 1k lines.

 

However, after 3 hours job running, it turns out a fail status.

 

My question are as follows,

1. How could I address this problem and final solve it?

2. Is there any other good methods could filter out records with give
conditions?

 

The following picture is a snapshot of the failed job.



 

<>

RE: hive task fails when left semi join

2013-07-16 Thread Devaraj k
Hi,
   In the given image, I see there are some failed/killed map& reduce task 
attempts. Could you check why these are failing, you can check further based on 
the fail/kill reason.


Thanks
Devaraj k

From: kira.w...@xiaoi.com [mailto:kira.w...@xiaoi.com]
Sent: 16 July 2013 12:57
To: user@hadoop.apache.org
Subject: hive task fails when left semi join

Hello,

I am trying to filter out some records in a table in hive.
The number of lines in this table is 4billions+,
I make a left semi join between above table and a small table with 1k lines.

However, after 3 hours job running, it turns out a fail status.

My question are as follows,

1. How could I address this problem and final solve it?

2. Is there any other good methods could filter out records with give 
conditions?

The following picture is a snapshot of the failed job.
[cid:image001.jpg@01CE8225.6507BD90]

<>

答复: hive task fails when left semi join

2013-07-16 Thread kira.wang
Thanks for you positive answer.

 

>From your answer, I get the key word “map join”, and realize it, do you
mean that I can do as the blog says:

http://blog.csdn.net/xqy1522/article/details/6699740

 

If you do mind, please scan the website.

 

 

发件人: Nitin Pawar [mailto:nitinpawar...@gmail.com] 
发送时间: 2013年7月16日 15:29
收件人: user@hadoop.apache.org
主题: Re: hive task fails when left semi join

 

Can you try map only join? 

Your one table is just 1k records .. map join will help you run it faster
and hopefully you will not hit memory condition 

 

On Tue, Jul 16, 2013 at 12:56 PM,  wrote:

Hello,

 

I am trying to filter out some records in a table in hive.

The number of lines in this table is 4billions+, 

I make a left semi join between above table and a small table with 1k lines.

 

However, after 3 hours job running, it turns out a fail status.

 

My question are as follows,

1. How could I address this problem and final solve it?

2. Is there any other good methods could filter out records with give
conditions?

 

The following picture is a snapshot of the failed job.



 





 

-- 
Nitin Pawar

<>

Re: hive task fails when left semi join

2013-07-16 Thread Nitin Pawar
Can you try map only join?
Your one table is just 1k records .. map join will help you run it faster
and hopefully you will not hit memory condition


On Tue, Jul 16, 2013 at 12:56 PM,  wrote:

> Hello,
>
> ** **
>
> I am trying to filter out some records in a table in hive.
>
> The number of lines in this table is 4billions+, 
>
> I make a left semi join between above table and a small table with 1k
> lines.
>
> ** **
>
> However, after 3 hours job running, it turns out a fail status.
>
> ** **
>
> My question are as follows,
>
> **1. **How could I address this problem and final solve it?
>
> **2. **Is there any other good methods could filter out records with
> give conditions?
>
> ** **
>
> The following picture is a snapshot of the failed job.
>
> 
>
> ** **
>



-- 
Nitin Pawar
<>

hive task fails when left semi join

2013-07-16 Thread kira.wang
Hello,

 

I am trying to filter out some records in a table in hive.

The number of lines in this table is 4billions+, 

I make a left semi join between above table and a small table with 1k lines.

 

However, after 3 hours job running, it turns out a fail status.

 

My question are as follows,

1. How could I address this problem and final solve it?

2. Is there any other good methods could filter out records with give
conditions?

 

The following picture is a snapshot of the failed job.



 

<>

RE: Policies for placing a reducer

2013-07-16 Thread Devaraj k
Hi,

It doesn’t consider where the map’s ran to schedule the reducers because 
reducers need to contact all the mappers for the map o/p’s.  It schedules 
reducers wherever the slots available.

Thanks
Devaraj k

From: Felix.徐 [mailto:ygnhz...@gmail.com]
Sent: 16 July 2013 09:25
To: user@hadoop.apache.org
Subject: Policies for placing a reducer

Hi all,

What is the policy of choosing a node for a reducer in mapreduce (Hadoop 
v1.2.0)?
For example,
If a cluster has 5 slaves, each slave can serve 2 maps and 2 reduces , there is 
a job who occupies 5 mappers and 3 reducers , how does the jobtracker assign 
reducers to these nodes (choosing free slaves or placing reducers close to 
mappers)?

Thanks.