Re: binary file deserialization

2016-03-09 Thread Ted Yu
bq. there is a varying number of items for that record

If the combination of items is very large, using case class would be
tedious.

On Wed, Mar 9, 2016 at 9:57 AM, Saurabh Bajaj 
wrote:

> You can load that binary up as a String RDD, then map over that RDD and
> convert each row to your case class representing the data. In the map stage
> you could also map the input string into an RDD of JSON values and use the
> following function to convert it into a DF
>
> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
>
> val anotherPeople = sqlContext.read.json(anotherPeopleRDD)
>
>
> On Wed, Mar 9, 2016 at 9:15 AM, Ruslan Dautkhanov 
> wrote:
>
>> We have a huge binary file in a custom serialization format (e.g. header
>> tells the length of the record, then there is a varying number of items for
>> that record). This is produced by an old c++ application.
>> What would be best approach to deserialize it into a Hive table or a
>> Spark RDD?
>> Format is known and well documented.
>>
>>
>> --
>> Ruslan Dautkhanov
>>
>
>


Re: error while defining custom schema in Spark 1.5.0

2015-12-25 Thread Ted Yu
The error was due to blank field being defined twice.

On Tue, Dec 22, 2015 at 12:03 AM, Divya Gehlot 
wrote:

> Hi,
> I am new bee to Apache Spark ,using  CDH 5.5 Quick start VM.having spark
> 1.5.0.
> I working on custom schema and getting error
>
> import org.apache.spark.sql.hive.HiveContext
>>>
>>> scala> import org.apache.spark.sql.hive.orc._
>>> import org.apache.spark.sql.hive.orc._
>>>
>>> scala> import org.apache.spark.sql.types.{StructType, StructField,
>>> StringType, IntegerType};
>>> import org.apache.spark.sql.types.{StructType, StructField, StringType,
>>> IntegerType}
>>>
>>> scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>>> 15/12/21 23:41:53 INFO hive.HiveContext: Initializing execution hive,
>>> version 1.1.0
>>> 15/12/21 23:41:53 INFO client.ClientWrapper: Inspected Hadoop version:
>>> 2.6.0-cdh5.5.0
>>> 15/12/21 23:41:53 INFO client.ClientWrapper: Loaded
>>> org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0-cdh5.5.0
>>> hiveContext: org.apache.spark.sql.hive.HiveContext =
>>> org.apache.spark.sql.hive.HiveContext@214bd538
>>>
>>> scala> val customSchema = StructType(Seq(StructField("year",
>>> IntegerType, true),StructField("make", StringType,
>>> true),StructField("model", StringType, true),StructField("comment",
>>> StringType, true),StructField("blank", StringType, true)))
>>> customSchema: org.apache.spark.sql.types.StructType =
>>> StructType(StructField(year,IntegerType,true),
>>> StructField(make,StringType,true), StructField(model,StringType,true),
>>> StructField(comment,StringType,true), StructField(blank,StringType,true))
>>>
>>> scala> val customSchema = (new StructType).add("year", IntegerType,
>>> true).add("make", StringType, true).add("model", StringType,
>>> true).add("comment", StringType, true).add("blank", StringType, true)
>>> customSchema: org.apache.spark.sql.types.StructType =
>>> StructType(StructField(year,IntegerType,true),
>>> StructField(make,StringType,true), StructField(model,StringType,true),
>>> StructField(comment,StringType,true), StructField(blank,StringType,true))
>>>
>>> scala> val customSchema = StructType( StructField("year", IntegerType,
>>> true) :: StructField("make", StringType, true) :: StructField("model",
>>> StringType, true) :: StructField("comment", StringType, true) ::
>>> StructField("blank", StringType, true)::StructField("blank", StringType,
>>> true))
>>> :24: error: value :: is not a member of
>>> org.apache.spark.sql.types.StructField
>>>val customSchema = StructType( StructField("year", IntegerType,
>>> true) :: StructField("make", StringType, true) :: StructField("model",
>>> StringType, true) :: StructField("comment", StringType, true) ::
>>> StructField("blank", StringType, true)::StructField("blank", StringType,
>>> true))
>>>
>>
> Tried like like below also
>
> scala> val customSchema = StructType( StructField("year", IntegerType,
> true), StructField("make", StringType, true) ,StructField("model",
> StringType, true) , StructField("comment", StringType, true) ,
> StructField("blank", StringType, true),StructField("blank", StringType,
> true))
> :24: error: overloaded method value apply with alternatives:
>   (fields:
> Array[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType
> 
>   (fields:
> java.util.List[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType
> 
>   (fields:
> Seq[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType
>  cannot be applied to (org.apache.spark.sql.types.StructField,
> org.apache.spark.sql.types.StructField,
> org.apache.spark.sql.types.StructField,
> org.apache.spark.sql.types.StructField,
> org.apache.spark.sql.types.StructField,
> org.apache.spark.sql.types.StructField)
>val customSchema = StructType( StructField("year", IntegerType,
> true), StructField("make", StringType, true) ,StructField("model",
> StringType, true) , StructField("comment", StringType, true) ,
> StructField("blank", StringType, true),StructField("blank", StringType,
> true))
>   ^
>Would really appreciate if somebody share the example which works with
> Spark 1.4 or Spark 1.5.0
>
> Thanks,
> Divya
>
> ^
>


Re: Intermittent BindException during long MR jobs

2015-02-28 Thread Ted Yu
Krishna:
Please take a look at:
http://wiki.apache.org/hadoop/BindException

Cheers

On Thu, Feb 26, 2015 at 10:30 PM, hadoop.supp...@visolve.com wrote:

 Hello Krishna,



 Exception seems to be IP specific. It might be occurred due to
 unavailability of IP address in the system to assign. Double check the IP
 address availability and run the job.



 *Thanks,*

 *S.RagavendraGanesh*

 ViSolve Hadoop Support Team
 ViSolve Inc. | San Jose, California
 Website: www.visolve.com

 email: servi...@visolve.com | Phone: 408-850-2243





 *From:* Krishna Rao [mailto:krishnanj...@gmail.com]
 *Sent:* Thursday, February 26, 2015 9:48 PM
 *To:* user@hive.apache.org; u...@hadoop.apache.org
 *Subject:* Intermittent BindException during long MR jobs



 Hi,



 we occasionally run into a BindException causing long running jobs to
 occasionally fail.



 The stacktrace is below.



 Any ideas what this could be caused by?



 Cheers,



 Krishna





 Stacktrace:

 379969 [Thread-980] ERROR org.apache.hadoop.hive.ql.exec.Task  - Job
 Submission failed with exception 'java.net.BindException(Problem binding to
 [back10/10.4.2.10:0] java.net.BindException: Cann

 ot assign requested address; For more details see:
 http://wiki.apache.org/hadoop/BindException)'

 java.net.BindException: Problem binding to [back10/10.4.2.10:0]
 java.net.BindException: Cannot assign requested address; For more details
 see:  http://wiki.apache.org/hadoop/BindException

 at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:718)

 at org.apache.hadoop.ipc.Client.call(Client.java:1242)

 at
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)

 at com.sun.proxy.$Proxy10.create(Unknown Source)

 at
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:193)

 at sun.reflect.GeneratedMethodAccessor43.invoke(Unknown Source)

 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

 at java.lang.reflect.Method.invoke(Method.java:597)

 at
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)

 at
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)

 at com.sun.proxy.$Proxy11.create(Unknown Source)

 at
 org.apache.hadoop.hdfs.DFSOutputStream.init(DFSOutputStream.java:1376)

 at
 org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1395)

 at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1255)

 at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1212)

 at
 org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:276)

 at
 org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:265)

 at
 org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:82)

 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:888)

 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:869)

 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:768)

 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:757)

 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:558)

 at
 org.apache.hadoop.mapreduce.split.JobSplitWriter.createFile(JobSplitWriter.java:96)

 at
 org.apache.hadoop.mapreduce.split.JobSplitWriter.createSplitFiles(JobSplitWriter.java:85)

 at
 org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:517)

 at
 org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:487)

 at
 org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:369)

 at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1286)

 at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1283)

 at java.security.AccessController.doPrivileged(Native Method)

 at javax.security.auth.Subject.doAs(Subject.java:396)

 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)

 at org.apache.hadoop.mapreduce.Job.submit(Job.java:1283)

 at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:606)

 at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:601)

 at java.security.AccessController.doPrivileged(Native Method)

 at javax.security.auth.Subject.doAs(Subject.java:396)

 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)

 at
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:601)

 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:586)

 at
 org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:448)

 at
 

Re: unsubscribe

2013-07-18 Thread Ted Yu
You have to send a mail to user-unsubscr...@hive.apache.org

On Thu, Jul 18, 2013 at 1:30 PM, Beau Rothrock beau.rothr...@lookout.comwrote:




Re: getRegion method is missing from RegionCoprocessorEnvironment class in version 0.94

2013-07-11 Thread Ted Yu
Which release of 0.94 do you use ?

For tip of 0.94, I see:

public interface RegionCoprocessorEnvironment extends
CoprocessorEnvironment {
  /** @return the region associated with this coprocessor */
  public HRegion getRegion();

On Thu, Jul 11, 2013 at 6:40 PM, ch huang justlo...@gmail.com wrote:

 how can i do ,to change the example code based on 0.92 api ,so it can
 running in 0.94?



Re: Variable resolution Fails

2013-05-01 Thread Ted Yu
Naidu:
Please don't hijack existing thread. Your questions are not directly related to 
Hive. 

Cheers

On May 1, 2013, at 12:53 AM, Naidu MS sanyasinaidu.malla...@gmail.com wrote:

 Hi i have two questions regarding hdfs and jps utility
 
 I am new to Hadoop and started leraning hadoop from the past week
 
 1.when ever i start start-all.sh and jps in console it showing the processes 
 started
 
 naidu@naidu:~/work/hadoop-1.0.4/bin$ jps
 22283 NameNode
 23516 TaskTracker
 26711 Jps
 22541 DataNode
 23255 JobTracker
 22813 SecondaryNameNode
 Could not synchronize with target
 
 But along with the list of process stared it always showing  Could not 
 synchronize with target in the jps output. What is meant by Could not 
 synchronize with target?  Can some one explain why this is happening?
 
 
 2.Is it possible to format namenode multiple  times? When i enter the  
 namenode -format command, it not formatting the name node and showing the 
 following ouput.
 
 naidu@naidu:~/work/hadoop-1.0.4/bin$ hadoop namenode -format
 Warning: $HADOOP_HOME is deprecated.
 
 13/05/01 12:08:04 INFO namenode.NameNode: STARTUP_MSG: 
 /
 STARTUP_MSG: Starting NameNode
 STARTUP_MSG:   host = naidu/127.0.0.1
 STARTUP_MSG:   args = [-format]
 STARTUP_MSG:   version = 1.0.4
 STARTUP_MSG:   build = 
 https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 
 1393290; compiled by 'hortonfo' on Wed Oct  3 05:13:58 UTC 2012
 /
 Re-format filesystem in /home/naidu/dfs/namenode ? (Y or N) y
 Format aborted in /home/naidu/dfs/namenode
 13/05/01 12:08:05 INFO namenode.NameNode: SHUTDOWN_MSG: 
 /
 SHUTDOWN_MSG: Shutting down NameNode at naidu/127.0.0.1
 
 /
 
 Can someone help me in understanding this? Why is it not possible to format 
 name node multiple times?
 
 
 On Wed, May 1, 2013 at 8:14 AM, Sanjay Subramanian 
 sanjay.subraman...@wizecommerce.com wrote:
 +1  agreed
 
 Also as a general script programming practice I check if the variables I am 
 going to use are NON empty before using them…nothing related to Hive scripts
 
 If [ ${freq} ==  ]
 then
echo variable freq is empty…exiting
exit 1
 Fi
  
 
 
 From: Anthony Urso antho...@cs.ucla.edu
 Reply-To: user@hive.apache.org user@hive.apache.org
 Date: Tuesday, April 30, 2013 7:20 PM
 To: user@hive.apache.org user@hive.apache.org, sumit ghosh 
 sumi...@yahoo.com
 Subject: Re: Variable resolution Fails
 
 Your shell is expanding the variable ${env:freq}, which doesn't exist in the 
 shell's environment, so hive is getting the empty string in that place.  If 
 you are always intending to run your query like this, just use ${freq} which 
 will be expanded as expected by bash and then passed to hive.
 
 Cheers,
 Anthony
 
 
 On Tue, Apr 30, 2013 at 4:40 PM, sumit ghosh sumi...@yahoo.com wrote:
 Hi,
  
 The following variable freq fails to resolve:
  
 bash-4.1$ export freq=MNTH
 bash-4.1$ echo $freq
 MNTH
 bash-4.1$ hive -e select ${env:freq} as dr  from dual
 Logging initialized using configuration in 
 file:/etc/hive/conf.dist/hive-log4j.properties
 Hive history 
 file=/hadoop1/hive_querylog/sumighos/hive_job_log_sumighos_201304302321_1867815625.txt
 FAILED: ParseException line 1:8 cannot recognize input near 'as' 'dr' 
 'from' in select clause
 bash-4.1$
  
 Here dual is a table with 1 row.
 What am I am doing wrong? When I try to resolve freq - it is empty!!
  
   $ hive -S -e select '${env:freq}' as dr  from dual
  
   $
  
 Thanks,
 Sumit
 
 
 CONFIDENTIALITY NOTICE
 ==
 This email message and any attachments are for the exclusive use of the 
 intended recipient(s) and may contain confidential and privileged 
 information. Any unauthorized review, use, disclosure or distribution is 
 prohibited. If you are not the intended recipient, please contact the sender 
 by reply email and destroy all copies of the original message along with any 
 attachments, from your computer system. If you are the intended recipient, 
 please be advised that the content of this message is subject to access, 
 review and disclosure by the sender's Email System Administrator.
 


Re: no data in external table

2012-10-04 Thread Ted Yu
Can you tell us how you created mapping for the existing table ?

In task log, do you see any connection attempt to HBase ?

Cheers

On Thu, Oct 4, 2012 at 11:30 AM, alx...@aim.com wrote:

 Hello,

 I use hive-0.9.0 with hadoop-0.20.2 and hbase -0.92.1. I have created
 external table, mapping it to an existing table in hbase. When I do select
 * from myextrenaltable it returns no results, although scan in hbase shows
 data, and I do not see any errors in jobtracker log.

 Any ideas how to debug this issue.

 Thanks.
 Alex.



Re: HBase aggregate query

2012-09-10 Thread Ted Yu
Hi,
Are you able to get the number you want through hive log ?

Thanks

On Mon, Sep 10, 2012 at 7:03 AM, iwannaplay games 
funnlearnfork...@gmail.com wrote:

 Hi ,

 I want to run query like

 select month(eventdate),scene,count(1),sum(timespent) from eventlog
 group by month(eventdate),scene


 in hbase.Through hive its taking a lot of time for 40 million
 records.Do we have any syntax in hbase to find its result?In sql
 server it takes around 9 minutes,How long it might take in hbase??

 Regards
 Prabhjot



Re: HBaseSerDe

2012-07-25 Thread Ted Yu
The ctor is used in TestHBaseSerDe.java

So maybe change it to package private ?

On Wed, Jul 25, 2012 at 12:43 PM, kulkarni.swar...@gmail.com 
kulkarni.swar...@gmail.com wrote:

 While going through some code for HBase/Hive Integration, I came across
 this constructor:

 public HBaseSerDe() throws SerDeException {

 }

 Basically, the constructor is doing nothing but throwing an exception.
 Problem is fixing this now will be a non-passive change.

 I couldn't really find an obvious reason for this to be there. Are there
 any objections if I file a JIRA to remove this constructor?
 --
 Swarnim



Re: What virtual column does hive.exec.rowoffset add?

2012-03-25 Thread Ted Yu
From VirtualColumn.java:
  public static VirtualColumn ROWOFFSET = new
VirtualColumn(ROW__OFFSET__INSIDE__BLOCK,
(PrimitiveTypeInfo)TypeInfoFactory.longTypeInfo);

Cheers

On Sun, Mar 25, 2012 at 2:30 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 property
  namehive.exec.rowoffset/name
  valuefalse/value
  descriptionWhether to provide the row offset virtual
 column/description
 /property

 I know the others are INPUT__FILE__NAME and BLOCK_OFFSET Does
 anyone know what this third columns is?

 Edward



Re: Running Hive Queries in Parallel?

2011-07-08 Thread Ted Yu
Set it in conf/hive-site.xml

On Thu, Jul 7, 2011 at 10:59 PM, hadoop n00b new2h...@gmail.com wrote:

 Found *hive*.*exec*.*parallel!!!*
 How can I set to to be 'true' by default?

 Thanks!


 On Fri, Jul 8, 2011 at 11:14 AM, hadoop n00b new2h...@gmail.com wrote:

 Hello,

 When I execute multiple queries in Hive, the mapred tasks are queued up
 and executed one by one. Is there are way I could set Hive or Hadoop to
 execute mapred tasks in parallel?

 I am running on Hive 0.4.1 and Hadoop 0.20

 Tx!





Re: URGENT: I need the Hive Server setup Wiki

2011-06-27 Thread Ted Yu
wiki has moved.
See
https://cwiki.apache.org/confluence/display/Hive/AdminManual+SettingUpHiveServer

On Mon, Jun 27, 2011 at 2:31 PM, Ayon Sinha ayonsi...@yahoo.com wrote:


 https://cwiki.apache.org/confluence/display/Hive/AdminManual+SettingUpHiveServeris
  empty
 and the old link is gone.

 -Ayon
 See My Photos on Flickr http://www.flickr.com/photos/ayonsinha/
 Also check out my Blog for answers to commonly asked 
 questions.http://dailyadvisor.blogspot.com



Re: How to change the hive.metastore.warehouse.dir ?

2011-05-17 Thread Ted Yu
Can you try as user hadoop ?

Cheers



On May 17, 2011, at 9:53 PM, jinhang du dujinh...@gmail.com wrote:

 hi,
 The default value is /user/hive/warehouse in hive.site.xml. After I changed 
 the directory to a path on HDFS, I got the exception.
 
 FAILED: Error in metadata: MetaException(message:Got exception: 
 org.apache.hadoop.security.
 AccessControlException org.apache.hadoop.security.AccessControlException: 
 Permission denied: 
 user=root, access=WRITE, inode=output 
 FAILED: Execution Error, return code 1 from 
 org.apache.hadoop.hive.ql.exec.DDLTask
 
 Is this failure related to the hadoop-site.xml or something?
 Thanks for your help.
 
 -- 
 dujinhang


Re: Hi,all.Is it possible to generate mutiple records in one SerDe?

2011-03-21 Thread Ted Yu
I don't think so:
  Object deserialize(Writable blob) throws SerDeException;


On Mon, Mar 21, 2011 at 4:55 AM, 幻 ygnhz...@gmail.com wrote:

 Hi,all.Is it possible to generate mutiple records in one SerDe? I mean if I
 can return more than one rows in deserialize?

 Thanks!



Re: skew join optimization

2011-03-20 Thread Ted Yu
Can someone re-attach the missing figures for that wiki ?

Thanks

On Sun, Mar 20, 2011 at 7:15 AM, bharath vissapragada 
bharathvissapragada1...@gmail.com wrote:

 Hi Igor,

 See http://wiki.apache.org/hadoop/Hive/JoinOptimization and see the
 jira 1642 which automatically converts a normal join into map-join
 (Otherwise you can specify the mapjoin hints in the query itself.).
 Because your 'S' table is very small , it can be replicated across all
 the mappers and the reduce phase can be avoided. This can greatly
 reduce the runtime .. (See the results section in the page for
 details.).

 Hope this helps.

 Thanks


 On Sun, Mar 20, 2011 at 6:37 PM, Jov zhao6...@gmail.com wrote:
  2011/3/20 Igor Tatarinov i...@decide.com:
  I have the following join that takes 4.5 hours (with 12 nodes) mostly
  because of a single reduce task that gets the bulk of the work:
  SELECT ...
  FROM T
  LEFT OUTER JOIN S
  ON T.timestamp = S.timestamp and T.id = S.id
  This is a 1:0/1 join so the size of the output is exactly the same as
 the
  size of T (500M records). S is actually very small (5K).
  I've tried:
  - switching the order of the join conditions
  - using a different hash function setting (jenkins instead of murmur)
  - using SET set hive.auto.convert.join = true;
 
  are you sure your query convert to mapjoin? if not,try use explicit
  mapjoin hint.
 
 
  - using SET hive.optimize.skewjoin = true;
  but nothing helped :(
  Anything else I can try?
  Thanks!
 



 --
 Regards,
 Bharath .V
 w:http://research.iiit.ac.in/~bharath.v



Re: skew join optimization

2011-03-20 Thread Ted Yu
How about link to http://imageshack.us/ or TinyPic ?

Thanks

On Sun, Mar 20, 2011 at 7:56 AM, Edward Capriolo edlinuxg...@gmail.comwrote:

 On Sun, Mar 20, 2011 at 10:30 AM, Ted Yu yuzhih...@gmail.com wrote:
  Can someone re-attach the missing figures for that wiki ?
 
  Thanks
 
  On Sun, Mar 20, 2011 at 7:15 AM, bharath vissapragada
  bharathvissapragada1...@gmail.com wrote:
 
  Hi Igor,
 
  See http://wiki.apache.org/hadoop/Hive/JoinOptimization and see the
  jira 1642 which automatically converts a normal join into map-join
  (Otherwise you can specify the mapjoin hints in the query itself.).
  Because your 'S' table is very small , it can be replicated across all
  the mappers and the reduce phase can be avoided. This can greatly
  reduce the runtime .. (See the results section in the page for
  details.).
 
  Hope this helps.
 
  Thanks
 
 
  On Sun, Mar 20, 2011 at 6:37 PM, Jov zhao6...@gmail.com wrote:
   2011/3/20 Igor Tatarinov i...@decide.com:
   I have the following join that takes 4.5 hours (with 12 nodes) mostly
   because of a single reduce task that gets the bulk of the work:
   SELECT ...
   FROM T
   LEFT OUTER JOIN S
   ON T.timestamp = S.timestamp and T.id = S.id
   This is a 1:0/1 join so the size of the output is exactly the same as
   the
   size of T (500M records). S is actually very small (5K).
   I've tried:
   - switching the order of the join conditions
   - using a different hash function setting (jenkins instead of murmur)
   - using SET set hive.auto.convert.join = true;
  
   are you sure your query convert to mapjoin? if not,try use explicit
   mapjoin hint.
  
  
   - using SET hive.optimize.skewjoin = true;
   but nothing helped :(
   Anything else I can try?
   Thanks!
  
 
 
 
  --
  Regards,
  Bharath .V
  w:http://research.iiit.ac.in/~bharath.v
 
 

 The wiki does not allow images, confluence does but we have not moved their
 yet.



Re: does hive support Sequence File format ?

2011-02-17 Thread Ted Yu
Look under
http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL#Create_Table

On Thu, Feb 17, 2011 at 12:00 PM, Mapred Learn mapred.le...@gmail.comwrote:

 Hi,
 I was wondering if hive supports Sequence File format. If yes, could me
 point me to some documentation about how to use Seq files in hive.

 Thanks,
 -JJ



Re: Partitioning External table

2010-12-29 Thread Ted Yu
Can you try using:
location 'dt=1/engine'

Cheers

On Wed, Dec 29, 2010 at 1:12 AM, David Ginzburg ginz...@hotmail.com wrote:

  Hi,
 Thank you for the reply.
 I tried  ALTER TABLE tpartitions ADD PARTITION (dt='1') LOCATION
 '/user/training/partitions/';
 SHOW PARTITIONS
 tpartitions;

 OK
 dt=1


  but when I try to issue a select query , I get the following error:

 *hiveselect count(value) from tpartitions where dt='1';
 Total MapReduce jobs = 1
 Number of reduce tasks not specified. Estimated from input data size: 1
 In order to change the average load for a reducer (in bytes):
   set hive.exec.reducers.bytes.per.reducer=number
 In order to limit the maximum number of reducers:
   set hive.exec.reducers.max=number
 In order to set a constant number of reducers:
   set mapred.reduce.tasks=number
 Job Submission failed with exception 'java.io.FileNotFoundException(File
 does not exist: hdfs://localhost:8022/user/training/partitions/dt=1/data)'
 FAILED: Execution Error, return code 1 from
 org.apache.hadoop.hive.ql.exec.ExecDriver

 *Why is it looking for data file when my sequence file is located at
 /user/training/partitions/dt=1/engine, according to the partition






  Date: Tue, 28 Dec 2010 11:25:50 -0500
  Subject: Re: Partitioning External table
  From: edlinuxg...@gmail.com
  To: user@hive.apache.org

 
  On Tue, Dec 28, 2010 at 9:41 AM, David Ginzburg ginz...@hotmail.com
 wrote:
   Hi,
   I am trying to test  creation of  an external table using partitions,
   my files on hdfs are:
  
   /user/training/partitions/dt=2/engine
   /user/training/partitions/dt=2/engine
  
   engine are sequence files which I have managed to create externally
 and
   query from, when I have not used partitions.
  
   When I create with partitions using :
   hive CREATE EXTERNAL TABLE tpartitions(value STRING) PARTITIONED BY
 (dt
   STRING) STORED AS INPUTFORMAT
   'org.apache.hadoop.mapred.SequenceFileAsTextInputFormat' OUTPUTFORMAT
   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION
   '/user/training/partitions';
   OK
   Time taken: 0.067 seconds
  
   show partitions
   tpartitions;
   OK
   Time taken: 0.084 seconds
   hive select * from tpartitions;
   OK
   Time taken: 0.139 seconds
  
   Can someone point to what am I doing wrong here?
  
  
  
  
  
  
  
 
  You need to explicitly add the partitions to the table. The location
  specified for the partition will be appended to the location of the
  table.
 
  http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL#Add_Partitions
 
  Something like this:
  alter table tpartitions add partition dt=2 location 'dt=2/engine';
  alter table tpartitions add partition dt=3 location 'dt=3/engine';



Re: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask

2010-12-20 Thread Ted Yu
Have you found anything interesting from Hive history file
(/tmp/hadoop/hive_job_log_hadoop_201012210353_775358406.txt) ?

Thanks

On Mon, Dec 20, 2010 at 6:11 PM, Sean Curtis sean.cur...@gmail.com wrote:

 just running a simple select count(1) from a table (using movielens as an
 example) doesnt seem to work for me.  anyone know why this doesnt work? im
 using hive trunk:

 hive select avg(rating) from movierating where movieid=43;
 Total MapReduce jobs = 1
 Launching Job 1 out of 1
 Number of reduce tasks determined at compile time: 1
 In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=number
 In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=number
 In order to set a constant number of reducers:
  set mapred.reduce.tasks=number
 Starting Job = job_201012141048_0023, Tracking URL =
 http://localhost:50030/jobdetails.jsp?jobid=job_201012141048_0023
 Kill Command = /Users/Sean/dev/hadoop-0.20.2+737/bin/../bin/hadoop job
  -Dmapred.job.tracker=localhost:8021 -kill job_201012141048_0023
 2010-12-20 15:15:03,295 Stage-1 map = 0%,  reduce = 0%
 2010-12-20 15:15:09,420 Stage-1 map = 50%,  reduce = 0%
 ...
 eventually fails after a couple of minutes with:

 2010-12-20 17:33:01,113 Stage-1 map = 100%,  reduce = 0%
 2010-12-20 17:33:32,182 Stage-1 map = 100%,  reduce = 100%
 Ended Job = job_201012141048_0023 with errors
 FAILED: Execution Error, return code 2 from
 org.apache.hadoop.hive.ql.exec.MapRedTask
 hive


 almost seems like the reduce task never starts. any help would be
 appreciated.

 sean


Re: Exception in hive startup

2010-10-13 Thread Ted Yu
This should be documented in README.txt

On Wed, Oct 13, 2010 at 6:14 PM, Steven Wong sw...@netflix.com wrote:

  You need to run hive_root/build/dist/bin/hive, not hive_root/bin/hive.





 *From:* hdev ml [mailto:hde...@gmail.com]
 *Sent:* Wednesday, October 13, 2010 2:18 PM
 *To:* hive-u...@hadoop.apache.org
 *Subject:* Exception in hive startup



 Hi all,

 I installed Hadoop 0.20.2 and installed hive 0.5.0.

 I followed all the instructions on Hive's getting started page for setting
 up environment variables like HADOOP_HOME

 When I run from command prompt in the hive installation folder as
 bin/hive it gives me following exception

 Exception in thread main java.lang.NoClassDefFoundError:
 org/apache/hadoop/hive/conf/HiveConf
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:247)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:149)
 Caused by: java.lang.ClassNotFoundException:
 org.apache.hadoop.hive.conf.HiveConf
 at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
 at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
 ... 3 more

 Please note that my Hadoop installation is working fine.

 What could be the cause of this? Anybody has any idea?

 Thanks
 Harshad