Re: Decompression using LZO

2013-08-16 Thread Sanjay Subramanian
What do u want to do ? View the .LZO file on HDFS ?


From: Sandeep Nemuri mailto:nhsande...@gmail.com>>
Reply-To: "user@hadoop.apache.org" 
mailto:user@hadoop.apache.org>>
Date: Tuesday, August 6, 2013 12:08 AM
To: "user@hadoop.apache.org" 
mailto:user@hadoop.apache.org>>
Subject: Decompression using LZO

Hi all ,
 Can any one help me out . how to decompress the data in hadop using 
LZO.

--
--Regards
  Sandeep Nemuri

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: EBADF: Bad file descriptor

2013-07-11 Thread Sanjay Subramanian
Thanks
I will look into logs to see if I see anything else…
sanjay


From: Colin McCabe mailto:cmcc...@alumni.cmu.edu>>
Reply-To: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" 
mailto:user@hadoop.apache.org>>
Date: Wednesday, July 10, 2013 11:52 AM
To: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" 
mailto:user@hadoop.apache.org>>
Subject: Re: EBADF: Bad file descriptor

To clarify a little bit, the readahead pool can sometimes spit out this message 
if you close a file while a readahead request is in flight.  It's not an error 
and just reflects the fact that the file was closed hastily, probably because 
of some other bug which is the real problem.

Colin


On Wed, Jul 10, 2013 at 11:50 AM, Colin McCabe 
mailto:cmcc...@alumni.cmu.edu>> wrote:
That's just a warning message.  It's not causing your problem-- it's just a 
symptom.

You will have to find out why the MR job failed.

best,
Colin


On Wed, Jul 10, 2013 at 8:19 AM, Sanjay Subramanian 
mailto:sanjay.subraman...@wizecommerce.com>>
 wrote:
2013-07-10 07:11:50,131 WARN [Readahead Thread #1] 
org.apache.hadoop.io.ReadaheadPool: Failed readahead on ifile
EBADF: Bad file descriptor
at org.apache.hadoop.io.nativeio.NativeIO.posix_fadvise(Native Method)
at 
org.apache.hadoop.io.nativeio.NativeIO.posixFadviseIfPossible(NativeIO.java:145)
at 
org.apache.hadoop.io.ReadaheadPool$ReadaheadRequestImpl.run(ReadaheadPool.java:205)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Hi

I have a Oozie workflow that runs a MR job and I have started getting this 
error past two days in one of the MR jobs that is being processed.
However if I run it again , it succeeds :-(  but about 1 hr is wasted in the 
process.

Any clues ?

Or should I post this issue in the Oozie postings ?

Thanks

sanjay

Configuration
NameValue
impression.log.record.cached.tagcached=
impression.log.record.end.tag   [end
impressions.mapreduce.conf.file.full.path   
/workflows/impressions/config/aggregations.conf<http://thv01:/filebrowser/view/workflows/impressions/config/aggregations.conf>
mapred.job.queue.name<http://mapred.job.queue.name> default
mapred.mapper.new-api   true
mapred.reducer.new-api  true
mapreduce.input.fileinputformat.inputdir
/data/input/impressionlogs/outpdirlogs/-99-99<http://thv01:/filebrowser/view/data/input/impressionlogs/outpdirlogs/-99-99>
mapreduce.job.inputformat.class 
com.wizecommerce.utils.mapred.ZipMultipleLineRecordInputFormat
mapreduce.job.map.class 
com.wizecommerce.parser.mapred.OutpdirImpressionLogMapper
mapreduce.job.maps  500
mapreduce.job.name<http://mapreduce.job.name>   
OutpdirImpressions_475-130611151004460-oozie-oozi-W
mapreduce.job.output.value.classorg.apache.hadoop.io.Text
mapreduce.job.outputformat.class
com.wizecommerce.utils.mapred.NextagTextOutputFormat
mapreduce.job.reduce.class  
com.wizecommerce.parser.mapred.OutpdirImpressionLogReducer
mapreduce.job.reduces   8
mapreduce.map.output.compress   true
mapreduce.map.output.compress.codec 
org.apache.hadoop.io.compress.SnappyCodec
mapreduce.map.output.key.class  
org.apache.hadoop.io.Textorg.apache.hadoop.io.Text
mapreduce.map.output.value.class
com.wizecommerce.parser.dao.OutpdirLogRecord
mapreduce.output.fileoutputformat.compress  true
mapreduce.output.fileoutputformat.compress.codec
com.hadoop.compression.lzo.LzopCodec
mapreduce.output.fileoutputformat.outputdir 
/data/output/impressions/outpdir/-99-99/475-130611151004460-oozie-oozi-W/outpdir_impressions_ptitle<http://thv01:/filebrowser/view/data/output/impressions/outpdir/-99-99/475-130611151004460-oozie-oozi-W/outpdir_impressions_ptitle>
mapreduce.tasktracker.map.tasks.maximum 12
mapreduce.tasktracker.reduce.tasks.maximum  8
outpdir.log.exclude.processing.datatypesheader,sellerhidden




CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.



CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unaut

EBADF: Bad file descriptor

2013-07-10 Thread Sanjay Subramanian
2013-07-10 07:11:50,131 WARN [Readahead Thread #1] 
org.apache.hadoop.io.ReadaheadPool: Failed readahead on ifile
EBADF: Bad file descriptor
at org.apache.hadoop.io.nativeio.NativeIO.posix_fadvise(Native Method)
at 
org.apache.hadoop.io.nativeio.NativeIO.posixFadviseIfPossible(NativeIO.java:145)
at 
org.apache.hadoop.io.ReadaheadPool$ReadaheadRequestImpl.run(ReadaheadPool.java:205)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Hi

I have a Oozie workflow that runs a MR job and I have started getting this 
error past two days in one of the MR jobs that is being processed.
However if I run it again , it succeeds :-(  but about 1 hr is wasted in the 
process.

Any clues ?

Or should I post this issue in the Oozie postings ?

Thanks

sanjay

Configuration
NameValue
impression.log.record.cached.tagcached=
impression.log.record.end.tag   [end
impressions.mapreduce.conf.file.full.path   
/workflows/impressions/config/aggregations.conf
mapred.job.queue.name   default
mapred.mapper.new-api   true
mapred.reducer.new-api  true
mapreduce.input.fileinputformat.inputdir
/data/input/impressionlogs/outpdirlogs/-99-99
mapreduce.job.inputformat.class 
com.wizecommerce.utils.mapred.ZipMultipleLineRecordInputFormat
mapreduce.job.map.class 
com.wizecommerce.parser.mapred.OutpdirImpressionLogMapper
mapreduce.job.maps  500
mapreduce.job.name  OutpdirImpressions_475-130611151004460-oozie-oozi-W
mapreduce.job.output.value.classorg.apache.hadoop.io.Text
mapreduce.job.outputformat.class
com.wizecommerce.utils.mapred.NextagTextOutputFormat
mapreduce.job.reduce.class  
com.wizecommerce.parser.mapred.OutpdirImpressionLogReducer
mapreduce.job.reduces   8
mapreduce.map.output.compress   true
mapreduce.map.output.compress.codec 
org.apache.hadoop.io.compress.SnappyCodec
mapreduce.map.output.key.class  
org.apache.hadoop.io.Textorg.apache.hadoop.io.Text
mapreduce.map.output.value.class
com.wizecommerce.parser.dao.OutpdirLogRecord
mapreduce.output.fileoutputformat.compress  true
mapreduce.output.fileoutputformat.compress.codec
com.hadoop.compression.lzo.LzopCodec
mapreduce.output.fileoutputformat.outputdir 
/data/output/impressions/outpdir/-99-99/475-130611151004460-oozie-oozi-W/outpdir_impressions_ptitle
mapreduce.tasktracker.map.tasks.maximum 12
mapreduce.tasktracker.reduce.tasks.maximum  8
outpdir.log.exclude.processing.datatypesheader,sellerhidden




CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Splitting input file - increasing number of mappers

2013-07-06 Thread Sanjay Subramanian
More mappers will make it faster
 U can try this parameter
  mapreduce.input.fileinputformat.split.maxsize=
 This will control the input split size and force more mappers to run


Also ur usecase seems good candidate for defining a Combiner because u r 
grouping keys based on a criteria
But only gotcha is Combiners are  not guaranteed to be called to run

Give these shot

Good luck

sanjay



From: parnab kumar mailto:parnab.2...@gmail.com>>
Reply-To: "user@hadoop.apache.org" 
mailto:user@hadoop.apache.org>>
Date: Saturday, July 6, 2013 12:50 AM
To: "user@hadoop.apache.org" 
mailto:user@hadoop.apache.org>>
Subject: Splitting input file - increasing number of mappers

Hi ,

I have an input file where each line is of the form :



  URLs whose number is within a threshold are considered similar. My task 
is to group together all similar urls. For this i wrote a custom writable where 
i implemented the threshold check in the compareTo method.Therefore when Hadoop 
sorts the similar urls are grouped together.This seems to work fine .
  I have the following query :

   1>   Since i am relying more on the sort feature provided by Hadoop, am i 
decreasing the efficiency in any way  or using Hadoops sort feature which 
hadoop does best  i am actually doing the right thing.Now if this is the right 
thing too , then it seems my job  mostly relies on the map task.Thefore will 
increase in the number of mappers increase efficiency ?

 2> My file size is not more than 64 mb  i.e a Hadoop block size which 
means not more than 1 mapper will be invoked.Will splitting the file into 
smaller size increase the efficiency by invoking more mappers.

Can someone kindly provide some insight,advice regarding the above.

Thanks ,
Parnab,
MS student, IIT kharagpur

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Piping to HDFS (from Linux or HDFS)

2013-06-24 Thread Sanjay Subramanian
Hi guys

While I was trying to get some test data and configurations done quickly I 
realized one can do this and I think its super cool

Processing existing file on Linux/HDFS and Piping it directly to hdfs

source = Linux  dest=HDFS
==
File = sanjay.conf.template
We want to replace one line in the file -99-99 > 1947-08-15
DATE_STR=-99-99

cat sanjay.conf.template | sed 's/-99-99/1947-08-15/g' | hdfs dfs -put - 
/user/sanjay/sanjay.conf

source = HDFS  dest=HDFS
==
hdfs dfs -cat  /user/nextag/sanjay.conf.template  | sed 
's/-99-99/1947-08-15/g' | hdfs dfs -put - 
/user/sanjay/1947-08-15/nextag.conf


Thanks

sanjay

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Many Errors at the last step of copying files from _temporary to Output Directory

2013-06-14 Thread Sanjay Subramanian
Hi

My environment is like this

INPUT FILES
==
400 GZIP files , one from each server - average size gzipped 25MB

REDUCER
===
Uses MultipleOutput

OUTPUT  (Snappy)
===
/path/to/output/dir1
/path/to/output/dir2
/path/to/output/dir3
/path/to/output/dir4

Number of output directories = 1600
Number of output files = 17000

SETTINGS
=
Maximum Number of Transfer Threads
dfs.datanode.max.xcievers, dfs.datanode.max.transfer.threads  = 16384

ERRORS
===
I am getting errors consistently at the last step of  copying files from 
_temporary to Output Directory.

ERROR 1
===
BADF: Bad file descriptor
at org.apache.hadoop.io.nativeio.NativeIO.posix_fadvise(Native Method)
at 
org.apache.hadoop.io.nativeio.NativeIO.posixFadviseIfPossible(NativeIO.java:145)
at 
org.apache.hadoop.io.ReadaheadPool$ReadaheadRequestImpl.run(ReadaheadPool.java:205)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)


ERROR 2
===
2013-06-13 23:35:15,902 WARN [main] org.apache.hadoop.hdfs.DFSClient: Failed to 
connect to /10.28.21.171:50010 for block, add to deadNodes and continue. 
java.io.IOException: Got error for OP_READ_BLOCK, self=/10.28.21.171:57436, 
remote=/10.28.21.171:50010, for file 
/user/nextag/oozie-workflows/config/aggregations.conf, for pool 
BP-64441488-10.28.21.167-1364511907893 block 213045727251858949_8466884
java.io.IOException: Got error for OP_READ_BLOCK, self=/10.28.21.171:57436, 
remote=/10.28.21.171:50010, for file 
/user/nextag/oozie-workflows/config/aggregations.conf, for pool 
BP-64441488-10.28.21.167-1364511907893 block 213045727251858949_8466884
at 
org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:444)
at 
org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:409)
at 
org.apache.hadoop.hdfs.BlockReaderFactory.newBlockReader(BlockReaderFactory.java:105)
at org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:937)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:455)
at 
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:645)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:689)
at java.io.DataInputStream.read(DataInputStream.java:132)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:264)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:306)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:158)
at java.io.InputStreamReader.read(InputStreamReader.java:167)
at java.io.BufferedReader.fill(BufferedReader.java:136)
at java.io.BufferedReader.readLine(BufferedReader.java:299)
at java.io.BufferedReader.readLine(BufferedReader.java:362)
at com.wizecommerce.utils.mapred.HdfsUtils.readFileIntoList(HdfsUtils.java:83)
at com.wizecommerce.utils.mapred.HdfsUtils.getConfigParamMap(HdfsUtils.java:214)
at 
com.wizecommerce.utils.mapred.NextagFileOutputFormat.getOutputPath(NextagFileOutputFormat.java:171)
at 
com.wizecommerce.utils.mapred.NextagFileOutputFormat.getOutputCommitter(NextagFileOutputFormat.java:330)
at 
com.wizecommerce.utils.mapred.NextagFileOutputFormat.getDefaultWorkFile(NextagFileOutputFormat.java:306)
at 
com.wizecommerce.utils.mapred.NextagTextOutputFormat.getRecordWriter(NextagTextOutputFormat.java:111)
at 
org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.getRecordWriter(MultipleOutputs.java:413)
at 
org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.write(MultipleOutputs.java:395)
at 
com.wizecommerce.parser.mapred.OutpdirImpressionLogReducer.writePtitleExplanationBlob(OutpdirImpressionLogReducer.java:337)
at 
com.wizecommerce.parser.mapred.OutpdirImpressionLogReducer.processPTitle(OutpdirImpressionLogReducer.java:171)
at 
com.wizecommerce.parser.mapred.OutpdirImpressionLogReducer.reduce(OutpdirImpressionLogReducer.java:91)
at 
com.wizecommerce.parser.mapred.OutpdirImpressionLogReducer.reduce(OutpdirImpressionLogReducer.java:24)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:170)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:636)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:396)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:152)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:147)


Thanks
Sanjay


CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original me

Re: How to design the mapper and reducer for the following problem

2013-06-14 Thread Sanjay Subramanian
Hi

My quick and dirty non-optimized solution would be as follows

MAPPER
===
OUTPUT from Mapper
  
  

REDUCER

Iterate over keys
For a key = (say) {HASH1,HASH2,HASH3,HASH4}
 Format the collection of values into some StringBuilder kind of class

Output
KEY = {DOCID1 DOCID2}  value = null
KEY = {DOCID3 DOCID5} value = null

Hope I have understood your problem correctly…If not sorry about that

sanjay

From: parnab kumar mailto:parnab.2...@gmail.com>>
Reply-To: "user@hadoop.apache.org" 
mailto:user@hadoop.apache.org>>
Date: Friday, June 14, 2013 7:06 AM
To: "user@hadoop.apache.org" 
mailto:user@hadoop.apache.org>>
Subject: How to design the mapper and reducer for the following problem

An input file where each line corresponds to a document .Each document is 
identfied by some fingerPrints .For example a line in the input file
is of the following form :

input:
-
DOCID1   HASH1 HASH2 HASH3 HASH4
DOCID2   HASH5 HASH3 HASH1 HASH4

The output of the mapreduce job should write the pair of DOCIDS which share a 
threshold number of HASH in common.

output:
--
DOCID1 DOCID2
DOCID3 DOCID5

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Now give .gz file as input to the MAP

2013-06-12 Thread Sanjay Subramanian
Rahul-da

I found bz2 pretty slow (although splittable) so I switched to snappy (only 
sequence files are splittable but compress-decompress is fast)

Thanks
Sanjay

From: Rahul Bhattacharjee 
mailto:rahul.rec@gmail.com>>
Reply-To: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" 
mailto:user@hadoop.apache.org>>
Date: Tuesday, June 11, 2013 9:53 PM
To: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" 
mailto:user@hadoop.apache.org>>
Subject: Re: Now give .gz file as input to the MAP

Nothing special is required for process .gz files using MR. however , as Sanjay 
mentioned , verify the codec's configured in core-site and another thing to 
note is that these files are not splittable.

You might want to use bz2 , these are splittable.

Thanks,
Rahul


On Wed, Jun 12, 2013 at 10:14 AM, Sanjay Subramanian 
mailto:sanjay.subraman...@wizecommerce.com>>
 wrote:

hadoopConf.set("mapreduce.job.inputformat.class", 
"com.wizecommerce.utils.mapred.TextInputFormat");

hadoopConf.set("mapreduce.job.outputformat.class", 
"com.wizecommerce.utils.mapred.TextOutputFormat");

No special settings required for reading Gzip except these above

I u want to output Gzip


hadoopConf.set("mapreduce.output.fileoutputformat.compress", "true");

hadoopConf.set("mapreduce.output.fileoutputformat.compress.codec", 
"org.apache.hadoop.io.compress.GzipCodec");


Make sure Gzip codec is defined in core-site.xml


io.compression.codecs

org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec


I have a question

Why are u using GZIP as input to Map ? These are not splittable…Unless u have 
to read multilines (like lines between a BEGIN and END block in a log file) and 
send it as one record to the mapper

Also in Non-splitable Snappy Codec is better

Good Luck


sanjay

From: samir das mohapatra 
mailto:samir.help...@gmail.com>>
Reply-To: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" 
mailto:user@hadoop.apache.org>>
Date: Tuesday, June 11, 2013 9:07 PM
To: "cdh-u...@cloudera.com<mailto:cdh-u...@cloudera.com>" 
mailto:cdh-u...@cloudera.com>>, 
"user@hadoop.apache.org<mailto:user@hadoop.apache.org>" 
mailto:user@hadoop.apache.org>>, 
"user-h...@hadoop.apache.org<mailto:user-h...@hadoop.apache.org>" 
mailto:user-h...@hadoop.apache.org>>
Subject: Now give .gz file as input to the MAP

Hi All,
Did any one worked on, how to pass the .gz file as  file input for 
mapreduce job ?

Regards,
samir.

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Now give .gz file as input to the MAP

2013-06-11 Thread Sanjay Subramanian
hadoopConf.set("mapreduce.job.inputformat.class", 
"com.wizecommerce.utils.mapred.TextInputFormat");

hadoopConf.set("mapreduce.job.outputformat.class", 
"com.wizecommerce.utils.mapred.TextOutputFormat");

No special settings required for reading Gzip except these above

I u want to output Gzip


hadoopConf.set("mapreduce.output.fileoutputformat.compress", "true");

hadoopConf.set("mapreduce.output.fileoutputformat.compress.codec", 
"org.apache.hadoop.io.compress.GzipCodec");


Make sure Gzip codec is defined in core-site.xml


io.compression.codecs

org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec


I have a question

Why are u using GZIP as input to Map ? These are not splittable…Unless u have 
to read multilines (like lines between a BEGIN and END block in a log file) and 
send it as one record to the mapper

Also in Non-splitable Snappy Codec is better

Good Luck


sanjay

From: samir das mohapatra 
mailto:samir.help...@gmail.com>>
Reply-To: "user@hadoop.apache.org" 
mailto:user@hadoop.apache.org>>
Date: Tuesday, June 11, 2013 9:07 PM
To: "cdh-u...@cloudera.com" 
mailto:cdh-u...@cloudera.com>>, 
"user@hadoop.apache.org" 
mailto:user@hadoop.apache.org>>, 
"user-h...@hadoop.apache.org" 
mailto:user-h...@hadoop.apache.org>>
Subject: Now give .gz file as input to the MAP

Hi All,
Did any one worked on, how to pass the .gz file as  file input for 
mapreduce job ?

Regards,
samir.

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: MapReduce on Local FileSystem

2013-05-31 Thread Sanjay Subramanian
Hi
Whats the data per hour or per day u r looking to put into HDFS ?

For dumping source data into HDFS there are again few options

Option 1
===
Have parallel threads dumping raw data into HDFS from your source

Option 2
===
Design how your Objects will look and write code to convert raw input files 
into Sequence Files and then dump it into HDFS

The community may have more options….depends on your use case

Regards
sanjay


From: , Nikhil 
mailto:nikhil.agar...@netapp.com>>
Reply-To: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" 
mailto:user@hadoop.apache.org>>
Date: Friday, May 31, 2013 12:24 AM
To: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" 
mailto:user@hadoop.apache.org>>
Subject: RE: MapReduce on Local FileSystem

Hi,

Thank you for your reply. One simple answer can be to reduce the time taken for 
ingesting the data in HDFS.

Regards,
Nikhil

From: Sanjay Subramanian [mailto:sanjay.subraman...@wizecommerce.com]
Sent: Friday, May 31, 2013 12:50 PM
To: mailto:user@hadoop.apache.org>>
Cc: user@hadoop.apache.org<mailto:user@hadoop.apache.org>
Subject: Re: MapReduce on Local FileSystem

Basic question. Why would u want to do that ? Also I think the Map R Hadoop 
distribution has an NFS mountable HDFS
Sanjay

Sent from my iPhone

On May 30, 2013, at 11:37 PM, "Agarwal, Nikhil" 
mailto:nikhil.agar...@netapp.com>> wrote:
Hi,

Is it possible to run MapReduce on multiple nodes using Local File system 
(file:///)  ?
I am able to run it in single node setup but in a multiple node setup the 
“slave” nodes are not able to access the “jobtoken” file which is present in 
the Hadoop.tmp.dir in “master” node.

Please let me know if it is possible to do this.

Thanks & Regards,
Nikhil

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: MapReduce on Local FileSystem

2013-05-31 Thread Sanjay Subramanian
Basic question. Why would u want to do that ? Also I think the Map R Hadoop 
distribution has an NFS mountable HDFS
Sanjay

Sent from my iPhone

On May 30, 2013, at 11:37 PM, "Agarwal, Nikhil" 
mailto:nikhil.agar...@netapp.com>> wrote:

Hi,

Is it possible to run MapReduce on multiple nodes using Local File system 
(file:///)  ?
I am able to run it in single node setup but in a multiple node setup the 
“slave” nodes are not able to access the “jobtoken” file which is present in 
the Hadoop.tmp.dir in “master” node.

Please let me know if it is possible to do this.

Thanks & Regards,
Nikhil

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Problem in uploading file in WebHDFS

2013-05-25 Thread Sanjay Subramanian
Can u try one of the following

hdfs dfs -put localfile /path/to/dir/in/hdfs

hdfs dfs -copyFromLocal localfile /path/to/dir/in/hdfs

Thanks
Sanjay.

Sent from my iPhone

On May 25, 2013, at 5:07 AM, "Mohammad Mustaqeem" 
<3m.mustaq...@gmail.com> wrote:

I am using pseudo-distributed Hadoop-2.0.3-alpha.
I want to upload a file named 'abc' from my local filesystem on HDFS at 
location "/waseem".
When I run this command on terminal -
curl --negotiate -u : -i -X PUT -T /home/mustaqeem/abc -L 
'http://127.0.0.1:50070/webhdfs/v1/waseem?op=CREATE&user.name=mustaqeem'


First it hangs for some time and give following error -

HTTP/1.1 100 Continue

HTTP/1.1 307 TEMPORARY_REDIRECT
Cache-Control: no-cache
Expires: Thu, 01-Jan-1970 00:00:00 GMT
Date: Sat, 25 May 2013 12:01:14 GMT
Pragma: no-cache
Date: Sat, 25 May 2013 12:01:14 GMT
Pragma: no-cache
Content-Type: application/octet-stream
Set-Cookie: 
hadoop.auth="u=mustaqeem&p=mustaqeem&t=simple&e=1369519274604&s=HyX9z80lRU+9RdKeN5USW5Q/AZY=";Path=/
Location: 
http://mustaqeem:50075/webhdfs/v1/waseem?op=CREATE&user.name=mustaqeem&namenoderpcaddress=localhost:8020&overwrite=false
Content-Length: 0
Server: Jetty(6.1.26)

curl: (7) couldn't connect to host


Whats the problem?

--
With regards ---
Mohammad Mustaqeem,
M.Tech (CSE)
MNNIT Allahabad



CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Where to begin from??

2013-05-24 Thread Sanjay Subramanian
Hey guys

Is there a way to dynamically change the input dir and outputdir

I have the following CONSTANT directories in HDFS

  *   /path/to/input/-99-99 (empty directory )
  *   /path/to/output/-99-99 (empty directory)

A new directory with yesterdays date like /path/to/input/2013-05-23 gets 
created every day

I set the job params
mapreduce.input.fileinputformat.inputdir=/path/to/input/-99-99
mapreduce.output.fileoutputformat.outputdir=/path/to/output/-99-99

But in my Mapper I thought I can sneak in this code….But it does not work ?


protected void setup(Context context)

throws IOException,

   InterruptedException{

 org.apache.hadoop.conf.Configuration hadoopConf = 
((JobContext)context).getConfiguration();

String inputDir = 
hadoopConf.get("mapreduce.input.fileinputformat.inputdir");

String outputDir = 
hadoopConf.get("mapreduce.output.fileoutputformat.outputdir");

String yesterdaysDate =DateUtils.getDayMMDD("-", -1);

String inputDirUseThis = inputDir.replaceAll("-99-99", yesterdaysDate);

String outputDirUseThis = inputDir.replaceAll("-99-99", yesterdaysDate);

hadoopConf.set("mapreduce.input.fileinputformat.inputdir",inputDirUseThis);


hadoopConf.set("mapreduce.output.fileoutputformat.outputdir",outputDirUseThis);

}

Thanks

sanjay


CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Where to begin from??

2013-05-23 Thread Sanjay Subramanian
I agree with Chris…don't worry about what the technology is called Hadoop , Big 
table, Lucene, Hive….Model the problem and see what the solution could 
be….that’s very important

And Lokesh please don't mind…we are writing to u perhaps stuff that u don't 
want to hear but its an important real perspective

To illustrate what I mean let me give u a few problems to think about and see 
how u would solve them….

1. Before Microsoft took over Skype at least this feature used to be there and 
the feature is like this……u type the name of a person and it used to come back 
with some search results in milliseconds often searching close to a billion 
names…….How would u design such a search architecture ?

2.  In 2012, say 50 million users (cookie based) searched Macys.com on a SALES 
weekend and say 20,000 bought $100 dollar shoes. Now this year 2013 on that 
SALES weekend 60 million users (cookie based) are buying on the website….You 
want to give a 25% extra reward to only those cookies that were from last 
year…So u are looking for an intersection set of possibly 20,000 cookies in two 
sets - 50million and 60 million…..How would u solve this problem within milli 
seconds  ?

3. Last my favorite….The Postal Services department wants to think of new 
business ideas to avoid bankruptcy…One idea I have is they have zillion small 
delivery vans that go to each street in the country….Say I lease out the space 
to BIG wireless phone providers and promise them them that I will mount 
wireless signal strength measurement systems on these vans and I will provide 
them data 3  times a day…how will u devise a solution to analyse and store data 
?

I am sure if u look around in India as well u will see a lot of situations 
where u want to solve a problem….

As Chris says , think about the problem u want to solve, then model the 
solutions and pick the best one…

On the flip side….I can tell u it will still be a few years till many Banks and 
Stock trading houses will believe in Cassandra and Hbase for OLTP because that 
data is critical……If your timeline in Facebook does not show a photo , its 
possibly OK but if your 1 million deposit I a bank does not show up for days or 
suddenly vanishes - u r possibly not going to take that lightly…..

Ok enough RAMBLING….

Good luck

sanjay



From: Chris Embree mailto:cemb...@gmail.com>>
Reply-To: "user@hadoop.apache.org" 
mailto:user@hadoop.apache.org>>, 
"ch...@embree.us" 
mailto:ch...@embree.us>>
Date: Thursday, May 23, 2013 7:47 PM
To: "user@hadoop.apache.org" 
mailto:user@hadoop.apache.org>>
Subject: Re: Where to begin from??

I'll be chastised and have mean things said about me for this.

Get some experience in IT before you start looking at Hadoop.  My reasoning is 
this:  If you don't know how to develop real applications in a Non-Hadoop 
world, you'll struggle a lot to develop with Hadoop.

Asking what "things you need to know in compulsory" is like saying you want to 
"learn computers" -- totally worthless!  Find a problem to solve and seek to 
learn the tools you need to solve your problem.  Otherwise, your learning is 
un-applied and somewhat useless.

Picture a recent acting school graduate how to direct the next Star Wars movie. 
 It's almost like that.


On Thu, May 23, 2013 at 10:39 PM, Lokesh Basu 
mailto:lokesh.b...@gmail.com>> wrote:
Hi all,

I'm a computer science undergraduate and has recently started to explore about 
Hadoop. I find it very interesting and want to get involved both as contributor 
and developer for this open source project. I have been going through many text 
book related to Hadoop and HDFS but still I find it very difficult as to where 
should a beginner start from before writing his first line of code as 
contributer or developer.

Also please tell me what are the things I compulsorily need to know before I 
dive into depth of these things.

Thanking you all in anticipation.




--

Lokesh Chandra Basu
B. Tech
Computer Science and Engineering
Indian Institute of Technology, Roorkee
India(GMT +5hr 30min)
+91-8267805498




CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: hive.log

2013-05-23 Thread Sanjay Subramanian
Ok figured it out

-  vi  /etc/hive/conf/hive-log4j.properties

- Modify this line
#hive.log.dir=/tmp/${user.name}
hive.log.dir=/data01/workspace/hive/log/${user.name}


From: Sanjay Subramanian 
mailto:sanjay.subraman...@wizecommerce.com>>
Reply-To: "u...@hive.apache.org<mailto:u...@hive.apache.org>" 
mailto:u...@hive.apache.org>>
Date: Thursday, May 23, 2013 2:56 PM
To: "u...@hive.apache.org<mailto:u...@hive.apache.org>" 
mailto:u...@hive.apache.org>>
Cc: User mailto:user@hadoop.apache.org>>
Subject: hive.log

How do I set the property in hive-site.xml that defines the local linux 
directory for hive.log ?
Thanks
sanjay

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


hive.log

2013-05-23 Thread Sanjay Subramanian
How do I set the property in hive-site.xml that defines the local linux 
directory for hive.log ?
Thanks
sanjay

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Hive tmp logs

2013-05-23 Thread Sanjay Subramanian
Clarification
This property defines a file on HDFS

  hive.exec.scratchdir
   /data01/workspace/hive scratch/dir/on/local/linux/disk





From: Sanjay Subramanian 
mailto:sanjay.subraman...@wizecommerce.com>>
Date: Wednesday, May 22, 2013 12:23 PM
To: "u...@hive.apache.org<mailto:u...@hive.apache.org>" 
mailto:u...@hive.apache.org>>
Cc: User mailto:user@hadoop.apache.org>>
Subject: Re: Hive tmp logs


  hive.querylog.location
  /path/to/hivetmp/dir/on/local/linux/disk



From: Anurag Tangri mailto:tangri.anu...@gmail.com>>
Reply-To: "u...@hive.apache.org<mailto:u...@hive.apache.org>" 
mailto:u...@hive.apache.org>>
Date: Wednesday, May 22, 2013 11:56 AM
To: "u...@hive.apache.org<mailto:u...@hive.apache.org>" 
mailto:u...@hive.apache.org>>
Cc: Hive mailto:u...@hive.apache.org>>, User 
mailto:user@hadoop.apache.org>>
Subject: Re: Hive tmp logs

Hi,
You can add Hive query log property in your hive site xml and point to the 
directory you want.

Thanks,
Anurag Tangri

Sent from my iPhone

On May 22, 2013, at 11:53 AM, Raj Hadoop 
mailto:hadoop...@yahoo.com>> wrote:

Hi,

My hive job logs are being written to /tmp/hadoop directory. I want to change 
it to a different location i.e. a sub directory somehere under the 'hadoop' 
user home directory.
How do I change it.

Thanks,
Ra

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Eclipse plugin

2013-05-22 Thread Sanjay Subramanian
Forgot to add, if u run Windows and Eclipse and want to do Hadoop u have to 
setup Cygwin and add $CYGWIN_PATH/bin to PATH

Good Luck

Sanjay

From: Sanjay Subramanian 
mailto:sanjay.subraman...@wizecommerce.com>>
Reply-To: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" 
mailto:user@hadoop.apache.org>>
Date: Wednesday, May 22, 2013 4:23 PM
To: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" 
mailto:user@hadoop.apache.org>>
Subject: Re: Eclipse plugin

Hi

I don't use any need any special plugin to walk thru the code

All my map reduce jobs have a

JobMapper.java
JobReducer.java
JobProcessor.java (set any configs u like)

I create a new maven project in eclipse (easier to manage dependencies) ….the 
elements are in the order as they should appear in the POM

Then In Eclipse Debug Configurations I create a new JAVA application and then I 
start debugging ! That’s it…..


MAVEN REPO INFO






Cloudera repository

https://repository.cloudera.com/artifactory/cloudera-repos/








2.0.0-cdh4.1.2






org.apache.hadoop

hadoop-mapreduce-client-core

${cloudera_version}

compile





org.apache.hadoop

hadoop-common

${cloudera_version}

compile





org.apache.hadoop

hadoop-client

${cloudera_version}

compile





org.apache.hadoop

hadoop-client

${cloudera_version}

compile



WordCountNew (please modify as needed)
==


publicclass WordCountNew {



   public static class Map extends 
org.apache.hadoop.mapreduce.Mapper {

 private final static IntWritable one = new IntWritable(1);

 private Text word = new Text();



 public void map(LongWritable key, Text value, Context ctxt) throws 
IOException, InterruptedException {

FileSplit fileSplit = (FileSplit)ctxt.getInputSplit();

// System.out.println(value.toString());

String fileName =  fileSplit.getPath().toString();

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());

ctxt.write(word, one);

   }

 }

   }



   public static class Reduce extends org.apache.hadoop.mapreduce.Reducer {

 public void reduce(Text key, Iterable values, Context ctxt) 
throws IOException, InterruptedException {

   int sum = 0;

   for (IntWritable value : values) {

 sum += value.get();

   }

   ctxt.write(key, new IntWritable(sum));

 }

   }



   public static void main(String[] args) throws Exception {

org.apache.hadoop.conf.Configuration hadoopConf = new 
org.apache.hadoop.conf.Configuration();

hadoopConf.set(MapredConfEnum.IMPRESSIONS_LOG_REC_SEPARATOR.getVal(), 
MapredConfEnum.PRODUCT_IMPR_LOG_REC_END.getVal());

hadoopConf.set(MapredConfEnum.IMPRESSIONS_LOG_REC_CACHED_SEPARATOR.getVal(), 
MapredConfEnum.PRODUCT_IMPR_LOG_REC_CACHED.getVal());

hadoopConf.set("io.compression.codecs", 
"org.apache.hadoop.io.compress.GzipCodec");


 Job job = new Job(hadoopConf);

 job.setJobName("wordcountNEW");

 job.setJarByClass(WordCountNew.class);

 job.setOutputKeyClass(Text.class);

 job.setOutputValueClass(IntWritable.class);

 job.setMapOutputKeyClass(Text.class);

 job.setMapOutputValueClass(IntWritable.class);



 job.setMapperClass(WordCountNew.Map.class);

 job.setCombinerClass(WordCountNew.Reduce.class);

 job.setReducerClass(Reduce.class);



//  job.setInputFormatClass(ZipMultipleLineRecordInputFormat.class);

 
job.setInputFormatClass(org.apache.hadoop.mapreduce.lib.input.TextInputFormat.class);


 job.setOutputFormatClass(TextOutputFormat.class);



 if (FileUtils.doesFileOrDirectoryExist(args[1])){

 org.apache.commons.io.FileUtils.deleteDirectory(new File(args[1]));

 }

 org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(job, 
new Path(args[0]));

org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.setOutputPath(job, 
new Path(args[1]));



 job.waitForCompletion(true);

 System.out.println();

   }

}





From: Bharati 
mailto:bharati.ad...@mparallelo.com>>
Reply-To: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" 
mailto:user@hadoop.apache.org>>
Date: Wednesday, May 22, 2013 3:39 PM
To: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" 
mailto:user@hadoop.apache.org>>
Subject: Re: Eclipse plugin

Hi Jing,

I want to be able to open a project as map reduce project in eclipse instead of 
java project as per some of the videos on youtube.

For now let us say I want to write a wordcount program and step through it with 
hadoop 1.2.0
How can I use eclipse to rewrite the code.

The goal here is to setup the development env to start project as mad reduce 
right in eclipse or netbeans which ever works better. The idea is to be able to 
step through the code.

Thanks,
Bharati

Sent from my iPad

On 

Re: Eclipse plugin

2013-05-22 Thread Sanjay Subramanian
Hi

I don't use any need any special plugin to walk thru the code

All my map reduce jobs have a

JobMapper.java
JobReducer.java
JobProcessor.java (set any configs u like)

I create a new maven project in eclipse (easier to manage dependencies) ….the 
elements are in the order as they should appear in the POM

Then In Eclipse Debug Configurations I create a new JAVA application and then I 
start debugging ! That’s it…..


MAVEN REPO INFO






Cloudera repository

https://repository.cloudera.com/artifactory/cloudera-repos/








2.0.0-cdh4.1.2






org.apache.hadoop

hadoop-mapreduce-client-core

${cloudera_version}

compile





org.apache.hadoop

hadoop-common

${cloudera_version}

compile





org.apache.hadoop

hadoop-client

${cloudera_version}

compile





org.apache.hadoop

hadoop-client

${cloudera_version}

compile



WordCountNew (please modify as needed)
==


public class WordCountNew {



public static class Map extends 
org.apache.hadoop.mapreduce.Mapper {

  private final static IntWritable one = new IntWritable(1);

  private Text word = new Text();



  public void map(LongWritable key, Text value, Context ctxt) throws 
IOException, InterruptedException {

FileSplit fileSplit = (FileSplit)ctxt.getInputSplit();

// System.out.println(value.toString());

String fileName =  fileSplit.getPath().toString();

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());

ctxt.write(word, one);

}

  }

}



public static class Reduce extends 
org.apache.hadoop.mapreduce.Reducer {

  public void reduce(Text key, Iterable values, Context ctxt) 
throws IOException, InterruptedException {

int sum = 0;

for (IntWritable value : values) {

  sum += value.get();

}

ctxt.write(key, new IntWritable(sum));

  }

}



public static void main(String[] args) throws Exception {

org.apache.hadoop.conf.Configuration hadoopConf = new 
org.apache.hadoop.conf.Configuration();

hadoopConf.set(MapredConfEnum.IMPRESSIONS_LOG_REC_SEPARATOR.getVal(), 
MapredConfEnum.PRODUCT_IMPR_LOG_REC_END.getVal());

hadoopConf.set(MapredConfEnum.IMPRESSIONS_LOG_REC_CACHED_SEPARATOR.getVal(), 
MapredConfEnum.PRODUCT_IMPR_LOG_REC_CACHED.getVal());

hadoopConf.set("io.compression.codecs", 
"org.apache.hadoop.io.compress.GzipCodec");


  Job job = new Job(hadoopConf);

  job.setJobName("wordcountNEW");

  job.setJarByClass(WordCountNew.class);

  job.setOutputKeyClass(Text.class);

  job.setOutputValueClass(IntWritable.class);

  job.setMapOutputKeyClass(Text.class);

  job.setMapOutputValueClass(IntWritable.class);



  job.setMapperClass(WordCountNew.Map.class);

  job.setCombinerClass(WordCountNew.Reduce.class);

  job.setReducerClass(Reduce.class);



//   job.setInputFormatClass(ZipMultipleLineRecordInputFormat.class);

  
job.setInputFormatClass(org.apache.hadoop.mapreduce.lib.input.TextInputFormat.class);


  job.setOutputFormatClass(TextOutputFormat.class);



  if (FileUtils.doesFileOrDirectoryExist(args[1])){

  org.apache.commons.io.FileUtils.deleteDirectory(new File(args[1]));

  }

  org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(job, 
new Path(args[0]));

org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.setOutputPath(job, 
new Path(args[1]));



  job.waitForCompletion(true);

  System.out.println();

}

}





From: Bharati 
mailto:bharati.ad...@mparallelo.com>>
Reply-To: "user@hadoop.apache.org" 
mailto:user@hadoop.apache.org>>
Date: Wednesday, May 22, 2013 3:39 PM
To: "user@hadoop.apache.org" 
mailto:user@hadoop.apache.org>>
Subject: Re: Eclipse plugin

Hi Jing,

I want to be able to open a project as map reduce project in eclipse instead of 
java project as per some of the videos on youtube.

For now let us say I want to write a wordcount program and step through it with 
hadoop 1.2.0
How can I use eclipse to rewrite the code.

The goal here is to setup the development env to start project as mad reduce 
right in eclipse or netbeans which ever works better. The idea is to be able to 
step through the code.

Thanks,
Bharati

Sent from my iPad

On May 22, 2013, at 2:42 PM, Jing Zhao 
mailto:j...@hortonworks.com>> wrote:

> Hi Bharati,
>
>Usually you only need to run "ant clean jar jar-test" and "ant
> eclipse" on your code base, and then import the project into your
> eclipse. Can you provide some more detailed description about the
> problem you met?
>
> Thanks,
> -Jing
>
> On Wed, May 22, 2013 at 2:25 PM, Bharati 
> mailto:bharati.ad...@mparallelo.com>> wrote:
>> Hi,
>>
>> I am trying to get or build eclipse plugin for 1.2.0
>>
>> All the methods I found on the web did not work for me. Any tutorial, 
>

Re: Hive tmp logs

2013-05-22 Thread Sanjay Subramanian

  hive.querylog.location
  /path/to/hivetmp/dir/on/local/linux/disk



  hive.exec.scratchdir
   /data01/workspace/hive scratch/dir/on/local/linux/disk


From: Anurag Tangri mailto:tangri.anu...@gmail.com>>
Reply-To: "u...@hive.apache.org" 
mailto:u...@hive.apache.org>>
Date: Wednesday, May 22, 2013 11:56 AM
To: "u...@hive.apache.org" 
mailto:u...@hive.apache.org>>
Cc: Hive mailto:u...@hive.apache.org>>, User 
mailto:user@hadoop.apache.org>>
Subject: Re: Hive tmp logs

Hi,
You can add Hive query log property in your hive site xml and point to the 
directory you want.

Thanks,
Anurag Tangri

Sent from my iPhone

On May 22, 2013, at 11:53 AM, Raj Hadoop 
mailto:hadoop...@yahoo.com>> wrote:

Hi,

My hive job logs are being written to /tmp/hadoop directory. I want to change 
it to a different location i.e. a sub directory somehere under the 'hadoop' 
user home directory.
How do I change it.

Thanks,
Ra

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: ORA-01950: no privileges on tablespace

2013-05-21 Thread Sanjay Subramanian
See the CDH notes  here…scroll down to where the Oracle section is
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Installation-Guide/cdh4ig_topic_18_4.html



From: Raj Hadoop mailto:hadoop...@yahoo.com>>
Reply-To: "u...@hive.apache.org" 
mailto:u...@hive.apache.org>>, Raj Hadoop 
mailto:hadoop...@yahoo.com>>
Date: Tuesday, May 21, 2013 12:26 PM
To: Hive mailto:u...@hive.apache.org>>, User 
mailto:user@hadoop.apache.org>>
Subject: ORA-01950: no privileges on tablespace


I am setting up a metastore on Oracle for Hive. I executed the script 
hive-schema-0.9.0-sql file too succesfully.

When i ran this
hive > show tables;

I am getting the following error.

ORA-01950: no privileges on tablespace

What kind of Oracle privileges are required (Quota wise for Hive) for hive 
oracle user in metastore? Please advise.

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Where to get Oracle scripts for Hive Metastore

2013-05-21 Thread Sanjay Subramanian
I think it should be this link because this refers to the /branches/branch-0.9

http://svn.apache.org/viewvc/hive/branches/branch-0.9/metastore/scripts/upgrade/oracle/


Can one of the Hive committers please verify…thanks

sanjay



From: Raj Hadoop mailto:hadoop...@yahoo.com>>
Reply-To: "u...@hive.apache.org" 
mailto:u...@hive.apache.org>>, Raj Hadoop 
mailto:hadoop...@yahoo.com>>
Date: Tuesday, May 21, 2013 12:12 PM
To: "u...@hive.apache.org" 
mailto:u...@hive.apache.org>>, User 
mailto:user@hadoop.apache.org>>, Raj Hadoop 
mailto:hadoop...@yahoo.com>>
Subject: Re: Where to get Oracle scripts for Hive Metastore

I got it. This is the link.

http://svn.apache.org/viewvc/hive/trunk/metastore/scripts/upgrade/oracle/hive-schema-0.9.0.oracle.sql?revision=1329416&view=co&pathrev=1329416

From: Raj Hadoop mailto:hadoop...@yahoo.com>>
To: Hive mailto:u...@hive.apache.org>>; User 
mailto:user@hadoop.apache.org>>
Sent: Tuesday, May 21, 2013 3:08 PM
Subject: Where to get Oracle scripts for Hive Metastore

I am trying to get Oracle scripts for Hive Metastore.

http://mail-archives.apache.org/mod_mbox/hive-commits/201204.mbox/%3c20120423201303.9742b2388...@eris.apache.org%3E

The scripts in the above link has a  + at the begining of each line. How should 
I supposed to execute scripts like this through Oracle sqlplus.

+CREATE TABLE PART_COL_PRIVS
+(
+PART_COLUMN_GRANT_ID NUMBER NOT NULL,
+"COLUMN_NAME" VARCHAR2(128) NULL,
+CREATE_TIME NUMBER (10) NOT NULL,
+GRANT_OPTION NUMBER (5) NOT NULL,
+GRANTOR VARCHAR2(128) NULL,
+GRANTOR_TYPE VARCHAR2(128) NULL,
+PART_ID NUMBER NULL,
+PRINCIPAL_NAME VARCHAR2(128) NULL,
+PRINCIPAL_TYPE VARCHAR2(128) NULL,
+PART_COL_PRIV VARCHAR2(128) NULL
+);
+





CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Unable to stop Thrift Server

2013-05-21 Thread Sanjay Subramanian
Not that I know of…..sorry
sanjay

From: Raj Hadoop mailto:hadoop...@yahoo.com>>
Reply-To: Raj Hadoop mailto:hadoop...@yahoo.com>>
Date: Monday, May 20, 2013 2:17 PM
To: Sanjay Subramanian 
mailto:sanjay.subraman...@wizecommerce.com>>,
 "u...@hive.apache.org<mailto:u...@hive.apache.org>" 
mailto:u...@hive.apache.org>>, User 
mailto:user@hadoop.apache.org>>
Subject: Re: Unable to stop Thrift Server

Hi Sanjay,

I am using 0.9 version.
I do not have a sudo access. is there any other command to stop the service.

thanks,
raj


____
From: Sanjay Subramanian 
mailto:sanjay.subraman...@wizecommerce.com>>
To: "u...@hive.apache.org<mailto:u...@hive.apache.org>" 
mailto:u...@hive.apache.org>>; Raj Hadoop 
mailto:hadoop...@yahoo.com>>; User 
mailto:user@hadoop.apache.org>>
Sent: Monday, May 20, 2013 5:11 PM
Subject: Re: Unable to stop Thrift Server

Raj
Which version r u using ?

I think from 0.9+ onwards its best to use service to stop and start and NOT hive

sudo service hive-metastore stop
sudo service hive-server stop

sudo service hive-metastore start
sudo service hive-server start

Couple of general things that might help

1. Use linux screens : then u can start many screen sessions and u don't have 
to give the synch mode "&" of execution
 Its very easy to manage several screen sessions and they keep running till 
your server restarts….and generally u can ssh to some jumhost and create your 
screen sessions there

2. Run the following
 pstree -pulac | less
 U can possible search for hive or your username or root which was used to 
start the service…and kill the process

sanjay

From: Raj Hadoop mailto:hadoop...@yahoo.com>>
Reply-To: "u...@hive.apache.org<mailto:u...@hive.apache.org>" 
mailto:u...@hive.apache.org>>, Raj Hadoop 
mailto:hadoop...@yahoo.com>>
Date: Monday, May 20, 2013 2:03 PM
To: Hive mailto:u...@hive.apache.org>>, User 
mailto:user@hadoop.apache.org>>
Subject: Unable to stop Thrift Server

Hi,

I was not able to stopThrift Server after performing the following steps.

$ bin/hive --service hiveserver &
Starting Hive Thrift Server

$ netstat -nl | grep 1
tcp 0 0 :::1 :::* LISTEN


I gave the following to stop. but not working.

hive --service hiveserver --action stop 1

How can I stop this service?


Thanks,
Raj

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.



CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


LZO compression implementation in Hive

2013-05-21 Thread Sanjay Subramanian
Hi Programming Hive Book authors

Maybe a lot of u have already successfully implemented this but only these last 
two weeks , we implemented our aggregations using LZO compression in Hive - MR 
jobs creating LZO files as Input for Hive ---> Therafter Hive aggregations 
creating more LZO files as output.
As usual nothing was straight forward :-)  Also the other challenge was to 
neatly tie all into actions in Oozie workflows….but after being underwater for 
weeks I think I am able to  rise above water and breathe !

In the next version of the book , If u guys r planning to add detailed sections 
on using lzo compression in Hive , let me know…my experiences might be useful 
:-)

Thanks

sanjay



CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Where to get Oracle scripts for Hive Metastore

2013-05-21 Thread Sanjay Subramanian
Raj

The correct location of the script is where u deflated the hive tar

For example
/usr/lib/hive/scripts/metastore/upgrade/oracle

You will find a file in this directory called hive-schema-0.9.0.oracle.sql

Use this

sanjay

From: Raj Hadoop mailto:hadoop...@yahoo.com>>
Reply-To: "user@hadoop.apache.org" 
mailto:user@hadoop.apache.org>>, Raj Hadoop 
mailto:hadoop...@yahoo.com>>
Date: Tuesday, May 21, 2013 12:08 PM
To: Hive mailto:u...@hive.apache.org>>, User 
mailto:user@hadoop.apache.org>>
Subject: Where to get Oracle scripts for Hive Metastore

I am trying to get Oracle scripts for Hive Metastore.

http://mail-archives.apache.org/mod_mbox/hive-commits/201204.mbox/%3c20120423201303.9742b2388...@eris.apache.org%3E

The scripts in the above link has a  + at the begining of each line. How should 
I supposed to execute scripts like this through Oracle sqlplus.

+CREATE TABLE PART_COL_PRIVS
+(
+PART_COLUMN_GRANT_ID NUMBER NOT NULL,
+"COLUMN_NAME" VARCHAR2(128) NULL,
+CREATE_TIME NUMBER (10) NOT NULL,
+GRANT_OPTION NUMBER (5) NOT NULL,
+GRANTOR VARCHAR2(128) NULL,
+GRANTOR_TYPE VARCHAR2(128) NULL,
+PART_ID NUMBER NULL,
+PRINCIPAL_NAME VARCHAR2(128) NULL,
+PRINCIPAL_TYPE VARCHAR2(128) NULL,
+PART_COL_PRIV VARCHAR2(128) NULL
+);
+



CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: hive.metastore.warehouse.dir - Should it point to a physical directory

2013-05-21 Thread Sanjay Subramanian
Hi Raj

http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Quick-Start/cdh4qs_topic_3.html

Installing CDH4 on a Single Linux Node in Pseudo-distributed Mode

On the left panel of the page u will find info on Hive installation etc.

I suggest CHD4 distribution only because it helps u to get started quickly…as 
developers I love to install from individual tar balls but sometimes there is 
little time to learn and execute

There are some great notes here

sanjay


From: bharath vissapragada 
mailto:bharathvissapragada1...@gmail.com>>
Date: Tuesday, May 21, 2013 11:12 AM
To: "u...@hive.apache.org<mailto:u...@hive.apache.org>" 
mailto:u...@hive.apache.org>>, Raj Hadoop 
mailto:hadoop...@yahoo.com>>
Cc: Sanjay Subramanian 
mailto:sanjay.subraman...@wizecommerce.com>>,
 User mailto:user@hadoop.apache.org>>
Subject: Re: hive.metastore.warehouse.dir - Should it point to a physical 
directory


Yes !

On Tue, May 21, 2013 at 11:41 PM, Raj Hadoop 
mailto:hadoop...@yahoo.com>> wrote:
So that means I need to create a HDFS ( Not an OS physical directory ) 
directory under Hadoop that need to be used in the Hive config file for this 
property. Right?

From: Dean Wampler mailto:deanwamp...@gmail.com>>
To: Raj Hadoop mailto:hadoop...@yahoo.com>>
Cc: Sanjay Subramanian 
mailto:sanjay.subraman...@wizecommerce.com>>;
 "u...@hive.apache.org<mailto:u...@hive.apache.org>" 
mailto:u...@hive.apache.org>>; User 
mailto:user@hadoop.apache.org>>
Sent: Tuesday, May 21, 2013 2:06 PM

Subject: Re: hive.metastore.warehouse.dir - Should it point to a physical 
directory

No, you only need a directory in HDFS, which will be "virtually located" 
somewhere in your cluster automatically by HDFS.

Also there's a typo in your hive.xml:

  

Should be

  /correct/path/in/hdfs/to/your/warehouse/directory

On Tue, May 21, 2013 at 1:04 PM, Raj Hadoop 
mailto:hadoop...@yahoo.com>> wrote:
Thanks Sanjay.

My environment is  like this.

$ echo $HADOOP_HOME
/software/home/hadoop/hadoop/hadoop-1.1.2

$ echo $HIVE_HOME
/software/home/hadoop/hive/hive-0.9.0

$ id
uid=50052(hadoop) gid=600(apps) groups=600(apps)

So can i do like this:

$pwd
/software/home/hadoop/hive/hive-0.9.0

$mkdir warehouse

$cd /software/home/hadoop/hive/hive-0.9.0/warehouse

$ in hive-site.xml

  hive.metastore.warehouse.dir
  
  location of default database for the warehouse


Where should I create the HDFS directory ?


From: Sanjay Subramanian 
mailto:sanjay.subraman...@wizecommerce.com>>
To: "u...@hive.apache.org<mailto:u...@hive.apache.org>" 
mailto:u...@hive.apache.org>>; Raj Hadoop 
mailto:hadoop...@yahoo.com>>; Dean Wampler 
mailto:deanwamp...@gmail.com>>
Cc: User mailto:user@hadoop.apache.org>>
Sent: Tuesday, May 21, 2013 1:53 PM

Subject: Re: hive.metastore.warehouse.dir - Should it point to a physical 
directory

Notes below

From: Raj Hadoop mailto:hadoop...@yahoo.com>>
Reply-To: "u...@hive.apache.org<mailto:u...@hive.apache.org>" 
mailto:u...@hive.apache.org>>, Raj Hadoop 
mailto:hadoop...@yahoo.com>>
Date: Tuesday, May 21, 2013 10:49 AM
To: Dean Wampler mailto:deanwamp...@gmail.com>>, 
"u...@hive.apache.org<mailto:u...@hive.apache.org>" 
mailto:u...@hive.apache.org>>
Cc: User mailto:user@hadoop.apache.org>>
Subject: Re: hive.metastore.warehouse.dir - Should it point to a physical 
directory

Ok.I got it. My questions -

1) Should a local physical directory be created before using this property?
I created a directory in HDFS during Hive installation
/user/hive/warehouse

My hive-site.xml has the following property defined


  hive.metastore.warehouse.dir
  /user/hive/warehouse
  location of default database for the warehouse


2) Should a HDFS file directory be created from Hadoop before using this 
property?
hdfs dfs -mkdir /user/hive/warehouse
Change the owner:group to hive:hive



From: Dean Wampler mailto:deanwamp...@gmail.com>>
To: u...@hive.apache.org<mailto:u...@hive.apache.org>; Raj Hadoop 
mailto:hadoop...@yahoo.com>>
Cc: User mailto:user@hadoop.apache.org>>
Sent: Tuesday, May 21, 2013 1:44 PM
Subject: Re: hive.metastore.warehouse.dir - Should it point to a physical 
directory

The name is misleading; this is the directory within HDFS where Hive stores the 
data, by default. (External tables can go elsewhere). It doesn't really have 
anything to do with the metastore.

dean

On Tue, May 21, 2013 at 12:42 PM, Raj Hadoop 
mailto:hadoop...@yahoo.com>> wrote:
Can some one help me on this ? I am stuck installing and configuring Hive with 
Oracle. Your timely help is really aprreciated.

From: Raj Hadoop mailto:hadoop...@yahoo.com>>
To: Hive mailto:u...@hive.apache.org>>; User 
mailto:user@hadoop.apache.org>>
Sent: Tuesday, May 21, 2013 1:08 PM
Subject

Re: Project ideas

2013-05-21 Thread Sanjay Subramanian
+1

My $0.02 is look look around and see problems u can solve…Its better to get a 
list of problems and see if u can model a solution using map-reduce framework

An example is as follows

PROBLEM
Build a Cars Pricing Model based on advertisements on Craigs list

OBJECTIVE
Recommend a price to the Craigslist car seller when the user gives info about 
make,model,color,miles

DATA required
Collect RSS feeds daily from Craigs List (don't pound their website , else they 
will lock u down)

DESIGN COMPONENTS
- Daily RSS Collector - pulls data and puts into HDFS
- Data Loader - Structures the columns u need to analyze and puts into HDFS
- Hive Aggregator and analyzer - studies and queries data and brings out 
recommendation models for car pricing
- REST Web service to return query results in XML/JSON
- iPhone App that talks to web service and gets info

There u go…this should keep a couple of students busy for 3 months

I find this kind of problem statement and solutions simpler to understand 
because its all there in the real world !

An example of my way of thinking led to me founding this non profit called 
www.medicalsidefx.org that gives users valuable metrics regarding medical side 
fx.
It uses Hadoop to aggregate , Lucene to search….This year I am redesigning the 
core to use Hive :-)

Good luck

Sanjay





From: Michael Segel 
mailto:michael_se...@hotmail.com>>
Reply-To: "user@hadoop.apache.org" 
mailto:user@hadoop.apache.org>>
Date: Tuesday, May 21, 2013 6:46 AM
To: "user@hadoop.apache.org" 
mailto:user@hadoop.apache.org>>
Subject: Re: Project ideas

Drink heavily?

Sorry.

Let me rephrase.

Part of the exercise is for you, the student to come up with the idea. Not 
solicit someone else for a suggestion.  This is how you learn.

The exercise is to get you to think about the following:

1) What is Hadoop
2) How does it work
3) Why would you want to use it

You need to understand #1 and #2 to be able to #3.

But at the same time... you need to also incorporate your own view of the world.
What are your hobbies? What do you like to do?
What scares you the most?  What excites you the most?
Why are you here?
And most importantly, what do you think you can do within the time period.
(What data can you easily capture and work with...)

Have you ever seen 'Eden of the East' ? ;-)

HTH


On May 21, 2013, at 8:35 AM, Anshuman Mathur 
mailto:ans...@gmail.com>> wrote:


Hello fellow users,

We are a group of students studying in National University of Singapore. As 
part of our course curriculum we need to develop an application using Hadoop 
and  map-reduce. Can you please suggest some innovative ideas for our project?

Thanks in advance.

Anshuman


CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Viewing snappy compressed files

2013-05-21 Thread Sanjay Subramanian
+1 Thanks Rahul-da

Or u can use
hdfs dfs -text /path/to/dir/on/hdfs/part-r-0.snappy | less


From: Rahul Bhattacharjee 
mailto:rahul.rec@gmail.com>>
Reply-To: "user@hadoop.apache.org" 
mailto:user@hadoop.apache.org>>
Date: Tuesday, May 21, 2013 9:52 AM
To: "user@hadoop.apache.org" 
mailto:user@hadoop.apache.org>>
Subject: Re: Viewing snappy compressed files

I haven't tried this with snappy , but you can try using hadoop fs -text 


On Tue, May 21, 2013 at 8:28 PM, Robert Rapplean 
mailto:robert.rappl...@trueffect.com>> wrote:
Hey, there. My Google skills have failed me, and I hope someone here can point 
me in the right direction.


We’re storing data on our Hadoop cluster in Snappy compressed format. When we 
pull a raw file down and try to read it, however, the Snappy libraries don’t 
know how to read the files. They tell me that the stream is missing the snappy 
identifier. I tried inserting 0xff 0x06 0x00 0x00 0x73 0x4e 0x61 0x50 0x70 0x59 
into the beginning of the file, but that didn’t do it.

Can someone point me to resources for figuring out how to uncompress these 
files without going through Hadoop?


Robert Rapplean
Senior Software Engineer
303-872-2256  direct  | 303.438.9597  main 
| www.trueffect.com



CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: hive.metastore.warehouse.dir - Should it point to a physical directory

2013-05-21 Thread Sanjay Subramanian
Notes below

From: Raj Hadoop mailto:hadoop...@yahoo.com>>
Reply-To: "u...@hive.apache.org" 
mailto:u...@hive.apache.org>>, Raj Hadoop 
mailto:hadoop...@yahoo.com>>
Date: Tuesday, May 21, 2013 10:49 AM
To: Dean Wampler mailto:deanwamp...@gmail.com>>, 
"u...@hive.apache.org" 
mailto:u...@hive.apache.org>>
Cc: User mailto:user@hadoop.apache.org>>
Subject: Re: hive.metastore.warehouse.dir - Should it point to a physical 
directory

Ok.I got it. My questions -

1) Should a local physical directory be created before using this property?
I created a directory in HDFS during Hive installation
/user/hive/warehouse

My hive-site.xml has the following property defined


  hive.metastore.warehouse.dir
  /user/hive/warehouse
  location of default database for the warehouse


2) Should a HDFS file directory be created from Hadoop before using this 
property?
hdfs dfs -mkdir /user/hive/warehouse
Change the owner:group to hive:hive



From: Dean Wampler mailto:deanwamp...@gmail.com>>
To: u...@hive.apache.org; Raj Hadoop 
mailto:hadoop...@yahoo.com>>
Cc: User mailto:user@hadoop.apache.org>>
Sent: Tuesday, May 21, 2013 1:44 PM
Subject: Re: hive.metastore.warehouse.dir - Should it point to a physical 
directory

The name is misleading; this is the directory within HDFS where Hive stores the 
data, by default. (External tables can go elsewhere). It doesn't really have 
anything to do with the metastore.

dean

On Tue, May 21, 2013 at 12:42 PM, Raj Hadoop 
mailto:hadoop...@yahoo.com>> wrote:
Can some one help me on this ? I am stuck installing and configuring Hive with 
Oracle. Your timely help is really aprreciated.

From: Raj Hadoop mailto:hadoop...@yahoo.com>>
To: Hive mailto:u...@hive.apache.org>>; User 
mailto:user@hadoop.apache.org>>
Sent: Tuesday, May 21, 2013 1:08 PM
Subject: hive.metastore.warehouse.dir - Should it point to a physical directory

Hi,

I am configurinig Hive. I ahve a question on the property 
hive.metastore.warehouse.dir.

Should this point to a physical directory. I am guessing it is a logical 
directory under Hadoop fs.default.name. Please advise 
whether I need to create any directory for the variable 
hive.metastore.warehouse.dir

Thanks,
Raj





--
Dean Wampler, Ph.D.
@deanwampler
http://polyglotprogramming.com/



CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Unable to stop Thrift Server

2013-05-21 Thread Sanjay Subramanian
Raj
Which version r u using ?

I think from 0.9+ onwards its best to use service to stop and start and NOT hive

sudo service hive-metastore stop
sudo service hive-server stop

sudo service hive-metastore start
sudo service hive-server start

Couple of general things that might help

1. Use linux screens : then u can start many screen sessions and u don't have 
to give the synch mode "&" of execution
 Its very easy to manage several screen sessions and they keep running till 
your server restarts….and generally u can ssh to some jumhost and create your 
screen sessions there

2. Run the following
 pstree -pulac | less
 U can possible search for hive or your username or root which was used to 
start the service…and kill the process

sanjay

From: Raj Hadoop mailto:hadoop...@yahoo.com>>
Reply-To: "u...@hive.apache.org" 
mailto:u...@hive.apache.org>>, Raj Hadoop 
mailto:hadoop...@yahoo.com>>
Date: Monday, May 20, 2013 2:03 PM
To: Hive mailto:u...@hive.apache.org>>, User 
mailto:user@hadoop.apache.org>>
Subject: Unable to stop Thrift Server

Hi,

I was not able to stopThrift Server after performing the following steps.

$ bin/hive --service hiveserver &
Starting Hive Thrift Server

$ netstat -nl | grep 1
tcp 0 0 :::1 :::* LISTEN


I gave the following to stop. but not working.

hive --service hiveserver --action stop 1

How can I stop this service?


Thanks,
Raj

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Did any one used Hive on Oracle Metastore

2013-05-18 Thread Sanjay Subramanian
Raj
It should be pretty much similar to setting it up in MySQL. Except any syntax 
differences. Read the cloudera hive installation notes. They have a separate 
Section for using mysql and oracle.

Also one of my favorite $0.02 about the open source software is just dare and 
try it out...get errorsfigure out...debug...ask for help...solve...share 
and move on :-)

Regards
Sanjay

Sent from my iPhone

On May 18, 2013, at 1:05 PM, "Raj Hadoop" 
mailto:hadoop...@yahoo.com>> wrote:

Hi,
I wanted to know whether any one used Hive on Oracle Metastore? Can you please 
share your experiences?
Thanks,
Raj


CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Hive on Oracle

2013-05-18 Thread Sanjay Subramanian
Try installing cloudera manager 4.1.2. It has bundled Hadoop hive and few other 
components.  I have this version in production. Cloudera has pretty good 
documentation. This way u don't have to spend time installing versions that 
work successful with each other.

Sent from my iPhone

On May 17, 2013, at 8:49 PM, "bejoy...@yahoo.com" 
mailto:bejoy...@yahoo.com>> wrote:

Hi Raj

Which jar depends on what version of oracle you are using? The jar version 
corresponding to each oracle release would be there in oracle documentations 
online.

JDBC Jars should be available from the oracle website for free download.

Regards
Bejoy KS

Sent from remote device, Please excuse typos

From: Raj Hadoop mailto:hadoop...@yahoo.com>>
Date: Fri, 17 May 2013 20:43:46 -0700 (PDT)
To: 
bejoy...@yahoo.commailto:bejoy...@yahoo.com>>;
 
u...@hive.apache.orgmailto:u...@hive.apache.org>>;
 Usermailto:user@hadoop.apache.org>>
ReplyTo: u...@hive.apache.org
Subject: Re: Hive on Oracle


Thanks for the reply.

Can you specify whether which jar file need need to be used ? where can i get 
the jar file? does oracle provide one for free? let me know please.

Thanks,
Raj




From: "bejoy...@yahoo.com" 
mailto:bejoy...@yahoo.com>>
To: u...@hive.apache.org; Raj Hadoop 
mailto:hadoop...@yahoo.com>>; User 
mailto:user@hadoop.apache.org>>
Sent: Friday, May 17, 2013 11:42 PM
Subject: Re: Hive on Oracle

Hi

The procedure is same as setting up mysql metastore. You need to use the jdbc 
driver/jar corresponding to the oracle version/release you are intending to use.
Regards
Bejoy KS

Sent from remote device, Please excuse typos

From: Raj Hadoop mailto:hadoop...@yahoo.com>>
Date: Fri, 17 May 2013 17:10:07 -0700 (PDT)
To: Hivemailto:u...@hive.apache.org>>; 
Usermailto:user@hadoop.apache.org>>
ReplyTo: u...@hive.apache.org
Subject: Hive on Oracle

Hi,

I am planning to install Hive and want to set up Meta store on Oracle. What is 
the procedure? Which driver (JDBC) do I need to use it?

Thanks,
Raj



CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.