Re: HDFS2 vs MaprFS

2016-06-05 Thread Peyman Mohajerian
It is very common practice to backup the metadata in some SAN store. So the idea of complete loss of all the metadata is preventable. You could lose a day worth of data if e.g. you back the metadata once a day but you could do it more frequently. I'm not saying S3 or Azure Blob are bad ideas. On

Re: println in MapReduce job

2015-09-24 Thread Peyman Mohajerian
You log statement should show up in the container log for that Map class within the data node. I'm guessing you aren't looking in the right place. On Thu, Sep 24, 2015 at 9:54 AM, xeonmailinglist wrote: > Does anyone know this question about logging in MapReduce. I

Re: Hadoop and HttpFs

2015-04-03 Thread Peyman Mohajerian
May be this helps: https://cwiki.apache.org/confluence/display/Hive/WebHCat+Reference+Pig On Fri, Apr 3, 2015 at 5:56 AM, Remy Dubois rdub...@talend.com wrote: Hi everyone, I used to think about the constraint that a Hadoop client has to know and to have access to each single datanode

Re: XML files in Hadoop

2015-01-03 Thread Peyman Mohajerian
in the end phase of the project. Thanks Shashi On Sat, Jan 3, 2015 at 10:32 PM, Peyman Mohajerian mohaj...@gmail.com wrote: Hi Shashi, Sure you can use json instead of Parquet, I was thinking in terms of using Hive for processing the data, but if you'd like to use Drill (which i heard

Re: XML files in Hadoop

2015-01-03 Thread Peyman Mohajerian
to parquet if I convert into json and store in Hive as Parquet format , is this a feasible option. The reason I want to convert to json is that Apache Drill works very well with JSON format. Thanks Shashi On Sat, Jan 3, 2015 at 10:08 PM, Peyman Mohajerian mohaj...@gmail.com wrote: You can

Re: XML files in Hadoop

2015-01-03 Thread Peyman Mohajerian
You can land the data in HDFS as XML files and use 'hive xml serde' to read the data and write it back in a more optimal format, e.g. ORC or parquet (depending somewhat on your choice of Hadoop distro). Querying XML data directly via Hive is also doable but slow. Converting to Avro is also doable

Re: Unable to use transfer data using distcp between EC2-classic cluster and VPC cluster

2014-09-20 Thread Peyman Mohajerian
It maybe easier to copy the data to s3 and then from s3 to the new cluster. On Fri, Sep 19, 2014 at 8:45 PM, Jameel Al-Aziz jam...@6sense.com wrote: Hi all, We’re in the process of migrating from EC2-Classic to VPC and needed to transfer our HDFS data. We setup a new cluster inside the

Re: Data cleansing in modern data architecture

2014-08-24 Thread Peyman Mohajerian
If you data is in different partitions in HDFS, you can simply use tools like Hive or Pig to read the data in a give partition, filter out the bad data and overwrite the partition. This data cleansing is common practice, I'm not sure why there is such a back and forth on this topic. Of course

Re: Multiple Part files

2014-07-17 Thread Peyman Mohajerian
Hadoop has a getmerge command ( http://hadoop.apache.org/docs/r0.19.1/hdfs_shell.html#getmerge) command, I'm not certain if it works with RC file, i think it should. So maybe you don't have to copy the files to local. On Thu, Jul 17, 2014 at 6:18 AM, Naganarasimha G R (Naga)

Re: The future of MapReduce

2014-07-16 Thread Peyman Mohajerian
This statement is inaccurate. Not all machine learning involves iterative computation, not all dataset can fit in-memory. I'm not an expert in Machine Learning, but I know enough to know that talking about it in some generic sense from a standpoint of spark vs mahout, or R vs Python makes no

Re: Gathering connection information

2014-06-07 Thread Peyman Mohajerian
In my experience you build a node called Edge Node which has all the libraries and configuration setting in XML to connect to the cluster, it just doesn't have any of the Hadoop daemons running. On Wed, Jun 4, 2014 at 2:46 PM, John Lilley john.lil...@redpoint.net wrote: We’ve found that much

Re: How to make sure data blocks are shared between 2 datanodes

2014-05-25 Thread Peyman Mohajerian
Block size are typically 64 M or 12 M, so in your case only a single block is involved which means if you have a single replica then only a single data node will be used. The default replication is three and since you only have two data nodes, you will most likely have two copies of the data in

Re: Realtime sensor's tcpip data to hadoop

2014-05-16 Thread Peyman Mohajerian
Flume is not just for log files, you can wire up Flume's source for this purpose. Also there are alternative open-source solutions for data streaming, e.g. Apache Storm or Kafka. On Tue, May 6, 2014 at 10:48 PM, Alex Lee eliy...@hotmail.com wrote: Sensors' may send tcpip data to server. Each

Re: Realtime sensor's tcpip data to hadoop

2014-05-16 Thread Peyman Mohajerian
Whether you use Storm/kafka or any other realtime processing or not, you may still need to persist the data which can be done directly to hbase from any of these realtime system or from the source. On Thu, May 8, 2014 at 9:25 PM, Hardik Pandya smarty.ju...@gmail.comwrote: If I were you I would

Re: hadoop+python+text mining

2014-04-24 Thread Peyman Mohajerian
At the high level I think you have these choices and more: 1) Hadoop Streaming, leverage some of your python could, but not all b/c you have to deal with map/reduce. 2) Use Mahout. 3) Use a distro of R that works with Hadoop .. On Thu, Apr 24, 2014 at 1:58 PM, qiaoresearcher

Re: hdfs - get file block path for a datanode

2014-04-14 Thread Peyman Mohajerian
hadoop fsck path -files -blocks -locations On Mon, Apr 14, 2014 at 4:43 PM, Alexandros Papadopoulos alex.pap...@gmail.com wrote: hi all, in some cases as hdfs-client, i would like to know the file block path in a datanode. Is there a way to get a file block path for a datanode ??

Re: how can i archive old data in HDFS?

2014-04-11 Thread Peyman Mohajerian
There is: http://hadoop.apache.org/docs/r1.2.1/hadoop_archives.html But not sure if it compresses the data or not. On Thu, Apr 10, 2014 at 9:57 PM, Stanley Shi s...@gopivotal.com wrote: AFAIK, no tools now. Regards, *Stanley Shi,* On Fri, Apr 11, 2014 at 9:09 AM, ch huang

Re: A non-empty file's size is reported as 0

2014-04-08 Thread Peyman Mohajerian
If you didn't close the file correctly then NameNode wouldn't be notified of the final size of the file. The file size is meta-data coming from NameNode. On Tue, Apr 8, 2014 at 4:35 AM, Tao Xiao xiaotao.cs@gmail.com wrote: I wrote some data into a file using MultipleOutputs in mappers. I

Re: recommended block replication for small cluster

2014-04-03 Thread Peyman Mohajerian
The reason for replication also has to do with data locality in a larger cluster for running a map-reduce jobs. You can reduce the replication, that's why it's a configurable parameter. On Thu, Apr 3, 2014 at 7:10 AM, Fengyun RAO raofeng...@gmail.com wrote: I know the default replication is 3,

Re: Job Tracker not running, Permission denied

2014-02-15 Thread Peyman Mohajerian
Maybe you just have to add 'mapred' user to the group that owns the 'hdfs' root directory, it seems the group name is called: 'hdfs'. On Sat, Feb 15, 2014 at 9:56 AM, Panshul Whisper ouchwhis...@gmail.comwrote: Hello, I had a Cloudera Hadoop cluster running on AWS EC2 instances. I used the

Re: Add few record(s) to a Hive table or a HDFS file on a daily basis

2014-02-09 Thread Peyman Mohajerian
The staging table is typically defined as external hive table, data is loaded directly on HDFS and staging table therefore is able to read that data directly from HDFS and the transfer it to Hive managed tables, your current statement. Of course there are variations to this as well. On Sun, Feb

Re: Suggestion technology/design on this usecase

2014-01-28 Thread Peyman Mohajerian
tagcombination ids in output, not the no of matches, for the given set of tags.. please illustrate a little your thought by taking my tag combination table design.. On Tue, Jan 28, 2014 at 10:57 PM, Peyman Mohajerian mohaj...@gmail.comwrote: No-sql solution with real-time counters would

Re: BlockMissingException reading HDFS file, but the block exists and fsck shows OK

2014-01-28 Thread Peyman Mohajerian
maybe its inode exhaustion: 'df -i' command can tell you more. On Mon, Jan 27, 2014 at 12:00 PM, John Lilley john.lil...@redpoint.netwrote: I've found that the error occurs right around a threshold where 20 tasks attempt to open 220 files each. This is ... slightly over 4k total files

Re: any suggestions on IIS log storage and analysis?

2013-12-31 Thread Peyman Mohajerian
You can run a series of map-reduce jobs on your data, if some log line is related to another line, e.g. based on sessionId, you can emit the sessionId as the key of your mapper output with the value being on the rows associated with the sessionId, so on the reducer side data from different blocks

Re: Why other process can't see the change after calling hdfsHFlush unless hdfsCloseFile is called?

2013-12-19 Thread Peyman Mohajerian
Ok i just read the book section on this (Definite Guide to Hadoop), just to be sure, length of a file is stored in Name Node, and its updated only after client calls Name Node after close of the file. At that point if Name Node has received all the ACK from Data Nodes then it will set the length

Re: Streaming Sensor Data using Flume

2013-12-08 Thread Peyman Mohajerian
Your flume source is a 'custom source', you will put the code for connecting to the source data there and then link the flume source with the channel and hdfs sink. There is a twitter example in the cloudera blog, a three part blog that explains this in nice details. On Sun, Dec 8, 2013 at 9:58

Re: Migrating from Legacy to Hadoop.

2013-10-08 Thread Peyman Mohajerian
I wonder if JDBC driver over Hive could help you. If you legacy ETL job can talk to a jdbc driver, it is a slow way of writing to HDFS and I don't have any experience doing it, e.g.: http://doc.cloveretl.com/documentation/UserGuide/index.jsp?topic=/com.cloveretl.gui.docs/docs/hive-connection.html

Re: Retrieve and compute input splits

2013-09-27 Thread Peyman Mohajerian
For the JobClient to compute the input splits doesn't it need to contact Name Node. Only Name Node knows where the splits are, how can it compute it without that additional call? On Fri, Sep 27, 2013 at 1:41 AM, Sonal Goyal sonalgoy...@gmail.com wrote: The input splits are not copied, only the

Re: issue about invisible data in haoop file

2013-09-25 Thread Peyman Mohajerian
In my experience with Flume and this issue, it occurs when the file is not properly closed. If it was then it would show you the correct size and Hive will read the content. On Wed, Sep 25, 2013 at 12:30 AM, ch huang justlo...@gmail.com wrote: hi,all: i have a question,i start pump

Re: Oozie dynamic action

2013-09-17 Thread Peyman Mohajerian
If you want to see a simple example of what you are looking for: https://github.com/cloudera/cdh-twitter-example It is part of this article: http://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/ On Tue, Sep 17, 2013 at 4:20 AM, praveenesh kumar praveen...@gmail.comwrote:

Re: How to speed up Hadoop?

2013-09-05 Thread Peyman Mohajerian
How about this: http://hadoop.apache.org/docs/stable/vaidya.html I've never tried it myself, i was just reading about it today. On Thu, Sep 5, 2013 at 5:57 PM, Preethi Vinayak Ponangi vinayakpona...@gmail.com wrote: Solution 1: Throw more hardware at the cluster. That's the whole point of

Re: Hadoop Clients (Hive,Pig) and Hadoop Cluster

2013-08-29 Thread Peyman Mohajerian
Regarding Sqoop, you can install it wherever you would have access to your database and HDFS cluster, you could e.g. install it on the namenode if you want it as long as it has access to the database that is the source or target of your data transfer. On Thu, Aug 29, 2013 at 3:11 PM, Raj Hadoop

Re: ETL Tools

2013-05-21 Thread Peyman Mohajerian
Apache Flume is one option. On Tue, May 21, 2013 at 7:32 AM, Aji Janis aji1...@gmail.com wrote: Hello users, I am interested in hearing about what sort of ETL tools are you using with your cloud based apps. Ideally, I am looking ETL(s) with the following feature: -free (yup)

Re: Mapreduce jobs to download job input from across the internet

2013-04-17 Thread Peyman Mohajerian
Apache Flume may help you for this use case. I read an article on Cloudera's site about using Flume to pull tweets and same idea may apply here. On Tue, Apr 16, 2013 at 9:26 PM, David Parks davidpark...@yahoo.com wrote: For a set of jobs to run I need to download about 100GB of data from the

Re: Input path with no Output path

2012-12-07 Thread Peyman Mohajerian
I think this does it: http://hadoop.apache.org/docs/r0.20.1/api/org/apache/hadoop/mapreduce/lib/output/NullOutputFormat.html On Fri, Dec 7, 2012 at 10:06 AM, Oleg Zhurakousky oleg.zhurakou...@gmail.com wrote: Guys I have a simple mapper that reads a records and sends out a message as it

Error Using Hadoop .20.2/Mahout.4 on Solr 3.4

2012-01-17 Thread Peyman Mohajerian
Hi Guys, I'm running a Clojure code inside Solr 3.4 that makes call to Mahout .4 for some text clustering job. Due to some issues with Clojure I had to put all the jar files in the solr war file ('WEB-INF/lib'). I also made sure to put hadoop core and mapreduce config xml files in the same

Re: From a newbie: Questions and will MapReduce fit our needs

2011-08-26 Thread Peyman Mohajerian
Hi, You should definitely take a look at Apache Sqoop as previously mentioned, if your file is large enough and you have several map jobs running and hitting your database concurrently, you will experience issues at the db level. In terms of speculative jobs (redundant jobs) running to deal with