It is very common practice to backup the metadata in some SAN store. So the
idea of complete loss of all the metadata is preventable. You could lose a
day worth of data if e.g. you back the metadata once a day but you could do
it more frequently. I'm not saying S3 or Azure Blob are bad ideas.
On
You log statement should show up in the container log for that Map class
within the data node. I'm guessing you aren't looking in the right place.
On Thu, Sep 24, 2015 at 9:54 AM, xeonmailinglist
wrote:
> Does anyone know this question about logging in MapReduce. I
May be this helps:
https://cwiki.apache.org/confluence/display/Hive/WebHCat+Reference+Pig
On Fri, Apr 3, 2015 at 5:56 AM, Remy Dubois rdub...@talend.com wrote:
Hi everyone,
I used to think about the constraint that a Hadoop client has to know and
to have access to each single datanode
in the end phase of
the project.
Thanks
Shashi
On Sat, Jan 3, 2015 at 10:32 PM, Peyman Mohajerian mohaj...@gmail.com
wrote:
Hi Shashi,
Sure you can use json instead of Parquet, I was thinking in terms of
using Hive for processing the data, but if you'd like to use Drill (which i
heard
to parquet
if I convert into json and store in Hive as Parquet format , is this a
feasible option.
The reason I want to convert to json is that Apache Drill works very well
with JSON format.
Thanks
Shashi
On Sat, Jan 3, 2015 at 10:08 PM, Peyman Mohajerian mohaj...@gmail.com
wrote:
You can
You can land the data in HDFS as XML files and use 'hive xml serde' to read
the data and write it back in a more optimal format, e.g. ORC or parquet
(depending somewhat on your choice of Hadoop distro). Querying XML data
directly via Hive is also doable but slow. Converting to Avro is also
doable
It maybe easier to copy the data to s3 and then from s3 to the new cluster.
On Fri, Sep 19, 2014 at 8:45 PM, Jameel Al-Aziz jam...@6sense.com wrote:
Hi all,
We’re in the process of migrating from EC2-Classic to VPC and needed to
transfer our HDFS data. We setup a new cluster inside the
If you data is in different partitions in HDFS, you can simply use tools
like Hive or Pig to read the data in a give partition, filter out the bad
data and overwrite the partition. This data cleansing is common practice,
I'm not sure why there is such a back and forth on this topic. Of course
Hadoop has a getmerge command (
http://hadoop.apache.org/docs/r0.19.1/hdfs_shell.html#getmerge) command,
I'm not certain if it works with RC file, i think it should. So maybe you
don't have to copy the files to local.
On Thu, Jul 17, 2014 at 6:18 AM, Naganarasimha G R (Naga)
This statement is inaccurate. Not all machine learning involves iterative
computation, not all dataset can fit in-memory. I'm not an expert in
Machine Learning, but I know enough to know that talking about it in some
generic sense from a standpoint of spark vs mahout, or R vs Python makes no
In my experience you build a node called Edge Node which has all the
libraries and configuration setting in XML to connect to the cluster, it
just doesn't have any of the Hadoop daemons running.
On Wed, Jun 4, 2014 at 2:46 PM, John Lilley john.lil...@redpoint.net
wrote:
We’ve found that much
Block size are typically 64 M or 12 M, so in your case only a single block
is involved which means if you have a single replica then only a single
data node will be used. The default replication is three and since you only
have two data nodes, you will most likely have two copies of the data in
Flume is not just for log files, you can wire up Flume's source for this
purpose. Also there are alternative open-source solutions for data
streaming, e.g. Apache Storm or Kafka.
On Tue, May 6, 2014 at 10:48 PM, Alex Lee eliy...@hotmail.com wrote:
Sensors' may send tcpip data to server. Each
Whether you use Storm/kafka or any other realtime processing or not, you
may still need to persist the data which can be done directly to hbase from
any of these realtime system or from the source.
On Thu, May 8, 2014 at 9:25 PM, Hardik Pandya smarty.ju...@gmail.comwrote:
If I were you I would
At the high level I think you have these choices and more:
1) Hadoop Streaming, leverage some of your python could, but not all b/c
you have to deal with map/reduce.
2) Use Mahout.
3) Use a distro of R that works with Hadoop
..
On Thu, Apr 24, 2014 at 1:58 PM, qiaoresearcher
hadoop fsck path -files -blocks -locations
On Mon, Apr 14, 2014 at 4:43 PM, Alexandros Papadopoulos
alex.pap...@gmail.com wrote:
hi all,
in some cases as hdfs-client, i would like to know the file block path
in a datanode.
Is there a way to get a file block path for a datanode ??
There is: http://hadoop.apache.org/docs/r1.2.1/hadoop_archives.html
But not sure if it compresses the data or not.
On Thu, Apr 10, 2014 at 9:57 PM, Stanley Shi s...@gopivotal.com wrote:
AFAIK, no tools now.
Regards,
*Stanley Shi,*
On Fri, Apr 11, 2014 at 9:09 AM, ch huang
If you didn't close the file correctly then NameNode wouldn't be notified
of the final size of the file. The file size is meta-data coming from
NameNode.
On Tue, Apr 8, 2014 at 4:35 AM, Tao Xiao xiaotao.cs@gmail.com wrote:
I wrote some data into a file using MultipleOutputs in mappers. I
The reason for replication also has to do with data locality in a larger
cluster for running a map-reduce jobs. You can reduce the replication,
that's why it's a configurable parameter.
On Thu, Apr 3, 2014 at 7:10 AM, Fengyun RAO raofeng...@gmail.com wrote:
I know the default replication is 3,
Maybe you just have to add 'mapred' user to the group that owns the 'hdfs'
root directory, it seems the group name is called: 'hdfs'.
On Sat, Feb 15, 2014 at 9:56 AM, Panshul Whisper ouchwhis...@gmail.comwrote:
Hello,
I had a Cloudera Hadoop cluster running on AWS EC2 instances. I used the
The staging table is typically defined as external hive table, data is
loaded directly on HDFS and staging table therefore is able to read that
data directly from HDFS and the transfer it to Hive managed tables, your
current statement. Of course there are variations to this as well.
On Sun, Feb
tagcombination ids in output, not
the no of matches, for the given set of tags..
please illustrate a little your thought by taking my tag combination table
design..
On Tue, Jan 28, 2014 at 10:57 PM, Peyman Mohajerian mohaj...@gmail.comwrote:
No-sql solution with real-time counters would
maybe its inode exhaustion:
'df -i' command can tell you more.
On Mon, Jan 27, 2014 at 12:00 PM, John Lilley john.lil...@redpoint.netwrote:
I've found that the error occurs right around a threshold where 20 tasks
attempt to open 220 files each. This is ... slightly over 4k total files
You can run a series of map-reduce jobs on your data, if some log line is
related to another line, e.g. based on sessionId, you can emit the
sessionId as the key of your mapper output with the value being on the rows
associated with the sessionId, so on the reducer side data from different
blocks
Ok i just read the book section on this (Definite Guide to Hadoop), just to
be sure, length of a file is stored in Name Node, and its updated only
after client calls Name Node after close of the file. At that point if Name
Node has received all the ACK from Data Nodes then it will set the length
Your flume source is a 'custom source', you will put the code for
connecting to the source data there and then link the flume source with the
channel and hdfs sink. There is a twitter example in the cloudera blog, a
three part blog that explains this in nice details.
On Sun, Dec 8, 2013 at 9:58
I wonder if JDBC driver over Hive could help you. If you legacy ETL job can
talk to a jdbc driver, it is a slow way of writing to HDFS and I don't have
any experience doing it, e.g.:
http://doc.cloveretl.com/documentation/UserGuide/index.jsp?topic=/com.cloveretl.gui.docs/docs/hive-connection.html
For the JobClient to compute the input splits doesn't it need to contact
Name Node. Only Name Node knows where the splits are, how can it compute it
without that additional call?
On Fri, Sep 27, 2013 at 1:41 AM, Sonal Goyal sonalgoy...@gmail.com wrote:
The input splits are not copied, only the
In my experience with Flume and this issue, it occurs when the file is not
properly closed. If it was then it would show you the correct size and Hive
will read the content.
On Wed, Sep 25, 2013 at 12:30 AM, ch huang justlo...@gmail.com wrote:
hi,all:
i have a question,i start pump
If you want to see a simple example of what you are looking for:
https://github.com/cloudera/cdh-twitter-example
It is part of this article:
http://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/
On Tue, Sep 17, 2013 at 4:20 AM, praveenesh kumar praveen...@gmail.comwrote:
How about this: http://hadoop.apache.org/docs/stable/vaidya.html
I've never tried it myself, i was just reading about it today.
On Thu, Sep 5, 2013 at 5:57 PM, Preethi Vinayak Ponangi
vinayakpona...@gmail.com wrote:
Solution 1: Throw more hardware at the cluster. That's the whole point of
Regarding Sqoop, you can install it wherever you would have access to your
database and HDFS cluster, you could e.g. install it on the namenode if you
want it as long as it has access to the database that is the source or
target of your data transfer.
On Thu, Aug 29, 2013 at 3:11 PM, Raj Hadoop
Apache Flume is one option.
On Tue, May 21, 2013 at 7:32 AM, Aji Janis aji1...@gmail.com wrote:
Hello users,
I am interested in hearing about what sort of ETL tools are you using with
your cloud based apps. Ideally, I am looking ETL(s) with the following
feature:
-free (yup)
Apache Flume may help you for this use case. I read an article on
Cloudera's site about using Flume to pull tweets and same idea may apply
here.
On Tue, Apr 16, 2013 at 9:26 PM, David Parks davidpark...@yahoo.com wrote:
For a set of jobs to run I need to download about 100GB of data from the
I think this does it:
http://hadoop.apache.org/docs/r0.20.1/api/org/apache/hadoop/mapreduce/lib/output/NullOutputFormat.html
On Fri, Dec 7, 2012 at 10:06 AM, Oleg Zhurakousky
oleg.zhurakou...@gmail.com wrote:
Guys
I have a simple mapper that reads a records and sends out a message as it
Hi Guys,
I'm running a Clojure code inside Solr 3.4 that makes call to Mahout
.4 for some text clustering job. Due to some issues with Clojure I had
to put all the jar files in the solr war file ('WEB-INF/lib'). I also
made sure to put hadoop core and mapreduce config xml files in the
same
Hi,
You should definitely take a look at Apache Sqoop as previously mentioned,
if your file is large enough and you have several map jobs running and
hitting your database concurrently, you will experience issues at the db
level.
In terms of speculative jobs (redundant jobs) running to deal with
37 matches
Mail list logo