Re: incremental loads into hadoop

2011-10-03 Thread Sam Seigal
I have given HBase a fair amount of thought, and I am looking for
input. Instead of managing incremental loads myself, why not just
setup an HBase cluster ? What are some of the trade offs.
My primary use for this cluster would still be data
analysis/aggregation and not so much random access. Random access
would be something which is a nice to have in case there are problems,
and we want to examine the data ad-hoc.


On Sat, Oct 1, 2011 at 12:31 PM, in.abdul in.ab...@gmail.com wrote:
 There is two method is there for processing OLTP

   1.  Hstremming or scibe  these are only methodes
   2. if not use chukuwa for storing the data so that when i you got a
   tesent volume then you can move to HDFS

            Thanks and Regards,
        S SYED ABDUL KATHER
                9731841519


 On Sat, Oct 1, 2011 at 4:32 AM, Sam Seigal [via Lucene] 
 ml-node+s472066n3383949...@n3.nabble.com wrote:

 Hi,

 I am relatively new to Hadoop and was wondering how to do incremental
 loads into HDFS.

 I have a continuous stream of data flowing into a service which is
 writing to an OLTP store. Due to the high volume of data, we cannot do
 aggregations on the OLTP store, since this starts affecting the write
 performance.

 We would like to offload this processing into a Hadoop cluster, mainly
 for doing aggregations/analytics.

 The question is how can this continuous stream of data be
 incrementally loaded and processed into Hadoop ?

 Thank you,

 Sam


 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://lucene.472066.n3.nabble.com/incremental-loads-into-hadoop-tp3383949p3383949.html
  To unsubscribe from Lucene, click 
 herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472066code=aW4uYWJkdWxAZ21haWwuY29tfDQ3MjA2NnwxMDczOTUyNDEw.




 -
 THANKS AND REGARDS,
 SYED ABDUL KATHER
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/incremental-loads-into-hadoop-tp3383949p3385689.html
 Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


Re: incremental loads into hadoop

2011-10-01 Thread Sam Seigal
Hi Bejoy,

Thanks for the response.

While reading about Hadoop, I have come across threads where people
claim that Hadoop is not a good fit for a large amount of small files.
It is good for files that are gigabyes/petabytes in size.

If I am doing incremental loads, let's say every hour. Do I need to
wait until maybe at the end of the day when enough data has been
collected to start off a MapReduce job ? I am wondering if an open
file that is continuously being written to can at the same time be
used as an input to an M/R job ...

Also, let's say I did not want to do a load straight off the DB. The
service, when committing a transaction to the OLTP system, sends a
message for that transaction to  a Hadoop Service that then writes the
transaction into HDFS  (the services are connected to each other via a
persisted queue, hence are eventually consistent, but that is not a
big deal) .. What should I keep in mind while designing a service like
this ?

Should the file be first written to local disk, and when they reach a
large enough size (let us say the cut off is 100G), and then be
uploaded into the cluster using put ? or these can be directly written
into an HDFS file as the data is streaming in.

Thank you for your help.


Sam

Thank you,

Saurabh




On Sat, Oct 1, 2011 at 12:19 PM, Bejoy KS bejoy.had...@gmail.com wrote:
 Sam
      Try looking into Flume if you need to load incremental data into hdfs
 . If the source data is present on some JDBC compliant data bases then you
 can use SQOOP to get in the data directly into hdfs or hive incrementally.
 For Big Data Aggregation and Analytics Hadoop is definitely a good choice,
 as you can use Map Reduce or optimized tools on top of map reduce like hive
 or pig that would cater the purpose very well. So in short for the two steps
 you can go in with the following
 1. Load into hadoop/hdfs - Use Flume or SQOOP as per your source
 2. Process within hadoop/hdfs - Use Hive or Pig. These tools are well
 optimised so go in for a custom map reduce if and only if you feel these
 tools don't fit into some complex processing.

 There may be other tools as well to get the source data into hdfs. Let us
 leave it open for others to comment.

 Hope It helps.

 Thanks and Regards
 Bejoy.K.S


 On Sat, Oct 1, 2011 at 4:32 AM, Sam Seigal selek...@yahoo.com wrote:

 Hi,

 I am relatively new to Hadoop and was wondering how to do incremental
 loads into HDFS.

 I have a continuous stream of data flowing into a service which is
 writing to an OLTP store. Due to the high volume of data, we cannot do
 aggregations on the OLTP store, since this starts affecting the write
 performance.

 We would like to offload this processing into a Hadoop cluster, mainly
 for doing aggregations/analytics.

 The question is how can this continuous stream of data be
 incrementally loaded and processed into Hadoop ?

 Thank you,

 Sam




incremental loads into hadoop

2011-09-30 Thread Sam Seigal
Hi,

I am relatively new to Hadoop and was wondering how to do incremental
loads into HDFS.

I have a continuous stream of data flowing into a service which is
writing to an OLTP store. Due to the high volume of data, we cannot do
aggregations on the OLTP store, since this starts affecting the write
performance.

We would like to offload this processing into a Hadoop cluster, mainly
for doing aggregations/analytics.

The question is how can this continuous stream of data be
incrementally loaded and processed into Hadoop ?

Thank you,

Sam