Data collection

2014-01-14 Thread Moty Michaely
Hi,

I just wanted to share with you guys a post that I've just published about
ways for collecting (BIG) data. I believe this might be of interest for you.

http://goo.gl/Et2KRf

Enjoy,

-- 
[image: Inline image 1]

*Moty Michaely*

VP RD

Cell: +972 (52) 631-1019

Email:  m...@xplenty.com

Web:www.xplenty.com


Web/Data Analytics and Data Collection using Hadoop

2010-03-22 Thread Utku Can Topçu
Hey All,

Currently in a project I'm involved, we're about to make design choices
regarding the use of Hadoop as a scalable and distributed data analytics
framework.
Basically the application would be the base of a Web Analytics tool, so I do
have the vision that Hadoop would be the finest choice for analyzing the
collected data.
But for the collection of data is somewhat a different issue to consider,
there needs to be serious design decision taken for the data collection
architecture.

Actually, I'd like to have a distributed and scalable data collection in
production. The current situation is like we have multiple of servers in 3-4
different locations, each collect some sort of data.
The basic approach on analyzing this distributed data would be: logging them
into structured text files so that we'll be able to transfer them to the
hadoop cluster and analyze them using some MapReduce jobs.
The basic process I define follows like this
- Transfer log files to Hadoop master, (collectors to master)
- Put the files on the master to the HDFS, (master to the cluster)
As it's clear there's an overhead in the transfer of the log files. And the
big log files will have to be analyzed even if you'll somehow need a small
portion of the data.

One better other option is, logging directly to a distributed database like
Cassandra and HBase, so the MapReduce jobs would be fetching the data from
the databases and doing the analysis. And the data will also be randomly
accessible and open to queries in real-time.
I'm not that much familiar in this area of distributed databases, however I
can see that,
-If we're using cassandra for storing logging information, we won't have a
connection overhead for writing the data to the Cassandra cluster, since all
nodes in the cluster are able to accept incoming write requests. However in
HBase I'm afraid we'll need to write to the master only, so in such
situation, there seems to be a connection overhead on the master and we can
only scale up-to the levels that the through-put of master. Logging to HBase
doesn't seem scalable from this point of view.
-On the other hand, using a different Cassandra cluster which is not
directly from the ecosystem of Hadoop, I'm afraid we'll lose the concept of
data locality while using the data for analysis in MapReduce jobs if
Cassandra was the choice for keeping the log data. However in the case of
HBase we'll be able to use the data locality since it's directly related to
the HDFS.
-Is there a stable way for integrating Cassandra with Hadoop?

So finally Chukwa seems to be a good choice for such kind of a data
collection. Where each server that can be defined as sources will be running
Agents on them, so they can transfer the data to the Collectors that reside
close to the HDFS. After series of pipe-lined processes the data would be
clearly available for analysis using MapReduce jobs.
I see some connection overhead due to the through-put of master in this
scenario and the files that need to be analyzed will also be again available
in big files, so a sample range of the data analysis would require the
reading of the full files.

I feel like these are the brief options I figured out till now. Actually all
decision will come with some kind of a drawback and provide some decision
specific more functionality compared to the others.

Is there anyone on the list who solved the need in such functionality
previously? I'm open to all kind of comments and suggestions,

Best Regards,
Utku


Re: Web/Data Analytics and Data Collection using Hadoop

2010-03-22 Thread Bill Graham
Hi Utku,

We're using Chukwa to collect and aggregate data as you describe and so far
it's working well. Typically chukwa collectors are deployed to all data
nodes, so there is no master write-bottleneck with this approach actually.

There have been discussions lately on the Chukwa list regarding how to write
data into HBase using Chukwa collectors or data processors that you might
want to check out.

thanks,
Bill


On Mon, Mar 22, 2010 at 4:50 AM, Utku Can Topçu u...@topcu.gen.tr wrote:

 Hey All,

 Currently in a project I'm involved, we're about to make design choices
 regarding the use of Hadoop as a scalable and distributed data analytics
 framework.
 Basically the application would be the base of a Web Analytics tool, so I
 do
 have the vision that Hadoop would be the finest choice for analyzing the
 collected data.
 But for the collection of data is somewhat a different issue to consider,
 there needs to be serious design decision taken for the data collection
 architecture.

 Actually, I'd like to have a distributed and scalable data collection in
 production. The current situation is like we have multiple of servers in
 3-4
 different locations, each collect some sort of data.
 The basic approach on analyzing this distributed data would be: logging
 them
 into structured text files so that we'll be able to transfer them to the
 hadoop cluster and analyze them using some MapReduce jobs.
 The basic process I define follows like this
 - Transfer log files to Hadoop master, (collectors to master)
 - Put the files on the master to the HDFS, (master to the cluster)
 As it's clear there's an overhead in the transfer of the log files. And the
 big log files will have to be analyzed even if you'll somehow need a small
 portion of the data.

 One better other option is, logging directly to a distributed database like
 Cassandra and HBase, so the MapReduce jobs would be fetching the data from
 the databases and doing the analysis. And the data will also be randomly
 accessible and open to queries in real-time.
 I'm not that much familiar in this area of distributed databases, however I
 can see that,
 -If we're using cassandra for storing logging information, we won't have a
 connection overhead for writing the data to the Cassandra cluster, since
 all
 nodes in the cluster are able to accept incoming write requests. However in
 HBase I'm afraid we'll need to write to the master only, so in such
 situation, there seems to be a connection overhead on the master and we can
 only scale up-to the levels that the through-put of master. Logging to
 HBase
 doesn't seem scalable from this point of view.
 -On the other hand, using a different Cassandra cluster which is not
 directly from the ecosystem of Hadoop, I'm afraid we'll lose the concept of
 data locality while using the data for analysis in MapReduce jobs if
 Cassandra was the choice for keeping the log data. However in the case of
 HBase we'll be able to use the data locality since it's directly related to
 the HDFS.
 -Is there a stable way for integrating Cassandra with Hadoop?

 So finally Chukwa seems to be a good choice for such kind of a data
 collection. Where each server that can be defined as sources will be
 running
 Agents on them, so they can transfer the data to the Collectors that reside
 close to the HDFS. After series of pipe-lined processes the data would be
 clearly available for analysis using MapReduce jobs.
 I see some connection overhead due to the through-put of master in this
 scenario and the files that need to be analyzed will also be again
 available
 in big files, so a sample range of the data analysis would require the
 reading of the full files.

 I feel like these are the brief options I figured out till now. Actually
 all
 decision will come with some kind of a drawback and provide some decision
 specific more functionality compared to the others.

 Is there anyone on the list who solved the need in such functionality
 previously? I'm open to all kind of comments and suggestions,

 Best Regards,
 Utku



Re: Web/Data Analytics and Data Collection using Hadoop

2010-03-22 Thread Utku Can Topçu
Hi Bill,

Thank you for your comments,
The main thing about the Chukwa installation on top of Hadoop is I guess,
you somehow need to connect to the namenode from the collectors.
Isn't it the case while trying to reach the HDFS, or the Chukwa collectors
are writing on the local drives instead of HDFS?

Best,
Utku

On Mon, Mar 22, 2010 at 6:34 PM, Bill Graham billgra...@gmail.com wrote:

 Hi Utku,

 We're using Chukwa to collect and aggregate data as you describe and so far
 it's working well. Typically chukwa collectors are deployed to all data
 nodes, so there is no master write-bottleneck with this approach actually.

 There have been discussions lately on the Chukwa list regarding how to
 write
 data into HBase using Chukwa collectors or data processors that you might
 want to check out.

 thanks,
 Bill


 On Mon, Mar 22, 2010 at 4:50 AM, Utku Can Topçu u...@topcu.gen.tr wrote:

  Hey All,
 
  Currently in a project I'm involved, we're about to make design choices
  regarding the use of Hadoop as a scalable and distributed data analytics
  framework.
  Basically the application would be the base of a Web Analytics tool, so I
  do
  have the vision that Hadoop would be the finest choice for analyzing the
  collected data.
  But for the collection of data is somewhat a different issue to consider,
  there needs to be serious design decision taken for the data collection
  architecture.
 
  Actually, I'd like to have a distributed and scalable data collection in
  production. The current situation is like we have multiple of servers in
  3-4
  different locations, each collect some sort of data.
  The basic approach on analyzing this distributed data would be: logging
  them
  into structured text files so that we'll be able to transfer them to the
  hadoop cluster and analyze them using some MapReduce jobs.
  The basic process I define follows like this
  - Transfer log files to Hadoop master, (collectors to master)
  - Put the files on the master to the HDFS, (master to the cluster)
  As it's clear there's an overhead in the transfer of the log files. And
 the
  big log files will have to be analyzed even if you'll somehow need a
 small
  portion of the data.
 
  One better other option is, logging directly to a distributed database
 like
  Cassandra and HBase, so the MapReduce jobs would be fetching the data
 from
  the databases and doing the analysis. And the data will also be randomly
  accessible and open to queries in real-time.
  I'm not that much familiar in this area of distributed databases, however
 I
  can see that,
  -If we're using cassandra for storing logging information, we won't have
 a
  connection overhead for writing the data to the Cassandra cluster, since
  all
  nodes in the cluster are able to accept incoming write requests. However
 in
  HBase I'm afraid we'll need to write to the master only, so in such
  situation, there seems to be a connection overhead on the master and we
 can
  only scale up-to the levels that the through-put of master. Logging to
  HBase
  doesn't seem scalable from this point of view.
  -On the other hand, using a different Cassandra cluster which is not
  directly from the ecosystem of Hadoop, I'm afraid we'll lose the concept
 of
  data locality while using the data for analysis in MapReduce jobs if
  Cassandra was the choice for keeping the log data. However in the case of
  HBase we'll be able to use the data locality since it's directly related
 to
  the HDFS.
  -Is there a stable way for integrating Cassandra with Hadoop?
 
  So finally Chukwa seems to be a good choice for such kind of a data
  collection. Where each server that can be defined as sources will be
  running
  Agents on them, so they can transfer the data to the Collectors that
 reside
  close to the HDFS. After series of pipe-lined processes the data would be
  clearly available for analysis using MapReduce jobs.
  I see some connection overhead due to the through-put of master in this
  scenario and the files that need to be analyzed will also be again
  available
  in big files, so a sample range of the data analysis would require the
  reading of the full files.
 
  I feel like these are the brief options I figured out till now. Actually
  all
  decision will come with some kind of a drawback and provide some decision
  specific more functionality compared to the others.
 
  Is there anyone on the list who solved the need in such functionality
  previously? I'm open to all kind of comments and suggestions,
 
  Best Regards,
  Utku
 



Re: Web/Data Analytics and Data Collection using Hadoop

2010-03-22 Thread Bill Graham
Sure, any framework that writes data into HDFS will need to communicate with
the namenode. So yes, there can potentially be large numbers of connections
to the namenode.

I (possibly mistakenly) thought you were speaking specifically of a
bottleneck caused by writes through a single master node. The actually data
does not go through the name node though, so there is no bottleneck in the
data flow.


On Mon, Mar 22, 2010 at 10:50 AM, Utku Can Topçu u...@topcu.gen.tr wrote:

 Hi Bill,

 Thank you for your comments,
 The main thing about the Chukwa installation on top of Hadoop is I guess,
 you somehow need to connect to the namenode from the collectors.
 Isn't it the case while trying to reach the HDFS, or the Chukwa collectors
 are writing on the local drives instead of HDFS?

 Best,
 Utku


 On Mon, Mar 22, 2010 at 6:34 PM, Bill Graham billgra...@gmail.com wrote:

 Hi Utku,

 We're using Chukwa to collect and aggregate data as you describe and so
 far
 it's working well. Typically chukwa collectors are deployed to all data
 nodes, so there is no master write-bottleneck with this approach actually.

 There have been discussions lately on the Chukwa list regarding how to
 write
 data into HBase using Chukwa collectors or data processors that you might
 want to check out.

 thanks,
 Bill


 On Mon, Mar 22, 2010 at 4:50 AM, Utku Can Topçu u...@topcu.gen.tr
 wrote:

  Hey All,
 
  Currently in a project I'm involved, we're about to make design choices
  regarding the use of Hadoop as a scalable and distributed data analytics
  framework.
  Basically the application would be the base of a Web Analytics tool, so
 I
  do
  have the vision that Hadoop would be the finest choice for analyzing the
  collected data.
  But for the collection of data is somewhat a different issue to
 consider,
  there needs to be serious design decision taken for the data collection
  architecture.
 
  Actually, I'd like to have a distributed and scalable data collection in
  production. The current situation is like we have multiple of servers in
  3-4
  different locations, each collect some sort of data.
  The basic approach on analyzing this distributed data would be: logging
  them
  into structured text files so that we'll be able to transfer them to the
  hadoop cluster and analyze them using some MapReduce jobs.
  The basic process I define follows like this
  - Transfer log files to Hadoop master, (collectors to master)
  - Put the files on the master to the HDFS, (master to the cluster)
  As it's clear there's an overhead in the transfer of the log files. And
 the
  big log files will have to be analyzed even if you'll somehow need a
 small
  portion of the data.
 
  One better other option is, logging directly to a distributed database
 like
  Cassandra and HBase, so the MapReduce jobs would be fetching the data
 from
  the databases and doing the analysis. And the data will also be randomly
  accessible and open to queries in real-time.
  I'm not that much familiar in this area of distributed databases,
 however I
  can see that,
  -If we're using cassandra for storing logging information, we won't have
 a
  connection overhead for writing the data to the Cassandra cluster, since
  all
  nodes in the cluster are able to accept incoming write requests. However
 in
  HBase I'm afraid we'll need to write to the master only, so in such
  situation, there seems to be a connection overhead on the master and we
 can
  only scale up-to the levels that the through-put of master. Logging to
  HBase
  doesn't seem scalable from this point of view.
  -On the other hand, using a different Cassandra cluster which is not
  directly from the ecosystem of Hadoop, I'm afraid we'll lose the concept
 of
  data locality while using the data for analysis in MapReduce jobs if
  Cassandra was the choice for keeping the log data. However in the case
 of
  HBase we'll be able to use the data locality since it's directly related
 to
  the HDFS.
  -Is there a stable way for integrating Cassandra with Hadoop?
 
  So finally Chukwa seems to be a good choice for such kind of a data
  collection. Where each server that can be defined as sources will be
  running
  Agents on them, so they can transfer the data to the Collectors that
 reside
  close to the HDFS. After series of pipe-lined processes the data would
 be
  clearly available for analysis using MapReduce jobs.
  I see some connection overhead due to the through-put of master in this
  scenario and the files that need to be analyzed will also be again
  available
  in big files, so a sample range of the data analysis would require the
  reading of the full files.
 
  I feel like these are the brief options I figured out till now. Actually
  all
  decision will come with some kind of a drawback and provide some
 decision
  specific more functionality compared to the others.
 
  Is there anyone on the list who solved the need in such functionality
  previously? I'm open to all kind of comments

Re: Web/Data Analytics and Data Collection using Hadoop

2010-03-22 Thread Ariel Rabkin
The collectors do indeed need to contact the namenode.

Chukwa goes to some trouble to minimize the load on the namenode by
making sure that data is consolidated into large files. As a result,
we can scale to several hundred MB/sec of collected data without
problems.

--Ari

On Mon, Mar 22, 2010 at 10:50 AM, Utku Can Topçu u...@topcu.gen.tr wrote:
 Hi Bill,

 Thank you for your comments,
 The main thing about the Chukwa installation on top of Hadoop is I guess,
 you somehow need to connect to the namenode from the collectors.
 Isn't it the case while trying to reach the HDFS, or the Chukwa collectors
 are writing on the local drives instead of HDFS?

 Best,
 Utku

 On Mon, Mar 22, 2010 at 6:34 PM, Bill Graham billgra...@gmail.com wrote:

 Hi Utku,

 We're using Chukwa to collect and aggregate data as you describe and so far
 it's working well. Typically chukwa collectors are deployed to all data
 nodes, so there is no master write-bottleneck with this approach actually.

 There have been discussions lately on the Chukwa list regarding how to
 write
 data into HBase using Chukwa collectors or data processors that you might
 want to check out.

 thanks,
 Bill


 On Mon, Mar 22, 2010 at 4:50 AM, Utku Can Topçu u...@topcu.gen.tr wrote:

  Hey All,
 
  Currently in a project I'm involved, we're about to make design choices
  regarding the use of Hadoop as a scalable and distributed data analytics
  framework.
  Basically the application would be the base of a Web Analytics tool, so I
  do
  have the vision that Hadoop would be the finest choice for analyzing the
  collected data.
  But for the collection of data is somewhat a different issue to consider,
  there needs to be serious design decision taken for the data collection
  architecture.
 
  Actually, I'd like to have a distributed and scalable data collection in
  production. The current situation is like we have multiple of servers in
  3-4
  different locations, each collect some sort of data.
  The basic approach on analyzing this distributed data would be: logging
  them
  into structured text files so that we'll be able to transfer them to the
  hadoop cluster and analyze them using some MapReduce jobs.
  The basic process I define follows like this
  - Transfer log files to Hadoop master, (collectors to master)
  - Put the files on the master to the HDFS, (master to the cluster)
  As it's clear there's an overhead in the transfer of the log files. And
 the
  big log files will have to be analyzed even if you'll somehow need a
 small
  portion of the data.
 
  One better other option is, logging directly to a distributed database
 like
  Cassandra and HBase, so the MapReduce jobs would be fetching the data
 from
  the databases and doing the analysis. And the data will also be randomly
  accessible and open to queries in real-time.
  I'm not that much familiar in this area of distributed databases, however
 I
  can see that,
  -If we're using cassandra for storing logging information, we won't have
 a
  connection overhead for writing the data to the Cassandra cluster, since
  all
  nodes in the cluster are able to accept incoming write requests. However
 in
  HBase I'm afraid we'll need to write to the master only, so in such
  situation, there seems to be a connection overhead on the master and we
 can
  only scale up-to the levels that the through-put of master. Logging to
  HBase
  doesn't seem scalable from this point of view.
  -On the other hand, using a different Cassandra cluster which is not
  directly from the ecosystem of Hadoop, I'm afraid we'll lose the concept
 of
  data locality while using the data for analysis in MapReduce jobs if
  Cassandra was the choice for keeping the log data. However in the case of
  HBase we'll be able to use the data locality since it's directly related
 to
  the HDFS.
  -Is there a stable way for integrating Cassandra with Hadoop?
 
  So finally Chukwa seems to be a good choice for such kind of a data
  collection. Where each server that can be defined as sources will be
  running
  Agents on them, so they can transfer the data to the Collectors that
 reside
  close to the HDFS. After series of pipe-lined processes the data would be
  clearly available for analysis using MapReduce jobs.
  I see some connection overhead due to the through-put of master in this
  scenario and the files that need to be analyzed will also be again
  available
  in big files, so a sample range of the data analysis would require the
  reading of the full files.
 
  I feel like these are the brief options I figured out till now. Actually
  all
  decision will come with some kind of a drawback and provide some decision
  specific more functionality compared to the others.
 
  Is there anyone on the list who solved the need in such functionality
  previously? I'm open to all kind of comments and suggestions,
 
  Best Regards,
  Utku
 





-- 
Ari Rabkin asrab...@gmail.com
UC Berkeley Computer Science Department