Data collection
Hi, I just wanted to share with you guys a post that I've just published about ways for collecting (BIG) data. I believe this might be of interest for you. http://goo.gl/Et2KRf Enjoy, -- [image: Inline image 1] *Moty Michaely* VP RD Cell: +972 (52) 631-1019 Email: m...@xplenty.com Web:www.xplenty.com
Web/Data Analytics and Data Collection using Hadoop
Hey All, Currently in a project I'm involved, we're about to make design choices regarding the use of Hadoop as a scalable and distributed data analytics framework. Basically the application would be the base of a Web Analytics tool, so I do have the vision that Hadoop would be the finest choice for analyzing the collected data. But for the collection of data is somewhat a different issue to consider, there needs to be serious design decision taken for the data collection architecture. Actually, I'd like to have a distributed and scalable data collection in production. The current situation is like we have multiple of servers in 3-4 different locations, each collect some sort of data. The basic approach on analyzing this distributed data would be: logging them into structured text files so that we'll be able to transfer them to the hadoop cluster and analyze them using some MapReduce jobs. The basic process I define follows like this - Transfer log files to Hadoop master, (collectors to master) - Put the files on the master to the HDFS, (master to the cluster) As it's clear there's an overhead in the transfer of the log files. And the big log files will have to be analyzed even if you'll somehow need a small portion of the data. One better other option is, logging directly to a distributed database like Cassandra and HBase, so the MapReduce jobs would be fetching the data from the databases and doing the analysis. And the data will also be randomly accessible and open to queries in real-time. I'm not that much familiar in this area of distributed databases, however I can see that, -If we're using cassandra for storing logging information, we won't have a connection overhead for writing the data to the Cassandra cluster, since all nodes in the cluster are able to accept incoming write requests. However in HBase I'm afraid we'll need to write to the master only, so in such situation, there seems to be a connection overhead on the master and we can only scale up-to the levels that the through-put of master. Logging to HBase doesn't seem scalable from this point of view. -On the other hand, using a different Cassandra cluster which is not directly from the ecosystem of Hadoop, I'm afraid we'll lose the concept of data locality while using the data for analysis in MapReduce jobs if Cassandra was the choice for keeping the log data. However in the case of HBase we'll be able to use the data locality since it's directly related to the HDFS. -Is there a stable way for integrating Cassandra with Hadoop? So finally Chukwa seems to be a good choice for such kind of a data collection. Where each server that can be defined as sources will be running Agents on them, so they can transfer the data to the Collectors that reside close to the HDFS. After series of pipe-lined processes the data would be clearly available for analysis using MapReduce jobs. I see some connection overhead due to the through-put of master in this scenario and the files that need to be analyzed will also be again available in big files, so a sample range of the data analysis would require the reading of the full files. I feel like these are the brief options I figured out till now. Actually all decision will come with some kind of a drawback and provide some decision specific more functionality compared to the others. Is there anyone on the list who solved the need in such functionality previously? I'm open to all kind of comments and suggestions, Best Regards, Utku
Re: Web/Data Analytics and Data Collection using Hadoop
Hi Utku, We're using Chukwa to collect and aggregate data as you describe and so far it's working well. Typically chukwa collectors are deployed to all data nodes, so there is no master write-bottleneck with this approach actually. There have been discussions lately on the Chukwa list regarding how to write data into HBase using Chukwa collectors or data processors that you might want to check out. thanks, Bill On Mon, Mar 22, 2010 at 4:50 AM, Utku Can Topçu u...@topcu.gen.tr wrote: Hey All, Currently in a project I'm involved, we're about to make design choices regarding the use of Hadoop as a scalable and distributed data analytics framework. Basically the application would be the base of a Web Analytics tool, so I do have the vision that Hadoop would be the finest choice for analyzing the collected data. But for the collection of data is somewhat a different issue to consider, there needs to be serious design decision taken for the data collection architecture. Actually, I'd like to have a distributed and scalable data collection in production. The current situation is like we have multiple of servers in 3-4 different locations, each collect some sort of data. The basic approach on analyzing this distributed data would be: logging them into structured text files so that we'll be able to transfer them to the hadoop cluster and analyze them using some MapReduce jobs. The basic process I define follows like this - Transfer log files to Hadoop master, (collectors to master) - Put the files on the master to the HDFS, (master to the cluster) As it's clear there's an overhead in the transfer of the log files. And the big log files will have to be analyzed even if you'll somehow need a small portion of the data. One better other option is, logging directly to a distributed database like Cassandra and HBase, so the MapReduce jobs would be fetching the data from the databases and doing the analysis. And the data will also be randomly accessible and open to queries in real-time. I'm not that much familiar in this area of distributed databases, however I can see that, -If we're using cassandra for storing logging information, we won't have a connection overhead for writing the data to the Cassandra cluster, since all nodes in the cluster are able to accept incoming write requests. However in HBase I'm afraid we'll need to write to the master only, so in such situation, there seems to be a connection overhead on the master and we can only scale up-to the levels that the through-put of master. Logging to HBase doesn't seem scalable from this point of view. -On the other hand, using a different Cassandra cluster which is not directly from the ecosystem of Hadoop, I'm afraid we'll lose the concept of data locality while using the data for analysis in MapReduce jobs if Cassandra was the choice for keeping the log data. However in the case of HBase we'll be able to use the data locality since it's directly related to the HDFS. -Is there a stable way for integrating Cassandra with Hadoop? So finally Chukwa seems to be a good choice for such kind of a data collection. Where each server that can be defined as sources will be running Agents on them, so they can transfer the data to the Collectors that reside close to the HDFS. After series of pipe-lined processes the data would be clearly available for analysis using MapReduce jobs. I see some connection overhead due to the through-put of master in this scenario and the files that need to be analyzed will also be again available in big files, so a sample range of the data analysis would require the reading of the full files. I feel like these are the brief options I figured out till now. Actually all decision will come with some kind of a drawback and provide some decision specific more functionality compared to the others. Is there anyone on the list who solved the need in such functionality previously? I'm open to all kind of comments and suggestions, Best Regards, Utku
Re: Web/Data Analytics and Data Collection using Hadoop
Hi Bill, Thank you for your comments, The main thing about the Chukwa installation on top of Hadoop is I guess, you somehow need to connect to the namenode from the collectors. Isn't it the case while trying to reach the HDFS, or the Chukwa collectors are writing on the local drives instead of HDFS? Best, Utku On Mon, Mar 22, 2010 at 6:34 PM, Bill Graham billgra...@gmail.com wrote: Hi Utku, We're using Chukwa to collect and aggregate data as you describe and so far it's working well. Typically chukwa collectors are deployed to all data nodes, so there is no master write-bottleneck with this approach actually. There have been discussions lately on the Chukwa list regarding how to write data into HBase using Chukwa collectors or data processors that you might want to check out. thanks, Bill On Mon, Mar 22, 2010 at 4:50 AM, Utku Can Topçu u...@topcu.gen.tr wrote: Hey All, Currently in a project I'm involved, we're about to make design choices regarding the use of Hadoop as a scalable and distributed data analytics framework. Basically the application would be the base of a Web Analytics tool, so I do have the vision that Hadoop would be the finest choice for analyzing the collected data. But for the collection of data is somewhat a different issue to consider, there needs to be serious design decision taken for the data collection architecture. Actually, I'd like to have a distributed and scalable data collection in production. The current situation is like we have multiple of servers in 3-4 different locations, each collect some sort of data. The basic approach on analyzing this distributed data would be: logging them into structured text files so that we'll be able to transfer them to the hadoop cluster and analyze them using some MapReduce jobs. The basic process I define follows like this - Transfer log files to Hadoop master, (collectors to master) - Put the files on the master to the HDFS, (master to the cluster) As it's clear there's an overhead in the transfer of the log files. And the big log files will have to be analyzed even if you'll somehow need a small portion of the data. One better other option is, logging directly to a distributed database like Cassandra and HBase, so the MapReduce jobs would be fetching the data from the databases and doing the analysis. And the data will also be randomly accessible and open to queries in real-time. I'm not that much familiar in this area of distributed databases, however I can see that, -If we're using cassandra for storing logging information, we won't have a connection overhead for writing the data to the Cassandra cluster, since all nodes in the cluster are able to accept incoming write requests. However in HBase I'm afraid we'll need to write to the master only, so in such situation, there seems to be a connection overhead on the master and we can only scale up-to the levels that the through-put of master. Logging to HBase doesn't seem scalable from this point of view. -On the other hand, using a different Cassandra cluster which is not directly from the ecosystem of Hadoop, I'm afraid we'll lose the concept of data locality while using the data for analysis in MapReduce jobs if Cassandra was the choice for keeping the log data. However in the case of HBase we'll be able to use the data locality since it's directly related to the HDFS. -Is there a stable way for integrating Cassandra with Hadoop? So finally Chukwa seems to be a good choice for such kind of a data collection. Where each server that can be defined as sources will be running Agents on them, so they can transfer the data to the Collectors that reside close to the HDFS. After series of pipe-lined processes the data would be clearly available for analysis using MapReduce jobs. I see some connection overhead due to the through-put of master in this scenario and the files that need to be analyzed will also be again available in big files, so a sample range of the data analysis would require the reading of the full files. I feel like these are the brief options I figured out till now. Actually all decision will come with some kind of a drawback and provide some decision specific more functionality compared to the others. Is there anyone on the list who solved the need in such functionality previously? I'm open to all kind of comments and suggestions, Best Regards, Utku
Re: Web/Data Analytics and Data Collection using Hadoop
Sure, any framework that writes data into HDFS will need to communicate with the namenode. So yes, there can potentially be large numbers of connections to the namenode. I (possibly mistakenly) thought you were speaking specifically of a bottleneck caused by writes through a single master node. The actually data does not go through the name node though, so there is no bottleneck in the data flow. On Mon, Mar 22, 2010 at 10:50 AM, Utku Can Topçu u...@topcu.gen.tr wrote: Hi Bill, Thank you for your comments, The main thing about the Chukwa installation on top of Hadoop is I guess, you somehow need to connect to the namenode from the collectors. Isn't it the case while trying to reach the HDFS, or the Chukwa collectors are writing on the local drives instead of HDFS? Best, Utku On Mon, Mar 22, 2010 at 6:34 PM, Bill Graham billgra...@gmail.com wrote: Hi Utku, We're using Chukwa to collect and aggregate data as you describe and so far it's working well. Typically chukwa collectors are deployed to all data nodes, so there is no master write-bottleneck with this approach actually. There have been discussions lately on the Chukwa list regarding how to write data into HBase using Chukwa collectors or data processors that you might want to check out. thanks, Bill On Mon, Mar 22, 2010 at 4:50 AM, Utku Can Topçu u...@topcu.gen.tr wrote: Hey All, Currently in a project I'm involved, we're about to make design choices regarding the use of Hadoop as a scalable and distributed data analytics framework. Basically the application would be the base of a Web Analytics tool, so I do have the vision that Hadoop would be the finest choice for analyzing the collected data. But for the collection of data is somewhat a different issue to consider, there needs to be serious design decision taken for the data collection architecture. Actually, I'd like to have a distributed and scalable data collection in production. The current situation is like we have multiple of servers in 3-4 different locations, each collect some sort of data. The basic approach on analyzing this distributed data would be: logging them into structured text files so that we'll be able to transfer them to the hadoop cluster and analyze them using some MapReduce jobs. The basic process I define follows like this - Transfer log files to Hadoop master, (collectors to master) - Put the files on the master to the HDFS, (master to the cluster) As it's clear there's an overhead in the transfer of the log files. And the big log files will have to be analyzed even if you'll somehow need a small portion of the data. One better other option is, logging directly to a distributed database like Cassandra and HBase, so the MapReduce jobs would be fetching the data from the databases and doing the analysis. And the data will also be randomly accessible and open to queries in real-time. I'm not that much familiar in this area of distributed databases, however I can see that, -If we're using cassandra for storing logging information, we won't have a connection overhead for writing the data to the Cassandra cluster, since all nodes in the cluster are able to accept incoming write requests. However in HBase I'm afraid we'll need to write to the master only, so in such situation, there seems to be a connection overhead on the master and we can only scale up-to the levels that the through-put of master. Logging to HBase doesn't seem scalable from this point of view. -On the other hand, using a different Cassandra cluster which is not directly from the ecosystem of Hadoop, I'm afraid we'll lose the concept of data locality while using the data for analysis in MapReduce jobs if Cassandra was the choice for keeping the log data. However in the case of HBase we'll be able to use the data locality since it's directly related to the HDFS. -Is there a stable way for integrating Cassandra with Hadoop? So finally Chukwa seems to be a good choice for such kind of a data collection. Where each server that can be defined as sources will be running Agents on them, so they can transfer the data to the Collectors that reside close to the HDFS. After series of pipe-lined processes the data would be clearly available for analysis using MapReduce jobs. I see some connection overhead due to the through-put of master in this scenario and the files that need to be analyzed will also be again available in big files, so a sample range of the data analysis would require the reading of the full files. I feel like these are the brief options I figured out till now. Actually all decision will come with some kind of a drawback and provide some decision specific more functionality compared to the others. Is there anyone on the list who solved the need in such functionality previously? I'm open to all kind of comments
Re: Web/Data Analytics and Data Collection using Hadoop
The collectors do indeed need to contact the namenode. Chukwa goes to some trouble to minimize the load on the namenode by making sure that data is consolidated into large files. As a result, we can scale to several hundred MB/sec of collected data without problems. --Ari On Mon, Mar 22, 2010 at 10:50 AM, Utku Can Topçu u...@topcu.gen.tr wrote: Hi Bill, Thank you for your comments, The main thing about the Chukwa installation on top of Hadoop is I guess, you somehow need to connect to the namenode from the collectors. Isn't it the case while trying to reach the HDFS, or the Chukwa collectors are writing on the local drives instead of HDFS? Best, Utku On Mon, Mar 22, 2010 at 6:34 PM, Bill Graham billgra...@gmail.com wrote: Hi Utku, We're using Chukwa to collect and aggregate data as you describe and so far it's working well. Typically chukwa collectors are deployed to all data nodes, so there is no master write-bottleneck with this approach actually. There have been discussions lately on the Chukwa list regarding how to write data into HBase using Chukwa collectors or data processors that you might want to check out. thanks, Bill On Mon, Mar 22, 2010 at 4:50 AM, Utku Can Topçu u...@topcu.gen.tr wrote: Hey All, Currently in a project I'm involved, we're about to make design choices regarding the use of Hadoop as a scalable and distributed data analytics framework. Basically the application would be the base of a Web Analytics tool, so I do have the vision that Hadoop would be the finest choice for analyzing the collected data. But for the collection of data is somewhat a different issue to consider, there needs to be serious design decision taken for the data collection architecture. Actually, I'd like to have a distributed and scalable data collection in production. The current situation is like we have multiple of servers in 3-4 different locations, each collect some sort of data. The basic approach on analyzing this distributed data would be: logging them into structured text files so that we'll be able to transfer them to the hadoop cluster and analyze them using some MapReduce jobs. The basic process I define follows like this - Transfer log files to Hadoop master, (collectors to master) - Put the files on the master to the HDFS, (master to the cluster) As it's clear there's an overhead in the transfer of the log files. And the big log files will have to be analyzed even if you'll somehow need a small portion of the data. One better other option is, logging directly to a distributed database like Cassandra and HBase, so the MapReduce jobs would be fetching the data from the databases and doing the analysis. And the data will also be randomly accessible and open to queries in real-time. I'm not that much familiar in this area of distributed databases, however I can see that, -If we're using cassandra for storing logging information, we won't have a connection overhead for writing the data to the Cassandra cluster, since all nodes in the cluster are able to accept incoming write requests. However in HBase I'm afraid we'll need to write to the master only, so in such situation, there seems to be a connection overhead on the master and we can only scale up-to the levels that the through-put of master. Logging to HBase doesn't seem scalable from this point of view. -On the other hand, using a different Cassandra cluster which is not directly from the ecosystem of Hadoop, I'm afraid we'll lose the concept of data locality while using the data for analysis in MapReduce jobs if Cassandra was the choice for keeping the log data. However in the case of HBase we'll be able to use the data locality since it's directly related to the HDFS. -Is there a stable way for integrating Cassandra with Hadoop? So finally Chukwa seems to be a good choice for such kind of a data collection. Where each server that can be defined as sources will be running Agents on them, so they can transfer the data to the Collectors that reside close to the HDFS. After series of pipe-lined processes the data would be clearly available for analysis using MapReduce jobs. I see some connection overhead due to the through-put of master in this scenario and the files that need to be analyzed will also be again available in big files, so a sample range of the data analysis would require the reading of the full files. I feel like these are the brief options I figured out till now. Actually all decision will come with some kind of a drawback and provide some decision specific more functionality compared to the others. Is there anyone on the list who solved the need in such functionality previously? I'm open to all kind of comments and suggestions, Best Regards, Utku -- Ari Rabkin asrab...@gmail.com UC Berkeley Computer Science Department