Re: When will Spark Streaming supports Kafka-simple consumer API?
Hi, Tathagata Thanks for the information, I'm trying to build 1.3 snapshot and make another try. There are 2 reasons for why we use Kafka SimpleConsumer API 1. Previously, in our company, all of the real time processing system were build on Apache Storm. So, the kafka environment is set to only support SimpleConsumer API. The kafka environment is controlled by another group of engineers in our company, and for some reasons I don't know, they only support SimpleConsumer API. 2. There is a document advises do not use kafka + spark streaming in the production environment, due to spark streaming only supports high level API. see *http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/#known-issues-in-spark-streaming http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/#known-issues-in-spark-streaming* I'm not quite sure whether the advise is with bias to spark streaming. But, since we don't have any successful project as our reference, we need to be careful about it. On Thu, Feb 5, 2015 at 12:28 PM, Tathagata Das [via Apache Spark Developers List] ml-node+s1001551n10477...@n3.nabble.com wrote: 1. There is already a third-party low-level kafka receiver - http://spark-packages.org/package/5 2. There is a new experimental Kafka stream that will be available in Spark 1.3 release. This is based on the low level API, and might suffice your purpose. JIRA - https://issues.apache.org/jira/browse/SPARK-4964 Can you elaborate on why you have to use SimpleConsumer in your environment? TD On Wed, Feb 4, 2015 at 7:44 PM, Xuelin Cao [hidden email] http:///user/SendEmail.jtp?type=nodenode=10477i=0 wrote: Hi, In our environment, Kafka can only be used with simple consumer API, like storm spout does. And, also, I found there are suggestions that Kafka connector of Spark should not be used in production http://markmail.org/message/2lb776ta5sq6lgtw because it is based on the high-level consumer API of Kafka. So, my question is, when will spark streaming supports Kafka simple consumer API? -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-developers-list.1001551.n3.nabble.com/When-will-Spark-Streaming-supports-Kafka-simple-consumer-API-tp10476p10477.html To start a new topic under Apache Spark Developers List, email ml-node+s1001551n1...@n3.nabble.com To unsubscribe from Apache Spark Developers List, click here http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=eHVlbGluY2FvMjAxNEBnbWFpbC5jb218MXwtOTc3NDY2MzAy . NAML http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/When-will-Spark-Streaming-supports-Kafka-simple-consumer-API-tp10476p10480.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
When will Spark Streaming supports Kafka-simple consumer API?
Hi, In our environment, Kafka can only be used with simple consumer API, like storm spout does. And, also, I found there are suggestions that Kafka connector of Spark should not be used in production http://markmail.org/message/2lb776ta5sq6lgtw because it is based on the high-level consumer API of Kafka. So, my question is, when will spark streaming supports Kafka simple consumer API?
Can spark provide an option to start reduce stage early?
In hadoop MR, there is an option *mapred.reduce.slowstart.completed.maps* which can be used to start reducer stage when X% mappers are completed. By doing this, the data shuffling process is able to parallel with the map process. In a large multi-tenancy cluster, this option is usually tuned off. But, in some cases, turn on the option could accelerate some high priority jobs. Will spark provide similar option?
Will Spark-SQL support vectorized query engine someday?
Hi, Correct me if I were wrong. It looks like, the current version of Spark-SQL is *tuple-at-a-time* module. Basically, each time the physical operator produces a tuple by recursively call child-execute . There are papers that illustrate the benefits of vectorized query engine. And Hive-Stinger also embrace this style. So, the question is, will Spark-SQL give a support to vectorized query execution someday? Thanks
When will Spark SQL support building DB index natively?
Hi, In Spark SQL help document, it says Some of these (such as indexes) are less important due to Spark SQL’s in-memory computational model. Others are slotted for future releases of Spark SQL. - Block level bitmap indexes and virtual columns (used to build indexes) For our use cases, DB index is quite important. I have about 300G data in our database, and we always use customer id as a predicate for DB look up. Without DB index, we will have to scan all 300G data, and it will take 1 minute for a simple DB look up, while MySQL only takes 10 seconds. We tried to create an independent table for each customer id, the result is pretty good, but the logic will be very complex. I'm wondering when will Spark SQL supports DB index, and before that, is there an alternative way to support DB index function? Thanks
Why Executor Deserialize Time takes more than 300ms?
In our experimental cluster (1 driver, 5 workers), we tried the simplest example: sc.parallelize(Range(0, 100), 2).count In the event log, we found the executor takes too much time on deserialization, about 300 ~ 500ms, and the execution time is only 1ms. Our servers are with 2.3G Hz CPU * 24 cores. And, we have set the serializer to org.apache.spark.serializer.KryoSerializer . The question is, is it normal that the executor takes 300~500ms on deserialization? If not, any clue for the performance tuning?
Re: Why Executor Deserialize Time takes more than 300ms?
Thanks Imran, The problems is, *every time* I run the same task, the deserialization time is around 300~500ms. I don't know if this is a normal case. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Why-Executor-Deserialize-Time-takes-more-than-300ms-tp9487p9489.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Why Executor Deserialize Time takes more than 300ms?
In our experimental cluster (1 driver, 5 workers), we tried the simplest example: sc.parallelize(Range(0, 100), 2).count In the event log, we found the executor takes too much time on deserialization, about 300 ~ 500ms, and the execution time is only 1ms. Our servers are with 2.3G Hz CPU * 24 cores. And, we have set the serializer to org.apache.spark.serializer.KryoSerializer . The question is, is it normal that the executor takes 300~500ms on deserialization? If not, any clue for the performance tuning? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Why-Executor-Deserialize-Time-takes-more-than-300ms-tp9476.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org