Re: When will Spark Streaming supports Kafka-simple consumer API?

2015-02-05 Thread Xuelin Cao.2015
Hi, Tathagata

 Thanks for the information, I'm trying to build 1.3 snapshot and make
another try.

 There are 2 reasons for why we use Kafka SimpleConsumer API
 1. Previously, in our company, all of the real time processing system
were build on Apache Storm. So, the kafka environment is set to only
support SimpleConsumer API. The kafka environment is controlled by another
group of engineers in our company, and for some reasons I don't know, they
only support SimpleConsumer API.

 2. There is a document advises do not use kafka + spark streaming in
the production environment, due to spark streaming only supports high level
API. see
*http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/#known-issues-in-spark-streaming
http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/#known-issues-in-spark-streaming*

 I'm not quite sure whether the advise is with bias to spark
streaming. But, since we don't have any successful project as our
reference, we need to be careful about it.



On Thu, Feb 5, 2015 at 12:28 PM, Tathagata Das [via Apache Spark Developers
List] ml-node+s1001551n10477...@n3.nabble.com wrote:

 1. There is already a third-party low-level kafka receiver -
 http://spark-packages.org/package/5
 2. There is a new experimental Kafka stream that will be available in
 Spark
 1.3 release. This is based on the low level API, and might suffice your
 purpose. JIRA - https://issues.apache.org/jira/browse/SPARK-4964

 Can you elaborate on why you have to use SimpleConsumer in your
 environment?

 TD


 On Wed, Feb 4, 2015 at 7:44 PM, Xuelin Cao [hidden email]
 http:///user/SendEmail.jtp?type=nodenode=10477i=0 wrote:

  Hi,
 
   In our environment, Kafka can only be used with simple consumer
 API,
  like storm spout does.
 
   And, also, I found there are suggestions that  Kafka connector of
  Spark should not be used in production
  http://markmail.org/message/2lb776ta5sq6lgtw because it is based on
 the
  high-level consumer API of Kafka.
 
  So, my question is, when will spark streaming supports Kafka simple
  consumer API?
 


 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-developers-list.1001551.n3.nabble.com/When-will-Spark-Streaming-supports-Kafka-simple-consumer-API-tp10476p10477.html
  To start a new topic under Apache Spark Developers List, email
 ml-node+s1001551n1...@n3.nabble.com
 To unsubscribe from Apache Spark Developers List, click here
 http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=eHVlbGluY2FvMjAxNEBnbWFpbC5jb218MXwtOTc3NDY2MzAy
 .
 NAML
 http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/When-will-Spark-Streaming-supports-Kafka-simple-consumer-API-tp10476p10480.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

When will Spark Streaming supports Kafka-simple consumer API?

2015-02-04 Thread Xuelin Cao
Hi,

 In our environment, Kafka can only be used with simple consumer API,
like storm spout does.

 And, also, I found there are suggestions that  Kafka connector of
Spark should not be used in production
http://markmail.org/message/2lb776ta5sq6lgtw because it is based on the
high-level consumer API of Kafka.

So, my question is, when will spark streaming supports Kafka simple
consumer API?


Can spark provide an option to start reduce stage early?

2015-02-02 Thread Xuelin Cao
In hadoop MR, there is an option *mapred.reduce.slowstart.completed.maps*

which can be used to start reducer stage when X% mappers are completed. By
doing this, the data shuffling process is able to parallel with the map
process.

In a large multi-tenancy cluster, this option is usually tuned off. But, in
some cases, turn on the option could accelerate some high priority jobs.

Will spark provide similar option?


Will Spark-SQL support vectorized query engine someday?

2015-01-19 Thread Xuelin Cao
Hi,

 Correct me if I were wrong. It looks like, the current version of
Spark-SQL is *tuple-at-a-time* module. Basically, each time the physical
operator produces a tuple by recursively call child-execute .

 There are papers that illustrate the benefits of vectorized query
engine. And Hive-Stinger also embrace this style.

 So, the question is, will Spark-SQL give a support to vectorized query
execution someday?

 Thanks


When will Spark SQL support building DB index natively?

2014-12-17 Thread Xuelin Cao

Hi, 
     In Spark SQL help document, it says Some of these (such as indexes) are 
less important due to Spark SQL’s in-memory  computational model. Others are 
slotted for future releases of Spark SQL.   
   - Block level bitmap indexes and virtual columns (used to build indexes)

     For our use cases, DB index is quite important. I have about 300G data in 
our database, and we always use customer id as a predicate for DB look up.  
Without DB index, we will have to scan all 300G data, and it will take  1 
minute for a simple DB look up, while MySQL only takes 10 seconds. We tried to 
create an independent table for each customer id, the result is pretty good, 
but the logic will be very complex. 
     I'm wondering when will Spark SQL supports DB index, and before that, is 
there an alternative way to support DB index function?
Thanks


Why Executor Deserialize Time takes more than 300ms?

2014-11-22 Thread Xuelin Cao

In our experimental cluster (1 driver, 5 workers), we tried the simplest 
example:   sc.parallelize(Range(0, 100), 2).count 

In the event log, we found the executor takes too much time on deserialization, 
about 300 ~ 500ms, and the execution time is only 1ms. 

Our servers are with 2.3G Hz CPU * 24 cores.  And, we have set the serializer 
to org.apache.spark.serializer.KryoSerializer . 

The question is, is it normal that the executor takes 300~500ms on 
deserialization?  If not, any clue for the performance tuning? 




Re: Why Executor Deserialize Time takes more than 300ms?

2014-11-22 Thread Xuelin Cao

Thanks Imran,

 The problems is, *every time* I run the same task, the deserialization
time is around 300~500ms. I don't know if this is a normal case.





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Why-Executor-Deserialize-Time-takes-more-than-300ms-tp9487p9489.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Why Executor Deserialize Time takes more than 300ms?

2014-11-21 Thread Xuelin Cao

In our experimental cluster (1 driver, 5 workers), we tried the simplest
example:   sc.parallelize(Range(0, 100), 2).count

In the event log, we found the executor takes too much time on
deserialization, about 300 ~ 500ms, and the execution time is only 1ms.

Our servers are with 2.3G Hz CPU * 24 cores.  And, we have set the
serializer to org.apache.spark.serializer.KryoSerializer .

The question is, is it normal that the executor takes 300~500ms on
deserialization?  If not, any clue for the performance tuning?






--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Why-Executor-Deserialize-Time-takes-more-than-300ms-tp9476.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org