Scala KafkaRDD uses a trait to handle this problem, but it is not so easy
and straightforward in Python, where we need to have a specific API to
handle this, I'm not sure is there any simple workaround to fix this, maybe
we should think carefully about it.
2015-06-12 13:59 GMT+08:00 Amit Ramesh :
Thanks, Jerry. That's what I suspected based on the code I looked at. Any
pointers on what is needed to build in this support would be great. This is
critical to the project we are currently working on.
Thanks!
On Thu, Jun 11, 2015 at 10:54 PM, Saisai Shao
wrote:
> OK, I get it, I think curren
OK, I get it, I think currently Python based Kafka direct API do not
provide such equivalence like Scala, maybe we should figure out to add this
into Python API also.
2015-06-12 13:48 GMT+08:00 Amit Ramesh :
>
> Hi Jerry,
>
> Take a look at this example:
> https://spark.apache.org/docs/latest/str
Hi Jerry,
Take a look at this example:
https://spark.apache.org/docs/latest/streaming-kafka-integration.html#tab_scala_2
The offsets are needed because as RDDs get generated within spark the
offsets move further along. With direct Kafka mode the current offsets are
no more persisted in Zookeeper
Hi,
What is your meaning of getting the offsets from the RDD, from my
understanding, the offsetRange is a parameter you offered to KafkaRDD, why
do you still want to get the one previous you set into?
Thanks
Jerry
2015-06-12 12:36 GMT+08:00 Amit Ramesh :
>
> Congratulations on the release of 1.
Hi,
Thanks for your interest in PySpark.
The first thing is to have a look at the "how to contribute" guide
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark and
filter the JIRA's using the label PySpark.
If you have your own improvement in mind, you can file your a JIRA, d
Hello,
I am currently taking a course in Apache Spark via EdX (
https://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1x)
and at the same time I try to look at the code for pySpark too. I wanted to
ask, if ideally I would like to contribute to pyspark specifically, how c
Through the DataFrame API, users should never see UTF8String.
Expression (and any class in the catalyst package) is considered internal
and so uses the internal representation of various types. Which type we
use here is not stable across releases.
Is there a reason you aren't defining a UDF inst
I'm hoping for some clarity about when to expect String vs UTF8String when
using the Java DataFrames API.
In upgrading to Spark 1.4, I'm dealing with a lot of errors where what was
once a String is now a UTF8String. The comments in the file and the related
commit message indicate that maybe it sho
Good idea -- I've added this to the wiki:
https://cwiki.apache.org/confluence/display/SPARK/Shuffle+Internals. Happy
to stick it elsewhere if folks think there's a more convenient place.
On Thu, Jun 11, 2015 at 4:46 PM, Gerard Maas wrote:
> Kay,
>
> Excellent write-up. This should be preserved
Kay,
Excellent write-up. This should be preserved for reference somewhere
searchable.
-Gerard.
On Fri, Jun 12, 2015 at 1:19 AM, Kay Ousterhout
wrote:
> Here’s how the shuffle works. This explains what happens for a single
> task; this will happen in parallel for each task running on the mac
Here’s how the shuffle works. This explains what happens for a single
task; this will happen in parallel for each task running on the machine,
and as Imran said, Spark runs up to “numCores” tasks concurrently on each
machine. There's also an answer to the original question about why CPU use
is lo
Hi All,
I'm happy to announce the availability of Spark 1.4.0! Spark 1.4.0 is
the fifth release on the API-compatible 1.X line. It is Spark's
largest release ever, with contributions from 210 developers and more
than 1,000 commits!
A huge thanks go to all of the individuals and organizations invo
+1, and i know i've been guilty of this in the past. :)
On Wed, Jun 10, 2015 at 10:20 PM, Joseph Bradley
wrote:
> +1
>
> On Sat, Jun 6, 2015 at 9:01 AM, Patrick Wendell
> wrote:
>
>> Hey All,
>>
>> Just a request here - it would be great if people could create JIRA's
>> for any and all merged
That is not exactly correct -- that being said I'm not 100% on these
details either so I'd appreciate you double checking and / or another dev
confirming my description.
Spark actually has more threads going then the "numCores" you specify.
"numCores" is really used for how many threads are acti
I was able to work around this problem in several cases using the class
'enhancement' or 'extension' pattern to add some functionality to the decision
tree model data structures.
- Original Message -
> Hi, previously all the models in ml package were private to package, so
> if i need t
16 matches
Mail list logo