Re: [VOTE] SPARK 4.0.0-preview1 (RC3)

2024-05-31 Thread Tathagata Das
+1
- Tested RC3 with Delta Lake. All our Scala and Python tests pass.

On Fri, May 31, 2024 at 3:24 PM Xiao Li  wrote:

> +1
>
> Cheng Pan  于2024年5月30日周四 09:48写道:
>
>> +1 (non-binding)
>>
>> - All links are valid
>> - Run some basic quires using YARN client mode with Apache Hadoop v3.3.6,
>> HMS 2.3.9
>> - Pass integration tests with Apache Kyuubi v1.9.1 RC0
>>
>> Thanks,
>> Cheng Pan
>>
>>
>> On May 29, 2024, at 02:48, Wenchen Fan  wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 4.0.0-preview1.
>>
>> The vote is open until May 31 PST and passes if a majority +1 PMC votes
>> are cast, with
>> a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 4.0.0-preview1
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v4.0.0-preview1-rc2 (commit
>> 7cfe5a6e44e8d7079ae29ad3e2cee7231cd3dc66):
>> https://github.com/apache/spark/tree/v4.0.0-preview1-rc3
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc3-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1456/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc3-docs/
>>
>> The list of bug fixes going into 4.0.0 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12353359
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with an out of date RC going forward).
>>
>>
>>


Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Tathagata Das
Next week sounds great! Thank you Wenchen!

On Wed, May 1, 2024 at 11:16 AM Wenchen Fan  wrote:

> Yea I think a preview release won't hurt (without a branch cut). We don't
> need to wait for all the ongoing projects to be ready. How about we do a
> 4.0 preview release based on the current master branch next Monday?
>
> On Wed, May 1, 2024 at 11:06 PM Tathagata Das 
> wrote:
>
>> Hey all,
>>
>> Reviving this thread, but Spark master has already accumulated a huge
>> amount of changes.  As a downstream project maintainer, I want to really
>> start testing the new features and other breaking changes, and it's hard to
>> do that without a Preview release. So the sooner we make a Preview release,
>> the faster we can start getting feedback for fixing things for a great
>> Spark 4.0 final release.
>>
>> So I urge the community to produce a Spark 4.0 Preview soon even if
>> certain features targeting the Delta 4.0 release are still incomplete.
>>
>> Thanks!
>>
>>
>> On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan  wrote:
>>
>>> Thank you all for the replies!
>>>
>>> To @Nicholas Chammas  : Thanks for cleaning
>>> up the error terminology and documentation! I've merged the first PR and
>>> let's finish others before the 4.0 release.
>>> To @Dongjoon Hyun  : Thanks for driving the
>>> ANSI on by default effort! Now the vote has passed, let's flip the config
>>> and finish the DataFrame error context feature before 4.0.
>>> To @Jungtaek Lim  : Ack. We can treat the
>>> Streaming state store data source as completed for 4.0 then.
>>> To @Cheng Pan  : Yea we definitely should have a
>>> preview release. Let's collect more feedback on the ongoing projects and
>>> then we can propose a date for the preview release.
>>>
>>> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:
>>>
>>>> will we have preview release for 4.0.0 like we did for 2.0.0 and 3.0.0?
>>>>
>>>> Thanks,
>>>> Cheng Pan
>>>>
>>>>
>>>> > On Apr 15, 2024, at 09:58, Jungtaek Lim 
>>>> wrote:
>>>> >
>>>> > W.r.t. state data source - reader (SPARK-45511), there are several
>>>> follow-up tickets, but we don't plan to address them soon. The current
>>>> implementation is the final shape for Spark 4.0.0, unless there are demands
>>>> on the follow-up tickets.
>>>> >
>>>> > We may want to check the plan for transformWithState - my
>>>> understanding is that we want to release the feature to 4.0.0, but there
>>>> are several remaining works to be done. While the tentative timeline for
>>>> releasing is June 2024, what would be the tentative timeline for the RC 
>>>> cut?
>>>> > (cc. Anish to add more context on the plan for transformWithState)
>>>> >
>>>> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
>>>> wrote:
>>>> > Hi all,
>>>> >
>>>> > It's close to the previously proposed 4.0.0 release date (June 2024),
>>>> and I think it's time to prepare for it and discuss the ongoing projects:
>>>> > •
>>>> > ANSI by default
>>>> > • Spark Connect GA
>>>> > • Structured Logging
>>>> > • Streaming state store data source
>>>> > • new data type VARIANT
>>>> > • STRING collation support
>>>> > • Spark k8s operator versioning
>>>> > Please help to add more items to this list that are missed here. I
>>>> would like to volunteer as the release manager for Apache Spark 4.0.0 if
>>>> there is no objection. Thank you all for the great work that fills Spark
>>>> 4.0!
>>>> >
>>>> > Wenchen Fan
>>>>
>>>>


Re: [DISCUSS] Spark 4.0.0 release

2024-05-01 Thread Tathagata Das
Hey all,

Reviving this thread, but Spark master has already accumulated a huge
amount of changes.  As a downstream project maintainer, I want to really
start testing the new features and other breaking changes, and it's hard to
do that without a Preview release. So the sooner we make a Preview release,
the faster we can start getting feedback for fixing things for a great
Spark 4.0 final release.

So I urge the community to produce a Spark 4.0 Preview soon even if certain
features targeting the Delta 4.0 release are still incomplete.

Thanks!


On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan  wrote:

> Thank you all for the replies!
>
> To @Nicholas Chammas  : Thanks for cleaning
> up the error terminology and documentation! I've merged the first PR and
> let's finish others before the 4.0 release.
> To @Dongjoon Hyun  : Thanks for driving the ANSI
> on by default effort! Now the vote has passed, let's flip the config and
> finish the DataFrame error context feature before 4.0.
> To @Jungtaek Lim  : Ack. We can treat the
> Streaming state store data source as completed for 4.0 then.
> To @Cheng Pan  : Yea we definitely should have a
> preview release. Let's collect more feedback on the ongoing projects and
> then we can propose a date for the preview release.
>
> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:
>
>> will we have preview release for 4.0.0 like we did for 2.0.0 and 3.0.0?
>>
>> Thanks,
>> Cheng Pan
>>
>>
>> > On Apr 15, 2024, at 09:58, Jungtaek Lim 
>> wrote:
>> >
>> > W.r.t. state data source - reader (SPARK-45511), there are several
>> follow-up tickets, but we don't plan to address them soon. The current
>> implementation is the final shape for Spark 4.0.0, unless there are demands
>> on the follow-up tickets.
>> >
>> > We may want to check the plan for transformWithState - my understanding
>> is that we want to release the feature to 4.0.0, but there are several
>> remaining works to be done. While the tentative timeline for releasing is
>> June 2024, what would be the tentative timeline for the RC cut?
>> > (cc. Anish to add more context on the plan for transformWithState)
>> >
>> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
>> wrote:
>> > Hi all,
>> >
>> > It's close to the previously proposed 4.0.0 release date (June 2024),
>> and I think it's time to prepare for it and discuss the ongoing projects:
>> > •
>> > ANSI by default
>> > • Spark Connect GA
>> > • Structured Logging
>> > • Streaming state store data source
>> > • new data type VARIANT
>> > • STRING collation support
>> > • Spark k8s operator versioning
>> > Please help to add more items to this list that are missed here. I
>> would like to volunteer as the release manager for Apache Spark 4.0.0 if
>> there is no objection. Thank you all for the great work that fills Spark
>> 4.0!
>> >
>> > Wenchen Fan
>>
>>


Re: [DISCUSS] Deprecate DStream in 3.4

2023-01-12 Thread Tathagata Das
+1

On Thu, Jan 12, 2023 at 7:46 PM Hyukjin Kwon  wrote:

> +1
>
> On Fri, 13 Jan 2023 at 08:51, Jungtaek Lim 
> wrote:
>
>> bump for more visibility.
>>
>> On Wed, Jan 11, 2023 at 12:20 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Hi dev,
>>>
>>> I'd like to propose the deprecation of DStream in Spark 3.4, in favor of
>>> promoting Structured Streaming.
>>> (Sorry for the late proposal, if we don't make the change in 3.4, we
>>> will have to wait for another 6 months.)
>>>
>>> We have been focusing on Structured Streaming for years (across multiple
>>> major and minor versions), and during the time we haven't made any
>>> improvements for DStream. Furthermore, recently we updated the DStream doc
>>> to explicitly say DStream is a legacy project.
>>>
>>> https://spark.apache.org/docs/latest/streaming-programming-guide.html#note
>>>
>>> The baseline of deprecation is that we don't see a particular use case
>>> which only DStream solves. This is a different story with GraphX and MLLIB,
>>> as we don't have replacements for that.
>>>
>>> The proposal does not mean we will remove the API soon, as the Spark
>>> project has been making deprecation against public API. I don't intend to
>>> propose the target version for removal. The goal is to guide users to
>>> refrain from constructing a new workload with DStream. We might want to go
>>> with this in future, but it would require a new discussion thread at that
>>> time.
>>>
>>> What do you think?
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>


Re: [vote] Apache Spark 3.0 RC3

2020-06-09 Thread Tathagata Das
+1 (binding)

On Tue, Jun 9, 2020 at 5:27 PM Burak Yavuz  wrote:

> +1
>
> Best,
> Burak
>
> On Tue, Jun 9, 2020 at 1:48 PM Shixiong(Ryan) Zhu 
> wrote:
>
>> +1 (binding)
>>
>> Best Regards,
>> Ryan
>>
>>
>> On Tue, Jun 9, 2020 at 4:24 AM Wenchen Fan  wrote:
>>
>>> +1 (binding)
>>>
>>> On Tue, Jun 9, 2020 at 6:15 PM Dr. Kent Yao  wrote:
>>>
 +1 (non-binding)



 --
 Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




Re: dropDuplicates and watermark in structured streaming

2020-02-28 Thread Tathagata Das
why do you have two watermarks? once you apply the watermark to a column
(i.e., "time"), it can be used in all later operations as long as the
column is preserved. So the above code should be equivalent to

df.withWarmark("time","window
size").dropDulplicates("id").groupBy(window("time","window size","window
size")).agg(count("id"))

The right way to think about the watermark threshold is "how late and out
of order my data can be". The answer may be different from the window size
completely. You may want to calculate 10 minutes windows but your data may
come in 5 hour late. So you should define watermark with 5 hour, not 10
minutes.

Btw, on a side note, just so you know, you can use "approx_count_distinct"
if you are okay with some approximation.

On Thu, Feb 27, 2020 at 9:11 PM lec ssmi  wrote:

>   Such as :
> df.withWarmark("time","window
> size").dropDulplicates("id").withWatermark("time","real
> watermark").groupBy(window("time","window size","window
> size")).agg(count("id"))
>can It  make count(distinct count) success?
>
> Tathagata Das  于2020年2月28日周五 上午10:25写道:
>
>> 1. Yes. All times in event time, not processing time. So you may get 10AM
>> event time data at 11AM processing time, but it will still be compared
>> again all data within 9-10AM event times.
>>
>> 2. Show us your code.
>>
>> On Thu, Feb 27, 2020 at 2:30 AM lec ssmi  wrote:
>>
>>> Hi:
>>> I'm new to structured streaming. Because the built-in API cannot
>>> perform the Count Distinct operation of Window, I want to use
>>> dropDuplicates first, and then perform the window count.
>>>But in the process of using, there are two problems:
>>>1. Because it is streaming computing, in the process of
>>> deduplication, the state needs to be cleared in time, which requires the
>>> cooperation of watermark. Assuming my event time field is consistently
>>>   increasing, and I set the watermark to 1 hour, does it
>>> mean that the data at 10 o'clock will only be compared in these data from 9
>>> o'clock to 10 o'clock, and the data before 9 o'clock will be cleared ?
>>>2. Because it is window deduplication, I set the watermark
>>> before deduplication to the window size.But after deduplication, I need to
>>> call withWatermark () again to set the watermark to the real
>>>watermark. Will setting the watermark again take effect?
>>>
>>>  Thanks a lot !
>>>
>>


Re: dropDuplicates and watermark in structured streaming

2020-02-27 Thread Tathagata Das
1. Yes. All times in event time, not processing time. So you may get 10AM
event time data at 11AM processing time, but it will still be compared
again all data within 9-10AM event times.

2. Show us your code.

On Thu, Feb 27, 2020 at 2:30 AM lec ssmi  wrote:

> Hi:
> I'm new to structured streaming. Because the built-in API cannot
> perform the Count Distinct operation of Window, I want to use
> dropDuplicates first, and then perform the window count.
>But in the process of using, there are two problems:
>1. Because it is streaming computing, in the process of
> deduplication, the state needs to be cleared in time, which requires the
> cooperation of watermark. Assuming my event time field is consistently
>   increasing, and I set the watermark to 1 hour, does it mean
> that the data at 10 o'clock will only be compared in these data from 9
> o'clock to 10 o'clock, and the data before 9 o'clock will be cleared ?
>2. Because it is window deduplication, I set the watermark
> before deduplication to the window size.But after deduplication, I need to
> call withWatermark () again to set the watermark to the real
>watermark. Will setting the watermark again take effect?
>
>  Thanks a lot !
>


Re: In structured streamin, multiple streaming aggregations are not yet supported.

2017-11-28 Thread Tathagata Das
Hello,

What do you mean by multiple streaming aggregations? Something like this is
already supported.

*df.groupBy("key").agg(min("colA"), max("colB"), avg("colC"))*

But the following is not supported.

*df.groupBy("key").agg(min("colA").as("min")).groupBy("min").count()*

In other words, multiple aggregations ONE AFTER ANOTHER is NOT supported
yet, and we currently don't have any plans to support it by 2.3.

If this is what you want, then can you explain the use case of why you want
multiple aggregation

On Tue, Nov 28, 2017 at 9:46 PM, Georg Heiler 
wrote:

> 2.3 around January
> 0.0 <407216...@qq.com> schrieb am Mi. 29. Nov. 2017 um 05:08:
>
>> Hi, all:
>> Multiple streaming aggregations are not yet supported. When will it
>> be supported? Is it in the plan?
>>
>> Thanks.
>>
>


Re: Time window on Processing Time

2017-08-29 Thread Tathagata Das
Yes, it can be! There is a sql function called current_timestamp() which is
self-explanatory. So I believe you should be able to do something like

import org.apache.spark.sql.functions._

ds.withColumn("processingTime", current_timestamp())
  .groupBy(window("processingTime", "1 minute"))
  .count()


On Mon, Aug 28, 2017 at 5:46 AM, madhu phatak  wrote:

> Hi,
> As I am playing with structured streaming, I observed that window function
> always requires a time column in input data.So that means it's event time.
>
> Is it possible to old spark streaming style window function based on
> processing time. I don't see any documentation on the same.
>
> --
> Regards,
> Madhukara Phatak
> http://datamantra.io/
>


Re: Questions about Stateful Operations in SS

2017-07-26 Thread Tathagata Das
Hello Lubo,

The idea of timeouts is to make a best-effort and last-resort effort to
process a key, when it has not received data for a while. With processing
time timeout is 1 minute, the system guarantees that it will not timeout
unless at least 1 minute has passed. Defining a precise timing on when the
timeout is triggered, is really hard for many reasons (distributed system,
lack of precise clock-synch, need for deterministic re-executions for
fault-tolerance, etc.). We made a design decision to process timed out data
after processing the data in a batch, but that choice should not affect
your business logic if your logic is constructed the right way. So your
business logic should set loosely defined timeout durations, and not depend
on the exactly timing of when the timeouts are hit.

TD

On Wed, Jul 26, 2017 at 1:54 AM, Zhang, Lubo  wrote:

> Hi all
>
>
>
> I have a question about the Stateful  operations
> [map/flatmap]GroupsWithState in Structured streaming. Issue are as follows:
>
>
>
> Take StructuredSessionization case for example, first I input two words
> like apache and spark in batch 0, then input another word Hadoop in batch 1
> until timeout happens (here the timeout type is ProcessingTimeout). So I
> can see both words apache and spark are outputed since each group state is
> timedout. But if I input the same word apache in batch 1 which already
> existed in batch 0, the result shows only spark is expired.  I deep into
> this and find the code https://github.com/apache/
> spark/blob/master/sql/core/src/main/scala/org/apache/
> spark/sql/execution/streaming/FlatMapGroupsWithStateExec.scala#L131  deal
> with the group state update, it first process new group data and set the
> flag hasTimedout to be false in update func , which result the key already
> timedout to be just update. I know the timeout function call will not
> occur until there is new data to trigger, but  I am wondering why don’t
> we first process timeout keys, so we can retrieve the expired data exist in
> batch 0 in user-given function
>
>
>
> def statefunc(key: K, values: Iterator [V],state: GroupState [S]): U = {
>
> *if* (state.hasTimedOut) {// If called when timing out, 
> remove the state
>
>   ToDO;
>
>   state.remove()
>
>
>
> } *else* *if* (state.exists) {
>
> }
>
> }
>
>
>
>
>
> Thanks
>
> Lubo
>
>
>


Re: Any plans for making StateStore pluggable?

2017-05-09 Thread Tathagata Das
Thank you for creating the JIRA. I am working towards making it
configurable very soon.

On Tue, May 9, 2017 at 4:12 PM, Yogesh Mahajan 
wrote:

> Hi Team,
>
> Any plans to make the StateStoreProvider/StateStore in structured
> streaming pluggable ?
> Currently StateStore#loadedProviders has only one HDFSBackedStateStoreProvider
> and it's not configurable.
> If we make this configurable, users can bring in their own implementation
> of StateStore.
>
> Please refer to this ticket - https://issues.apache.org/
> jira/browse/SPARK-20376
>
> Thanks,
> http://www.snappydata.io/blog 
>


Re: Structured Streaming Schema Issue

2017-02-01 Thread Tathagata Das
What is the query you are apply writeStream on? Essentially can you print
the whole query.

Also, you can do StreamingQuery.explain() to see in full details how the
logical plan changes to physical plan, for a batch of data. that might
help. try doing that with some other sink to make sure the source works
correctly, and then try using your sink.

If you want further debugging, then you will have to dig into the
StreamingExecution class in Spark, and debug stuff there.
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L523

On Wed, Feb 1, 2017 at 3:49 PM, Sam Elamin <hussam.ela...@gmail.com> wrote:

> Yeah sorry Im still working on it, its on a branch you can find here
> <https://github.com/samelamin/spark-bigquery/blob/Stream_reading/src/main/scala/com/samelamin/spark/bigquery/streaming/BigQuerySource.scala>,
> ignore the logging messages I was trying to workout how the APIs work and
> unfortunately because I have to shade the dependency I cant debug it in an
> IDE (that I know of! )
>
> So I can see the correct schema here
> <https://github.com/samelamin/spark-bigquery/blob/Stream_reading/src/main/scala/com/samelamin/spark/bigquery/streaming/BigQuerySource.scala#L64>
>  and
> also when the df is returned(After .load() )
>
> But when that same df has writeStream applied to it, the addBatch
> dataframe has a new schema. Its similar to the old schema but some ints
> have been turned to strings.
>
>
>
> On Wed, Feb 1, 2017 at 11:40 PM, Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
>> I am assuming that you have written your own BigQuerySource (i dont see
>> that code in the link you posted). In that source, you must have
>> implemented getBatch which uses offsets to return the Dataframe having the
>> data of a batch. Can you double check when this DataFrame returned by
>> getBatch, has the expected schema?
>>
>> On Wed, Feb 1, 2017 at 3:33 PM, Sam Elamin <hussam.ela...@gmail.com>
>> wrote:
>>
>>> Thanks for the quick response TD!
>>>
>>> Ive been trying to identify where exactly this transformation happens
>>>
>>> The readStream returns a dataframe with the correct schema
>>>
>>> The minute I call writeStream, by the time I get to the addBatch method,
>>> the dataframe there has an incorrect Schema
>>>
>>> So Im skeptical about the issue being prior to the readStream since the
>>> output dataframe has the correct Schema
>>>
>>>
>>> Am I missing something completely obvious?
>>>
>>> Regards
>>> Sam
>>>
>>> On Wed, Feb 1, 2017 at 11:29 PM, Tathagata Das <
>>> tathagata.das1...@gmail.com> wrote:
>>>
>>>> You should make sure that schema of the streaming Dataset returned by
>>>> `readStream`, and the schema of the DataFrame returned by the sources
>>>> getBatch.
>>>>
>>>> On Wed, Feb 1, 2017 at 3:25 PM, Sam Elamin <hussam.ela...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi All
>>>>>
>>>>> I am writing a bigquery connector here
>>>>> <http://github.com/samelamin/spark-bigquery> and I am getting a
>>>>> strange error with schemas being overwritten when a dataframe is passed
>>>>> over to the Sink
>>>>>
>>>>>
>>>>> for example the source returns this StructType
>>>>> WARN streaming.BigQuerySource: StructType(StructField(custome
>>>>> rid,LongType,true),
>>>>>
>>>>> and the sink is recieving this StructType
>>>>> WARN streaming.BigQuerySink: StructType(StructField(custome
>>>>> rid,StringType,true)
>>>>>
>>>>>
>>>>> Any idea why this might be happening?
>>>>> I dont have infering schema on
>>>>>
>>>>> spark.conf.set("spark.sql.streaming.schemaInference", "false")
>>>>>
>>>>> I know its off by default but I set it just to be sure
>>>>>
>>>>> So completely lost to what could be causing this
>>>>>
>>>>> Regards
>>>>> Sam
>>>>>
>>>>
>>>>
>>>
>>
>


Re: Structured Streaming Schema Issue

2017-02-01 Thread Tathagata Das
I am assuming that you have written your own BigQuerySource (i dont see
that code in the link you posted). In that source, you must have
implemented getBatch which uses offsets to return the Dataframe having the
data of a batch. Can you double check when this DataFrame returned by
getBatch, has the expected schema?

On Wed, Feb 1, 2017 at 3:33 PM, Sam Elamin <hussam.ela...@gmail.com> wrote:

> Thanks for the quick response TD!
>
> Ive been trying to identify where exactly this transformation happens
>
> The readStream returns a dataframe with the correct schema
>
> The minute I call writeStream, by the time I get to the addBatch method,
> the dataframe there has an incorrect Schema
>
> So Im skeptical about the issue being prior to the readStream since the
> output dataframe has the correct Schema
>
>
> Am I missing something completely obvious?
>
> Regards
> Sam
>
> On Wed, Feb 1, 2017 at 11:29 PM, Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
>> You should make sure that schema of the streaming Dataset returned by
>> `readStream`, and the schema of the DataFrame returned by the sources
>> getBatch.
>>
>> On Wed, Feb 1, 2017 at 3:25 PM, Sam Elamin <hussam.ela...@gmail.com>
>> wrote:
>>
>>> Hi All
>>>
>>> I am writing a bigquery connector here
>>> <http://github.com/samelamin/spark-bigquery> and I am getting a strange
>>> error with schemas being overwritten when a dataframe is passed over to the
>>> Sink
>>>
>>>
>>> for example the source returns this StructType
>>> WARN streaming.BigQuerySource: StructType(StructField(custome
>>> rid,LongType,true),
>>>
>>> and the sink is recieving this StructType
>>> WARN streaming.BigQuerySink: StructType(StructField(custome
>>> rid,StringType,true)
>>>
>>>
>>> Any idea why this might be happening?
>>> I dont have infering schema on
>>>
>>> spark.conf.set("spark.sql.streaming.schemaInference", "false")
>>>
>>> I know its off by default but I set it just to be sure
>>>
>>> So completely lost to what could be causing this
>>>
>>> Regards
>>> Sam
>>>
>>
>>
>


Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-10 Thread Tathagata Das
+1 binding

On Thu, Nov 10, 2016 at 6:05 PM, Kousuke Saruta 
wrote:

> +1 (non-binding)
>
>
> On 2016年11月08日 15:09, Reynold Xin wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.0.2. The vote is open until Thu, Nov 10, 2016 at 22:00 PDT and passes if
>> a majority of at least 3+1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.0.2
>> [ ] -1 Do not release this package because ...
>>
>>
>> The tag to be voted on is v2.0.2-rc3 (584354eaac02531c9584188b14336
>> 7ba694b0c34)
>>
>> This release candidate resolves 84 issues: https://s.apache.org/spark-2.0
>> .2-jira
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-bin/ <
>> http://people.apache.org/%7Epwendell/spark-releases/spark-2.0.2-rc3-bin/>
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1214/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/ <
>> http://people.apache.org/%7Epwendell/spark-releases/spark-2.0.2-rc3-docs/
>> >
>>
>>
>> Q: How can I help test this release?
>> A: If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions from 2.0.1.
>>
>> Q: What justifies a -1 vote for this release?
>> A: This is a maintenance release in the 2.0.x series. Bugs already
>> present in 2.0.1, missing features, or bugs related to new features will
>> not necessarily block this release.
>>
>> Q: What fix version should I use for patches merging into branch-2.0 from
>> now on?
>> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
>> (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.2.
>>
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: REST api for monitoring Spark Streaming

2016-11-07 Thread Tathagata Das
This may be a good addition. I suggest you read our guidelines on
contributing code to Spark.

https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-PreparingtoContributeCodeChanges

Its long document but it should have everything for you to figure out how
to contribute your changes. I hope to see your changes in a Github PR soon!

TD

On Mon, Nov 7, 2016 at 5:30 PM, Chan Chor Pang 
wrote:

> hi everyone
>
> it seems that there is not much who interested in creating a api for
> Streaming.
> never the less I still really want the api for monitoring.
> so i tried to see if i can implement by my own.
>
> after some test,
> i believe i can achieve the goal by
> 1. implement a package(org.apache.spark.streaming.status.api.v1) that
> serve the same purpose as org.apache.spark.status.api.v1
> 2. register the api path through StreamingTab
> and 3. retrive the streaming informateion through
> StreamingJobProgressListener
>
> what my most concern now is will my implementation be able to merge to the
> main stream.
>
> im new to open source project, so anyone could please show me some light?
> how should/could i proceed to make my implementation to be able to merge
> to the main stream.
>
>
> here is my test code base on v1.6.0
> ###
> diff --git a/streaming/src/main/scala/org/apache/spark/streaming/status
> /api/v1/JacksonMessageWriter.scala b/streaming/src/main/scala/org
> /apache/spark/streaming/status/api/v1/JacksonMessageWriter.scala
> new file mode 100644
> index 000..690e2d8
> --- /dev/null
> +++ b/streaming/src/main/scala/org/apache/spark/streaming/status
> /api/v1/JacksonMessageWriter.scala
> @@ -0,0 +1,68 @@
> +package org.apache.spark.streaming.status.api.v1
> +
> +import java.io.OutputStream
> +import java.lang.annotation.Annotation
> +import java.lang.reflect.Type
> +import java.text.SimpleDateFormat
> +import java.util.{Calendar, SimpleTimeZone}
> +import javax.ws.rs.Produces
> +import javax.ws.rs.core.{MediaType, MultivaluedMap}
> +import javax.ws.rs.ext.{MessageBodyWriter, Provider}
> +
> +import com.fasterxml.jackson.annotation.JsonInclude
> +import com.fasterxml.jackson.databind.{ObjectMapper,
> SerializationFeature}
> +
> +@Provider
> +@Produces(Array(MediaType.APPLICATION_JSON))
> +private[v1] class JacksonMessageWriter extends MessageBodyWriter[Object]{
> +
> +  val mapper = new ObjectMapper() {
> +override def writeValueAsString(t: Any): String = {
> +  super.writeValueAsString(t)
> +}
> +  }
> + mapper.registerModule(com.fasterxml.jackson.module.scala.
> DefaultScalaModule)
> +  mapper.enable(SerializationFeature.INDENT_OUTPUT)
> +  mapper.setSerializationInclusion(JsonInclude.Include.NON_NULL)
> +  mapper.setDateFormat(JacksonMessageWriter.makeISODateFormat)
> +
> +  override def isWriteable(
> +  aClass: Class[_],
> +  `type`: Type,
> +  annotations: Array[Annotation],
> +  mediaType: MediaType): Boolean = {
> +  true
> +  }
> +
> +  override def writeTo(
> +  t: Object,
> +  aClass: Class[_],
> +  `type`: Type,
> +  annotations: Array[Annotation],
> +  mediaType: MediaType,
> +  multivaluedMap: MultivaluedMap[String, AnyRef],
> +  outputStream: OutputStream): Unit = {
> +t match {
> +  //case ErrorWrapper(err) => outputStream.write(err.getByte
> s("utf-8"))
> +  case _ => mapper.writeValue(outputStream, t)
> +}
> +  }
> +
> +  override def getSize(
> +  t: Object,
> +  aClass: Class[_],
> +  `type`: Type,
> +  annotations: Array[Annotation],
> +  mediaType: MediaType): Long = {
> +-1L
> +  }
> +}
> +
> +private[spark] object JacksonMessageWriter {
> +  def makeISODateFormat: SimpleDateFormat = {
> +val iso8601 = new SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSS'GMT'")
> +val cal = Calendar.getInstance(new SimpleTimeZone(0, "GMT"))
> +iso8601.setCalendar(cal)
> +iso8601
> +  }
> +}
> diff --git a/streaming/src/main/scala/org/apache/spark/streaming/status
> /api/v1/StreamingApiRootResource.scala b/streaming/src/main/scala/org
> /apache/spark/streaming/status/api/v1/StreamingApiRootResource.scala
> new file mode 100644
> index 000..f4e43dd
> --- /dev/null
> +++ b/streaming/src/main/scala/org/apache/spark/streaming/status
> /api/v1/StreamingApiRootResource.scala
> @@ -0,0 +1,74 @@
> +package org.apache.spark.streaming.status.api.v1
> +
> +import org.apache.spark.status.api.v1.UIRoot
> +import org.eclipse.jetty.server.handler.ContextHandler
> +import org.eclipse.jetty.servlet.ServletContextHandler
> +import org.eclipse.jetty.servlet.ServletHolder
> +
> +import com.sun.jersey.spi.container.servlet.ServletContainer
> +
> +import javax.servlet.ServletContext
> +import javax.ws.rs.Path
> +import javax.ws.rs.Produces
> +import javax.ws.rs.core.Context
> +import org.apache.spark.streaming.ui.StreamingJobProgressListener
> +
> +
> +@Path("/v1")
> +private[v1] class StreamingApiRootResource 

Re: Watermarking in Structured Streaming to drop late data

2016-10-27 Thread Tathagata Das
Assaf, thanks for the feedback!

On Thu, Oct 27, 2016 at 3:28 AM, assaf.mendelson <assaf.mendel...@rsa.com>
wrote:

> Thanks.
>
> This article is excellent. It completely explains everything.
>
> I would add it as a reference to any and all explanations of structured
> streaming (and in the case of watermarking, I simply didn’t understand the
> definition before reading this).
>
>
>
> Thanks,
>
> Assaf.
>
>
>
>
>
> *From:* kostas papageorgopoylos [via Apache Spark Developers List]
> [mailto:ml-node+[hidden email]
> <http:///user/SendEmail.jtp?type=node=19600=0>]
> *Sent:* Thursday, October 27, 2016 10:17 AM
> *To:* Mendelson, Assaf
> *Subject:* Re: Watermarking in Structured Streaming to drop late data
>
>
>
> Hi all
>
> I would highly recommend to all users-devs interested in the design
> suggestions / discussions for Structured Streaming Spark API watermarking
>
> to take a look on the following links along with the design document. It
> would help to understand the notions of watermark , out of order data and
> possible use cases.
>
>
>
> https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
>
> https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
>
>
>
> Kind Regards
>
>
>
>
>
> 2016-10-27 9:46 GMT+03:00 assaf.mendelson <[hidden email]
> <http:///user/SendEmail.jtp?type=node=19592=0>>:
>
> Hi,
>
> Should comments come here or in the JIRA?
>
> Any, I am a little confused on the need to expose this as an API to begin
> with.
>
> Let’s consider for a second the most basic behavior: We have some input
> stream and we want to aggregate a sum over a time window.
>
> This means that the window we should be looking at would be the maximum
> time across our data and back by the window interval. Everything older can
> be dropped.
>
> When new data arrives, the maximum time cannot move back so we generally
> drop everything tool old.
>
> This basically means we save only the latest time window.
>
> This simpler model would only break if we have a secondary aggregation
> which needs the results of multiple windows.
>
> Is this the use case we are trying to solve?
>
> If so, wouldn’t just calculating the bigger time window across the entire
> aggregation solve this?
>
> Am I missing something here?
>
>
>
> *From:* Michael Armbrust [via Apache Spark Developers List] [mailto:[hidden
> email] <http:///user/SendEmail.jtp?type=node=19592=1>[hidden email]
> <http://user/SendEmail.jtp?type=node=19591=0>]
> *Sent:* Thursday, October 27, 2016 3:04 AM
> *To:* Mendelson, Assaf
> *Subject:* Re: Watermarking in Structured Streaming to drop late data
>
>
>
> And the JIRA: https://issues.apache.org/jira/browse/SPARK-18124
>
>
>
> On Wed, Oct 26, 2016 at 4:56 PM, Tathagata Das <[hidden email]
> <http://user/SendEmail.jtp?type=node=19590=0>> wrote:
>
> Hey all,
>
>
>
> We are planning implement watermarking in Structured Streaming that would
> allow us handle late, out-of-order data better. Specially, when we are
> aggregating over windows on event-time, we currently can end up keeping
> unbounded amount data as state. We want to define watermarks on the event
> time in order mark and drop data that are "too late" and accordingly age
> out old aggregates that will not be updated any more.
>
>
>
> To enable the user to specify details like lateness threshold, we are
> considering adding a new method to Dataset. We would like to get more
> feedback on this API. Here is the design doc
>
>
>
> https://docs.google.com/document/d/1z-Pazs5v4rA31azvmYhu4I5xwqaNQl6Z
> LIS03xhkfCQ/
>
>
>
> Please comment on the design and proposed APIs.
>
>
>
> Thank you very much!
>
>
>
> TD
>
>
>
>
> --
>
> *If you reply to this email, your message will be added to the discussion
> below:*
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/Watermarking-in-
> Structured-Streaming-to-drop-late-data-tp19589p19590.html
>
> To start a new topic under Apache Spark Developers List, email [hidden
> email] <http://user/SendEmail.jtp?type=node=19591=1>
> To unsubscribe from Apache Spark Developers List, click here.
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer=instant_html%21nabble%3Aemail.naml=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabb

Re: Watermarking in Structured Streaming to drop late data

2016-10-27 Thread Tathagata Das
Hello Assaf,

I think you are missing the fact that we want to compute over event-time of
the data (e.g. data generation time), which may arrive at Spark
out-of-order and late. And we want to aggregate over late data. The
watermark is an estimate made by the system that there wont be any data
later than the watermark time arriving after now.

If this basic context is clear, then please read the design doc for further
details. Please comments in the doc for more specific design discussions.

On Thu, Oct 27, 2016 at 1:52 AM, Ofir Manor <ofir.ma...@equalum.io> wrote:

> Assaf,
> I think you are using the term "window" differently than Structured
> Streaming,... Also, you didn't consider groupBy. Here is an example:
> I want to maintain, for every minute over the last six hours, a
> computation (trend or average or stddev) on a five-minute window (from t-4
> to t). So,
> 1. My window size is 5 minutes
> 2. The window slides every 1 minute (so, there is a new 5-minute window
> for every minute)
> 3. Old windows should be purged if they are 6 hours old (based on event
> time vs. clock?)
> Option 3 is currently missing - the streaming job keeps all windows
> forever, as the app may want to access very old windows, unless it would
> explicitly say otherwise.
>
>
> Ofir Manor
>
> Co-Founder & CTO | Equalum
>
> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>
> On Thu, Oct 27, 2016 at 9:46 AM, assaf.mendelson <assaf.mendel...@rsa.com>
> wrote:
>
>> Hi,
>>
>> Should comments come here or in the JIRA?
>>
>> Any, I am a little confused on the need to expose this as an API to begin
>> with.
>>
>> Let’s consider for a second the most basic behavior: We have some input
>> stream and we want to aggregate a sum over a time window.
>>
>> This means that the window we should be looking at would be the maximum
>> time across our data and back by the window interval. Everything older can
>> be dropped.
>>
>> When new data arrives, the maximum time cannot move back so we generally
>> drop everything tool old.
>>
>> This basically means we save only the latest time window.
>>
>> This simpler model would only break if we have a secondary aggregation
>> which needs the results of multiple windows.
>>
>> Is this the use case we are trying to solve?
>>
>> If so, wouldn’t just calculating the bigger time window across the entire
>> aggregation solve this?
>>
>> Am I missing something here?
>>
>>
>>
>> *From:* Michael Armbrust [via Apache Spark Developers List] [mailto:
>> ml-node+[hidden email]
>> <http:///user/SendEmail.jtp?type=node=19591=0>]
>> *Sent:* Thursday, October 27, 2016 3:04 AM
>> *To:* Mendelson, Assaf
>> *Subject:* Re: Watermarking in Structured Streaming to drop late data
>>
>>
>>
>> And the JIRA: https://issues.apache.org/jira/browse/SPARK-18124
>>
>>
>>
>> On Wed, Oct 26, 2016 at 4:56 PM, Tathagata Das <[hidden email]
>> <http:///user/SendEmail.jtp?type=node=19590=0>> wrote:
>>
>> Hey all,
>>
>>
>>
>> We are planning implement watermarking in Structured Streaming that would
>> allow us handle late, out-of-order data better. Specially, when we are
>> aggregating over windows on event-time, we currently can end up keeping
>> unbounded amount data as state. We want to define watermarks on the event
>> time in order mark and drop data that are "too late" and accordingly age
>> out old aggregates that will not be updated any more.
>>
>>
>>
>> To enable the user to specify details like lateness threshold, we are
>> considering adding a new method to Dataset. We would like to get more
>> feedback on this API. Here is the design doc
>>
>>
>>
>> https://docs.google.com/document/d/1z-Pazs5v4rA31azvmYhu4I5x
>> wqaNQl6ZLIS03xhkfCQ/
>>
>>
>>
>> Please comment on the design and proposed APIs.
>>
>>
>>
>> Thank you very much!
>>
>>
>>
>> TD
>>
>>
>>
>>
>> --
>>
>> *If you reply to this email, your message will be added to the discussion
>> below:*
>>
>> http://apache-spark-developers-list.1001551.n3.nabble.com/
>> Watermarking-in-Structured-Streaming-to-drop-late-data-tp19589p19590.html
>>
>> To start a new topic under Apache Spark Developers List, email [hidden
>> email] <http:///user/SendEmail.jtp?type=node=19591=1>
>> To unsubscribe from Apache Spark Developers List, click here.
>> NAML
>

Watermarking in Structured Streaming to drop late data

2016-10-26 Thread Tathagata Das
Hey all,

We are planning implement watermarking in Structured Streaming that would
allow us handle late, out-of-order data better. Specially, when we are
aggregating over windows on event-time, we currently can end up keeping
unbounded amount data as state. We want to define watermarks on the event
time in order mark and drop data that are "too late" and accordingly age
out old aggregates that will not be updated any more.

To enable the user to specify details like lateness threshold, we are
considering adding a new method to Dataset. We would like to get more
feedback on this API. Here is the design doc

https://docs.google.com/document/d/1z-Pazs5v4rA31azvmYhu4I5xwqaNQl6Z
LIS03xhkfCQ/

Please comment on the design and proposed APIs.

Thank you very much!

TD


Re: Structured streaming use of DataFrame vs Datasource

2016-06-16 Thread Tathagata Das
Its not throwing away any information from the point of view of the SQL
optimizer. The schema preserves all the type information that the catalyst
uses. The type information T in Dataset[T] is only used at the API level to
ensure compilation-time type checks of the user program.

On Thu, Jun 16, 2016 at 2:05 PM, Cody Koeninger <c...@koeninger.org> wrote:

> I'm clear on what a type alias is.  My question is more that moving
> from e.g. Dataset[T] to Dataset[Row] involves throwing away
> information.  Reading through code that uses the Dataframe alias, it's
> a little hard for me to know when that's intentional or not.
>
>
> On Thu, Jun 16, 2016 at 2:50 PM, Tathagata Das
> <tathagata.das1...@gmail.com> wrote:
> > There are different ways to view this. If its confusing to think that
> Source
> > API returning DataFrames, its equivalent to thinking that you are
> returning
> > a Dataset[Row], and DataFrame is just a shorthand. And
> > DataFrame/Datasetp[Row] is to Dataset[String] is what java Array[Object]
> is
> > to Array[String]. DataFrame is more general in a way, as every Dataset
> can
> > be boiled down to a DataFrame. So to keep the Source APIs general (and
> also
> > source-compatible with 1.x), they return DataFrame.
> >
> > On Thu, Jun 16, 2016 at 12:38 PM, Cody Koeninger <c...@koeninger.org>
> wrote:
> >>
> >> Is this really an internal / external distinction?
> >>
> >> For a concrete example, Source.getBatch seems to be a public
> >> interface, but returns DataFrame.
> >>
> >> On Thu, Jun 16, 2016 at 1:42 PM, Tathagata Das
> >> <tathagata.das1...@gmail.com> wrote:
> >> > DataFrame is a type alias of Dataset[Row], so externally it seems like
> >> > Dataset is the main type and DataFrame is a derivative type.
> >> > However, internally, since everything is processed as Rows, everything
> >> > uses
> >> > DataFrames, Type classes used in a Dataset is internally converted to
> >> > rows
> >> > for processing. . Therefore internally DataFrame is like "main" type
> >> > that is
> >> > used.
> >> >
> >> > On Thu, Jun 16, 2016 at 11:18 AM, Cody Koeninger <c...@koeninger.org>
> >> > wrote:
> >> >>
> >> >> Sorry, meant DataFrame vs Dataset
> >> >>
> >> >> On Thu, Jun 16, 2016 at 12:53 PM, Cody Koeninger <c...@koeninger.org
> >
> >> >> wrote:
> >> >> > Is there a principled reason why sql.streaming.* and
> >> >> > sql.execution.streaming.* are making extensive use of DataFrame
> >> >> > instead of Datasource?
> >> >> >
> >> >> > Or is that just a holdover from code written before the move / type
> >> >> > alias?
> >> >>
> >> >> -
> >> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >> >> For additional commands, e-mail: dev-h...@spark.apache.org
> >> >>
> >> >
> >
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Structured streaming use of DataFrame vs Datasource

2016-06-16 Thread Tathagata Das
DataFrame is a type alias of Dataset[Row], so externally it seems like
Dataset is the main type and DataFrame is a derivative type.
However, internally, since everything is processed as Rows, everything uses
DataFrames, Type classes used in a Dataset is internally converted to rows
for processing. . Therefore internally DataFrame is like "main" type that
is used.

On Thu, Jun 16, 2016 at 11:18 AM, Cody Koeninger  wrote:

> Sorry, meant DataFrame vs Dataset
>
> On Thu, Jun 16, 2016 at 12:53 PM, Cody Koeninger 
> wrote:
> > Is there a principled reason why sql.streaming.* and
> > sql.execution.streaming.* are making extensive use of DataFrame
> > instead of Datasource?
> >
> > Or is that just a holdover from code written before the move / type
> alias?
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Spark-1.6.0-preview2 trackStateByKey exception restoring state

2015-11-23 Thread Tathagata Das
My intention is to make it compatible! Filed this bug -
https://issues.apache.org/jira/browse/SPARK-11932
Looking into it right now. Thanks for testing it out and reporting this!


On Mon, Nov 23, 2015 at 7:22 AM, jan  wrote:

> Hi guys,
>
> I'm trying out the new trackStateByKey API of the Spark-1.6.0-preview2
> release and I'm encountering an exception when trying to restore previously
> checkpointed state in spark streaming.
>
> Use case:
> - execute a stateful Spark streaming job using trackStateByKey
> - interrupt / kill the job
> - start the job again (without any code changes or cleaning out the
> checkpoint directory)
>
> Upon this restart, I encounter the exception below. The nature of the
> exception makes me think either I am doing something wrong, or there's a
> problem with this use case for the new trackStateByKey API.
>
> I uploaded my job code (
> https://gist.github.com/juyttenh/be7973b0c5c2eddd8a81), but it's
> basically just a modified version of the spark streaming example
> StatefulNetworkWordCount (that had already been updated to use
> trackStateByKey). My version however includes the usage of
> StreamingContext.getOrCreate to actually read the checkpointed state when
> the job is started, leading to the exception below.
>
> Just to make sure: using StreamingContext.getOrCreate should still be
> compatible with using the trackStateByKey API?
>
> Thanx,
> Jan
>
> 15/11/23 10:55:07 ERROR StreamingContext: Error starting the context,
> marking it as stopped
>
> java.lang.IllegalArgumentException: requirement failed
>
> at scala.Predef$.require(Predef.scala:221)
>
> at
> org.apache.spark.streaming.rdd.TrackStateRDD.(TrackStateRDD.scala:133)
>
> at
> org.apache.spark.streaming.dstream.InternalTrackStateDStream$$anonfun$compute$2.apply(TrackStateDStream.scala:148)
>
> at
> org.apache.spark.streaming.dstream.InternalTrackStateDStream$$anonfun$compute$2.apply(TrackStateDStream.scala:143)
>
> at scala.Option.map(Option.scala:145)
>
> at
> org.apache.spark.streaming.dstream.InternalTrackStateDStream.compute(TrackStateDStream.scala:143)
>
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
>
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
>
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
>
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
>
> at
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:424)
>
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)
>
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342)
>
> at scala.Option.orElse(Option.scala:257)
>
> at
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:339)
>
> at
> org.apache.spark.streaming.dstream.TrackStateDStreamImpl.compute(TrackStateDStream.scala:66)
>
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
>
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
>
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
>
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
>
> at
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:424)
>
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)
>
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:342)
>
> at scala.Option.orElse(Option.scala:257)
>
> at
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:339)
>
> at
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:47)
>
> at
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:115)
>
> at
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:114)
>
> at
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>
> at
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>
> at
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
>
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
>
> at
> 

Re: Problem building Spark

2015-10-19 Thread Tathagata Das
Seems to be a heap space issue for Maven. Have you configured Maven's
memory according the instruction on the web page?

export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"


On Mon, Oct 19, 2015 at 6:59 PM, Annabel Melongo <
melongo_anna...@yahoo.com.invalid> wrote:

> I tried to build Spark according to the build directions
>  and the it
> failed due to the following error:
>
>
>
>
>
>
> Building Spark - Spark 1.5.1 Documentation
> 
> Building Spark Building with build/mvn Building a Runnable Distribution
> Setting up Maven’s Memory Usage Specifying the Hadoop Version Building With
> Hive and JDBC Support Building for Scala 2.11
> View on spark.apache.org
> 
> Preview by Yahoo
>
> [ERROR] Failed to execute goal
> org.apache.maven.plugins:maven-assembly-plugin:2.5.5:single
> (test-jar-with-
>   dependencies) on project spark-streaming-mqtt_2.10: Failed to
> create assembly: Error creating assembly archive test-
> jar-with-dependencies: Problem creating jar: Execution exception
> (and the archive is probably corrupt but I could not
> delete it): Java heap space -> [Help 1]
>
> Any help?  I have a 64-bit windows 8 machine
>


Re: failure notice

2015-10-06 Thread Tathagata Das
Unfortunately, there is not an obvious way to do this. I am guessing that
you want to partition your stream such that the same keys always go to the
same executor, right?

You could do it by writing a custom RDD. See ShuffledRDD
<https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/ShuffledRDD.scala>.
That is what is used to do a lot of shuffling. See how it is used from
RDD.partitionByKey() or RDD.reduceByKey(). You could subclass it specify a
set of preferred locations, and the system will try to respect those
locations. These locations should be among the currently active executors.
You could either get the current list of executors from
SparkContext.getExecutorMemoryStatus(),

Hope this helps.

On Tue, Oct 6, 2015 at 8:27 AM, Renyi Xiong <renyixio...@gmail.com> wrote:

> yes, it can recover on a different node. it uses write-ahead-log,
> checkpoints offsets of both ingress and egress (e.g. using zookeeper and/or
> kafka), replies on the streaming engine's deterministic operations.
>
> by replaying back a certain range of data based on checkpointed
> ingress offset (at least once semantic), state can be recovered, and
> filters out duplicate events based on checkpointed egress offset (at most
> once semantic)
>
> hope it makes sense.
>
> On Mon, Oct 5, 2015 at 3:11 PM, Tathagata Das <t...@databricks.com> wrote:
>
>> What happens when a whole node running  your " per node streaming engine
>> (built-in checkpoint and recovery)" fails? Can its checkpoint and recovery
>> mechanism handle whole node failure? Can you recover from the checkpoint on
>> a different node?
>>
>> Spark and Spark Streaming were designed with the idea that executors are
>> disposable, and there should not be any node-specific long term state that
>> you rely on unless you can recover that state on a different node.
>>
>> On Mon, Oct 5, 2015 at 3:03 PM, Renyi Xiong <renyixio...@gmail.com>
>> wrote:
>>
>>> if RDDs from same DStream not guaranteed to run on same worker, then the
>>> question becomes:
>>>
>>> is it possible to specify an unlimited duration in ssc to have a
>>> continuous stream (as opposed to discretized).
>>>
>>> say, we have a per node streaming engine (built-in checkpoint and
>>> recovery) we'd like to integrate with spark streaming. can we have a
>>> never-ending batch (or RDD) this way?
>>>
>>> On Mon, Sep 28, 2015 at 4:31 PM, <mailer-dae...@apache.org> wrote:
>>>
>>>> Hi. This is the qmail-send program at apache.org.
>>>> I'm afraid I wasn't able to deliver your message to the following
>>>> addresses.
>>>> This is a permanent error; I've given up. Sorry it didn't work out.
>>>>
>>>> <u...@spark.apache.org>:
>>>> Must be sent from an @apache.org address or a subscriber address or an
>>>> address in LDAP.
>>>>
>>>> --- Below this line is a copy of the message.
>>>>
>>>> Return-Path: <renyixio...@gmail.com>
>>>> Received: (qmail 95559 invoked by uid 99); 28 Sep 2015 23:31:46 -
>>>> Received: from Unknown (HELO spamd3-us-west.apache.org)
>>>> (209.188.14.142)
>>>> by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Sep 2015 23:31:46
>>>> +
>>>> Received: from localhost (localhost [127.0.0.1])
>>>> by spamd3-us-west.apache.org (ASF Mail Server at
>>>> spamd3-us-west.apache.org) with ESMTP id 96E361809BA
>>>> for <u...@spark.apache.org>; Mon, 28 Sep 2015 23:31:45 +
>>>> (UTC)
>>>> X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org
>>>> X-Spam-Flag: NO
>>>> X-Spam-Score: 3.129
>>>> X-Spam-Level: ***
>>>> X-Spam-Status: No, score=3.129 tagged_above=-999 required=6.31
>>>> tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
>>>> FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=3,
>>>> RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01,
>>>> SPF_PASS=-0.001]
>>>> autolearn=disabled
>>>> Authentication-Results: spamd3-us-west.apache.org (amavisd-new);
>>>> dkim=pass (2048-bit key) header.d=gmail.com
>>>> Received: from mx1-us-west.apache.org ([10.40.0.8])
>>>> by localhost (spamd3-us-west.apache.org [10.40.0.10])
>>>> (amavisd-new, port 10024)
>>>> with ESMTP id FAGoohFE7Y7A for <u...@spark.apache.org>;
>>>> M

Re: failure notice

2015-10-05 Thread Tathagata Das
What happens when a whole node running  your " per node streaming engine
(built-in checkpoint and recovery)" fails? Can its checkpoint and recovery
mechanism handle whole node failure? Can you recover from the checkpoint on
a different node?

Spark and Spark Streaming were designed with the idea that executors are
disposable, and there should not be any node-specific long term state that
you rely on unless you can recover that state on a different node.

On Mon, Oct 5, 2015 at 3:03 PM, Renyi Xiong  wrote:

> if RDDs from same DStream not guaranteed to run on same worker, then the
> question becomes:
>
> is it possible to specify an unlimited duration in ssc to have a
> continuous stream (as opposed to discretized).
>
> say, we have a per node streaming engine (built-in checkpoint and
> recovery) we'd like to integrate with spark streaming. can we have a
> never-ending batch (or RDD) this way?
>
> On Mon, Sep 28, 2015 at 4:31 PM,  wrote:
>
>> Hi. This is the qmail-send program at apache.org.
>> I'm afraid I wasn't able to deliver your message to the following
>> addresses.
>> This is a permanent error; I've given up. Sorry it didn't work out.
>>
>> :
>> Must be sent from an @apache.org address or a subscriber address or an
>> address in LDAP.
>>
>> --- Below this line is a copy of the message.
>>
>> Return-Path: 
>> Received: (qmail 95559 invoked by uid 99); 28 Sep 2015 23:31:46 -
>> Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142)
>> by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Sep 2015 23:31:46
>> +
>> Received: from localhost (localhost [127.0.0.1])
>> by spamd3-us-west.apache.org (ASF Mail Server at
>> spamd3-us-west.apache.org) with ESMTP id 96E361809BA
>> for ; Mon, 28 Sep 2015 23:31:45 +
>> (UTC)
>> X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org
>> X-Spam-Flag: NO
>> X-Spam-Score: 3.129
>> X-Spam-Level: ***
>> X-Spam-Status: No, score=3.129 tagged_above=-999 required=6.31
>> tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
>> FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=3,
>> RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001]
>> autolearn=disabled
>> Authentication-Results: spamd3-us-west.apache.org (amavisd-new);
>> dkim=pass (2048-bit key) header.d=gmail.com
>> Received: from mx1-us-west.apache.org ([10.40.0.8])
>> by localhost (spamd3-us-west.apache.org [10.40.0.10])
>> (amavisd-new, port 10024)
>> with ESMTP id FAGoohFE7Y7A for ;
>> Mon, 28 Sep 2015 23:31:44 + (UTC)
>> Received: from mail-la0-f51.google.com (mail-la0-f51.google.com
>> [209.85.215.51])
>> by mx1-us-west.apache.org (ASF Mail Server at
>> mx1-us-west.apache.org) with ESMTPS id 2ED40204C9
>> for ; Mon, 28 Sep 2015 23:31:44 +
>> (UTC)
>> Received: by labzv5 with SMTP id zv5so32919088lab.1
>> for ; Mon, 28 Sep 2015 16:31:42 -0700
>> (PDT)
>> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
>> d=gmail.com; s=20120113;
>>
>> h=mime-version:in-reply-to:references:date:message-id:subject:from:to
>>  :cc:content-type;
>> bh=F36l+I4dfDHTL7nQ0K9mAW4aVtPpVpYc0rWWpPNjt4c=;
>>
>> b=QfRdLEWf4clJqwkZSH7n0oHjXLNifWdhYxvCDZ+P37oSfM0vd/8Bx962hTflRQkD1q
>>
>>  2B3go7g8bpnQlhZgMRrZfT6hk7vUtNA3lOZjYeN+cPyoVRaBwm3LIID5vF4cw5hFAWaM
>>
>>  LUenU7E7b9kJY8JkyhIfpya8CLKz+Yo6EjCv3W6BAvv2YiNPgbOQkpx7u8y9dw0kHGig
>>
>>  1hv37Ey/DZpoKCgbSesv+sztYslevu+VBgxHFkveEyxH1saRr6OqTM7fnL2E6fP4E8qu
>>
>>  W81G1ZfNW1Pp9i5IcCb/9S7YTZDnBlUj4yROsOfNANRGmed71QpQD9l8NnAQXmeqoeNF
>>  SyEg==
>> MIME-Version: 1.0
>> X-Received: by 10.25.213.75 with SMTP id m72mr4047578lfg.17.1443483102618;
>>  Mon, 28 Sep 2015 16:31:42 -0700 (PDT)
>> Received: by 10.25.207.18 with HTTP; Mon, 28 Sep 2015 16:31:42 -0700 (PDT)
>> In-Reply-To: > 2...@mail.gmail.com>
>> References: <
>> cangsv6-k+33gvgtiynwhz2gsbudf_wwwnazvupbqe8qdcg_...@mail.gmail.com>
>> 

Re: Dynamic DAG use-case for spark streaming.

2015-09-29 Thread Tathagata Das
A very basic support that is there in DStream is DStream.transform() which
take arbitrary RDD => RDD function. This function can actually choose to do
different computation with time. That may be of help to you.

On Tue, Sep 29, 2015 at 12:06 PM, Archit Thakur 
wrote:

> Hi,
>
>  We are using spark streaming as our processing engine, and as part of
> output we want to push the data to UI. Now there would be multiple users
> accessing the system with there different filters on. Based on the filters
> and other inputs we want to either run a SQL Query on DStream or do a
> custom logic processing. This would need the system to read the
> filters/query and generate the execution graph at runtime. I cant see any
> support in spark streaming for generating the execution graph on the fly.
> I think I can broadcast the query to executors and read the broadcasted
> query at runtime but that would also limit my user to 1 at a time.
>
> Do we not expect the spark streaming to take queries/filters from outside
> world. Does output in spark streaming only means outputting to an external
> source which could then be queried.
>
> Thanks,
> Archit Thakur.
>


Re: Checkpoint directory structure

2015-09-23 Thread Tathagata Das
Could you provide the logs on when and how you are seeing this error?

On Wed, Sep 23, 2015 at 6:32 PM, Bin Wang  wrote:

> BTW, I just kill the application and restart it. Then the application
> cannot recover from checkpoint because of some lost of RDD. So I'm wonder,
> if there are some failure in the application, won't it possible not be able
> to recovery from checkpoint?
>
> Bin Wang 于2015年9月23日周三 下午6:58写道:
>
>> I find the checkpoint directory structure is like this:
>>
>> -rw-r--r--   1 root root 134820 2015-09-23 16:55
>> /user/root/checkpoint/checkpoint-144299850
>> -rw-r--r--   1 root root 134768 2015-09-23 17:00
>> /user/root/checkpoint/checkpoint-144299880
>> -rw-r--r--   1 root root 134895 2015-09-23 17:05
>> /user/root/checkpoint/checkpoint-144299910
>> -rw-r--r--   1 root root 134899 2015-09-23 17:10
>> /user/root/checkpoint/checkpoint-144299940
>> -rw-r--r--   1 root root 134913 2015-09-23 17:15
>> /user/root/checkpoint/checkpoint-144299970
>> -rw-r--r--   1 root root 134928 2015-09-23 17:20
>> /user/root/checkpoint/checkpoint-14430
>> -rw-r--r--   1 root root 134987 2015-09-23 17:25
>> /user/root/checkpoint/checkpoint-144300030
>> -rw-r--r--   1 root root 134944 2015-09-23 17:30
>> /user/root/checkpoint/checkpoint-144300060
>> -rw-r--r--   1 root root 134956 2015-09-23 17:35
>> /user/root/checkpoint/checkpoint-144300090
>> -rw-r--r--   1 root root 135244 2015-09-23 17:40
>> /user/root/checkpoint/checkpoint-144300120
>> drwxr-xr-x   - root root  0 2015-09-23 18:48
>> /user/root/checkpoint/d3714249-e03a-45c7-a0d5-1dc870b7d9f2
>> drwxr-xr-x   - root root  0 2015-09-23 17:44
>> /user/root/checkpoint/receivedBlockMetadata
>>
>>
>> I restart spark and it reads from
>> /user/root/checkpoint/d3714249-e03a-45c7-a0d5-1dc870b7d9f2. But it seems
>> that the data in it lost some rdds so it is not able to recovery. While I
>> find other directories in checkpoint/, like
>>  /user/root/checkpoint/checkpoint-144300120.  What does it used for?
>> Can I recovery my data from that?
>>
>


Re: KinesisStreamSuite failing in master branch

2015-07-19 Thread Tathagata Das
The PR to fix this is out.
https://github.com/apache/spark/pull/7519


On Sun, Jul 19, 2015 at 6:41 PM, Tathagata Das t...@databricks.com wrote:

 I am taking care of this right now.

 On Sun, Jul 19, 2015 at 6:08 PM, Patrick Wendell pwend...@gmail.com
 wrote:

 I think we should just revert this patch on all affected branches. No
 reason to leave the builds broken until a fix is in place.

 - Patrick

 On Sun, Jul 19, 2015 at 6:03 PM, Josh Rosen rosenvi...@gmail.com wrote:
  Yep, I emailed TD about it; I think that we may need to make a change
 to the
  pull request builder to fix this.  Pending that, we could just revert
 the
  commit that added this.
 
  On Sun, Jul 19, 2015 at 5:32 PM, Ted Yu yuzhih...@gmail.com wrote:
 
  Hi,
  I noticed that KinesisStreamSuite fails for both hadoop profiles in
 master
  Jenkins builds.
 
  From
 
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/3011/console
  :
 
  KinesisStreamSuite:
  *** RUN ABORTED ***
java.lang.AssertionError: assertion failed: Kinesis test not enabled,
  should not attempt to get AWS credentials
at scala.Predef$.assert(Predef.scala:179)
at
 
 org.apache.spark.streaming.kinesis.KinesisTestUtils$.getAWSCredentials(KinesisTestUtils.scala:189)
at
  org.apache.spark.streaming.kinesis.KinesisTestUtils.org
 $apache$spark$streaming$kinesis$KinesisTestUtils$$kinesisClient$lzycompute(KinesisTestUtils.scala:59)
at
  org.apache.spark.streaming.kinesis.KinesisTestUtils.org
 $apache$spark$streaming$kinesis$KinesisTestUtils$$kinesisClient(KinesisTestUtils.scala:58)
at
 
 org.apache.spark.streaming.kinesis.KinesisTestUtils.describeStream(KinesisTestUtils.scala:121)
at
 
 org.apache.spark.streaming.kinesis.KinesisTestUtils.findNonExistentStreamName(KinesisTestUtils.scala:157)
at
 
 org.apache.spark.streaming.kinesis.KinesisTestUtils.createStream(KinesisTestUtils.scala:78)
at
 
 org.apache.spark.streaming.kinesis.KinesisStreamSuite.beforeAll(KinesisStreamSuite.scala:45)
at
 
 org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
at
 
 org.apache.spark.streaming.kinesis.KinesisStreamSuite.beforeAll(KinesisStreamSuite.scala:33)
 
 
  FYI
 
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org





Re: KinesisStreamSuite failing in master branch

2015-07-19 Thread Tathagata Das
I am taking care of this right now.

On Sun, Jul 19, 2015 at 6:08 PM, Patrick Wendell pwend...@gmail.com wrote:

 I think we should just revert this patch on all affected branches. No
 reason to leave the builds broken until a fix is in place.

 - Patrick

 On Sun, Jul 19, 2015 at 6:03 PM, Josh Rosen rosenvi...@gmail.com wrote:
  Yep, I emailed TD about it; I think that we may need to make a change to
 the
  pull request builder to fix this.  Pending that, we could just revert the
  commit that added this.
 
  On Sun, Jul 19, 2015 at 5:32 PM, Ted Yu yuzhih...@gmail.com wrote:
 
  Hi,
  I noticed that KinesisStreamSuite fails for both hadoop profiles in
 master
  Jenkins builds.
 
  From
 
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/3011/console
  :
 
  KinesisStreamSuite:
  *** RUN ABORTED ***
java.lang.AssertionError: assertion failed: Kinesis test not enabled,
  should not attempt to get AWS credentials
at scala.Predef$.assert(Predef.scala:179)
at
 
 org.apache.spark.streaming.kinesis.KinesisTestUtils$.getAWSCredentials(KinesisTestUtils.scala:189)
at
  org.apache.spark.streaming.kinesis.KinesisTestUtils.org
 $apache$spark$streaming$kinesis$KinesisTestUtils$$kinesisClient$lzycompute(KinesisTestUtils.scala:59)
at
  org.apache.spark.streaming.kinesis.KinesisTestUtils.org
 $apache$spark$streaming$kinesis$KinesisTestUtils$$kinesisClient(KinesisTestUtils.scala:58)
at
 
 org.apache.spark.streaming.kinesis.KinesisTestUtils.describeStream(KinesisTestUtils.scala:121)
at
 
 org.apache.spark.streaming.kinesis.KinesisTestUtils.findNonExistentStreamName(KinesisTestUtils.scala:157)
at
 
 org.apache.spark.streaming.kinesis.KinesisTestUtils.createStream(KinesisTestUtils.scala:78)
at
 
 org.apache.spark.streaming.kinesis.KinesisStreamSuite.beforeAll(KinesisStreamSuite.scala:45)
at
 
 org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
at
 
 org.apache.spark.streaming.kinesis.KinesisStreamSuite.beforeAll(KinesisStreamSuite.scala:33)
 
 
  FYI
 
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Does RDD checkpointing store the entire state in HDFS?

2015-07-14 Thread Tathagata Das
1. When you set ssc.checkpoint(checkpointDir), the spark streaming
periodically saves the state RDD (which is a snapshot of all the state
data) to HDFS using RDD checkpointing. In fact, a streaming app with
updateStateByKey will not start until you set checkpoint directory.

2. The updateStateByKey performance is sort of independent of the what is
the source that is being use - receiver based or direct Kafka. The
absolutely performance obvious depends on a LOT of variables, size of the
cluster, parallelization, etc. The key things is that you must ensure
sufficient parallelization at every stage - receiving, shuffles
(updateStateByKey included), and output.

Some more discussion in my talk -
https://www.youtube.com/watch?v=d5UJonrruHk


On Tue, Jul 14, 2015 at 4:11 PM, swetha swethakasire...@gmail.com wrote:


 Hi TD,

 I have a question regarding sessionization using updateStateByKey. If near
 real time state needs to be maintained in a Streaming application, what
 happens when the number of RDDs to maintain the state becomes very large?
 Does it automatically get saved to HDFS and reload when needed or do I have
 to use any code like ssc.checkpoint(checkpointDir)?  Also, how is the
 performance if I use both DStream Checkpointing for maintaining the state
 and use Kafka Direct approach for exactly once semantics?


 Thanks,
 Swetha



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Does-RDD-checkpointing-store-the-entire-state-in-HDFS-tp7368p13227.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Does RDD checkpointing store the entire state in HDFS?

2015-07-14 Thread Tathagata Das
BTW, this is more like a user-list kind of mail, than a dev-list. The
dev-list is for Spark developers.

On Tue, Jul 14, 2015 at 4:23 PM, Tathagata Das t...@databricks.com wrote:

 1. When you set ssc.checkpoint(checkpointDir), the spark streaming
 periodically saves the state RDD (which is a snapshot of all the state
 data) to HDFS using RDD checkpointing. In fact, a streaming app with
 updateStateByKey will not start until you set checkpoint directory.

 2. The updateStateByKey performance is sort of independent of the what is
 the source that is being use - receiver based or direct Kafka. The
 absolutely performance obvious depends on a LOT of variables, size of the
 cluster, parallelization, etc. The key things is that you must ensure
 sufficient parallelization at every stage - receiving, shuffles
 (updateStateByKey included), and output.

 Some more discussion in my talk -
 https://www.youtube.com/watch?v=d5UJonrruHk


 On Tue, Jul 14, 2015 at 4:11 PM, swetha swethakasire...@gmail.com wrote:


 Hi TD,

 I have a question regarding sessionization using updateStateByKey. If near
 real time state needs to be maintained in a Streaming application, what
 happens when the number of RDDs to maintain the state becomes very large?
 Does it automatically get saved to HDFS and reload when needed or do I
 have
 to use any code like ssc.checkpoint(checkpointDir)?  Also, how is the
 performance if I use both DStream Checkpointing for maintaining the state
 and use Kafka Direct approach for exactly once semantics?


 Thanks,
 Swetha



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Does-RDD-checkpointing-store-the-entire-state-in-HDFS-tp7368p13227.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org





Re: [VOTE] Release Apache Spark 1.4.1

2015-06-29 Thread Tathagata Das
@Ted, could you elaborate more on what was the test command that you ran?
What profiles, using SBT or Maven?

TD

On Sun, Jun 28, 2015 at 12:21 PM, Patrick Wendell pwend...@gmail.com
wrote:

 Hey Krishna - this is still the current release candidate.

 - Patrick

 On Sun, Jun 28, 2015 at 12:14 PM, Krishna Sankar ksanka...@gmail.com
 wrote:
  Patrick,
 Haven't seen any replies on test results. I will byte ;o) - Should I
 test
  this version or is another one in the wings ?
  Cheers
  k/
 
  On Tue, Jun 23, 2015 at 10:37 PM, Patrick Wendell pwend...@gmail.com
  wrote:
 
  Please vote on releasing the following candidate as Apache Spark version
  1.4.1!
 
  This release fixes a handful of known issues in Spark 1.4.0, listed
 here:
  http://s.apache.org/spark-1.4.1
 
  The tag to be voted on is v1.4.1-rc1 (commit 60e08e5):
  https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
  60e08e50751fe3929156de956d62faea79f5b801
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc1-bin/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  The staging repository for this release can be found at:
  [published as version: 1.4.1]
  https://repository.apache.org/content/repositories/orgapachespark-1118/
  [published as version: 1.4.1-rc1]
  https://repository.apache.org/content/repositories/orgapachespark-1119/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc1-docs/
 
  Please vote on releasing this package as Apache Spark 1.4.1!
 
  The vote is open until Saturday, June 27, at 06:32 UTC and passes
  if a majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.4.1
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Time is ugly in Spark Streaming....

2015-06-27 Thread Tathagata Das
Could you print the time on the driver (that is, in foreachRDD but before
RDD.foreachPartition) and see if it is behaving weird?

TD

On Fri, Jun 26, 2015 at 3:57 PM, Emrehan Tüzün emrehan.tu...@gmail.com
wrote:





 On Fri, Jun 26, 2015 at 12:30 PM, Sea 261810...@qq.com wrote:

 Hi, all

 I find a problem in spark streaming, when I use the time in function 
 foreachRDD...
 I find the time is very interesting.

 val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, 
 StringDecoder](ssc, kafkaParams, topicsSet)

 dataStream.map(x = createGroup(x._2, 
 dimensions)).groupByKey().foreachRDD((rdd, time) = {
 try {
 if (!rdd.partitions.isEmpty) {
   rdd.foreachPartition(partition = {
 handlePartition(partition, timeType, time, dimensions, outputTopic, 
 brokerList)
   })
 }
   } catch {
 case e: Exception = e.printStackTrace()
   }
 })


 val dateFormat = new SimpleDateFormat(-MM-dd'T'HH:mm:ss)

  var date = dateFormat.format(new Date(time.milliseconds))


  Then I insert the 'date' into Kafka , but I found .


 {timestamp:2015-06-00T16:50:02,status:3,type:1,waittime:0,count:17}

 {timestamp:2015-06-26T16:51:13,status:1,type:1,waittime:0,count:34}

 {timestamp:2015-06-00T16:50:02,status:4,type:0,waittime:0,count:279}

 {timestamp:2015-06-26T16:52:00,status:11,type:1,waittime:0,count:9}
 {timestamp:0020-06-26T16:50:36
 ,status:7,type:0,waittime:0,count:1722}

 {timestamp:2015-06-10T16:51:17,status:0,type:0,waittime:0,count:2958}

 {timestamp:2015-06-26T16:52:00,status:0,type:1,waittime:0,count:114}

 {timestamp:2015-06-10T16:51:17,status:11,type:0,waittime:0,count:2066}

 {timestamp:2015-06-26T16:52:00,status:1,type:0,waittime:0,count:1539}





Re: Calculating tuple count /input rate with time

2015-06-23 Thread Tathagata Das
This should give accurate count for each batch, though for getting the rate
you have to make sure that you streaming app is stable, that is, batches
are processed as fast as they are received (scheduling delay in the spark
streaming UI is approx 0).

TD

On Tue, Jun 23, 2015 at 2:49 AM, anshu shukla anshushuk...@gmail.com
wrote:

 I am calculating input rate using the following logic.

 And i think this foreachRDD is always running on driver (println are seen on 
 driver)

 1- Is there any other way to do that in less cost .

 2- Will this give me the correct count for rate  .


 //code -

 inputStream.foreachRDD(new FunctionJavaRDDString, Void() {
 @Override
 public Void call(JavaRDDString stringJavaRDD) throws Exception {
 System.out.println(System.currentTimeMillis()+,spoutstringJavaRDD, 
 + stringJavaRDD.count() );
 return null;
 }
 });



 --
 Thanks  Regards,
 Anshu Shukla



Re: [VOTE] Release Apache Spark 1.4.0 (RC4)

2015-06-07 Thread Tathagata Das
+1

On Sun, Jun 7, 2015 at 3:01 PM, Joseph Bradley jos...@databricks.com
wrote:

 +1

 On Sat, Jun 6, 2015 at 7:55 PM, Guoqiang Li wi...@qq.com wrote:

 +1 (non-binding)


 -- Original --
 *From: * Reynold Xin;r...@databricks.com;
 *Date: * Fri, Jun 5, 2015 03:18 PM
 *To: * Krishna Sankarksanka...@gmail.com;
 *Cc: * Patrick Wendellpwend...@gmail.com; dev@spark.apache.org
 dev@spark.apache.org;
 *Subject: * Re: [VOTE] Release Apache Spark 1.4.0 (RC4)

 Enjoy your new shiny mbp.

 On Fri, Jun 5, 2015 at 12:10 AM, Krishna Sankar ksanka...@gmail.com
 wrote:

 +1 (non-binding, of course)

 1. Compiled OSX 10.10 (Yosemite) OK Total time: 25:42 min (My brand new
 shiny MacBookPro12,1 : 16GB. Inaugurated the machine with compile  test
 1.4.0-RC4 !)
  mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
 -Dhadoop.version=2.6.0 -DskipTests
 2. Tested pyspark, mlib - running as well as compare results with 1.3.1
 2.1. statistics (min,max,mean,Pearson,Spearman) OK
 2.2. Linear/Ridge/Laso Regression OK
 2.3. Decision Tree, Naive Bayes OK
 2.4. KMeans OK
Center And Scale OK
 2.5. RDD operations OK
   State of the Union Texts - MapReduce, Filter,sortByKey (word count)
 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
Model evaluation/optimization (rank, numIter, lambda) with
 itertools OK
 3. Scala - MLlib
 3.1. statistics (min,max,mean,Pearson,Spearman) OK
 3.2. LinearRegressionWithSGD OK
 3.3. Decision Tree OK
 3.4. KMeans OK
 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
 3.6. saveAsParquetFile OK
 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
 registerTempTable, sql OK
 3.8. result = sqlContext.sql(SELECT
 OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
 JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID) OK
 4.0. Spark SQL from Python OK
 4.1. result = sqlContext.sql(SELECT * from people WHERE State = 'WA')
 OK

 Cheers
 k/

 On Tue, Jun 2, 2015 at 8:53 PM, Patrick Wendell pwend...@gmail.com
 wrote:

 Please vote on releasing the following candidate as Apache Spark
 version 1.4.0!

 The tag to be voted on is v1.4.0-rc3 (commit 22596c5):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
 22596c534a38cfdda91aef18aa9037ab101e4251

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-bin/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 [published as version: 1.4.0]
 https://repository.apache.org/content/repositories/orgapachespark-/
 [published as version: 1.4.0-rc4]
 https://repository.apache.org/content/repositories/orgapachespark-1112/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-docs/

 Please vote on releasing this package as Apache Spark 1.4.0!

 The vote is open until Saturday, June 06, at 05:00 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.4.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 == What has changed since RC3 ==
 In addition to may smaller fixes, three blocker issues were fixed:
 4940630 [SPARK-8020] [SQL] Spark SQL conf in spark-defaults.conf make
 metadataHive get constructed too early
 6b0f615 [SPARK-8038] [SQL] [PYSPARK] fix Column.when() and otherwise()
 78a6723 [SPARK-7978] [SQL] [PYSPARK] DecimalType should not be singleton

 == How can I help test this release? ==
 If you are a Spark user, you can help us test this release by
 taking a Spark 1.3 workload and running on this release candidate,
 then reporting any regressions.

 == What justifies a -1 vote for this release? ==
 This vote is happening towards the end of the 1.4 QA period,
 so -1 votes should only occur for significant regressions from 1.3.1.
 Bugs already present in 1.3.X, minor regressions, or bugs related
 to new features will not block this release.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org







Re: StreamingContextSuite fails with NoSuchMethodError

2015-05-30 Thread Tathagata Das
Did was it a clean compilation?

TD

On Fri, May 29, 2015 at 10:48 PM, Ted Yu yuzhih...@gmail.com wrote:

 Hi,
 I ran the following command on 1.4.0 RC3:

 mvn -Phadoop-2.4 -Dhadoop.version=2.7.0 -Pyarn -Phive package

 I saw the following failure:

 ^[[32mStreamingContextSuite:^[[0m
 ^[[32m- from no conf constructor^[[0m
 ^[[32m- from no conf + spark home^[[0m
 ^[[32m- from no conf + spark home + env^[[0m
 ^[[32m- from conf with settings^[[0m
 ^[[32m- from existing SparkContext^[[0m
 ^[[32m- from existing SparkContext with settings^[[0m
 ^[[31m*** RUN ABORTED ***^[[0m
 ^[[31m  java.lang.NoSuchMethodError:
 org.apache.spark.ui.JettyUtils$.createStaticHandler(Ljava/lang/String;Ljava/lang/String;)Lorg/eclipse/jetty/servlet/ServletContextHandler;^[[0m
 ^[[31m  at
 org.apache.spark.streaming.ui.StreamingTab.attach(StreamingTab.scala:49)^[[0m
 ^[[31m  at
 org.apache.spark.streaming.StreamingContext$$anonfun$start$2.apply(StreamingContext.scala:585)^[[0m
 ^[[31m  at
 org.apache.spark.streaming.StreamingContext$$anonfun$start$2.apply(StreamingContext.scala:585)^[[0m
 ^[[31m  at scala.Option.foreach(Option.scala:236)^[[0m
 ^[[31m  at
 org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:585)^[[0m
 ^[[31m  at
 org.apache.spark.streaming.StreamingContextSuite$$anonfun$8.apply$mcV$sp(StreamingContextSuite.scala:101)^[[0m
 ^[[31m  at
 org.apache.spark.streaming.StreamingContextSuite$$anonfun$8.apply(StreamingContextSuite.scala:96)^[[0m
 ^[[31m  at
 org.apache.spark.streaming.StreamingContextSuite$$anonfun$8.apply(StreamingContextSuite.scala:96)^[[0m
 ^[[31m  at
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)^[[0m
 ^[[31m  at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)^[[0m

 Did anyone else encounter similar error ?

 Cheers



Re: Spark Streaming - Design considerations/Knobs

2015-05-24 Thread Tathagata Das
Blocks are replicated immediately, before the driver launches any jobs
using them.

On Thu, May 21, 2015 at 2:05 AM, Hemant Bhanawat hemant9...@gmail.com
wrote:

 Honestly, given the length of my email, I didn't expect a reply. :-)
 Thanks for reading and replying. However, I have a follow-up question:

 I don't think if I understand the block replication completely. Are the
 blocks replicated immediately after they are received by the receiver? Or
 are they kept on the receiver node only and are moved only on shuffle? Has
 the replication something to do with locality.wait?

 Thanks,
 Hemant

 On Thu, May 21, 2015 at 2:21 AM, Tathagata Das t...@databricks.com
 wrote:

 Correcting the ones that are incorrect or incomplete. BUT this is good
 list for things to remember about Spark Streaming.


 On Wed, May 20, 2015 at 3:40 AM, Hemant Bhanawat hemant9...@gmail.com
 wrote:

 Hi,

 I have compiled a list (from online sources) of knobs/design
 considerations that need to be taken care of by applications running on
 spark streaming. Is my understanding correct?  Any other important design
 consideration that I should take care of?


- A DStream is associated with a single receiver. For attaining read
parallelism multiple receivers i.e. multiple DStreams need to be created.
- A receiver is run within an executor. It occupies one core. Ensure
that there are enough cores for processing after receiver slots are 
 booked
i.e. spark.cores.max should take the receiver slots into account.
- The receivers are allocated to executors in a round robin fashion.
- When data is received from a stream source, receiver creates
blocks of data.  A new block of data is generated every blockInterval
milliseconds. N blocks of data are created during the batchInterval 
 where N
= batchInterval/blockInterval.
- These blocks are distributed by the BlockManager of the current
executor to the block managers of other executors. After that, the 
 Network
Input Tracker running on the driver is informed about the block locations
for further processing.
- A RDD is created on the driver for the blocks created during the
batchInterval. The blocks generated during the batchInterval are 
 partitions
of the RDD. Each partition is a task in spark. blockInterval==
batchinterval would mean that a single partition is created and probably 
 it
is processed locally.

 The map tasks on the blocks are processed in the executors (one that
 received the block, and another where the block was replicated) that has
 the blocks irrespective of block interval, unless non-local scheduling
 kicks in (as you observed next).


- Having bigger blockinterval means bigger blocks. A high value of
spark.locality.wait increases the chance of processing a block on the 
 local
node. A balance needs to be found out between these two parameters to
ensure that the bigger blocks are processed locally.
- Instead of relying on batchInterval and blockInterval, you can
define the number of partitions by calling dstream.repartition(n). This
reshuffles the data in RDD randomly to create n number of partitions.

 Yes, for greater parallelism. Though comes at the cost of a shuffle.


- An RDD's processing is scheduled by driver's jobscheduler as a
job. At a given point of time only one job is active. So, if one job is
executing the other jobs are queued.


- If you have two dstreams there will be two RDDs formed and there
will be two jobs created which will be scheduled one after the another.


- To avoid this, you can union two dstreams. This will ensure that a
single unionRDD is formed for the two RDDs of the dstreams. This unionRDD
is then considered as a single job. However the partitioning of the RDDs 
 is
not impacted.

 To further clarify, the jobs depend on the number of output operations
 (print, foreachRDD, saveAsXFiles) and the number of RDD actions in those
 output operations.

 dstream1.union(dstream2).foreachRDD { rdd = rdd.count() }// one
 Spark job per batch

 dstream1.union(dstream2).foreachRDD { rdd = { rdd.count() ; rdd.count()
 } }// TWO Spark jobs per batch

 dstream1.foreachRDD { rdd = rdd.count } ; dstream2.foreachRDD { rdd =
 rdd.count }  // TWO Spark jobs per batch






-
- If the batch processing time is more than batchinterval then
obviously the receiver's memory will start filling up and will end up in
throwing exceptions (most probably BlockNotFoundException). Currently 
 there
is  no way to pause the receiver.

 You can limit the rate of receiver using SparkConf config
 spark.streaming.receiver.maxRate


-
- For being fully fault tolerant, spark streaming needs to enable
checkpointing. Checkpointing increases the batch processing time.

 Incomplete. There are two types of checkpointing - data and metadata.
 Only data checkpointing, needed by only some operations, increase batch

Re: Spark Streaming with Tachyon : Data Loss on Receiver Failure due to WAL error

2015-05-21 Thread Tathagata Das
Looks like somehow the file size reported by the FSInputDStream of
Tachyon's FileSystem interface, is returning zero.

On Mon, May 11, 2015 at 4:38 AM, Dibyendu Bhattacharya 
dibyendu.bhattach...@gmail.com wrote:

 Just to follow up this thread further .

 I was doing some fault tolerant testing of Spark Streaming with Tachyon as
 OFF_HEAP block store. As I said in earlier email, I could able to solve the
 BlockNotFound exception when I used Hierarchical Storage of Tachyon ,
  which is good.

 I continue doing some testing around storing the Spark Streaming WAL and
 CheckPoint files also in Tachyon . Here is few finding ..


 When I store the Spark Streaming Checkpoint location in Tachyon , the
 throughput is much higher . I tested the Driver and Receiver failure cases
 , and Spark Streaming is able to recover without any Data Loss on Driver
 failure.

 *But on Receiver failure , Spark Streaming looses data* as I see
 Exception while reading the WAL file from Tachyon receivedData location
  for the same Receiver id which just failed.

 If I change the Checkpoint location back to HDFS , Spark Streaming can
 recover from both Driver and Receiver failure .

 Here is the Log details when Spark Streaming receiver failed ...I raised a
 JIRA for the same issue : https://issues.apache.org/jira/browse/SPARK-7525



 INFO : org.apache.spark.scheduler.DAGScheduler - *Executor lost: 2 (epoch
 1)*
 INFO : org.apache.spark.storage.BlockManagerMasterEndpoint - Trying to
 remove executor 2 from BlockManagerMaster.
 INFO : org.apache.spark.storage.BlockManagerMasterEndpoint - Removing
 block manager BlockManagerId(2, 10.252.5.54, 45789)
 INFO : org.apache.spark.storage.BlockManagerMaster - Removed 2
 successfully in removeExecutor
 INFO : org.apache.spark.streaming.scheduler.ReceiverTracker - *Registered
 receiver for stream 2 from 10.252.5.62*:47255
 WARN : org.apache.spark.scheduler.TaskSetManager - Lost task 2.1 in stage
 103.0 (TID 421, 10.252.5.62): org.apache.spark.SparkException: *Could not
 read data from write ahead log record
 FileBasedWriteAheadLogSegment(tachyon-ft://10.252.5.113:19998/tachyon/checkpoint/receivedData/2/log-1431341091711-1431341151711,645603894,10891919
 http://10.252.5.113:19998/tachyon/checkpoint/receivedData/2/log-1431341091711-1431341151711,645603894,10891919)*
 at org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org
 $apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:144)
 at
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:168)
 at
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:168)
 at scala.Option.getOrElse(Option.scala:120)
 at
 org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.compute(WriteAheadLogBackedBlockRDD.scala:168)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 Caused by: java.lang.IllegalArgumentException:* Seek position is past
 EOF: 645603894, fileSize = 0*
 at tachyon.hadoop.HdfsFileInputStream.seek(HdfsFileInputStream.java:239)
 at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:37)
 at
 org.apache.spark.streaming.util.FileBasedWriteAheadLogRandomReader.read(FileBasedWriteAheadLogRandomReader.scala:37)
 at
 org.apache.spark.streaming.util.FileBasedWriteAheadLog.read(FileBasedWriteAheadLog.scala:104)
 at org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org
 $apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:141)
 ... 15 more

 INFO : org.apache.spark.scheduler.TaskSetManager - Starting task 2.2 in
 stage 103.0 (TID 422, 10.252.5.61, ANY, 1909 bytes)
 INFO : org.apache.spark.scheduler.TaskSetManager - Lost task 2.2 in stage
 103.0 (TID 422) on executor 10.252.5.61: org.apache.spark.SparkException
 (Could not read data from write ahead log record
 FileBasedWriteAheadLogSegment(tachyon-ft://
 10.252.5.113:19998/tachyon/checkpoint/receivedData/2/log-1431341091711-1431341151711,645603894,10891919))
 [duplicate 1]
 INFO : org.apache.spark.scheduler.TaskSetManager - Starting task 2.3 in
 stage 103.0 (TID 423, 10.252.5.62, ANY, 1909 bytes)
 INFO : org.apache.spark.deploy.client.AppClient$ClientActor - 

Re: Spark Streaming - Design considerations/Knobs

2015-05-20 Thread Tathagata Das
Correcting the ones that are incorrect or incomplete. BUT this is good list
for things to remember about Spark Streaming.


On Wed, May 20, 2015 at 3:40 AM, Hemant Bhanawat hemant9...@gmail.com
wrote:

 Hi,

 I have compiled a list (from online sources) of knobs/design
 considerations that need to be taken care of by applications running on
 spark streaming. Is my understanding correct?  Any other important design
 consideration that I should take care of?


- A DStream is associated with a single receiver. For attaining read
parallelism multiple receivers i.e. multiple DStreams need to be created.
- A receiver is run within an executor. It occupies one core. Ensure
that there are enough cores for processing after receiver slots are booked
i.e. spark.cores.max should take the receiver slots into account.
- The receivers are allocated to executors in a round robin fashion.
- When data is received from a stream source, receiver creates blocks
of data.  A new block of data is generated every blockInterval
milliseconds. N blocks of data are created during the batchInterval where N
= batchInterval/blockInterval.
- These blocks are distributed by the BlockManager of the current
executor to the block managers of other executors. After that, the Network
Input Tracker running on the driver is informed about the block locations
for further processing.
- A RDD is created on the driver for the blocks created during the
batchInterval. The blocks generated during the batchInterval are partitions
of the RDD. Each partition is a task in spark. blockInterval==
batchinterval would mean that a single partition is created and probably it
is processed locally.

 The map tasks on the blocks are processed in the executors (one that
received the block, and another where the block was replicated) that has
the blocks irrespective of block interval, unless non-local scheduling
kicks in (as you observed next).


- Having bigger blockinterval means bigger blocks. A high value of
spark.locality.wait increases the chance of processing a block on the local
node. A balance needs to be found out between these two parameters to
ensure that the bigger blocks are processed locally.
- Instead of relying on batchInterval and blockInterval, you can
define the number of partitions by calling dstream.repartition(n). This
reshuffles the data in RDD randomly to create n number of partitions.

 Yes, for greater parallelism. Though comes at the cost of a shuffle.


- An RDD's processing is scheduled by driver's jobscheduler as a job.
At a given point of time only one job is active. So, if one job is
executing the other jobs are queued.


- If you have two dstreams there will be two RDDs formed and there
will be two jobs created which will be scheduled one after the another.


- To avoid this, you can union two dstreams. This will ensure that a
single unionRDD is formed for the two RDDs of the dstreams. This unionRDD
is then considered as a single job. However the partitioning of the RDDs is
not impacted.

 To further clarify, the jobs depend on the number of output operations
(print, foreachRDD, saveAsXFiles) and the number of RDD actions in those
output operations.

dstream1.union(dstream2).foreachRDD { rdd = rdd.count() }// one Spark
job per batch

dstream1.union(dstream2).foreachRDD { rdd = { rdd.count() ; rdd.count() }
}// TWO Spark jobs per batch

dstream1.foreachRDD { rdd = rdd.count } ; dstream2.foreachRDD { rdd =
rdd.count }  // TWO Spark jobs per batch






-
- If the batch processing time is more than batchinterval then
obviously the receiver's memory will start filling up and will end up in
throwing exceptions (most probably BlockNotFoundException). Currently there
is  no way to pause the receiver.

 You can limit the rate of receiver using SparkConf config
spark.streaming.receiver.maxRate


-
- For being fully fault tolerant, spark streaming needs to enable
checkpointing. Checkpointing increases the batch processing time.

 Incomplete. There are two types of checkpointing - data and metadata. Only
data checkpointing, needed by only some operations, increase batch
processing time. Read -
http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing
Furthemore, with checkpoint you can recover computation, but you may loose
some data (that was received but not processed before driver failed) for
some sources. Enabling write ahead logs and reliable source + receiver,
allow zero data loss. Read - WAL in
http://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics


- The frequency of metadata checkpoint cleaning can be controlled
using spark.cleaner.ttl. But, data checkpoint cleaning happens
automatically when the RDDs in the checkpoint are no more required.


 Incorrect. metadata checkpointing or 

Re: Speeding up Spark build during development

2015-05-04 Thread Tathagata Das
In addition to Michael suggestion, in my SBT workflow I also use ~ to
automatically kickoff build and unit test. For example,

sbt/sbt ~streaming/test-only *BasicOperationsSuite*

It will automatically detect any file changes in the project and start of
the compilation and testing.
So my full workflow involves changing code in IntelliJ and then
continuously running unit tests in the background on the command line using
this ~.

TD


On Mon, May 4, 2015 at 2:49 PM, Michael Armbrust mich...@databricks.com
wrote:

 FWIW... My Spark SQL development workflow is usually to run build/sbt
 sparkShell or build/sbt 'sql/test-only testSuiteName'.  These commands
 starts in as little as 30s on my laptop, automatically figure out which
 subprojects need to be rebuilt, and don't require the expensive assembly
 creation.

 On Mon, May 4, 2015 at 5:48 AM, Meethu Mathew meethu.mat...@flytxt.com
 wrote:

  *
  *
  ** ** ** ** **  **  Hi,
 
   Is it really necessary to run **mvn --projects assembly/ -DskipTests
  install ? Could you please explain why this is needed?
  I got the changes after running mvn --projects streaming/ -DskipTests
  package.
 
  Regards,
  Meethu
 
 
  On Monday 04 May 2015 02:20 PM, Emre Sevinc wrote:
 
  Just to give you an example:
 
  When I was trying to make a small change only to the Streaming component
  of
  Spark, first I built and installed the whole Spark project (this took
  about
  15 minutes on my 4-core, 4 GB RAM laptop). Then, after having changed
  files
  only in Streaming, I ran something like (in the top-level directory):
 
  mvn --projects streaming/ -DskipTests package
 
  and then
 
  mvn --projects assembly/ -DskipTests install
 
 
  This was much faster than trying to build the whole Spark from scratch,
  because Maven was only building one component, in my case the Streaming
  component, of Spark. I think you can use a very similar approach.
 
  --
  Emre Sevinç
 
 
 
  On Mon, May 4, 2015 at 10:44 AM, Pramod Biligiri 
  pramodbilig...@gmail.com
  wrote:
 
   No, I just need to build one project at a time. Right now SparkSql.
 
  Pramod
 
  On Mon, May 4, 2015 at 12:09 AM, Emre Sevinc emre.sev...@gmail.com
  wrote:
 
   Hello Pramod,
 
  Do you need to build the whole project every time? Generally you
 don't,
  e.g., when I was changing some files that belong only to Spark
  Streaming, I
  was building only the streaming (of course after having build and
  installed
  the whole project, but that was done only once), and then the
 assembly.
  This was much faster than trying to build the whole Spark every time.
 
  --
  Emre Sevinç
 
  On Mon, May 4, 2015 at 9:01 AM, Pramod Biligiri 
  pramodbilig...@gmail.com
 
  wrote:
  Using the inbuilt maven and zinc it takes around 10 minutes for each
  build.
  Is that reasonable?
  My maven opts looks like this:
  $ echo $MAVEN_OPTS
  -Xmx12000m -XX:MaxPermSize=2048m
 
  I'm running it as build/mvn -DskipTests package
 
  Should I be tweaking my Zinc/Nailgun config?
 
  Pramod
 
  On Sun, May 3, 2015 at 3:40 PM, Mark Hamstra 
 m...@clearstorydata.com
  wrote:
 
 
 
 
 https://spark.apache.org/docs/latest/building-spark.html#building-with-buildmvn
 
  On Sun, May 3, 2015 at 2:54 PM, Pramod Biligiri 
 
  pramodbilig...@gmail.com
 
  wrote:
 
   This is great. I didn't know about the mvn script in the build
 
  directory.
 
  Pramod
 
  On Fri, May 1, 2015 at 9:51 AM, York, Brennon 
  brennon.y...@capitalone.com
  wrote:
 
   Following what Ted said, if you leverage the `mvn` from within the
  `build/` directory of Spark you¹ll get zinc for free which should
 
  help
 
  speed up build times.
 
  On 5/1/15, 9:45 AM, Ted Yu yuzhih...@gmail.com wrote:
 
   Pramod:
  Please remember to run Zinc so that the build is faster.
 
  Cheers
 
  On Fri, May 1, 2015 at 9:36 AM, Ulanov, Alexander
  alexander.ula...@hp.com
  wrote:
 
   Hi Pramod,
 
  For cluster-like tests you might want to use the same code as in
 
  mllib's
 
  LocalClusterSparkContext. You can rebuild only the package that
 
  you
 
  change
  and then run this main class.
 
  Best regards, Alexander
 
  -Original Message-
  From: Pramod Biligiri [mailto:pramodbilig...@gmail.com]
  Sent: Friday, May 01, 2015 1:46 AM
  To: dev@spark.apache.org
  Subject: Speeding up Spark build during development
 
  Hi,
  I'm making some small changes to the Spark codebase and trying
 
  it out
 
  on a
  cluster. I was wondering if there's a faster way to build than
 
  running
 
  the
  package target each time.
  Currently I'm using: mvn -DskipTests  package
 
  All the nodes have the same filesystem mounted at the same mount
 
  point.
 
  Pramod
 
   
 
  The information contained in this e-mail is confidential and/or
  proprietary to Capital One and/or its affiliates. The information
  transmitted herewith is intended only for use by the individual or
 
  entity
 
  to which it is addressed.  If the reader of 

Re: Spark Streaming updatyeStateByKey throws OutOfMemory Error

2015-04-22 Thread Tathagata Das
It could very well be that your executor memory is not enough to store the
state RDDs AND operate on the data. 1G per executor is quite low.
Definitely give more memory. And have you tried increasing the number of
partitions (specify number of partitions in updateStateByKey) ?

On Wed, Apr 22, 2015 at 2:34 AM, Sourav Chandra 
sourav.chan...@livestream.com wrote:

 Anyone?

 On Wed, Apr 22, 2015 at 12:29 PM, Sourav Chandra 
 sourav.chan...@livestream.com wrote:

  Hi Olivier,
 
  *the update function is as below*:
 
  *val updateFunc = (values: Seq[IConcurrentUsers], state: Option[(Long,
  Long)]) = {*
  *  val previousCount = state.getOrElse((0L, 0L))._2*
  *  var startValue: IConcurrentUsers = ConcurrentViewers(0)*
  *  var currentCount = 0L*
  *  val lastIndexOfConcurrentUsers =*
  *values.lastIndexWhere(_.isInstanceOf[ConcurrentViewers])*
  *  val subList = values.slice(0, lastIndexOfConcurrentUsers)*
  *  val currentCountFromSubList = subList.foldLeft(startValue)(_ op
  _).count + previousCount*
  *  val lastConcurrentViewersCount =
  values(lastIndexOfConcurrentUsers).count*
 
  *  if (math.abs(lastConcurrentViewersCount - currentCountFromSubList)
  = 1) {*
  *logger.error(*
  *  sCount using state updation $currentCountFromSubList,  +*
  *sConcurrentUsers count $lastConcurrentViewersCount +*
  *s resetting to $lastConcurrentViewersCount*
  *)*
  *currentCount = lastConcurrentViewersCount*
  *  }*
  *  val remainingValuesList = values.diff(subList)*
  *  startValue = ConcurrentViewers(currentCount)*
  *  currentCount = remainingValuesList.foldLeft(startValue)(_ op
  _).count*
 
  *  if (currentCount  0) {*
 
  *logger.error(*
  *  sERROR: Got new count $currentCount  0, value:$values,
  state:$state, resetting to 0*
  *)*
  *currentCount = 0*
  *  }*
  *  // to stop pushing subsequent 0 after receiving first 0*
  *  if (currentCount == 0  previousCount == 0) None*
  *  else Some(previousCount, currentCount)*
  *}*
 
  *trait IConcurrentUsers {*
  *  val count: Long*
  *  def op(a: IConcurrentUsers): IConcurrentUsers =
  IConcurrentUsers.op(this, a)*
  *}*
 
  *object IConcurrentUsers {*
  *  def op(a: IConcurrentUsers, b: IConcurrentUsers): IConcurrentUsers =
  (a, b) match {*
  *case (_, _: ConcurrentViewers) = *
  *  ConcurrentViewers(b.count)*
  *case (_: ConcurrentViewers, _: IncrementConcurrentViewers) = *
  *  ConcurrentViewers(a.count + b.count)*
  *case (_: ConcurrentViewers, _: DecrementConcurrentViewers) = *
  *  ConcurrentViewers(a.count - b.count)*
  *  }*
  *}*
 
  *case class IncrementConcurrentViewers(count: Long) extends
  IConcurrentUsers*
  *case class DecrementConcurrentViewers(count: Long) extends
  IConcurrentUsers*
  *case class ConcurrentViewers(count: Long) extends IConcurrentUsers*
 
 
  *also the error stack trace copied from executor logs is:*
 
  *java.lang.OutOfMemoryError: Java heap space*
  *at
 
 org.apache.hadoop.io.WritableUtils.readCompressedStringArray(WritableUtils.java:183)*
  *at
  org.apache.hadoop.conf.Configuration.readFields(Configuration.java:2564)*
  *at
  org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:285)*
  *at
  org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:77)*
  *at
 
 org.apache.spark.SerializableWritable$$anonfun$readObject$1.apply$mcV$sp(SerializableWritable.scala:43)*
  *at
 org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:927)*
  *at
 
 org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:39)*
  *at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)*
  *at
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)*
  *at
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)*
  *at java.lang.reflect.Method.invoke(Method.java:601)*
  *at
  java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1004)*
  *at
  java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1866)*
  *at
 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)*
  *at
  java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)*
  *at
  java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)*
  *at
 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)*
  *at
 
 org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:236)*
  *at
 
 org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readObject$1.apply$mcV$sp(TorrentBroadcast.scala:169)*
  *at
 org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:927)*
  *at
 
 

Re: Which method do you think is better for making MIN_REMEMBER_DURATION configurable?

2015-04-08 Thread Tathagata Das
Approach 2 is definitely better  :)
Can you tell us more about the use case why you want to do this?

TD

On Wed, Apr 8, 2015 at 1:44 AM, Emre Sevinc emre.sev...@gmail.com wrote:

 Hello,

 This is about SPARK-3276 and I want to make MIN_REMEMBER_DURATION (that is
 now a constant) a variable (configurable, with a default value). Before
 spending effort on developing something and creating a pull request, I
 wanted to consult with the core developers to see which approach makes most
 sense, and has the higher probability of being accepted.

 The constant MIN_REMEMBER_DURATION can be seen at:



 https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L338

 it is marked as private member of private[streaming] object
 FileInputDStream.

 Approach 1: Make MIN_REMEMBER_DURATION a variable, with a new name of
 minRememberDuration, and then  add a new fileStream method to
 JavaStreamingContext.scala :



 https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala

 such that the new fileStream method accepts a new parameter, e.g.
 minRememberDuration: Int (in seconds), and then use this value to set the
 private minRememberDuration.


 Approach 2: Create a new, public Spark configuration property, e.g. named
 spark.rememberDuration.min (with a default value of 60 seconds), and then
 set the private variable minRememberDuration to the value of this Spark
 property.


 Approach 1 would mean adding a new method to the public API, Approach 2
 would mean creating a new public Spark property. Right now, approach 2
 seems more straightforward and simpler to me, but nevertheless I wanted to
 have the opinions of other developers who know the internals of Spark
 better than I do.

 Kind regards,
 Emre Sevinç



Re: Spark + Kinesis

2015-04-06 Thread Tathagata Das
, batchInterval)/* Kinesis checkpoint
 interval.  Same as batchInterval for this example. */val
 kinesisCheckpointInterval = batchInterval/* Create the same number of
 Kinesis DStreams/Receivers as Kinesis stream's shards */val
 kinesisStreams = (0 until numStreams).map { i =
 KinesisUtils.createStream(ssc, streamName, endpointUrl,
 kinesisCheckpointInterval,  InitialPositionInStream.LATEST,
 StorageLevel.MEMORY_AND_DISK_2)}/* Union all the streams */val
 unionStreams  = ssc.union(kinesisStreams).map(byteArray = new
 String(byteArray))unionStreams.print()ssc.start()
 ssc.awaitTermination()  }}*


 On Fri, Apr 3, 2015 at 3:48 PM, Tathagata Das t...@databricks.com
 wrote:

 Just remove provided for spark-streaming-kinesis-asl

 libraryDependencies += org.apache.spark %%
 spark-streaming-kinesis-asl % 1.3.0

 On Fri, Apr 3, 2015 at 12:45 PM, Vadim Bichutskiy 
 vadim.bichuts...@gmail.com wrote:

 Thanks. So how do I fix it?

 On Fri, Apr 3, 2015 at 3:43 PM, Kelly, Jonathan jonat...@amazon.com
 wrote:

   spark-streaming-kinesis-asl is not part of the Spark distribution
 on your cluster, so you cannot have it be just a provided dependency.
 This is also why the KCL and its dependencies were not included in the
 assembly (but yes, they should be).


  ~ Jonathan Kelly

   From: Vadim Bichutskiy vadim.bichuts...@gmail.com
 Date: Friday, April 3, 2015 at 12:26 PM
 To: Jonathan Kelly jonat...@amazon.com
 Cc: u...@spark.apache.org u...@spark.apache.org
 Subject: Re: Spark + Kinesis

   Hi all,

  Good news! I was able to create a Kinesis consumer and assemble it
 into an uber jar following
 http://spark.apache.org/docs/latest/streaming-kinesis-integration.html
 http://t.signauxtrois.com/e1t/c/5/f18dQhb0S7lC8dDMPbW2n0x6l2B9nMJW7t5XZs653q_MN8rBNbzRbv22W8r4TLx56dCDWf13Gc8R02?t=http%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fstreaming-kinesis-integration.htmlsi=5533377798602752pi=8898f7f0-ede6-4e6c-9bfd-00cf2101f0e9
 and example
 https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala
 http://t.signauxtrois.com/e1t/c/5/f18dQhb0S7lC8dDMPbW2n0x6l2B9nMJW7t5XZs653q_MN8rBNbzRbv22W8r4TLx56dCDWf13Gc8R02?t=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fblob%2Fmaster%2Fextras%2Fkinesis-asl%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2Fexamples%2Fstreaming%2FKinesisWordCountASL.scalasi=5533377798602752pi=8898f7f0-ede6-4e6c-9bfd-00cf2101f0e9
 .

  However when I try to spark-submit it I get the following exception:

  *Exception in thread main java.lang.NoClassDefFoundError:
 com/amazonaws/auth/AWSCredentialsProvider*

  Do I need to include KCL dependency in *build.sbt*, here's what it
 looks like currently:

  import AssemblyKeys._
 name := Kinesis Consumer
 version := 1.0
 organization := com.myconsumer
 scalaVersion := 2.11.5

  libraryDependencies += org.apache.spark %% spark-core % 1.3.0
 % provided
 libraryDependencies += org.apache.spark %% spark-streaming %
 1.3.0 % provided
 libraryDependencies += org.apache.spark %%
 spark-streaming-kinesis-asl % 1.3.0 % provided

  assemblySettings
 jarName in assembly :=  consumer-assembly.jar
 assemblyOption in assembly := (assemblyOption in
 assembly).value.copy(includeScala=false)

  Any help appreciated.

  Thanks,
 Vadim

 On Thu, Apr 2, 2015 at 1:15 PM, Kelly, Jonathan jonat...@amazon.com
 wrote:

  It looks like you're attempting to mix Scala versions, so that's
 going to cause some problems.  If you really want to use Scala 2.11.5, 
 you
 must also use Spark package versions built for Scala 2.11 rather than
 2.10.  Anyway, that's not quite the correct way to specify Scala
 dependencies in build.sbt.  Instead of placing the Scala version after 
 the
 artifactId (like spark-core_2.10), what you actually want is to use 
 just
 spark-core with two percent signs before it.  Using two percent signs
 will make it use the version of Scala that matches your declared
 scalaVersion.  For example:

  libraryDependencies += org.apache.spark %% spark-core %
 1.3.0 % provided

  libraryDependencies += org.apache.spark %% spark-streaming %
 1.3.0 % provided

  libraryDependencies += org.apache.spark %%
 spark-streaming-kinesis-asl % 1.3.0

  I think that may get you a little closer, though I think you're
 probably going to run into the same problems I ran into in this thread:
 https://www.mail-archive.com/user@spark.apache.org/msg23891.html  I
 never really got an answer for that, and I temporarily moved on to other
 things for now.


  ~ Jonathan Kelly

   From: 'Vadim Bichutskiy' vadim.bichuts...@gmail.com
 Date: Thursday, April 2, 2015 at 9:53 AM
 To: u...@spark.apache.org u...@spark.apache.org
 Subject: Spark + Kinesis

   Hi all,

  I am trying to write an Amazon Kinesis consumer Scala app that
 processes data in the
 Kinesis stream. Is this the correct way to specify *build.sbt*:

  ---
 *import AssemblyKeys._*
 *name := Kinesis Consumer*






 *version := 1.0

Re: Spark 2.0: Rearchitecting Spark for Mobile, Local, Social

2015-04-01 Thread Tathagata Das
This is a significant effort that Reynold has undertaken, and I am super
glad to see that it's finally taking a concrete form. Would love to see
what the community thinks about the idea.

TD

On Wed, Apr 1, 2015 at 3:11 AM, Reynold Xin r...@databricks.com wrote:

 Hi Spark devs,

 I've spent the last few months investigating the feasibility of
 re-architecting Spark for mobile platforms, considering the growing
 population of Android/iOS users. I'm happy to share with you my findings at
 https://issues.apache.org/jira/browse/SPARK-6646

 The tl;dr is that we should support running Spark on Android/iOS, and the
 best way to do this at the moment is to use Scala.js to compile Spark code
 into JavaScript, and then run it in Safari or Chrome (and even node.js
 potentially for servers).

 If you are on your phones right now and prefer reading a blog post rather
 than a PDF file, you can read more about the design doc at

 https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html


 This is done in collaboration with TD, Xiangrui, Patrick. Look forward to
 your feedback!



Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-06 Thread Tathagata Das
To add to what Patrick said, the only reason that those JIRAs are marked as
Blockers (at least I can say for myself) is so that they are at the top of
the JIRA list signifying that these are more *immediate* issues than all
the Critical issues. To make it less confusing for the community voting, we
can definitely add a filter that ignores Documentation issues from the JIRA
list.


On Fri, Mar 6, 2015 at 1:17 PM, Patrick Wendell pwend...@gmail.com wrote:

 Sean,

 The docs are distributed and consumed in a fundamentally different way
 than Spark code itself. So we've always considered the deadline for
 doc changes to be when the release is finally posted.

 If there are small inconsistencies with the docs present in the source
 code for that release tag, IMO that doesn't matter much since we don't
 even distribute the docs with Spark's binary releases and virtually no
 one builds and hosts the docs on their own (that I am aware of, at
 least). Perhaps we can recommend if people want to build the doc
 sources that they should always grab the head of the most recent
 release branch, to set expectations accordingly.

 In the past we haven't considered it worth holding up the release
 process for the purpose of the docs. It just doesn't make sense since
 they are consumed as a service. If we decide to change this
 convention, it would mean shipping our releases later, since we
 could't pipeline the doc finalization with voting.

 - Patrick

 On Fri, Mar 6, 2015 at 11:02 AM, Sean Owen so...@cloudera.com wrote:
  Given the title and tagging, it sounds like there could be some
  must-have doc changes to go with what is being released as 1.3. It can
  be finished later, and published later, but then the docs source
  shipped with the release doesn't match the site, and until then, 1.3
  is released without some must-have docs for 1.3 on the site.
 
  The real question to me is: are there any further, absolutely
  essential doc changes that need to accompany 1.3 or not?
 
  If not, just resolve these. If there are, then it seems like the
  release has to block on them. If there are some docs that should have
  gone in for 1.3, but didn't, but aren't essential, well I suppose it
  bears thinking about how to not slip as much work, but it doesn't
  block.
 
  I think Documentation issues certainly can be a blocker and shouldn't
  be specially ignored.
 
 
  BTW the UISeleniumSuite issue is a real failure, but I do not think it
  is serious: http://issues.apache.org/jira/browse/SPARK-6205  It isn't
  a regression from 1.2.x, but only affects tests, and only affects a
  subset of build profiles.
 
 
 
 
  On Fri, Mar 6, 2015 at 6:43 PM, Patrick Wendell pwend...@gmail.com
 wrote:
  Hey Sean,
 
  SPARK-5310 Update SQL programming guide for 1.3
  SPARK-5183 Document data source API
  SPARK-6128 Update Spark Streaming Guide for Spark 1.3
 
  For these, the issue is that they are documentation JIRA's, which
  don't need to be timed exactly with the release vote, since we can
  update the documentation on the website whenever we want. In the past
  I've just mentally filtered these out when considering RC's. I see a
  few options here:
 
  1. We downgrade such issues away from Blocker (more clear, but we risk
  loosing them in the fray if they really are things we want to have
  before the release is posted).
  2. We provide a filter to the community that excludes 'Documentation'
  issues and shows all other blockers for 1.3. We can put this on the
  wiki, for instance.
 
  Which do you prefer?
 
  - Patrick

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Tathagata Das
Hey all,

I found a major issue where JobProgressListener (a listener used to keep
track of jobs for the web UI) never forgets stages in one of its data
structures. This is a blocker for long running applications.
https://issues.apache.org/jira/browse/SPARK-5967

I am testing a fix for this right now.

TD

On Mon, Feb 23, 2015 at 7:23 PM, Soumitra Kumar kumar.soumi...@gmail.com
wrote:

 +1 (non-binding)

 For: https://issues.apache.org/jira/browse/SPARK-3660

 . Docs OK
 . Example code is good

 -Soumitra.


 On Mon, Feb 23, 2015 at 10:33 AM, Marcelo Vanzin van...@cloudera.com
 wrote:

  Hi Tom, are you using an sbt-built assembly by any chance? If so, take
  a look at SPARK-5808.
 
  I haven't had any problems with the maven-built assembly. Setting
  SPARK_HOME on the executors is a workaround if you want to use the sbt
  assembly.
 
  On Fri, Feb 20, 2015 at 2:56 PM, Tom Graves
  tgraves...@yahoo.com.invalid wrote:
   Trying to run pyspark on yarn in client mode with basic wordcount
  example I see the following error when doing the collect:
   Error from python worker:  /usr/bin/python: No module named
  sqlPYTHONPATH was:
 
 /grid/3/tmp/yarn-local/usercache/tgraves/filecache/20/spark-assembly-1.3.0-hadoop2.6.0.1.1411101121.jarjava.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
  at
 
 org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
at
 
 org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
at
 
 org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:105)
at
 org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:69)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)at
  org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:308)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)at
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)at
  org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
at
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
   any ideas on this?
   Tom
  
On Wednesday, February 18, 2015 2:14 AM, Patrick Wendell 
  pwend...@gmail.com wrote:
  
  
Please vote on releasing the following candidate as Apache Spark
  version 1.3.0!
  
   The tag to be voted on is v1.3.0-rc1 (commit f97b0d4a):
  
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=f97b0d4a6b26504916816d7aefcf3132cd1da6c2
  
   The release files, including signatures, digests, etc. can be found at:
   http://people.apache.org/~pwendell/spark-1.3.0-rc1/
  
   Release artifacts are signed with the following key:
   https://people.apache.org/keys/committer/pwendell.asc
  
   The staging repository for this release can be found at:
  
 https://repository.apache.org/content/repositories/orgapachespark-1069/
  
   The documentation corresponding to this release can be found at:
   http://people.apache.org/~pwendell/spark-1.3.0-rc1-docs/
  
   Please vote on releasing this package as Apache Spark 1.3.0!
  
   The vote is open until Saturday, February 21, at 08:03 UTC and passes
   if a majority of at least 3 +1 PMC votes are cast.
  
   [ ] +1 Release this package as Apache Spark 1.3.0
   [ ] -1 Do not release this package because ...
  
   To learn more about Apache Spark, please see
   http://spark.apache.org/
  
   == How can I help test this release? ==
   If you are a Spark user, you can help us test this release by
   taking a Spark 1.2 workload and running on this release candidate,
   then reporting any regressions.
  
   == What justifies a -1 vote for this release? ==
   This vote is happening towards the end of the 1.3 QA period,
   so -1 votes should only occur for significant regressions from 1.2.1.
   Bugs already present in 1.2.X, minor regressions, or bugs related
   to new features will not block this release.
  
   - Patrick
  
   -
   To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
   For additional commands, e-mail: dev-h...@spark.apache.org
  
  
  
  
 
 
 
  --
  Marcelo
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 



Re: When will Spark Streaming supports Kafka-simple consumer API?

2015-02-04 Thread Tathagata Das
1. There is already a third-party low-level kafka receiver -
http://spark-packages.org/package/5
2. There is a new experimental Kafka stream that will be available in Spark
1.3 release. This is based on the low level API, and might suffice your
purpose. JIRA - https://issues.apache.org/jira/browse/SPARK-4964

Can you elaborate on why you have to use SimpleConsumer in your environment?

TD


On Wed, Feb 4, 2015 at 7:44 PM, Xuelin Cao xuelincao2...@gmail.com wrote:

 Hi,

  In our environment, Kafka can only be used with simple consumer API,
 like storm spout does.

  And, also, I found there are suggestions that  Kafka connector of
 Spark should not be used in production
 http://markmail.org/message/2lb776ta5sq6lgtw because it is based on the
 high-level consumer API of Kafka.

 So, my question is, when will spark streaming supports Kafka simple
 consumer API?



Re: missing document of several messages in actor-based receiver?

2015-01-09 Thread Tathagata Das
It was not really mean to be hidden. So its essentially the case of the
documentation being insufficient. This code has not gotten much attention
for a while, so it could have a bugs. If you find any and submit a fix for
them, I am happy to take a look!

TD

On Thu, Jan 8, 2015 at 6:33 PM, Nan Zhu zhunanmcg...@gmail.com wrote:

  Hi, TD and other streaming developers,

 When I look at the implementation of actor-based receiver
 (ActorReceiver.scala), I found that there are several messages which are
 not mentioned in the document

 case props: Props =
   val worker = context.actorOf(props)
   logInfo(Started receiver worker at: + worker.path)
   sender ! worker

 case (props: Props, name: String) =
   val worker = context.actorOf(props, name)
   logInfo(Started receiver worker at: + worker.path)
   sender ! worker

 case _: PossiblyHarmful = hiccups.incrementAndGet()

 case _: Statistics =
   val workers = context.children
   sender ! Statistics(n.get, workers.size, hiccups.get, 
 workers.mkString(\n*”*))


 Is it hided with intention or incomplete document, or I missed something?

 And the handler of these messages are “buggy? e.g. when we start a new 
 worker, we didn’t increase n (counter of children), and n and hiccups are 
 unnecessarily set to AtomicInteger ?

 Best,

 --
 Nan Zhu
 http://codingcat.me



Re: Spark Streaming Data flow graph

2015-01-05 Thread Tathagata Das
Hey François,

Well, at a high-level here is what I thought about the diagram.

- ReceiverSupervisor handles only one Receiver.
- BlockGenerator is part of ReceiverSupervisor not ReceivedBlockHandler
- The blocks are inserted in BlockManager and if activated,
WriteAheadLogManager in parallel, not through BlockManager as the
diagram seems to imply
- It would be good to have a clean visual separation of what runs in
Executor (better term than Worker) and what is in Driver ... Driver
stuff on left and Executor stuff on right, or vice versa.

More importantly, the word of caution is that all the internal stuff
like ReceiverBlockHandler, Supervisor, etc are subject to change any
time as we keep refactoring stuff. So highlighting these internal
details too much too publicly may lead to future confusion.

TD

On Thu, Dec 18, 2014 at 11:04 AM,  francois.garil...@typesafe.com wrote:
 I’ve been trying to produce an updated box diagram to refresh :
 http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617/26


 … after the SPARK-3129, and other switches (a surprising number of comments 
 still mention NetworkReceiver).


 Here’s what I have so far:
 https://www.dropbox.com/s/q79taoce2ywdmf1/SparkStreaming.pdf?dl=0


 This is not supposed to respect any particular convention (ER, ORM, …). Data 
 flow up to right before RDD creation is in bold arrows, metadata flow is in 
 normal width arrows.


 This diagram is still very much a WIP (see below : todo), but I wanted to 
 share it to ask:
 - what’s wrong ?
 - what are the glaring omissions ?
 - how can I make this better (i.e. what should I add first to the Todo-list 
 below) ?


 I’ll be happy to share this (including sources) with whoever asks for it.


 Todo :
 - mark private/public classes
 - mark queues in Receiver, ReceivedBlockHandler, BlockManager
 - mark type of info on transport : e.g. Actor message, ReceivedBlockInfo



 —
 François Garillot

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Which committers care about Kafka?

2014-12-29 Thread Tathagata Das
Hey all,

Some wrap up thoughts on this thread.

Let me first reiterate what Patrick said, that Kafka is super super
important as it forms the largest fraction of Spark Streaming user
base. So we really want to improve the Kafka + Spark Streaming
integration. To this end, some of the things that needs to be
considered can be broadly classified into the following to sort
facilitate the discussion.

1. Data rate control
2. Receiver failure semantics - partially achieving this gives
at-least once, completely achieving this gives exactly-once
3. Driver failure semantics - partially achieving this gives at-least
once, completely achieving this gives exactly-once

Here is a run down of what is achieved by different implementations
(based on what I think).

1. Prior to WAL in Spark 1.2, the KafkaReceiver could handle 3, could
handle 1 partially (some duplicate data), and could NOT handle 2 (all
previously received data lost).

2. In Spark 1.2 with WAL enabled, the Saisai's ReliableKafkaReceiver
can handle 3, can almost completely handle 1 and 2 (except few corner
cases which prevents it from completely guaranteeing exactly-once).

3. I believe Dibyendu's solution (correct me if i am wrong) can handle
1 and 2 perfectly. And 3 can be partially solved with WAL, or possibly
completely solved by extending the solution further.

4. Cody's solution (again, correct me if I am wrong) does not use
receivers at all (so eliminates 2). It can handle 3 completely for
simple operations like map and filter, but not sure if it works
completely for stateful ops like windows and updateStateByKey. Also it
does not handle 1.

The real challenge for Kafka is in achieving 3 completely for stateful
operations while also handling 1.  (i.e., use receivers, but still get
driver failure guarantees). Solving this will give us our holy grail
solution, and this is what I want to achieve.

On that note, Cody submitted a PR on his style of achieving
exactly-once semantics - https://github.com/apache/spark/pull/3798 . I
am reviewing it. Please follow the PR if you are interested.

TD

On Wed, Dec 24, 2014 at 11:59 PM, Cody Koeninger c...@koeninger.org wrote:
 The conversation was mostly getting TD up to speed on this thread since he
 had just gotten back from his trip and hadn't seen it.

 The jira has a summary of the requirements we discussed, I'm sure TD or
 Patrick can add to the ticket if I missed something.
 On Dec 25, 2014 1:54 AM, Hari Shreedharan hshreedha...@cloudera.com
 wrote:

 In general such discussions happen or is posted on the dev lists. Could
 you please post a summary? Thanks.

 Thanks,
 Hari


 On Wed, Dec 24, 2014 at 11:46 PM, Cody Koeninger c...@koeninger.org
 wrote:

  After a long talk with Patrick and TD (thanks guys), I opened the
 following jira

 https://issues.apache.org/jira/browse/SPARK-4964

 Sample PR has an impementation for the batch and the dstream case, and a
 link to a project with example usage.

 On Fri, Dec 19, 2014 at 4:36 PM, Koert Kuipers ko...@tresata.com wrote:

 yup, we at tresata do the idempotent store the same way. very simple
 approach.

 On Fri, Dec 19, 2014 at 5:32 PM, Cody Koeninger c...@koeninger.org
 wrote:

 That KafkaRDD code is dead simple.

 Given a user specified map

 (topic1, partition0) - (startingOffset, endingOffset)
 (topic1, partition1) - (startingOffset, endingOffset)
 ...
 turn each one of those entries into a partition of an rdd, using the
 simple
 consumer.
 That's it.  No recovery logic, no state, nothing - for any failures,
 bail
 on the rdd and let it retry.
 Spark stays out of the business of being a distributed database.

 The client code does any transformation it wants, then stores the data
 and
 offsets.  There are two ways of doing this, either based on idempotence
 or
 a transactional data store.

 For idempotent stores:

 1.manipulate data
 2.save data to store
 3.save ending offsets to the same store

 If you fail between 2 and 3, the offsets haven't been stored, you start
 again at the same beginning offsets, do the same calculations in the
 same
 order, overwrite the same data, all is good.


 For transactional stores:

 1. manipulate data
 2. begin transaction
 3. save data to the store
 4. save offsets
 5. commit transaction

 If you fail before 5, the transaction rolls back.  To make this less
 heavyweight, you can write the data outside the transaction and then
 update
 a pointer to the current data inside the transaction.


 Again, spark has nothing much to do with guaranteeing exactly once.  In
 fact, the current streaming api actively impedes my ability to do the
 above.  I'm just suggesting providing an api that doesn't get in the
 way of
 exactly-once.





 On Fri, Dec 19, 2014 at 3:57 PM, Hari Shreedharan 
 hshreedha...@cloudera.com
  wrote:

  Can you explain your basic algorithm for the once-only-delivery? It is
  quite a bit of very Kafka-specific code, that would take more time to
 read
  than I can currently afford? If you can explain your 

Re: HA support for Spark

2014-12-11 Thread Tathagata Das
Spark Streaming essentially does this by saving the DAG of DStreams, which
can deterministically regenerate the DAG of RDDs upon recovery from
failure. Along with that the progress information (which batches have
finished, which batches are queued, etc.) is also saved, so that upon
recovery the system can restart from where it was before failure. This was
conceptually easy to do because the RDDs are very deterministically
generated in every batch. Extending this to a very general Spark program
with arbitrary RDD computations is definitely conceptually possible but not
that easy to do.

On Wed, Dec 10, 2014 at 7:34 PM, Jun Feng Liu liuj...@cn.ibm.com wrote:

 Right, perhaps also need preserve some DAG information? I am wondering if
 there is any work around this.


 [image: Inactive hide details for Sandy Ryza ---2014-12-11
 01:36:35---Sandy Ryza sandy.r...@cloudera.com]Sandy Ryza ---2014-12-11
 01:36:35---Sandy Ryza sandy.r...@cloudera.com


*Sandy Ryza sandy.r...@cloudera.com sandy.r...@cloudera.com*

2014-12-11 01:34


 To


Jun Feng Liu/China/IBM@IBMCN,


 cc


Reynold Xin r...@databricks.com, dev@spark.apache.org 
dev@spark.apache.org


 Subject


Re: HA support for Spark


 I think that if we were able to maintain the full set of created RDDs as
 well as some scheduler and block manager state, it would be enough for most
 apps to recover.

 On Wed, Dec 10, 2014 at 5:30 AM, Jun Feng Liu liuj...@cn.ibm.com wrote:

  Well, it should not be mission impossible thinking there are so many HA
  solution existing today. I would interest to know if there is any
 specific
  difficult.
 
  Best Regards
 
 
  *Jun Feng Liu*
  IBM China Systems  Technology Laboratory in Beijing
 
--
   [image: 2D barcode - encoded with contact information] *Phone:
 *86-10-82452683
 
  * E-mail:* *liuj...@cn.ibm.com* liuj...@cn.ibm.com
  [image: IBM]
 
  BLD 28,ZGC Software Park
  No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
  China
 
 
 
 
 
   *Reynold Xin r...@databricks.com r...@databricks.com*
 
  2014/12/10 16:30
To
  Jun Feng Liu/China/IBM@IBMCN,
  cc
  dev@spark.apache.org dev@spark.apache.org
  Subject
  Re: HA support for Spark
 
 
 
 
  This would be plausible for specific purposes such as Spark streaming or
  Spark SQL, but I don't think it is doable for general Spark driver since
 it
  is just a normal JVM process with arbitrary program state.
 
  On Wed, Dec 10, 2014 at 12:25 AM, Jun Feng Liu liuj...@cn.ibm.com
 wrote:
 
   Do we have any high availability support in Spark driver level? For
   example, if we want spark drive can move to another node continue
  execution
   when failure happen. I can see the RDD checkpoint can help to
  serialization
   the status of RDD. I can image to load the check point from another
 node
   when error happen, but seems like will lost track all tasks status or
  even
   executor information that maintain in spark context. I am not sure if
  there
   is any existing stuff I can leverage to do that. thanks for any
 suggests
  
   Best Regards
  
  
   *Jun Feng Liu*
   IBM China Systems  Technology Laboratory in Beijing
  
 --
[image: 2D barcode - encoded with contact information] *Phone:
  *86-10-82452683
  
   * E-mail:* *liuj...@cn.ibm.com* liuj...@cn.ibm.com
   [image: IBM]
  
   BLD 28,ZGC Software Park
   No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
   China
  
  
  
  
  
 
 




Re: Replacing Spark's native scheduler with Sparrow

2014-11-10 Thread Tathagata Das
Too bad Nick, I dont have anything immediately ready that tests Spark
Streaming with those extreme settings. :)

On Mon, Nov 10, 2014 at 9:56 AM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
 On Sun, Nov 9, 2014 at 1:51 AM, Tathagata Das tathagata.das1...@gmail.com
 wrote:

 This causes a scalability vs. latency tradeoff - if your limit is 1000
 tasks per second (simplifying from 1500), you could either configure
 it to use 100 receivers at 100 ms batches (10 blocks/sec), or 1000
 receivers at 1 second batches.


 This raises an interesting question, TD.

 Do we have a benchmark for Spark Streaming that tests it at the extreme for
 some key metric, perhaps processed messages per second per node? Something
 that would press Spark's ability to process tasks quickly enough.

 Given such a benchmark, it would probably be interesting to see how -- if at
 all -- Sparrow has an impact on Spark Streaming's performance.

 Nick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Designating maintainers for some Spark components

2014-11-07 Thread Tathagata Das
+1 (binding)

I agree with the proposal that it just formalizes what we have been
doing till now, and will increase the efficiency and focus of the
review process.

To address Davies' concern, I agree coding style is often a hot topic
of contention. But that is just an indication that our processes are
not perfect and we have much room to improve (which is what this
proposal is all about). Regarding the specific case of coding style,
we should all get together, discuss, and make our coding style guide
more comprehensive so that such concerns can be dealt with once and
not be a recurring concern. And that guide will override any one's
personal preference, be it the maintainer or a new committer.

TD


On Fri, Nov 7, 2014 at 3:18 PM, Davies Liu dav...@databricks.com wrote:
 -1 (not binding, +1 for maintainer, -1 for sign off)

 Agree with Greg and Vinod. In the beginning, everything is better
 (more efficient, more focus), but after some time, fighting begins.

 Code style is the most hot topic to fight (we already saw it in some
 PRs). If two committers (one of them is maintainer) have not got a
 agreement on code style, before this process, they will ask comments
 from other committers, but after this process, the maintainer have
 higher priority to -1, then maintainer will keep his/her personal
 preference, it's hard to make a agreement. Finally, different
 components will have different code style (or others).

 Right now, maintainers are kind of first contact or best contacts, the
 best person to review the PR in that component. We could announce it,
 then new contributors can easily find the right one to review.

 My 2 cents.

 Davies


 On Thu, Nov 6, 2014 at 11:43 PM, Vinod Kumar Vavilapalli
 vino...@apache.org wrote:
 With the maintainer model, the process is as follows:

 - Any committer could review the patch and merge it, but they would need to 
 forward it to me (or another core API maintainer) to make sure we also 
 approve
 - At any point during this process, I could come in and -1 it, or give 
 feedback
 - In addition, any other committer beyond me is still allowed to -1 this 
 patch

 The only change in this model is that committers are responsible to forward 
 patches in these areas to certain other committers. If every committer had 
 perfect oversight of the project, they could have also seen every patch to 
 their component on their own, but this list ensures that they see it even 
 if they somehow overlooked it.


 Having done the job of playing an informal 'maintainer' of a project myself, 
 this is what I think you really need:

 The so called 'maintainers' do one of the below
  - Actively poll the lists and watch over contributions. And follow what is 
 repeated often around here: Trust but verify.
  - Setup automated mechanisms to send all bug-tracker updates of a specific 
 component to a list that people can subscribe to

 And/or
  - Individual contributors send review requests to unofficial 'maintainers' 
 over dev-lists or through tools. Like many projects do with review boards 
 and other tools.

 Note that none of the above is a required step. It must not be, that's the 
 point. But once set as a convention, they will all help you address your 
 concerns with project scalability.

 Anything else that you add is bestowing privileges to a select few and 
 forming dictatorships. And contrary to what the proposal claims, this is 
 neither scalable nor confirming to Apache governance rules.

 +Vinod

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.1.0 (RC4)

2014-09-06 Thread Tathagata Das
+1

Tested streaming integration with flume on a local test bed.


On Thu, Sep 4, 2014 at 6:08 PM, Kan Zhang kzh...@apache.org wrote:

 +1

 Compiled, ran newly-introduced PySpark Hadoop input/output examples.


 On Thu, Sep 4, 2014 at 1:10 PM, Egor Pahomov pahomov.e...@gmail.com
 wrote:

  +1
 
  Compiled, ran on yarn-hadoop-2.3 simple job.
 
 
  2014-09-04 22:22 GMT+04:00 Henry Saputra henry.sapu...@gmail.com:
 
   LICENSE and NOTICE files are good
   Hash files are good
   Signature files are good
   No 3rd parties executables
   Source compiled
   Run local and standalone tests
   Test persist off heap with Tachyon looks good
  
   +1
  
   - Henry
  
   On Wed, Sep 3, 2014 at 12:24 AM, Patrick Wendell pwend...@gmail.com
   wrote:
Please vote on releasing the following candidate as Apache Spark
  version
   1.1.0!
   
The tag to be voted on is v1.1.0-rc4 (commit 2f9b2bd):
   
  
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=2f9b2bd7844ee8393dc9c319f4fefedf95f5e460
   
The release files, including signatures, digests, etc. can be found
 at:
http://people.apache.org/~pwendell/spark-1.1.0-rc4/
   
Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc
   
The staging repository for this release can be found at:
   
  https://repository.apache.org/content/repositories/orgapachespark-1031/
   
The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.1.0-rc4-docs/
   
Please vote on releasing this package as Apache Spark 1.1.0!
   
The vote is open until Saturday, September 06, at 08:30 UTC and
 passes
  if
a majority of at least 3 +1 PMC votes are cast.
   
[ ] +1 Release this package as Apache Spark 1.1.0
[ ] -1 Do not release this package because ...
   
To learn more about Apache Spark, please see
http://spark.apache.org/
   
== Regressions fixed since RC3 ==
SPARK-3332 - Issue with tagging in EC2 scripts
SPARK-3358 - Issue with regression for m3.XX instances
   
== What justifies a -1 vote for this release? ==
This vote is happening very late into the QA period compared with
previous votes, so -1 votes should only occur for significant
regressions from 1.0.2. Bugs already present in 1.0.X will not block
this release.
   
== What default changes should I be aware of? ==
1. The default value of spark.io.compression.codec is now snappy
-- Old behavior can be restored by switching to lzf
   
2. PySpark now performs external spilling during aggregations.
-- Old behavior can be restored by setting spark.shuffle.spill to
   false.
   
3. PySpark uses a new heuristic for determining the parallelism of
shuffle operations.
-- Old behavior can be restored by setting
spark.default.parallelism to the number of cores in the cluster.
   
-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org
   
  
   -
   To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
   For additional commands, e-mail: dev-h...@spark.apache.org
  
  
 
 
  --
 
 
 
  *Sincerely yoursEgor PakhomovScala Developer, Yandex*
 



Re: Dependency hell in Spark applications

2014-09-05 Thread Tathagata Das
If httpClient dependency is coming from Hive, you could build Spark without
Hive. Alternatively, have you tried excluding httpclient from
spark-streaming dependency in your sbt/maven project?

TD



On Thu, Sep 4, 2014 at 6:42 AM, Koert Kuipers ko...@tresata.com wrote:

 custom spark builds should not be the answer. at least not if spark ever
 wants to have a vibrant community for spark apps.

 spark does support a user-classpath-first option, which would deal with
 some of these issues, but I don't think it works.
 On Sep 4, 2014 9:01 AM, Felix Garcia Borrego fborr...@gilt.com wrote:

  Hi,
  I run into the same issue and apart from the ideas Aniket said, I only
  could find a nasty workaround. Add my custom
 PoolingClientConnectionManager
  to my classpath.
 
 
 
 http://stackoverflow.com/questions/24788949/nosuchmethoderror-while-running-aws-s3-client-on-spark-while-javap-shows-otherwi/25488955#25488955
 
 
 
  On Thu, Sep 4, 2014 at 11:43 AM, Sean Owen so...@cloudera.com wrote:
 
   Dumb question -- are you using a Spark build that includes the Kinesis
   dependency? that build would have resolved conflicts like this for
   you. Your app would need to use the same version of the Kinesis client
   SDK, ideally.
  
   All of these ideas are well-known, yes. In cases of super-common
   dependencies like Guava, they are already shaded. This is a
   less-common source of conflicts so I don't think http-client is
   shaded, especially since it is not used directly by Spark. I think
   this is a case of your app conflicting with a third-party dependency?
  
   I think OSGi is deemed too over the top for things like this.
  
   On Thu, Sep 4, 2014 at 11:35 AM, Aniket Bhatnagar
   aniket.bhatna...@gmail.com wrote:
I am trying to use Kinesis as source to Spark Streaming and have run
   into a
dependency issue that can't be resolved without making my own custom
   Spark
build. The issue is that Spark is transitively dependent
on org.apache.httpcomponents:httpclient:jar:4.1.2 (I think because of
libfb303 coming from hbase and hive-serde) whereas AWS SDK is
 dependent
on org.apache.httpcomponents:httpclient:jar:4.2. When I package and
 run
Spark Streaming application, I get the following:
   
Caused by: java.lang.NoSuchMethodError:
   
  
 
 org.apache.http.impl.conn.DefaultClientConnectionOperator.init(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)V
at
   
  
 
 org.apache.http.impl.conn.PoolingClientConnectionManager.createConnectionOperator(PoolingClientConnectionManager.java:140)
at
   
  
 
 org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:114)
at
   
  
 
 org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:99)
at
   
  
 
 com.amazonaws.http.ConnectionManagerFactory.createPoolingClientConnManager(ConnectionManagerFactory.java:29)
at
   
  
 
 com.amazonaws.http.HttpClientFactory.createHttpClient(HttpClientFactory.java:97)
at
com.amazonaws.http.AmazonHttpClient.init(AmazonHttpClient.java:181)
at
   
  
 
 com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:119)
at
   
  
 
 com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:103)
at
   
  
 
 com.amazonaws.services.kinesis.AmazonKinesisClient.init(AmazonKinesisClient.java:136)
at
   
  
 
 com.amazonaws.services.kinesis.AmazonKinesisClient.init(AmazonKinesisClient.java:117)
at
   
  
 
 com.amazonaws.services.kinesis.AmazonKinesisAsyncClient.init(AmazonKinesisAsyncClient.java:132)
   
I can create a custom Spark build with
org.apache.httpcomponents:httpclient:jar:4.2 included in the assembly
   but I
was wondering if this is something Spark devs have noticed and are
   looking
to resolve in near releases. Here are my thoughts on this issue:
   
Containers that allow running custom user code have to often resolve
dependency issues in case of conflicts between framework's and user
   code's
dependency. Here is how I have seen some frameworks resolve the
 issue:
1. Provide a child-first class loader: Some JEE containers provided a
child-first class loader that allowed for loading classes from user
  code
first. I don't think this approach completely solves the problem as
 the
framework is then susceptible to class mismatch errors.
2. Fold in all dependencies in a sub-package: This approach involves
folding all dependencies in a project specific sub-package (like
spark.dependencies). This approach is tedious because it involves
   building
custom version of all dependencies (and their transitive
 dependencies)
3. Use something like OSGi: Some frameworks has successfully used
 OSGi
  to
manage dependencies between the modules. The challenge in this
 approach
   is
to 

Re: spark-ec2 1.0.2 creates EC2 cluster at wrong version

2014-08-26 Thread Tathagata Das
Yes, this was an oversight on my part. I have opened a JIRA for this.
https://issues.apache.org/jira/browse/SPARK-3242

For the time being the workaround should be providing the version 1.0.2
explicitly as part of the script.

TD


On Tue, Aug 26, 2014 at 6:39 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 This shouldn't be a chicken-and-egg problem, since the script fetches the
 AMI from a known URL. Seems like an issue in publishing this release.

 On August 26, 2014 at 1:24:45 PM, Shivaram Venkataraman (
 shiva...@eecs.berkeley.edu) wrote:

 This is a chicken and egg problem in some sense. We can't change the ec2
 script till we have made the release and uploaded the binaries -- But once
 that is done, we can't update the script.

 I think the model we support so far is that you can launch the latest
 spark version from the master branch on github. I guess we can try to add
 something in the release process that updates the script but doesn't commit
 it ? The release managers might be able to add more.

 Thanks
 Shivaram


 On Tue, Aug 26, 2014 at 1:16 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

  I downloaded the source code release for 1.0.2 from here
  http://spark.apache.org/downloads.html and launched an EC2 cluster
 using
  spark-ec2.
 
  After the cluster finishes launching, I fire up the shell and check the
  version:
 
  scala sc.version
  res1: String = 1.0.1
 
  The startup banner also shows the same thing. Hmm...
 
  So I dig around and find that the spark_ec2.py script has the default
 Spark
  version set to 1.0.1.
 
  Derp.
 
  parser.add_option(-v, --spark-version, default=1.0.1,
  help=Version of Spark to use: 'X.Y.Z' or a specific git hash)
 
  Is there any way to fix the release? It’s a minor issue, but could be
 very
  confusing. And how can we prevent this from happening again?
 
  Nick
  ​
 



Re: reference to dstream in package org.apache.spark.streaming which is not available

2014-08-22 Thread Tathagata Das
Figured it out. Fixing this ASAP.

TD


On Fri, Aug 22, 2014 at 5:51 PM, Patrick Wendell pwend...@gmail.com wrote:

 Hey All,

 We can sort this out ASAP. Many of the Spark committers were at a company
 offsite for the last 72 hours, so sorry that it is broken.

 - Patrick


 On Fri, Aug 22, 2014 at 4:07 PM, Hari Shreedharan 
 hshreedha...@cloudera.com
  wrote:

  Sean - I think only the ones in 1726 are enough. It is weird that any
  class that uses the test-jar actually requires the streaming jar to be
  added explicitly. Shouldn't maven take care of this?
 
  I posted some comments on the PR.
 
  --
 
  Thanks,
  Hari
 
 
   Sean Owen mailto:so...@cloudera.com
  August 22, 2014 at 3:58 PM
 
  Yes, master hasn't compiled for me for a few days. It's fixed in:
 
  https://github.com/apache/spark/pull/1726
  https://github.com/apache/spark/pull/2075
 
  Could a committer sort this out?
 
  Sean
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
  Ted Yu mailto:yuzhih...@gmail.com
  August 22, 2014 at 1:55 PM
 
  Hi,
  Using the following command on (refreshed) master branch:
  mvn clean package -DskipTests
 
  I got:
 
  constituent[36]: file:/homes/hortonzy/apache-maven-3.1.1/conf/logging/
  ---
  java.lang.reflect.InvocationTargetException
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
  sun.reflect.NativeMethodAccessorImpl.invoke(
  NativeMethodAccessorImpl.java:57)
  at
  sun.reflect.DelegatingMethodAccessorImpl.invoke(
  DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
  at
  org.codehaus.plexus.classworlds.launcher.Launcher.
  launchEnhanced(Launcher.java:289)
  at
  org.codehaus.plexus.classworlds.launcher.Launcher.
  launch(Launcher.java:229)
  at
  org.codehaus.plexus.classworlds.launcher.Launcher.
  mainWithExitCode(Launcher.java:415)
  at org.codehaus.plexus.classworlds.launcher.Launcher.
  main(Launcher.java:356)
  Caused by: scala.reflect.internal.Types$TypeError: bad symbolic
  reference.
  A signature in TestSuiteBase.class refers to term dstream
  in package org.apache.spark.streaming which is not available.
  It may be completely missing from the current classpath, or the version
 on
  the classpath might be incompatible with the version used when compiling
  TestSuiteBase.class.
  at
  scala.reflect.internal.pickling.UnPickler$Scan.
  toTypeError(UnPickler.scala:847)
  at
  scala.reflect.internal.pickling.UnPickler$Scan$LazyTypeRef.complete(
  UnPickler.scala:854)
  at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1231)
  at
  scala.reflect.internal.Types$TypeMap$$anonfun$noChangeToSymbols$1.apply(
  Types.scala:4280)
  at
  scala.reflect.internal.Types$TypeMap$$anonfun$noChangeToSymbols$1.apply(
  Types.scala:4280)
  at
  scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.
  scala:70)
  at scala.collection.immutable.List.forall(List.scala:84)
  at scala.reflect.internal.Types$TypeMap.noChangeToSymbols(
  Types.scala:4280)
  at scala.reflect.internal.Types$TypeMap.mapOver(Types.scala:4293)
  at scala.reflect.internal.Types$TypeMap.mapOver(Types.scala:4196)
  at scala.reflect.internal.Types$AsSeenFromMap.apply(Types.scala:4638)
  at scala.reflect.internal.Types$TypeMap.mapOver(Types.scala:4202)
  at scala.reflect.internal.Types$AsSeenFromMap.apply(Types.scala:4638)
  at scala.reflect.internal.Types$Type.asSeenFrom(Types.scala:754)
  at scala.reflect.internal.Types$Type.memberInfo(Types.scala:773)
  at xsbt.ExtractAPI.defDef(ExtractAPI.scala:224)
  at xsbt.ExtractAPI.xsbt$ExtractAPI$$definition(ExtractAPI.scala:315)
  at
  xsbt.ExtractAPI$$anonfun$xsbt$ExtractAPI$$processDefinitions$1.apply(
  ExtractAPI.scala:296)
  at
  xsbt.ExtractAPI$$anonfun$xsbt$ExtractAPI$$processDefinitions$1.apply(
  ExtractAPI.scala:296)
  at
  scala.collection.TraversableLike$$anonfun$flatMap$1.apply(
  TraversableLike.scala:251)
  at
  scala.collection.TraversableLike$$anonfun$flatMap$1.apply(
  TraversableLike.scala:251)
  at
  scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.
  scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
  at scala.collection.TraversableLike$class.flatMap(
  TraversableLike.scala:251)
  at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:108)
  at xsbt.ExtractAPI.xsbt$ExtractAPI$$processDefinitions(ExtractAPI.
  scala:296)
  at xsbt.ExtractAPI$$anonfun$mkStructure$4.apply(ExtractAPI.scala:293)
  at xsbt.ExtractAPI$$anonfun$mkStructure$4.apply(ExtractAPI.scala:293)
  at xsbt.Message$$anon$1.apply(Message.scala:8)
  at xsbti.SafeLazy$$anonfun$apply$1.apply(SafeLazy.scala:8)
  at xsbti.SafeLazy$Impl._t$lzycompute(SafeLazy.scala:20)
  at xsbti.SafeLazy$Impl._t(SafeLazy.scala:18)
  at xsbti.SafeLazy$Impl.get(SafeLazy.scala:24)
  at
 

Re: reference to dstream in package org.apache.spark.streaming which is not available

2014-08-22 Thread Tathagata Das
The real fix is that the spark sink suite does not really need to use to
the spark-streaming test jars. Removing that dependency altogether, and
submitting a PR.

TD


On Fri, Aug 22, 2014 at 6:34 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:

 Figured it out. Fixing this ASAP.

 TD


 On Fri, Aug 22, 2014 at 5:51 PM, Patrick Wendell pwend...@gmail.com
 wrote:

 Hey All,

 We can sort this out ASAP. Many of the Spark committers were at a company
 offsite for the last 72 hours, so sorry that it is broken.

 - Patrick


 On Fri, Aug 22, 2014 at 4:07 PM, Hari Shreedharan 
 hshreedha...@cloudera.com
  wrote:

  Sean - I think only the ones in 1726 are enough. It is weird that any
  class that uses the test-jar actually requires the streaming jar to be
  added explicitly. Shouldn't maven take care of this?
 
  I posted some comments on the PR.
 
  --
 
  Thanks,
  Hari
 
 
   Sean Owen mailto:so...@cloudera.com
  August 22, 2014 at 3:58 PM
 
  Yes, master hasn't compiled for me for a few days. It's fixed in:
 
  https://github.com/apache/spark/pull/1726
  https://github.com/apache/spark/pull/2075
 
  Could a committer sort this out?
 
  Sean
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
  Ted Yu mailto:yuzhih...@gmail.com
  August 22, 2014 at 1:55 PM
 
  Hi,
  Using the following command on (refreshed) master branch:
  mvn clean package -DskipTests
 
  I got:
 
  constituent[36]: file:/homes/hortonzy/apache-maven-3.1.1/conf/logging/
  ---
  java.lang.reflect.InvocationTargetException
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
  sun.reflect.NativeMethodAccessorImpl.invoke(
  NativeMethodAccessorImpl.java:57)
  at
  sun.reflect.DelegatingMethodAccessorImpl.invoke(
  DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
  at
  org.codehaus.plexus.classworlds.launcher.Launcher.
  launchEnhanced(Launcher.java:289)
  at
  org.codehaus.plexus.classworlds.launcher.Launcher.
  launch(Launcher.java:229)
  at
  org.codehaus.plexus.classworlds.launcher.Launcher.
  mainWithExitCode(Launcher.java:415)
  at org.codehaus.plexus.classworlds.launcher.Launcher.
  main(Launcher.java:356)
  Caused by: scala.reflect.internal.Types$TypeError: bad symbolic
  reference.
  A signature in TestSuiteBase.class refers to term dstream
  in package org.apache.spark.streaming which is not available.
  It may be completely missing from the current classpath, or the
 version on
  the classpath might be incompatible with the version used when
 compiling
  TestSuiteBase.class.
  at
  scala.reflect.internal.pickling.UnPickler$Scan.
  toTypeError(UnPickler.scala:847)
  at
  scala.reflect.internal.pickling.UnPickler$Scan$LazyTypeRef.complete(
  UnPickler.scala:854)
  at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1231)
  at
 
 scala.reflect.internal.Types$TypeMap$$anonfun$noChangeToSymbols$1.apply(
  Types.scala:4280)
  at
 
 scala.reflect.internal.Types$TypeMap$$anonfun$noChangeToSymbols$1.apply(
  Types.scala:4280)
  at
  scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.
  scala:70)
  at scala.collection.immutable.List.forall(List.scala:84)
  at scala.reflect.internal.Types$TypeMap.noChangeToSymbols(
  Types.scala:4280)
  at scala.reflect.internal.Types$TypeMap.mapOver(Types.scala:4293)
  at scala.reflect.internal.Types$TypeMap.mapOver(Types.scala:4196)
  at scala.reflect.internal.Types$AsSeenFromMap.apply(Types.scala:4638)
  at scala.reflect.internal.Types$TypeMap.mapOver(Types.scala:4202)
  at scala.reflect.internal.Types$AsSeenFromMap.apply(Types.scala:4638)
  at scala.reflect.internal.Types$Type.asSeenFrom(Types.scala:754)
  at scala.reflect.internal.Types$Type.memberInfo(Types.scala:773)
  at xsbt.ExtractAPI.defDef(ExtractAPI.scala:224)
  at xsbt.ExtractAPI.xsbt$ExtractAPI$$definition(ExtractAPI.scala:315)
  at
  xsbt.ExtractAPI$$anonfun$xsbt$ExtractAPI$$processDefinitions$1.apply(
  ExtractAPI.scala:296)
  at
  xsbt.ExtractAPI$$anonfun$xsbt$ExtractAPI$$processDefinitions$1.apply(
  ExtractAPI.scala:296)
  at
  scala.collection.TraversableLike$$anonfun$flatMap$1.apply(
  TraversableLike.scala:251)
  at
  scala.collection.TraversableLike$$anonfun$flatMap$1.apply(
  TraversableLike.scala:251)
  at
  scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.
  scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
  at scala.collection.TraversableLike$class.flatMap(
  TraversableLike.scala:251)
  at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:108)
  at xsbt.ExtractAPI.xsbt$ExtractAPI$$processDefinitions(ExtractAPI.
  scala:296)
  at xsbt.ExtractAPI$$anonfun$mkStructure$4.apply(ExtractAPI.scala:293)
  at xsbt.ExtractAPI$$anonfun$mkStructure$4.apply(ExtractAPI.scala:293)
  at xsbt.Message

Re: failed to build spark with maven for both 1.0.1 and latest master branch

2014-07-31 Thread Tathagata Das
Does a mvn clean or sbt/sbt clean help?

TD

On Wed, Jul 30, 2014 at 9:25 PM, yao yaosheng...@gmail.com wrote:
 Hi Folks,

 Today I am trying to build spark using maven; however, the following
 command failed consistently for both 1.0.1 and the latest master.  (BTW, it
 seems sbt works fine: *sbt/sbt -Dhadoop.version=2.4.0 -Pyarn clean
 assembly)*

 Environment: Mac OS Mavericks
 Maven: 3.2.2 (installed by homebrew)




 *export M2_HOME=/usr/local/Cellar/maven/3.2.2/libexec/export
 PATH=$M2_HOME/bin:$PATHexport MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M
 -XX:ReservedCodeCacheSize=512mmvn -Pyarn -Phadoop-2.4
 -Dhadoop.version=2.4.0 -DskipTests clean package*

 Build outputs:

 [INFO] Scanning for projects...
 [INFO]
 
 [INFO] Reactor Build Order:
 [INFO]
 [INFO] Spark Project Parent POM
 [INFO] Spark Project Core
 [INFO] Spark Project Bagel
 [INFO] Spark Project GraphX
 [INFO] Spark Project ML Library
 [INFO] Spark Project Streaming
 [INFO] Spark Project Tools
 [INFO] Spark Project Catalyst
 [INFO] Spark Project SQL
 [INFO] Spark Project Hive
 [INFO] Spark Project REPL
 [INFO] Spark Project YARN Parent POM
 [INFO] Spark Project YARN Stable API
 [INFO] Spark Project Assembly
 [INFO] Spark Project External Twitter
 [INFO] Spark Project External Kafka
 [INFO] Spark Project External Flume
 [INFO] Spark Project External ZeroMQ
 [INFO] Spark Project External MQTT
 [INFO] Spark Project Examples
 [INFO]
 [INFO]
 
 [INFO] Building Spark Project Parent POM 1.0.1
 [INFO]
 
 [INFO]
 [INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ spark-parent ---
 [INFO]
 [INFO] --- maven-enforcer-plugin:1.3.1:enforce (enforce-versions) @
 spark-parent ---
 [INFO]
 [INFO] --- build-helper-maven-plugin:1.8:add-source (add-scala-sources) @
 spark-parent ---
 [INFO] Source directory:
 /Users/syao/git/grid/thirdparty/spark/src/main/scala added.
 [INFO]
 [INFO] --- maven-remote-resources-plugin:1.5:process (default) @
 spark-parent ---
 [INFO]
 [INFO] --- scala-maven-plugin:3.1.6:add-source (scala-compile-first) @
 spark-parent ---
 [INFO] Add Test Source directory:
 /Users/syao/git/grid/thirdparty/spark/src/test/scala
 [INFO]
 [INFO] --- scala-maven-plugin:3.1.6:compile (scala-compile-first) @
 spark-parent ---
 [INFO] No sources to compile
 [INFO]
 [INFO] --- build-helper-maven-plugin:1.8:add-test-source
 (add-scala-test-sources) @ spark-parent ---
 [INFO] Test Source directory:
 /Users/syao/git/grid/thirdparty/spark/src/test/scala added.
 [INFO]
 [INFO] --- scala-maven-plugin:3.1.6:testCompile (scala-test-compile-first)
 @ spark-parent ---
 [INFO] No sources to compile
 [INFO]
 [INFO] --- maven-site-plugin:3.3:attach-descriptor (attach-descriptor) @
 spark-parent ---
 [INFO]
 [INFO] --- maven-source-plugin:2.2.1:jar-no-fork (create-source-jar) @
 spark-parent ---
 [INFO]
 [INFO]
 
 [INFO] Building Spark Project Core 1.0.1
 [INFO]
 
 [INFO]
 [INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ spark-core_2.10
 ---
 [INFO]
 [INFO] --- maven-enforcer-plugin:1.3.1:enforce (enforce-versions) @
 spark-core_2.10 ---
 [INFO]
 [INFO] --- build-helper-maven-plugin:1.8:add-source (add-scala-sources) @
 spark-core_2.10 ---
 [INFO] Source directory:
 /Users/syao/git/grid/thirdparty/spark/core/src/main/scala added.
 [INFO]
 [INFO] --- maven-remote-resources-plugin:1.5:process (default) @
 spark-core_2.10 ---
 [INFO]
 [INFO] --- exec-maven-plugin:1.2.1:exec (default) @ spark-core_2.10 ---
 Archive:  lib/py4j-0.8.1-src.zip
   inflating: build/py4j/tests/java_map_test.py
  extracting: build/py4j/tests/__init__.py
   inflating: build/py4j/tests/java_gateway_test.py
   inflating: build/py4j/tests/java_callback_test.py
   inflating: build/py4j/tests/java_list_test.py
   inflating: build/py4j/tests/byte_string_test.py
   inflating: build/py4j/tests/multithreadtest.py
   inflating: build/py4j/tests/java_array_test.py
   inflating: build/py4j/tests/py4j_callback_example2.py
   inflating: build/py4j/tests/py4j_example.py
   inflating: build/py4j/tests/py4j_callback_example.py
   inflating: build/py4j/tests/finalizer_test.py
   inflating: build/py4j/tests/java_set_test.py
   inflating: build/py4j/finalizer.py
  extracting: build/py4j/__init__.py
   inflating: build/py4j/java_gateway.py
   inflating: build/py4j/protocol.py
   inflating: build/py4j/java_collections.py
  extracting: build/py4j/version.py
   inflating: build/py4j/compat.py
 [INFO]
 [INFO] --- maven-resources-plugin:2.6:resources (default-resources) @
 spark-core_2.10 ---
 [INFO] Using 'UTF-8' encoding to copy filtered resources.
 [INFO] Copying 6 resources
 [INFO] Copying 20 resources
 [INFO] Copying 7 resources
 [INFO] Copying 3 

Re: Tranforming flume events using Spark transformation functions

2014-07-22 Thread Tathagata Das
This is because of the RDD's lazy evaluation! Unless you force a
transformed (mapped/filtered/etc.) RDD to give you back some data (like
RDD.count) or output the data (like RDD.saveAsTextFile()), Spark will not
do anything.

So after the eventData.map(...), if you do take(10) and then print the
result, you should seem 10 items from each batch be printed.

Also you can do the same map operation on the Dstream as well. FYI.

inputDStream.map(...).foreachRDD(...) is equivalent to
 inputDStream.foreachRDD( // call rdd.map(...) )

Either way you have to call some RDD action (count, collect, take,
saveAsHadoopFile, etc.)  that asks the system to something concrete with
the data.

TD




On Tue, Jul 22, 2014 at 1:55 PM, Sundaram, Muthu X. 
muthu.x.sundaram@sabre.com wrote:

 I tried to map SparkFlumeEvents to String of RDDs like below. But that map
 and call are not at all executed. I might be doing this in a wrong way. Any
 help would be appreciated.

 flumeStream.foreach(new FunctionJavaRDDSparkFlumeEvent,Void () {
   @Override
   public Void call(JavaRDDSparkFlumeEvent eventsData) throws
 Exception {
 System.out.println(Inside for
 each...call);

 JavaRDDString records = eventsData.map(
 new FunctionSparkFlumeEvent, String() {
 @Override
 public String call(SparkFlumeEvent flume) throws Exception
 {
 String logRecord = null;
 AvroFlumeEvent avroEvent = null;
   ByteBuffer bytePayload = null;


   System.out.println(Inside Map..call);
 /* ListSparkFlumeEvent events = flume.collect();
  IteratorSparkFlumeEvent batchedEvents =
 events.iterator();

 SparkFlumeEvent flumeEvent =
 batchedEvents.next();*/
 avroEvent = flume.event();
 bytePayload = avroEvent.getBody();
 logRecord = new String(bytePayload.array());

   System.out.println(Record is +
 logRecord);

 return logRecord;
 }
 });
 return null;
 }

 -Original Message-
 From: Sundaram, Muthu X. [mailto:muthu.x.sundaram@sabre.com]
 Sent: Tuesday, July 22, 2014 10:24 AM
 To: u...@spark.apache.org; d...@spark.incubator.apache.org
 Subject: Tranforming flume events using Spark transformation functions

 Hi All,
   I am getting events from flume using following line.

   JavaDStreamSparkFlumeEvent flumeStream = FlumeUtils.createStream(ssc,
 host, port);

 Each event is a delimited record. I like to use some of the transformation
 functions like map and reduce on this. Do I need to convert the
 JavaDStreamSparkFlumeEvent to JavaDStreamString or can I apply these
 function directly on this?

 I need to do following kind of operations

  AA
 YDelta
 TAA
  Southwest
  AA

 Unique tickets are  , Y, , , .
 Count is  2,  1, T 1 and so on...
 AA - 2 tickets(Should not count the duplicate), Delta - 1 ticket,
 Southwest - 1 ticket.

 I have to do transformations like this. Right now I am able to receives
 records. But I am struggling to transform them using spark transformation
 functions since they are not of type JavaRDDString.

 Can I create new JavaRDDString? How do I create new JavaRDD?

 I loop through  the events like below

 flumeStream.foreach(new FunctionJavaRDDSparkFlumeEvent,Void () {
   @Override
   public Void call(JavaRDDSparkFlumeEvent eventsData) throws
 Exception {
  String logRecord = null;
  ListSparkFlumeEvent events = eventsData.collect();
  IteratorSparkFlumeEvent batchedEvents =
 events.iterator();
  long t1 = System.currentTimeMillis();
  AvroFlumeEvent avroEvent = null;
  ByteBuffer bytePayload = null;
  // All the user level data is carried as payload in
 Flume Event
  while(batchedEvents.hasNext()) {
 SparkFlumeEvent flumeEvent =
 batchedEvents.next();
 avroEvent = flumeEvent.event();
 bytePayload = avroEvent.getBody();
 logRecord = new String(bytePayload.array());

 System.out.println(LOG RECORD =  +
 logRecord); }

 Where do I create new JavaRDDString? DO I do it before this loop? How do
 I create this JavaRDDString?
 In the loop I am able to get every record and I am able to print them.

 I appreciate any help here.

 Thanks,
 Muthu





Re: [brainsotrming] Generalization of DStream, a ContinuousRDD ?

2014-07-16 Thread Tathagata Das
I think it makes sense, though without a concrete implementation its hard
to be sure. Applying sorting on the RDD according to the RDDs makes sense,
but I can think of two kinds of fundamental problems.

1. How do you deal with ordering across RDD boundaries. Say two consecutive
RDDs in the DStream has the following record timestampsRDD1: [ 1, 2, 3,
4, 6, 7 ]   RDD 2: [ 5, 8, 9, 10] . And you want to run a function through
all these records in the timestamp order. I am curious to find how this
problem can be solved without sacrificing efficiency (e.g. I can imagine
doing multiple pass magic)

2. An even more fundamental question is how do you ensure ordering with
delayed records. If you want to process in order of application time, and
records are delayed how do you deal with them.

Any ideas? ;)

TD



On Wed, Jul 16, 2014 at 2:37 AM, andy petrella andy.petre...@gmail.com
wrote:

 Heya TD,

 Thanks for the detailed answer! Much appreciated.

 Regarding order among elements within an RDD, you're definitively right,
 it'd kill the //ism and would require synchronization which is completely
 avoided in distributed env.

 That's why, I won't push this constraint to the RDDs themselves actually,
 only the Space is something that *defines* ordered elements, and thus there
 are two functions that will break the RDDs based on a given (extensible,
 plugable) heuristic f.i.
 Since the Space is rather decoupled from the data, thus the source and the
 partitions, it's the responsibility of the CRRD implementation to dictate
 how (if necessary) the elements should be sorted in the RDDs... which will
 require some shuffles :-s -- Or the couple (source, space) is something
 intrinsically ordered (like it is for DStream).

 To be more concrete an RDD would be composed of un-ordered iterator of
 millions of events for which all timestamps land into the same time
 interval.

 WDYT, would that makes sense?

 thanks again for the answer!

 greetz

  aℕdy ℙetrella
 about.me/noootsab
 [image: aℕdy ℙetrella on about.me]

 http://about.me/noootsab


 On Wed, Jul 16, 2014 at 12:33 AM, Tathagata Das 
 tathagata.das1...@gmail.com
  wrote:

  Very interesting ideas Andy!
 
  Conceptually i think it makes sense. In fact, it is true that dealing
 with
  time series data, windowing over application time, windowing over number
 of
  events, are things that DStream does not natively support. The real
  challenge is actually mapping the conceptual windows with the underlying
  RDD model. On aspect you correctly observed in the ordering of events
  within the RDDs of the DStream. Another fundamental aspect is the fact
 that
  RDDs as parallel collections, with no well-defined ordering in the
 records
  in the RDDs. If you want to process the records in an RDD as a ordered
  stream of events, you kind of have to process the stream sequentially,
  which means you have to process each RDD partition one-by-one, and
  therefore lose the parallelism. So implementing all these functionality
 may
  mean adding functionality at the cost of performance. Whether that is
 okay
  for Spark Streaming to have these OR this tradeoff is not-intuitive for
  end-users and therefore should not come out-of-the-box with Spark
 Streaming
  -- that is a definitely a question worth debating upon.
 
  That said, for some limited usecases, like windowing over N events, can
 be
  implemented using custom RDDs like SlidingRDD
  
 
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/SlidingRDD.scala
  
  without
  losing parallelism. For things like app time based windows, and
  random-application-event based windows, its much harder.
 
  Interesting ideas nonetheless. I am curious to see how far we can push
  using the RDD model underneath, without losing parallelism and
 performance.
 
  TD
 
 
 
  On Tue, Jul 15, 2014 at 10:11 AM, andy petrella andy.petre...@gmail.com
 
  wrote:
 
   Dear Sparkers,
  
   *[sorry for the lengthy email... = head to the gist
   https://gist.github.com/andypetrella/12228eb24eea6b3e1389 for a
  preview
   :-p**]*
  
   I would like to share some thinking I had due to a use case I faced.
   Basically, as the subject announced it, it's a generalization of the
   DStream currently available in the streaming project.
   First of all, I'd like to say that it's only a result of some personal
   thinking, alone in the dark with a use case, the spark code, a sheet of
   paper and a poor pen.
  
  
   DStream is a very great concept to deal with micro-batching use cases,
  and
   it does it very well too!
   Also, it hardly relies on the elapsing time to create its internal
   micro-batches.
   However, there are similar use cases where we need micro-batches where
  this
   constraint on the time doesn't hold, here are two of them:
   * a micro-batch has to be created every *n* events received
   * a micro-batch has to be generate based on the values of the items
  pushed
   by the source (which might

Re: Does RDD checkpointing store the entire state in HDFS?

2014-07-16 Thread Tathagata Das
After every checkpointing interval, the latest state RDD is stored to HDFS
in its entirety. Along with that, the series of DStream transformations
that was setup with the streaming context is also stored into HDFS (the
whole DAG of DStream objects is serialized and saved).

TD


On Wed, Jul 16, 2014 at 5:38 PM, Yan Fang yanfang...@gmail.com wrote:

 Hi guys,

 am wondering how the RDD checkpointing
 https://spark.apache.org/docs/latest/streaming-programming-guide.html#RDD
 Checkpointing works in Spark Streaming. When I use updateStateByKey, does
 the Spark store the entire state (at one time point) into the HDFS or only
 put the transformation into the HDFS? Thank you.

 Best,

 Fang, Yan
 yanfang...@gmail.com
 +1 (206) 849-4108



Re: [brainsotrming] Generalization of DStream, a ContinuousRDD ?

2014-07-15 Thread Tathagata Das
Very interesting ideas Andy!

Conceptually i think it makes sense. In fact, it is true that dealing with
time series data, windowing over application time, windowing over number of
events, are things that DStream does not natively support. The real
challenge is actually mapping the conceptual windows with the underlying
RDD model. On aspect you correctly observed in the ordering of events
within the RDDs of the DStream. Another fundamental aspect is the fact that
RDDs as parallel collections, with no well-defined ordering in the records
in the RDDs. If you want to process the records in an RDD as a ordered
stream of events, you kind of have to process the stream sequentially,
which means you have to process each RDD partition one-by-one, and
therefore lose the parallelism. So implementing all these functionality may
mean adding functionality at the cost of performance. Whether that is okay
for Spark Streaming to have these OR this tradeoff is not-intuitive for
end-users and therefore should not come out-of-the-box with Spark Streaming
-- that is a definitely a question worth debating upon.

That said, for some limited usecases, like windowing over N events, can be
implemented using custom RDDs like SlidingRDD
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/SlidingRDD.scala
without
losing parallelism. For things like app time based windows, and
random-application-event based windows, its much harder.

Interesting ideas nonetheless. I am curious to see how far we can push
using the RDD model underneath, without losing parallelism and performance.

TD



On Tue, Jul 15, 2014 at 10:11 AM, andy petrella andy.petre...@gmail.com
wrote:

 Dear Sparkers,

 *[sorry for the lengthy email... = head to the gist
 https://gist.github.com/andypetrella/12228eb24eea6b3e1389 for a preview
 :-p**]*

 I would like to share some thinking I had due to a use case I faced.
 Basically, as the subject announced it, it's a generalization of the
 DStream currently available in the streaming project.
 First of all, I'd like to say that it's only a result of some personal
 thinking, alone in the dark with a use case, the spark code, a sheet of
 paper and a poor pen.


 DStream is a very great concept to deal with micro-batching use cases, and
 it does it very well too!
 Also, it hardly relies on the elapsing time to create its internal
 micro-batches.
 However, there are similar use cases where we need micro-batches where this
 constraint on the time doesn't hold, here are two of them:
 * a micro-batch has to be created every *n* events received
 * a micro-batch has to be generate based on the values of the items pushed
 by the source (which might even not be a stream!).

 An example of use case (mine ^^) would be
 * the creation of timeseries from a cold source containing timestamped
 events (like S3).
 * one these timeseries have cells being the mean (sum, count, ...) of one
 of the fields of the event
 * the mean has to be computed over a window depending on a field
 *timestamp*.

 * a timeserie is created for each type of event (the number of types is
 high)
 So, in this case, it'd be interesting to have an RDD for each cell, which
 will generate all cells for all neede timeseries.
 It's more or less what DStream does, but here it won't help due what was
 stated above.

 That's how I came to a raw sketch of what could be named ContinuousRDD
 (CRDD) which is basically and RDD[RDD[_]]. And, for the sake of simplicity
 I've stuck with the definition of a DStream to think about it. Okay, let's
 go ^^.


 Looking at the DStream contract, here is something that could be drafted
 around CRDD.
 A *CRDD* would be a generalized concept that relies on:
 * a reference space/continuum (to which data can be bound)
 * a binning function that can breaks the continuum into splits.
 Since *Space* is a continuum we could define it as:
 * a *SpacePoint* (the origin)
 * a SpacePoint=SpacePoint (the continuous function)
 * a Ordering[SpacePoint]

 DStream uses a *JobGenerator* along with a DStreamGraph, which are using
 timer and clock to do their work, in the case of a CRDD we'll have to
 define also a point generator, as a more generic but also adaptable
 concept.


 So far (so good?), these definition should work quite fine for *ordered*
 space
 for which:
 * points are coming/fetched in order
 * the space is fully filled (no gaps)
 For these cases, the JobGenerator (f.i.) could be defined with two extra
 functions:
 * one is responsible to chop the batches even if the upper bound of the
 batch hasn't been seen yet
 * the other is responsible to handle outliers (and could wrap them into yet
 another CRDD ?)


 I created a gist here wrapping up the types and thus the skeleton of this
 idea, you can find it here:
 https://gist.github.com/andypetrella/12228eb24eea6b3e1389

 WDYT?
 *The answer can be: you're a fool!*
 Actually, I already I am, but also I like to know why so some
 explanations will help me 

[RESULT][VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-29 Thread Tathagata Das
Hello everyone,

The vote on Spark 1.0.0 RC11 passes with13 +1 votes, one 0 vote and no
-1 vote.

Thanks to everyone who tested the RC and voted. Here are the totals:

+1: (13 votes)
Matei Zaharia*
Mark Hamstra*
Holden Karau
Nick Pentreath*
Will Benton
Henry Saputra
Sean McNamara*
Xiangrui Meng*
Andy Konwinski*
Krishna Sankar
Kevin Markey
Patrick Wendell*
Tathagata Das*

0: (1 vote)
Ankur Dave*

-1: (0 vote)

Please hold off announcing Spark 1.0.0 until Apache Software Foundation
makes the press release tomorrow. Thank you very much for your cooperation.

TD


Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-29 Thread Tathagata Das
Let me put in my +1 as well!

This voting is now closed, and it successfully passes with 13 +1
votes and one 0 vote.
Thanks to everyone who tested the RC and voted. Here are the totals:

+1: (13 votes)
Matei Zaharia*
Mark Hamstra*
Holden Karau
Nick Pentreath*
Will Benton
Henry Saputra
Sean McNamara*
Xiangrui Meng*
Andy Konwinski*
Krishna Sankar
Kevin Markey
Patrick Wendell*
Tathagata Das*

0: (1 vote)
Ankur Dave*

-1: (0 vote)

* = binding

Please hold off announcing Spark 1.0.0 until Apache Software
Foundation makes the press release tomorrow. Thank you very much for
your cooperation.

TD

On Thu, May 29, 2014 at 9:14 AM, Patrick Wendell pwend...@gmail.com wrote:
 +1

 I spun up a few EC2 clusters and ran my normal audit checks. Tests
 passing, sigs, CHANGES and NOTICE look good

 Thanks TD for helping cut this RC!

 On Wed, May 28, 2014 at 9:38 PM, Kevin Markey kevin.mar...@oracle.com wrote:
 +1

 Built -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0
 Ran current version of one of my applications on 1-node pseudocluster
 (sorry, unable to test on full cluster).
 yarn-cluster mode
 Ran regression tests.

 Thanks
 Kevin


 On 05/28/2014 09:55 PM, Krishna Sankar wrote:

 +1
 Pulled  built on MacOS X, EC2 Amazon Linux
 Ran test programs on OS X, 5 node c3.4xlarge cluster
 Cheers
 k/


 On Wed, May 28, 2014 at 7:36 PM, Andy Konwinski
 andykonwin...@gmail.comwrote:

 +1
 On May 28, 2014 7:05 PM, Xiangrui Meng men...@gmail.com wrote:

 +1

 Tested apps with standalone client mode and yarn cluster and client

 modes.

 Xiangrui

 On Wed, May 28, 2014 at 1:07 PM, Sean McNamara
 sean.mcnam...@webtrends.com wrote:

 Pulled down, compiled, and tested examples on OS X and ubuntu.
 Deployed app we are building on spark and poured data through it.

 +1

 Sean


 On May 26, 2014, at 8:39 AM, Tathagata Das 

 tathagata.das1...@gmail.com

 wrote:

 Please vote on releasing the following candidate as Apache Spark

 version 1.0.0!

 This has a few important bug fixes on top of rc10:
 SPARK-1900 and SPARK-1918: https://github.com/apache/spark/pull/853
 SPARK-1870: https://github.com/apache/spark/pull/848
 SPARK-1897: https://github.com/apache/spark/pull/849

 The tag to be voted on is v1.0.0-rc11 (commit c69d97cd):


 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=c69d97cdb42f809cb71113a1db4194c21372242a

 The release files, including signatures, digests, etc. can be found

 at:

 http://people.apache.org/~tdas/spark-1.0.0-rc11/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/tdas.asc

 The staging repository for this release can be found at:

 https://repository.apache.org/content/repositories/orgapachespark-1019/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/

 Please vote on releasing this package as Apache Spark 1.0.0!

 The vote is open until Thursday, May 29, at 16:00 UTC and passes if
 a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.0.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 == API Changes ==
 We welcome users to compile Spark applications against 1.0. There are
 a few API changes in this release. Here are links to the associated
 upgrade guides - user facing changes have been kept as small as
 possible.

 Changes to ML vector specification:


 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/mllib-guide.html#from-09-to-10

 Changes to the Java API:


 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark

 Changes to the streaming API:


 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x

 Changes to the GraphX API:


 http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091

 Other changes:
 coGroup and related functions now return Iterable[T] instead of Seq[T]
 == Call toSeq on the result to restore the old behavior

 SparkContext.jarOfClass returns Option[String] instead of Seq[String]
 == Call toSeq on the result to restore old behavior




[VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-26 Thread Tathagata Das
Please vote on releasing the following candidate as Apache Spark version 1.0.0!

This has a few important bug fixes on top of rc10:
SPARK-1900 and SPARK-1918: https://github.com/apache/spark/pull/853
SPARK-1870: https://github.com/apache/spark/pull/848
SPARK-1897: https://github.com/apache/spark/pull/849

The tag to be voted on is v1.0.0-rc11 (commit c69d97cd):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=c69d97cdb42f809cb71113a1db4194c21372242a

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~tdas/spark-1.0.0-rc11/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/tdas.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1019/

The documentation corresponding to this release can be found at:
http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/

Please vote on releasing this package as Apache Spark 1.0.0!

The vote is open until Thursday, May 29, at 16:00 UTC and passes if
a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.0.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== API Changes ==
We welcome users to compile Spark applications against 1.0. There are
a few API changes in this release. Here are links to the associated
upgrade guides - user facing changes have been kept as small as
possible.

Changes to ML vector specification:
http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/mllib-guide.html#from-09-to-10

Changes to the Java API:
http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark

Changes to the streaming API:
http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x

Changes to the GraphX API:
http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091

Other changes:
coGroup and related functions now return Iterable[T] instead of Seq[T]
== Call toSeq on the result to restore the old behavior

SparkContext.jarOfClass returns Option[String] instead of Seq[String]
== Call toSeq on the result to restore old behavior


Re: No output from Spark Streaming program with Spark 1.0

2014-05-24 Thread Tathagata Das
What does the kafka receiver status on the streaming UI say when you are
connected to the Kafka sources? Does it show any error?

Can you find out which machine the receiver is running and see the worker
logs for any exceptions / error messages? Try turning on the DEBUG level in
log4j.

TD
On May 24, 2014 4:58 PM, Jim Donahue jdona...@adobe.com wrote:

 I looked at the Streaming UI for my job and it reports that it has
 processed many batches, but that none of the batches had any records in
 them. Unfortunately, that’s what I expected.  :-(

 I’ve tried multiple test programs and I’m seeing the same thing.  The
 Kafka sources are alive and well and the programs all worked on 0.9 from
 Eclipse.  And there’s no indication of any failure — just no records are
 being delivered.

 Any ideas would be much appreciated …


 Thanks,

 Jim


 On 5/23/14, 7:29 PM, Tathagata Das tathagata.das1...@gmail.com wrote:

 Few more suggestions.
 1. See the web ui, is the system running any jobs? If not, then you may
 need to give the system more nodes. Basically the system should have more
 cores than the number of receivers.
 2. Furthermore there is a streaming specific web ui which gives more
 streaming specific data.
 
 
 On Fri, May 23, 2014 at 6:02 PM, Patrick Wendell pwend...@gmail.com
 wrote:
 
  Also one other thing to try, try removing all of the logic form inside
  of foreach and just printing something. It could be that somehow an
  exception is being triggered inside of your foreach block and as a
  result the output goes away.
 
  On Fri, May 23, 2014 at 6:00 PM, Patrick Wendell pwend...@gmail.com
  wrote:
   Hey Jim,
  
   Do you see the same behavior if you run this outside of eclipse?
  
   Also, what happens if you print something to standard out when setting
   up your streams (i.e. not inside of the foreach) do you see that? This
   could be a streaming issue, but it could also be something related to
   the way it's running in eclipse.
  
   - Patrick
  
   On Fri, May 23, 2014 at 2:57 PM, Jim Donahue jdona...@adobe.com
 wrote:
   I¹m trying out 1.0 on a set of small Spark Streaming tests and am
  running
   into problems.  Here¹s one of the little programs I¹ve used for a
 long
   time ‹ it reads a Kafka stream that contains Twitter JSON tweets and
  does
   some simple counting.  The program starts OK (it connects to the
 Kafka
   stream fine) and generates a stream of INFO logging messages, but
 never
   generates any output. :-(
  
   I¹m running this in Eclipse, so there may be some class loading issue
   (loading the wrong class or something like that), but I¹m not seeing
   anything in the console output.
  
   Thanks,
  
   Jim Donahue
   Adobe
  
  
  
   val kafka_messages =
 KafkaUtils.createStream[Array[Byte], Array[Byte],
   kafka.serializer.DefaultDecoder,
 kafka.serializer.DefaultDecoder](ssc,
   propsMap, topicMap, StorageLevel.MEMORY_AND_DISK)
  
  
val messages = kafka_messages.map(_._2)
  
  
val total = ssc.sparkContext.accumulator(0)
  
  
val startTime = new java.util.Date().getTime()
  
  
val jsonstream = messages.map[JSONObject](message =
 {val string = new String(message);
 val json = new JSONObject(string);
 total += 1
 json
 }
   )
  
  
   val deleted = ssc.sparkContext.accumulator(0)
  
  
   val msgstream = jsonstream.filter(json =
 if (!json.has(delete)) true else { deleted += 1; false}
 )
  
  
   msgstream.foreach(rdd = {
 if(rdd.count()  0){
 val data = rdd.map(json = (json.has(entities),
   json.length())).collect()
 val entities: Double = data.count(t = t._1)
 val fieldCounts = data.sortBy(_._2)
 val minFields = fieldCounts(0)._2
 val maxFields = fieldCounts(fieldCounts.size - 1)._2
 val now = new java.util.Date()
 val interval = (now.getTime() - startTime) / 1000
 System.out.println(now.toString)
 System.out.println(processing time:  + interval +  seconds)
 System.out.println(total messages:  + total.value)
 System.out.println(deleted messages:  + deleted.value)
 System.out.println(message receipt rate:  +
  (total.value/interval)
   +  per second)
 System.out.println(messages this interval:  + data.length)
 System.out.println(message fields varied between:  +
 minFields
  + 
   and  + maxFields)
 System.out.println(fraction with entities is  + (entities /
   data.length))
 }
   }
   )
  
   ssc.start()
  
 




Re: No output from Spark Streaming program with Spark 1.0

2014-05-23 Thread Tathagata Das
Few more suggestions.
1. See the web ui, is the system running any jobs? If not, then you may
need to give the system more nodes. Basically the system should have more
cores than the number of receivers.
2. Furthermore there is a streaming specific web ui which gives more
streaming specific data.


On Fri, May 23, 2014 at 6:02 PM, Patrick Wendell pwend...@gmail.com wrote:

 Also one other thing to try, try removing all of the logic form inside
 of foreach and just printing something. It could be that somehow an
 exception is being triggered inside of your foreach block and as a
 result the output goes away.

 On Fri, May 23, 2014 at 6:00 PM, Patrick Wendell pwend...@gmail.com
 wrote:
  Hey Jim,
 
  Do you see the same behavior if you run this outside of eclipse?
 
  Also, what happens if you print something to standard out when setting
  up your streams (i.e. not inside of the foreach) do you see that? This
  could be a streaming issue, but it could also be something related to
  the way it's running in eclipse.
 
  - Patrick
 
  On Fri, May 23, 2014 at 2:57 PM, Jim Donahue jdona...@adobe.com wrote:
  I¹m trying out 1.0 on a set of small Spark Streaming tests and am
 running
  into problems.  Here¹s one of the little programs I¹ve used for a long
  time ‹ it reads a Kafka stream that contains Twitter JSON tweets and
 does
  some simple counting.  The program starts OK (it connects to the Kafka
  stream fine) and generates a stream of INFO logging messages, but never
  generates any output. :-(
 
  I¹m running this in Eclipse, so there may be some class loading issue
  (loading the wrong class or something like that), but I¹m not seeing
  anything in the console output.
 
  Thanks,
 
  Jim Donahue
  Adobe
 
 
 
  val kafka_messages =
KafkaUtils.createStream[Array[Byte], Array[Byte],
  kafka.serializer.DefaultDecoder, kafka.serializer.DefaultDecoder](ssc,
  propsMap, topicMap, StorageLevel.MEMORY_AND_DISK)
 
 
   val messages = kafka_messages.map(_._2)
 
 
   val total = ssc.sparkContext.accumulator(0)
 
 
   val startTime = new java.util.Date().getTime()
 
 
   val jsonstream = messages.map[JSONObject](message =
{val string = new String(message);
val json = new JSONObject(string);
total += 1
json
}
  )
 
 
  val deleted = ssc.sparkContext.accumulator(0)
 
 
  val msgstream = jsonstream.filter(json =
if (!json.has(delete)) true else { deleted += 1; false}
)
 
 
  msgstream.foreach(rdd = {
if(rdd.count()  0){
val data = rdd.map(json = (json.has(entities),
  json.length())).collect()
val entities: Double = data.count(t = t._1)
val fieldCounts = data.sortBy(_._2)
val minFields = fieldCounts(0)._2
val maxFields = fieldCounts(fieldCounts.size - 1)._2
val now = new java.util.Date()
val interval = (now.getTime() - startTime) / 1000
System.out.println(now.toString)
System.out.println(processing time:  + interval +  seconds)
System.out.println(total messages:  + total.value)
System.out.println(deleted messages:  + deleted.value)
System.out.println(message receipt rate:  +
 (total.value/interval)
  +  per second)
System.out.println(messages this interval:  + data.length)
System.out.println(message fields varied between:  + minFields
 + 
  and  + maxFields)
System.out.println(fraction with entities is  + (entities /
  data.length))
}
  }
  )
 
  ssc.start()
 



Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-22 Thread Tathagata Das
Hey all,

On further testing, I came across a bug that breaks execution of
pyspark scripts on YARN.
https://issues.apache.org/jira/browse/SPARK-1900
This is a blocker and worth cutting a new RC.

We also found a fix for a known issue that prevents additional jar
files to be specified through spark-submit on YARN.
https://issues.apache.org/jira/browse/SPARK-1870
The has been fixed and will be in the next RC.

We are canceling this vote for now. We will post RC11 shortly. Thanks
everyone for testing!

TD

On Thu, May 22, 2014 at 1:25 PM, Kevin Markey kevin.mar...@oracle.com wrote:
 Thank you, all!  This is quite helpful.

 We have been arguing how to handle this issue across a growing application.
 Unfortunately the Hadoop FileSystem java doc should say all this but
 doesn't!

 Kevin


 On 05/22/2014 01:48 PM, Aaron Davidson wrote:

 In Spark 0.9.0 and 0.9.1, we stopped using the FileSystem cache correctly,
 and we just recently resumed using it in 1.0 (and in 0.9.2) when this
 issue
 was fixed: https://issues.apache.org/jira/browse/SPARK-1676

 Prior to this fix, each Spark task created and cached its own FileSystems
 due to a bug in how the FS cache handles UGIs. The big problem that arose
 was that these FileSystems were never closed, so they just kept piling up.
 There were two solutions we considered, with the following effects: (1)
 Share the FS cache among all tasks and (2) Each task effectively gets its
 own FS cache, and closes all of its FSes after the task completes.

 We chose solution (1) for 3 reasons:
   - It does not rely on the behavior of a bug in HDFS.
   - It is the most performant option.
   - It is most consistent with the semantics of the (albeit broken) FS
 cache.

 Since this behavior was changed in 1.0, it could be considered a
 regression. We should consider the exact behavior we want out of the FS
 cache. For Spark's purposes, it seems fine to cache FileSystems across
 tasks, as Spark does not close FileSystems. The issue that comes up is
 that
 user code which uses FileSystem.get() but then closes the FileSystem can
 screw up Spark processes which were using that FileSystem. The workaround
 for users would be to use FileSystem.newInstance() if they want full
 control over the lifecycle of their FileSystems.


 On Thu, May 22, 2014 at 12:06 PM, Colin McCabe
 cmcc...@alumni.cmu.eduwrote:

 The FileSystem cache is something that has caused a lot of pain over the
 years.  Unfortunately we (in Hadoop core) can't change the way it works
 now
 because there are too many users depending on the current behavior.

 Basically, the idea is that when you request a FileSystem with certain
 options with FileSystem#get, you might get a reference to an FS object
 that
 already exists, from our FS cache cache singleton.  Unfortunately, this
 also means that someone else can change the working directory on you or
 close the FS underneath you.  The FS is basically shared mutable state,
 and
 you don't know whom you're sharing with.

 It might be better for Spark to call FileSystem#newInstance, which
 bypasses
 the FileSystem cache and always creates a new object.  If Spark can hang
 on
 to the FS for a while, it can get the benefits of caching without the
 downsides.  In HDFS, multiple FS instances can also share things like the
 socket cache between them.

 best,
 Colin


 On Thu, May 22, 2014 at 10:06 AM, Marcelo Vanzin van...@cloudera.com

 wrote:
 Hi Kevin,

 On Thu, May 22, 2014 at 9:49 AM, Kevin Markey kevin.mar...@oracle.com
 wrote:

 The FS closed exception only effects the cleanup of the staging

 directory,

 not the final success or failure.  I've not yet tested the effect of
 changing my application's initialization, use, or closing of

 FileSystem.

 Without going and reading more of the Spark code, if your app is
 explicitly close()'ing the FileSystem instance, it may be causing the
 exception. If Spark is caching the FileSystem instance, your app is
 probably closing that same instance (which it got from the HDFS
 library's internal cache).

 It would be nice if you could test that theory; it might be worth
 knowing that's the case so that we can tell people not to do that.

 --
 Marcelo




Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-22 Thread Tathagata Das
Right! Doing that.

TD

On Thu, May 22, 2014 at 3:07 PM, Henry Saputra henry.sapu...@gmail.com wrote:
 Looks like SPARK-1900 is a blocker for YARN and might as well add
 SPARK-1870 while at it.

 TD or Patrick, could you kindly send [CANCEL] prefixed in the subject
 email out for the RC10 Vote to help people follow the active VOTE
 threads? The VOTE emails are getting a bit hard to follow.


 - Henry


 On Thu, May 22, 2014 at 2:05 PM, Tathagata Das
 tathagata.das1...@gmail.com wrote:
 Hey all,

 On further testing, I came across a bug that breaks execution of
 pyspark scripts on YARN.
 https://issues.apache.org/jira/browse/SPARK-1900
 This is a blocker and worth cutting a new RC.

 We also found a fix for a known issue that prevents additional jar
 files to be specified through spark-submit on YARN.
 https://issues.apache.org/jira/browse/SPARK-1870
 The has been fixed and will be in the next RC.

 We are canceling this vote for now. We will post RC11 shortly. Thanks
 everyone for testing!

 TD



[VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-20 Thread Tathagata Das
Please vote on releasing the following candidate as Apache Spark version 1.0.0!

This has a few bug fixes on top of rc9:
SPARK-1875: https://github.com/apache/spark/pull/824
SPARK-1876: https://github.com/apache/spark/pull/819
SPARK-1878: https://github.com/apache/spark/pull/822
SPARK-1879: https://github.com/apache/spark/pull/823

The tag to be voted on is v1.0.0-rc10 (commit d8070234):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d807023479ce10aec28ef3c1ab646ddefc2e663c

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~tdas/spark-1.0.0-rc10/

The release artifacts are signed with the following key:
https://people.apache.org/keys/committer/tdas.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1018/

The documentation corresponding to this release can be found at:
http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/

The full list of changes in this release can be found at:
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=blob;f=CHANGES.txt;h=d21f0ace6326e099360975002797eb7cba9d5273;hb=d807023479ce10aec28ef3c1ab646ddefc2e663c

Please vote on releasing this package as Apache Spark 1.0.0!

The vote is open until Friday, May 23, at 20:00 UTC and passes if
amajority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.0.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== API Changes ==
We welcome users to compile Spark applications against 1.0. There are
a few API changes in this release. Here are links to the associated
upgrade guides - user facing changes have been kept as small as
possible.

Changes to ML vector specification:
http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/mllib-guide.html#from-09-to-10

Changes to the Java API:
http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark

Changes to the streaming API:
http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x

Changes to the GraphX API:
http://people.apache.org/~tdas/spark-1.0.0-rc10-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091

Other changes:
coGroup and related functions now return Iterable[T] instead of Seq[T]
== Call toSeq on the result to restore the old behavior

SparkContext.jarOfClass returns Option[String] instead of Seq[String]
== Call toSeq on the result to restore old behavior


Re: Double lhbase dependency in spark 0.9.1

2014-04-17 Thread Tathagata Das
Aaah, this should have been ported to Spark 0.9.1!

TD


On Thu, Apr 17, 2014 at 12:08 PM, Sean Owen so...@cloudera.com wrote:

 I remember that too, and it has been fixed already in master, but
 maybe it was not included in 0.9.1:

 https://github.com/apache/spark/blob/master/project/SparkBuild.scala#L367
 --
 Sean Owen | Director, Data Science | London


 On Thu, Apr 17, 2014 at 8:03 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:
  Not sure if I am seeing double.
 
  SparkBuild.scala for 0.9.1 has dobule hbase declaration
 
org.apache.hbase %  hbase   % 0.94.6
  excludeAll(excludeNetty, excludeAsm),
org.apache.hbase % hbase % HBASE_VERSION
 excludeAll(excludeNetty,
  excludeAsm),
 
 
  as a result i am not getting the right version of hbase here. Perhaps the
  old declaration crept in during a merge at some point?
 
  -d



Re: Spark 0.9.1 released

2014-04-09 Thread Tathagata Das
A small additional note: Please use the direct download links in the Spark
Downloads http://spark.apache.org/downloads.html page. The Apache mirrors
take a day or so to sync from the main repo, so may not work immediately.

TD


On Wed, Apr 9, 2014 at 2:54 PM, Tathagata Das
tathagata.das1...@gmail.comwrote:

 Hi everyone,

 We have just posted Spark 0.9.1, which is a maintenance release with
 bug fixes, performance improvements, better stability with YARN and
 improved parity of the Scala and Python API. We recommend all 0.9.0
 users to upgrade to this stable release.

 This is the first release since Spark graduated as a top level Apache
 project. Contributions to this release came from 37 developers.

 The full release notes are at:
 http://spark.apache.org/releases/spark-release-0-9-1.html

 You can download the release at:
 http://spark.apache.org/downloads.html

 Thanks all the developers who contributed to this release:
 Aaron Davidson, Aaron Kimball, Andrew Ash, Andrew Or, Andrew Tulloch,
 Bijay Bisht, Bouke van der Bijl, Bryn Keller, Chen Chao,
 Christian Lundgren, Diana Carroll, Emtiaz Ahmed, Frank Dai,
 Henry Saputra, jianghan, Josh Rosen, Jyotiska NK, Kay Ousterhout,
 Kousuke Saruta, Mark Grover, Matei Zaharia, Nan Zhu, Nick Lanham,
 Patrick Wendell, Prabin Banka, Prashant Sharma, Qiuzhuang,
 Raymond Liu, Reynold Xin, Sandy Ryza, Sean Owen, Shixiong Zhu,
 shiyun.wxm, Stevo Slavić, Tathagata Das, Tom Graves, Xiangrui Meng

 TD



Re: Spark 0.9.1 released

2014-04-09 Thread Tathagata Das
Thanks Nick for pointing that out! I have updated the release
noteshttp://spark.apache.org/releases/spark-release-0-9-1.html.
But I see the new operations like repartition in the latest PySpark
RDD docshttp://spark.apache.org/docs/latest/api/pyspark/index.html.
Maybe refresh the page couple of times?

TD


On Wed, Apr 9, 2014 at 3:58 PM, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:

 A very nice addition for us PySpark users in 0.9.1 is the addition of
 RDD.repartition(), which is not mentioned in the release 
 noteshttp://spark.apache.org/releases/spark-release-0-9-1.html
 !

 This is super helpful for when you create an RDD from a gzipped file and
 then need to explicitly shuffle the data around to parallelize operations
 on it appropriately.

 Thanks people!

 FYI, 
 docs/latesthttp://spark.apache.org/docs/latest/api/pyspark/index.htmlhasn't 
 been updated yet to reflect the new additions to PySpark.

 Nick



 On Wed, Apr 9, 2014 at 6:07 PM, Matei Zaharia matei.zaha...@gmail.comwrote:

 Thanks TD for managing this release, and thanks to everyone who
 contributed!

 Matei

 On Apr 9, 2014, at 2:59 PM, Tathagata Das tathagata.das1...@gmail.com
 wrote:

 A small additional note: Please use the direct download links in the
 Spark Downloads http://spark.apache.org/downloads.html page. The
 Apache mirrors take a day or so to sync from the main repo, so may not work
 immediately.

 TD


 On Wed, Apr 9, 2014 at 2:54 PM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:

 Hi everyone,

 We have just posted Spark 0.9.1, which is a maintenance release with
 bug fixes, performance improvements, better stability with YARN and
 improved parity of the Scala and Python API. We recommend all 0.9.0
 users to upgrade to this stable release.

 This is the first release since Spark graduated as a top level Apache
 project. Contributions to this release came from 37 developers.

 The full release notes are at:
 http://spark.apache.org/releases/spark-release-0-9-1.html

 You can download the release at:
 http://spark.apache.org/downloads.html

 Thanks all the developers who contributed to this release:
 Aaron Davidson, Aaron Kimball, Andrew Ash, Andrew Or, Andrew Tulloch,
 Bijay Bisht, Bouke van der Bijl, Bryn Keller, Chen Chao,
 Christian Lundgren, Diana Carroll, Emtiaz Ahmed, Frank Dai,
 Henry Saputra, jianghan, Josh Rosen, Jyotiska NK, Kay Ousterhout,
 Kousuke Saruta, Mark Grover, Matei Zaharia, Nan Zhu, Nick Lanham,
 Patrick Wendell, Prabin Banka, Prashant Sharma, Qiuzhuang,
 Raymond Liu, Reynold Xin, Sandy Ryza, Sean Owen, Shixiong Zhu,
 shiyun.wxm, Stevo Slavić, Tathagata Das, Tom Graves, Xiangrui Meng

 TD







Re: Flaky streaming tests

2014-04-07 Thread Tathagata Das
Yes, I will take a look at those tests ASAP.

TD



On Mon, Apr 7, 2014 at 11:32 AM, Patrick Wendell pwend...@gmail.com wrote:

 TD - do you know what is going on here?

 I looked into this ab it and at least a few of these that use
 Thread.sleep() and assume the sleep will be exact, which is wrong. We
 should disable all the tests that do and probably they should be re-written
 to virtualize time.

 - Patrick


 On Mon, Apr 7, 2014 at 10:52 AM, Kay Ousterhout k...@eecs.berkeley.edu
 wrote:

  Hi all,
 
  The InputStreamsSuite seems to have some serious flakiness issues -- I've
  seen the file input stream fail many times and now I'm seeing some actor
  input stream test failures (
 
 
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13846/consoleFull
  )
  on what I think is an unrelated change.  Does anyone know anything about
  these?  Should we just remove some of these tests since they seem to be
  constantly failing?
 
  -Kay
 



Re: [VOTE] Release Apache Spark 0.9.1 (RC3)

2014-03-31 Thread Tathagata Das
Yes, lets extend the vote for two more days from now. So the vote is open
till *Wednesday, April 02, at 20:00 UTC*

On that note, my +1

TD




On Mon, Mar 31, 2014 at 9:57 AM, Patrick Wendell pwend...@gmail.com wrote:

 Yeah good point. Let's just extend this vote another few days?


 On Mon, Mar 31, 2014 at 8:12 AM, Tom Graves tgraves...@yahoo.com wrote:

  I should probably pull this off into another thread, but going forward
 can
  we try to not have the release votes end on a weekend? Since we only seem
  to give 3 days, it makes it really hard for anyone who is offline for the
  weekend to try it out.   Either that or extend the voting for more then 3
  days.
 
  Tom
  On Monday, March 31, 2014 12:50 AM, Patrick Wendell pwend...@gmail.com
  wrote:
 
  TD - I downloaded and did some local testing. Looks good to me!
 
  +1
 
  You should cast your own vote - at that point it's enough to pass.
 
  - Patrick
 
 
 
  On Sun, Mar 30, 2014 at 9:47 PM, prabeesh k prabsma...@gmail.com
 wrote:
 
   +1
   tested on Ubuntu12.04 64bit
  
  
   On Mon, Mar 31, 2014 at 3:56 AM, Matei Zaharia 
 matei.zaha...@gmail.com
   wrote:
  
+1 tested on Mac OS X.
   
Matei
   
On Mar 27, 2014, at 1:32 AM, Tathagata Das 
  tathagata.das1...@gmail.com
wrote:
   
 Please vote on releasing the following candidate as Apache Spark
   version
0.9.1

 A draft of the release notes along with the CHANGES.txt file is
 attached to this e-mail.

 The tag to be voted on is v0.9.1-rc3 (commit 4c43182b):

   
  
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=4c43182b6d1b0b7717423f386c0214fe93073208

 The release files, including signatures, digests, etc. can be found
  at:
 http://people.apache.org/~tdas/spark-0.9.1-rc3/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/tdas.asc

 The staging repository for this release can be found at:

  
 https://repository.apache.org/content/repositories/orgapachespark-1009/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~tdas/spark-0.9.1-rc3-docs/

 Please vote on releasing this package as Apache Spark 0.9.1!

 The vote is open until Sunday, March 30, at 10:00 UTC and passes if
 a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 0.9.1
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/
 CHANGES.txtRELEASE_NOTES.txt
   
   
  
 



Re: [VOTE] Release Apache Spark 0.9.1 (RC3)

2014-03-29 Thread Tathagata Das
Ah yes that should be and will be updated!
One more update in docs.

In the home page of spark streaming
http://spark.incubator.apache.org/streaming/.
Under
Deployment Options
 It is mentioned that *Spark Streaming can read data from HDFS

http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html
,
Flume
http://flume.apache.org/,Kafka http://kafka.apache.org/, Twitter
https://dev.twitter.com/ and ZeroMQ http://zeromq.org/*.

But from Spark Streaming-0.9.0 on wards it also supports Mqtt.

Can you please do the necessary to update the same after the voting has
completed?


On Sat, Mar 29, 2014 at 9:28 PM, Tathagata Das
tathagata.das1...@gmail.comwrote:

 Small fixes to the docs can be done after the voting has completed. This
 should not determine the vote on the release candidate binaries. Please
 vote as +1 if the published artifacts and binaries are good to go.

 TD
 On Mar 29, 2014 5:23 AM, prabeesh k prabsma...@gmail.com wrote:

  https://github.com/apache/spark/blob/master/docs/quick-start.md in line
  127. one spelling mistake found please correct it. (proogram to program)
 
 
 
  On Fri, Mar 28, 2014 at 9:58 PM, Will Benton wi...@redhat.com wrote:
 
   RC3 works with the applications I'm working on now and MLLib
 performance
   is indeed perceptibly improved over 0.9.0 (although I haven't done a
 real
   evaluation).  Also, from the downstream perspective, I've been
tracking
  the
   0.9.1 RCs in Fedora and have no issues to report there either:
  
  http://koji.fedoraproject.org/koji/buildinfo?buildID=507284
  
   [x] +1 Release this package as Apache Spark 0.9.1
   [ ] -1 Do not release this package because ...
  
 



Re: Spark 0.9.1 release

2014-03-27 Thread Tathagata Das
I have cut another release candidate, RC3, with two important bug
fixes. See the following JIRAs for more details.
1. Bug with intercepts in MLLib's GLM:
https://spark-project.atlassian.net/browse/SPARK-1327
2. Bug in PySpark's RDD.top() ordering:
https://spark-project.atlassian.net/browse/SPARK-1322

Please vote on this candidate on the voting thread.

Thanks!

TD

On Wed, Mar 26, 2014 at 3:09 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
 Updates:
 1. Fix for the ASM problem that Kevin mentioned is already in Spark 0.9.1
 RC2
 2. Fix for pyspark's RDD.top() that Patrick mentioned has been pulled into
 branch 0.9. This will get into the next RC if there is one.

 TD


 On Wed, Mar 26, 2014 at 9:21 AM, Patrick Wendell pwend...@gmail.com wrote:

 Hey TD,

 This one we just merged into master this morning:
 https://spark-project.atlassian.net/browse/SPARK-1322

 It should definitely go into the 0.9 branch because there was a bug in the
 semantics of top() which at this point is unreleased in Python.

 I didn't backport it yet because I figured you might want to do this at a
 specific time. So please go ahead and backport it. Not sure whether this
 warrants another RC.

 - Patrick


 On Tue, Mar 25, 2014 at 10:47 PM, Mridul Muralidharan
 mri...@gmail.comwrote:

  On Wed, Mar 26, 2014 at 10:53 AM, Tathagata Das
  tathagata.das1...@gmail.com wrote:
   PR 159 seems like a fairly big patch to me. And quite recent, so its
  impact
   on the scheduling is not clear. It may also depend on other changes
   that
   may have gotten into the DAGScheduler but not pulled into branch 0.9.
   I
  am
   not sure it is a good idea to pull that in. We can pull those changes
  later
   for 0.9.2 if required.
 
 
  There is no impact on scheduling : it only has an impact on error
  handling - it ensures that you can actually use spark on yarn in
  multi-tennent clusters more reliably.
  Currently, any reasonably long running job (30 mins+) working on non
  trivial dataset will fail due to accumulated failures in spark.
 
 
  Regards,
  Mridul
 
 
  
   TD
  
  
  
  
   On Tue, Mar 25, 2014 at 8:44 PM, Mridul Muralidharan mri...@gmail.com
  wrote:
  
   Forgot to mention this in the earlier request for PR's.
   If there is another RC being cut, please add
   https://github.com/apache/spark/pull/159 to it too (if not done
   already !).
  
   Thanks,
   Mridul
  
   On Thu, Mar 20, 2014 at 5:37 AM, Tathagata Das
   tathagata.das1...@gmail.com wrote:
 Hello everyone,
   
Since the release of Spark 0.9, we have received a number of
important
   bug
fixes and we would like to make a bug-fix release of Spark 0.9.1.
We
  are
going to cut a release candidate soon and we would love it if
people
  test
it out. We have backported several bug fixes into the 0.9 and
updated
   JIRA
accordingly
  
 
  https://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed)
   .
Please let me know if there are fixes that were not backported but
you
would like to see them in 0.9.1.
   
Thanks!
   
TD
  
 




Re: Spark 0.9.1 release

2014-03-25 Thread Tathagata Das
@evan
From the discussion in the JIRA, it seems that we still dont have a clear
solution for SPARK-1138. Nor do we have a sense of whether the solution is
going to small enough for a maintenance release. So I dont think we should
block the release of Spark 0.9.1 for this. We can make another Spark 0.9.2
release once the correct solution has been figured out.

@kevin
I understand the problem. I will try to port the solution for master
inthis PR https://github.com/apache/spark/pull/100/ into
branch 0.9. Lets see if it works out.


On Tue, Mar 25, 2014 at 10:19 AM, Kevin Markey kevin.mar...@oracle.comwrote:

 TD:

 A correct shading of ASM should only affect Spark code unless someone is
 relying on ASM 4.0 in unrelated project code, in which case they can add
 org.ow2.asm:asm:4.x as a dependency.

 Our short term solution has been to repackage other libraries with a 3.2
 dependency or to exclude ASM when our use of a dependent library really
 doesn't need it.  As you probably know, the real problem arises in
 ClassVisitor, which is an Interface in 3.x and before, but in 4.x it is an
 abstract class that takes a version constant as its constructor.  The ASM
 folks of course had our best interests in mind when they did this,
 attempting to deal with the Java-version dependent  changes from one ASM
 release to the next.  Unfortunately, they didn't change the names or
 locations of their classes and interfaces, which would have helped.

 In our particular case, the only library from which we couldn't exclude
 ASM was org.glassfish.jersey.containers:jersey-container-servlet:jar:2.5.1.
 I added a new module to our project, including some dummy source code,
 because we needed the library to be self contained, made the servlet --
 minus some unrelated transitive dependencies -- the only module dependency,
 then used the Maven shade plugin to relocate org.objectweb.asm to an
 arbitrary target.  We added the new shaded module as a new project
 dependency, plus the unrelated transitive dependencies excluded above.
 This solved the problem. At least until we added WADL to the project.  Then
 we needed to deal with it on its own terms.

 As you can see, we left Spark alone in all its ASM 4.0 glory.  Why? Spark
 is more volatile than the other libraries.  Also, the way in which we
 needed to deploy Spark and other resources on our (Yarn) clusters suggested
 that it would be easier to shade the other libraries.  I wanted to avoid
 having to install a locally patched Spark library into our build, updating
 the cluster and individual developers whenever there's a new patch.
  Individual developers such as me who are testing the impact of patches can
 handle it, but the main build goes to Maven Central via our corporate
 Artifactory mirror.

 If suddenly we had a Spark 0.9.1 with a shaded ASM, it would have no
 negative impact on us.  Only a positive impact.

 I just wish that all users of ASM would read FAQ entry 15!!!

 Thanks
 Kevin



 On 03/24/2014 06:30 PM, Tathagata Das wrote:

 Hello Kevin,

 A fix for SPARK-782 would definitely simplify building against Spark.
 However, its possible that a fix for this issue in 0.9.1 will break
 the builds (that reference spark) of existing 0.9 users, either due to
 a change in the ASM version, or for being incompatible with their
 current workarounds for this issue. That is not a good idea for a
 maintenance release, especially when 1.0 is not too far away.

 Can you (and others) elaborate more on the current workarounds that
 you have for this issue? Its best to understand all the implications
 of this fix.

 Note that in branch 0.9, it is not fixed, neither in SBT nor in Maven.

 TD

 On Mon, Mar 24, 2014 at 4:38 PM, Kevin Markey kevin.mar...@oracle.com
 wrote:

 Is there any way that [SPARK-782] (Shade ASM) can be included?  I see
 that
 it is not currently backported to 0.9.  But there is no single issue that
 has caused us more grief as we integrate spark-core with other project
 dependencies.  There are way too many libraries out there in addition to
 Spark 0.9 and before that are not well-behaved (ASM FAQ recommends
 shading),
 including some Hive and Hadoop libraries and a number of servlet
 libraries.
 We can't control those, but if Spark were well behaved in this regard, it
 would help.  Even for a maintenance release, and even if 1.0 is only 6
 weeks
 away!

 (For those not following 782, according to Jira comments, the SBT build
 shades it, but it is the Maven build that ends up in Maven Central.)

 Thanks
 Kevin Markey




 On 03/19/2014 06:07 PM, Tathagata Das wrote:

Hello everyone,

 Since the release of Spark 0.9, we have received a number of important
 bug
 fixes and we would like to make a bug-fix release of Spark 0.9.1. We are
 going to cut a release candidate soon and we would love it if people
 test
 it out. We have backported several bug fixes into the 0.9 and updated
 JIRA

 accordinglyhttps://spark-project.atlassian.net/browse/
 SPARK-1275?jql=project

Re: Spark 0.9.1 release

2014-03-25 Thread Tathagata Das
PR 159 seems like a fairly big patch to me. And quite recent, so its impact
on the scheduling is not clear. It may also depend on other changes that
may have gotten into the DAGScheduler but not pulled into branch 0.9. I am
not sure it is a good idea to pull that in. We can pull those changes later
for 0.9.2 if required.

TD




On Tue, Mar 25, 2014 at 8:44 PM, Mridul Muralidharan mri...@gmail.comwrote:

 Forgot to mention this in the earlier request for PR's.
 If there is another RC being cut, please add
 https://github.com/apache/spark/pull/159 to it too (if not done
 already !).

 Thanks,
 Mridul

 On Thu, Mar 20, 2014 at 5:37 AM, Tathagata Das
 tathagata.das1...@gmail.com wrote:
   Hello everyone,
 
  Since the release of Spark 0.9, we have received a number of important
 bug
  fixes and we would like to make a bug-fix release of Spark 0.9.1. We are
  going to cut a release candidate soon and we would love it if people test
  it out. We have backported several bug fixes into the 0.9 and updated
 JIRA
  accordingly
 https://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed)
 .
  Please let me know if there are fixes that were not backported but you
  would like to see them in 0.9.1.
 
  Thanks!
 
  TD



Re: Spark 0.9.1 release

2014-03-24 Thread Tathagata Das
Patrick, that is a good point.


On Mon, Mar 24, 2014 at 12:14 AM, Patrick Wendell pwend...@gmail.comwrote:

  Spark's dependency graph in a maintenance
 *Modifying* Spark's dependency graph...



Re: Spark 0.9.1 release

2014-03-24 Thread Tathagata Das
1051 has been pulled in!

search 1051 in
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=shortlog;h=refs/heads/branch-0.9

TD

On Mon, Mar 24, 2014 at 4:26 PM, Kevin Markey kevin.mar...@oracle.com wrote:
 1051 is essential!
 I'm not sure about the others, but anything that adds stability to
 Spark/Yarn would  be helpful.
 Kevin Markey



 On 03/20/2014 01:12 PM, Tom Graves wrote:

 I'll pull [SPARK-1053] Should not require SPARK_YARN_APP_JAR when running
 on YARN - JIRA and  [SPARK-1051] On Yarn, executors don't doAs as submitting
 user - JIRA in.  The pyspark one I would consider more of an enhancement so
 might not be appropriate for a point release.

 [SPARK-1053] Should not require SPARK_YARN_APP_JAR when running on
 YA...
 org.apache.spark.SparkException: env SPARK_YARN_APP_JAR is not set at
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:49)
 at org.apache.spark.schedule...
 View on spark-project.atlassian.net Preview by Yahoo
   [SPARK-1051] On Yarn, executors don't doAs as submitting user - JIRA
 This means that they can't write/read from files that the yarn user
 doesn't have permissions to but the submitting user does.
 View on spark-project.atlassian.net Preview by Yahoo



 On Thursday, March 20, 2014 1:35 PM, Bhaskar Dutta bhas...@gmail.com
 wrote:
   It will be great if
 SPARK-1101https://spark-project.atlassian.net/browse/SPARK-1101:
 Umbrella
 for hardening Spark on YARN can get into 0.9.1.

 Thanks,
 Bhaskar


 On Thu, Mar 20, 2014 at 5:37 AM, Tathagata Das
 tathagata.das1...@gmail.comwrote:

Hello everyone,

 Since the release of Spark 0.9, we have received a number of important
 bug
 fixes and we would like to make a bug-fix release of Spark 0.9.1. We are
 going to cut a release candidate soon and we would love it if people test
 it out. We have backported several bug fixes into the 0.9 and updated
 JIRA
 accordingly

 https://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed)

 .

 Please let me know if there are fixes that were not backported but you
 would like to see them in 0.9.1.

 Thanks!

 TD




Re: Spark 0.9.1 release

2014-03-24 Thread Tathagata Das
Hello Kevin,

A fix for SPARK-782 would definitely simplify building against Spark.
However, its possible that a fix for this issue in 0.9.1 will break
the builds (that reference spark) of existing 0.9 users, either due to
a change in the ASM version, or for being incompatible with their
current workarounds for this issue. That is not a good idea for a
maintenance release, especially when 1.0 is not too far away.

Can you (and others) elaborate more on the current workarounds that
you have for this issue? Its best to understand all the implications
of this fix.

Note that in branch 0.9, it is not fixed, neither in SBT nor in Maven.

TD

On Mon, Mar 24, 2014 at 4:38 PM, Kevin Markey kevin.mar...@oracle.com wrote:
 Is there any way that [SPARK-782] (Shade ASM) can be included?  I see that
 it is not currently backported to 0.9.  But there is no single issue that
 has caused us more grief as we integrate spark-core with other project
 dependencies.  There are way too many libraries out there in addition to
 Spark 0.9 and before that are not well-behaved (ASM FAQ recommends shading),
 including some Hive and Hadoop libraries and a number of servlet libraries.
 We can't control those, but if Spark were well behaved in this regard, it
 would help.  Even for a maintenance release, and even if 1.0 is only 6 weeks
 away!

 (For those not following 782, according to Jira comments, the SBT build
 shades it, but it is the Maven build that ends up in Maven Central.)

 Thanks
 Kevin Markey




 On 03/19/2014 06:07 PM, Tathagata Das wrote:

   Hello everyone,

 Since the release of Spark 0.9, we have received a number of important bug
 fixes and we would like to make a bug-fix release of Spark 0.9.1. We are
 going to cut a release candidate soon and we would love it if people test
 it out. We have backported several bug fixes into the 0.9 and updated JIRA

 accordinglyhttps://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed).

 Please let me know if there are fixes that were not backported but you
 would like to see them in 0.9.1.

 Thanks!

 TD




Spark 0.9.1 release

2014-03-19 Thread Tathagata Das
 Hello everyone,

Since the release of Spark 0.9, we have received a number of important bug
fixes and we would like to make a bug-fix release of Spark 0.9.1. We are
going to cut a release candidate soon and we would love it if people test
it out. We have backported several bug fixes into the 0.9 and updated JIRA
accordinglyhttps://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed).
Please let me know if there are fixes that were not backported but you
would like to see them in 0.9.1.

Thanks!

TD