Re: failure notice

2015-10-06 Thread Renyi Xiong
yes, it can recover on a different node. it uses write-ahead-log,
checkpoints offsets of both ingress and egress (e.g. using zookeeper and/or
kafka), replies on the streaming engine's deterministic operations.

by replaying back a certain range of data based on checkpointed
ingress offset (at least once semantic), state can be recovered, and
filters out duplicate events based on checkpointed egress offset (at most
once semantic)

hope it makes sense.

On Mon, Oct 5, 2015 at 3:11 PM, Tathagata Das  wrote:

> What happens when a whole node running  your " per node streaming engine
> (built-in checkpoint and recovery)" fails? Can its checkpoint and recovery
> mechanism handle whole node failure? Can you recover from the checkpoint on
> a different node?
>
> Spark and Spark Streaming were designed with the idea that executors are
> disposable, and there should not be any node-specific long term state that
> you rely on unless you can recover that state on a different node.
>
> On Mon, Oct 5, 2015 at 3:03 PM, Renyi Xiong  wrote:
>
>> if RDDs from same DStream not guaranteed to run on same worker, then the
>> question becomes:
>>
>> is it possible to specify an unlimited duration in ssc to have a
>> continuous stream (as opposed to discretized).
>>
>> say, we have a per node streaming engine (built-in checkpoint and
>> recovery) we'd like to integrate with spark streaming. can we have a
>> never-ending batch (or RDD) this way?
>>
>> On Mon, Sep 28, 2015 at 4:31 PM,  wrote:
>>
>>> Hi. This is the qmail-send program at apache.org.
>>> I'm afraid I wasn't able to deliver your message to the following
>>> addresses.
>>> This is a permanent error; I've given up. Sorry it didn't work out.
>>>
>>> :
>>> Must be sent from an @apache.org address or a subscriber address or an
>>> address in LDAP.
>>>
>>> --- Below this line is a copy of the message.
>>>
>>> Return-Path: 
>>> Received: (qmail 95559 invoked by uid 99); 28 Sep 2015 23:31:46 -
>>> Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142)
>>> by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Sep 2015 23:31:46
>>> +
>>> Received: from localhost (localhost [127.0.0.1])
>>> by spamd3-us-west.apache.org (ASF Mail Server at
>>> spamd3-us-west.apache.org) with ESMTP id 96E361809BA
>>> for ; Mon, 28 Sep 2015 23:31:45 +
>>> (UTC)
>>> X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org
>>> X-Spam-Flag: NO
>>> X-Spam-Score: 3.129
>>> X-Spam-Level: ***
>>> X-Spam-Status: No, score=3.129 tagged_above=-999 required=6.31
>>> tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
>>> FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=3,
>>> RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01,
>>> SPF_PASS=-0.001]
>>> autolearn=disabled
>>> Authentication-Results: spamd3-us-west.apache.org (amavisd-new);
>>> dkim=pass (2048-bit key) header.d=gmail.com
>>> Received: from mx1-us-west.apache.org ([10.40.0.8])
>>> by localhost (spamd3-us-west.apache.org [10.40.0.10])
>>> (amavisd-new, port 10024)
>>> with ESMTP id FAGoohFE7Y7A for ;
>>> Mon, 28 Sep 2015 23:31:44 + (UTC)
>>> Received: from mail-la0-f51.google.com (mail-la0-f51.google.com
>>> [209.85.215.51])
>>> by mx1-us-west.apache.org (ASF Mail Server at
>>> mx1-us-west.apache.org) with ESMTPS id 2ED40204C9
>>> for ; Mon, 28 Sep 2015 23:31:44 +
>>> (UTC)
>>> Received: by labzv5 with SMTP id zv5so32919088lab.1
>>> for ; Mon, 28 Sep 2015 16:31:42 -0700
>>> (PDT)
>>> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
>>> d=gmail.com; s=20120113;
>>>
>>> h=mime-version:in-reply-to:references:date:message-id:subject:from:to
>>>  :cc:content-type;
>>> bh=F36l+I4dfDHTL7nQ0K9mAW4aVtPpVpYc0rWWpPNjt4c=;
>>>
>>> b=QfRdLEWf4clJqwkZSH7n0oHjXLNifWdhYxvCDZ+P37oSfM0vd/8Bx962hTflRQkD1q
>>>
>>>  2B3go7g8bpnQlhZgMRrZfT6hk7vUtNA3lOZjYeN+cPyoVRaBwm3LIID5vF4cw5hFAWaM
>>>
>>>  LUenU7E7b9kJY8JkyhIfpya8CLKz+Yo6EjCv3W6BAvv2YiNPgbOQkpx7u8y9dw0kHGig
>>>
>>>  1hv37Ey/DZpoKCgbSesv+sztYslevu+VBgxHFkveEyxH1saRr6OqTM7fnL2E6fP4E8qu
>>>
>>>  W81G1ZfNW1Pp9i5IcCb/9S7YTZDnBlUj4yROsOfNANRGmed71QpQD9l8NnAQXmeqoeNF
>>>  SyEg==
>>> MIME-Version: 1.0
>>> X-Received: by 10.25.213.75 with SMTP id
>>> m72mr4047578lfg.17.1443483102618;
>>>  Mon, 28 Sep 2015 16:31:42 -0700 (PDT)
>>> Received: by 10.25.207.18 with HTTP; Mon, 28 Sep 2015 16:31:42 -0700
>>> (PDT)
>>> In-Reply-To: >> 2...@mail.gmail.com>
>>> References: <
>>> cangsv6-k+33gvgtiynwhz2gsbudf_wwwnazvupbqe8qdcg_...@mail.gmail.com>
>>> 

FW: Spark error while running in spark mode

2015-10-06 Thread Ratika Prasad


From: Ratika Prasad
Sent: Monday, October 05, 2015 2:39 PM
To: u...@spark.apache.org
Cc: Ameeta Jayarajan 
Subject: Spark error while running in spark mode

Hi,

When we run our spark component in cluster mode as below we get the following 
error

./bin/spark-submit --class 
com.coupons.stream.processing.SparkStreamEventProcessingEngine --master 
spark://172.28.161.138:7077 
EventProcessingEngine-0.0.1-SNAPSHOT-jar-with-dependencies.jar

ERROR ErrorMonitor: dropping message [class akka.actor.ActorSelectionMessage] 
for non-local recipient [Actor[akka.tcp://sparkMaster@172.28.161.138:7077/]] 
arriving at [akka.tcp://sparkMaster@172.28.161.138:7077] inbound addresses are 
[akka.tcp://sparkDriver@172.28.161.138:7077]
akka.event.Logging$Error$NoCause$


Kindly help







Pyspark dataframe read

2015-10-06 Thread Blaž Šnuderl
Hello everyone.

It seems pyspark dataframe read is broken for reading multiple files.

sql.read.json( "file1,file2") fails with java.io.IOException: No input
paths specified in job.

This used to work in spark 1.4 and also still work with sc.textFile

Blaž


How can I access data on RDDs?

2015-10-06 Thread jatinganhotra
Consider the following 2 scenarios:

*Scenario #1*
val pagecounts = sc.textFile("data/pagecounts")
pagecounts.checkpoint
pagecounts.count

*Scenario #2*
val pagecounts = sc.textFile("data/pagecounts")
pagecounts.count

The total time show in the Spark shell Application UI was different for both
scenarios. /Scenario #1 took 0.5 seconds, while scenario #2 took only 0.2
s/.

*Questions:*
1. I understand that scenario #1 is taking more time, because the RDD is
check-pointed (written to disk). Is there a way I can know the time taken
for checkpoint, from the total time?  

The Spark shell Application UI shows the following - Scheduler delay, Task
Deserialization time, GC time, Result serialization time, getting result
time. But, doesn't show the breakdown for checkpointing.  

2. Is there a way to access the above metrics e.g. scheduler delay, GC time
and save them programmatically? I want to log some of the above metrics for
every action invoked on an RDD.  

3. How can I programmatically access the following information:  
- Size of an RDD, when persisted to disk on checkpointing?  
- How much percentage of an RDD is in memory currently?  
- Overall time taken for computing an RDD?  

Please let me know if you need more information.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/How-can-I-access-data-on-RDDs-tp14475.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Adding Spark Testing functionality

2015-10-06 Thread Holden Karau
Hi Spark Devs,

So this has been brought up a few times before, and generally on the user
list people get directed to use spark-testing-base. I'd like to start
moving some of spark-testing-base's functionality into Spark so that people
don't need a library to do what is (hopefully :p) a very common requirement
across all Spark projects.

To that end I was wondering what peoples thoughts are on where this should
live inside of Spark. I was thinking it could either be a separate testing
project (like sql or similar), or just put the bits to enable testing
inside of each relevant project.

I was also thinking it probably makes sense to only move the unit testing
parts at the start and leave things like integration testing in a testing
project since that could vary depending on the users environment.

What are peoples thoughts?

Cheers,

Holden :)


Re: Pyspark dataframe read

2015-10-06 Thread Koert Kuipers
i ran into the same thing in scala api. we depend heavily on comma
separated paths, and it no longer works.


On Tue, Oct 6, 2015 at 3:02 AM, Blaž Šnuderl  wrote:

> Hello everyone.
>
> It seems pyspark dataframe read is broken for reading multiple files.
>
> sql.read.json( "file1,file2") fails with java.io.IOException: No input
> paths specified in job.
>
> This used to work in spark 1.4 and also still work with sc.textFile
>
> Blaž
>


Re: Pyspark dataframe read

2015-10-06 Thread Reynold Xin
I think the problem is that comma is actually a legitimate character for
file name, and as a result ...

On Tuesday, October 6, 2015, Josh Rosen  wrote:

> Could someone please file a JIRA to track this?
> https://issues.apache.org/jira/browse/SPARK
>
> On Tue, Oct 6, 2015 at 1:21 AM, Koert Kuipers  > wrote:
>
>> i ran into the same thing in scala api. we depend heavily on comma
>> separated paths, and it no longer works.
>>
>>
>> On Tue, Oct 6, 2015 at 3:02 AM, Blaž Šnuderl > > wrote:
>>
>>> Hello everyone.
>>>
>>> It seems pyspark dataframe read is broken for reading multiple files.
>>>
>>> sql.read.json( "file1,file2") fails with java.io.IOException: No input
>>> paths specified in job.
>>>
>>> This used to work in spark 1.4 and also still work with sc.textFile
>>>
>>> Blaž
>>>
>>
>>
>


Re: Pyspark dataframe read

2015-10-06 Thread Koert Kuipers
i personally find the comma separated paths feature much more important
than commas in paths (which one could argue you should avoid).

but assuming people want to keep commas as legitimate characters in paths:
https://issues.apache.org/jira/browse/SPARK-10185
https://github.com/apache/spark/pull/8416



On Tue, Oct 6, 2015 at 4:31 AM, Reynold Xin  wrote:

> I think the problem is that comma is actually a legitimate character for
> file name, and as a result ...
>
>
> On Tuesday, October 6, 2015, Josh Rosen  wrote:
>
>> Could someone please file a JIRA to track this?
>> https://issues.apache.org/jira/browse/SPARK
>>
>> On Tue, Oct 6, 2015 at 1:21 AM, Koert Kuipers  wrote:
>>
>>> i ran into the same thing in scala api. we depend heavily on comma
>>> separated paths, and it no longer works.
>>>
>>>
>>> On Tue, Oct 6, 2015 at 3:02 AM, Blaž Šnuderl  wrote:
>>>
 Hello everyone.

 It seems pyspark dataframe read is broken for reading multiple files.

 sql.read.json( "file1,file2") fails with java.io.IOException: No input
 paths specified in job.

 This used to work in spark 1.4 and also still work with sc.textFile

 Blaž

>>>
>>>
>>


Re: StructType has more rows, than corresponding Row has objects.

2015-10-06 Thread Eugene Morozov
Davies,

that seemed to be my issue, my colleague helped me to resolved it. The
problem was that we build RDD and corresponding StructType by
ourselves (no json, parquet, cassandra, etc - we take a list of business
objects and convert them to Rows, then infer struct type) and I missed one
thing.
--
Be well!
Jean Morozov

On Tue, Oct 6, 2015 at 1:58 AM, Davies Liu  wrote:

> Could you tell us a way to reproduce this failure? Reading from JSON or
> Parquet?
>
> On Mon, Oct 5, 2015 at 4:28 AM, Eugene Morozov
>  wrote:
> > Hi,
> >
> > We're building our own framework on top of spark and we give users pretty
> > complex schema to work with. That requires from us to build dataframes by
> > ourselves: we transform business objects to rows and struct types and
> uses
> > these two to create dataframe.
> >
> > Everything was fine until I started to upgrade to spark 1.5.0 (from
> 1.3.1).
> > Seems to be catalyst engine has been changed and now using almost the
> same
> > code to produce rows and struct types I have the following:
> > http://ibin.co/2HzUsoe9O96l, some of rows in the end result have
> different
> > number of values and corresponding struct types.
> >
> > I'm almost sure it's my own fault, but there is always a small chance,
> that
> > something is wrong in spark codebase. If you've seen something similar
> or if
> > there is a jira for smth similar, I'd be glad to know. Thanks.
> > --
> > Be well!
> > Jean Morozov
>


Re: Pyspark dataframe read

2015-10-06 Thread Josh Rosen
Could someone please file a JIRA to track this?
https://issues.apache.org/jira/browse/SPARK

On Tue, Oct 6, 2015 at 1:21 AM, Koert Kuipers  wrote:

> i ran into the same thing in scala api. we depend heavily on comma
> separated paths, and it no longer works.
>
>
> On Tue, Oct 6, 2015 at 3:02 AM, Blaž Šnuderl  wrote:
>
>> Hello everyone.
>>
>> It seems pyspark dataframe read is broken for reading multiple files.
>>
>> sql.read.json( "file1,file2") fails with java.io.IOException: No input
>> paths specified in job.
>>
>> This used to work in spark 1.4 and also still work with sc.textFile
>>
>> Blaž
>>
>
>


Re: failure notice

2015-10-06 Thread Tathagata Das
Unfortunately, there is not an obvious way to do this. I am guessing that
you want to partition your stream such that the same keys always go to the
same executor, right?

You could do it by writing a custom RDD. See ShuffledRDD
.
That is what is used to do a lot of shuffling. See how it is used from
RDD.partitionByKey() or RDD.reduceByKey(). You could subclass it specify a
set of preferred locations, and the system will try to respect those
locations. These locations should be among the currently active executors.
You could either get the current list of executors from
SparkContext.getExecutorMemoryStatus(),

Hope this helps.

On Tue, Oct 6, 2015 at 8:27 AM, Renyi Xiong  wrote:

> yes, it can recover on a different node. it uses write-ahead-log,
> checkpoints offsets of both ingress and egress (e.g. using zookeeper and/or
> kafka), replies on the streaming engine's deterministic operations.
>
> by replaying back a certain range of data based on checkpointed
> ingress offset (at least once semantic), state can be recovered, and
> filters out duplicate events based on checkpointed egress offset (at most
> once semantic)
>
> hope it makes sense.
>
> On Mon, Oct 5, 2015 at 3:11 PM, Tathagata Das  wrote:
>
>> What happens when a whole node running  your " per node streaming engine
>> (built-in checkpoint and recovery)" fails? Can its checkpoint and recovery
>> mechanism handle whole node failure? Can you recover from the checkpoint on
>> a different node?
>>
>> Spark and Spark Streaming were designed with the idea that executors are
>> disposable, and there should not be any node-specific long term state that
>> you rely on unless you can recover that state on a different node.
>>
>> On Mon, Oct 5, 2015 at 3:03 PM, Renyi Xiong 
>> wrote:
>>
>>> if RDDs from same DStream not guaranteed to run on same worker, then the
>>> question becomes:
>>>
>>> is it possible to specify an unlimited duration in ssc to have a
>>> continuous stream (as opposed to discretized).
>>>
>>> say, we have a per node streaming engine (built-in checkpoint and
>>> recovery) we'd like to integrate with spark streaming. can we have a
>>> never-ending batch (or RDD) this way?
>>>
>>> On Mon, Sep 28, 2015 at 4:31 PM,  wrote:
>>>
 Hi. This is the qmail-send program at apache.org.
 I'm afraid I wasn't able to deliver your message to the following
 addresses.
 This is a permanent error; I've given up. Sorry it didn't work out.

 :
 Must be sent from an @apache.org address or a subscriber address or an
 address in LDAP.

 --- Below this line is a copy of the message.

 Return-Path: 
 Received: (qmail 95559 invoked by uid 99); 28 Sep 2015 23:31:46 -
 Received: from Unknown (HELO spamd3-us-west.apache.org)
 (209.188.14.142)
 by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Sep 2015 23:31:46
 +
 Received: from localhost (localhost [127.0.0.1])
 by spamd3-us-west.apache.org (ASF Mail Server at
 spamd3-us-west.apache.org) with ESMTP id 96E361809BA
 for ; Mon, 28 Sep 2015 23:31:45 +
 (UTC)
 X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org
 X-Spam-Flag: NO
 X-Spam-Score: 3.129
 X-Spam-Level: ***
 X-Spam-Status: No, score=3.129 tagged_above=-999 required=6.31
 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
 FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=3,
 RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01,
 SPF_PASS=-0.001]
 autolearn=disabled
 Authentication-Results: spamd3-us-west.apache.org (amavisd-new);
 dkim=pass (2048-bit key) header.d=gmail.com
 Received: from mx1-us-west.apache.org ([10.40.0.8])
 by localhost (spamd3-us-west.apache.org [10.40.0.10])
 (amavisd-new, port 10024)
 with ESMTP id FAGoohFE7Y7A for ;
 Mon, 28 Sep 2015 23:31:44 + (UTC)
 Received: from mail-la0-f51.google.com (mail-la0-f51.google.com
 [209.85.215.51])
 by mx1-us-west.apache.org (ASF Mail Server at
 mx1-us-west.apache.org) with ESMTPS id 2ED40204C9
 for ; Mon, 28 Sep 2015 23:31:44 +
 (UTC)
 Received: by labzv5 with SMTP id zv5so32919088lab.1
 for ; Mon, 28 Sep 2015 16:31:42 -0700
 (PDT)
 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20120113;

 h=mime-version:in-reply-to:references:date:message-id:subject:from:to
  :cc:content-type;
 bh=F36l+I4dfDHTL7nQ0K9mAW4aVtPpVpYc0rWWpPNjt4c=;

 

multiple count distinct in SQL/DataFrame?

2015-10-06 Thread Reynold Xin
The current implementation of multiple count distinct in a single query is
very inferior in terms of performance and robustness, and it is also hard
to guarantee correctness of the implementation in some of the refactorings
for Tungsten. Supporting a better version of it is possible in the future,
but will take a lot of engineering efforts. Most other Hadoop-based SQL
systems (e.g. Hive, Impala) don't support this feature.

As a result, we are considering removing support for multiple count
distinct in a single query in the next Spark release (1.6). If you use this
feature, please reply to this email. Thanks.

Note that if you don't care about null values, it is relatively easy to
reconstruct a query using joins to support multiple distincts.


Re: multiple count distinct in SQL/DataFrame?

2015-10-06 Thread Reynold Xin
To provide more context, if we do remove this feature, the following SQL
query would throw an AnalysisException:

select count(distinct colA), count(distinct colB) from foo;

The following should still work:

select count(distinct colA) from foo;

The following should also work:

select count(distinct colA, colB) from foo;


On Tue, Oct 6, 2015 at 5:51 PM, Reynold Xin  wrote:

> The current implementation of multiple count distinct in a single query is
> very inferior in terms of performance and robustness, and it is also hard
> to guarantee correctness of the implementation in some of the refactorings
> for Tungsten. Supporting a better version of it is possible in the future,
> but will take a lot of engineering efforts. Most other Hadoop-based SQL
> systems (e.g. Hive, Impala) don't support this feature.
>
> As a result, we are considering removing support for multiple count
> distinct in a single query in the next Spark release (1.6). If you use this
> feature, please reply to this email. Thanks.
>
> Note that if you don't care about null values, it is relatively easy to
> reconstruct a query using joins to support multiple distincts.
>
>
>


Re: Adding Spark Testing functionality

2015-10-06 Thread Patrick Wendell
Hey Holden,

It would be helpful if you could outline the set of features you'd imagine
being part of Spark in a short doc. I didn't see a README on the existing
repo, so it's hard to know exactly what is being proposed.

As a general point of process, we've typically avoided merging modules into
Spark that can exist outside of the project. A testing utility package that
is based on Spark's public API's seems like a really useful thing for the
community, but it does seem like a good fit for a package library. At
least, this is my first question after taking a look at the project.

In any case, getting some high level view of the functionality you imagine
would be helpful to give more detailed feedback.

- Patrick

On Tue, Oct 6, 2015 at 3:12 PM, Holden Karau  wrote:

> Hi Spark Devs,
>
> So this has been brought up a few times before, and generally on the user
> list people get directed to use spark-testing-base. I'd like to start
> moving some of spark-testing-base's functionality into Spark so that people
> don't need a library to do what is (hopefully :p) a very common requirement
> across all Spark projects.
>
> To that end I was wondering what peoples thoughts are on where this should
> live inside of Spark. I was thinking it could either be a separate testing
> project (like sql or similar), or just put the bits to enable testing
> inside of each relevant project.
>
> I was also thinking it probably makes sense to only move the unit testing
> parts at the start and leave things like integration testing in a testing
> project since that could vary depending on the users environment.
>
> What are peoples thoughts?
>
> Cheers,
>
> Holden :)
>


Re: Adding Spark Testing functionality

2015-10-06 Thread Holden Karau
I'll put together a google doc and send that out (in the meantime a quick
guide of sort of how the current package can be used is in the blog post I
did at
http://blog.cloudera.com/blog/2015/09/making-apache-spark-testing-easy-with-spark-testing-base/
)  If people think its better to keep as a package I am of course happy to
keep doing that. It feels a little strange to have something as core as
being able to test your code live outside.

On Tue, Oct 6, 2015 at 3:44 PM, Patrick Wendell  wrote:

> Hey Holden,
>
> It would be helpful if you could outline the set of features you'd imagine
> being part of Spark in a short doc. I didn't see a README on the existing
> repo, so it's hard to know exactly what is being proposed.
>
> As a general point of process, we've typically avoided merging modules
> into Spark that can exist outside of the project. A testing utility package
> that is based on Spark's public API's seems like a really useful thing for
> the community, but it does seem like a good fit for a package library. At
> least, this is my first question after taking a look at the project.
>
> In any case, getting some high level view of the functionality you imagine
> would be helpful to give more detailed feedback.
>
> - Patrick
>
> On Tue, Oct 6, 2015 at 3:12 PM, Holden Karau  wrote:
>
>> Hi Spark Devs,
>>
>> So this has been brought up a few times before, and generally on the user
>> list people get directed to use spark-testing-base. I'd like to start
>> moving some of spark-testing-base's functionality into Spark so that people
>> don't need a library to do what is (hopefully :p) a very common requirement
>> across all Spark projects.
>>
>> To that end I was wondering what peoples thoughts are on where this
>> should live inside of Spark. I was thinking it could either be a separate
>> testing project (like sql or similar), or just put the bits to enable
>> testing inside of each relevant project.
>>
>> I was also thinking it probably makes sense to only move the unit testing
>> parts at the start and leave things like integration testing in a testing
>> project since that could vary depending on the users environment.
>>
>> What are peoples thoughts?
>>
>> Cheers,
>>
>> Holden :)
>>
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau
Linked In: https://www.linkedin.com/in/holdenkarau


Re: SparkR dataframe UDF

2015-10-06 Thread Hossein
User defined functions written in R are not supposed yet. You can implement
your UDF in Scala, register it in sqlContext and use it in SparkR, provided
that you share your context between R and Scala.

--Hossein

On Friday, October 2, 2015, Renyi Xiong  wrote:

> Hi Shiva,
>
> Is Dataframe UDF implemented in SparkR yet? - I could not find it in below
> URL
>
> https://github.com/hlin09/spark/tree/SparkR-streaming/R/pkg/R
>
> Thanks,
> Renyi.
>


-- 
--Hossein


Re: CQs on WindowedStream created on running StreamingContext

2015-10-06 Thread Yogesh Mahajan
Anyone knows about this ? TD ?

-yogesh

> On 30-Sep-2015, at 1:25 pm, Yogs  wrote:
> 
> Hi, 
> 
> We intend to run adhoc windowed continuous queries on spark streaming data. 
> The queries could be registered/deregistered dynamically or can be submitted 
> through command line. Currently Spark streaming doesn’t allow adding any new 
> inputs, transformations, and output operations after starting a 
> StreamingContext. But doing following code changes in DStream.scala allows me 
> to create an window on DStream even after StreamingContext has started (in 
> StreamingContextState.ACTIVE). 
> 
> 1) In DStream.validateAtInit()
> Allowed adding new inputs, transformations, and output operations after 
> starting a streaming context
> 2) In DStream.persist()
> Allowed to change storage level of an DStream after streaming context has 
> started
> 
> Ultimately the window api just does slice on the parentRDD and returns 
> allRDDsInWindow.
> We create DataFrames out of these RDDs from this particular WindowedDStream, 
> and evaluate queries on those DataFrames. 
> 
> 1) Do you see any challenges and consequences with this approach ? 
> 2) Will these on the fly created WindowedDStreams be accounted properly in 
> Runtime and memory management?
> 3) What is the reason we do not allow creating new windows with 
> StreamingContextState.ACTIVE state?
> 4) Does it make sense to add our own implementation of WindowedDStream in 
> this case?
> 
> - Yogesh 
> 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org