Trigger.ProcessingTime("10 seconds") & Trigger.Continuous(10.seconds)

2018-02-25 Thread naresh Goud
Hello Spark Experts,

What is the difference between Trigger.Continuous(10.seconds) and
Trigger.ProcessingTime("10 seconds") ?



Thank you,
Naresh


CATALYST rule join

2018-02-25 Thread tan shai
Hi,

I need to write a rule to customize the join function using Spark Catalyst
optimizer. The objective to duplicate the second dataset using this
process:

- Execute a udf on the column called x, this udf returns an array

- Execute an explode function on the new column

Using SQL terms, my objective is to execute this query on the second table :

SELECT EXPLODE(foo(x)) from table2

Where `foo` is is a udf that return an array of elements.

I have this rule:

case class JoinRule(spark: SparkSession) extends Rule[LogicalPlan] {

override def apply(plan: LogicalPlan): LogicalPlan = plan transform {

case join@Join(left, right, _, Some(condition)) =>

{

val attr = right.outputSet.find(x => x.toString().contains("x"))

val udf = ScalaUDF((x: Long) => x +: f(ipix), ArrayType(LongType),
Seq(attr.last.toAttribute))

val explode = Explode(udf)

val resolvedGenerator = Generate(explode, true,false, qualifier = None,
udf.references.toSeq, right)

var newRight = Project(resolvedGenerator.output,resolvedGenerator)

Join(left, newRight , Inner,Option(condition))

}
  }
}

But the problem is that the operation `Generate explode` appears many times
in the physical plan.


Do you have any other ideas ? Maybe rewriting the code.

Thank you.


Re: Spark structured streaming: periodically refresh static data frame

2018-02-25 Thread naresh Goud
Appu,

I am also landed in same problem.

Are you able to solve this issue? Could you please share snippet of code if
your able to do?

Thanks,
Naresh

On Wed, Feb 14, 2018 at 8:04 PM, Tathagata Das 
wrote:

> 1. Just loop like this.
>
>
> def startQuery(): Streaming Query = {
>// Define the dataframes and start the query
> }
>
> // call this on main thread
> while (notShutdown) {
>val query = startQuery()
>query.awaitTermination(refreshIntervalMs)
>query.stop()
>// refresh static data
> }
>
>
> 2. Yes, stream-stream joins in 2.3.0, soon to be released. RC3 is
> available if you want to test it right now - https://dist.apache.org/
> repos/dist/dev/spark/v2.3.0-rc3-bin/.
>
>
>
> On Wed, Feb 14, 2018 at 3:34 AM, Appu K  wrote:
>
>> TD,
>>
>> Thanks a lot for the quick reply :)
>>
>>
>> Did I understand it right that in the main thread, to wait for the
>> termination of the context I'll not be able to use
>>  outStream.awaitTermination()  -  [ since i'll be closing in inside another
>> thread ]
>>
>> What would be a good approach to keep the main app long running if I’ve
>> to restart queries?
>>
>> Should i just wait for 2.3 where i'll be able to join two structured
>> streams ( if the release is just a few weeks away )
>>
>> Appreciate all the help!
>>
>> thanks
>> App
>>
>>
>>
>> On 14 February 2018 at 4:41:52 PM, Tathagata Das (
>> tathagata.das1...@gmail.com) wrote:
>>
>> Let me fix my mistake :)
>> What I suggested in that earlier thread does not work. The streaming
>> query that joins a streaming dataset with a batch view, does not correctly
>> pick up when the view is updated. It works only when you restart the query.
>> That is,
>> - stop the query
>> - recreate the dataframes,
>> - start the query on the new dataframe using the same checkpoint location
>> as the previous query
>>
>> Note that you dont need to restart the whole process/cluster/application,
>> just restart the query in the same process/cluster/application. This should
>> be very fast (within a few seconds). So, unless you have latency SLAs of 1
>> second, you can periodically restart the query without restarting the
>> process.
>>
>> Apologies for my misdirections in that earlier thread. Hope this helps.
>>
>> TD
>>
>> On Wed, Feb 14, 2018 at 2:57 AM, Appu K  wrote:
>>
>>> More specifically,
>>>
>>> Quoting TD from the previous thread
>>> "Any streaming query that joins a streaming dataframe with the view will
>>> automatically start using the most updated data as soon as the view is
>>> updated”
>>>
>>> Wondering if I’m doing something wrong in  https://gist.github.com/anony
>>> mous/90dac8efadca3a69571e619943ddb2f6
>>>
>>> My streaming dataframe is not using the updated data, even though the
>>> view is updated!
>>>
>>> Thank you
>>>
>>>
>>> On 14 February 2018 at 2:54:48 PM, Appu K (kut...@gmail.com) wrote:
>>>
>>> Hi,
>>>
>>> I had followed the instructions from the thread https://mail-archives.a
>>> pache.org/mod_mbox/spark-user/201704.mbox/%3CD1315D33-41CD-4
>>> ba3-8b77-0879f3669...@qvantel.com%3E while trying to reload a static
>>> data frame periodically that gets joined to a structured streaming query.
>>>
>>> However, the streaming query results does not reflect the data from the
>>> refreshed static data frame.
>>>
>>> Code is here https://gist.github.com/anonymous/90dac8efadca3a69571e6
>>> 19943ddb2f6
>>>
>>> I’m using spark 2.2.1 . Any pointers would be highly helpful
>>>
>>> Thanks a lot
>>>
>>> Appu
>>>
>>>
>>
>


Unsubscribe

2018-02-25 Thread Sudheesh K J


Thanks & Regards

Sudheesh K J
Tata Consultancy Services Limited,
TCS Centre SEZ Unit,
Infopark PO,
Kochi - 682042
Kerala, India
Cell Number: 9746068506
Mail to: sudheesh...@tcs.com
Web site: www.tcs.com

From: Brindha Sengottaiyan 
Sent: 24 February 2018 03:06 AM
To: user@spark.apache.org
Subject: Unsubscribe


=-=-=
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you




Unsubscribe

2018-02-25 Thread Anu B Nair



unsubscribe

2018-02-25 Thread qiming zhang
unsubscribe

Unsubscribe

2018-02-25 Thread Sandeep Kr. Choudhary
Thanks

Regards

Sandeep Kumar Choudhary
IIT Delhi
Contact: +917042645119


Unsubscribe

2018-02-25 Thread Bhaskar Kvvsr
-- 
*Thanks and Regards,*
*Rama Bhaskar*


Re: Trigger.ProcessingTime("10 seconds") & Trigger.Continuous(10.seconds)

2018-02-25 Thread Tathagata Das
The continuous one is our new low latency continuous processing engine in
Structured Streaming (to be released in 2.3).
Here is the pre-release doc -
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-docs/_site/structured-streaming-programming-guide.html#continuous-processing

On Sun, Feb 25, 2018 at 12:26 PM, naresh Goud 
wrote:

> Hello Spark Experts,
>
> What is the difference between Trigger.Continuous(10.seconds) and
> Trigger.ProcessingTime("10 seconds") ?
>
>
>
> Thank you,
> Naresh
>


unsubscribe

2018-02-25 Thread Arun Khetarpal