unsubscribe

2017-12-25 Thread ????
unsubscribe

Re: Passing an array of more than 22 elements in a UDF

2017-12-25 Thread Aakash Basu
What's the privilege of using that specific version for this? Please throw
some light onto it.

On Mon, Dec 25, 2017 at 6:51 AM, Felix Cheung 
wrote:

> Or use it with Scala 2.11?
>
> --
> *From:* ayan guha 
> *Sent:* Friday, December 22, 2017 3:15:14 AM
> *To:* Aakash Basu
> *Cc:* user
> *Subject:* Re: Passing an array of more than 22 elements in a UDF
>
> Hi I think you are in correct track. You can stuff all your param in a
> suitable data structure like array or dict and pass this structure as a
> single param in your udf.
>
> On Fri, 22 Dec 2017 at 2:55 pm, Aakash Basu 
> wrote:
>
>> Hi,
>>
>> I am using Spark 2.2 using Java, can anyone please suggest me how to take
>> more than 22 parameters in an UDF? I mean, if I want to pass all the
>> parameters as an array of integers?
>>
>> Thanks,
>> Aakash.
>>
> --
> Best Regards,
> Ayan Guha
>


Re: Which kafka client to use with spark streaming

2017-12-25 Thread Diogo Munaro Vieira
Hey Serkan, it depends of your Kafka version... Is it 0.8.2?

Em 25 de dez de 2017 06:17, "Serkan TAS"  escreveu:

> Hi,
>
>
>
> Working on spark 2.2.0 cluster and 1.0 kafka brokers.
>
>
>
> I was using the library
>
> "org.apache.spark" % "spark-streaming-kafka-0-10_2.11" % "2.2.0"
>
>
>
> and had lots of problems during streaming process then downgraded to
>
>"org.apache.spark" % "spark-streaming-kafka-0-8_2.11" %
> "2.2.0"
>
>
>
> And i know there is also another path which is using kafka-clients jars
> which has the latest version of 1.0.0
>
>
>
> 
>
> 
>
> org.apache.kafka
>
> kafka-clients
>
> 1.0.0
>
> 
>
>
>
> I am confused which path  is the  right one
>
>
>
> Thanks…
>
>
>
>
>
> --
>
> Bu ileti hukuken korunmuş, gizli veya ifşa edilmemesi gereken bilgiler
> içerebilir. Şayet mesajın gönderildiği kişi değilseniz, bu iletiyi
> çoğaltmak ve dağıtmak yasaktır. Bu mesajı yanlışlıkla alan kişi, bu durumu
> derhal gönderene telefonla ya da e-posta ile bildirmeli ve bilgisayarından
> silmelidir. Bu iletinin içeriğinden yalnızca iletiyi gönderen kişi
> sorumludur.
>
> This communication may contain information that is legally privileged,
> confidential or exempt from disclosure. If you are not the intended
> recipient, please note that any dissemination, distribution, or copying of
> this communication is strictly prohibited. Anyone who receives this message
> in error should notify the sender immediately by telephone or by return
> communication and delete it from his or her computer. Only the person who
> has sent this message is responsible for its content.
>


Is there a way to make the broker merge big result set faster?

2017-12-25 Thread Mu Kong
Hi, community,

I have a subquery running slow on druid cluster.

The *inner query* yield fields:

*SELECT D1, D2, D3, MAX(M1) as MAX_M1*
*FROM SOME_TABLE*
*GROUP BY D1, D2, D3*

Then, the outer query looks like:

*SELECT D1, D2, SUM(MAX_M1)*
*FROM INNER_QUERY*
*GROUP BY D1, D2*

The D3 is a high cardinality dimension, which makes the result set of the
inner query very huge.
But still, the inner query itself takes 1~2 seconds to "process" and
transfer the data to the broker.

The outer query, however, takes 40 seconds to process.

As far as I understand how broker work with the historicals, I think the
druid simply fetch the result of each segment directly from historicals'
memory for the inner query,
so that there isn't any computation when druid deals with the inner query.
However, as the inner query finishes, all the results from the historicals
will be passed to one single broker for merging the result.
In my case, because the result set from the inner query is tremendous, this
phase takes a long time to finish.

I think the situation mentioned in this thread is quite similar to my case:
https://groups.google.com/d/msg/druid-user/ir7hRpxg0PI/3oqCDAwoPjMJ
Gian mentioned "Historical merging", and I have tried that by disabling the
broker cache, but it didn't really make the query faster.

Is there any other way to make broker merge faster?

Thanks!


Best regards,
Mu


Re: Apache Spark - (2.2.0) - window function for DataSet

2017-12-25 Thread Diogo Munaro Vieira
Window function requires a timestamp column because you will apply a
function for each window (like an aggregation). You still can use UDF for
customized tasks

Em 25 de dez de 2017 20:15, "M Singh" 
escreveu:

> Hi:
> I would like to use window function on a DataSet stream (Spark 2.2.0)
> The window function requires Column as argument and can be used with
> DataFrames by passing the column. Is there any analogous window function or
> pointers to how window function can be used for DataSets ?
>
> Thanks
>


Re: Apache Spark - Structured Streaming from file - checkpointing

2017-12-25 Thread Diogo Munaro Vieira
Can you please post here your code?

Em 25 de dez de 2017 19:24, "M Singh" 
escreveu:

> Hi:
>
> I am using spark structured streaming (v 2.2.0) to read data from files. I
> have configured checkpoint location. On stopping and restarting the
> application, it looks like it is reading the previously ingested files.  Is
> that expected behavior ?
>
> Is there anyway to prevent reading files that have already been ingested ?
> If a file is partially ingested, on restart - can we start reading the
> file from previously checkpointed offset ?
>
> Thanks
>


Re: Apache Spark - Structured Streaming graceful shutdown

2017-12-25 Thread Diogo Munaro Vieira
Hi M Singh! Here I'm using query.stop()

Em 25 de dez de 2017 19:19, "M Singh" 
escreveu:

> Hi:
> Are there any patterns/recommendations for gracefully stopping a
> structured streaming application ?
> Thanks
>
>
>


Apache Spark - (2.2.0) - window function for DataSet

2017-12-25 Thread M Singh
Hi:I would like to use window function on a DataSet stream (Spark 2.2.0)The 
window function requires Column as argument and can be used with DataFrames by 
passing the column. Is there any analogous window function or pointers to how 
window function can be used for DataSets ?
Thanks

Apache Spark - Structured Streaming from file - checkpointing

2017-12-25 Thread M Singh
Hi:
I am using spark structured streaming (v 2.2.0) to read data from files. I have 
configured checkpoint location. On stopping and restarting the application, it 
looks like it is reading the previously ingested files.  Is that expected 
behavior ?  
Is there anyway to prevent reading files that have already been ingested ? If a 
file is partially ingested, on restart - can we start reading the file from 
previously checkpointed offset ?
Thanks

Apache Spark - Structured Streaming graceful shutdown

2017-12-25 Thread M Singh
Hi:Are there any patterns/recommendations for gracefully stopping a structured 
streaming application ?Thanks




Re: Spark Docker

2017-12-25 Thread Jörn Franke
You find several presentations on this at the Spark summit web page.

Generally you have also to make a decision if you run one cluster for all 
applications or one cluster per application in the container context.

Not sure though why do you want to run just on one node. If you have only one 
node then you may want to go not for Spark at all.

> On 25. Dec 2017, at 09:54, sujeet jog  wrote:
> 
> Folks, 
> 
> Can you share your experience of running spark under docker  on a single 
> local / standalone node.
> Anybody using it under production environments ?,  we have a existing Docker 
> Swarm deployment, and i want to run Spark in a seperate FAT VM  hooked / 
> controlled by docker swarm  
> 
> I know there is no official clustering support for running spark under docker 
> swarm,  but can it be used to run on a single FAT VM controlled by Swarm.
> 
> Any insights on this would be appreciated / production mode experiences etc.
> 
> Thanks, 
> Sujeet

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark Docker

2017-12-25 Thread sujeet jog
Folks,

Can you share your experience of running spark under docker  on a single
local / standalone node.
Anybody using it under production environments ?,  we have a existing
Docker Swarm deployment, and i want to run Spark in a seperate FAT VM
hooked / controlled by docker swarm

I know there is no official clustering support for running spark under
docker swarm,  but can it be used to run on a single FAT VM controlled by
Swarm.

Any insights on this would be appreciated / production mode experiences etc.

Thanks,
Sujeet


Which kafka client to use with spark streaming

2017-12-25 Thread Serkan TAS
Hi,

Working on spark 2.2.0 cluster and 1.0 kafka brokers.

I was using the library
"org.apache.spark" % "spark-streaming-kafka-0-10_2.11" % "2.2.0"

and had lots of problems during streaming process then downgraded to
   "org.apache.spark" % "spark-streaming-kafka-0-8_2.11" % "2.2.0"

And i know there is also another path which is using kafka-clients jars which 
has the latest version of 1.0.0



org.apache.kafka
kafka-clients
1.0.0


I am confused which path  is the  right one

Thanks…





Bu ileti hukuken korunmuş, gizli veya ifşa edilmemesi gereken bilgiler 
içerebilir. Şayet mesajın gönderildiği kişi değilseniz, bu iletiyi çoğaltmak ve 
dağıtmak yasaktır. Bu mesajı yanlışlıkla alan kişi, bu durumu derhal gönderene 
telefonla ya da e-posta ile bildirmeli ve bilgisayarından silmelidir. Bu 
iletinin içeriğinden yalnızca iletiyi gönderen kişi sorumludur.

This communication may contain information that is legally privileged, 
confidential or exempt from disclosure. If you are not the intended recipient, 
please note that any dissemination, distribution, or copying of this 
communication is strictly prohibited. Anyone who receives this message in error 
should notify the sender immediately by telephone or by return communication 
and delete it from his or her computer. Only the person who has sent this 
message is responsible for its content.