PySpark Structured Streaming - using previous iteration computed results in current iteration

2018-05-16 Thread Ofer Eliassaf
We would like to utilize maintaining an arbitrary state between invokations
of the iterations of StructuredStreaming in python

How can we maintain a static DataFrame that acts as state between the
iterations?

Several options that may be relevant:
1. in Spark memory (distributed across the workers)
2. External Memory solution (e.g. ElasticSearch / Redis)
3. utilizing other state maintenance that can work with PySpark

Specifically - given that in iteration N we get a Streaming DataFrame from
Kafka, we apply computation that produces a label column over the window of
samples from the last hour.
We want to keep around the labels and the sample ids for the next iteration
(N+1) where we want to do a join with the new sample window to inherit the
labels of samples that existed in the previous (N) iteration.


-- 
Regards,
Ofer Eliassaf


Re: pyspark cluster mode on standalone deployment

2017-03-05 Thread Ofer Eliassaf
anyone? please? is this getting any priority?

On Tue, Sep 27, 2016 at 3:38 PM, Ofer Eliassaf 
wrote:

> Is there any plan to support python spark running in "cluster mode" on a
> standalone deployment?
>
> There is this famous survey mentioning that more than 50% of the users are
> using the standalone configuration.
> Making pyspark work in cluster mode with standalone will help a lot for
> high availabilty in python spark.
>
> Cuurently only Yarn deployment supports it. Bringing the huge Yarn
> installation just for this feature is not fun at all
>
> Does someone have time estimation for this?
>
>
>
> --
> Regards,
> Ofer Eliassaf
>



-- 
Regards,
Ofer Eliassaf


Re: PySpark TaskContext

2016-11-24 Thread Ofer Eliassaf
thank u so much for this! Great to see that u listen to the community.

On Thu, Nov 24, 2016 at 12:10 PM, Holden Karau  wrote:

> https://issues.apache.org/jira/browse/SPARK-18576
>
> On Thu, Nov 24, 2016 at 2:05 AM, Holden Karau 
> wrote:
>
>> Cool - thanks. I'll circle back with the JIRA number once I've got it
>> created - will probably take awhile before it lands in a Spark release
>> (since 2.1 has already branched) but better debugging information for
>> Python users is certainly important/useful.
>>
>> On Thu, Nov 24, 2016 at 2:03 AM, Ofer Eliassaf 
>> wrote:
>>
>>> Since we can't work with log4j in pyspark executors we build our own
>>> logging infrastructure (based on logstash/elastic/kibana).
>>> Would help to have TID in the logs, so we can drill down accordingly.
>>>
>>>
>>> On Thu, Nov 24, 2016 at 11:48 AM, Holden Karau 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> The TaskContext isn't currently exposed in PySpark but I've been
>>>> meaning to look at exposing at least some of TaskContext for parity in
>>>> PySpark. Is there a particular use case which you want this for? Would help
>>>> with crafting the JIRA :)
>>>>
>>>> Cheers,
>>>>
>>>> Holden :)
>>>>
>>>> On Thu, Nov 24, 2016 at 1:39 AM, ofer  wrote:
>>>>
>>>>> Hi,
>>>>> Is there a way to get in PYSPARK something like TaskContext from a code
>>>>> running on executor like in scala spark?
>>>>>
>>>>> If not - how can i know my task id from inside the executors?
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context: http://apache-spark-user-list.
>>>>> 1001560.n3.nabble.com/PySpark-TaskContext-tp28125.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>> -
>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Cell : 425-233-8271
>>>> Twitter: https://twitter.com/holdenkarau
>>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Ofer Eliassaf
>>>
>>
>>
>>
>> --
>> Cell : 425-233-8271
>> Twitter: https://twitter.com/holdenkarau
>>
>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>



-- 
Regards,
Ofer Eliassaf


Re: PySpark TaskContext

2016-11-24 Thread Ofer Eliassaf
Since we can't work with log4j in pyspark executors we build our own
logging infrastructure (based on logstash/elastic/kibana).
Would help to have TID in the logs, so we can drill down accordingly.


On Thu, Nov 24, 2016 at 11:48 AM, Holden Karau  wrote:

> Hi,
>
> The TaskContext isn't currently exposed in PySpark but I've been meaning
> to look at exposing at least some of TaskContext for parity in PySpark. Is
> there a particular use case which you want this for? Would help with
> crafting the JIRA :)
>
> Cheers,
>
> Holden :)
>
> On Thu, Nov 24, 2016 at 1:39 AM, ofer  wrote:
>
>> Hi,
>> Is there a way to get in PYSPARK something like TaskContext from a code
>> running on executor like in scala spark?
>>
>> If not - how can i know my task id from inside the executors?
>>
>> Thanks!
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/PySpark-TaskContext-tp28125.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>



-- 
Regards,
Ofer Eliassaf


Dynamic Resource Allocation in a standalone

2016-10-27 Thread Ofer Eliassaf
Hi,

I have a question/problem regarding dynamic resource allocation.
I am using spark 1.6.2 with stand alone cluster manager.

I have one worker with 2 cores.

I set the the folllowing arguments in the spark-defaults.conf file on all
my nodes:

spark.dynamicAllocation.enabled  true
spark.shuffle.service.enabled true
spark.deploy.defaultCores 1

I run a sample application with many tasks.

I open port 4040 on the driver and i can verify that the above
configuration exists.

My problem is that no matter what i do my application only gets 1 core even
though the other cores are available.

Is this normal or do i have a problem in my configuration.


The behaviour i want to get is this:
I have many users working with the same spark cluster.
I want that each application will get a fixed number of cores unless the
rest of the clutser is pending.
In this case I want that the runn ing applications will get the total
amount of cores until a new application arrives...


-- 
Regards,
Ofer Eliassaf


Re: RESTful Endpoint and Spark

2016-10-06 Thread Ofer Eliassaf
there are 2 main projects for that: livy(https://github.com/cloudera/livy)
and spark job server(https://github.com/spark-jobserver/spark-jobserver).

On Fri, Oct 7, 2016 at 2:27 AM, Benjamin Kim  wrote:

> Has anyone tried to integrate Spark with a server farm of RESTful API
> endpoints or even HTTP web-servers for that matter? I know it’s typically
> done using a web farm as the presentation interface, then data flows
> through a firewall/router to direct calls to a JDBC listener that will
> SELECT, INSERT, UPDATE and, at times, DELETE data in a database. Can the
> same be done using Spark SQL Thriftserver on top of, say, HBase, Kudu,
> Parquet, etc.? Or can Kafka be used somewhere? Spark would be an ideal
> solution as the intermediary because it can talk to any data store
> underneath; so, swapping out a technology at any time would be possible.
>
> Just want some ideas.
>
> Thank,
> Ben
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Regards,
Ofer Eliassaf


Re: spark standalone with multiple workers gives a warning

2016-10-06 Thread Ofer Eliassaf
The slaves should connect to the master using the scripts in sbin...
You can read about it here:
http://spark.apache.org/docs/latest/spark-standalone.html

On Thu, Oct 6, 2016 at 6:46 PM, Mendelson, Assaf 
wrote:

> Hi,
>
> I have a spark standalone cluster. On it, I am using 3 workers per node.
>
> So I added SPARK_WORKER_INSTANCES set to 3 in spark-env.sh
>
> The problem is, that when I run spark-shell I get the following warning:
>
> WARN SparkConf:
>
> SPARK_WORKER_INSTANCES was detected (set to '3').
>
> This is deprecated in Spark 1.0+.
>
>
>
> Please instead use:
>
> - ./spark-submit with --num-executors to specify the number of executors
>
> - Or set SPARK_EXECUTOR_INSTANCES
>
> - spark.executor.instances to configure the number of instances in the
> spark config.
>
>
>
> So how would I start a cluster of 3? SPARK_WORKER_INSTANCES is the only
> way I see to start the standalone cluster and the only way I see to define
> it is in spark-env.sh. The spark submit option, SPARK_EXECUTOR_INSTANCES
> and spark.executor.instances are all related to submitting the job.
>
>
>
> Any ideas?
>
> Thanks
>
> Assaf
>



-- 
Regards,
Ofer Eliassaf


pyspark cluster mode on standalone deployment

2016-09-27 Thread Ofer Eliassaf
Is there any plan to support python spark running in "cluster mode" on a
standalone deployment?

There is this famous survey mentioning that more than 50% of the users are
using the standalone configuration.
Making pyspark work in cluster mode with standalone will help a lot for
high availabilty in python spark.

Cuurently only Yarn deployment supports it. Bringing the huge Yarn
installation just for this feature is not fun at all

Does someone have time estimation for this?



-- 
Regards,
Ofer Eliassaf