PySpark Structured Streaming - using previous iteration computed results in current iteration
We would like to utilize maintaining an arbitrary state between invokations of the iterations of StructuredStreaming in python How can we maintain a static DataFrame that acts as state between the iterations? Several options that may be relevant: 1. in Spark memory (distributed across the workers) 2. External Memory solution (e.g. ElasticSearch / Redis) 3. utilizing other state maintenance that can work with PySpark Specifically - given that in iteration N we get a Streaming DataFrame from Kafka, we apply computation that produces a label column over the window of samples from the last hour. We want to keep around the labels and the sample ids for the next iteration (N+1) where we want to do a join with the new sample window to inherit the labels of samples that existed in the previous (N) iteration. -- Regards, Ofer Eliassaf
Re: pyspark cluster mode on standalone deployment
anyone? please? is this getting any priority? On Tue, Sep 27, 2016 at 3:38 PM, Ofer Eliassaf wrote: > Is there any plan to support python spark running in "cluster mode" on a > standalone deployment? > > There is this famous survey mentioning that more than 50% of the users are > using the standalone configuration. > Making pyspark work in cluster mode with standalone will help a lot for > high availabilty in python spark. > > Cuurently only Yarn deployment supports it. Bringing the huge Yarn > installation just for this feature is not fun at all > > Does someone have time estimation for this? > > > > -- > Regards, > Ofer Eliassaf > -- Regards, Ofer Eliassaf
Re: PySpark TaskContext
thank u so much for this! Great to see that u listen to the community. On Thu, Nov 24, 2016 at 12:10 PM, Holden Karau wrote: > https://issues.apache.org/jira/browse/SPARK-18576 > > On Thu, Nov 24, 2016 at 2:05 AM, Holden Karau > wrote: > >> Cool - thanks. I'll circle back with the JIRA number once I've got it >> created - will probably take awhile before it lands in a Spark release >> (since 2.1 has already branched) but better debugging information for >> Python users is certainly important/useful. >> >> On Thu, Nov 24, 2016 at 2:03 AM, Ofer Eliassaf >> wrote: >> >>> Since we can't work with log4j in pyspark executors we build our own >>> logging infrastructure (based on logstash/elastic/kibana). >>> Would help to have TID in the logs, so we can drill down accordingly. >>> >>> >>> On Thu, Nov 24, 2016 at 11:48 AM, Holden Karau >>> wrote: >>> >>>> Hi, >>>> >>>> The TaskContext isn't currently exposed in PySpark but I've been >>>> meaning to look at exposing at least some of TaskContext for parity in >>>> PySpark. Is there a particular use case which you want this for? Would help >>>> with crafting the JIRA :) >>>> >>>> Cheers, >>>> >>>> Holden :) >>>> >>>> On Thu, Nov 24, 2016 at 1:39 AM, ofer wrote: >>>> >>>>> Hi, >>>>> Is there a way to get in PYSPARK something like TaskContext from a code >>>>> running on executor like in scala spark? >>>>> >>>>> If not - how can i know my task id from inside the executors? >>>>> >>>>> Thanks! >>>>> >>>>> >>>>> >>>>> -- >>>>> View this message in context: http://apache-spark-user-list. >>>>> 1001560.n3.nabble.com/PySpark-TaskContext-tp28125.html >>>>> Sent from the Apache Spark User List mailing list archive at >>>>> Nabble.com. >>>>> >>>>> - >>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>>> >>>>> >>>> >>>> >>>> -- >>>> Cell : 425-233-8271 >>>> Twitter: https://twitter.com/holdenkarau >>>> >>> >>> >>> >>> -- >>> Regards, >>> Ofer Eliassaf >>> >> >> >> >> -- >> Cell : 425-233-8271 >> Twitter: https://twitter.com/holdenkarau >> > > > > -- > Cell : 425-233-8271 > Twitter: https://twitter.com/holdenkarau > -- Regards, Ofer Eliassaf
Re: PySpark TaskContext
Since we can't work with log4j in pyspark executors we build our own logging infrastructure (based on logstash/elastic/kibana). Would help to have TID in the logs, so we can drill down accordingly. On Thu, Nov 24, 2016 at 11:48 AM, Holden Karau wrote: > Hi, > > The TaskContext isn't currently exposed in PySpark but I've been meaning > to look at exposing at least some of TaskContext for parity in PySpark. Is > there a particular use case which you want this for? Would help with > crafting the JIRA :) > > Cheers, > > Holden :) > > On Thu, Nov 24, 2016 at 1:39 AM, ofer wrote: > >> Hi, >> Is there a way to get in PYSPARK something like TaskContext from a code >> running on executor like in scala spark? >> >> If not - how can i know my task id from inside the executors? >> >> Thanks! >> >> >> >> -- >> View this message in context: http://apache-spark-user-list. >> 1001560.n3.nabble.com/PySpark-TaskContext-tp28125.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> > > > -- > Cell : 425-233-8271 > Twitter: https://twitter.com/holdenkarau > -- Regards, Ofer Eliassaf
Dynamic Resource Allocation in a standalone
Hi, I have a question/problem regarding dynamic resource allocation. I am using spark 1.6.2 with stand alone cluster manager. I have one worker with 2 cores. I set the the folllowing arguments in the spark-defaults.conf file on all my nodes: spark.dynamicAllocation.enabled true spark.shuffle.service.enabled true spark.deploy.defaultCores 1 I run a sample application with many tasks. I open port 4040 on the driver and i can verify that the above configuration exists. My problem is that no matter what i do my application only gets 1 core even though the other cores are available. Is this normal or do i have a problem in my configuration. The behaviour i want to get is this: I have many users working with the same spark cluster. I want that each application will get a fixed number of cores unless the rest of the clutser is pending. In this case I want that the runn ing applications will get the total amount of cores until a new application arrives... -- Regards, Ofer Eliassaf
Re: RESTful Endpoint and Spark
there are 2 main projects for that: livy(https://github.com/cloudera/livy) and spark job server(https://github.com/spark-jobserver/spark-jobserver). On Fri, Oct 7, 2016 at 2:27 AM, Benjamin Kim wrote: > Has anyone tried to integrate Spark with a server farm of RESTful API > endpoints or even HTTP web-servers for that matter? I know it’s typically > done using a web farm as the presentation interface, then data flows > through a firewall/router to direct calls to a JDBC listener that will > SELECT, INSERT, UPDATE and, at times, DELETE data in a database. Can the > same be done using Spark SQL Thriftserver on top of, say, HBase, Kudu, > Parquet, etc.? Or can Kafka be used somewhere? Spark would be an ideal > solution as the intermediary because it can talk to any data store > underneath; so, swapping out a technology at any time would be possible. > > Just want some ideas. > > Thank, > Ben > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Regards, Ofer Eliassaf
Re: spark standalone with multiple workers gives a warning
The slaves should connect to the master using the scripts in sbin... You can read about it here: http://spark.apache.org/docs/latest/spark-standalone.html On Thu, Oct 6, 2016 at 6:46 PM, Mendelson, Assaf wrote: > Hi, > > I have a spark standalone cluster. On it, I am using 3 workers per node. > > So I added SPARK_WORKER_INSTANCES set to 3 in spark-env.sh > > The problem is, that when I run spark-shell I get the following warning: > > WARN SparkConf: > > SPARK_WORKER_INSTANCES was detected (set to '3'). > > This is deprecated in Spark 1.0+. > > > > Please instead use: > > - ./spark-submit with --num-executors to specify the number of executors > > - Or set SPARK_EXECUTOR_INSTANCES > > - spark.executor.instances to configure the number of instances in the > spark config. > > > > So how would I start a cluster of 3? SPARK_WORKER_INSTANCES is the only > way I see to start the standalone cluster and the only way I see to define > it is in spark-env.sh. The spark submit option, SPARK_EXECUTOR_INSTANCES > and spark.executor.instances are all related to submitting the job. > > > > Any ideas? > > Thanks > > Assaf > -- Regards, Ofer Eliassaf
pyspark cluster mode on standalone deployment
Is there any plan to support python spark running in "cluster mode" on a standalone deployment? There is this famous survey mentioning that more than 50% of the users are using the standalone configuration. Making pyspark work in cluster mode with standalone will help a lot for high availabilty in python spark. Cuurently only Yarn deployment supports it. Bringing the huge Yarn installation just for this feature is not fun at all Does someone have time estimation for this? -- Regards, Ofer Eliassaf