Re: Show function name in Logs for PythonUDFRunner

2018-11-22 Thread Eike von Seggern
Hi, Abdeali Kothari schrieb am Do., 22. Nov. 2018 um 10:04 Uhr: > When I run Python UDFs with pyspark, I get multiple logs where it says: > > 18/11/22 01:51:59 INFO python.PythonUDFRunner: Times: total = 44, boot = -25, > init = 67, finish = 2 > > > I am wondering if in these logs I can

Re: Pyspark create RDD of dictionary

2018-11-02 Thread Eike von Seggern
Hi, Soheil Pourbafrani schrieb am Fr., 2. Nov. 2018 um 15:43 Uhr: > Hi, I have an RDD of the form (((a), (b), (c), (d)), (e)) and I want to > transform every row to a dictionary of the form a:(b, c, d, e) > > Here is my code, but it's errorful! > > map(lambda row : {row[0][0] : (row[1],

Re: Spark SQL - Truncate Day / Hour

2017-11-13 Thread Eike von Seggern
Hi, you can truncate datetimes like this (in pyspark), e.g. to 5 minutes: import pyspark.sql.functions as F df.select((F.floor(F.col('myDateColumn').cast('long') / 300) * 300).cast('timestamp')) Best, Eike David Hodefi schrieb am Mo., 13. Nov. 2017 um 12:27 Uhr:

Re: Loading objects only once

2017-09-28 Thread Eike von Seggern
Hello, maybe broadcast can help you here. [1] You can load the model once on the driver and then broadcast it to the workers with `bc_model = sc.broadcast(model)`? You can access the model in the map function with `bc_model.value()`. Best Eike [1]

Re: Worker node log not showed

2017-06-08 Thread Eike von Seggern
2017-05-31 10:48 GMT+02:00 Paolo Patierno : > No it's running in standalone mode as Docker image on Kubernetes. > > > The only way I found was to access "stderr" file created under the "work" > directory in the SPARK_HOME but ... is it the right way ? > I think that is the

Re: bug with PYTHONHASHSEED

2017-04-04 Thread Eike von Seggern
2017-04-01 21:54 GMT+02:00 Paul Tremblay : > When I try to to do a groupByKey() in my spark environment, I get the > error described here: > > http://stackoverflow.com/questions/36798833/what-does-except > ion-randomness-of-hash-of-string-should-be-disabled-via-pythonh >

Re: Alternatives for dataframe collectAsList()

2017-04-04 Thread Eike von Seggern
Hi, depending on what you're trying to achieve `RDD.toLocalIterator()` might help you. Best Eike 2017-03-29 21:00 GMT+02:00 szep.laszlo.it : > Hi, > > after I created a dataset > > Dataset df = sqlContext.sql("query"); > > I need to have a result values and I call a

Re: [PySpark - 1.6] - Avoid object serialization

2016-12-29 Thread Eike von Seggern
2016-12-28 20:17 GMT+01:00 Chawla,Sumit : > Would this work for you? > > def processRDD(rdd): > analyzer = ShortTextAnalyzer(root_dir) > rdd.foreach(lambda record: analyzer.analyze_short_text_ > event(record[1])) > > ssc.union(*streams).filter(lambda x: x[1] !=

Re: What is the deployment model for Spark Streaming? A specific example.

2016-12-19 Thread Eike von Seggern
c.awaitTermination() >> >> >> What is the actual deployment model for Spark Streaming? All I know to do >> right now is to restart the PID. I'm new to Spark, and the docs don't >> really explain this (that I can see). >> >>

Re: Access S3 buckets in multiple accounts

2016-09-28 Thread Eike von Seggern
Hi Teng, 2016-09-28 10:42 GMT+02:00 Teng Qiu : > hmm, i do not believe security group can control s3 bucket access... is > this something new? or you mean IAM role? > You're right, it's not security groups but you can configure a VPC endpoint for the EMR-Cluster and grant

Re: Access S3 buckets in multiple accounts

2016-09-28 Thread Eike von Seggern
14 W 29th Street, 5th Floor > New York, NY 10001 > > -- ------------ *Jan Eike von Seggern* Data Scientist *Sevenval Technologies GmbH * FRONT-END-EXPERTS SINCE 1999 Köpenicker Straße 154 | 10997 Berlin office

Re: pyspark pickle error when using itertools.groupby

2016-08-05 Thread Eike von Seggern
Hello, `itertools.groupby` is evaluated lazily and the `g`s in your code are generators not lists. This might cause your problem. Casting everything to lists might help here, e.g.: grp2 = [(k, list(g)) for k,g in groupby(grp1, lambda e: e[1])] HTH Eike 2016-08-05 7:31 GMT+02:00 林家銘

Re: ImportError: No module named numpy

2016-06-02 Thread Eike von Seggern
gt;>>>>>> >>>>>>> On Wed, Jun 1, 2016 at 8:31 AM, Bhupendra Mishra < >>>>>>> bhupendra.mis...@gmail.com> wrote: >>>>>>> >>>>>>>> If any one please can help me with fo

Re: Launch Spark shell using differnt python version

2016-05-30 Thread Eike von Seggern
Hi Stuti 2016-03-15 10:08 GMT+01:00 Stuti Awasthi : > Thanks Prabhu, > > I tried starting in local mode but still picking Python 2.6 only. I have > exported “DEFAULT_PYTHON” in my session variable and also included in PATH. > > > > Export: > > export

Re: value from groubBy paired rdd

2016-02-24 Thread Eike von Seggern
Hello Abhishek, your code appears ok. Can you please post the exception you get? Without, it's hard to track down the issue. Best Eike

Re: Submit custom python packages from current project

2016-02-19 Thread Eike von Seggern
Hello, 2016-02-16 11:03 GMT+01:00 Mohannad Ali : > Hello Everyone, > > I have code inside my project organized in packages and modules, however I > keep getting the error "ImportError: No module named " when > I run spark on YARN. > > My directory structure is something like

Re: how to use sc.hadoopConfiguration from pyspark

2015-11-23 Thread Eike von Seggern
2015-11-23 10:26 GMT+01:00 Tamas Szuromi : > Hello Eike, > > Thanks! Yes I'm using it with Hadoop 2.6 so I'll give a try to the 2.4 > build. > Have you tried it with 1.6 Snapshot or do you know JIRA tickets for this > missing libraries issues? I've not tried 1.6.

Re: how to use sc.hadoopConfiguration from pyspark

2015-11-23 Thread Eike von Seggern
Hello Tamas, 2015-11-20 17:23 GMT+01:00 Tamas Szuromi : > > Hello, > > I've just wanted to use sc._jsc.hadoopConfiguration().set('key','value') in > pyspark 1.5.2 but I got set method not exists error. For me it's working with Spark 1.5.2 binary distribution built