Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-11 Thread Gourav Sengupta
Hi, Sorry Regards, Gourav Sengupta On Fri, Nov 12, 2021 at 6:48 AM Sergey Ivanychev wrote: > Hi Gourav, > > Please, read my question thoroughly. My problem is with the plan of the > execution and with the fact that toPandas collects all the data not on the > driver but on an executor, not

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-11 Thread Georg Heiler
https://stackoverflow.com/questions/46832394/spark-access-first-n-rows-take-vs-limit might be related Best, Georg Am Fr., 12. Nov. 2021 um 07:48 Uhr schrieb Sergey Ivanychev < sergeyivanyc...@gmail.com>: > Hi Gourav, > > Please, read my question thoroughly. My problem is with the plan of the >

unsubscribe

2021-11-11 Thread Anshul Gupta

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-11 Thread Sergey Ivanychev
Hi Gourav, Please, read my question thoroughly. My problem is with the plan of the execution and with the fact that toPandas collects all the data not on the driver but on an executor, not with the fact that there’s some memory overhead. I don’t understand how your excerpts answer my question.

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-11 Thread Gourav Sengupta
Hi Sergey, Please read the excerpts from the book of Dr. Zaharia that I had sent, they explain these fundamentals clearly. Regards, Gourav Sengupta On Thu, Nov 11, 2021 at 9:40 PM Sergey Ivanychev wrote: > Yes, in fact those are the settings that cause this behaviour. If set to > false,

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-11 Thread Sergey Ivanychev
Yes, in fact those are the settings that cause this behaviour. If set to false, everything goes fine since the implementation in spark sources in this case is pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns) Best regards, Sergey Ivanychev > 11 нояб. 2021 г., в 13:58,

Re: Using MulticlassClassificationEvaluator for NER evaluation

2021-11-11 Thread martin
OK, thank you, Gourav. I didn't realize that Spark works with numerical formats only by design. What I am trying to achieve is rather straight-forward: Evaluate a trained model using the standard metrics provided by MulticlassClassificationEvaluator. Since this isn't possible for text

Re: Using MulticlassClassificationEvaluator for NER evaluation

2021-11-11 Thread Gourav Sengupta
Hi Martin, okay, so you will ofcourse need to translate the NER string output to a numerical format as you would do with any text data before feeding it to SPARK ML. Please read SPARK ML documentation on this. I think that they are quite clear on how to do that. But more importantly please try to

RE: HiveThrift2 ACID Transactions?

2021-11-11 Thread Bode, Meikel, NMA-CFD
Hi all, I now have some more input related to the issues I face at the moment: When I try to UPDATE an external table via JDBC connection to HiveThrift2 server I get the following exception: java.lang.UnsupportedOperationException: UPDATE TABLE is not supported temporarily. Whey doing an

Re: Using MulticlassClassificationEvaluator for NER evaluation

2021-11-11 Thread Martin Wunderlich
Hi Gourav, Mostly correct. The output of SparNLP here is a trained pipeline/model/transformer. I am feeding this trained pipeline to the MulticlassClassificationEvaluator for evaluation and this MulticlassClassificationEvaluator only accepts floats or doubles are the labels (instead of NER

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-11 Thread Mich Talebzadeh
Have you tried the following settings: spark.conf.set("spark.sql.execution.arrow.pysppark.enabled", "true") spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled","true") HTH view my Linkedin profile *Disclaimer:*

Re: Using MulticlassClassificationEvaluator for NER evaluation

2021-11-11 Thread Gourav Sengupta
Hi Martin, just to confirm, you are taking the output of SPARKNLP, and then trying to feed it to SPARK ML for running algorithms on the output of NERgenerated by SPARKNLP right? Regards, Gourav Sengupta On Thu, Nov 11, 2021 at 8:00 AM wrote: > Hi Sean, > > Apologies for the delayed reply.

Re: Feature (?): Setting custom parameters for a Spark MLlib pipeline

2021-11-11 Thread martin
Yes, that would be a suitable option. We could just extend the standard Spark MLLib Transformer and add the required meta-data. Just out of curiosity: Is there a specific reason for why the user of a standard Transform would not be able to add arbitrary key-value pairs for additional