Re: spark-sklearn

2019-04-08 Thread Abdeali Kothari
I haven't used spark-sklearn much, but their travis file gives the combination they test with: https://github.com/databricks/spark-sklearn/blob/master/.travis.yml#L8 Also, your first email is a bit confusing - you mentioned Spark 2.2.3 but the traceback path says spark-2.4.1-bin-hadoop2.6 I then

Spark2: Deciphering saving text file name

2019-04-08 Thread Subash Prabakar
Hi, While saving in Spark2 as text file - I see encoded/hash value attached in the part files along with part number. I am curious to know what is that value is about ? Example: ds.write.save(SaveMode.Overwrite).option("compression","gzip").text(path) Produces,

Re: spark-sklearn

2019-04-08 Thread Sudhir Babu Pothineni
Thanks Stephen, saw that, but this is already released version of spark-sklearn-0.3.0, tests should be working. So just checking if I am doing anything wrong, version of other libraries etc.. Thanks Sudhir > On Apr 8, 2019, at 1:52 PM, Stephen Boesch wrote: > > There are several suggestions

[no subject]

2019-04-08 Thread Siddharth Reddy
unsubscribe

Re: spark-sklearn

2019-04-08 Thread Stephen Boesch
There are several suggestions on this SOF https://stackoverflow.com/questions/38984775/spark-errorexpected-zero-arguments-for-construction-of-classdict-for-numpy-cor 1 You need to convert the final value to a python list. You implement the function as follows: def uniq_array(col_array): x =

Re: spark-sklearn

2019-04-08 Thread Sudhir Babu Pothineni
> > Trying to run tests in spark-sklearn, anybody check the below exception > > pip freeze: > > nose==1.3.7 > numpy==1.16.1 > pandas==0.19.2 > python-dateutil==2.7.5 > pytz==2018.9 > scikit-learn==0.19.2 > scipy==1.2.0 > six==1.12.0 > spark-sklearn==0.3.0 > > Spark version: >

Parallelize Join Problem

2019-04-08 Thread Paul.Bauriegel
Hi, I'm struggling with a join of two large DataFrames. The join is extremely slow because it is only executed on one worker. At the first checkpoint spark uses all four workers, but at the second it only uses one. I first thought it might have something to do with that spark wants to load the

Re: Spark SQL API taking longer time than DF API.

2019-04-08 Thread chris
Hi, Without more information it’s very difficult to work out what’s going on. If possible can you do the following and make available to us. 1) for each query call explain() and post the output. 2) Run each query and then go to the sql tab in the spark ui. For each query show us the plan. 3)

Re: Spark SQL API taking longer time than DF API.

2019-04-08 Thread neeraj bhadani
Hi All, Can anyone help me here with my query? Regards, Neeraj On Mon, Apr 1, 2019 at 9:44 AM neeraj bhadani wrote: > In Both the cases, I am trying to create a HIVE table based on Union on 2 > same queries. > > Not sure how internally it differs on the process of creation of HIVE > table?

Re: Is there any spark API function to handle a group of companies at once in this scenario?

2019-04-08 Thread Shyam P
Hi Mich, thanks for your prompt reply. I get few company financial data like profits and etc results . I would get this company data through Kafka topics which is fed by an rest service. I am thinking of using spark-structured streaming. Put them back in HIVE/C*. Regards, Shyam On Sun, Apr 7,