Re: Best way to process this dataset

2018-06-19 Thread Matteo Cossu
Single machine? Any other framework will perform better than Spark On Tue, 19 Jun 2018 at 09:40, Aakash Basu wrote: > Georg, just asking, can Pandas handle such a big dataset? If that data is > further passed into using any of the sklearn modules? > > On Tue, Jun 19, 2018 at 10:35 AM, Georg

Re: Help explaining explain() after DataFrame join reordering

2018-06-05 Thread Matteo Cossu
Hello, as explained here , the join order can be changed by the optimizer. The difference introduced in Spark 2.2 is that the reordering is based on statistics instead of heuristics, that can appear "random"

Re: OneHotEncoderEstimator - java.lang.NoSuchMethodError: org.apache.spark.sql.Dataset.withColumns

2018-05-18 Thread Matteo Cossu
Hi, are you sure Dataset has a method withColumns? On 15 May 2018 at 16:58, Mina Aslani wrote: > Hi, > > I get below error when I try to run oneHotEncoderEstimator example. > https://github.com/apache/spark/blob/b74366481cc87490adf4e69d26389e >

Re: How to Spark can solve this example

2018-05-18 Thread Matteo Cossu
Hello Esa, all the steps that you described can be performed with Spark. I don't know about CEP, but Spark Streaming should be enough. Best, Matteo On 18 May 2018 at 09:20, Esa Heikkinen wrote: > Hi > > > > I have attached fictive example (pdf-file) about

Re: Why doesn't spark use broadcast join?

2018-04-18 Thread Matteo Cossu
Can you check the value for spark.sql.autoBroadcastJoinThreshold? On 29 March 2018 at 14:41, Vitaliy Pisarev wrote: > I am looking at the physical plan for the following query: > > SELECT f1,f2,f3,... > FROM T1 > LEFT ANTI JOIN T2 ON T1.id = T2.id > WHERE f1 =

Re: Broadcasting huge array or persisting on HDFS to read on executors - both not working

2018-04-12 Thread Matteo Cossu
ut how to do for each user? > You have one rdd of users on one hand and rdd of items on the other. How > to go from here? Am I missing something trivial? > > > On Thursday, 12 April, 2018, 2:10:51 AM IST, Matteo Cossu < > elco...@gmail.com> wrote: > > > Why broad

Re: Broadcasting huge array or persisting on HDFS to read on executors - both not working

2018-04-11 Thread Matteo Cossu
Why broadcasting this list then? You should use an RDD or DataFrame. For example, RDD has a method sample() that returns a random sample from it. On 11 April 2018 at 22:34, surender kumar wrote: > I'm using pySpark. > I've list of 1 million items (all float values

Re: can udaf's return complex types?

2018-02-13 Thread Matteo Cossu
Hello, yes, sure they can return complex types. For example, the functions collect_list and collect_set return an ArrayType. On 10 February 2018 at 14:28, kant kodali wrote: > Hi All, > > Can UDAF's return complex types? like say a Map with key as an Integer and > the value

Re: [Spark DataFrame]: Passing DataFrame to custom method results in NullPointerException

2018-01-22 Thread Matteo Cossu
hat, in general, you are using the wrong approach to solve the problem. Best Regards, Matteo Cossu On 15 January 2018 at 12:56, <abdul.h.huss...@bt.com> wrote: > Hi, > > > > My Spark app is mapping lines from a text file to case classes stored > within an RDD. >

Re: How to convert Array of Json rows into Dataset of specific columns in Spark 2.2.0?

2017-10-07 Thread Matteo Cossu
Hello, I think you should use *from_json *from spark.sql.functions to parse the json string and convert it to a StructType. Afterwards, you

Re: Trying to connect Spark 1.6 to Hive

2017-08-09 Thread Matteo Cossu
Hello, try to use these options when starting Spark: *--conf "spark.driver.userClassPathFirst=true" --conf "spark.executor.userClassPathFirst=true" * In this way you will be sure that the executor and the driver of Spark will use the classpath you define. Best Regards,

Re: Reading Hive tables Parallel in Spark

2017-07-18 Thread Matteo Cossu
The context you use for calling SparkSQL can be used only in the driver. Moreover, collect() works because it takes in local memory the RDD, but it should be used only for debugging reasons(95% of the times), if all your data fits into a single machine memory you shouldn't use Spark at all but

Re: Reading Hive tables Parallel in Spark

2017-07-17 Thread Matteo Cossu
Hello, have you tried to use threads instead of the loop? On 17 July 2017 at 14:12, FN wrote: > Hi > I am currently trying to parallelize reading multiple tables from Hive . As > part of an archival framework, i need to convert few hundred tables which > are in txt format