Re: specifing schema on dataframe

2017-02-04 Thread Dirceu Semighini Filho
Hi Sam Remove the " from the number that it will work Em 4 de fev de 2017 11:46 AM, "Sam Elamin" escreveu: > Hi All > > I would like to specify a schema when reading from a json but when trying > to map a number to a Double it fails, I tried FloatType and IntType with

Re: [SparkStreaming] 1 SQL tab for each SparkStreaming batch in SparkUI

2016-11-22 Thread Dirceu Semighini Filho
8:51 AM, Dirceu Semighini Filho < > dirceu.semigh...@gmail.com> wrote: > >> Has anybody seen this behavior (see tha attached picture) in Spark >> Streaming? >> It started to happen here after I changed the HiveContext creation to >> stream.foreachRDD { >> rdd

[SparkStreaming] 1 SQL tab for each SparkStreaming batch in SparkUI

2016-11-22 Thread Dirceu Semighini Filho
Has anybody seen this behavior (see tha attached picture) in Spark Streaming? It started to happen here after I changed the HiveContext creation to stream.foreachRDD { rdd => val hiveContext = new HiveContext(rdd.sparkContext) } Is this expected? Kind Regards, Dirceu

Re: Duplicated fit into TrainValidationSplit

2016-04-27 Thread Dirceu Semighini Filho
entire dataset. This is standard practice - > usually the dataset passed to the train validation split is itself further > split into a training and test set, where the final best model is evaluated > against the test set. > > On Wed, 27 Apr 2016 at 14:30, Dirceu Semighini Filho < > dirceu

Duplicated fit into TrainValidationSplit

2016-04-27 Thread Dirceu Semighini Filho
Hi guys, I was testing a pipeline here, and found a possible duplicated call to fit method into the org.apache.spark.ml.tuning.TrainValidationSplit

Fwd: Null Value in DecimalType column of DataFrame

2015-09-15 Thread Dirceu Semighini Filho
will be [0, 1) >> and casting "10.5" to DecimalType(10, 10) will return null, which is >> expected. >> >> On Mon, Sep 14, 2015 at 1:42 PM, Dirceu Semighini Filho < >> dirceu.semigh...@gmail.com> wrote: >> >>> Hi all, >>>

Null Value in DecimalType column of DataFrame

2015-09-14 Thread Dirceu Semighini Filho
Hi all, I'm moving from spark 1.4 to 1.5, and one of my tests is failing. It seems that there was some changes in org.apache.spark.sql.types. DecimalType This ugly code is a little sample to reproduce the error, don't use it into your project. test("spark test") { val file =

Re: - Spark 1.4.1 - run-example SparkPi - Failure ...

2015-08-13 Thread Dirceu Semighini Filho
Hi Naga, This happened here sometimes when the memory of the spark cluster wasn't enough, and Java GC enters into an infinite loop trying to free some memory. To fix this I just added more memory to the Workers of my cluster, or you can increase the number of partitions of your RDD, using the

Re: - Spark 1.4.1 - run-example SparkPi - Failure ...

2015-08-13 Thread Dirceu Semighini Filho
on Mac? -- Regards Naga On Thu, Aug 13, 2015 at 11:46 AM, Dirceu Semighini Filho dirceu.semigh...@gmail.com wrote: Hi Naga, This happened here sometimes when the memory of the spark cluster wasn't enough, and Java GC enters into an infinite loop trying to free some memory. To fix this I

Re: How to create a Row from a List or Array in Spark using Scala

2015-03-02 Thread Dirceu Semighini Filho
You can use the parallelize method: val data = List( Row(1, 5, vlr1, 10.5), Row(2, 1, vl3, 0.1), Row(3, 8, vl3, 10.0), Row(4, 1, vl4, 1.0)) val rdd = sc.parallelize(data) Here I'm using a list of Rows, but you could use it with a list of other kind of object, like this: val x =

Spark performance on 32 Cpus Server Cluster

2015-02-20 Thread Dirceu Semighini Filho
Hi all, I'm running Spark 1.2.0, in Stand alone mode, on different cluster and server sizes. All of my data is cached in memory. Basically I have a mass of data, about 8gb, with about 37k of columns, and I'm running different configs of an BinaryLogisticRegressionBFGS. When I put spark to run on 9

Re: Spark performance on 32 Cpus Server Cluster

2015-02-20 Thread Dirceu Semighini Filho
other unless one depends on the other. You'd have to clarify what you mean by running stages in parallel, like what are the interdependencies. On Fri, Feb 20, 2015 at 10:01 AM, Dirceu Semighini Filho dirceu.semigh...@gmail.com wrote: Hi all, I'm running Spark 1.2.0, in Stand alone mode

Re: PSA: Maven supports parallel builds

2015-02-05 Thread Dirceu Semighini Filho
Thanks Nicholas, I didn't knew this. 2015-02-05 22:16 GMT-02:00 Nicholas Chammas nicholas.cham...@gmail.com: Y’all may already know this, but I haven’t seen it mentioned anywhere in our docs on here and it’s a pretty easy win. Maven supports parallel builds

Re: [VOTE] Release Apache Spark 1.2.1 (RC3)

2015-02-03 Thread Dirceu Semighini Filho
Hi Patrick, I work in an Startup and we want make one of our projects as open source. This project is based on Spark, and it will help users to instantiate spark clusters in a cloud environment. But for that project we need to use the repl, hive and thrift-server. Can the decision of not

TimeoutException on tests

2015-01-29 Thread Dirceu Semighini Filho
Hi All, I'm trying to use a local build spark, adding the pr 1290 to the 1.2.0 build and after I do the build, I my tests start to fail. should create labeledpoint *** FAILED *** (10 seconds, 50 milliseconds) [info] java.util.concurrent.TimeoutException: Futures timed out after [1

Re: Use mvn to build Spark 1.2.0 failed

2015-01-28 Thread Dirceu Semighini Filho
I was facing the same problem, and I fixed it by adding plugin artifactIdmaven-assembly-plugin/artifactId version2.4.1/version configuration descriptors descriptorassembly/src/main/assembly/assembly.xml/descriptor /descriptors /configuration

Re: renaming SchemaRDD - DataFrame

2015-01-27 Thread Dirceu Semighini Filho
to not break source compatibility for Scala. On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho dirceu.semigh...@gmail.com wrote: Can't the SchemaRDD remain the same, but deprecated, and be removed in the release 1.5(+/- 1) for example, and the new code been added to DataFrame

Re: Issue with repartition and cache

2015-01-21 Thread Dirceu Semighini Filho
looking to parse and convert it, toInt should be used instead of asInstanceOf. -Sandy On Wed, Jan 21, 2015 at 8:43 AM, Dirceu Semighini Filho dirceu.semigh...@gmail.com wrote: Hi guys, have anyone find something like this? I have a training set, and when I repartition it, if I call cache

Issue with repartition and cache

2015-01-21 Thread Dirceu Semighini Filho
Hi guys, have anyone find something like this? I have a training set, and when I repartition it, if I call cache it throw a classcastexception when I try to execute anything that access it val rep120 = train.repartition(120) val cached120 = rep120.cache cached120.map(f =

Spark 1.2.0 Repl

2014-12-26 Thread Dirceu Semighini Filho
Hello, Is there any reason in not publishing spark repl in the version 1.2.0? In repl/pom.xml the deploy and publish are been skipped. Regards, Dirceu