Re: Spark 1.5.0-SNAPSHOT broken with Scala 2.11

2015-06-29 Thread Alessandro Baretta
Steve, It was indeed a protocol buffers issue. I am able to build spark now. Thanks. On Mon, Jun 29, 2015 at 7:37 AM, Steve Loughran ste...@hortonworks.com wrote: On 29 Jun 2015, at 11:27, Iulian DragoČ™ iulian.dra...@typesafe.com wrote: On Mon, Jun 29, 2015 at 3:02 AM, Alessandro

Spark 1.5.0-SNAPSHOT broken with Scala 2.11

2015-06-28 Thread Alessandro Baretta
I am building the current master branch with Scala 2.11 following these instructions: Building for Scala 2.11 To produce a Spark package compiled with Scala 2.11, use the -Dscala-2.11 property: dev/change-version-to-2.11.sh mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package

Spark-Shell 2.11 1.4.0-RC-03 does not add jars to class path

2015-06-16 Thread Alessandro Baretta
This bug still exists in Spark-1.4.0. Is there a workaround for it? https://issues.apache.org/jira/browse/SPARK-7944 Thanks, Alex

Re: Join the developer community of spark

2015-01-19 Thread Alessandro Baretta
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Enjoy! Alex On Mon, Jan 19, 2015 at 6:44 PM, Jeff Wang jingjingwang...@gmail.com wrote: Hi: I would like to contribute to the code of spark. Can I join the community? Thanks, Jeff

Memory config issues

2015-01-18 Thread Alessandro Baretta
All, I'm getting out of memory exceptions in SparkSQL GROUP BY queries. I have plenty of RAM, so I should be able to brute-force my way through, but I can't quite figure out what memory option affects what process. My current memory configuration is the following: export

Re: Memory config issues

2015-01-18 Thread Alessandro Baretta
Regards On Mon, Jan 19, 2015 at 11:36 AM, Alessandro Baretta alexbare...@gmail.com wrote: All, I'm getting out of memory exceptions in SparkSQL GROUP BY queries. I have plenty of RAM, so I should be able to brute-force my way through, but I can't quite figure out what memory option affects

Re: Join implementation in SparkSQL

2015-01-16 Thread Alessandro Baretta
/scala/org/apache/spark/sql/execution/SparkStrategies.scala In most common use cases (e.g. inner equi join), filters are pushed below the join or into the join. Doing a cartesian product followed by a filter is too expensive. On Thu, Jan 15, 2015 at 7:39 AM, Alessandro Baretta alexbare

Re: Spark SQL API changes and stabilization

2015-01-16 Thread Alessandro Baretta
, 2015 at 7:53 AM, Alessandro Baretta alexbare...@gmail.com wrote: Reynold, Thanks for the heads up. In general, I strongly oppose the use of private to restrict access to certain parts of the API, the reason being that I might find the need to use some of the internals of a library from my

Re: Spark SQL API changes and stabilization

2015-01-15 Thread Alessandro Baretta
Reynold, Thanks for the heads up. In general, I strongly oppose the use of private to restrict access to certain parts of the API, the reason being that I might find the need to use some of the internals of a library from my own project. I find that a @DeveloperAPI annotation serves the same

Re: Job priority

2015-01-11 Thread Alessandro Baretta
, 2015, Alessandro Baretta alexbare...@gmail.com wrote: Cody, Maybe I'm not getting this, but it doesn't look like this page is describing a priority queue scheduling policy. What this section discusses is how resources are shared between queues. A weight-1000 pool will get 1000 times more

Re: Job priority

2015-01-11 Thread Alessandro Baretta
11, 2015 at 7:36 AM, Alessandro Baretta alexbare...@gmail.com wrote: Cody, While I might be able to improve the scheduling of my jobs by using a few different pools with weights equal to, say, 1, 1e3 and 1e6, effectively getting a small handful of priority classes. Still

Job priority

2015-01-10 Thread Alessandro Baretta
Is it possible to specify a priority level for a job, such that the active jobs might be scheduled in order of priority? Alex

Re: Spark driver main thread hanging after SQL insert

2015-01-02 Thread Alessandro Baretta
it on the dev list? That's where we track issues like this. Thanks!. - Patrick On Wed, Dec 31, 2014 at 8:48 PM, Alessandro Baretta alexbare...@gmail.com wrote: Here's what the console shows: 15/01/01 01:12:29 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 58.0, whose tasks have all

Spark driver main thread hanging after SQL insert

2014-12-31 Thread Alessandro Baretta
Here's what the console shows: 15/01/01 01:12:29 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 58.0, whose tasks have all completed, from pool 15/01/01 01:12:29 INFO scheduler.DAGScheduler: Stage 58 (runJob at ParquetTableOperations.scala:326) finished in 5493.549 s 15/01/01 01:12:29 INFO

Re: Unsupported Catalyst types in Parquet

2014-12-30 Thread Alessandro Baretta
nanoseconds now. Since passing too many flags is ugly, now I need the whole SQLContext, so that we can put more flags there. Thanks, Daoyuan *From:* Michael Armbrust [mailto:mich...@databricks.com] *Sent:* Tuesday, December 30, 2014 10:43 AM *To:* Alessandro Baretta *Cc:* Wang, Daoyuan; dev

Re: Unsupported Catalyst types in Parquet

2014-12-30 Thread Alessandro Baretta
Sorry! My bad. I had stale spark jars sitting on the slave nodes... Alex On Tue, Dec 30, 2014 at 4:39 PM, Alessandro Baretta alexbare...@gmail.com wrote: Gents, I tried #3820. It doesn't work. I'm still getting the following exceptions: Exception in thread Thread-45

Re: Unsupported Catalyst types in Parquet

2014-12-30 Thread Alessandro Baretta
(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Any input on how to address this issue would be welcome. Alex On Tue, Dec 30, 2014 at 5:21 PM, Alessandro Baretta alexbare...@gmail.com

Re: Unsupported Catalyst types in Parquet

2014-12-30 Thread Alessandro Baretta
I think I might have figure it out myself. Here's a pull request for you guys to check out: https://github.com/apache/spark/pull/3855 I successfully tested this code on my cluster. On Tue, Dec 30, 2014 at 11:01 PM, Alessandro Baretta alexbare...@gmail.com wrote: Here's a more meaningful

RE: Unsupported Catalyst types in Parquet

2014-12-29 Thread Alessandro Baretta
wrote: Hi Alex, I'll create JIRA SPARK-4985 for date type support in parquet, and SPARK-4987 for timestamp type support. For decimal type, I think we only support decimals that fits in a long. Thanks, Daoyuan -Original Message- From: Alessandro Baretta [mailto:alexbare...@gmail.com

Re: Unsupported Catalyst types in Parquet

2014-12-29 Thread Alessandro Baretta
the plan is there to make sure that whatever we do is going to be compatible long term. Michael On Mon, Dec 29, 2014 at 8:13 AM, Alessandro Baretta alexbare...@gmail.com wrote: Daoyuan, Thanks for creating the jiras. I need these features by... last week, so I'd be happy to take care

SQLContext is Serializable, SparkContext is not

2014-12-26 Thread Alessandro Baretta
How, O how can this be? Doesn't the SQLContext hold a reference to the SparkContext? Alex

Assembly jar file name does not match profile selection

2014-12-26 Thread Alessandro Baretta
I am building spark with sbt off of branch 1.2. I'm using the following command: sbt/sbt -Pyarn -Phadoop-2.3 assembly (http://spark.apache.org/docs/latest/building-spark.html#building-with-sbt) Although the jar file I obtain does contain the proper version of the hadoop libraries (v. 2.4), the

Re: Assembly jar file name does not match profile selection

2014-12-26 Thread Alessandro Baretta
PM, Alessandro Baretta alexbare...@gmail.com wrote: I am building spark with sbt off of branch 1.2. I'm using the following command: sbt/sbt -Pyarn -Phadoop-2.3 assembly ( http://spark.apache.org/docs/latest/building-spark.html#building-with-sbt ) Although the jar file I obtain does

Unsupported Catalyst types in Parquet

2014-12-26 Thread Alessandro Baretta
Michael, I'm having trouble storing my SchemaRDDs in Parquet format with SparkSQL, due to my RDDs having having DateType and DecimalType fields. What would it take to add Parquet support for these Catalyst? Are there any other Catalyst types for which there is no Catalyst support? Alex

More general submitJob API

2014-12-22 Thread Alessandro Baretta
Fellow Sparkers, I'm rather puzzled at the submitJob API. I can't quite figure out how it is supposed to be used. Is there any more documentation about it? Also, is there any simpler way to multiplex jobs on the cluster, such as starting multiple computations in as many threads in the driver and

Re: More general submitJob API

2014-12-22 Thread Alessandro Baretta
On Mon, Dec 22, 2014 at 1:32 PM, Alessandro Baretta alexbare...@gmail.com wrote: Fellow Sparkers, I'm rather puzzled at the submitJob API. I can't quite figure out how it is supposed to be used. Is there any more documentation about it? Also, is there any simpler way to multiplex jobs

What RDD transformations trigger computations?

2014-12-18 Thread Alessandro Baretta
All, I noticed that while some operations that return RDDs are very cheap, such as map and flatMap, some are quite expensive, such as union and groupByKey. I'm referring here to the cost of constructing the RDD scala value, not the cost of collecting the values contained in the RDD. This does not

Re: What RDD transformations trigger computations?

2014-12-18 Thread Alessandro Baretta
On December 18, 2014 at 1:04:54 AM, Alessandro Baretta ( alexbare...@gmail.com) wrote: All, I noticed that while some operations that return RDDs are very cheap, such as map and flatMap, some are quite expensive, such as union and groupByKey. I'm referring here to the cost of constructing the RDD

Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Alessandro Baretta
, Dec 17, 2014 at 11:24 PM, Alessandro Baretta alexbare...@gmail.com wrote: Well, what do you suggest I run to test this? But more importantly, what information would this give me? On Wed, Dec 17, 2014 at 10:46 PM, Denny Lee denny.g@gmail.com wrote: Oh, it makes sense of gsutil scans

Where are the docs for the SparkSQL DataTypes?

2014-12-11 Thread Alessandro Baretta
Michael other Spark SQL junkies, As I read through the Spark API docs, in particular those for the org.apache.spark.sql package, I can't seem to find details about the Scala classes representing the various SparkSQL DataTypes, for instance DecimalType. I find DataType classes in

Re: Where are the docs for the SparkSQL DataTypes?

2014-12-11 Thread Alessandro Baretta
Hao -Original Message- From: Alessandro Baretta [mailto:alexbare...@gmail.com] Sent: Friday, December 12, 2014 6:37 AM To: Michael Armbrust; dev@spark.apache.org Subject: Where are the docs for the SparkSQL DataTypes? Michael other Spark SQL junkies, As I read through the Spark

SparkSQL not honoring schema

2014-12-10 Thread Alessandro Baretta
Hello, I defined a SchemaRDD by applying a hand-crafted StructType to an RDD. Some of the Rows in the RDD are malformed--that is, they do not conform to the schema defined by the StructType. When running a select statement on this SchemaRDD I would expect SparkSQL to either reject the malformed

Re: SparkSQL not honoring schema

2014-12-10 Thread Alessandro Baretta
if you try to manipulate the data, but otherwise it will pass it though. I have written some debugging code (developer API, not guaranteed to be stable) though that you can use. import org.apache.spark.sql.execution.debug._ schemaRDD.typeCheck() On Wed, Dec 10, 2014 at 6:19 PM, Alessandro

Re: Quantile regression in tree models

2014-11-18 Thread Alessandro Baretta
will try to search for JIRAs or create new ones and update this thread. -Manish On Monday, November 17, 2014, Alessandro Baretta alexbare...@gmail.com wrote: Manish, Thanks for pointing me to the relevant docs. It is unfortunate that absolute error is not supported yet. I can't seem to find

Re: Quantile regression in tree models

2014-11-17 Thread Alessandro Baretta
and deviance as loss functions but I don't think anyone is planning to work on it yet. :-) -Manish On Mon, Nov 17, 2014 at 11:11 AM, Alessandro Baretta alexbare...@gmail.com wrote: I see that, as of v. 1.1, MLLib supports regression and classification tree models. I assume this means

Build fails on master (f90ad5d)

2014-11-04 Thread Alessandro Baretta
Fellow Sparkers, I am new here and still trying to learn to crawl. Please, bear with me. I just pulled f90ad5d from https://github.com/apache/spark.git and am running the compile command in the sbt shell. This is the error I'm seeing: [error]

Re: Build fails on master (f90ad5d)

2014-11-04 Thread Alessandro Baretta
use ? Cheers On Tue, Nov 4, 2014 at 2:08 PM, Alessandro Baretta alexbare...@gmail.com wrote: Fellow Sparkers, I am new here and still trying to learn to crawl. Please, bear with me. I just pulled f90ad5d from https://github.com/apache/spark.git and am running the compile command

Re: Build fails on master (f90ad5d)

2014-11-04 Thread Alessandro Baretta
it because it's faster than Maven. Nick On Tue, Nov 4, 2014 at 8:03 PM, Alessandro Baretta alexbare...@gmail.com wrote: Nicholas, Yes, I saw them, but they refer to maven, and I'm under the impression that sbt is the preferred way of building spark. Is indeed maven the right way? Anyway, as per

Spark consulting

2014-10-31 Thread Alessandro Baretta
Hello, Is anyone open to do some consulting work on Spark in San Mateo? Thanks. Alex