Spark 1.4 - memory bloat in group by/aggregate???

2015-06-26 Thread Manoj Samel
Hi, - Spark 1.4 on a single node machine. Run spark-shell - Reading from Parquet file with bunch of text columns and couple of amounts in decimal(14,4). On disk size of of the file is 376M. It has ~100 million rows - rdd1 = sqlcontext.read.parquet - rdd1.cache - group_by_df

Spark 1.3 saveAsTextFile with codec gives error - works with Spark 1.2

2015-04-15 Thread Manoj Samel
Env - Spark 1.3 Hadoop 2.3, Kerbeos xx.saveAsTextFile(path, codec) gives following trace. Same works with Spark 1.2 in same environment val codec = classOf[some codec class] val a = sc.textFile(/some_hdfs_file) a.saveAsTextFile(/some_other_hdfs_file, codec) fails with following trace in Spark

park-assembly-1.3.0-hadoop2.3.0.jar has unsigned entries - org/apache/spark/SparkHadoopWriter$.class

2015-04-14 Thread Manoj Samel
With Spark 1.3 xx.saveAsTextFile(path, codec) gives following trace. Same works with Spark 1.2 Config is CDH 5.3.0 (Hadoop 2.3) with Kerberos 15/04/14 18:06:15 INFO scheduler.TaskSetManager: Lost task 1.3 in stage 2.0 (TID 17) on executor node1078.svc.devpg.pdx.wd: java.lang.SecurityException

Re: How to specify the port for AM Actor ...

2015-04-01 Thread Manoj Samel
Filed https://issues.apache.org/jira/browse/SPARK-6653 On Sun, Mar 29, 2015 at 8:18 PM, Shixiong Zhu zsxw...@gmail.com wrote: LGTM. Could you open a JIRA and send a PR? Thanks. Best Regards, Shixiong Zhu 2015-03-28 7:14 GMT+08:00 Manoj Samel manojsamelt...@gmail.com: I looked @ the 1.3.0

Spark 1.3 Source - Github and source tar does not seem to match

2015-03-27 Thread Manoj Samel
While looking into a issue, I noticed that the source displayed on Github site does not matches the downloaded tar for 1.3 Thoughts ?

Re: How to specify the port for AM Actor ...

2015-03-27 Thread Manoj Samel
change is needed? On Wed, Mar 25, 2015 at 4:44 PM, Shixiong Zhu zsxw...@gmail.com wrote: There is no configuration for it now. Best Regards, Shixiong Zhu 2015-03-26 7:13 GMT+08:00 Manoj Samel manojsamelt...@gmail.com: There may be firewall rules limiting the ports between host running spark

How to specify the port for AM Actor ...

2015-03-25 Thread Manoj Samel
Spark 1.3, Hadoop 2.5, Kerbeors When running spark-shell in yarn client mode, it shows following message with a random port every time (44071 in example below). Is there a way to specify that port to a specific port ? It does not seem to be part of ports specified in

Re: How to specify the port for AM Actor ...

2015-03-25 Thread Manoj Samel
conflicts, since multiple AMs can run in the same machine. Why do you need a fixed port? Best Regards, Shixiong Zhu 2015-03-26 6:49 GMT+08:00 Manoj Samel manojsamelt...@gmail.com: Spark 1.3, Hadoop 2.5, Kerbeors When running spark-shell in yarn client mode, it shows following message

Re: Invalid ContainerId ... Caused by: java.lang.NumberFormatException: For input string: e04

2015-03-24 Thread Manoj Samel
PM, Manoj Samel manojsamelt...@gmail.com wrote: When I run any query, it gives java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode; Are you running a custom-compiled Spark by any chance? Specifically, one you built with sbt? That would

Hadoop 2.5 not listed in Spark 1.4 build page

2015-03-24 Thread Manoj Samel
http://spark.apache.org/docs/latest/building-spark.html#packaging-without-hadoop-dependencies-for-yarn does not list hadoop 2.5 in Hadoop version table table etc. I assume it is still OK to compile with -Pyarn -Phadoop-2.5 for use with Hadoop 2.5 (cdh 5.3.2) Thanks,

Re: Invalid ContainerId ... Caused by: java.lang.NumberFormatException: For input string: e04

2015-03-24 Thread Manoj Samel
of which addresses this parsing trouble). You do not require to recompile Spark, just alter its hadoop libraries in its classpath to be that of CDH server version (overwrite from parcels, etc.). On Wed, Mar 25, 2015 at 1:06 AM, Manoj Samel manojsamelt...@gmail.com wrote: I recompiled Spark

Re: Spark 1.3 Dynamic Allocation - Requesting 0 new executor(s) because tasks are backlogged

2015-03-23 Thread Manoj Samel
) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) On Mon, Mar 23, 2015 at 2:25 PM, Marcelo Vanzin van...@cloudera.com wrote: On Mon, Mar 23, 2015 at 2:15 PM, Manoj Samel manojsamelt...@gmail.com wrote: Found the issue above error - the setting for spark_shuffle was incomplete. Now

Re: Spark 1.3 Dynamic Allocation - Requesting 0 new executor(s) because tasks are backlogged

2015-03-23 Thread Manoj Samel
AM, Ted Yu yuzhih...@gmail.com wrote: bq. Requesting 1 new executor(s) because tasks are backlogged 1 executor was requested. Which hadoop release are you using ? Can you check resource manager log to see if there is some clue ? Thanks On Fri, Mar 20, 2015 at 4:17 PM, Manoj Samel

Invalid ContainerId ... Caused by: java.lang.NumberFormatException: For input string: e04

2015-03-23 Thread Manoj Samel
Spark 1.3, CDH 5.3.2, Kerberos Setup works fine with base configuration, spark-shell can be used in yarn client mode etc. When work recovery feature is enabled via http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/admin_ha_yarn_work_preserving_recovery.html, the

Spark 1.3 Dynamic Allocation - Requesting 0 new executor(s) because tasks are backlogged

2015-03-20 Thread Manoj Samel
Hi, Running Spark 1.3 with secured Hadoop. Spark-shell with Yarn client mode runs without issue when not using Dynamic Allocation. When Dynamic allocation is turned on, the shell comes up but same SQL etc. causes it to loop. spark.dynamicAllocation.enabled=true

Re: Spark 1.3 Dynamic Allocation - Requesting 0 new executor(s) because tasks are backlogged

2015-03-20 Thread Manoj Samel
Forgot to add - the cluster is idle otherwise so there should be no resource issues. Also the configuration works when not using Dynamic allocation. On Fri, Mar 20, 2015 at 4:15 PM, Manoj Samel manojsamelt...@gmail.com wrote: Hi, Running Spark 1.3 with secured Hadoop. Spark-shell with Yarn

Dataframe v/s SparkSQL

2015-03-02 Thread Manoj Samel
Is it correct to say that Spark Dataframe APIs are implemented using same execution as SparkSQL ? In other words, while the dataframe API is different than SparkSQL, the runtime performance of equivalent constructs in Dataframe and SparkSQL should be same. So one should be able to choose whichever

New ColumnType For Decimal Caching

2015-02-13 Thread Manoj Samel
/scala/org/apache/spark/sql/columnar/ColumnType.scala . PRs welcome :) On Mon, Feb 9, 2015 at 3:01 PM, Manoj Samel manojsamelt...@gmail.com wrote: Hi Michael, As a test, I have same data loaded as another parquet - except with the 2 decimal(14,4) replaced by double. With this, the on disk

Is there a separate mailing list for Spark Developers ?

2015-02-12 Thread Manoj Samel
d...@spark.apache.org http://apache-spark-developers-list.1001551.n3.nabble.com/ mentioned on http://spark.apache.org/community.html seems to be bouncing. Is there another one ?

Re: SQL group by on Parquet table slower when table cached

2015-02-09 Thread Manoj Samel
are rebuilding column buffers in addition to reading the data off of the disk. On Fri, Feb 6, 2015 at 4:39 PM, Manoj Samel manojsamelt...@gmail.com wrote: Spark 1.2 Data stored in parquet table (large number of rows) Test 1 select a, sum(b), sum(c) from table Test sqlContext.cacheTable

Re: SQL group by on Parquet table slower when table cached

2015-02-09 Thread Manoj Samel
in the in-memory columnar storage, so you are paying expensive serialization there likely. On Mon, Feb 9, 2015 at 2:18 PM, Manoj Samel manojsamelt...@gmail.com wrote: Flat data of types String, Int and couple of decimal(14,4) On Mon, Feb 9, 2015 at 1:58 PM, Michael Armbrust mich

Re: SQL group by on Parquet table slower when table cached

2015-02-09 Thread Manoj Samel
to store in-memory decimal in some form of long with decoration ? For the immediate future, is there any hook that we can use to provide custom caching / processing for the decimal type in RDD so other semantic does not changes ? Thanks, On Mon, Feb 9, 2015 at 2:41 PM, Manoj Samel manojsamelt

SQL group by on Parquet table slower when table cached

2015-02-06 Thread Manoj Samel
Spark 1.2 Data stored in parquet table (large number of rows) Test 1 select a, sum(b), sum(c) from table Test sqlContext.cacheTable() select a, sum(b), sum(c) from table - seed cache First time slow since loading cache ? select a, sum(b), sum(c) from table - Second time it should be faster

Large # of tasks in groupby on single table

2015-02-04 Thread Manoj Samel
Spark 1.2 Data is read from parquet with 2 partitions and is cached as table with 2 partitions. Verified in UI that it shows RDD with 2 partitions it is fully cached in memory Cached data contains column a, b, c. Column a has ~150 distinct values. Next run SQL on this table as select a, sum(b),

Re: Large # of tasks in groupby on single table

2015-02-04 Thread Manoj Samel
Thanks On Wed, Feb 4, 2015 at 4:09 PM, Manoj Samel manojsamelt...@gmail.com wrote: Awesome ! By setting this, I could minimize the collect overhead, e.g by setting it to # of partitions of the RDD. Two questions 1. I had looked for such option in http://spark.apache.org/docs/latest

Re: Large # of tasks in groupby on single table

2015-02-04 Thread Manoj Samel
, 2015 at 12:38 PM, Manoj Samel manojsamelt...@gmail.com wrote: Spark 1.2 Data is read from parquet with 2 partitions and is cached as table with 2 partitions. Verified in UI that it shows RDD with 2 partitions it is fully cached in memory Cached data contains column a, b, c. Column a has ~150

Re: Error in saving schemaRDD with Decimal as Parquet

2015-02-03 Thread Manoj Samel
Hi, Any thoughts ? Thanks, On Sun, Feb 1, 2015 at 12:26 PM, Manoj Samel manojsamelt...@gmail.com wrote: Spark 1.2 SchemaRDD has schema with decimal columns created like x1 = new StructField(a, DecimalType(14,4), true) x2 = new StructField(b, DecimalType(14,4), true) Registering as SQL

Error in saving schemaRDD with Decimal as Parquet

2015-02-01 Thread Manoj Samel
Spark 1.2 SchemaRDD has schema with decimal columns created like x1 = new StructField(a, DecimalType(14,4), true) x2 = new StructField(b, DecimalType(14,4), true) Registering as SQL Temp table and doing SQL queries on these columns , including SUM etc. works fine, so the schema Decimal does

Re: Error in saving schemaRDD with Decimal as Parquet

2015-02-01 Thread Manoj Samel
for decimal So it seems schemaRDD.coalesce returns a RDD whose schema does not matches the source RDD in that decimal type seem to get changed. Any thoughts ? Is this a bug ??? Thanks, On Sun, Feb 1, 2015 at 12:26 PM, Manoj Samel manojsamelt...@gmail.com wrote: Spark 1.2 SchemaRDD has schema

Why is DecimalType separate from DataType ?

2015-01-30 Thread Manoj Samel
Spark 1.2 While building schemaRDD using StructType xxx = new StructField(credit_amount, DecimalType, true) gives error type mismatch; found : org.apache.spark.sql.catalyst.types.DecimalType.type required: org.apache.spark.sql.catalyst.types.DataType From

schemaRDD.saveAsParquetFile creates large number of small parquet files ...

2015-01-29 Thread Manoj Samel
Spark 1.2 on Hadoop 2.3 Read one big csv file, create a schemaRDD on it and saveAsParquetFile. It creates a large number of small (~1MB ) parquet part-x- files. Any way to control so that smaller number of large files are created ? Thanks,

SparkSQL Performance Tuning Options

2015-01-27 Thread Manoj Samel
Spark 1.2, no Hive, prefer not to use HiveContext to avoid metastore_db. Use case is Spark Yarn app will start and serve as query server for multiple users i.e. always up and running. At startup, there is option to cache data and also pre-compute some results sets, hash maps etc. that would be

Re: spark 1.2 - Writing parque fails for timestamp with Unsupported datatype TimestampType

2015-01-26 Thread Manoj Samel
Awesome ! That would be great !! On Mon, Jan 26, 2015 at 3:18 PM, Michael Armbrust mich...@databricks.com wrote: I'm aiming for 1.3. On Mon, Jan 26, 2015 at 3:05 PM, Manoj Samel manojsamelt...@gmail.com wrote: Thanks Michael. I am sure there have been many requests for this support. Any

Re: spark 1.2 - Writing parque fails for timestamp with Unsupported datatype TimestampType

2015-01-26 Thread Manoj Samel
precision. However, there is a PR to add support using parquets INT96 type: https://github.com/apache/spark/pull/3820 On Fri, Jan 23, 2015 at 12:08 PM, Manoj Samel manojsamelt...@gmail.com wrote: Looking further at the trace and ParquetTypes.scala, it seems there is no support for Timestamp

spark 1.2 - Writing parque fails for timestamp with Unsupported datatype TimestampType

2015-01-23 Thread Manoj Samel
Using Spark 1.2 Read a CSV file, apply schema to convert to SchemaRDD and then schemaRdd.saveAsParquetFile If the schema includes Timestamptype, it gives following trace when doing the save Exception in thread main java.lang.RuntimeException: Unsupported datatype TimestampType at

Error when running SparkPi on Secure HA Hadoop cluster

2015-01-15 Thread Manoj Samel
Hi, Setup is as follows Hadoop Cluster 2.3.0 (CDH5.0) - Namenode HA - Resource manager HA - Secured with Kerberos Spark 1.2 Run SparkPi as follows - conf/spark-defaults.conf has following entries spark.yarn.queue myqueue spark.yarn.access.namenodes hdfs://namespace (remember this is namenode

Running spark 1.2 on Hadoop + Kerberos

2015-01-08 Thread Manoj Samel
Hi, For running spark 1.2 on Hadoop cluster with Kerberos, what spark configurations are required? Using existing keytab, can any examples be submitted to the secured cluster ? How? Thanks,

Re: Running spark 1.2 on Hadoop + Kerberos

2015-01-08 Thread Manoj Samel
logged in (i.e. you've run kinit), everything should just work. You can run klist to make sure you're logged in. On Thu, Jan 8, 2015 at 3:49 PM, Manoj Samel manojsamelt...@gmail.com wrote: Hi, For running spark 1.2 on Hadoop cluster with Kerberos, what spark configurations are required

Cannot see RDDs in Spark UI

2015-01-06 Thread Manoj Samel
Hi, I create a bunch of RDDs, including schema RDDs. When I run the program and go to UI on xxx:4040, the storage tab does not shows any RDDs. Spark version is 1.1.1 (Hadoop 2.3) Any thoughts? Thanks,

Sharing sqlContext between Akka router and routee actors ...

2014-12-18 Thread Manoj Samel
Hi, Akka router creates a sqlContext and creates a bunch of routees actors with sqlContext as parameter. The actors then execute query on that sqlContext. Would this pattern be a issue ? Any other way sparkContext etc. should be shared cleanly in Akka routers/routees ? Thanks,

Re: Spark Server - How to implement

2014-12-12 Thread Manoj Samel
: On Thu, Dec 11, 2014 at 5:33 PM, Manoj Samel manojsamelt...@gmail.com wrote: Hi, If spark based services are to be exposed as a continuously available server, what are the options? * The API exposed to client will be proprietary and fine grained (RPC style ..), not a Job level API

Spark Server - How to implement

2014-12-11 Thread Manoj Samel
Hi, If spark based services are to be exposed as a continuously available server, what are the options? * The API exposed to client will be proprietary and fine grained (RPC style ..), not a Job level API * The client API need not be SQL so the Thrift JDBC server does not seem to be option ..

Spark 1.1.1 SQLContext.jsonFile dumps trace if JSON has newlines ...

2014-12-10 Thread Manoj Samel
I am using SQLContext.jsonFile. If a valid JSON contains newlines, spark1.1.1 dumps trace below. If the JSON is read as one line, it works fine. Is this known? 14/12/10 11:44:02 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 28) com.fasterxml.jackson.core.JsonParseException:

Can HiveContext be used without using Hive?

2014-12-09 Thread Manoj Samel
From 1.1.1 documentation, it seems one can use HiveContext instead of SQLContext without having a Hive installation. The benefit is richer SQL dialect. Is my understanding correct ? Thanks

Spark SQL - Any time line to move beyond Alpha version ?

2014-11-24 Thread Manoj Samel
Is there any timeline where Spark SQL goes beyond alpha version? Thanks,

Re: Spark resilience

2014-04-15 Thread Manoj Samel
checkpointing into a globally visible storage system (e.g., HDFS), which, for example, Spark Streaming already does. Currently, this feature is not supported in YARN or Mesos fine-grained mode. On Mon, Apr 14, 2014 at 2:08 PM, Manoj Samel manojsamelt...@gmail.comwrote: Could you please elaborate

Re: Spark resilience

2014-04-14 Thread Manoj Samel
Could you please elaborate how drivers can be restarted automatically ? Thanks, On Mon, Apr 14, 2014 at 10:30 AM, Aaron Davidson ilike...@gmail.com wrote: Master and slave are somewhat overloaded terms in the Spark ecosystem (see the glossary:

Re: Error in SparkSQL Example

2014-03-31 Thread Manoj Samel
'. Perhaps there is a clearer way to indicate this. As you have realized, using the full line from the first example will allow you to run the rest of them. On Sun, Mar 30, 2014 at 7:31 AM, Manoj Samel manojsamelt...@gmail.comwrote: Hi, On http://people.apache.org/~pwendell/catalyst

Error in SparkSQL Example

2014-03-30 Thread Manoj Samel
Hi, On http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html, I am trying to run code on Writing Language-Integrated Relational Queries ( I have 1.0.0 Snapshot ). I am running into error on val people: RDD[Person] // An RDD of case class objects, from the first example.

Shouldn't the UNION of SchemaRDDs produce SchemaRDD ?

2014-03-30 Thread Manoj Samel
Hi, I am trying SparkSQL based on the example on doc ... val people = sc.textFile(/data/spark/examples/src/main/resources/people.txt).map(_.split(,)).map(p = Person(p(0), p(1).trim.toInt)) val olderThanTeans = people.where('age 19) val youngerThanTeans = people.where('age 13) val

SparkSQL where with BigDecimal type gives stacktrace

2014-03-30 Thread Manoj Samel
Hi, If I do a where on BigDecimal, I get a stack trace. Changing BigDecimal to Double works ... scala case class JournalLine(account: String, credit: BigDecimal, debit: BigDecimal, date: String, company: String, currency: String, costcenter: String, region: String) defined class JournalLine

Re: SparkSQL where with BigDecimal type gives stacktrace

2014-03-30 Thread Manoj Samel
this was not for some reason intentional. On Sun, Mar 30, 2014 at 10:43 AM, smallmonkey...@hotmail.com smallmonkey...@hotmail.com wrote: can I get the whole operation? then i can try to locate the error -- smallmonkey...@hotmail.com *From:* Manoj Samel manojsamelt

groupBy RDD does not have grouping column ?

2014-03-30 Thread Manoj Samel
Hi, If I create a groupBy('a)(Sum('b) as 'foo, Sum('c) as 'bar), then the resulting RDD should have 'a, 'foo and 'bar. The result RDD just shows 'foo and 'bar and is missing 'a Thoughts? Thanks, Manoj