Hi,
- Spark 1.4 on a single node machine. Run spark-shell
- Reading from Parquet file with bunch of text columns and couple of
amounts in decimal(14,4). On disk size of of the file is 376M. It has ~100
million rows
- rdd1 = sqlcontext.read.parquet
- rdd1.cache
- group_by_df
Env - Spark 1.3 Hadoop 2.3, Kerbeos
xx.saveAsTextFile(path, codec) gives following trace. Same works with
Spark 1.2 in same environment
val codec = classOf[some codec class]
val a = sc.textFile(/some_hdfs_file)
a.saveAsTextFile(/some_other_hdfs_file, codec) fails with following trace
in Spark
With Spark 1.3 xx.saveAsTextFile(path, codec) gives following trace. Same
works with Spark 1.2
Config is CDH 5.3.0 (Hadoop 2.3) with Kerberos
15/04/14 18:06:15 INFO scheduler.TaskSetManager: Lost task 1.3 in stage 2.0
(TID 17) on executor node1078.svc.devpg.pdx.wd: java.lang.SecurityException
Filed https://issues.apache.org/jira/browse/SPARK-6653
On Sun, Mar 29, 2015 at 8:18 PM, Shixiong Zhu zsxw...@gmail.com wrote:
LGTM. Could you open a JIRA and send a PR? Thanks.
Best Regards,
Shixiong Zhu
2015-03-28 7:14 GMT+08:00 Manoj Samel manojsamelt...@gmail.com:
I looked @ the 1.3.0
While looking into a issue, I noticed that the source displayed on Github
site does not matches the downloaded tar for 1.3
Thoughts ?
change is needed?
On Wed, Mar 25, 2015 at 4:44 PM, Shixiong Zhu zsxw...@gmail.com wrote:
There is no configuration for it now.
Best Regards,
Shixiong Zhu
2015-03-26 7:13 GMT+08:00 Manoj Samel manojsamelt...@gmail.com:
There may be firewall rules limiting the ports between host running spark
Spark 1.3, Hadoop 2.5, Kerbeors
When running spark-shell in yarn client mode, it shows following message
with a random port every time (44071 in example below). Is there a way to
specify that port to a specific port ? It does not seem to be part of ports
specified in
conflicts, since multiple AMs can run in
the same machine. Why do you need a fixed port?
Best Regards,
Shixiong Zhu
2015-03-26 6:49 GMT+08:00 Manoj Samel manojsamelt...@gmail.com:
Spark 1.3, Hadoop 2.5, Kerbeors
When running spark-shell in yarn client mode, it shows following message
PM, Manoj Samel manojsamelt...@gmail.com
wrote:
When I run any query, it gives java.lang.NoSuchMethodError:
com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;
Are you running a custom-compiled Spark by any chance? Specifically,
one you built with sbt? That would
http://spark.apache.org/docs/latest/building-spark.html#packaging-without-hadoop-dependencies-for-yarn
does not list hadoop 2.5 in Hadoop version table table etc.
I assume it is still OK to compile with -Pyarn -Phadoop-2.5 for use with
Hadoop 2.5 (cdh 5.3.2)
Thanks,
of which addresses this parsing trouble).
You do not require to recompile Spark, just alter its hadoop libraries in
its classpath to be that of CDH server version (overwrite from parcels,
etc.).
On Wed, Mar 25, 2015 at 1:06 AM, Manoj Samel manojsamelt...@gmail.com
wrote:
I recompiled Spark
)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
On Mon, Mar 23, 2015 at 2:25 PM, Marcelo Vanzin van...@cloudera.com wrote:
On Mon, Mar 23, 2015 at 2:15 PM, Manoj Samel manojsamelt...@gmail.com
wrote:
Found the issue above error - the setting for spark_shuffle was
incomplete.
Now
AM, Ted Yu yuzhih...@gmail.com wrote:
bq. Requesting 1 new executor(s) because tasks are backlogged
1 executor was requested.
Which hadoop release are you using ?
Can you check resource manager log to see if there is some clue ?
Thanks
On Fri, Mar 20, 2015 at 4:17 PM, Manoj Samel
Spark 1.3, CDH 5.3.2, Kerberos
Setup works fine with base configuration, spark-shell can be used in yarn
client mode etc.
When work recovery feature is enabled via
http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/admin_ha_yarn_work_preserving_recovery.html,
the
Hi,
Running Spark 1.3 with secured Hadoop.
Spark-shell with Yarn client mode runs without issue when not using Dynamic
Allocation.
When Dynamic allocation is turned on, the shell comes up but same SQL etc.
causes it to loop.
spark.dynamicAllocation.enabled=true
Forgot to add - the cluster is idle otherwise so there should be no
resource issues. Also the configuration works when not using Dynamic
allocation.
On Fri, Mar 20, 2015 at 4:15 PM, Manoj Samel manojsamelt...@gmail.com
wrote:
Hi,
Running Spark 1.3 with secured Hadoop.
Spark-shell with Yarn
Is it correct to say that Spark Dataframe APIs are implemented using same
execution as SparkSQL ? In other words, while the dataframe API is
different than SparkSQL, the runtime performance of equivalent constructs
in Dataframe and SparkSQL should be same. So one should be able to choose
whichever
/scala/org/apache/spark/sql/columnar/ColumnType.scala
.
PRs welcome :)
On Mon, Feb 9, 2015 at 3:01 PM, Manoj Samel manojsamelt...@gmail.com
wrote:
Hi Michael,
As a test, I have same data loaded as another parquet - except with the 2
decimal(14,4) replaced by double. With this, the on disk
d...@spark.apache.org
http://apache-spark-developers-list.1001551.n3.nabble.com/ mentioned on
http://spark.apache.org/community.html seems to be bouncing. Is there
another one ?
are rebuilding column buffers in addition to reading the data off of
the disk.
On Fri, Feb 6, 2015 at 4:39 PM, Manoj Samel manojsamelt...@gmail.com
wrote:
Spark 1.2
Data stored in parquet table (large number of rows)
Test 1
select a, sum(b), sum(c) from table
Test
sqlContext.cacheTable
in the in-memory columnar
storage, so you are paying expensive serialization there likely.
On Mon, Feb 9, 2015 at 2:18 PM, Manoj Samel manojsamelt...@gmail.com
wrote:
Flat data of types String, Int and couple of decimal(14,4)
On Mon, Feb 9, 2015 at 1:58 PM, Michael Armbrust mich
to store in-memory decimal in some form of
long with decoration ?
For the immediate future, is there any hook that we can use to provide
custom caching / processing for the decimal type in RDD so other semantic
does not changes ?
Thanks,
On Mon, Feb 9, 2015 at 2:41 PM, Manoj Samel manojsamelt
Spark 1.2
Data stored in parquet table (large number of rows)
Test 1
select a, sum(b), sum(c) from table
Test
sqlContext.cacheTable()
select a, sum(b), sum(c) from table - seed cache First time slow since
loading cache ?
select a, sum(b), sum(c) from table - Second time it should be faster
Spark 1.2
Data is read from parquet with 2 partitions and is cached as table with 2
partitions. Verified in UI that it shows RDD with 2 partitions it is
fully cached in memory
Cached data contains column a, b, c. Column a has ~150 distinct values.
Next run SQL on this table as select a, sum(b),
Thanks
On Wed, Feb 4, 2015 at 4:09 PM, Manoj Samel manojsamelt...@gmail.com
wrote:
Awesome ! By setting this, I could minimize the collect overhead, e.g by
setting it to # of partitions of the RDD.
Two questions
1. I had looked for such option in
http://spark.apache.org/docs/latest
, 2015 at 12:38 PM, Manoj Samel manojsamelt...@gmail.com
wrote:
Spark 1.2
Data is read from parquet with 2 partitions and is cached as table with 2
partitions. Verified in UI that it shows RDD with 2 partitions it is
fully cached in memory
Cached data contains column a, b, c. Column a has ~150
Hi,
Any thoughts ?
Thanks,
On Sun, Feb 1, 2015 at 12:26 PM, Manoj Samel manojsamelt...@gmail.com
wrote:
Spark 1.2
SchemaRDD has schema with decimal columns created like
x1 = new StructField(a, DecimalType(14,4), true)
x2 = new StructField(b, DecimalType(14,4), true)
Registering as SQL
Spark 1.2
SchemaRDD has schema with decimal columns created like
x1 = new StructField(a, DecimalType(14,4), true)
x2 = new StructField(b, DecimalType(14,4), true)
Registering as SQL Temp table and doing SQL queries on these columns ,
including SUM etc. works fine, so the schema Decimal does
for decimal
So it seems schemaRDD.coalesce returns a RDD whose schema does not matches
the source RDD in that decimal type seem to get changed.
Any thoughts ? Is this a bug ???
Thanks,
On Sun, Feb 1, 2015 at 12:26 PM, Manoj Samel manojsamelt...@gmail.com
wrote:
Spark 1.2
SchemaRDD has schema
Spark 1.2
While building schemaRDD using StructType
xxx = new StructField(credit_amount, DecimalType, true) gives error type
mismatch; found : org.apache.spark.sql.catalyst.types.DecimalType.type
required: org.apache.spark.sql.catalyst.types.DataType
From
Spark 1.2 on Hadoop 2.3
Read one big csv file, create a schemaRDD on it and saveAsParquetFile.
It creates a large number of small (~1MB ) parquet part-x- files.
Any way to control so that smaller number of large files are created ?
Thanks,
Spark 1.2, no Hive, prefer not to use HiveContext to avoid metastore_db.
Use case is Spark Yarn app will start and serve as query server for
multiple users i.e. always up and running. At startup, there is option to
cache data and also pre-compute some results sets, hash maps etc. that
would be
Awesome ! That would be great !!
On Mon, Jan 26, 2015 at 3:18 PM, Michael Armbrust mich...@databricks.com
wrote:
I'm aiming for 1.3.
On Mon, Jan 26, 2015 at 3:05 PM, Manoj Samel manojsamelt...@gmail.com
wrote:
Thanks Michael. I am sure there have been many requests for this support.
Any
precision.
However, there is a PR to add support using parquets INT96 type:
https://github.com/apache/spark/pull/3820
On Fri, Jan 23, 2015 at 12:08 PM, Manoj Samel manojsamelt...@gmail.com
wrote:
Looking further at the trace and ParquetTypes.scala, it seems there is no
support for Timestamp
Using Spark 1.2
Read a CSV file, apply schema to convert to SchemaRDD and then
schemaRdd.saveAsParquetFile
If the schema includes Timestamptype, it gives following trace when doing
the save
Exception in thread main java.lang.RuntimeException: Unsupported datatype
TimestampType
at
Hi,
Setup is as follows
Hadoop Cluster 2.3.0 (CDH5.0)
- Namenode HA
- Resource manager HA
- Secured with Kerberos
Spark 1.2
Run SparkPi as follows
- conf/spark-defaults.conf has following entries
spark.yarn.queue myqueue
spark.yarn.access.namenodes hdfs://namespace (remember this is namenode
Hi,
For running spark 1.2 on Hadoop cluster with Kerberos, what spark
configurations are required?
Using existing keytab, can any examples be submitted to the secured cluster
? How?
Thanks,
logged in (i.e. you've run kinit), everything should
just work. You can run klist to make sure you're logged in.
On Thu, Jan 8, 2015 at 3:49 PM, Manoj Samel manojsamelt...@gmail.com
wrote:
Hi,
For running spark 1.2 on Hadoop cluster with Kerberos, what spark
configurations are required
Hi,
I create a bunch of RDDs, including schema RDDs. When I run the program and
go to UI on xxx:4040, the storage tab does not shows any RDDs.
Spark version is 1.1.1 (Hadoop 2.3)
Any thoughts?
Thanks,
Hi,
Akka router creates a sqlContext and creates a bunch of routees actors
with sqlContext as parameter. The actors then execute query on that
sqlContext.
Would this pattern be a issue ? Any other way sparkContext etc. should be
shared cleanly in Akka routers/routees ?
Thanks,
:
On Thu, Dec 11, 2014 at 5:33 PM, Manoj Samel manojsamelt...@gmail.com
wrote:
Hi,
If spark based services are to be exposed as a continuously available
server, what are the options?
* The API exposed to client will be proprietary and fine grained (RPC
style
..), not a Job level API
Hi,
If spark based services are to be exposed as a continuously available
server, what are the options?
* The API exposed to client will be proprietary and fine grained (RPC style
..), not a Job level API
* The client API need not be SQL so the Thrift JDBC server does not seem to
be option ..
I am using SQLContext.jsonFile. If a valid JSON contains newlines,
spark1.1.1 dumps trace below. If the JSON is read as one line, it works
fine. Is this known?
14/12/10 11:44:02 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID
28)
com.fasterxml.jackson.core.JsonParseException:
From 1.1.1 documentation, it seems one can use HiveContext instead of
SQLContext without having a Hive installation. The benefit is richer SQL
dialect.
Is my understanding correct ?
Thanks
Is there any timeline where Spark SQL goes beyond alpha version?
Thanks,
checkpointing into a
globally visible storage system (e.g., HDFS), which, for example, Spark
Streaming already does.
Currently, this feature is not supported in YARN or Mesos fine-grained
mode.
On Mon, Apr 14, 2014 at 2:08 PM, Manoj Samel manojsamelt...@gmail.comwrote:
Could you please elaborate
Could you please elaborate how drivers can be restarted automatically ?
Thanks,
On Mon, Apr 14, 2014 at 10:30 AM, Aaron Davidson ilike...@gmail.com wrote:
Master and slave are somewhat overloaded terms in the Spark ecosystem (see
the glossary:
'. Perhaps there is
a clearer way to indicate this.
As you have realized, using the full line from the first example will
allow you to run the rest of them.
On Sun, Mar 30, 2014 at 7:31 AM, Manoj Samel manojsamelt...@gmail.comwrote:
Hi,
On
http://people.apache.org/~pwendell/catalyst
Hi,
On
http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html,
I am trying to run code on Writing Language-Integrated Relational Queries
( I have 1.0.0 Snapshot ).
I am running into error on
val people: RDD[Person] // An RDD of case class objects, from the first
example.
Hi,
I am trying SparkSQL based on the example on doc ...
val people =
sc.textFile(/data/spark/examples/src/main/resources/people.txt).map(_.split(,)).map(p
= Person(p(0), p(1).trim.toInt))
val olderThanTeans = people.where('age 19)
val youngerThanTeans = people.where('age 13)
val
Hi,
If I do a where on BigDecimal, I get a stack trace. Changing BigDecimal to
Double works ...
scala case class JournalLine(account: String, credit: BigDecimal, debit:
BigDecimal, date: String, company: String, currency: String, costcenter:
String, region: String)
defined class JournalLine
this
was not for some reason intentional.
On Sun, Mar 30, 2014 at 10:43 AM, smallmonkey...@hotmail.com
smallmonkey...@hotmail.com wrote:
can I get the whole operation? then i can try to locate the error
--
smallmonkey...@hotmail.com
*From:* Manoj Samel manojsamelt
Hi,
If I create a groupBy('a)(Sum('b) as 'foo, Sum('c) as 'bar), then the
resulting RDD should have 'a, 'foo and 'bar.
The result RDD just shows 'foo and 'bar and is missing 'a
Thoughts?
Thanks,
Manoj
53 matches
Mail list logo