The issue is now resolved.
One of the csv files had an incorrect record at the end.
On Fri, Feb 27, 2015 at 4:24 PM, anamika gupta anamika.guo...@gmail.com
wrote:
I have three tables with the following schema:
case class* date_d*(WID: Int, CALENDAR_DATE: java.sql.Timestamp,
DATE_STRING:
Hi,
I am exploring SparkSQL for my purposes of performing large relational
operations across a cluster. However, it seems to be in alpha right now. Is
there any indication when it would be considered production-level? I don't
see any info on the site.
Regards,
Ashish
Hopefully the alpha tag will be remove in 1.4.0, if the community can review
code a little bit faster :P
Thanks,
Daoyuan
From: Ashish Mukherjee [mailto:ashish.mukher...@gmail.com]
Sent: Saturday, February 28, 2015 4:28 PM
To: user@spark.apache.org
Subject: SparkSQL production readiness
Hi,
I
Maybe but any time the work around is to use spark-submit --conf
spark.executor.extraClassPath=/guava.jar blah” that means that standalone apps
must have hard coded paths that are honored on every worker. And as you know a
lib is pretty much blocked from use of this version of Spark—hence the
Hi,
I have been looking at Spark Streaming , which seems to be for the use case
of live streams which are processed one line at a time generally in
real-time.
Since SparkSQL reads data from some filesystem, I was wondering if there is
something which connects SparkSQL with Spark Streaming, so I
I think you can do simple operations like foreachRDD or transform to get
access to the RDDs in the stream and then you can do SparkSQL over it.
Thanks
Best Regards
On Sat, Feb 28, 2015 at 3:27 PM, Ashish Mukherjee
ashish.mukher...@gmail.com wrote:
Hi,
I have been looking at Spark Streaming
*Hi devs,*
*Is there any connection between the input file size and RAM size for
sorting using SparkSQL ?*
*I tried 1 GB file with 8 GB RAM with 4 cores and got
java.lang.OutOfMemoryError: GC overhead limit exceeded.*
*Or could it be for any other reason ? Its working for other SparkSQL
Hi,
I am running Spark applications in GCE. I set up cluster with different
number of nodes varying from 1 to 7. The machines are single core machines.
I set the spark.default.parallelism to the number of nodes in the cluster
for each cluster. I ran the four applications available in Spark
hey,
running my first map-red like (meaning disk-to-disk, avoiding in memory
RDDs) computation in spark on yarn i immediately got bitten by a too low
spark.yarn.executor.memoryOverhead. however it took me about an hour to
find out this was the cause. at first i observed failing shuffles leading
to
I have created SPARK-6085 with pull request:
https://github.com/apache/spark/pull/4836
Cheers
On Sat, Feb 28, 2015 at 12:08 PM, Corey Nolet cjno...@gmail.com wrote:
+1 to a better default as well.
We were working find until we ran against a real dataset which was much
larger than the test
Thanks for taking this on Ted!
On Sat, Feb 28, 2015 at 4:17 PM, Ted Yu yuzhih...@gmail.com wrote:
I have created SPARK-6085 with pull request:
https://github.com/apache/spark/pull/4836
Cheers
On Sat, Feb 28, 2015 at 12:08 PM, Corey Nolet cjno...@gmail.com wrote:
+1 to a better default as
We are planning to remove the alpha tag in 1.3.0.
On Sat, Feb 28, 2015 at 12:30 AM, Wang, Daoyuan daoyuan.w...@intel.com
wrote:
Hopefully the alpha tag will be remove in 1.4.0, if the community can
review code a little bit faster :P
Thanks,
Daoyuan
*From:* Ashish Mukherjee
Thanks, Ashish! Is Oozie integrated with Spark? I knew it can accommodate
some Hadoop jobs.
On Sat, Feb 28, 2015 at 6:07 PM, Ashish Nigam ashnigamt...@gmail.com
wrote:
Qiang,
Did you look at Oozie?
We use oozie to run spark jobs in production.
On Feb 28, 2015, at 2:45 PM, Qiang Cao
Hi Deep,
Compute times may not be very meaningful for small examples like those. If
you increase the sizes of the examples, then you may start to observe more
meaningful trends and speedups.
Joseph
On Sat, Feb 28, 2015 at 7:26 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
Hi,
I am
You have to call spark-submit from oozie.
I used this link to get the idea for my implementation -
http://mail-archives.apache.org/mod_mbox/oozie-user/201404.mbox/%3CCAHCsPn-0Grq1rSXrAZu35yy_i4T=fvovdox2ugpcuhkwmjp...@mail.gmail.com%3E
On Feb 28, 2015, at 3:25 PM, Qiang Cao
Hi,
I have a spark application that hangs on doing just one task (Rest 200-300
task gets completed in reasonable time)
I can see in the Thread dump which function gets stuck how ever I don't
have a clue as to what value is causing that behaviour.
Also, logging the inputs before the function is
For what it's worth, I was seeing mysterious hangs, but it went away when
upgrading from spark1.2 to 1.2.1.I don't know if this is your problem.Also, I'm
using AWS EMR images, which were also upgraded.
Anyway, that's my experience.
-Mike
From: Manas Kar manasdebashis...@gmail.com
To:
Maybe try including the jar with
--driver-class-path jar
On Feb 26, 2015, at 12:16 PM, Akshat Aranya aara...@gmail.com wrote:
My guess would be that you are packaging too many things in your job, which
is causing problems with the classpath. When your jar goes in first, you get
the
A correction to my first post:
There is also a repartition right before groupByKey to help avoid
too-many-open-files error:
rdd2.union(rdd1).map(...).filter(...).repartition(15000).groupByKey().map(...).flatMap(...).saveAsTextFile()
On Sat, Feb 28, 2015 at 11:10 AM, Arun Luthra
Yes. I ran into this problem with mahout snapshot and spark 1.2.0 not
really trying to figure out why that was a problem, since there were
already too many moving parts in my app. Obviously there is a classpath
issue somewhere.
/Erlend
On 27 Feb 2015 22:30, Pat Ferrel p...@occamsmachete.com
Hi Everyone,
We need to deal with workflows on Spark. In our scenario, each workflow
consists of multiple processing steps. Among different steps, there could
be dependencies. I'm wondering if there are tools available that can help
us schedule and manage workflows on Spark. I'm looking for
Qiang,
Did you look at Oozie?
We use oozie to run spark jobs in production.
On Feb 28, 2015, at 2:45 PM, Qiang Cao caoqiang...@gmail.com wrote:
Hi Everyone,
We need to deal with workflows on Spark. In our scenario, each workflow
consists of multiple processing steps. Among different
Ted,
spark-catalyst_2.11-1.2.1.jar is present in the class path. BTW, I am running
the code locally in eclipse workspace.
Here’s complete exception stack trace -
Exception in thread main scala.ScalaReflectionException: class
org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with
Hi Shahab,
There are actually a few distributed Matrix types which support sparse
representations: RowMatrix, IndexedRowMatrix, and CoordinateMatrix.
The documentation has a bit more info about the various uses:
http://spark.apache.org/docs/latest/mllib-data-types.html#distributed-matrix
The
Hi Jao,
You can use external tools and libraries if they can be called from your
Spark program or script (with appropriate conversion of data types, etc.).
The best way to apply a pre-trained model to a dataset would be to call the
model from within a closure, e.g.:
myRDD.map { myDatum =
There was a recent discussion about whether to increase or indeed make
configurable this kind of default fraction. I believe the suggestion
there too was that 9-10% is a safer default.
Advanced users can lower the resulting overhead value; it may still
have to be increased in some cases, but a
Having good out-of-box experience is desirable.
+1 on increasing the default.
On Sat, Feb 28, 2015 at 8:27 AM, Sean Owen so...@cloudera.com wrote:
There was a recent discussion about whether to increase or indeed make
configurable this kind of default fraction. I believe the suggestion
All stated symptoms are consistent with GC pressure (other nodes timeout
trying to connect because of a long stop-the-world), quite possibly due to
groupByKey. groupByKey is a very expensive operation as it may bring all
the data for a particular partition into memory (in particular, it cannot
Moving user to bcc.
What I found was that the TaskSetManager for my task set that had 5 tasks
had preferred locations set for 4 of the 5. Three had localhost/driver
and had completed. The one that had nothing had also completed. The last
one was set by our code to be my IP address. Local mode can
Hi,
I reconfigured everything. Still facing the same issue.
Can someone please help?
On Friday, February 27, 2015, Anusha Shamanur anushas...@gmail.com wrote:
I do.
What tags should I change in this?
I changed the value of hive.exec.scratchdir to /tmp/hive.
What else?
On Fri, Feb 27, 2015
Hi Spark community,
We have a use case where we need to pull huge amounts of data from a SQL
query against a database into Spark. We need to execute the query against
our huge database and not a substitute (SparkSQL, Hive, etc) because of a
couple of factors including custom functions used in the
I would first check whether there is any possibility that after doing
groupbykey one of the groups does not fit in one of the executors' memory.
To back up my theory, instead of doing groupbykey + map try reducebykey +
mapvalues.
Let me know if that helped.
Pawel Szulc
http://rabbitonweb.com
But groupbykey will repartition according to numer of keys as I understand
how it works. How do you know that you haven't reached the groupbykey
phase? Are you using a profiler or do yoi base that assumption only on logs?
sob., 28 lut 2015, 8:12 PM Arun Luthra użytkownik arun.lut...@gmail.com
So, actually I am removing the persist for now, because there is
significant filtering that happens after calling textFile()... but I will
keep that option in mind.
I just tried a few different combinations of number of executors, executor
memory, and more importantly, number of tasks... *all
The job fails before getting to groupByKey.
I see a lot of timeout errors in the yarn logs, like:
15/02/28 12:47:16 WARN util.AkkaUtils: Error sending message in 1 attempts
akka.pattern.AskTimeoutException: Timed out
and
15/02/28 12:47:49 WARN util.AkkaUtils: Error sending message in 2
conf =
SparkConf().setAppName(spark_calc3merged).setMaster(spark://ec2-54-145-68-13.compute-1.amazonaws.com:7077)
sc =
SparkContext(conf=conf,pyFiles=[/root/platinum.py,/root/collections2.py])
15/02/28 19:06:38 WARN scheduler.TaskSetManager: Lost task 5.0 in stage 3.0
(TID 38,
The Spark UI names the line number and name of the operation (repartition
in this case) that it is performing. Only if this information is wrong
(just a possibility), could it have started groupByKey already.
I will try to analyze the amount of skew in the data by using reduceByKey
(or simply
+1 to a better default as well.
We were working find until we ran against a real dataset which was much
larger than the test dataset we were using locally. It took me a couple
days and digging through many logs to figure out this value was what was
causing the problem.
On Sat, Feb 28, 2015 at
Have you verified that spark-catalyst_2.10 jar was in the classpath ?
Cheers
On Sat, Feb 28, 2015 at 9:18 AM, Ashish Nigam ashnigamt...@gmail.com
wrote:
Hi,
I wrote a very simple program in scala to convert an existing RDD to
SchemaRDD.
But createSchemaRDD function is throwing exception
Just wanted to point out- raising the memory-head (as I saw in the logs)
was the fix for this issue and I have not seen dying executors since this
calue was increased
On Tue, Feb 24, 2015 at 3:52 AM, Anders Arpteg arp...@spotify.com wrote:
If you thinking of the yarn memory overhead, then yes,
Hi,
I wrote a very simple program in scala to convert an existing RDD to
SchemaRDD.
But createSchemaRDD function is throwing exception
Exception in thread main scala.ScalaReflectionException: class
org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with primordial
classloader with boot
Also the data file is on hdfs
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/getting-this-error-while-runing-tp21860p21861.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Also, can scala version play any role here?
I am using scala 2.11.5 but all spark packages have dependency to scala
2.11.2
Just wanted to make sure that scala version is not an issue here.
On Sat, Feb 28, 2015 at 9:18 AM, Ashish Nigam ashnigamt...@gmail.com
wrote:
Hi,
I wrote a very simple
So somehow Spark Streaming doesn't support display of named accumulators in
the WebUI?
On Tue, Feb 24, 2015 at 7:58 AM, Petar Zecevic petar.zece...@gmail.com
wrote:
Interesting. Accumulators are shown on Web UI if you are using the
ordinary SparkContext (Spark 1.2). It just has to be named
Sorry not really. Spork is a way to migrate your existing pig scripts to
Spark or write new pig jobs then can execute on spark.
For orchestration you are better off using Oozie especially if you are
using other execution engines/systems besides spark.
Regards,
Mayur Rustagi
Ph: +1 (760) 203 3257
Thanks for the pointer, Ashish! I was also looking at Spork
https://github.com/sigmoidanalytics/spork Pig-on-Spark), but wasn't sure if
that's the right direction.
On Sat, Feb 28, 2015 at 6:36 PM, Ashish Nigam ashnigamt...@gmail.com
wrote:
You have to call spark-submit from oozie.
I used this
Hi guys,
I am new to spark and we are running a small project that collects data from
Kinesis and inserts in to mongo.
I would like to share a high level view of how it is done and would love you
input on it.
I am fetching kinesis data and for each RDD
- Parsing String data
- Inserting into
Here was latest modification in spork repo:
Mon Dec 1 10:08:19 2014
Not sure if it is being actively maintained.
On Sat, Feb 28, 2015 at 6:26 PM, Qiang Cao caoqiang...@gmail.com wrote:
Thanks for the pointer, Ashish! I was also looking at Spork
https://github.com/sigmoidanalytics/spork
Thanks Mayur! I'm looking for something that would allow me to easily
describe and manage a workflow on Spark. A workflow in my context is a
composition of Spark applications that may depend on one another based on
hdfs inputs/outputs. Is Spork a good fit? The orchestration I want is on
app level.
You mean the size of the data that we take?
Thank You
Regards,
Deep
On Sun, Mar 1, 2015 at 6:04 AM, Joseph Bradley jos...@databricks.com
wrote:
Hi Deep,
Compute times may not be very meaningful for small examples like those.
If you increase the sizes of the examples, then you may start to
We do maintain it but in apache repo itself. However Pig cannot do
orchestration for you. I am not sure what you are looking at from Pig in
this context.
Regards,
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoid.com http://www.sigmoidanalytics.com/
@mayur_rustagi
I think its possible that the problem is that the scala compiler is not
being loaded by the primordial classloader (but instead by some child
classloader) and thus the scala reflection mirror is failing to initialize
when it can't find it. Unfortunately, the only solution that I know of is
to load
52 matches
Mail list logo