FWIW, this is an essential feature to our use of Spark, and I'm surprised it's not advertised clearly as a limitation in the documentation. All I've found about running Spark 1.3 on 2.11 is here:http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211Also, I'm experiencing
Not sure if this will help, but try clearing your jar cache (for sbt
~/.ivy2 and for maven ~/.m2) directories.
Thanks
Best Regards
On Wed, Apr 15, 2015 at 9:33 PM, Manoj Samel manojsamelt...@gmail.com
wrote:
Env - Spark 1.3 Hadoop 2.3, Kerbeos
xx.saveAsTextFile(path, codec) gives following
How many tasks are you seeing in your mapToPair stage? Is it 7000? then i
suggest you giving a number similar/close to 7000 in your .distinct call,
what is happening in your case is that, you are repartitioning your data to
a smaller number (32) which would put a lot of load on processing i
There's a version incompatibility between your hadoop jars. You need to
make sure you build your spark with Hadoop 2.5.0-cdh5.3.1 version.
Thanks
Best Regards
On Fri, Apr 17, 2015 at 5:17 AM, lalasriza . lala.s.r...@gmail.com wrote:
Dear everyone,
right now I am working with SparkR on
Hey Imran,
Thanks for the great explanation! This cleared up a lot of things for me. I am
actually trying to utilize some of the features within Spark for a system I am
developing. I am currently working on developing a subsystem that can be
integrated within Spark and other Big Data
spark 'master' branch (i.e. v1.4.0) builds successfully on windows 8.1 intel i7
64-bit with oracle jdk8_45.with maven opts without the flag
-XX:ReservedCodeCacheSize=1g.
takes about 33 minutes.
Thanking you.
With Regards
Sree
On Thursday, April 16, 2015 9:07 PM, Arun Lists
FYI.. the problem is that column names spark generates are not able to be
referenced within SQL or dataframe operations (ie. SUM(cool_cnt#725))..
any idea how to alias these final aggregate columns..
the syntax below doesn't make sense, but this is what i'd ideally want to
do:
I think this paper will be a good resource
(https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf), also the paper
of Dryad is also a good one.
Thanks
Jerry
From: James King [mailto:jakwebin...@gmail.com]
Sent: Friday, April 17, 2015 3:26 PM
To: user
Subject: Spark Directed Acyclic
Hello again,
steps to reproduce the same problem in JdbcRDD:
- create a table containig Date field in your favourite DBMS, I used PostgreSQL:
CREATE TABLE spark_test
(
pk_spark_test integer NOT NULL,
text character varying(25),
date1 date,
CONSTRAINT pk PRIMARY KEY (pk_spark_test)
)
I just checked the codes about creating OutputCommitCoordinator. Could you
reproduce this issue? If so, could you provide details about how to
reproduce it?
Best Regards,
Shixiong(Ryan) Zhu
2015-04-16 13:27 GMT+08:00 Canoe canoe...@gmail.com:
13119 Exception in thread main
Good use of analogies J
Yep friction (or entropy in general) exists in everything – but hey by adding
and doing “more work” at the same time (aka more powerful rockets) some people
have overcome the friction of the air and even got as far as the moon and
beyond
It is all about the
Hi,
I am working with multiple Kafka streams (23 streams) and currently I am
processing them separately. I receive one stream from each topic. I have the
following questions.
1. Spark streaming guide suggests to union these streams. Is it possible to
get statistics of each stream even after
My apologies, I had pasted the wrong exception trace in the previous email.
Here is the actual exception that I am receiving.
Exception in thread main java.lang.NullPointerException
at
org.apache.spark.rdd.ParallelCollectionRDD$.slice(ParallelCollectionRDD.scala:154)
at
Hi Sean,
Thanks a lot for your reply. The problem is that I need to sample random
*independent* pairs. If I draw two samples and build all n*(n-1) pairs
then there is a lot of dependency. My current solution is also not
satisfying because some pairs (the closest ones in a partition) have a
Doesn't this reduce to Scala isn't compatible with itself across
maintenance releases? Meaning, if this were fixed then Scala
2.11.{x 6} would have similar failures. It's not not-ready; it's
just not the Scala 2.11.6 REPL. Still, sure I'd favor breaking the
unofficial support to at least make the
And btw if you suspect this is a YARN issue you can always launch and use
Spark in a Standalone Mode which uses its own embedded cluster resource
manager - this is possible even when Spark has been deployed on CDH under
YARN by the pre-canned install scripts of CDH
To achieve that:
1.
Normally I use like the following in scala:
case calss datetest (x: Int, y:java.sql.Date)
val dt = sc.parallelize(1 to 3).map(p = datetest(p, new
java.sql.Date(p*1000*60*60*24)))
sqlContext.createDataFrame(dt).registerTempTable(“t1”)
sql(“select * from t1”).collect.foreach(println)
If you still
A very basic but strange problem:
On running master i am getting following error.
My java path is proper, however spark-class file is getting error because
here the in the string bin/java is duplicated. Can any body explain why it
is getting this .
Error:
/bin/spark-class: line 190: exec:
Hi,
We are planning to add new Metrics in Spark for the executors that got
killed during the execution. Was just curious, why this info is not already
present. Is there some reason for not adding it.?
Any ideas around are welcome.
Thanks and Regards,
Archit Thakur.
map phase of join*
On Fri, Apr 17, 2015 at 5:28 PM, Archit Thakur archit279tha...@gmail.com
wrote:
Ajay,
This is true. When we call join again on two RDD's.Rather than computing
the whole pipe again, It reads the map output of the map phase of an
RDD(which it usually gets from shuffle
Hi Archit Thanks for reply.
How can I don the costom compilation so reduce it to 4 bytes.I want to make
it to 4 bytes in any case can you please guide?
I am applying flatMapvalue in each step after ZipWithIndex it should be in
same Node right? Why its suffling?
Also I am running with very less
This is the fraction available for caching, which is 60% * 90% * total
by default.
On Fri, Apr 17, 2015 at 11:30 AM, podioss grega...@hotmail.com wrote:
Hi,
i am a bit confused with the executor-memory option. I am running
applications with Standalone cluster manager with 8 workers with 4gb
Thanks, Sree!
Are you able to run your applications using spark-submit? Even after we
were able to build successfully, we ran into problems with running the
spark-submit script. If everything worked correctly for you, we can hope
that things will be smoother when 1.4.0 is made generally
Hi Akhil,
Thank you for your response,
I think it is not because of the processing time, in fact the delay is under 1
second, while the batch interval is 10 seconds… The data volume is low (10
lines / second)
By the way, I have seen some results changing to this call of Kafkautils:
Thanks for your answer Imran. I haven't tried your suggestions yet, but
setting spark.shuffle.blockTransferService=nio solved my issue. There is a
JIRA for this: https://issues.apache.org/jira/browse/SPARK-6962.
Zsolt
2015-04-14 21:57 GMT+02:00 Imran Rashid iras...@cloudera.com:
is it possible
By custom installation, I meant change the code and build it. I have not
done the complete impact analysis, just had a look on the code.
When you say, same key goes to same node, It would need shuffling unless
the raw data you are reading is present that way.
On Apr 17, 2015 6:30 PM, Jeetendra
I decided to play around with DataFrames this morning but I'm running into
quite a few issues. I'm assuming that I must be doing something wrong so would
appreciate some advice.
First, I create my Data Frame.
import sqlContext.implicits._
case class Entity(InternalId: Long, EntityId: Long,
Hi all,
Spark 1.2.1.
I have a Cassandra column family and doing the following
SchemaRDD s = cassandraSQLContext.sql(select user.id as user_id from
user);
// user.id is UUID in table definition
s.registerTempTable( my_user );
s.cache(); // throws following exception
// tried the
Hi
With SparkStreaming on 1.3.0 version when I'm using WAL and checkpoints,
sometimes, I'm hitting fileNotFound exceptions.
Here's the complete stacktrace:
https://gist.github.com/akhld/126b945f7fef408a525e
The application simply reads data from Kafka and does a simple wordcount
over it. Batch
Hi,
i am a bit confused with the executor-memory option. I am running
applications with Standalone cluster manager with 8 workers with 4gb memory
and 2 cores each and when i submit my application with spark-submit i use
--executor-memory 1g.
In the web ui in the completed applications table i see
Hi,
I am unable to access the metrics servlet on spark 1.2. I tried to access
it from the app master UI on port 4040 but i dont see any metrics there. Is
it a known issue with spark 1.2 or am I doing something wrong?
Also how do I publish my own metrics and view them on this servlet?
Thanks,
Hi,
I'm trying to figure out when TaskCompletionListeners are called -- are
they called at the end of the RDD's compute() method, or after the
iteration through the iterator of the compute() method is completed.
To put it another way, is this OK:
class DatabaseRDD[T] extends RDD[T] {
def
Thanks everyone for the reply.
Looks like foreachRDD + filtering is the way to go. I'll have 4 independent
Spark streaming applications so the overhead seems acceptable.
Jianshi
On Fri, Apr 17, 2015 at 5:17 PM, Evo Eftimov evo.efti...@isecc.com wrote:
Good use of analogies J
Yep friction
I recently started using Spark version 1.3.0 in standalone mode (with Scala
2.10.3), and I'm running into an odd problem. I'm loading data from a file
using sc.textFile, doing some conversion of the data, and then clustering
it. When I do this with a small file (10 lines, 9 KB), it works fine, and
I am saying to partition something like partitionBy(new HashPartitioner(16)
will this not work?
On 17 April 2015 at 21:28, Jeetendra Gangele gangele...@gmail.com wrote:
I have given 3000 task to mapToPair now its taking so much memory and
shuffling and wasting time there. Here is the stats
So I'm trying to store the results of a query into a DataFrame, but I get the
following exception thrown:
Exception in thread main java.lang.RuntimeException: [1.71] failure: ``*''
expected but `select' found
SELECT DISTINCT OutSwitchID FROM wtbECRTemp WHERE OutSwtichID NOT IN (SELECT
SwitchID
Support for sub queries in predicates hasn't been resolved yet - please
refer to SPARK-4226
BTW, Spark 1.3 default bindings to Hive 0.13.1
On Fri, Apr 17, 2015 at 09:18 ARose ashley.r...@telarix.com wrote:
So I'm trying to store the results of a query into a DataFrame, but I get
the
its the latter -- after spark gets to the end of the iterator (or if it
hits an exception)
so your example is good, that is exactly what it is intended for.
On Fri, Apr 17, 2015 at 12:23 PM, Akshat Aranya aara...@gmail.com wrote:
Hi,
I'm trying to figure out when TaskCompletionListeners are
Hi,
I'm new to Spark and am working on a proof of concept. I'm using Spark
1.3.0 and running in local mode.
I can read and parse an RCFile using Spark however the performance is not as
good as I hoped.
I'm testing using ~800k rows and it is taking about 30 mins to process.
Is there a better
https://issues.apache.org/jira/browse/SPARK-1061
note the proposed fix isn't to have spark automatically know about the
partitioner when it reloads the data, but at least to make it *possible*
for it to be done at the application level.
On Fri, Apr 17, 2015 at 11:35 AM, Wang, Ningjun (LNG-NPV)
Hi All
I have an RDDOjbect then I convert it to RDDObject,Long with
ZipWithIndex
here Index is Long and its taking 8 bytes Is there any way to make it
Integer?
There is no API available which INT index.
How Can I create Custom RDD so that I takes only 4 bytes for index part?
Also why API is
Thanks. Would that distribution work for hdp 2.2?
On Fri, Apr 17, 2015 at 2:19 PM, Zhan Zhang zzh...@hortonworks.com wrote:
You don’t need to put any yarn assembly in hdfs. The spark assembly jar
will include everything. It looks like your package does not include yarn
module, although I
H... I don't follow. The 2.11.x series is supposed to be binary compatible
against user code. Anyway, I was building Spark against 2.11.2 and still saw
the problems with the REPL. I've created a bug report:
https://issues.apache.org/jira/browse/SPARK-6989
Hi Udit,
By the way, do you mind to share the whole log trace?
Thanks.
Zhan Zhang
On Apr 17, 2015, at 2:26 PM, Udit Mehta
ume...@groupon.commailto:ume...@groupon.com wrote:
I am just trying to launch a spark shell and not do anything fancy. I got the
binary distribution from apache and put
I was using sbt, and I found that I actually had specified Spark 0.9.1 there.
Once I upgraded my sbt config file to use 1.3.0, and Scala to 2.10.4, the
problem went away.
Michael
--
View this message in context:
It's because you did a repartition -- which rearranges all the data.
Parquet uses all kinds of compression techniques such as dictionary
encoding and run-length encoding, which would result in the size difference
when the data is ordered different.
On Fri, Apr 17, 2015 at 4:51 AM, zhangxiongfei
You don’t need to put any yarn assembly in hdfs. The spark assembly jar will
include everything. It looks like your package does not include yarn module,
although I didn’t find anything wrong in your mvn command. Can you check
whether the ExecutorLauncher class is in your jar file or not?
BTW:
You probably want to first try the basic configuration to see whether it works,
instead of setting SPARK_JAR pointing to the hdfs location. This error is
caused by not finding ExecutorLauncher in class path, and not HDP specific, I
think.
Thanks.
Zhan Zhang
On Apr 17, 2015, at 2:26 PM, Udit
Hi,
This is the log trace:
https://gist.github.com/uditmehta27/511eac0b76e6d61f8b47
On the yarn RM UI, I see :
Error: Could not find or load main class
org.apache.spark.deploy.yarn.ExecutorLauncher
The command I run is: bin/spark-shell --master yarn-client
The spark defaults I use is:
I followed the steps described above and I still get this error:
Error: Could not find or load main class
org.apache.spark.deploy.yarn.ExecutorLauncher
I am trying to build spark 1.3 on hdp 2.2.
I built spark from source using:
build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive
Hi All,
I'm happy to announce the Spark 1.3.1 and 1.2.2 maintenance releases.
We recommend all users on the 1.3 and 1.2 Spark branches upgrade to
these releases, which contain several important bug fixes.
Download Spark 1.3.1 or 1.2.2:
http://spark.apache.org/downloads.html
Release notes:
I'm using Amazon EMR + S3 as my spark cluster infrastructure. When I'm
running a job with periodic checkpointing (it has a long dependency tree, so
truncating by checkpointing is mandatory, each checkpoint has 320
partitions). The job stops halfway, resulting an exception:
(On driver)
I'm trying to create a simple SparkListener to get notified of error on
executors. I do not get any call backs on my SparkListener. Here some
simple code I'm executing in spark-shell. But I still don't get any
callbacks on my listener. Am I doing something wrong?
Thanks for any clue you can send
when you start the spark-shell, its already too late to get the
ApplicationStart event. Try listening for StageCompleted or JobEnd instead.
On Fri, Apr 17, 2015 at 5:54 PM, Praveen Balaji
secondorderpolynom...@gmail.com wrote:
I'm trying to create a simple SparkListener to get notified of
Besides the hdp.version in spark-defaults.conf, I think you probably forget to
put the file java-opts under $SPARK_HOME/conf with following contents.
[root@c6402 conf]# pwd
/usr/hdp/current/spark-client/conf
[root@c6402 conf]# ls
fairscheduler.xml.template java-opts
Thanks Zhang, that solved the error. This is probably not documented
anywhere so I missed it.
Thanks again,
Udit
On Fri, Apr 17, 2015 at 3:24 PM, Zhan Zhang zzh...@hortonworks.com wrote:
Besides the hdp.version in spark-defaults.conf, I think you probably
forget to put the file* java-opts*
Thank you for the explanation! I’ll check what can be done here.
From: Krist Rastislav [mailto:rkr...@vub.sk]
Sent: Friday, April 17, 2015 9:03 PM
To: Wang, Daoyuan; Michael Armbrust
Cc: user
Subject: RE: ClassCastException processing date fields using spark SQL since
1.3.0
So finally,
You are running on 2.11.6, right? of course, it seems like that should
all work, but it doesn't work for you. My point is that the shell you
are saying doesn't work is Scala's 2.11.2 shell -- with some light
modification.
It's possible that the delta is the problem. I can't entirely make out
I actually just saw your comment on SPARK-6989 before this message. So I'll
copy to the mailing list:
I'm not sure I understand what you mean about running on 2.11.6. I'm just
running the spark-shell command. It in turn is running
java -cp
According to the documentation:
The local directories used by Spark executors will be the local directories
configured for YARN (Hadoop YARN config yarn.nodemanager.local-dirs). If the
user specifies spark.local.dir, it will be ignored.
(https://spark.apache.org/docs/1.2.1/running-on-yarn.html)
Thanks for the response, Imran. I probably chose the wrong methods for this
email. I implemented all methods of SparkListener and the only callback I
get is onExecutorMetricsUpdate.
Here's the complete code:
==
import org.apache.spark.scheduler._
sc.addSparkListener(new SparkListener()
61 matches
Mail list logo