Hello,
I'm using spark on yarn cluster and using mongo-hadoop-connector to pull
data to spark, doing some job
The job has following stage.
(flatMap - flatMap - reduceByKey - sortByKey)
The data in MongoDB is tweet from twitter.
First, connect to mongodb and make RDD by following
val mongoRDD
Ah, nevermind, I don't know anything about scheduling tasks in YARN.
On Thu, Aug 13, 2015 at 11:03 PM, Ara Vartanian arav...@cs.wisc.edu wrote:
I’m running on Yarn.
On Aug 13, 2015, at 10:58 PM, Philip Weaver philip.wea...@gmail.com
wrote:
Are you running on mesos, yarn or standalone? If
What does exponential data means? Does this mean that the amount of the
data that is being received from the stream in a batchinterval is
increasing exponentially as the time progresses?
Does your process have enough memory to handle the data for a batch
interval?
You may want to share Spark
Hi all,
I got an exception like
“org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call
to dataType on unresolved object” when using some where condition queries.
I am using 1.4.0 version spark. But its perfectly working in hive .
Please refer the following query. I have
Yep, and it works fine for operations which does not involve any shuffle
(like foreach,, count etc) and those which involves shuffle operations ends
up in an infinite loop. Spark should somehow indicate this instead of going
in an infinite loop.
Thanks
Best Regards
On Thu, Aug 13, 2015 at 11:37
-- Forwarded message --
From: Dawid Wysakowicz wysakowicz.da...@gmail.com
Date: 2015-08-14 9:32 GMT+02:00
Subject: Re: Using unserializable classes in tasks
To: mark manwoodv...@googlemail.com
I am not an expert but first of all check if there is no ready connector
(you mentioned
What I understood from Imran's mail (and what was referenced in his
mail) the RDD mentioned seems to be violating some basic contracts on
how partitions are used in spark [1].
They cannot be arbitrarily numbered,have duplicates, etc.
Extending RDD to add functionality is typically for niche
Data skew ? May your partition key has some special value like null or
empty string
On Fri, Aug 14, 2015 at 11:01 AM, randylu randyl...@gmail.com wrote:
It is strange that there are always two tasks slower than others, and the
corresponding partitions's data are larger, no matter how many
I’m running on Yarn.
On Aug 13, 2015, at 10:58 PM, Philip Weaver philip.wea...@gmail.com wrote:
Are you running on mesos, yarn or standalone? If you're on mesos, are you
using coarse grain or fine grained mode?
On Thu, Aug 13, 2015 at 10:13 PM, Ara Vartanian arav...@cs.wisc.edu
Hi,
If you are decreasing the number of partitions in this RDD, consider
using coalesce, which can avoid performing a shuffle.
However, if you're doing a drastic coalesce, e.g. to numPartitions =
1, this may result in your computation taking place on fewer nodes
than you like (e.g. one node in
Thanks Ted for the help!
At 2015-08-14 12:02:14, Ted Yu yuzhih...@gmail.com wrote:
You can look under Developer Track:
https://spark-summit.org/2015/#day-1
http://www.slideshare.net/jeykottalam/spark-sqlamp-camp2014?related=1 (slightly
old)
Catalyst design:
I have a Spark job that computes some values and needs to write those
values to a data store. The classes that write to the data store are not
serializable (eg, Cassandra session objects etc).
I don't want to collect all the results at the driver, I want each worker
to write the data - what is
No the connector does not need to be serializable cause it is constructed
on the worker. Only objects shuffled across partitions needs to be
serializable.
2015-08-14 9:40 GMT+02:00 mark manwoodv...@googlemail.com:
I guess I'm looking for a more general way to use complex graphs of
objects that
Correction: I am not able to convert the Scala statement to java.
The NoClassDefFoundException differs from ClassNotFoundException : it
indicates an error while initializing that class: but the class is found in
the classpath. Please provide the full stack trace.
2015-08-14 4:59 GMT-07:00 stelsavva stel...@avocarrot.com:
Hello, I am just starting out with
Hi
I am facing huge performance problem when I am trying to left outer join very
big data set (~140GB) with bunch of small lookups [Start schema type]. I am
using data frame in spark sql. It looks like data is shuffled and skewed when
that join happens. Is there any way to improve performance
Which version of spark are you using? Did you try with --driver-class-path
configuration?
Thanks
Best Regards
On Fri, Aug 14, 2015 at 2:05 PM, Kyle Lin kylelin2...@gmail.com wrote:
Hi all
I had similar usage and also got the same problem.
I guess Spark use some class in my user jars but
Both works the same way, but with SparkSQL you will get the optimization
etc done by the catalyst. One important thing to consider is the #
partitions and the key distribution (when you are doing RDD.join), If the
keys are not evenly distributed across machines then you can see the
process
Thanks for the clarifications Mrithul.
Thanks
Best Regards
On Fri, Aug 14, 2015 at 1:04 PM, Mridul Muralidharan mri...@gmail.com
wrote:
What I understood from Imran's mail (and what was referenced in his
mail) the RDD mentioned seems to be violating some basic contracts on
how partitions are
Hi all
I had similar usage and also got the same problem.
I guess Spark use some class in my user jars but actually it should use the
class in spark-assembly-xxx.jar, but I don't know how to fix it.
Kyle
2015-07-22 23:03 GMT+08:00 Ashish Soni asoni.le...@gmail.com:
Hi All ,
I am getting
Looks like a jar version conflict to me.
Thanks
Best Regards
On Thu, Aug 13, 2015 at 7:59 PM, satish chandra j jsatishchan...@gmail.com
wrote:
HI,
Please let me know if I am missing anything in the below mail, to get the
issue fixed
Regards,
Satish Chandra
On Wed, Aug 12, 2015 at 6:59
You could cache the lookup DataFrames, it’ll then do a broadcast join.
On 8/14/15, 9:39 AM, VIJAYAKUMAR JAWAHARLAL sparkh...@data2o.io wrote:
Hi
I am facing huge performance problem when I am trying to left outer join very
big data set (~140GB) with bunch of small lookups [Start schema
You can put a try..catch around all the transformations that you are doing
and catch such exceptions instead of crashing your entire job.
Thanks
Best Regards
On Fri, Aug 14, 2015 at 4:35 PM, hide x22t33...@gmail.com wrote:
Hello,
I'm using spark on yarn cluster and using
I think you can try dataFrame create api that takes RDD[Row] and Struct
type...
On Aug 11, 2015 4:28 PM, Jyun-Fan Tsai jft...@appier.com wrote:
Hi all,
I'm using Spark 1.4.1. I create a DataFrame from json file. There is
a column C that all values are null in the json file. I found that
There's a long recent thread in this list about stopping apps, subject was
stopping spark stream app
at 1 second I wouldn't run repeated rdds, no.
I'd take a look at subclassing, personally (you'll have to rebuild the
streaming kafka project since a lot is private), but if topic changes dont
Hi All,
I have a working program, in which I create two big tuples2 out of the data.
This seems to work in local but when I switch over cluster standalone mode, I
get this error at the very beggining:
15/08/14 10:22:25 WARN TaskSetManager: Lost task 4.0 in stage 1.0 (TID 10,
162.101.194.44):
How do I get beyond the This post has NOT been accepted by the mailing list
yet message? This message was posted through the nabble interface; one
would think that would be enough to get the message accepted.
--
View this message in context:
Use your email client to send a message to the mailing list from the email
address you used to subscribe?
The message you just sent reached the list
On Fri, Aug 14, 2015 at 9:36 AM, dutrow dan.dut...@gmail.com wrote:
How do I get beyond the This post has NOT been accepted by the mailing
list
For those who find this post and may be interested, the most thorough
documentation on the subject may be found here:
https://github.com/koeninger/kafka-exactly-once
--
View this message in context:
Thanks. Looking at the KafkaCluster.scala code, (
https://github.com/apache/spark/blob/master/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaCluster.scala#L253),
it seems a little hacky for me to alter and recompile spark to expose those
methods, so I'll use the receiver API
Hi Jerry,This blog post is perfect for window functions in
Spark.https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
and a generic sql usage from oracle-base
blog.https://oracle-base.com/articles/misc/lag-lead-analytic-functions
It seems you are not using
Our additional question on checkpointing is basically the logistics of it --
At which point does the data get written into checkpointing? Is it written
as soon as the driver program retrieves an RDD from Kafka (or another
source)? Or, is it written after that RDD has been processed and we're
I want to test some Spark Streaming code that is using
reduceByKeyAndWindow. If I do not enable checkpointing, I get the error:
java.lang.IllegalArgumentException: requirement failed: The checkpoint
directory has not been set. Please set it by StreamingContext.checkpoint().
But if I enable
I still want to check if anyone can provide any help related to the Spark 1.2.2
will hang on our production cluster when reading Big HDFS data (7800 avro
blocks), while looks fine for small data (769 avro blocks).
I enable the debug level in the spark log4j, and attached the log file if it
I don't entirely agree with that assessment. Not paying for extra cores to
run receivers was about as important as delivery semantics, as far as
motivations for the api.
As I said in the jira tickets on the topic, if you want to use the direct
api and save offsets to ZK, you can. The right way
So it seems like dataframes aren't going give me a break and just work. Now
it evaluates but goes nuts if it runs into a null case OR doesn't know how
to get the correct data type when I specify the default value as a string
expression. Let me know if anyone has a work around to this. PLEASE HELP
In summary, it appears that the use of the DirectAPI was intended
specifically to enable exactly-once semantics. This can be achieved for
idempotent transformations and with transactional processing using the
database to guarantee an onto mapping of results based on inputs. For the
latter, you
-- Forwarded message --
From: Ranjana Rajendran ranjana.rajend...@gmail.com
Date: Thu, Aug 13, 2015 at 7:37 AM
Subject: Graphx - how to add vertices to a HashSet of vertices ?
To: d...@spark.apache.org
Hi,
sampledVertices is a HashSet of vertices
var sampledVertices:
Hello all,
I am writing a program which calls from a database. A run a couple
computations, but in the end I have a while loop, in which I make a
modification to the persisted thata. eg:
val data = PairRDD... persist()
var i = 0
while (i 10) {
val data_mod = data.map(_._1 + 1, _._2)
Thanks Marcelo. But our problem is little complicated.
We have 10+ ftp sites that we will be transferring data from. The ftp server
info, filename, credentials are all coming via Kafka message. So, I want to
read those kafka message and dynamically connect to the ftp site and download
those
I just pushed some code that does this for spark-testing-base (
https://github.com/holdenk/spark-testing-base ) (its in master) and will
publish an updated artifact with it for tonight.
On Fri, Aug 14, 2015 at 3:35 PM, Tathagata Das t...@databricks.com wrote:
A hacky workaround is to create a
Hi Salih,
Normally I do sort before performing that operation, but since I've been
trying to get this working for a week, I'm just loading something simple to
test if lag works. Earlier I was having DB issues so it's been a long
run of solving one runtime exception after another. Hopefully
@Koen,
If you meant to reply to my question on distributing matrices, could you
re-send as there was not content in your post.
Thanks,
On 08/07/2015 10:02 AM, Koen Vantomme wrote:
Verzonden vanaf mijn Sony Xperia™-smartphone
iceback schreef
Is this the sort of problem spark
Spark stream seems to be creating 0 bytes files even when there is no data.
Also, I have 2 concerns here:
1) Extra unnecessary files is being created from the output
2) Hadoop doesn't work really well with too many files and I see that it is
creating a directory with a timestamp every 1 second.
First you create the file:
final File outputFile = new File(outputPath);
Then you write to it:
Files.append(counts + \n, outputFile, Charset.defaultCharset());
Cheers
On Fri, Aug 14, 2015 at 4:38 PM, Mohit Anchlia mohitanch...@gmail.com
wrote:
I thought prefix meant the output
A hacky workaround is to create a customer InputDStream that creates the
right RDDs based on a function. The TestInputDStream
https://github.com/apache/spark/blob/master/streaming/src/test/scala/org/apache/spark/streaming/TestSuiteBase.scala#L61
does something similar for Spark Streaming unit
Please take a look at JavaPairDStream.scala:
def saveAsHadoopFiles[F : OutputFormat[_, _]](
prefix: String,
suffix: String,
keyClass: Class[_],
valueClass: Class[_],
outputFormatClass: Class[F]) {
Did you intend to use outputPath as prefix ?
Cheers
On Fri, Aug
Spark 1.3
Code:
wordCounts.foreachRDD(*new* *Function2JavaPairRDDString, Integer, Time,
Void()* {
@Override
*public* Void call(JavaPairRDDString, Integer rdd, Time time) *throws*
IOException {
String counts = Counts at time + time + + rdd.collect();
System.*out*.println(counts);
On Fri, Aug 14, 2015 at 2:11 PM, Varadhan, Jawahar
varad...@yahoo.com.invalid wrote:
And hence, I was planning to use Spark Streaming with Kafka or Flume with
Kafka. But flume runs on a JVM and may not be the best option as the huge
file will create memory issues. Please suggest someway to
You'll resume and re-process the rdd that didnt finish
On Fri, Aug 14, 2015 at 1:31 PM, Dmitry Goldenberg dgoldenberg...@gmail.com
wrote:
Our additional question on checkpointing is basically the logistics of it
--
At which point does the data get written into checkpointing? Is it
written
Still not cooperating...
lag(A,1,'X') OVER (ORDER BY A) as LA
^
at scala.sys.package$.error(package.scala:27)
at
org.apache.spark.sql.catalyst.SqlParser.parseExpression(SqlParser.scala:45)
at
Additional Comment:
I checked the disk usage on the 3 nodes (using iostat) and it seems that
reading from HDFS partitions happen in a node-by-node basis. Only one of
the nodes shows active IO (as read) at any given time while the other two
nodes are idle IO-wise. I am not sure why the tasks are
I thought prefix meant the output path? What's the purpose of prefix and
where do I specify the path if not in prefix?
On Fri, Aug 14, 2015 at 4:36 PM, Ted Yu yuzhih...@gmail.com wrote:
Please take a look at JavaPairDStream.scala:
def saveAsHadoopFiles[F : OutputFormat[_, _]](
prefix:
I am running on Yarn and do have a question on how spark runs executors on
different data nodes. Is that primarily decided based on number of
receivers?
What do I need to do to ensure that multiple nodes are being used for data
processing?
I feel the real fix here is to remove the exception from QueueInputDStream
class by reverting the fix of
https://issues.apache.org/jira/browse/SPARK-8630
I can write another class that is identical to the QueueInputDStream class
except it does not throw the exception. But this feels like a
Convert it to a rdd then save the rdd to a file
val str = dank memes
sc.parallelize(List(str)).saveAsTextFile(str.txt)
On Fri, Aug 14, 2015 at 7:50 PM, go canal goca...@yahoo.com.invalid wrote:
Hello again,
online resources have sample code for writing RDD to a file, but I have a
simple
Another fix might be to remove the exception that is thrown when windowing
and other stateful operations are used without checkpointing.
On Fri, Aug 14, 2015 at 5:43 PM, Asim Jalis asimja...@gmail.com wrote:
I feel the real fix here is to remove the exception from QueueInputDStream
class by
Thanks, Cody. It sounds like Spark Streaming has enough state info to know
how many batches have been processed and if not all of them then the RDD is
'unfinished'. I wonder if it would know whether the last micro-batch has
been fully processed successfully. Hypothetically, the driver program
thank you very much. just a quick question - I try to save string in this way
but the file is always empty:
val file = Path (sample data/ZN_SPARK.OUT).createFile(true)
file.bufferedWriter().write(im.toString()) file.bufferedWriter().flush()
file.bufferedWriter().close()
anything
In spark 1.4 there is a parameter to control that. Its default value is 10
M. So you need to cache your dataframe to hint the size.
On Aug 14, 2015 7:09 PM, VIJAYAKUMAR JAWAHARLAL sparkh...@data2o.io
wrote:
Hi
I am facing huge performance problem when I am trying to left outer join
very big
Hello again,online resources have sample code for writing RDD to a file, but I
have a simple string, how to save to a text file ? (my data is a DenseMatrix
actually)
appreciate any help ! thanks, canal
Jyun Fan
Here is how I have been doing it. I found that I needed to define the
schema when loading the JSON file first
Francis
import datetime
from pyspark.sql.types import *
# Define schema
upSchema = StructType([
StructField(field 1, StringType(), True),
StructField(field 2, LongType(),
Hello, I am just starting out with spark streaming and Hbase/hadoop, i m
writing a simple app to read from kafka and store to Hbase, I am having
trouble submitting my job to spark.
I 've downloaded Apache Spark 1.4.1 pre-build for hadoop 2.6
I am building the project with mvn package
and
Hi Cody,
by start/stopping, do you mean the streaming context or the app entirely?
From what I understand once a streaming context has been stopped it cannot
be restarted, but I also haven't found a way to stop the app
programmatically.
The batch duration will probably be around 1-10 seconds. I
Hello Akhil
I use Spark 1.4.2 on HDP 2.1(Hadoop 2.4)
I didn't use --driver-class-path. I only use
spark.executor.userClassPathFirst=true
Kyle
2015-08-14 17:11 GMT+08:00 Akhil Das ak...@sigmoidanalytics.com:
Which version of spark are you using? Did you try with --driver-class-path
Hi Akhil,
Which jar version is conflicting and what needs to be done for the fix
Regards,
Satish Chandra
On Fri, Aug 14, 2015 at 2:44 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Looks like a jar version conflict to me.
Thanks
Best Regards
On Thu, Aug 13, 2015 at 7:59 PM, satish
Data skew is still a problem with Spark.
- If you use groupByKey, try to express your logic by not using groupByKey.
- If you need to use groupByKey, all you can do is to scale vertically.
- If you can, repartition with a finer HashPartitioner. You will have many
tasks for each stage, but tasks
Hi Eugene,
in my case the list of values that I want to sort and write to a separate
file, its fairly small so the way I solved it is the following:
.groupByKey().foreach(e = {
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig);
val newPath = rootPath+/+e._1;
Looks you compiled the codes with one Scala version but ran your app using
a different incompatible version.
BTW, you should not use PrintWriter like this to save your results. There
may be multiple tasks running at the same host, and your job will fail
because you are trying to write to the same
Which Spark release are you using ?
Can you show us snippet of your code ?
Have you checked namenode log ?
Thanks
On Aug 13, 2015, at 10:21 PM, Mohit Anchlia mohitanch...@gmail.com wrote:
I was able to get this working by using an alternative method however I only
see 0 bytes files in
70 matches
Mail list logo