Thanks Davies and Eric. I followed Davies' instructions and it works
wonderful.
I would add that you can also add these scripts in the pyspark shell too:
pyspark --py-files support.py
where support.py is your script containing your class as Davies described.
Best,
Guillaume Guy
* +1 919 -
The trick here is getting the scala compiler to do the implicit conversion
from Symbol - Column. In your second example, the compiler doesn't know
that you are going to try and use the Seq[Symbol] as a Seq[Column] and so
doesn't do the conversion. The following are other ways to provide enough
Hi
I tried to run Streaming Linear Regression in my local.
val trainingData =
ssc.textFileStream(/home/barisakgu/Desktop/Spark/train).map(LabeledPoint.parse)
textFileStream is not seeing the new files. I search on the Internet, and I
saw that somebody has same issue but no solution is found
Yeah, I do manually delete the files, but it still fails with this error.
On Feb 19, 2015, at 8:16 PM, Ganelin, Ilya ilya.gane...@capitalone.com
wrote:
When writing to hdfs Spark will not overwrite existing files or directories.
You must either manually delete these or use Java's Hadoop
Hello all,
I am trying to integrate Spark with Tranquility
(https://github.com/metamx/tranquility) and when I start the Spark program,
I get the error below:
java.lang.NoClassDefFoundError: org/codehaus/jackson/annotate/JsonClass
at
If you know there are data doesn't belong to any existing category,
put them into the training set and make a new category for them. It
won't help much if instances from this unknown category are all
outliers. In that case, lower the thresholds and tune the parameters
to get a lower error rate.
On Thu, Feb 19, 2015 at 7:57 AM, jamborta jambo...@gmail.com wrote:
Hi all,
I think I have run into an issue on the lazy evaluation of variables in
pyspark, I have to following
functions = [func1, func2, func3]
for counter in range(len(functions)):
data = data.map(lambda value:
Well:
I think that I solved my issue in the next way:
val variable_fieldsStr = List(field1,field2)
val variable_argument_list= variable_fieldsStr.map(f = Alias(Symbol(f),
f)())
val schm2 = myschemaRDD.select(variable_argument_list:_*)
schm2 seems to have the required fields, but would like
Hello everybody,
I'm quite new to Spark and Scala as well and I was trying to analyze some
csv data via spark-sql
My csv file contains data like this
Following the example at this link below
https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection
Thanks for your reply Sean.
Looks like it's happening in a map:
15/02/18 20:35:52 INFO scheduler.DAGScheduler: Submitting 100 missing tasks
from Stage 1 (MappedRDD[17] at mapToPair at
NativeMethodAccessorImpl.java:-2)
That's my initial 'parse' stage, done before repartitioning. It reduces the
Hello,
We have a Spark Streaming application that watches an input directory, and
as files are copied there the application reads them and sends the contents
to a RESTful web service, receives a response and write some contents to an
output directory.
When testing the application by copying a
On the advice of some recent discussions on this list, I thought I would
try and consume gz files directly. I'm reading them, doing a preliminary
map, then repartitioning, then doing normal spark things.
As I understand it, zip files aren't readable in partitions because of the
format, so I
This should result in 4 executors, not 25. They should be able to
execute 4*4 = 16 tasks simultaneously. You have them grab 4*32 = 128GB
of RAM, not 1TB.
It still feels like this shouldn't be running out of memory, not by a
long shot though. But just pointing out potential differences between
Hi,
I would like you to read my stack overflow answer to this question. If you need
more clarification feel free to drop a msg.
http://stackoverflow.com/questions/28403664/connect-to-existing-hive-in-intellij-using-sbt-as-build
Regards,
Ashutosh
From:
What version of Spark are you using?
TD
On Thu, Feb 19, 2015 at 2:45 AM, Emre Sevinc emre.sev...@gmail.com wrote:
Hello,
We have a Spark Streaming application that watches an input directory, and
as files are copied there the application reads them and sends the contents
to a RESTful web
Hi,
when getting the model out of ALS.train it would be beneficial to store it (to
disk) so the model can be reused later for any following predictions. I am
using pyspark and I had no luck pickling it either using standard pickle module
or even dill.
does anyone have a solution for this (note
Hi,
In our job, we need to process the data in small chunks, so as to avoid GC
and other stuff. For this, we are using old API of hadoop as that let us
specify parameter like minPartitions.
Does any one knows, If there a way to do the same via newHadoopAPI also?
How that way will be different
gzip and zip are not splittable compression formats; bzip and lzo are.
Ideally, use a splittable compression format.
Repartitioning is not a great solution since it means a shuffle, typically.
This is not necessarily related to how big your partitions are. The
question is, when does this happen?
Hi,
I have 4 very powerful boxes (256GB RAM, 32 cores each). I am running spark
1.2.0 in yarn-client mode with following layout:
spark.executor.cores=4
spark.executor.memory=28G
spark.yarn.executor.memoryOverhead=4096
I am submitting bigger ALS trainImplicit task (rank=100, iters=15) on a
based on spark UI I am running 25 executors for sure. why would you expect
four? I submit the task with --num-executors 25 and I get 6-7 executors running
per host (using more of smaller executors allows me better cluster utilization
when running parallel spark sessions (which is not the case
Oh OK you are saying you are requesting 25 executors and getting them,
got it. You can consider making fewer, bigger executors to pool rather
than split up your memory, but at some point it becomes
counter-productive. 32GB is a fine executor size.
So you have ~8GB available per task which seems
it is from within the ALS.trainImplicit() call. btw. the exception varies
between this GC overhead limit exceeded and Java heap space (which I guess
is just different outcome of same problem).
just tried another run and here are the logs (filtered) - note I tried this run
with
On Thu, Feb 19, 2015 at 12:27 PM, Tathagata Das t...@databricks.com wrote:
What version of Spark are you using?
TD
Spark version is 1.2.0 (running on Cloudera CDH 5.3.0)
--
Emre Sevinç
I think that the newer Hadoop API does not expose this suggested min
partitions parameter like the old one did. I believe you can try
setting mapreduce.input.fileinputformat.split.{min,max}size instead on
the Hadoop Configuration to suggest a max/min split size, and
therefore bound the number of
This is because of each line will be separated into 4 columns instead of 3
columns.
If you want to use comma to separate different columns, each column will be
not allowed to include commas.
2015-02-19 18:12 GMT+08:00 sparkino francescoboname...@gmail.com:
Hello everybody,
I'm quite new to
I am running Spark on Mesos and it works quite well. I have three
users, all who setup iPython notebooks to instantiate a spark instance
to work with on the notebooks. I love it so far.
Since I am auto instantiating (I don't want a user to have to
think about instantiating and submitting a spark
Hi Cesar,
these methods would be private until new ml api would stabilize (aprox.
in spark 1.4). My solution for the same issue was to create
org.apache.spark.ml package in my project and extends/implement
everything there.
Thanks,
Peter Rudenko
On 2015-02-18 22:17, Cesar Flores wrote:
I
now with reverted spark.shuffle.io.preferDirectBufs (to true) getting again GC
overhead limit exceeded:
=== spark stdout ===15/02/19 12:08:08 WARN scheduler.TaskSetManager: Lost task
7.0 in stage 18.0 (TID 5329, 192.168.1.93): java.lang.OutOfMemoryError: GC
overhead limit exceeded at
Hi Yanbo,
unfortunately all csv files contain comma inside some columns and I can't
change the structure.
How can I work with this kind of textfile and spark-sql?
Thank you again
2015-02-19 14:38 GMT+01:00 Yanbo Liang hackingda...@gmail.com:
This is because of each line will be separated
At the beginning of the code, do a query to find the current maximum ID
Don't just put in an arbitrarily large value, or all of your rows will end
up in 1 spark partition at the beginning of the range.
The question of keys is up to you... all that you need to be able to do is
write a sql
Great, glad it worked out!
From: Todd Nist
Date: Thursday, February 19, 2015 at 9:19 AM
To: Silvio Fiorito
Cc: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: SparkSQL + Tableau Connector
Hi Silvio,
I got this working today using your suggestion with the Initial SQL and a
Custom
Yep. the matrix model had two RDD vectors representing the decomposed
matrix. You can save these to disk and re use them.
On Thu, Feb 19, 2015 at 2:19 AM Antony Mayi antonym...@yahoo.com.invalid
wrote:
Hi,
when getting the model out of ALS.train it would be beneficial to store it
(to disk) so
Hi Anthony - you are seeing a problem that I ran into. The underlying issue
is your default parallelism setting. What's happening is that within ALS
certain RDD operations end up changing the number of partitions you have of
your data. For example if you start with an RDD of 300 partitions, unless
Hi,
I am doing lookup on cached RDDs [(Int,String)], and I noticed that the
lookup is relatively slow 30-100 ms ?? I even tried this on one machine
with single partition, but no difference!
The RDDs are not large at all, 3-30 MB.
Is this expected behaviour? should I use other data structures,
RDDs are not Maps. lookup() does a linear scan -- parallel by
partition, but stil linear. Yes, it is not supposed be an O(1) lookup
data structure. It'd be much nicer to broadcast the relatively small
data set as a Map and look it up fast, locally.
On Thu, Feb 19, 2015 at 3:29 PM, shahab
Hi Shahab - if your data structures are small enough a broadcasted Map is
going to provide faster lookup. Lookup within an RDD is an O(m) operation
where m is the size of the partition. For RDDs with multiple partitions,
executors can operate on it in parallel so you get some improvement for
For your case, I think you can use a trick for separating with “ “,” instead
of “,”
You can refer the following code snippet
val people = sc.textFile(examples/src/main/resources/data.csv).map( x =
x.substring(1,x.length-1).split(\,\)).map(p = List(p(0), p(1), p(2)))
On Feb 19, 2015, at 10:02
There was already a thread around it if i understood your question
correctly, you can go through this
https://mail-archives.apache.org/mod_mbox/spark-user/201502.mbox/%3ccannjawtrp0nd3odz-5-_ya351rin81q-9+f2u-qn+vruqy+...@mail.gmail.com%3E
Thanks
Best Regards
On Thu, Feb 19, 2015 at 8:16 PM,
Hi Dhimant,
I believe if you change your spark-shell to pass -driver-class-path
/usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar vs putting it in
--jars.
-Todd
On Wed, Feb 18, 2015 at 10:41 PM, Dhimant dhimant84.jays...@gmail.com
wrote:
Found solution from one of the post found on
I am able to connect by doing the following using the Tableau Initial SQL
and a custom query:
1.
First ingest csv file or json and save out to file system:
import org.apache.spark.sql.SQLContext
import com.databricks.spark.csv._
val sqlContext = new SQLContext(sc)
val demo =
You have the keys before and after reduceByKey. You want to do
something based on the key within reduceByKey? it just calls
combineByKey, so you can use that method for lower-level control over
the merging.
Whether it's possible depends I suppose on what you mean to filter on.
If it's just a
Is this the full stack trace ?
On Wed, Feb 18, 2015 at 2:39 AM, sachin Singh sachin.sha...@gmail.com
wrote:
Hi,
I want to run my spark Job in Hadoop yarn Cluster mode,
I am using below command -
spark-submit --master yarn-cluster --driver-memory 1g --executor-memory 1g
--executor-cores 1
Hi Sean,
This is what I intend to do:
are you saying that you know a key should be filtered based on its value
partway through the merge?
I should use combineByKey...
Thanks.
Deb
On Thu, Feb 19, 2015 at 6:31 AM, Sean Owen so...@cloudera.com wrote:
You have the keys before and after
I'm having an issue with spark 1.2.1 and scala 2.11. I detailed the
symptoms in this stackoverflow question.
http://stackoverflow.com/questions/28612837/spark-classnotfoundexception-when-running-hello-world-example-in-scala-2-11
Has anyone experienced anything similar?
Thank you!
Hi,
I have two RDD's with csv data as below :
RDD-1
101970_5854301840,fbcf5485-e696-4100-9468-a17ec7c5bb43,19229261643
101970_5854301839,fbaf5485-e696-4100-9468-a17ec7c5bb39,9229261645
101970_5854301839,fbbf5485-e696-4100-9468-a17ec7c5bb39,9229261647
Consider the following left outer join
potentialDailyModificationsRDD =
reducedDailyPairRDD.leftOuterJoin(baselinePairRDD).partitionBy(new
HashPartitioner(1024)).persist(StorageLevel.MEMORY_AND_DISK_SER());
Below are the record counts for the RDDs involved
Number of records for
Kafka ordering is guaranteed on a per-partition basis.
The high-level consumer api as used by the spark kafka streams prior to 1.3
will consume from multiple kafka partitions, thus not giving any ordering
guarantees.
The experimental direct stream in 1.3 uses the simple consumer api, and
there
Thanks Akhil, you are right.
I checked and find that I have only 1 core allocated to the program
I am running on a visual machine,and only allocate one processor to it(1 core
per processor), so even if I have specified --total-executor-cores 3 in the
submit script, the application will still
Hi Sasi,
I am not sure about Vaadin, but by simple googling you can find many
article on how to pass json parameters in http.
http://stackoverflow.com/questions/21404252/post-request-send-json-data-java-httpurlconnection
You can also try Finagle which is fully fault tolerant framework by
Hi Kelvin,
Yes. I am creating an uber jar with the Postgres driver included, but
nevertheless tried both –jars and –driver-classpath flags. It didn’t help.
Interestingly, I can’t use BoneCP even in the driver program when I run my
application with spark-submit. I am getting the same exception
It appears that the file paths are different when running spark in
local and cluster mode. When running spark without --master the paths to
the pipe command are relative to the local machine. When running spark
with --master the paths to the pipe command are ./
This is what finally worked. I
Hi Mohammed,
Did you use --jars to specify your jdbc driver when you submitted your job?
Take a look of this link:
http://spark.apache.org/docs/1.2.0/submitting-applications.html
Hope this help!
Kelvin
On Thu, Feb 19, 2015 at 7:24 PM, Mohammed Guller moham...@glassbeam.com
wrote:
Hi –
I
Hi,
I am trying the spark streaming log analysis reference application provided by
Databricks at
https://github.com/databricks/reference-apps/tree/master/logs_analyzer
When I deploy the code to the standalone cluster, there is no output at will
with the following shell script.Which means, the
Thanks Todd. great stuff :)
Regards,
Ashu
From: Todd Nist tsind...@gmail.com
Sent: Thursday, February 19, 2015 7:46 PM
To: Ashutosh Trivedi (MT2013030)
Cc: user@spark.apache.org
Subject: Re: Tableau beta connector
I am able to connect by doing the following
Hi,
I have HDFS file of size 598MB. I create RDD over this file and cache it in
RAM in a 7 node cluster with 2G RAM each. I find that each partition gets
replicated thrice or even 4 times in the cluster even without me specifying
in code. Total partitions are 5 for the RDD created but cached
I'm a bit new to Spark, but had a question on performance. I suspect a lot of
my issue is due to tuning and parameters. I have a Hive external table on
this data and to run queries against it runs in minutes
The Job:
+ 40gb of avro events on HDFS (100 million+ avro events)
+ Read in the files
I had a chance to talk to TD today at the Strata+Hadoop Conf in San Jose.
We talked a bit about this after his presentation about this - the short
answer is spark streaming does not guarantee any sort of ordering (within
batches, across batches). One would have to use updateStateByKey to
collect
Hi,
Vertices are simply hash-partitioned by spark.HashPartitioner, so
you easily calculate partition ids by yourself.
Also, you can type the lines to check ids;
import org.apache.spark.graphx._
graph.vertices.mapPartitionsWithIndex { (pid, iter) =
val vids = Array.newBuilder[VertexId]
for
Even with the new direct streams in 1.3, isn't it the case that the job
*scheduling* follows the partition order, rather than job *execution*? Or
is it the case that the stream listens to job completion event (using a
streamlistener) before scheduling the next batch? To compare with storm
from a
Can you downgrade your scala dependency to 2.10 and give it a try?
Thanks
Best Regards
On Fri, Feb 20, 2015 at 12:40 AM, Luis Solano l...@pixable.com wrote:
I'm having an issue with spark 1.2.1 and scala 2.11. I detailed the
symptoms in this stackoverflow question.
Not quiet sure, but this can be the case. One of your executor is stuck on
GC pause while the other one asks for the data from it and hence the
request timesout ending in that exception. You can try increasing the akk
framesize and ack wait timeout as follows:
While running the program go to your clusters webUI (that runs on 8080,
prolly at hadoop.master:8080) and see how many cores are allocated to the
program, it should be = 2 for the stream to get processed.
[image: Inline image 1]
Thanks
Best Regards
On Fri, Feb 20, 2015 at 9:29 AM,
Already fixed: https://github.com/apache/spark/pull/2802
On Thu, Feb 19, 2015 at 3:17 PM, Mohnish Kodnani mohnish.kodn...@gmail.com
wrote:
Hi,
I am trying to use percentile and getting the following error. I am using
spark 1.2.0. Does UDAF percentile exist in that code line and do i have to
The percentile UDAF came in across a couple of PRs. Commit
f33d55046427b8594fd19bda5fd2214eeeab1a95 reflects the most recent work, I
believe. It will be part of the 1.3.0 release very soon:
Hi all,
I think I have run into an issue on the lazy evaluation of variables in
pyspark, I have to following
functions = [func1, func2, func3]
for counter in range(len(functions)):
data = data.map(lambda value: [functions[counter](value)])
it looks like that the counter is evaluated when
What file system are you using ?
If you use hdfs, the documentation you cited is pretty clear on how
partitions are determined.
bq. file X replicated on 4 machines
I don't think replication factor plays a role w.r.t. partitions.
On Thu, Feb 19, 2015 at 8:05 AM, Alessandro Lulli
Hello,
I would like to use the spark MLlib recommendation filtering library. My
goal will be to predict what a user would like to buy based on what he
bought before.
I read on the spark documentation that Spark supports implicit feedback.
However there is not example for this application.
I'm not sure what your use case is, but perhaps you could use mapPartitions
to reduce across the individual partitions and apply your filtering. Then
you can finish with a reduceByKey.
On Thu, Feb 19, 2015 at 9:21 AM, Debasish Das debasish.da...@gmail.com
wrote:
Hi,
Before I send out the keys
On Feb 19, 2015, at 7:29 PM, Pavel Velikhov pavel.velik...@icloud.com wrote:
I have a simple Spark job that goes out to Cassandra, runs a pipe and stores
results:
val sc = new SparkContext(conf)
val rdd = sc.cassandraTable(“keyspace, “table)
.map(r = r.getInt(“column) + \t +
I have a simple Spark job that goes out to Cassandra, runs a pipe and stores
results:
val sc = new SparkContext(conf)
val rdd = sc.cassandraTable(“keyspace, “table)
.map(r = r.getInt(“column) + \t +
write(get_lemmas(r.getString(tags
.pipe(python3
Thanks for your detailed reply Imran. I'm writing this in Clojure (using
Flambo which uses the Java API) but I don't think that's relevant. So
here's the pseudocode (sorry I've not written Scala for a long time):
val rawData = sc.hadoopFile(/dir/to/gzfiles) // NB multiple files.
val parsedFiles =
Hi All,
Could you please help me understanding how Spark defines the number of
partitions of the RDDs if not specified?
I found the following in the documentation for file loaded from HDFS:
*The textFile method also takes an optional second argument for controlling
the number of partitions of
It's shown at
http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html
It's really not different to use. It's suitable when you have
count-like data rather than rating-like data. That's what you have
here.
I am not sure what you mean that you want to add frequency too but no
the
By default you will have (fileSize in Mb / 64) partitions. You can also set
the number of partitions when you read in a file with sc.textFile as an
optional second parameter.
On Thu, Feb 19, 2015 at 8:07 AM Alessandro Lulli lu...@di.unipi.it wrote:
Hi All,
Could you please help me
When writing to hdfs Spark will not overwrite existing files or directories.
You must either manually delete these or use Java's Hadoop FileSystem class to
remove them.
Sent with Good (www.good.com)
-Original Message-
From: Pavel Velikhov
bq. *blocks being 64MB by default in HDFS*
*In hadoop 2.1+, default block size has been increased.*
See https://issues.apache.org/jira/browse/HDFS-4053
Cheers
On Thu, Feb 19, 2015 at 8:32 AM, Ted Yu yuzhih...@gmail.com wrote:
What file system are you using ?
If you use hdfs, the
As Ted Yu points out, default block size is 128MB as of Hadoop 2.1.
Sent with Good (www.good.com)
-Original Message-
From: Ilya Ganelin [ilgan...@gmail.commailto:ilgan...@gmail.com]
Sent: Thursday, February 19, 2015 12:13 PM Eastern Standard Time
To: Alessandro Lulli;
Hi,
Before I send out the keys for network shuffle, in reduceByKey after map +
combine are done, I would like to filter the keys based on some threshold...
Is there a way to get the key, value after map+combine stages so that I can
run a filter on the keys ?
Thanks.
Deb
Hi Joe,
The issue is not that you have input partitions that are bigger than 2GB --
its just that they are getting cached. You can see in the stack trace, the
problem is when you try to read data out of the DiskStore:
org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132)
Also, just
Hi Silvio,
I got this working today using your suggestion with the Initial SQL and a
Custom Query. See here for details:
http://stackoverflow.com/questions/28403664/connect-to-existing-hive-in-intellij-using-sbt-as-build/28608608#28608608
It is not ideal as I need to write a custom query, but
Hi all,
In Spark Streaming I want use the Dstream.saveAsTextFiles by bulk writing
because of the normal saveAsTextFiles cannot during the batch interval of
setting.
May be a common pool of writing or another assigned worker for bulk writing?
Thanks!
B/R
Jichao
Yup, I did see that. Good point though, Cody. The mismatch was happening
for me when I was trying to get the 'new JdbcRDD' approach going. Once I
switched to the 'create' method things are working just fine. Was just able
to refactor the 'get connection' logic into a 'DbConnection implements
oh, I think you are just choosing a number that is too small for your
number of partitions. All of the data in /dir/to/gzfiles is going to be
sucked into one RDD, with the data divided into partitions. So if you're
parsing 200 files, each about 2 GB, and then repartitioning down to 100
I thought combiner comes from reduceByKey and not mapPartitions right...Let
me dig deeper into the APIs
On Thu, Feb 19, 2015 at 8:29 AM, Daniel Siegmann daniel.siegm...@velos.io
wrote:
I'm not sure what your use case is, but perhaps you could use
mapPartitions to reduce across the individual
That's a good point, thanks. Is there a way to instrument continuous
realtime streaming of data out of a database? If the data keeps changing,
one way to implement extraction would be to keep track of something like
the last-modified timestamp and instrument the query to be 'where
lastmodified ?'
almost all your data is going to one task. You can see that the shuffle
read for task 0 is 153.3 KB, and for most other tasks its just 26B (which
is probably just some header saying there are no actual records). You need
to ensure your data is more evenly distributed before this step.
On Thu,
Hi Imran,
Thanks for pointing that out. My data comes from the HBase connector of
Spark. I do not govern the distribution of data myself. HBase decides to
put the data on any of the region servers. Is there a way to distribute
data evenly? And I am especially interested in running even small
If your dataset is large, there is a Spark Package called IndexedRDD
optimized for lookups. Feel free to check that out.
Burak
On Feb 19, 2015 7:37 AM, Ilya Ganelin ilgan...@gmail.com wrote:
Hi Shahab - if your data structures are small enough a broadcasted Map is
going to provide faster
I am trying to pass a variable number of arguments to the select function
of a SchemaRDD I created, as I want to select the fields in run time:
val variable_argument_list = List('field1,'field2')
val schm1 = myschemaRDD.select('field1,'field2) // works
val schm2 =
Isnt that PR about being able to pass in an array to percentile function.
If I understand this error correctly, its not able to find the function
percentile itself.
Also, if I am incorrect and that PR fixes it, is it available in a release ?
On Thu, Feb 19, 2015 at 3:27 PM, Mark Hamstra
--
cheers,
chaitu
Yes.
On 19 Feb 2015 23:40, Harshvardhan Chauhan ha...@gumgum.com wrote:
Is this the full stack trace ?
On Wed, Feb 18, 2015 at 2:39 AM, sachin Singh sachin.sha...@gmail.com
wrote:
Hi,
I want to run my spark Job in Hadoop yarn Cluster mode,
I am using below command -
spark-submit --master
Hi –
I am trying to use BoneCP (a database connection pooling library) to write data
from my Spark application to an RDBMS. The database inserts are inside a
foreachPartition code block. I am getting this exception when the code tries to
insert data using BoneCP:
java.sql.SQLException: No
if you have duplicate values for a key, join creates all pairs. Eg. if you
2 values for key X in rdd A 2 values for key X in rdd B, then a.join(B)
will have 4 records for key X
On Thu, Feb 19, 2015 at 3:39 PM, Darin McBeath ddmcbe...@yahoo.com.invalid
wrote:
Consider the following left outer
Hi,
I am trying to use percentile and getting the following error. I am using
spark 1.2.0. Does UDAF percentile exist in that code line and do i have to
do something to get this to work.
java.util.NoSuchElementException: key not found: percentile
at
My streaming app runs fine for a few hours and then starts spewing Could
not compute split, block input-xx-xxx not found errors. After this,
jobs start to fail and batches start to pile up.
My question isn't so much about why this error but rather, how do I trace
what leads to this error? I
the more scalable alternative is to do a join (or a variant like cogroup,
leftOuterJoin, subtractByKey etc. found in PairRDDFunctions)
the downside is this requires a shuffle of both your RDDs
On Thu, Feb 19, 2015 at 3:36 PM, Himanish Kushary himan...@gmail.com
wrote:
Hi,
I have two RDD's
Hi, I'm trying to figure out why the following job is failing on a pipe
http://pastebin.com/raw.php?i=U5E8YiNN
With this exception:
http://pastebin.com/raw.php?i=07NTGyPP
Any help is welcome. Thank you.
The error msg is telling you the exact problem, it can't find
ProgramSIM, the thing you are trying to run
Lost task 3520.3 in stage 0.0 (TID 11, compute3.research.dev):
java.io.IOException: Cannot run program ProgramSIM: error=2, No s\
uch file or directory
On Thu, Feb 19, 2015 at 5:52 PM,
The stupid question is whether you're deleting the file from hdfs on the
right node?
On Thu, Feb 19, 2015 at 11:31 AM Pavel Velikhov pavel.velik...@gmail.com
wrote:
Yeah, I do manually delete the files, but it still fails with this error.
On Feb 19, 2015, at 8:16 PM, Ganelin, Ilya
1 - 100 of 103 matches
Mail list logo