Hi,
On Thu, Jan 29, 2015 at 9:52 AM, Tobias Pfeiffer t...@preferred.jp wrote:
Hi,
On Thu, Jan 29, 2015 at 1:54 AM, YaoPau jonrgr...@gmail.com wrote:
My thinking is to maintain state in an RDD and update it an persist it
with
each 2-second pass, but this also seems like it could get messy
Hi,
On Thu, Mar 26, 2015 at 4:08 AM, Khandeshi, Ami
ami.khande...@fmr.com.invalid wrote:
I am seeing the same behavior. I have enough resources…..
CPU *and* memory are sufficient? No previous (unfinished) jobs eating them?
Tobias
Hi,
On Wed, Mar 11, 2015 at 11:05 PM, Cesar Flores ces...@gmail.com wrote:
Thanks for both answers. One final question. *This registerTempTable is
not an extra process that the SQL queries need to do that may decrease
performance over the language integrated method calls? *
As far as I
Sean,
On Wed, Mar 11, 2015 at 7:43 PM, Tobias Pfeiffer t...@preferred.jp wrote:
it seems like I am unable to shut down my StreamingContext properly, both
in local[n] and yarn-cluster mode. In addition, (only) in yarn-cluster
mode, subsequent use of a new StreamingContext will raise
Hi,
On Thu, Mar 12, 2015 at 12:08 AM, Huang, Jie jie.hu...@intel.com wrote:
According to my understanding, your approach is to register a series of
tables by using transformWith, right? And then, you can get a new Dstream
(i.e., SchemaDstream), which consists of lots of SchemaRDDs.
Yep,
Hi,
it seems like I am unable to shut down my StreamingContext properly, both
in local[n] and yarn-cluster mode. In addition, (only) in yarn-cluster
mode, subsequent use of a new StreamingContext will raise
an InvalidActorNameException.
I use
logger.info(stoppingStreamingContext)
Hi,
I discovered what caused my issue when running on YARN and was able to work
around it.
On Wed, Mar 11, 2015 at 7:43 PM, Tobias Pfeiffer t...@preferred.jp wrote:
The processing itself is complete, i.e., the batch currently processed at
the time of stop() is finished and no further batches
Hi,
On Tue, Mar 10, 2015 at 2:13 PM, Cesar Flores ces...@gmail.com wrote:
I am new to the SchemaRDD class, and I am trying to decide in using SQL
queries or Language Integrated Queries (
https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD
).
Can someone
Hi,
On Wed, Mar 11, 2015 at 9:33 AM, Cheng, Hao hao.ch...@intel.com wrote:
Intel has a prototype for doing this, SaiSai and Jason are the authors.
Probably you can ask them for some materials.
The github repository is here: https://github.com/intel-spark/stream-sql
Also, what I did is
Hi,
On Thu, Mar 5, 2015 at 10:58 PM, Ashish Mukherjee
ashish.mukher...@gmail.com wrote:
I understand Spark can be used with Hadoop or standalone. I have certain
questions related to use of the correct FS for Spark data.
What is the efficiency trade-off in feeding data to Spark from NFS v
Hi,
On Thu, Mar 5, 2015 at 12:20 AM, Imran Rashid iras...@cloudera.com wrote:
This doesn't involve spark at all, I think this is entirely an issue with
how scala deals w/ primitives and boxing. Often it can hide the details
for you, but IMO it just leads to far more confusing errors when
Hi,
I have a function with signature
def aggFun1(rdd: RDD[(Long, (Long, Double))]):
RDD[(Long, Any)]
and one with
def aggFun2[_Key: ClassTag, _Index](rdd: RDD[(_Key, (_Index, Double))]):
RDD[(_Key, Double)]
where all Double classes involved are scala.Double classes (according
to
Hi,
can you explain how you copied that into your *streaming* application?
Like, how do you issue the SQL, what data do you operate on, how do you
view the logs etc.?
Tobias
On Wed, Mar 4, 2015 at 8:55 AM, Cui Lin cui@hds.com wrote:
Dear all,
I found the below sample code can be
Hi,
On Wed, Mar 4, 2015 at 6:20 AM, Zhan Zhang zzh...@hortonworks.com wrote:
Do you have enough resource in your cluster? You can check your resource
manager to see the usage.
Yep, I can confirm that this is a very annoying issue. If there is not
enough memory or VCPUs available, your app
Hi,
I think your chances for a satisfying answer would increase dramatically if
you elaborated a bit more on what you actually want to know.
(Holds for any of your last four questions about Spark SQL...)
Tobias
Hi
On Fri, Feb 27, 2015 at 10:50 AM, Gary Malouf malouf.g...@gmail.com wrote:
The honest answer is that it is unclear to me at this point. I guess what
I am really wondering is if there are cases where one would find it
beneficial to use Spark against one or more RDBs?
Well, RDBs are all
Gary,
On Fri, Feb 27, 2015 at 8:40 AM, Gary Malouf malouf.g...@gmail.com wrote:
I'm considering whether or not it is worth introducing Spark at my new
company. The data is no-where near Hadoop size at this point (it sits in
an RDS Postgres cluster).
Will it ever become Hadoop size? Looking
On Fri, Feb 27, 2015 at 10:57 AM, Gary Malouf malouf.g...@gmail.com wrote:
So when deciding whether to take on installing/configuring Spark, the size
of the data does not automatically make that decision in your mind.
You got me there ;-)
Tobias
Hi,
On Thu, Feb 26, 2015 at 11:24 AM, Thanigai Vellore
thanigai.vell...@gmail.com wrote:
It appears that the function immediately returns even before the
foreachrdd stage is executed. Is that possible?
Sure, that's exactly what happens. foreachRDD() schedules a computation, it
does not
Sean,
thanks for your message!
On Mon, Feb 23, 2015 at 6:03 PM, Sean Owen so...@cloudera.com wrote:
What I haven't investigated is whether you can enable checkpointing
for the state in updateStateByKey separately from this mechanism,
which is exactly your question. What happens if you set a
Hi,
On Tue, Feb 24, 2015 at 4:34 AM, Tathagata Das tathagata.das1...@gmail.com
wrote:
There are different kinds of checkpointing going on. updateStateByKey
requires RDD checkpointing which can be enabled only by called
sparkContext.setCheckpointDirectory. But that does not enable Spark
Hi,
On Sat, Feb 21, 2015 at 1:05 AM, craigv craigvanderbo...@gmail.com wrote:
/Might it be possible to perform large batches processing on HDFS time
series data using Spark Streaming?/
1.I understand that there is not currently an InputDStream that could do
what's needed. I would have
Hi Ayoub,
thanks for your mail!
On Thu, Jan 29, 2015 at 6:23 PM, Ayoub benali.ayoub.i...@gmail.com wrote:
SQLContext and hiveContext have a jsonRDD method which accept an
RDD[String] where the string is a JSON String a returns a SchemaRDD, it
extends RDD[Row] which the type you want.
After
Hi,
I have data as RDD[(Long, String)], where the Long is a timestamp and the
String is a JSON-encoded string. I want to infer the schema of the JSON and
then do a SQL statement on the data (no aggregates, just column selection
and UDF application), but still have the timestamp associated with
Hi,
On Fri, Jan 30, 2015 at 6:32 AM, Ganelin, Ilya ilya.gane...@capitalone.com
wrote:
Make a copy of your RDD with an extra entry in the beginning to offset.
The you can zip the two RDDs and run a map to generate an RDD of
differences.
Does that work? I recently tried something to compute
Hi,
On Thu, Jan 29, 2015 at 1:54 AM, YaoPau jonrgr...@gmail.com wrote:
My thinking is to maintain state in an RDD and update it an persist it with
each 2-second pass, but this also seems like it could get messy. Any
thoughts or examples that might help me?
I have just implemented some
Hi,
in my Spark Streaming application, computations depend on users' input in
terms of
* user-defined functions
* computation rules
* etc.
that can throw exceptions in various cases (think: exception in UDF,
division by zero, invalid access by key etc.).
Now I am wondering about what is a
Hi,
On Wed, Jan 28, 2015 at 1:45 PM, Soumitra Kumar kumar.soumi...@gmail.com
wrote:
It is a Streaming application, so how/when do you plan to access the
accumulator on driver?
Well... maybe there would be some user command or web interface showing the
errors that have happened during
Hi,
thanks for your mail!
On Wed, Jan 28, 2015 at 11:44 AM, Tathagata Das tathagata.das1...@gmail.com
wrote:
That seems reasonable to me. Are you having any problems doing it this way?
Well, actually I haven't done that yet. The idea of using accumulators to
collect errors just came while
Hi,
I want to do something like
dstream.foreachRDD(rdd = if (someCondition) ssc.stop())
so in particular the function does not touch any element in the RDD and
runs completely within the driver. However, this fails with a
NotSerializableException because $outer is not serializable etc. The
Hi,
thanks for the answers!
On Wed, Jan 28, 2015 at 11:31 AM, Shao, Saisai saisai.s...@intel.com
wrote:
Also this `foreachFunc` is more like an action function of RDD, thinking
of rdd.foreach(func), in which `func` need to be serializable. So maybe I
think your way of use it is not a normal
Hi,
On Mon, Jan 26, 2015 at 9:32 AM, Steve Nunez snu...@hortonworks.com wrote:
I’ve got a list of points: List[(Float, Float)]) that represent (x,y)
coordinate pairs and need to sum the distance. It’s easy enough to compute
the distance:
Are you saying you want all combinations (N^2) of
Hi,
On Tue, Jan 20, 2015 at 8:16 PM, balu.naren balu.na...@gmail.com wrote:
I am a beginner to spark streaming. So have a basic doubt regarding
checkpoints. My use case is to calculate the no of unique users by day. I
am using reduce by key and window for this. Where my window duration is 24
Hi,
On Thu, Jan 22, 2015 at 2:26 AM, Corey Nolet cjno...@gmail.com wrote:
Let's say I have 2 formats for json objects in the same file
schema1 = { location: 12345 My Lane }
schema2 = { location:{houseAddres:1234 My Lane} }
From my tests, it looks like the current inferSchema() function will
Aaron,
On Thu, Jan 15, 2015 at 5:05 PM, Aaron Davidson ilike...@gmail.com wrote:
Scala for-loops are implemented as closures using anonymous inner classes
which are instantiated once and invoked many times. This means, though,
that the code inside the loop is actually sitting inside a class,
Sean,
On Mon, Jan 26, 2015 at 10:28 AM, Sean Owen so...@cloudera.com wrote:
Note that RDDs don't really guarantee anything about ordering though,
so this only makes sense if you've already sorted some upstream RDD by
a timestamp or sequence number.
Speaking of order, is there some reading
Hi,
On Wed, Jan 21, 2015 at 9:13 PM, Bob Tiernay btier...@hotmail.com wrote:
Maybe I'm misunderstanding something here, but couldn't this be done with
broadcast variables? I there is the following caveat from the docs:
In addition, the object v should not be modified after it is broadcast
Hi,
I am developing a Spark Streaming application where I want every item in my
stream to be assigned a unique, strictly increasing Long. My input data
already has RDD-local integers (from 0 to N-1) assigned, so I am doing the
following:
var totalNumberOfItems = 0L
// update the keys of the
Hi,
On Wed, Jan 21, 2015 at 4:46 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
How about using accumulators
http://spark.apache.org/docs/1.2.0/programming-guide.html#accumulators?
As far as I understand, they solve the part of the problem that I am not
worried about, namely increasing the
Hi,
On Sat, Jan 17, 2015 at 3:37 AM, Peng Cheng pc...@uow.edu.au wrote:
I'm talking about RDD1 (not persisted or checkpointed) in this situation:
...(somewhere) - RDD1 - RDD2
||
V V
Hi,
On Sun, Jan 18, 2015 at 11:08 AM, guxiaobo1982 guxiaobo1...@qq.com wrote:
Driver programs submitted by the spark-submit script will get the runtime
spark master URL, but how it get the URL inside the main method when
creating the SparkConf object?
The master will be stored in the
Hi,
On Fri, Jan 16, 2015 at 6:14 PM, Wang, Daoyuan daoyuan.w...@intel.com
wrote:
The second parameter of jsonRDD is the sampling ratio when we infer schema.
OK, I was aware of this, but I guess I understand the problem now. My
sampling ratio is so low that I only see the Long values of data
Hi,
On Sat, Jan 17, 2015 at 3:05 AM, Shuai Zheng szheng.c...@gmail.com wrote:
Can you share more information about how do you do that? I also have
similar question about this.
Not very proud about it ;-), but here you go:
// find the number of workers available to us.
val _runCmd =
Hi again,
On Fri, Jan 16, 2015 at 4:25 PM, Tobias Pfeiffer t...@preferred.jp wrote:
Now I'm wondering where this comes from (I haven't touched this component
in a while, nor upgraded Spark etc.) [...]
So the reason that the error is showing up now is that suddenly data from a
different
Hi,
On Fri, Jan 16, 2015 at 5:55 PM, Wang, Daoyuan daoyuan.w...@intel.com
wrote:
Can you provide how you create the JsonRDD?
This should be reproducible in the Spark shell:
-
import org.apache.spark.sql._
val sqlc = new SparkContext(sc)
Hi,
On Fri, Jan 16, 2015 at 7:31 AM, freedafeng freedaf...@yahoo.com wrote:
I myself saw many times that my app threw out exceptions because an empty
RDD cannot be saved. This is not big issue, but annoying. Having a cheap
solution testing if an RDD is empty would be nice if there is no such
Hi,
I am experiencing a weird error that suddenly popped up in my unit tests. I
have a couple of HDFS files in JSON format and my test is basically
creating a JsonRDD and then issuing a very simple SQL query over it. This
used to work fine, but now suddenly I get:
15:58:49.039 [Executor task
Aaron,
thanks for your mail!
On Thu, Jan 15, 2015 at 5:05 PM, Aaron Davidson ilike...@gmail.com wrote:
Scala for-loops are implemented as closures using anonymous inner classes
[...]
While loops, on the other hand, involve none of this trickery, and
everyone is happy.
Ah, I was suspecting
Sean,
thanks for your message.
On Wed, Jan 14, 2015 at 8:36 PM, Sean Owen so...@cloudera.com wrote:
On Wed, Jan 14, 2015 at 4:53 AM, Tobias Pfeiffer t...@preferred.jp wrote:
OK, it seems like even on a local machine (with no network overhead), the
groupByKey version is about 5 times slower
Hi,
On Thu, Jan 15, 2015 at 12:23 AM, Ted Yu yuzhih...@gmail.com wrote:
On Wed, Jan 14, 2015 at 6:58 AM, Jianguo Li flyingfromch...@gmail.com
wrote:
I am using Spark-1.1.1. When I used sbt test, I ran into the following
exceptions. Any idea how to solve it? Thanks! I think somebody posted
Hi,
sorry, I don't like questions about serializability myself, but still...
Can anyone give me a hint why
for (i - 0 to (maxId - 1)) { ... }
throws a NotSerializableException in the loop body while
var i = 0
while (i maxId) {
// same code as in the for loop
i += 1
}
works
Hi again,
On Wed, Jan 14, 2015 at 10:06 AM, Tobias Pfeiffer t...@preferred.jp wrote:
If you think of
items.map(x = /* throw exception */).count()
then even though the count you want to get does not necessarily require
the evaluation of the function in map() (i.e., the number is the same
Hi,
On Wed, Jan 14, 2015 at 12:11 PM, Tobias Pfeiffer t...@preferred.jp wrote:
Now I don't know (yet) if all of the functions I want to compute can be
expressed in this way and I was wondering about *how much* more expensive
we are talking about.
OK, it seems like even on a local machine
Hi,
I have an RDD[(Long, MyData)] where I want to compute various functions on
lists of MyData items with the same key (this will in general be a rather
short lists, around 10 items per key).
Naturally I was thinking of groupByKey() but was a bit intimidated by the
warning: This operation may be
Hi,
On Mon, Jan 12, 2015 at 8:09 PM, Ganelin, Ilya ilya.gane...@capitalone.com
wrote:
Use the mapPartitions function. It returns an iterator to each partition.
Then just get that length by converting to an array.
On Tue, Jan 13, 2015 at 2:50 PM, Kevin Burton bur...@spinn3r.com wrote:
Hi,
On Wed, Jan 7, 2015 at 9:47 AM, Asim Jalis asimja...@gmail.com wrote:
One approach I was considering was to use mapPartitions. It is
straightforward to compute the moving average over a partition, except for
near the end point. Does anyone see how to fix that?
Well, I guess this is not
Hi,
On Thu, Jan 8, 2015 at 2:19 PM, rekt...@voodoowarez.com wrote:
dstream processing bulk HDFS data- is something I don't feel is super
well socialized yet, fingers crossed that base gets built up a little
more.
Just out of interest (and hoping not to hijack my own thread), why are you
Hi,
I have a setup where data from an external stream is piped into Kafka and
also written to HDFS periodically for long-term storage. Now I am trying to
build an application that will first process the HDFS files and then switch
to Kafka, continuing with the first item that was not yet in HDFS.
Hi,
On Tue, Jan 6, 2015 at 11:24 PM, Todd bit1...@163.com wrote:
I am a bit new to Spark, except that I tried simple things like word
count, and the examples given in the spark sql programming guide.
Now, I am investigating the internals of Spark, but I think I am almost
lost, because I
Hi,
it looks to me as if you need the whole user database on every node, so
maybe put the id-name information as a Map[Id, String] in a broadcast
variable and then do something like
recommendations.map(line = {
line.map(uid = usernames(uid))
})
or so?
Tobias
Hi,
On Wed, Jan 7, 2015 at 10:47 AM, Riginos Samaras samarasrigi...@gmail.com
wrote:
Yes something like this. Can you please give me an example to create a Map?
That depends heavily on the shape of your input file. What about something
like:
(for (line - Source.fromFile(filename).getLines())
Hi,
On Wed, Jan 7, 2015 at 11:13 AM, Riginos Samaras samarasrigi...@gmail.com
wrote:
exactly thats what I'm looking for, my code is like this:
//code
val users_map = users_file.map{ s =
val parts = s.split(,)
(parts(0).toInt, parts(1))
}.distinct
//code
but i get the error:
Hi Michael,
On Tue, Jan 6, 2015 at 3:43 PM, Michael Armbrust mich...@databricks.com
wrote:
Oh sorry, I'm rereading your email more carefully. Its only because you
have some setup code that you want to amortize?
Yes, exactly that.
Concerning the docs, I'd be happy to contribute, but I don't
Hi,
I have a SchemaRDD where I want to add a column with a value that is
computed from the rest of the row. As the computation involves a
network operation and requires setup code, I can't use
SELECT *, myUDF(*) FROM rdd,
but I wanted to use a combination of:
- get schema of input SchemaRDD
Hi,
On Fri, Dec 26, 2014 at 5:22 AM, Amit Behera amit.bd...@gmail.com wrote:
How can I do it? Please help me to do.
Have you considered using groupByKey?
http://spark.apache.org/docs/latest/programming-guide.html#transformations
Tobias
Hi,
On Fri, Dec 26, 2014 at 1:32 AM, ey-chih chow eyc...@hotmail.com wrote:
I got some issues with mapPartitions with the following piece of code:
val sessions = sc
.newAPIHadoopFile(
... path to an avro file ...,
Nick,
uh, I would have expected a rather heated discussion, but the opposite
seems to be the case ;-)
Independent of my personal preferences w.r.t. usability, habits etc., I
think it is not good for a software/tool/framework if questions and
discussions are spread over too many places. I guess
Hi,
On Fri, Dec 26, 2014 at 10:13 AM, ey-chih chow eyc...@hotmail.com wrote:
I should rephrase my question as follows:
How to use the corresponding Hadoop Configuration of a HadoopRDD in
defining
a function as an input parameter to the MapPartitions function?
Well, you could try to pull
Hi,
On Wed, Dec 24, 2014 at 3:18 PM, Hafiz Mujadid hafizmujadi...@gmail.com
wrote:
I want to convert a schemaRDD into RDD of String. How can we do that?
Currently I am doing like this which is not converting correctly no
exception but resultant strings are empty
here is my code
Hehe,
Hi,
On Fri, Dec 19, 2014 at 6:53 PM, Ashic Mahtab as...@live.com wrote:
val doSomething(entry:SomeEntry, session:Session) : SomeOutput = {
val result = session.SomeOp(entry)
SomeOutput(entry.Key, result.SomeProp)
}
I could use a transformation for rdd.map, but in case of failures,
Hi,
I have the following code in my application:
tmpRdd.foreach(item = {
println(abc: + item)
})
tmpRdd.foreachPartition(iter = {
iter.map(item = {
println(xyz: + item)
})
})
In the output, I see only the abc prints
Hi again,
On Thu, Dec 18, 2014 at 6:43 PM, Tobias Pfeiffer t...@preferred.jp wrote:
tmpRdd.foreachPartition(iter = {
iter.map(item = {
println(xyz: + item)
})
})
Uh, with iter.foreach(...) it works... the reason being apparently
Hi,
On Thu, Dec 18, 2014 at 3:08 AM, Patrick Wendell pwend...@gmail.com wrote:
On Wed, Dec 17, 2014 at 5:43 AM, Gerard Maas gerard.m...@gmail.com
wrote:
I was wondering why one would choose for rdd.map vs rdd.foreach to
execute a
side-effecting function on an RDD.
Personally, I like to
Jerry,
On Wed, Dec 17, 2014 at 3:35 PM, Jerry Raj jerry@gmail.com wrote:
Another problem with the DSL:
t1.where('term == dmin).count() returns zero.
Looks like you need ===:
https://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD
Tobias
Hi,
On Fri, Dec 12, 2014 at 7:01 AM, ryaminal tacmot...@gmail.com wrote:
Now our solution is to make a very simply YARN application which execustes
as its command spark-submit --master yarn-cluster
s3n://application/jar.jar
This seemed so simple and elegant, but it has some weird
Nathan,
On Fri, Dec 12, 2014 at 3:11 PM, Nathan Kronenfeld
nkronenf...@oculusinfo.com wrote:
I can see how to do it if can express the added values in SQL - just run
SELECT *,valueCalculation AS newColumnName FROM table
I've been searching all over for how to do this if my added value is a
Hi,
I am interested in building an application that uses sliding windows not
based on the time when the item was received, but on either
* a timestamp embedded in the data, or
* a count (like: every 10 items, look at the last 100 items).
Also, I want to do this on stream data received from
Hi,
On Tue, Dec 9, 2014 at 4:39 AM, Sandy Ryza sandy.r...@cloudera.com wrote:
Can you try using the YARN Fair Scheduler and set
yarn.scheduler.fair.continuous-scheduling-enabled to true?
I'm using Cloudera 5.2.0 and my configuration says
yarn.resourcemanager.scheduler.class =
Hi,
thanks for your responses!
On Sat, Dec 6, 2014 at 4:22 AM, Sandy Ryza sandy.r...@cloudera.com wrote:
What version are you using? In some recent versions, we had a couple of
large hardcoded sleeps on the Spark side.
I am using Spark 1.1.1.
As Andrew mentioned, I guess most of the 10
Hi,
On Thu, Dec 4, 2014 at 11:58 PM, Rohit Pujari rpuj...@hortonworks.com
wrote:
I'd like to do market basket analysis using spark, what're my options?
To do it or not to do it ;-)
Seriously, could you elaborate a bit on what you want to know?
Tobias
Hi,
On Fri, Dec 5, 2014 at 3:56 AM, Akshat Aranya aara...@gmail.com wrote:
Is it possible to have some state across multiple calls to mapPartitions
on each partition, for instance, if I want to keep a database connection
open?
If you're using Scala, you can use a singleton object, this will
On Fri, Dec 5, 2014 at 12:53 PM, Rahul Bindlish
rahul.bindl...@nectechnologies.in wrote:
Is it a limitation that spark does not support more than one case class at
a
time.
What do you mean? I do not have the slightest idea what you *could*
possibly mean by to support a case class.
Tobias
Rahul,
On Fri, Dec 5, 2014 at 1:29 PM, Rahul Bindlish
rahul.bindl...@nectechnologies.in wrote:
I have created objectfiles [person_obj,office_obj] from
csv[person_csv,office_csv] files using case classes[person,office] with API
(saveAsObjectFile)
Now I restarted spark-shell and load
Rahul,
On Fri, Dec 5, 2014 at 2:50 PM, Rahul Bindlish
rahul.bindl...@nectechnologies.in wrote:
I have done so thats why spark is able to load objectfile [e.g. person_obj]
and spark has maintained serialVersionUID [person_obj].
Next time when I am trying to load another objectfile [e.g.
Rahul,
On Fri, Dec 5, 2014 at 3:51 PM, Rahul Bindlish
rahul.bindl...@nectechnologies.in wrote:
1. Copy csv files in current directory.
2. Open spark-shell from this directory.
3. Run one_scala file which will create object-files from csv-files in
current directory.
4. Restart spark-shell
Hi,
I have an RDD and a function that should be called on every item in this
RDD once (say it updates an external database). So far, I used
rdd.map(myFunction).count()
or
rdd.mapPartitions(iter = iter.map(myFunction))
but I am wondering if this always triggers the call of myFunction in both
Hi,
On Wed, Dec 3, 2014 at 4:31 PM, Jerry Raj jerry@gmail.com wrote:
Exception in thread main java.lang.RuntimeException: [1.57] failure:
``('' expected but identifier myudf found
I also tried returning a List of Ints, that did not work either. Is there
a way to write a UDF that returns
Hi,
On Wed, Dec 3, 2014 at 5:31 PM, Bahubali Jain bahub...@gmail.com wrote:
I am trying to use textFileStream(some_hdfs_location) to pick new files
from a HDFS location.I am seeing a pretty strange behavior though.
textFileStream() is not detecting new files when I move them from a
location
Hi,
On Thu, Dec 4, 2014 at 2:59 AM, Ashic Mahtab as...@live.com wrote:
I've been doing this with foreachPartition (i.e. have the parameters for
creating the singleton outside the loop, do a foreachPartition, create the
instance, loop over entries in the partition, close the partition), but
Hi,
I am using spark-submit to submit my application to YARN in yarn-cluster
mode. I have both the Spark assembly jar file as well as my application jar
file put in HDFS and can see from the logging output that both files are
used from there. However, it still takes about 10 seconds for my
Markus,
On Tue, Nov 11, 2014 at 10:40 AM, M. Dale medal...@yahoo.com wrote:
I never tried to use this property. I was hoping someone else would jump
in. When I saw your original question I remembered that Hadoop has
something similar. So I searched and found the link below. A quick JIRA
Hi,
have a look at the documentation for spark.driver.extraJavaOptions (which
seems to have disappeared since I looked it up last week)
and spark.executor.extraJavaOptions at
http://spark.apache.org/docs/latest/configuration.html#runtime-environment.
Tobias
Josh,
On Sun, Nov 30, 2014 at 10:17 PM, Josh J joshjd...@gmail.com wrote:
I would like to setup a Kafka pipeline whereby I write my data to a single
topic 1, then I continue to process using spark streaming and write the
transformed results to topic2, and finally I read the results from topic
Hi,
Thanks for your help!
Sandy, I had a bit of trouble finding the spark.executor.cores property.
(It wasn't there although its value should have been 2.)
I ended up throwing regular expressions
on scala.util.Properties.propOrElse(sun.java.command, ), which worked
surprisingly well ;-)
Thanks
Hi,
On Tue, Nov 25, 2014 at 9:40 AM, Joanne Contact joannenetw...@gmail.com
wrote:
I seemed to read somewhere that spark is still batch learning, but spark
streaming could allow online learning.
Spark doesn't do Machine Learning itself, but MLlib does. MLlib currently
can do online learning
Hi,
On Sat, Nov 22, 2014 at 12:13 AM, EH eas...@gmail.com wrote:
Unfortunately whether it is possible to have both Spark and HDFS running on
the same machine is not under our control. :( Right now we have Spark and
HDFS running in different machines. In this case, is it still possible to
Hi,
I am using spark-submit to submit my application jar to a YARN cluster. I
want to deliver a single jar file to my users, so I would like to avoid to
tell them also, please put that log4j.xml file somewhere and add that path
to the spark-submit command.
I thought it would be sufficient that
Hi,
it looks what you are trying to use as a Tuple cannot be inferred to be a
Tuple from the compiler. Try to add type declarations and maybe you will
see where things fail.
Tobias
Hi,
do you have some logging backend (log4j, logback) on your classpath? This
seems a bit like there is no particular implementation of the abstract
`log()` method available.
Tobias
Hi,
see https://www.mail-archive.com/dev@spark.apache.org/msg03520.html for one
solution.
One issue with those XML files is that they cannot be processed line by
line in parallel; plus you inherently need shared/global state to parse XML
or check for well-formedness, I think. (Same issue with
1 - 100 of 201 matches
Mail list logo