Just point out a bug in your codes. You should not use `mapPartitions` like
that. For details, I recommend Section setup() and cleanup() in Sean
Owen's post:
http://blog.cloudera.com/blog/2014/09/how-to-translate-from-mapreduce-to-apache-spark/
Best Regards,
Shixiong Zhu
2014-12-14 16:35
We are developing Spark framework wherein we are moving historical data into
RDD sets.
Basically, RDD is immutable, read only dataset on which we do operations.
Based on that we have moved historical data into RDD and we do computations
like filtering/mapping, etc on such RDDs.
Now there is a
Hi Nathan,
#1
Spark SQL DSL can satisfy your requirement. You can refer the following
code snippet:
jdata.select(Star(Node), 'seven.getField(mod), 'eleven.getField(mod))
You need to import org.apache.spark.sql.catalyst.analysis.Star in advance.
#2
After you make the transform above, you do
Thanks! shall try it out.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Does-filter-on-an-RDD-scan-every-data-item-tp20170p20683.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi all,
I am trying to run Spark job on Playframework + Spark Master/Worker in one
Mac.
When job ran, I encountered java.lang.ClassNotFoundException.
Would you teach me how to solve it?
Here is my code in Github.
https://github.com/TomoyaIgarashi/spark_cluster_sample
* Envrionments:
Mac 10.9.5
Hi,
I tried to create a function that to convert an Unix time stamp to the
hour number in a day.
It works if the code is like this:sqlContext.registerFunction(toHour,
(x:Long)={new java.util.Date(x*1000).getHours})
But, if I do it like this, it doesn't work:
def toHour
Try the workaround (addClassPathJars(sparkContext,
this.getClass.getClassLoader) discussed in
http://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3CCAJOb8buD1B6tUtOfG8_Ok7F95C3=r-zqgffoqsqbjdxd427...@mail.gmail.com%3E
Thanks,
Aniket
On Mon Dec 15 2014 at 07:43:24 Tomoya Igarashi
The reason not using sc.newAPIHadoopRDD is it only support one scan each
time.
I am not sure is that's true. You can use multiple scans as following:
val scanStrings = scans.map(scan = convertScanToString(scan))
conf.setStrings(MultiTableInputFormat.SCANS, scanStrings : _*)
where
In case you are still looking for help, there has been multiple discussions
in this mailing list that you can try searching for. Or you can simply use
https://github.com/unicredit/hbase-rdd :-)
Thanks,
Aniket
On Wed Dec 03 2014 at 16:11:47 Ted Yu yuzhih...@gmail.com wrote:
Which hbase release
Hi,
I just wonder if Cassandra-Spark connector supports executing HiveQL on
Cassandra tables?
best,
/Shahab
Hi Helena and All,
I have found one example multi-line json file into an RDD using
https://github.com/alexholmes/json-mapreduce;.
val data = sc.newAPIHadoopFile(
filepath,
classOf[MultiLineJsonInputFormat],
classOf[LongWritable],
classOf[Text],
conf ).map(p = (p._1.get,
Thanks Akhil,
In line with your suggestion I have used the following 2 commands to
flatten the directory structure:
find . -type f -iname '*' -exec mv '{}' . \;
find . -type d -exec rm -rf '{}' \;
Kind Regards
Karen
On 12/12/14 13:25, Akhil Das wrote:
I'm not quiet sure whether spark
Hi Michael,
I have opened following JIRA for the same :-
https://issues.apache.org/jira/browse/SPARK-4849
I am having a look at the code to see what can be done and then we can have
a discussion over the approach.
Let me know if you have any comments/suggestions.
Thanks
-Nitin
On Sun, Dec
Hi,
is there an easy way to “migrate” parquet files or indicate optional values in
sql statements? I added a couple of new fields that I also use in a
schemaRDD.sql() which obviously fails for input files that don’t have the new
fields.
Thanks
- Marius
As the error log shows, you may need to register it as:
sqlContext.rgisterFunction(“toHour”, toHour _)
The “_” means you are passing the function as parameter, not invoking it.
Cheng Hao
From: Xuelin Cao [mailto:xuelin...@yahoo.com.INVALID]
Sent: Monday, December 15, 2014 5:28 PM
To: User
When you say restored, does it mean the internal IP/public IP remain
unchanged to you changed them accordingly? (I'm assuming you are using a
cloud service like AWS, GCE or Azure).
What is the serializer that you are using? Try to set the following before
creating the sparkContext, might help
Hi experts
I am wondering if there is a way to interactive with Spark remotely? i.e. no
access to clusters required but submit Python/Scala scripts to cluster and get
result based on (REST) APIs.
That will facilitate the development process a lot..
Xiaoyong
Hiya,
I too am looking for a PageRank solution in GraphX where the probabilities
sum to 1.
I tried a few modifications, including division by the total number of
vertices in the first part of the equation, as well as trying to return full
rank instead of delta (though not correctly as evident
Hi everybody,
I apologize if the answer to my question is obvious but I haven't been able to
find a straightforward solution anywhere on the internet.
I have a number of Spark jobs written using the python API that do things like
read in data from Amazon S3 to a main table in the Hive
Hi all.
I'm willing to serialize and later load a model trained using mllib's
ALS.
I've tried usign Java serialization with something like:
val model = ALS.trainImplicit(training, rank, numIter, lambda, 1)
val fos = new FileOutputStream(model.bin)
val oos = new
To me this looks like an internal error to the REPL. I am not sure what is
causing that.
Personally I never use the REPL, can you try typing up your program and
running it from an IDE or spark-submit and see if you still get the same
error?
Simone Franzini, PhD
For an alternative take on a similar idea, see
https://github.com/koeninger/spark-1/tree/kafkaRdd/external/kafka/src/main/scala/org/apache/spark/rdd/kafka
An advantage of the approach I'm taking is that the lower and upper offsets
of the RDD are known in advance, so it's deterministic.
I
Hi,
I’m seeing strange, random errors when running unit tests for my Spark jobs. In
this particular case I’m using Spark SQL to read and write Parquet files, and
one error that I keep running into is this one:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 19 in
stage
thanks! i will take a look at your code. didn't realize there was already
something out there.
good point about upper offsets, i will add that feature to our version as
well if you dont mind.
i was thinking about making it deterministic for task failure transparently
(even if no upper offsets
This class is not going to be serializable, as it contains huge RDDs.
Even if the right constructor existed the RDDs inside would not
serialize.
On Mon, Dec 15, 2014 at 4:33 PM, Albert Manyà alber...@eml.cc wrote:
Hi all.
I'm willing to serialize and later load a model trained using mllib's
Hi Xiaoyong,
You could refer this post if you are looking on how to run spark jobs
remotely
http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-Spark-job-on-Unix-cluster-from-dev-environment-Windows-td16989.html
You will of course require network access to the cluster.
Thanks
Best
In that case, what is the strategy to train a model in some background
batch process and make recommendations for some other service in real
time? Run both processes in the same spark cluster?
Thanks.
--
Albert Manyà
alber...@eml.cc
On Mon, Dec 15, 2014, at 05:58 PM, Sean Owen wrote:
This
Have you seen the recent announcement around Spark Kernel using IPython/0MQ
protocol ?
https://github.com/ibm-et/spark-kernel
On Mon, Dec 15, 2014 at 12:06 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Hi Xiaoyong,
You could refer this post if you are looking on how to run spark jobs
Thanks Vincenzo.
Are you trying out all the models implemented in mllib? Actually I don't
see decision tree there. Sorry if I missed it. When are you planning to
merge this to spark branch?
Thanks
Sourabh
On Sun, Dec 14, 2014 at 5:54 PM, selvinsource [via Apache Spark User List]
Hi Albert,
There is some discussion going on here:
http://apache-spark-user-list.1001560.n3.nabble.com/MLLIB-model-export-PMML-vs-MLLIB-serialization-tc20324.html#a20674
I am also looking for this solution.But looks like until mllib pmml export
is ready, there is no full proof solution to export
The thing about MatrixFactorizationModel, compared to other models, is
that it is huge. It's not just a few coefficients, but whole RDDs of
coefficients. I think you could save these RDDs of user/product
factors to persistent storage, load them, then recreate the
MatrixFactorizationModel that way.
Hello,
I met a problem when using Spark sql CLI. A custom UDTF with lateral view
throws ClassNotFound exception. I did a couple of experiments in same
environment (spark version 1.1.1):
select + same custom UDTF (Passed)
select + lateral view + custom UDTF (ClassNotFoundException)
select +
Thanks Sandy, passing --name works fine :)
Tomer
On Fri, Dec 12, 2014 at 9:35 AM, Sandy Ryza sandy.r...@cloudera.com wrote:
Hi Tomer,
In yarn-cluster mode, the application has already been submitted to YARN
by the time the SparkContext is created, so it's too late to set the app
name
Hi spark users,
Do you know how to access rows of row?
I have a SchemaRDD called user and register it as a table with the
following schema:
root
|-- user_id: string (nullable = true)
|-- item: array (nullable = true)
||-- element: struct (containsNull = false)
|||-- item_id:
Underneath the covers, jsonFile uses TextInputFormat, which will split
files correctly based on new lines. Thus, there is no fixed maximum size
for a json object (other than the fact that it must fit into memory on the
executors).
On Mon, Dec 15, 2014 at 7:22 AM, Madabhattula Rajesh Kumar
Can you add this information to the JIRA?
On Mon, Dec 15, 2014 at 10:54 AM, shenghua wansheng...@gmail.com wrote:
Hello,
I met a problem when using Spark sql CLI. A custom UDTF with lateral view
throws ClassNotFound exception. I did a couple of experiments in same
environment (spark version
scala val items = Row(1 - orange, 2 - apple)
items: org.apache.spark.sql.catalyst.expressions.Row =
[(1,orange),(2,apple)]
If you literally want an iterator, then this:
scala items.toIterator.count { case (user_id, name) = user_id == 1 }
res0: Int = 1
...else:
scala items.count { case
Is it possible that you are starting more than one SparkContext in a single
JVM with out stopping previous ones? I'd try testing with Spark 1.2, which
will throw an exception in this case.
On Mon, Dec 15, 2014 at 8:48 AM, Marius Soutier mps@gmail.com wrote:
Hi,
I’m seeing strange, random
Thanks tsingfu,
I used this configuration based in your post: (with ganglia unicast mode)
# Enable GangliaSink for all instances
*.sink.ganglia.class=org.apache.spark.metrics.sink.GangliaSink
*.sink.ganglia.host=10.0.0.7
*.sink.ganglia.port=8649
*.sink.ganglia.period=15
you can use the default toString method to get the string
representation. if you want to customized, check the indices/values
fields. -Xiangrui
On Fri, Dec 5, 2014 at 7:32 AM, debbie debbielarso...@hotmail.com wrote:
Basic question:
What is the best way to loop through one of these and print
Hi Mark,
Thank you for helping out.
The items I got back from Spark SQL has the type information as follows:
scala items
res16: org.apache.spark.sql.Row = [WrappedArray([1,orange],[2,apple])]
I tried to iterate the items as you suggested but no luck.
Best Regards,
Jerry
On Mon, Dec 15,
If you want to train offline and predict online, you can use the
current LR implementation to train a model and then apply
model.predict on the dstream. -Xiangrui
On Sun, Dec 7, 2014 at 6:30 PM, Nasir Khan nasirkhan.onl...@gmail.com wrote:
I am new to spark.
Lets say i want to develop a machine
Is it possible that after filtering the feature dimension changed?
This may happen if you use LIBSVM format but didn't specify the number
of features. -Xiangrui
On Tue, Dec 9, 2014 at 4:54 AM, Sameer Tilak ssti...@live.com wrote:
Hi All,
I was able to run LinearRegressionwithSGD for a largeer
Please check the number of partitions after sc.textFile. Use
sc.textFile('...', 8) to have at least 8 partitions. -Xiangrui
On Tue, Dec 9, 2014 at 4:58 AM, DB Tsai dbt...@dbtsai.com wrote:
You just need to use the latest master code without any configuration
to get performance improvement from
Could you post the full stacktrace? It seems to be some recursive call
in parsing. -Xiangrui
On Tue, Dec 9, 2014 at 7:44 PM, jishnu.prat...@wipro.com wrote:
Hi
I am getting Stack overflow Error
Exception in main java.lang.stackoverflowerror
Looks like you've got one more layer of containment than you intend -- i.e.
you've got Row[WrappedArray[Row[(Int, String)]] where you want
Row[Row[(Int, String)]]. That's easy to do if somewhere along the line you
did something like `val row = Row(collection)` instead of `val row =
I'm a little confused by some of the responses. It seems like there are
two different issues being discussed here:
1. How to turn a sequential algorithm into something that works on spark.
Eg deal with the fact that data is split into partitions which are
processed in parallel (though within a
On Sun, Dec 14, 2014 at 3:06 AM, Saurabh Agrawal
saurabh.agra...@markit.com wrote:
Hi,
I am a new bee in spark and scala world
I have been trying to implement Collaborative filtering using MlLib supplied
out of the box with Spark and Scala
I have 2 problems
1. The best
Unfortunately, it will depends on the Sorter API in 1.2. -Xiangrui
On Mon, Dec 15, 2014 at 11:48 AM, Bharath Ravi Kumar
reachb...@gmail.com wrote:
Hi Xiangrui,
The block size limit was encountered even with reduced number of item blocks
as you had expected. I'm wondering if I could try the
In case you must write c# code, you can call python code from c# or use
IronPython. :)
On Mon, Dec 15, 2014 at 12:04 PM, Xiangrui Meng men...@gmail.com wrote:
On Sun, Dec 14, 2014 at 3:06 AM, Saurabh Agrawal
saurabh.agra...@markit.com wrote:
Hi,
I am a new bee in spark and scala
Restored ment reboot slave node with unchanged IP.
Funny thing is that for small files spark works fine.
I checked hadoop with hdfs also and I'm able to run wordcount on it without
any problems (i.e. file about 50GB size).
--
View this message in context:
Hi Muhammad,
Maybe next time you can use http://pastebin.com/ to format and paste
the cleaner scala code snippet so other can help you easier. Also,
please only paste the significant portion of stack-trace which causes
the issue instead of giant logs.
First of all, In your log, it seems that you
Just out of my curiosity. Do you manually apply this patch and see if
this can actually resolve the issue? It seems that it was merged at
some point, but reverted due to that it causes some stability issue.
Sincerely,
DB Tsai
---
My Blog:
I am going to try to export decision tree next, so far I focused on linear
models and k-means.
Regards,
Vincenzo
sourabh wrote
Thanks Vincenzo.
Are you trying out all the models implemented in mllib? Actually I don't
see decision tree there. Sorry if I missed it. When are you planning to
Adding group back.
FYI Geneis - this was on a m3.xlarge with all default settings in Spark. I
used Spark version 1.3.0.
The 2nd case did work for me:
a = [1,2,3,4,5,6,7,8,9]
b = []
for x in range(100):
... b.append(a)
...
rdd1 = sc.parallelize(b)
rdd1.first()
14/12/15 16:33:01 WARN
Hi all,
We are using Spark Streaming ETL a large volume of time series datasets. In our
current design, each dataset we ETL will have a corresponding Spark Streaming
context + process running on our cluster. Each of these processes will be
passed configuration options specifying the data
Ok, maybe these test versions will help me then. I’ll check it out.
On 15.12.2014, at 22:33, Michael Armbrust mich...@databricks.com wrote:
Using a single SparkContext should not cause this problem. In the SQL tests
we use TestSQLContext and TestHive which are global singletons for all of
Hello, everyone
I know 'NumberFormatException' is due to the reason that String can not be
parsed properly, but I really can not find any mistakes for my code. I hope
someone may kindly help me.
My hdfs file is as follows:
8,22
3,11
40,10
49,47
48,29
24,28
50,30
33,56
4,20
30,38
...
So each line
I've tried some additional experiments with kmeans and I finally got it
worked as I expected. In fact, the number of partition is critical. I had a
data set of 24x784 with 12 partitions. In this case the kmeans
algorithm took a very long time (about hours to converge). When I change
the
Hi Adam,
I have following scala actor based code to do graceful shutdown:
class TimerActor (val timeout : Long, val who : Actor) extends Actor {
def act {
reactWithin (timeout) {
case TIMEOUT = who ! SHUTDOWN
}
}
}
class SSCReactor (val ssc :
That certainly looks surprising. Are you sure there are no unprintable
characters in the file?
On Mon, Dec 15, 2014 at 9:49 PM, yu yuz1...@iastate.edu wrote:
The exception info is:
14/12/15 15:35:03 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0
(TID 0, h3):
Yes, this is what I also found in Spark documentation, that foreach can
have side effects. Nevertheless I have this weird error, that sometimes
files are just empty.
using is simply a wrapper that takes our code, makes try-catch-finally
and flush close all resources.
I honestly have no clue
Thinking about that any task could be launched concurrently in
different nodes, so in order to make sure the generated files are
valid, you need some atomic operation (such as rename) to do it. For
example, you could generate a random name for output file, writing the
data into it, rename it to
Hi Imran, you are right. Sequentially process does not make sense to use spark.
I think Sequentially process works if batch for each iteration is large
enough(this batch could be processed in parallel).
My point is that we shall not run mini-batches in parallel, but it still
possible to use
This still seems to be broken. In 1.1.1, it errors immediately on this line
(from the above repro script):
liveTweets.map(t = noop(t)).print()
The stack trace is:
org.apache.spark.SparkException: Task not serializable
at
Hi Yu,
Try this :
val data = csv.map( line = line.split(,).map(elem = elem.trim)) //lines
in rows
data.map( rec = (rec(0).toInt, rec(1).toInt))
to convert into integer.
On 16 December 2014 at 10:49, yu [via Apache Spark User List]
ml-node+s1001560n20694...@n3.nabble.com wrote:
Hello,
Thank you Peter for the clarification.
Regards,
Rajesh
On Tue, Dec 16, 2014 at 12:42 AM, Michael Armbrust mich...@databricks.com
wrote:
Underneath the covers, jsonFile uses TextInputFormat, which will split
files correctly based on new lines. Thus, there is no fixed maximum size
for a json
Hi Pala,
Spark executors only reserve spark.storage.memoryFraction (default 0.6) of
their spark.executor.memory for caching RDDs. The spark UI displays this
fraction.
spark.executor.memory controls the executor heap size.
spark.yarn.executor.memoryOverhead controls the extra that's tacked on
While I was running spark MR job, there was FetchFailed(BlockManagerId(47,
xx.com, 40975, 0), shuffleId=2, mapId=5, reduceId=286), then there
were many retries, and the job failed finally.
And the log showed the following error, does anybody meet this error ? or is
it a known issue in
Ok. We'll try using it in a test cluster running 1.2.
On 16-Dec-2014 1:36 am, Xiangrui Meng men...@gmail.com wrote:
Unfortunately, it will depends on the Sorter API in 1.2. -Xiangrui
On Mon, Dec 15, 2014 at 11:48 AM, Bharath Ravi Kumar
reachb...@gmail.com wrote:
Hi Xiangrui,
The block size
Hi Aniket,
Thanks for your reply.
I followed your advice to modified my code.
Here is latest one.
https://github.com/TomoyaIgarashi/spark_cluster_sample/commit/ce7613c42d3adbe6ae44e264c11f3829460f3c35
As a result, It works correctly! Thank you very much.
But, AssociationError Message appears
Is that possible, if not, how would one do it from PySpark ?
This probably does not make sense in most cases, but am writing a script
where my job involves downloading and pushing data into cassandra..
sometimes a task hangs forever, and I dont really mind killing it.. The job
is not actually
Thanks all for your information! What Pietro mentioned seems to be the
appropriate solution.. I also find a
slideshttp://www.slideshare.net/EvanChan2/spark-summit-2014-spark-job-server-talk
talking about it.
Several quick questions:
1. Is it already available in Spark main branch? (seems
Seems you are using standalone mode. Can you check spark worker logs or
application logs in spark work directory to find any errors?
On Tue, Dec 16, 2014, 9:09 AM Tomoya Igarashi
tomoya.igarashi.0...@gmail.com wrote:
Hi Aniket,
Thanks for your reply.
I followed your advice to modified my
There is a spark listener interface
https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.scheduler.SparkListener
which
can be used to trigger events like jobStarted, TaskGotResults etc but i
don't think you can set execution time anywhere. If a task is hung, its
mostly because
You could try setting the following while creating the sparkContext
.set(spark.rdd.compress,true)
.set(spark.storage.memoryFraction,1)
.set(spark.core.connection.ack.wait.timeout,600)
.set(spark.akka.frameSize,50)
Thanks
Best Regards
On Tue, Dec 16, 2014 at 8:30 AM, Mars Max
There could be some other character like a space or ^M etc. You could try
the following and see the actual row.
val newstream = datastream.map(row = {
try{
val strArray = str.trim().split(,)
(strArray(0).toInt, strArray(1).toInt)
//Instead try this
Thanks for response.
Yes, I am using standalone mode.
I couldn't find any errors. But, WARN messages appear in Spark master
logs.
Here is Spark master logs.
https://gist.github.com/TomoyaIgarashi/72145c11d3769c7d1ddb
FYI
Here is Spark worker logs.
Hi
I have installed a standalone Spark set up in standalone mode in a Linux
server and I am trying to access that spark setup from Java in windows. When
I try connecting to Spark I see the following exception
14/12/16 12:52:52 WARN TaskSchedulerImpl: Initial job has not accepted any
resources;
79 matches
Mail list logo