These questions have been Scala questions, not Spark questions. It's
better to look for answers on the internet or on discussion groups
devoted to Scala. StackOverflow is good, for example.
An array is indexed by integers, not strings. It's not even clear what
you intend here.
On Tue, Sep 9,
This structure is not specific to Hadoop, but in theory works in any
JAR file. You can put JARs in JARs and refer to them with Class-Path
entries in META-INF/MANIFEST.MF.
It works but I have found it can cause trouble with programs that
query the JARs on the classpath to find other classes. When
At 2014-09-05 12:13:18 +0200, Yifan LI iamyifa...@gmail.com wrote:
But how to assign the storage level to a new vertices RDD that mapped from
an existing vertices RDD,
e.g.
*val newVertexRDD =
graph.collectNeighborIds(EdgeDirection.Out).map{case(id:VertexId,
a:Array[VertexId]) = (id,
I want to implement the following logic:
val stream = getFlumeStream() // a DStream
if(size_of_stream 0) // if the DStream contains some RDD
stream.someTransfromation
stream.count() can figure out the number of RDD in a DStream, but it return
a DStream[Long] and can't compare with a
VisualVM is free and is enough in most situations
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-profile-a-spark-application-tp13684p13770.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi Xiangrui,
For tall skinny matrices, if I can pass a similarityMeasure to
computeGrammian, I could re-use the SVD's computeGrammian for similarity
computation as well...
Do you recommend using this approach for tall skinny matrices or just use
the dimsum's routines ?
Right now RowMatrix does
What's the type of the key?
If the hash of key is different across slaves, then you could get this confusing
results. We had met this similar results in Python, because of hash of None
is different across machines.
Davies
On Mon, Sep 8, 2014 at 8:16 AM, redocpot julien19890...@gmail.com wrote:
Hi,
I had been using Mahout's Naive Bayes algorithm to classify document data.
For a specific train and test set, I was getting accuracy in the range of
86%. When I shifted to Spark's MLlib, the accuracy dropped to the vicinity
of 82%.
I am using same version of Lucene and logic to generate
Hi,
Is there any specific scenario which needs to know the RDD numbers in the
DStream? According to my knowledge DStream will generate one RDD in each right
batchDuration, some old rdd will be remembered for windowing-like function, and
will be removed when useless. The hashmap generatedRDDs
Hi Jerry,
Thanks for your reply.
I use spark streaming to receive the flume stream, then I need to do a
judgement, in each batchDuration, if the received stream has data, then I
should do something, if no data, do the other thing. Then I thought the
count() can give me the measure, but it returns
How about calling foreachRDD, and processing whatever data is in each
RDD normally, and also keeping track within the foreachRDD function of
whether any RDD had a count() 0? if not, then you can execute at the
end your alternate logic in the case of no data. I don't think you
want to operate at
Hi,
I think all the received stream will generate a RDD in each batch duration even
there is no data feed in (an empty RDD will be generated). So you cannot use
number of RDD to judge whether there is any data received.
One way is to do this in DStream/foreachRDD(), like
a.foreachRDD { r =
if
Thanks all,
yes, i did using foreachRDD, the following is my code:
var count = -1L // a global variable in the main object
val currentBatch = some_DStream
val countDStream = currentBatch.map(o={
*count = 0L *// reset the count variable in each batch
o
})
Hi,
I had the same issue in my Java code while I was trying to connect to a
locally hosted spark server (using sbin/start-all.sh etc) using an IDE
(IntelliJ).
I packaged my app into a jar and used spark-submit (in bin/) and it worked!
Hope this helps
Rgds
--
View this message in context:
Hi Deb,
Did you mean to message me instead of Xiangrui?
For TS matrices, dimsum with positiveinfinity and computeGramian have the
same cost, so you can do either one. For dense matrices with say, 1m
columns this won't be computationally feasible and you'll want to start
sampling with dimsum.
It
i'm sorry I have some error in my code, update here:
var count = -1L // a global variable in the main object
val currentBatch = some_DStream
val countDStream = currentBatch.map(o={
count = 0L // reset the count variable in each batch
o
})
countDStream.foreachRDD(rdd= count
Can you provide small sample or test data that reproduce this problem? and
what's your env setup? single node or cluster?
Sent from my iPhone
On 2014年9月8日, at 22:29, redocpot julien19890...@gmail.com wrote:
Hi,
I have a key-value RDD called rdd below. After a groupBy, I tried to count
If you take into account what streaming means in spark, your goal doesn't
really make sense; you have to assume that your streams are infinite and
you will have to process them till the end of the days. Operations on a
DStream define what you want to do with each element of each RDD, but spark
I think you should clarify some things in Spark Streaming:
1. closure in map is running in the remote side, so modify count var will only
take effect in remote side. You will always get -1 in driver side.
2. some codes in closure in foreachRDD is lazily executed in each batch
duration, while
yes, I agree to directly transform on DStream even there is no data injected
in this batch duration.
while my situation is :
Spark receive flume stream continurously, and I use updateStateByKey
function to collect data for a key among several batches, then I will handle
the collected data after
In order to help anyone to answer i could say that i checked the
inactiveIDs.filter operation seperated, and I found that it doesn't return
null in any case. In addition i don't how to handle (or check) whether a RDD
is null. I find the debugging to complicated to point the error. Any ideas
how to
My apologies to the list. I replied to Manu's question and it went directly
to him rather than the list.
In case anyone else has this issue here is my reply and Manu's reply to me.
This also answers Ian's question.
---
Hi Manu,
The dataset is 7.5 million
Why I think its the number of files is that I believe that a
all of those or large part of those files are read when
you run sqlContext.parquetFile() and the time it would
take in s3 for that to happen is a lot so something
internally is timing out..
I'll create the parquet files with Drill
Hi,
I tried running the classification program on the famous newsgroup data.
This had an even more drastic effect on the accuracy, as it dropped from
~82% in Mahout to ~72% in Spark MLlib.
Please help me in this regard as I have to use Spark in a production system
very soon and this is a blocker
Hi,
I tried running the classification program on the famous newsgroup data.
This had an even more drastic effect on the accuracy, as it dropped from
~82% in Mahout to ~72% in Spark MLlib.
Please help me in this regard as I have to use Spark in a production system
very soon and this is a blocker
I want to call a function for batches of elements from an rdd
val javaClass:org.apache.spark.api.java.function.Function[Seq[String],Unit]
= new JavaClass()
rdd.mapPartitions(_.grouped(5)).foreach(javaClass)
1.This worked fine in spark 0.9.1 , when we upgrade to spark 1.0.2 ,
Function changed
You're mixing the Java and Scala APIs here. Your call to foreach() is
expecting a Scala function and you're giving it a Java Function.
Ideally you just use the Scala API, of course. Before explaining how
to actually use a Java function here, maybe clarify that you have to
do it and can't use Scala
Thank you for your replies.
More details here:
The prog is executed on local mode (single node). Default env params are
used.
The test code and the result are in this gist:
https://gist.github.com/coderh/0147467f0b185462048c
Here is 10 first lines of the data: 3 fields each row, the delimiter
Cool...can I add loadRowMatrix in your PR ?
Thanks.
Deb
On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh r...@databricks.com wrote:
Hi Deb,
Did you mean to message me instead of Xiangrui?
For TS matrices, dimsum with positiveinfinity and computeGramian have the
same cost, so you can do either
If you are using the Mahout's Multinomial Naive Bayes, it should be
the same as MLlib's. I tried MLlib with news20.scale downloaded from
http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html
and the test accuracy is 82.4%. -Xiangrui
On Tue, Sep 9, 2014 at 4:58 AM, jatinpreet
Better to do it in a PR of your own, it's not sufficiently related to dimsum
On Tue, Sep 9, 2014 at 7:03 AM, Debasish Das debasish.da...@gmail.com
wrote:
Cool...can I add loadRowMatrix in your PR ?
Thanks.
Deb
On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh r...@databricks.com wrote:
Hi Deb,
Okay,
This seems to be either a code version issue or a communication issue. It
works if I execute the spark shell from the master node. It doesn't work if
I run it from my laptop and connect to the master node.
I had opened the ports for the WebUI (8080) and the cluster manager (7077)
for the
Hi ,
I came from map/reduce background and try to do quite trivial thing:
I have a lot of files ( on hdfs ) - format is :
1 , 2 , 3
2 , 3 , 5
1 , 3, 5
2, 3 , 4
2 , 5, 1
I am actually need to group by key (first column) :
key values
1 -- (2,3),(3,5)
2 --
I currently working on a machine learning project, which require the RDDs'
content to be (mostly partially) updated during each iteration. Because the
program will be converted directly from traditional python object-oriented
code, the content of the RDD will be modified in the mapping function.
Yes, that's how file: URLs are interpreted everywhere in Spark. (It's also
explained in the link to the docs I posted earlier.)
The second interpretation below is local: URLs in Spark, but that doesn't
work with Yarn on Spark 1.0 (so it won't work with CDH 5.1 and older
either).
On Mon, Sep 8,
Hi,
val test = persons.value
.map{tuple = (tuple._1, tuple._2
.filter{event = *inactiveIDs.filter(event2 = event2._1 ==
tuple._1).count() != 0})}
Your problem is right between the asterisk. You can't make an RDD operation
inside an RDD operation, because RDD's can't be serialized.
Which version of Spark are you using?
This bug had been fixed in 0.9.2, 1.0.2 and 1.1, could you upgrade to
one of these versions
to verify it?
Davies
On Tue, Sep 9, 2014 at 7:03 AM, redocpot julien19890...@gmail.com wrote:
Thank you for your replies.
More details here:
The prog is
On Tue, Sep 9, 2014 at 10:07 AM, Boxian Dong box...@indoo.rs wrote:
I currently working on a machine learning project, which require the RDDs'
content to be (mostly partially) updated during each iteration. Because the
program will be converted directly from traditional python object-oriented
Thanks for the information Xiangrui. I am using the following example to
classify documents.
http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/
I am not sure if this is the best way to convert textual data into vectors.
Can you please confirm
I have also ran some tests on the other algorithms available with MLlib but
got dismal accuracy. Is the method of creating LabeledPoint RDD different
for other algorithms such as, LinearRegressionWithSGD?
Any help is appreciated.
-
Novice Big Data Programmer
--
View this message in
On Tue, Sep 9, 2014 at 9:56 AM, Oleg Ruchovets oruchov...@gmail.com wrote:
Hi ,
I came from map/reduce background and try to do quite trivial thing:
I have a lot of files ( on hdfs ) - format is :
1 , 2 , 3
2 , 3 , 5
1 , 3, 5
2, 3 , 4
2 , 5, 1
I am actually need
Hello Diana,
How can I include this implementation in my code, in terms of start this
task together the NetworkWordCount.
In my case, I have a directory with several files.
Then,
I include this line:
StreamingDataGenerator.streamingGenerator(NetPort, BytesSecond, DirFiles)
But the program
spark-submit is a script which calls spark-class script. Can you output the
command that spark-class runs (say, by putting set -x before the very last
line?). You should see the java command that is being run. The scripts do
some parameter setting so it's possible you're missing something. It
If that library has native dependencies you'd need to make sure that the
native code is on all boxes and in the path with export
SPARK_LIBRARY_PATH=...
On Tue, Sep 9, 2014 at 10:17 AM, ayandas84 ayanda...@gmail.com wrote:
We have a small apache spark cluster of 6 computers. We are trying to
I utilize this code in separated but the program block in this character:
val socket = listener.accept()
Do you have any suggestion?
Thanks
--
View this message in context:
I am running Spark on Yarn with the HDP 2.1 technical preview. I'm having
issues getting the spark history server permissions to read the spark event
logs from hdfs. Both sides are configured to write/read logs from:
hdfs:///apps/spark/events
The history server is running as user spark, the
I figured out this issue (in our case) ...And I'll vent a little in my reply
here... =:)Fedora's well-intentioned firewall (firewall-cmd) requires you to
open (enable) any port/services on a host that you need to connect to
(including SSH/22 - which is enabled by default, of course). So when
Hey - I have a Spark 1.0.2 job (using spark-streaming-kafka) that runs
successfully using master = local[4]. However, when I run it on a Hadoop 2.2
EMR cluster using master yarn-client, it fails after running for about 5
minutes. My main method does something like this:
1. gets streaming
I finally seem to have gotten past this issue. Here’s what I did:
* rather than using the binary distribution, I built Spark from scratch to
eliminate the 4.1 version of org.apache.httpcomponents from the assembly
* git clone https://github.com/apache/spark.git
* cd spark
On September 25-26, SF Scala teams up with Adam Gibson, the creator of
deeplearning4j.org, to teach the first ever Distributed Deep Learning
with Scala Akka, and Spark workshop. Deep Learning is enabling
break-through advances in the areas such as image recognition and
natural language
Hi,
In Spark website, there’s a plan to support HiveQL on top of Spark SQL and also
to support JDBC/ODBC.
I would like to know if in this “HiveQL” supported by Spark (or Spark SQL
accessible through JDBC/ODBC), is there a plan to add extensions to leverage
different Spark features like
This has all the symptoms of Yarn killing your executors due to them
exceeding their memory limits. Could you check your RM/NM logs to see
if that's the case?
(The error was because of an executor at
domU-12-31-39-0B-F1-D1.compute-1.internal, so you can check that NM's
log file.)
If that's the
The executors of my spark streaming application are being killed due to
memory issues. The memory consumption is quite high on startup because is
the first run and there are quite a few events on the kafka queues that are
consumed at a rate of 100K events per sec.
I wonder if it's recommended to
I'm running on Yarn with relatively small instances with 4gb memory. I'm not
caching any data but when the map stage ends and shuffling begins all of the
executors request the map output locations at the same time which seems to
kill the driver when the number of executors is turned up.
For
Hi,
Yes, this is a problem, and I'm not aware of any simple workarounds
(or complex one for that matter). There are people working to fix
this, you can follow progress here:
https://issues.apache.org/jira/browse/SPARK-1239
On Tue, Sep 9, 2014 at 2:54 PM, jbeynon jbey...@gmail.com wrote:
I'm
Thanks Marcelo, that looks like the same thing. I'll follow the Jira ticket
for updates.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Yarn-Driver-OOME-Java-heap-space-when-executors-request-map-output-locations-tp13827p13829.html
Sent from the Apache
The node manager log looks like this - not exactly sure what this means, but
the container messages seem to indicate there is still plenty of memory.
2014-09-09 21:47:00,718 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Memory usage of
Hey,
If you are interested in more details there is also a thread about this
issue here:
http://apache-spark-developers-list.1001551.n3.nabble.com/Eliminate-copy-while-sending-data-any-Akka-experts-here-td7127.html
Kostas
On Tue, Sep 9, 2014 at 3:01 PM, jbeynon jbey...@gmail.com wrote:
Thanks
Hi Greg,
SPARK_EXECUTOR_INSTANCES is the total number of workers in the cluster. The
equivalent spark.executor.instances is just another way to set the same
thing in your spark-defaults.conf. Maybe this should be documented. :)
spark.yarn.executor.memoryOverhead is just an additional margin
Hi, users
1. Disk based cache eviction policy? The same LRU?
2. What is the scope of a cached RDD? Does it survive application? What
happen if I run Java app next time? Will RRD be created or read from cache?
If , answer is YES, then ...
3. Is there are any way to invalidate cached RDD
When I stop Spark Streaming Context by calling stop(), I always get the
following error:
ERROR Deregistered receiver for stream 0: Stopped by driver
class=org.apache.spark.streaming.scheduler.ReceiverTracker
WARN Stopped executor without error
Hi Sandeep,
would you be interesting in joining my open source project?
https://github.com/tribbloid/spookystuff
IMHO spark is indeed not for general purpose crawling, of which distributed
job is highly homogeneous. But good enough for directional scraping which
involves heterogeneous input and
Hi,
I'm trying to execute Spark SQL queries on top of the AccumuloInputFormat.
Not sure if I should be asking on the Spark list or the Accumulo list, but
I'll try here. The problem is that the workload to process SQL queries
doesn't seem to be distributed across my cluster very well.
My Spark
Hi,
I want to use the sparksql thrift server in my application and make sure
everything is loading and working. I built Spark 1.1 SNAPSHOT and ran the
thrift server using ./sbin/start-thrift-server. In my application I load
tables into schemaRDDs and I expect that the thrift-server should pick
I'm mostly interested in the hbase examples in the repo. I saw two examples
hbase_inputformat.py and hbase_outputformat.py in the 1.1 branch. Can you
show me how to run them?
Compile step is done. I tried to run the examples, but failed.
--
View this message in context:
Hi Yana -
I added the following to spark-class:
echo RUNNER: $RUNNER
echo CLASSPATH: $CLASSPATH
echo JAVA_OPTS: $JAVA_OPTS
echo '$@': $@
Here's the output:
$ ./spark-submit --class experiments.SimpleApp --master
spark://myhost.local:7077
Has anything changed in the last 30 days w.r.t serialization? I had 620MB
of compressed data which used to get serialized-in-spark-memory with 4GB
executor memory. Now it fails to get serialized in memory even at 10GB of
executor memory.
-- Bharath
I ran the SimpleApp program from spark tutorial
(https://spark.apache.org/docs/1.0.0/quick-start.html), which works fine.
However, if I change the file location from local to hdfs, then I get an
EOFException.
I did some search online which suggests this error is caused by hadoop
version
Your tables were registered in the SqlContext, whereas the thrift server
works with HiveContext. They seem to be in two different worlds today.
On 9/9/14, 5:16 PM, alexandria1101 alexandria.shea...@gmail.com wrote:
Hi,
I want to use the sparksql thrift server in my application and make sure
What do you mean by control your input”, are you trying to pace your spark
streaming by number of words. If so that is not supported as of now, you can
only control time consume all files within that time period.
--
Regards,
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
Add something like following to spark-env.sh
export LD_LIBRARY_PATH=path of libmosekjava7_0.so:$LD_LIBRARY_PATH
(and remove all 5 exports you listed). Then restart all worker nodes, and
try
again.
Good luck!
--
View this message in context:
71 matches
Mail list logo