Add something like following to spark-env.sh
export LD_LIBRARY_PATH=:$LD_LIBRARY_PATH
(and remove all 5 exports you listed). Then restart all worker nodes, and
try
again.
Good luck!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-set-java-library-pa
What do you mean by "control your input”, are you trying to pace your spark
streaming by number of words. If so that is not supported as of now, you can
only control time & consume all files within that time period.
--
Regards,
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
You need to run mvn install so that the package you built is put into the
local maven repo. Then when compiling your own app (with the right
dependency specified), the package will be retrieved.
On 9/9/14, 8:16 PM, "alexandria1101" wrote:
>I think the package does not exist because I need to c
Hi Mayur,
Thanks for your response. I did write a simple test that set up a DStream
with
5 batches; The batch duration is 1 second, and the 3rd batch will take extra
2 seconds, the output of the test shows that the 3rd batch causes backlog,
and spark streaming does catch up on 4th and 5th batch (
Hi,
I am working on a 3 machine cloudera cluster. Whenever I submit a spark job
as a jar file with native dependency on mosek it shows the following error.
java.lang.UnsatisfiedLinkError: no mosekjava7_0 in java.library.path
How should I set the java.library.path. I printed the environment varia
Thanks for your response. I do have something like:
val inputDStream = ...
val keyedDStream = inputDStream.map(...) // use sensorId as key
val partitionedDStream = keyedDstream.transform(rdd => rdd.partitionBy(new
MyPartitioner(...)))
val stateDStream = partitionedDStream.updateStateByKey[...](ud
Using your own partitioner didn't work?
e.g.
YourRDD.partitionBy(new HashPartitioner(your number))
xj @ Tokyo
On Wed, Sep 10, 2014 at 12:03 PM, qihong wrote:
> I'm working on a DStream application. The input are sensors' measurements,
> the data format is
>
> There are 10 thousands sensors,
I think the package does not exist because I need to change the pom file:
org.apache.spark
spark-assembly_2.10
1.0.1
pom
provided
I changed the version number to 1.1.1, yet still that causes the build
error:
Failure to find org.apache.spark:spark-assembly_2.10:pom:1.1.1 in
http
I'm working on a DStream application. The input are sensors' measurements,
the data format is
There are 10 thousands sensors, and updateStateByKey is used to maintain
the states of sensors, the code looks like following:
val inputDStream = ...
val keyedDStream = inputDStream.map(...) // use s
Hi again,
On Tue, Sep 9, 2014 at 2:20 PM, Tobias Pfeiffer wrote:
>
> On Tue, Sep 9, 2014 at 2:02 PM, Ron's Yahoo! wrote:
>>
>> For example, let’s say there’s a particular topic T1 in a Kafka queue.
>> If I have a new set of requests coming from a particular client A, I was
>> wondering if I co
I've encountered probably the same problem and just figured out the solution.
The error was caused because Spark tried to write to the scratch directory
but the path didn't exist.
It's likely you are running the app on the master node only. In the
spark-ec2 setting, the scratch directory for Spar
Thanks so much!
That makes complete sense. However, when I compile I get an error "package
org.apache.spark.sql.hive does not exist."
Does anyone else have this and any idea why this might be so?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Table-not-f
Hi Luis,
The parameter “spark.cleaner.ttl” and “spark.streaming.unpersist” can be used
to remove useless timeout streaming data, the difference is that
“spark.cleaner.ttl” is time-based cleaner, it does not only clean streaming
input data, but also Spark’s useless metadata; while
“spark.stream
Your tables were registered in the SqlContext, whereas the thrift server
works with HiveContext. They seem to be in two different worlds today.
On 9/9/14, 5:16 PM, "alexandria1101" wrote:
>Hi,
>
>I want to use the sparksql thrift server in my application and make sure
>everything is loading an
I ran the SimpleApp program from spark tutorial
(https://spark.apache.org/docs/1.0.0/quick-start.html), which works fine.
However, if I change the file location from local to hdfs, then I get an
EOFException.
I did some search online which suggests this error is caused by hadoop
version conflic
Has anything changed in the last 30 days w.r.t serialization? I had 620MB
of compressed data which used to get serialized-in-spark-memory with 4GB
executor memory. Now it fails to get serialized in memory even at 10GB of
executor memory.
-- Bharath
Hi Yana -
I added the following to spark-class:
echo RUNNER: $RUNNER
echo CLASSPATH: $CLASSPATH
echo JAVA_OPTS: $JAVA_OPTS
echo '$@': $@
Here's the output:
$ ./spark-submit --class experiments.SimpleApp --master
spark://myhost.local:7077
/IdeaProjects/spark-experiments/target/spark-experiments
I'm mostly interested in the hbase examples in the repo. I saw two examples
hbase_inputformat.py and hbase_outputformat.py in the 1.1 branch. Can you
show me how to run them?
Compile step is done. I tried to run the examples, but failed.
--
View this message in context:
http://apache-spark-u
Hi,
I want to use the sparksql thrift server in my application and make sure
everything is loading and working. I built Spark 1.1 SNAPSHOT and ran the
thrift server using ./sbin/start-thrift-server. In my application I load
tables into schemaRDDs and I expect that the thrift-server should pick th
Hi,
I'm trying to execute Spark SQL queries on top of the AccumuloInputFormat.
Not sure if I should be asking on the Spark list or the Accumulo list, but
I'll try here. The problem is that the workload to process SQL queries
doesn't seem to be distributed across my cluster very well.
My Spark SQL
Hi Sandeep,
would you be interesting in joining my open source project?
https://github.com/tribbloid/spookystuff
IMHO spark is indeed not for general purpose crawling, of which distributed
job is highly homogeneous. But good enough for directional scraping which
involves heterogeneous input and
When I stop Spark Streaming Context by calling stop(), I always get the
following error:
ERROR Deregistered receiver for stream 0: Stopped by driver
class=org.apache.spark.streaming.scheduler.ReceiverTracker
WARN Stopped executor without error
class=org.apache.spark.streaming.receiver.Receive
Hi, users
1. Disk based cache eviction policy? The same LRU?
2. What is the scope of a cached RDD? Does it survive application? What
happen if I run Java app next time? Will RRD be created or read from cache?
If , answer is YES, then ...
3. Is there are any way to invalidate cached RDD automat
Hi Greg,
SPARK_EXECUTOR_INSTANCES is the total number of workers in the cluster. The
equivalent "spark.executor.instances" is just another way to set the same
thing in your spark-defaults.conf. Maybe this should be documented. :)
"spark.yarn.executor.memoryOverhead" is just an additional margin a
Your executor is exiting or crashing unexpectedly:
On Tue, Sep 9, 2014 at 3:13 PM, Penny Espinoza
wrote:
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit
> code from container container_1410224367331_0006_01_03 is : 1
> 2014-09-09 21:47:26,345 WARN
> org.apache.hadoo
Hey,
If you are interested in more details there is also a thread about this
issue here:
http://apache-spark-developers-list.1001551.n3.nabble.com/Eliminate-copy-while-sending-data-any-Akka-experts-here-td7127.html
Kostas
On Tue, Sep 9, 2014 at 3:01 PM, jbeynon wrote:
> Thanks Marcelo, that lo
The node manager log looks like this - not exactly sure what this means, but
the container messages seem to indicate there is still plenty of memory.
2014-09-09 21:47:00,718 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Memory usage of ProcessTre
Thanks Marcelo, that looks like the same thing. I'll follow the Jira ticket
for updates.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Yarn-Driver-OOME-Java-heap-space-when-executors-request-map-output-locations-tp13827p13829.html
Sent from the Apache Spar
Hi,
Yes, this is a problem, and I'm not aware of any simple workarounds
(or complex one for that matter). There are people working to fix
this, you can follow progress here:
https://issues.apache.org/jira/browse/SPARK-1239
On Tue, Sep 9, 2014 at 2:54 PM, jbeynon wrote:
> I'm running on Yarn with
I'm running on Yarn with relatively small instances with 4gb memory. I'm not
caching any data but when the map stage ends and shuffling begins all of the
executors request the map output locations at the same time which seems to
kill the driver when the number of executors is turned up.
For exampl
The executors of my spark streaming application are being killed due to
memory issues. The memory consumption is quite high on startup because is
the first run and there are quite a few events on the kafka queues that are
consumed at a rate of 100K events per sec.
I wonder if it's recommended to u
This has all the symptoms of Yarn killing your executors due to them
exceeding their memory limits. Could you check your RM/NM logs to see
if that's the case?
(The error was because of an executor at
domU-12-31-39-0B-F1-D1.compute-1.internal, so you can check that NM's
log file.)
If that's the ca
Hi,
In Spark website, there’s a plan to support HiveQL on top of Spark SQL and also
to support JDBC/ODBC.
I would like to know if in this “HiveQL” supported by Spark (or Spark SQL
accessible through JDBC/ODBC), is there a plan to add extensions to leverage
different Spark features like machin
On September 25-26, SF Scala teams up with Adam Gibson, the creator of
deeplearning4j.org, to teach the first ever Distributed Deep Learning
with Scala Akka, and Spark workshop. Deep Learning is enabling
break-through advances in the areas such as image recognition and
natural language processing.
I finally seem to have gotten past this issue. Here’s what I did:
* rather than using the binary distribution, I built Spark from scratch to
eliminate the 4.1 version of org.apache.httpcomponents from the assembly
* git clone https://github.com/apache/spark.git
* cd spark
Hey - I have a Spark 1.0.2 job (using spark-streaming-kafka) that runs
successfully using master = local[4]. However, when I run it on a Hadoop 2.2
EMR cluster using master yarn-client, it fails after running for about 5
minutes. My main method does something like this:
1. gets streaming
I figured out this issue (in our case) ...And I'll vent a little in my reply
here... =:)Fedora's well-intentioned firewall (firewall-cmd) requires you to
open (enable) any port/services on a host that you need to connect to
(including SSH/22 - which is enabled by default, of course). So when
launch
I am running Spark on Yarn with the HDP 2.1 technical preview. I'm having
issues getting the spark history server permissions to read the spark event
logs from hdfs. Both sides are configured to write/read logs from:
hdfs:///apps/spark/events
The history server is running as user spark, the j
I utilize this code in separated but the program block in this character:
val socket = listener.accept()
Do you have any suggestion?
Thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/streaming-code-to-simulate-a-network-socket-data-source-tp3431p13817.
If that library has native dependencies you'd need to make sure that the
native code is on all boxes and in the path with export
SPARK_LIBRARY_PATH=...
On Tue, Sep 9, 2014 at 10:17 AM, ayandas84 wrote:
> We have a small apache spark cluster of 6 computers. We are trying to
> solve
> a distribut
spark-submit is a script which calls spark-class script. Can you output the
command that spark-class runs (say, by putting set -x before the very last
line?). You should see the java command that is being run. The scripts do
some parameter setting so it's possible you're missing something. It seems
Hello Diana,
How can I include this implementation in my code, in terms of start this
task together the NetworkWordCount.
In my case, I have a directory with several files.
Then,
I include this line:
StreamingDataGenerator.streamingGenerator(NetPort, BytesSecond, DirFiles)
But the program sta
On Tue, Sep 9, 2014 at 9:56 AM, Oleg Ruchovets wrote:
> Hi ,
>
>I came from map/reduce background and try to do quite trivial thing:
>
> I have a lot of files ( on hdfs ) - format is :
>
>1 , 2 , 3
>2 , 3 , 5
>1 , 3, 5
> 2, 3 , 4
> 2 , 5, 1
>
> I am actually need to grou
I have also ran some tests on the other algorithms available with MLlib but
got dismal accuracy. Is the method of creating LabeledPoint RDD different
for other algorithms such as, LinearRegressionWithSGD?
Any help is appreciated.
-
Novice Big Data Programmer
--
View this message in context:
Thanks for the information Xiangrui. I am using the following example to
classify documents.
http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/
I am not sure if this is the best way to convert textual data into vectors.
Can you please confirm
On Tue, Sep 9, 2014 at 10:07 AM, Boxian Dong wrote:
> I currently working on a machine learning project, which require the RDDs'
> content to be (mostly partially) updated during each iteration. Because the
> program will be converted directly from "traditional" python object-oriented
> code, the
Which version of Spark are you using?
This bug had been fixed in 0.9.2, 1.0.2 and 1.1, could you upgrade to
one of these versions
to verify it?
Davies
On Tue, Sep 9, 2014 at 7:03 AM, redocpot wrote:
> Thank you for your replies.
>
> More details here:
>
> The prog is executed on local mode (sin
Hi,
val test = persons.value
.map{tuple => (tuple._1, tuple._2
.filter{event => *inactiveIDs.filter(event2 => event2._1 ==
tuple._1).count() != 0})}
Your problem is right between the asterisk. You can't make an RDD operation
inside an RDD operation, because RDD's can't be serialized
You should not be broadcasting an RDD. You also should not be passing an
RDD in a lambda to another RDD. If you want, can call RDD.collect and then
broadcast those values (of course you must be able to fit all those values
in memory).
On Tue, Sep 9, 2014 at 6:34 AM, Blackeye wrote:
> In order to
Yes, that's how file: URLs are interpreted everywhere in Spark. (It's also
explained in the link to the docs I posted earlier.)
The second interpretation below is "local:" URLs in Spark, but that doesn't
work with Yarn on Spark 1.0 (so it won't work with CDH 5.1 and older
either).
On Mon, Sep 8,
I currently working on a machine learning project, which require the RDDs'
content to be (mostly partially) updated during each iteration. Because the
program will be converted directly from "traditional" python object-oriented
code, the content of the RDD will be modified in the mapping function.
Hi ,
I came from map/reduce background and try to do quite trivial thing:
I have a lot of files ( on hdfs ) - format is :
1 , 2 , 3
2 , 3 , 5
1 , 3, 5
2, 3 , 4
2 , 5, 1
I am actually need to group by key (first column) :
key values
1 --> (2,3),(3,5)
2 --> (3,5),(3
Okay,
This seems to be either a code version issue or a communication issue. It
works if I execute the spark shell from the master node. It doesn't work if
I run it from my laptop and connect to the master node.
I had opened the ports for the WebUI (8080) and the cluster manager (7077)
for the m
Better to do it in a PR of your own, it's not sufficiently related to dimsum
On Tue, Sep 9, 2014 at 7:03 AM, Debasish Das
wrote:
> Cool...can I add loadRowMatrix in your PR ?
>
> Thanks.
> Deb
>
> On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh wrote:
>
>> Hi Deb,
>>
>> Did you mean to message me in
If you are using the Mahout's Multinomial Naive Bayes, it should be
the same as MLlib's. I tried MLlib with news20.scale downloaded from
http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html
and the test accuracy is 82.4%. -Xiangrui
On Tue, Sep 9, 2014 at 4:58 AM, jatinpreet wrot
We have a small apache spark cluster of 6 computers. We are trying to solve
a distributed problem which requires solving a optimization problem at each
machine during a spark map operation.
We decided to use mosek as the solver and I collected an academic license to
this end. We observed that mos
Cool...can I add loadRowMatrix in your PR ?
Thanks.
Deb
On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh wrote:
> Hi Deb,
>
> Did you mean to message me instead of Xiangrui?
>
> For TS matrices, dimsum with positiveinfinity and computeGramian have the
> same cost, so you can do either one. For dense
Thank you for your replies.
More details here:
The prog is executed on local mode (single node). Default env params are
used.
The test code and the result are in this gist:
https://gist.github.com/coderh/0147467f0b185462048c
Here is 10 first lines of the data: 3 fields each row, the delimiter i
You're mixing the Java and Scala APIs here. Your call to foreach() is
expecting a Scala function and you're giving it a Java Function.
Ideally you just use the Scala API, of course. Before explaining how
to actually use a Java function here, maybe clarify that you have to
do it and can't use Scala
I want to call a function for batches of elements from an rdd
val javaClass:org.apache.spark.api.java.function.Function[Seq[String],Unit]
= new JavaClass()
rdd.mapPartitions(_.grouped(5)).foreach(javaClass)
1.This worked fine in spark 0.9.1 , when we upgrade to spark 1.0.2 ,
Function changed from
Hi,
I tried running the classification program on the famous newsgroup data.
This had an even more drastic effect on the accuracy, as it dropped from
~82% in Mahout to ~72% in Spark MLlib.
Please help me in this regard as I have to use Spark in a production system
very soon and this is a blocker
Hi,
I tried running the classification program on the famous newsgroup data.
This had an even more drastic effect on the accuracy, as it dropped from
~82% in Mahout to ~72% in Spark MLlib.
Please help me in this regard as I have to use Spark in a production system
very soon and this is a blocker
>Why I think its the number of files is that I believe that a
> all of those or large part of those files are read when
>you run sqlContext.parquetFile() and the time it would
>take in s3 for that to happen is a lot so something
>internally is timing out..
I'll create the parquet files with D
My apologies to the list. I replied to Manu's question and it went directly
to him rather than the list.
In case anyone else has this issue here is my reply and Manu's reply to me.
This also answers Ian's question.
---
Hi Manu,
The dataset is 7.5 million rows
In order to help anyone to answer i could say that i checked the
inactiveIDs.filter operation seperated, and I found that it doesn't return
null in any case. In addition i don't how to handle (or check) whether a RDD
is null. I find the debugging to complicated to point the error. Any ideas
how to
I have the following code written in scala in Spark:
(inactiveIDs is a RDD[(Int, Seq[String])], persons is a Broadcast[RDD[(Int,
Seq[Event])]] and Event is a class that I have created)
val test = persons.value
.map{tuple => (tuple._1, tuple._2
.filter{event => inactiveIDs.filter(event2 => eve
yes, I agree to directly transform on DStream even there is no data injected
in this batch duration.
while my situation is :
Spark receive flume stream continurously, and I use updateStateByKey
function to collect data for a key among several batches, then I will handle
the collected data after wai
I think you should clarify some things in Spark Streaming:
1. closure in map is running in the remote side, so modify count var will only
take effect in remote side. You will always get -1 in driver side.
2. some codes in closure in foreachRDD is lazily executed in each batch
duration, while the
If you take into account what streaming means in spark, your goal doesn't
really make sense; you have to assume that your streams are infinite and
you will have to process them till the end of the days. Operations on a
DStream define what you want to do with each element of each RDD, but spark
stre
Can you provide small sample or test data that reproduce this problem? and
what's your env setup? single node or cluster?
Sent from my iPhone
> On 2014年9月8日, at 22:29, redocpot wrote:
>
> Hi,
>
> I have a key-value RDD called rdd below. After a groupBy, I tried to count
> rows.
> But the resu
i'm sorry I have some error in my code, update here:
var count = -1L // a global variable in the main object
val currentBatch = some_DStream
val countDStream = currentBatch.map(o=>{
count = 0L // reset the count variable in each batch
o
})
countDStream.foreachRDD(rdd=> coun
Hi Deb,
Did you mean to message me instead of Xiangrui?
For TS matrices, dimsum with positiveinfinity and computeGramian have the
same cost, so you can do either one. For dense matrices with say, 1m
columns this won't be computationally feasible and you'll want to start
sampling with dimsum.
It
Hi,
I had the same issue in my Java code while I was trying to connect to a
locally hosted spark server (using sbin/start-all.sh etc) using an IDE
(IntelliJ).
I packaged my app into a jar and used spark-submit (in bin/) and it worked!
Hope this helps
Rgds
--
View this message in context:
Thanks all,
yes, i did using foreachRDD, the following is my code:
var count = -1L // a global variable in the main object
val currentBatch = some_DStream
val countDStream = currentBatch.map(o=>{
*count = 0L *// reset the count variable in each batch
o
})
countDStream.foreachRDD
Hi,
I think all the received stream will generate a RDD in each batch duration even
there is no data feed in (an empty RDD will be generated). So you cannot use
number of RDD to judge whether there is any data received.
One way is to do this in DStream/foreachRDD(), like
a.foreachRDD { r =>
if
How about calling foreachRDD, and processing whatever data is in each
RDD normally, and also keeping track within the foreachRDD function of
whether any RDD had a count() > 0? if not, then you can execute at the
end your alternate logic in the case of no data. I don't think you
want to operate at t
Hi Jerry,
Thanks for your reply.
I use spark streaming to receive the flume stream, then I need to do a
judgement, in each batchDuration, if the received stream has data, then I
should do something, if no data, do the other thing. Then I thought the
count() can give me the measure, but it returns
Hi,
Is there any specific scenario which needs to know the RDD numbers in the
DStream? According to my knowledge DStream will generate one RDD in each right
batchDuration, some old rdd will be remembered for windowing-like function, and
will be removed when useless. The hashmap generatedRDDs i
Hi,
I had been using Mahout's Naive Bayes algorithm to classify document data.
For a specific train and test set, I was getting accuracy in the range of
86%. When I shifted to Spark's MLlib, the accuracy dropped to the vicinity
of 82%.
I am using same version of Lucene and logic to generate TFIDF
What's the type of the key?
If the hash of key is different across slaves, then you could get this confusing
results. We had met this similar results in Python, because of hash of None
is different across machines.
Davies
On Mon, Sep 8, 2014 at 8:16 AM, redocpot wrote:
> Update:
>
> Just test w
Hi Xiangrui,
For tall skinny matrices, if I can pass a similarityMeasure to
computeGrammian, I could re-use the SVD's computeGrammian for similarity
computation as well...
Do you recommend using this approach for tall skinny matrices or just use
the dimsum's routines ?
Right now RowMatrix does n
81 matches
Mail list logo