Does my question make sense or required some elaboration?
Sasi
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-for-Spark-JobServer-setup-on-Maven-for-Java-programming-tp20849p20896.html
Sent from the Apache Spark User List mailing list archive at
Hey,
why specific in maven??
we setup a spark job server thru sbt which is easy way to up and running
job server.
On 30 Dec 2014 13:32, Sasi [via Apache Spark User List]
ml-node+s1001560n20896...@n3.nabble.com wrote:
Does my question make sense or required some elaboration?
Sasi
The reason being, we had Vaadin (Java Framework) application which displays
data from Spark RDD, which in turn gets data from Cassandra. As we know, we
need to use Maven for building Spark API in Java.
We tested the spark-jobserver using SBT and able to run it. However, for our
requirement, we
Kmeans really needs to have identified number of clusters in advance. There
are multiple algorithms (XMeans, ART,...) which do not need this
information. Unfortunately, none of them is implemented in MLLib for the
moment (you can give a hand and help community).
Anyway, it seems to me you will
Ohh...
Just curious, we did similar use case like yours getting data out of
Cassandra since job server is a rest architecture all we need is an URL to
access it. Why integrating with your framework matters here when all we
need is a URL.
On 30 Dec 2014 14:05, Sasi [via Apache Spark User List]
Thanks Sandy, It was the issue with the no of cores.
Another issue I was facing is that tasks are not getting distributed evenly
among all executors and are running on the NODE_LOCAL locality level i.e.
all the tasks are running on the same executor where my kafkareceiver(s)
are running even
I have a table(csv file) loaded data on that by creating POJO as per table
structure,and created SchemaRDD as under
JavaRDDTest1 testSchema =
sc.textFile(D:/testTable.csv).map(GetTableData);/* GetTableData will
transform the all table data in testTable object*/
JavaSchemaRDD schemaTest =
Thanks Abhishek. We understand your point and will try using REST URL.
However one concern, we had around 1 lakh rows in our Cassandra table
presently. Will REST URL result can withstand the response size?
--
View this message in context:
Foreach iterates through the partitions in the RDD and executes the operations
for each partitions i guess.
On 29-Dec-2014, at 10:19 pm, SamyaMaiti samya.maiti2...@gmail.com wrote:
Hi All,
Please clarify.
Can we say 1 RDD is generated every batch interval?
If the above is true. Then, is
While I was doing JOIN operation of three tables using Spark 1.1.1, and
always got the following error. However, I've never met the exception in
Spark 1.1.0 with the same operation and same data. Does anyone meet the
problem?
14/12/30 17:49:33 ERROR CliDriver:
The DStream model is one RDD of data per interval, yes. foreachRDD
performs an operation on each RDD in the stream, which means it is
executed once* for the one RDD in each interval.
* ignoring the possibility here of failure and retry of course
On Mon, Dec 29, 2014 at 4:49 PM, SamyaMaiti
Hi, I'm facing a weird issue. Any help appreciated.
When I execute the below code and compare input and output, each record
in the output has some extra trailing data appended to it, and hence
corrupted. I'm just reading and writing, so the input and output should be
exactly the same.
I'm using
Thank Sean.
That was helpful.
Regards,
Sam
On Dec 30, 2014, at 4:12 PM, Sean Owen so...@cloudera.com wrote:
The DStream model is one RDD of data per interval, yes. foreachRDD
performs an operation on each RDD in the stream, which means it is
executed once* for the one RDD in each interval.
This poor soul had the exact same problem and solution:
http://stackoverflow.com/questions/24083332/write-and-read-raw-byte-arrays-in-spark-using-sequence-file-sequencefile
ᐧ
On Tue, Dec 30, 2014 at 10:58 AM, Enno Shioji eshi...@gmail.com wrote:
Hi, I'm facing a weird issue. Any help
Frankly saying I never tried for this volume in practical. But I believe it
should work.
On 30 Dec 2014 15:26, Sasi [via Apache Spark User List]
ml-node+s1001560n20902...@n3.nabble.com wrote:
Thanks Abhishek. We understand your point and will try using REST URL.
However one concern, we had
Hi,
well, spark 1.2 was prepared for scala 2.10. If you want stable and fully
functional tool I'd compile it this default compiler.
*I was able to compile Spar 1.2 by Java 7 and scala 2.10 seamlessly.*
I also tried Java8 and scala 2.11 (no -Dscala.usejavacp=true), but I failed
for some other
Hi all,
i have one large data-set. when i am getting the number of partitions its
showing 43.
We can't collect() the large data-set in to memory so i am thinking like
this, collect() each partitions so that it will be small in size.
Any thoughts ?
Thanks Abhishek. We are good know with an answer to try.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-for-Spark-JobServer-setup-on-Maven-for-Java-programming-tp20849p20906.html
Sent from the Apache Spark User List mailing list archive at
collect()-ing a partition still implies copying it to the driver, but
you're suggesting you can't collect() the whole data set to the
driver. What do you mean: collect() 1 partition? or collect() some
smaller result from each partition?
On Tue, Dec 30, 2014 at 11:54 AM, DEVAN M.S.
Hi Team,
I was trying to execute a Pyspark code in cluster. It gives me the following
error. (Wne I run the same job in local it is working fine too :-()
Eoor
Error from python worker:
/usr/lib/spark-1.2.0-bin-hadoop2.3/python/pyspark/context.py:209: Warning:
'with' will become a reserved
The Python installed in your cluster is 2.5. You need at least 2.6.
Eric Friedman
On Dec 30, 2014, at 7:45 AM, Jaggu jagana...@gmail.com wrote:
Hi Team,
I was trying to execute a Pyspark code in cluster. It gives me the following
error. (Wne I run the same job in local it is
Dear all:)
We're trying to make a graph using large input data and get a subgraph
applied some filter.
Now, we wanna save this graph to HDFS so that we can load later.
Is it possible to store graph or subgraph directly into HDFS and load it as
a graph for future use?
We will be glad for your
how about save as object?
Yours, Xuefeng Wu 吴雪峰 敬上
On 2014年12月30日, at 下午9:27, Jason Hong begger3...@gmail.com wrote:
Dear all:)
We're trying to make a graph using large input data and get a subgraph
applied some filter.
Now, we wanna save this graph to HDFS so that we can load later.
Hi.
I'm trying to configure a spark standalone cluster, with three master nodes
(bigdata1, bigdata2 and bigdata3) managed by Zookeeper.
It seems there's a configuration problem, since everyone is saying it is the
cluster leader:
.
14/12/30 13:54:59 INFO Master: I have been
I'm not sure exactly what you're trying to do, but take a look at
rdd.toLocalIterator if you haven't already.
On Tue, Dec 30, 2014 at 6:16 AM, Sean Owen so...@cloudera.com wrote:
collect()-ing a partition still implies copying it to the driver, but
you're suggesting you can't collect() the
Some time ago I did the (2) approach, I installed Anaconda on every node.
But to avoid screwing RedHat (it was CentOS in my case, which is the same)
I installed Anaconda on every node using the user yarn and made it the
default python only for that user.
After you install it, Anaconda asks if it
Do your debug println show values? i.e. what would you see if in rowToString
you output println( row to string +row+ +sub)?
Another thing to check would be to do schemaRDD.take(3) or something to
make sure you actually have data
you can also try this: rowToString(schemaRDD.first,list) and
Hi,
I am using spark standalone on EC2. I can access ephemeral hdfs from
spark-shell interface but I can't access hdfs in standalone application. I am
using spark 1.2.0 with hadoop 2.4.0 and launched cluster from ec2 folder from
my local machine. In my pom file I have given hadoop client as
I've been working with Spark 1.2 and Mesos 0.21.0 and while I have set the
spark.executor.uri within spark-env.sh (and directly within bash as well),
the Mesos slaves do not seem to be able to access the spark tgz file via
HTTP or HDFS as per the message below.
14/12/30 15:57:35 INFO SparkILoop:
Did you check firewall rules in security groups?
On Tue, Dec 30, 2014, 9:34 PM Laeeq Ahmed laeeqsp...@yahoo.com.invalid
wrote:
Hi,
I am using spark standalone on EC2. I can access ephemeral hdfs from
spark-shell interface but I can't access hdfs in standalone application. I
am using spark
Hi Michael,
I’ve looked through the example and the test cases and I think I understand
what we need to do - so I’ll give it a go.
I think what I’d like to try to do is allow files to be added at anytime, so
perhaps I can cache partition info, and also what may be useful for us would be
to
Hey all,
Since upgrading to 1.2.0 a pyspark job that worked fine in 1.1.1 fails
during shuffle. I've tried reverting from the sort-based shuffle back to
the hash one, and that fails as well. Does anyone see similar problems or
has an idea on where to look next?
For the sort-based shuffle I get a
I'm half-way there
follow
1. compiled and installed open blas library
2. ln -s libopenblas_sandybridgep-r0.2.13.so /usr/lib/libblas.so.3
3. compiled and built spark:
mvn -Pnetlib-lgpl -DskipTests clean compile package
So far so fine. Then I run into problems by testing the solution:
This here may also be of help:
http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html.
Make sure to spread your objects across multiple partitions to not be rate
limited by S3.
-Sven
On Mon, Dec 22, 2014 at 10:20 AM, durga katakam durgak...@gmail.com wrote:
Yes . I
Hi all,
I'm investigating spark for a new project and I'm trying to use
spark-jobserver because... I need to reuse and share RDDs and from what I
read in the forum that's the standard :D
Turns out that spark-jobserver doesn't seem to work on yarn, or at least it
does not on 1.1.1
My config
Without caching, each action is recomputed. So assuming rdd2 and rdd3
result in separate actions answer is yes.
On Mon, Dec 29, 2014 at 7:53 PM, Corey Nolet cjno...@gmail.com wrote:
If I have 2 RDDs which depend on the same RDD like the following:
val rdd1 = ...
val rdd2 = rdd1.groupBy()...
Anytime you see java.lang.NoSuchMethodError it means that you have
multiple conflicting versions of a library on the classpath, or you are
trying to run code that was compiled against the wrong version of a library.
On Tue, Dec 30, 2014 at 1:43 AM, sachin Singh sachin.sha...@gmail.com
wrote:
I
Hi
I am using Aanonda Python. Is there any way to specify the Python which we
have o use for running pyspark in a cluster.
Best regards
Jagan
On Tue, Dec 30, 2014 at 6:27 PM, Eric Friedman eric.d.fried...@gmail.com
wrote:
The Python installed in your cluster is 2.5. You need at least 2.6.
Hi
Does spark have built in possiblity of exposing current value of
Accumulator [1] using Monitoring and Instrumentation [2].
Unfortunately I couldn't find anything in Sources which could be used.
Does it mean only way to expose current accumulator value is to implement
new Source which would
I am not sure that can be done. Receivers are designed to be run only
on the executors/workers, whereas a SQLContext (for using Spark SQL)
can only be defined on the driver.
On Mon, Dec 29, 2014 at 6:45 PM, sranga sra...@gmail.com wrote:
Hi
Could Spark-SQL be used from within a custom actor
For windows that large (1 hour), you will probably also have to
increase the batch interval for efficiency.
TD
On Mon, Dec 29, 2014 at 12:16 AM, Akhil Das ak...@sigmoidanalytics.com wrote:
You can use reduceByKeyAndWindow for that. Here's a pretty clean example
Hi Sven,
Do you have a small example program that you can share which will allow me
to reproduce this issue? If you have a workload that runs into this, you
should be able to keep iteratively simplifying the job and reducing the
data set size until you hit a fairly minimal reproduction (assuming
To configure the Python executable used by PySpark, see the Using the
Shell Python section in the Spark Programming Guide:
https://spark.apache.org/docs/latest/programming-guide.html#using-the-shell
You can set the PYSPARK_PYTHON environment variable to choose the Python
executable that will be
Thanks. Will look at other options.
On Tue, Dec 30, 2014 at 11:43 AM, Tathagata Das tathagata.das1...@gmail.com
wrote:
I am not sure that can be done. Receivers are designed to be run only
on the executors/workers, whereas a SQLContext (for using Spark SQL)
can only be defined on the driver.
I¹m submitting a script using spark-submit in local mode for testing, and
I¹m having trouble figuring out where the logs are stored. The
documentation indicates that they should be in the work folder in the
directory in which Spark lives on my system, but I see no such folder there.
I¹ve set the
Here is the code for my streaming job.
~~val sparkConf = new
SparkConf().setAppName(SparkStreamingJob)
sparkConf.set(spark.serializer,
org.apache.spark.serializer.KryoSerializer)sparkConf.set(spark.default.parallelism,
Hi Experts,
Few general Queries :
1. Can a single block/partition in a RDD have more than 1 kafka message? or
there will be one only one kafka message per block? In a more broader way,
is the message count related to block in any way or its just that any
message received with in a particular
Hi,
I am trying to use the MultipleTextOutputFormat to rename the output files
of my Spark job something different from the default part-N.
I have implemented a custom MultipleTextOutputFormat class as follows:
*class DriveOutputRenameMultipleTextOutputFormat extends
I am not sure , the way I can pass jets3t.properties file for spark-submit.
--file option seems not working.
can some one please help me. My production spark jobs get hung up when
reading s3 file sporadically.
Thanks,
-D
--
View this message in context:
Hi Patrick, to follow up on the below discussion, I am including a short code
snippet that produces the problem on 1.1. This is kind of stupid code since
it’s a greatly simplified version of what I’m actually doing but it has a
number of the key components in place. I’m also including some
no,it still fail use mvn -Pyarn -Phadoop-2.5 -Dhadoop.version=2.5.0
-Dscala-2.10 -X -DskipTests clean package
...
[DEBUG] /opt/xdsp/spark-1.2.0/core/src/main/scala
[DEBUG] includes = [**/*.scala,**/*.java,]
[DEBUG] excludes = []
[WARNING] Zinc server is not available
This file needs to be on your CLASSPATH actually, not just in a directory. The
best way to pass it in is probably to package it into your application JAR. You
can put it in src/main/resources in a Maven or SBT project, and check that it
makes it into the JAR using jar tf yourfile.jar.
Matei
Hey Josh,
I am still trying to prune this to a minimal example, but it has been
tricky since scale seems to be a factor. The job runs over ~720GB of data
(the cluster's total RAM is around ~900GB, split across 32 executors). I've
managed to run it over a vastly smaller data set without issues.
Could you share a link about this? It's common to use Java 7, that
will be nice if we can fix this.
On Mon, Dec 29, 2014 at 1:27 PM, Eric Friedman
eric.d.fried...@gmail.com wrote:
Was your spark assembly jarred with Java 7? There's a known issue with jar
files made with that version. It
There is a known bug with local scheduler, will be fixed by
https://github.com/apache/spark/pull/3779
On Sun, Dec 21, 2014 at 10:57 PM, Samarth Mailinglist
mailinglistsama...@gmail.com wrote:
I’m trying to run the stateful network word count at
1. Of course, a single block / partition has many Kafka messages, and
from different Kafka topics interleaved together. The message count is
not related to the block count. Any message received within a
particular block interval will go in the same block.
2. Yes, the receiver will be started on
Thats is kind of expected due to data locality. Though you should see
some tasks running on the executors as the data gets replicated to
other nodes and can therefore run tasks based on locality. You have
two solutions
1. kafkaStream.repartition() to explicitly repartition the received
data
Which version of Spark Streaming are you using.
When the batch processing time increases to 15-20 seconds, could you
compare the task times compared to the tasks time when the application
is just launched? Basically is the increase from 6 seconds to 15-20
seconds is caused by increase in
I am running the job on 1.1.1.
I will let the job run overnight and send you more info on computation vs GC
time tomorrow.
BTW, do you know what the stage description named getCallSite at
DStream.scala:294 might mean?
Thanks,RK
On Tuesday, December 30, 2014 6:02 PM, Tathagata Das
https://issues.apache.org/jira/browse/SPARK-1911
Is one of several tickets on the problem.
On Dec 30, 2014, at 8:36 PM, Davies Liu dav...@databricks.com wrote:
Could you share a link about this? It's common to use Java 7, that
will be nice if we can fix this.
On Mon, Dec 29, 2014 at
I have a spark app that involves series of mapPartition operations and then
a keyBy operation. I have measured the time inside mapPartition function
block. These blocks take trivial time. Still the application takes way too
much time and even sparkUI shows that much time.
So i was wondering where
Anyone has suggestions?
On Tue, Dec 23, 2014 at 3:08 PM, Chen Song chen.song...@gmail.com wrote:
Silly question, what is the best way to shuffle protobuf messages in Spark
(Streaming) job? Can I use Kryo on top of protobuf Message type?
--
Chen Song
--
Chen Song
Thanks Matei.
-D
On Tue, Dec 30, 2014 at 4:49 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
This file needs to be on your CLASSPATH actually, not just in a directory.
The best way to pass it in is probably to package it into your application
JAR. You can put it in src/main/resources in a
This is still using a non-existent hadoop-2.5 profile, and
-Dscala-2.10 won't do anything. These don't matter though; this error
is just some scalac problem. I don't see this error when compiling.
On Wed, Dec 31, 2014 at 12:48 AM, j_soft zsof...@gmail.com wrote:
no,it still fail use mvn -Pyarn
Thanks for your answer, Xuefeng Wu.
But, I don't understand how to save a graph as object. :(
Do you have any sample codes?
2014-12-31 13:27 GMT+09:00 Jason Hong begger3...@gmail.com:
Thanks for your answer, Xuefeng Wu.
But, I don't understand how to save a graph as object. :(
Do you have
Hi all!
I use Spark SQL1.2 start the thrift server on yarn.
I want to use fair scheduler in the thrift server.
I set the properties in spark-defaults.conf like this:
spark.scheduler.mode FAIR
spark.scheduler.allocation.file
/opt/spark-1.2.0-bin-2.4.1/conf/fairscheduler.xml
In the thrift server
66 matches
Mail list logo