I tried running it but dint work
public static final SparkConf batchConf= new SparkConf();
String master = spark://sivarani:7077;
String spark_home =/home/sivarani/spark-1.0.2-bin-hadoop2/;
String jar = /home/sivarani/build/Test.jar;
public static final JavaSparkContext batchSparkContext = new
In IntelliJ, Tools Generate Scaladoc.
Kamal
On Fri, Oct 31, 2014 at 5:35 AM, Alessandro Baretta alexbare...@gmail.com
wrote:
How do I build the scaladoc html files from the spark source distribution?
Alex Bareta
What do your worker logs say?
Best Regards,
Sonal
Nube Technologies http://www.nubetech.co
http://in.linkedin.com/in/sonalgoyal
On Fri, Oct 31, 2014 at 11:44 AM, sivarani whitefeathers...@gmail.com
wrote:
I tried running it but dint work
public static final SparkConf batchConf= new
Hi,
Since, the cassandra object is not serializable you can't open the
connection on driver level and access the object inside foreachRDD (i.e. at
worker level).
You have to open connection inside foreachRDD only, perform the operation
and then close the connection.
For example:
Are you expecting something like this?
val data = ssc.textFileStream(hdfs://akhldz:9000/input/)
val rdd = ssc.sparkContext.parallelize(Seq(foo, bar))
val sample = data.foreachRDD(x= {
val new_rdd = x.union(rdd)
new_rdd.saveAsTextFile(hdfs://akhldz:9000/output/)
})
It says 478548 on host 172.18.152.36: java.lang.ArrayIndexOutOfBoundsException
Can you try putting a try { }catch around all those operations that you are
doing on the DStream? In that way it will not stop the entire application
due to corrupt data/field etc.
Thanks
Best Regards
On Fri, Oct 31,
I am trying to write some sample code under IntelliJ IDEA. I start with a
non-sbt scala project. In order that the program compile, I add
*spark-assembly-1.1.0-hadoop2.4.0.jar* in the *spark/lib* directory as one
external library of the IDEA project.
You can also use PairRDDFunctions' saveAsNewAPIHadoopFile that takes an
OutputFormat class.
So you will have to write a custom OutputFormat class that extends
OutputFormat. In this class, you will have to implement a getRecordWriter
which returns a custom RecordWriter.
So you will also have to
Hi, everyone I have an RDD filled with data like (k1, v11)
(k1, v12) (k1, v13) (k2, v21) (k2, v22) (k2, v23)
...
I want to calculate the average and standard deviation of (v11, v12, v13)
and (v21, v22, v23) group by there keys for
I think you can try to use the Hadoop DBOutputFormat
Best Regards,
Sonal
Nube Technologies http://www.nubetech.co
http://in.linkedin.com/in/sonalgoyal
On Fri, Oct 31, 2014 at 1:00 PM, Kamal Banga ka...@sigmoidanalytics.com
wrote:
You can also use PairRDDFunctions' saveAsNewAPIHadoopFile
No, empty parens do no matter when calling no-arg methods in Scala.
This invocation should work as-is and should result in the RDD showing
in Storage. I see that when I run it right now.
Since it really does/should work, I'd look at other possibilities --
is it maybe taking a short time to start
cache() won't speed up a single operation on an RDD, since it is
computed the same way before it is persisted.
On Thu, Oct 30, 2014 at 7:15 PM, Sameer Farooqui same...@databricks.com wrote:
By the way, in case you haven't done so, do try to .cache() the RDD before
running a .count() on it as
Hi, all
My job failed and there are a lot of ERROR ConnectionManager: Corresponding
SendingConnection to ConnectionManagerId not found information in the log.
Can anyone tell me what's wrong and how to fix it?
Best Regards,
Kevin.
I am dealing with a Lambda Architecture. This means I have Hadoop on the
batch layer, Storm on the speed layer and I'm storing the precomputed views
from both layers in Cassandra.
I understand that Spark is a substitute for Hadoop but at the moment I would
like not to change the batch layer.
I
Hi,
I have inpot data that are many of very small files containing one .json.
For performance reasons (I use PySpark) I have to do repartioning, currently I
do:
sc.textFile(files).coalesce(100))
Problem is that I have to guess the number of partitions in a such way that
it's as fast as
I found my problem. I assumed based on TF-IDF in Wikipedia , that log base
10 is used, but as I found in this discussion
https://groups.google.com/forum/#!topic/scala-language/K5tbYSYqQc8, in
scala it is actually ln (natural logarithm).
Regards,
Andrejs
On Thu, Oct 30, 2014 at 10:49 PM, Ashic
Yes, here the base doesn't matter as it just multiplies all results by
a constant factor. Math libraries tend to have ln, not log10 or log2.
ln is often the more, er, natural base for several computations. So I
would assume that log = ln in the context of ML.
On Fri, Oct 31, 2014 at 11:31 AM,
Hi,
I meet the same problem in the context of spark and yarn.
When I open pyspark with the following command:
spark/bin/pyspark --master yarn-client --num-executors 1 --executor-memory
2500m
It turns out *INFO storage.BlockManagerMasterActor: Registering block
manager
Hi
Thanks
Branch 1.1. did not work but 1.0 worked.
Why could that be?
Regards Hans-Peter
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/sbt-sbt-compile-error-FATAL-tp17629p17817.html
Sent from the Apache Spark User List mailing list archive at
While i testing Spark SQL i noticed that COUNT DISTINCT works really slow.
Map partitions phase finished fast, but collect phase is slow.
It's only runs on single executor.
Should this run this way?
And here is the simple code which i use for testing:
val sqlContext = new
Hi Jan. I've actually written a function recently to do precisely that using
the RDD.randomSplit function. You just need to calculate how big each element
of your data is, then how many of each data can fit in each RDD to populate the
input to rqndomSplit. Unfortunately, in my case I wind up
Apology for having to send to all.
I am highly interested in spark, would like to stay in this mailing list. But
the email I signed up is not right one. The link below to unsubscribe seems not
working.
https://spark.apache.org/community.html
Can anyone help?
Hongbin,
Please send an email to user-unsubscr...@spark.apache.org in order to
unsubscribe from the user list.
On Fri, Oct 31, 2014 at 9:05 AM, Hongbin Liu hongbin@theice.com wrote:
Apology for having to send to all.
I am highly interested in spark, would like to stay in this mailing
Hi,
I am trying to make Spark SQL 1.1 to work to replace part of our ETL
processes that are currently done by Hive 0.12.
A common problem that I have encountered is the Too many files open
error. Once that happened, the query just failed. I started the
spark-shell by using ulimit -n 4096
Thanks for the pointers! I did tried but didn't seem to help...
In my latest try, I am doing spark-submit local
But see the same message in spark App ui (4040)
localhost CANNOT FIND ADDRESS
In the logs, I see a lot of in-memory map to disk. I don't understand why
that is the case.
It's almost surely the workers, not the driver (shell) that have too
many files open. You can change their ulimit. But it's probably better
to see why it happened -- a very big shuffle? -- and repartition or
design differently to avoid it. The new sort-based shuffle might help
in this regard.
On
what is it for? when do we call it?
thanks!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-stop-tp17826.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi Sean,
Thanks for the reply. I think both driver and worker have the problem. You
are right that the ulimit fixed the driver side too many files open error.
And there is a very big shuffle. My maybe naive thought is to migrate the
HQL scripts directly from Hive to Spark SQL and make them work.
Based on execution on small test cases, it appears that the construction
below does what I intend. (Yes, all those Tuple1()s were superfluous.)
var lines = ssc.textFileStream(dirArg)
var linesArray = lines.map( line = (line.split(\t)))
var newState = linesArray.map( lineArray =
Thanks Chris for looking at this. I was putting data at roughly the same 50
records per batch max. This issue was purely because of a bug in my
persistence logic that was leaking memory.
Overall, I haven't seen a lot of lag with kinesis + spark setup and I am
able to process records at roughly
Hi Ilya,
This seems to me as quiet complicated solution, I'm thinking that easier
(though not optimal) solution might be for example to use heuristicaly
something like RDD.coalesce(RDD.getNumPartitions() / N), but it keeps me wonder
that Spark does not have something like
Yes I would expect it as you say, setting executor-cores as 1 would work, but
it seems to me that when I do use executor-cores=1 than it does actually
perform more than one job on each of the machines at one time moment (at least
based on what top says).
Hm, now I am also seeing this problem.
The essence of my code is:
final JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
JavaStreamingContextFactory streamingContextFactory = new
JavaStreamingContextFactory() {
@Override
public JavaStreamingContext create() {
The original problem is in biology but the following captures the CS
issues, Assume I have a large number of locks and a large number of keys.
There is a scoring function between keys and locks and a key that fits a
lock will have a high score. There may be many keys fitting one lock and a
key
As Sean suggested, try out the new sort-based shuffle in 1.1 if you know
you're triggering large shuffles. That should help a lot.
2014년 10월 31일 금요일, Bill Qbill.q@gmail.com님이 작성한 메시지:
Hi Sean,
Thanks for the reply. I think both driver and worker have the problem. You
are right that the
Is there any tools like Ganglia that I can use to get performance on Spark or
I need to do it myself?
Thanks!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Measuring-Performance-in-Spark-tp17376p17836.html
Sent from the Apache Spark User List mailing
Hi Mahsa,
Use SPM http://sematext.com/spm/. See
http://blog.sematext.com/2014/10/07/apache-spark-monitoring/ .
Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr Elasticsearch Support * http://sematext.com/
On Fri, Oct 31, 2014 at 1:00 PM, mahsa
Hi Guys,
I'm trying to execute a spark job using python, running on a cluster of
Yarn (managed by cloudera manager). The python script is using a set of
python programs installed in each member of cluster. This set of programs
need an property file found by a local system path.
My problem is:
Oh this is Awesome! exactly what I needed! Thank you Otis!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Measuring-Performance-in-Spark-tp17376p17839.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
It is used to shut down the context when you're done with it, but if you're
using a context for the lifetime of your application I don't think it
matters.
I use this in my unit tests, because they start up local contexts and you
can't have multiple local contexts open so each test must stop its
The only thing in your code that cannot be parallelized is the collect()
because -- by definition -- it collects all the results to the driver node.
This has nothing to do with the DISTINCT in your query.
What do you want to do with the results after you collect them? How many
results do you have
Hi,
I am using the latest Cassandra-Spark Connector to access Cassandra tables
form Spark. While I successfully managed to connect Cassandra using
CassandraRDD, the similar SparkSQL approach does not work. Here is my code
for both methods:
import com.datastax.spark.connector._
import
Hi Harold,
Can you include the versions of spark and spark-cassandra-connector you are
using?
Thanks!
Helena
@helenaedelson
On Oct 30, 2014, at 12:58 PM, Harold Nguyen har...@nexgate.com wrote:
Hi all,
I'd like to be able to modify values in a DStream, and then send it off to an
Thanks Lalit, and Helena,
What I'd like to do is manipulate the values within a DStream like this:
DStream.foreachRDD( rdd = {
val arr = record.toArray
}
I'd then like to be able to insert results from the arr back into
Cassadnra, after I've manipulated the arr array.
However, for all
Hi Shahab,
I’m just curious, are you explicitly needing to use thrift? Just using the
connector with spark does not require any thrift dependencies.
Simply: com.datastax.spark %% spark-cassandra-connector % 1.1.0-beta1”
But to your question, you declare the keyspace but also unnecessarily
Hi Steve,
Are you talking about sequence alignment ?
—
FG
On Fri, Oct 31, 2014 at 5:44 PM, Steve Lewis lordjoe2...@gmail.com
wrote:
The original problem is in biology but the following captures the CS
issues, Assume I have a large number of locks and a large number of keys.
There is a
Hi Harold,
Yes, that is the problem :) Sorry for the confusion, I will make this clear in
the docs ;) since master is work for the next version.
All you need to do is use
spark 1.1.0 as you have it already
com.datastax.spark %% spark-cassandra-connector % 1.1.0-beta1”
and assembly - not from
Hi Harold,
This is a great use case, and here is how you could do it, for example, with
Spark Streaming:
Using a Kafka stream:
https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrweather/KafkaStreamingActor.scala#L50
Save raw data to
Thanks Helena.
I tried setting the KeySpace, but I got same result. I also removed other
Cassandra dependencies, but still same exception!
I also tried to see if this setting appears in the CassandraSQLContext or
not, so I printed out the output of configustion
val cc = new
You don't have to call it if you just exit your application, but it's useful
for example in unit tests if you want to create and shut down a separate
SparkContext for each test.
Matei
On Oct 31, 2014, at 10:39 AM, Evan R. Sparks evan.spa...@gmail.com wrote:
In cluster settings if you don't
Hi Shahab,
The apache cassandra version looks great.
I think that doing
cc.setKeyspace(mydb)
cc.sql(SELECT * FROM mytable)
versus
cc.setKeyspace(mydb)
cc.sql(select * from mydb.mytable )
Is the problem? And if not, would you mind creating a ticket off-list for us to
help further?
Hi All,
I am using LinearRegression and have a question about the details on
model.predict method. Basically it is predicting variable y given an input
vector x. However, can someone point me to the documentation about what is the
threshold used in the predict method? Can that be changed ? I am
OK, I created an issue. Hopefully it will be resolved soon.
Again thanks,
best,
/Shahab
On Fri, Oct 31, 2014 at 7:05 PM, Helena Edelson helena.edel...@datastax.com
wrote:
Hi Shahab,
The apache cassandra version looks great.
I think that doing
cc.setKeyspace(mydb)
cc.sql(SELECT * FROM
Actually, if you don't call SparkContext.stop(), the event log
information that is used by the history server will be incomplete, and
your application will never show up in the history server's UI.
If you don't use that functionality, then you're probably ok not
calling it as long as your
It sounds like you are asking about logistic regression, not linear
regression. If so, yes that's just what it does. The default would be
0.5 in logistic regression. If you 'clear' the threshold you get the
raw margin out of this and other linear classifiers.
On Fri, Oct 31, 2014 at 7:18 PM,
Does the following help?
JavaPairRDDbin,key join with JavaPairRDDbin,lock
If you partition both RDDs by the bin id, I think you should be able to get
what you want.
Best Regards,
Sonal
Nube Technologies http://www.nubetech.co
http://in.linkedin.com/in/sonalgoyal
On Fri, Oct 31, 2014 at
You can serialize the model to a local/hdfs file system and use it later
when you want.
Best Regards,
Sonal
Nube Technologies http://www.nubetech.co
http://in.linkedin.com/in/sonalgoyal
On Sat, Nov 1, 2014 at 12:02 AM, Sean Owen so...@cloudera.com wrote:
It sounds like you are asking about
I'm want to fold an RDD into a smaller RDD with max elements. I have simple
bean objects with 4 properties. I want to group by 3 of the properties and then
select the max of the 4th. So I believe fold is the appropriate method for
this. My question is, is there a good fold example out there.
I am synced up to the Spark master branch as of commit 23468e7e96. I have Maven
3.0.5, Scala 2.10.3, and SBT 0.13.1. I’ve built the master branch successfully
previously and am trying to rebuild again to take advantage of the new Hive
0.13.1 profile. I execute the following command:
$ mvn
Yeah looks like https://github.com/apache/spark/pull/2744 broke the
build. We will fix it soon
On Fri, Oct 31, 2014 at 12:21 PM, Terry Siu terry@smartfocus.com wrote:
I am synced up to the Spark master branch as of commit 23468e7e96. I have
Maven 3.0.5, Scala 2.10.3, and SBT 0.13.1. I’ve
Thanks for the update, Shivaram.
-Terry
On 10/31/14, 12:37 PM, Shivaram Venkataraman
shiva...@eecs.berkeley.edu wrote:
Yeah looks like https://github.com/apache/spark/pull/2744 broke the
build. We will fix it soon
On Fri, Oct 31, 2014 at 12:21 PM, Terry Siu terry@smartfocus.com
wrote:
I
Hi,
I have an issue with running Spark in standalone mode on a cluster.
Everything seems to run fine for a couple of minutes until Spark stops
executing the tasks.
Any idea?
Would appreciate some help.
Thanks in advance,
Tassilo
I get errors like that at the end:
14/10/31 16:16:59 INFO
You should look at how fold is used in scala in general to help. Here is a
blog post that may also give some guidance:
http://blog.madhukaraphatak.com/spark-rdd-fold
The zero value should be your bean, with the 4th parameter set to the
minimum value. Your fold function should compare the 4th
try run this code:
sudo -E R CMD javareconf
and then start spark, basically, it syncs R's java configuration with your
Java configuration
Good luck!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Help-with-error-initializing-SparkR-tp4495p17871.html
How do i setup hadoop_conf_dir correctly when I'm running my spark job on
Yarn? My Yarn environment has the correct hadoop_conf_dir settings by the
configuration that I pull from sc.hadoopConfiguration() is incorrect.
--
View this message in context:
I was really surprised to see the results here, esp. SparkSQL not
completing
http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style
I was under the impression that SparkSQL performs really well because it
can optimize the RDD operations and load only the columns that are
required.
We have seen all kinds of results published that often contradict each other.
My take is that the authors often know more tricks about how to tune their
own/familiar products than the others. So the product on focus is tuned for
ideal performance while the competitors are not. The authors are
Dear Sir/Madam,
We want to become an organiser of Singapore Meetup to promote the regional
SPARK and big data community in ASEAN area.
My name is Songtao, I am a big data consultant in Singapore and have great
passion for Spark technologies.
Thanks,
Songtao
Hi Sean/Sameer,
It seems you're both right. In the python shell I need to explicitly call
the empty parens data.cache(), then run an action and it appears in the
storage tab. Using the scala shell I can just call data.cache without the
parens, run an action tthat works.
Thanks for your help.
I agree. My personal experience with Spark core is that it performs really
well once you tune it properly.
As far I understand SparkSQL under the hood performs many of these
optimizations (order of Spark operations) and uses a more efficient storage
format. Is this assumption correct?
Has anyone
70 matches
Mail list logo