Out of curiosity I wanted to see what JBoss supported in terms of
clustering and database connection pooling since its implementation should
suffice for your use case. I found:
*Note:* JBoss does not recommend using this feature on a production
environment. It requires accessing a connection pool
I think these are the following configurations that you are looking for:
*spark.locality.wait*: Number of milliseconds to wait to launch a
data-local task before giving up and launching it on a less-local node. The
same wait will be used to step through multiple locality levels
(process-local,
Have you refer to official document of kmeans on
https://spark.apache.org/docs/1.1.1/mllib-clustering.html ?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-kmeans-iteration-tp22353p22365.html
Sent from the Apache Spark User List mailing list archive
What he meant is that look it up in the Spark UI, specifically in the Stage
tab to see what is taking so long. And yes code snippet helps us debug.
TD
On Fri, Apr 3, 2015 at 12:47 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
You need open the Stage\'s page which is taking time, and see how
How did you build spark? which version of spark are you having? Doesn't
this thread already explains it?
https://www.mail-archive.com/user@spark.apache.org/msg25505.html
Thanks
Best Regards
On Thu, Apr 2, 2015 at 11:10 PM, Todd Nist tsind...@gmail.com wrote:
Hi Akhil,
Tried your suggestion
Did you try these?
- Disable shuffle : spark.shuffle.spill=false
- Enable log rotation:
sparkConf.set(spark.executor.logs.rolling.strategy, size)
.set(spark.executor.logs.rolling.size.maxBytes, 1024)
.set(spark.executor.logs.rolling.maxRetainedFiles, 3)
Thanks
Best Regards
On Fri, Apr 3, 2015
You could also try setting your `nofile` value in /etc/security/limits.conf
for `soft` to some ridiculously high value if you haven't done so already.
On Fri, Apr 3, 2015 at 2:09 AM Akhil Das ak...@sigmoidanalytics.com wrote:
Did you try these?
- Disable shuffle : spark.shuffle.spill=false
-
I copied the following from the spark streaming UI, I don't know why the
Waiting batches is 1, my understanding is that it should be 72.
Following is my understanding:
1. Total time is 1minute 35 seconds=95 seconds
2. Batch interval is 1 second, so, 95 batches are generated in 95 seconds.
3.
Good! Thank you.
On Thu, Apr 2, 2015 at 9:05 AM, Xiangrui Meng men...@gmail.com wrote:
I reproduced the bug on master and submitted a patch for it:
https://github.com/apache/spark/pull/5329. It may get into Spark
1.3.1. Thanks for reporting the bug! -Xiangrui
On Wed, Apr 1, 2015 at 12:57
I placed it there. It was downloaded from MySql site.
On Fri, Apr 3, 2015 at 6:25 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
Akhil
you mentioned /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar .
how come you got this lib into spark/lib folder.
1) did you place it there ?
2) What
Hi Deepujain,
I did include the jar file, I believe it is hive-exe.jar, through the
--jars option:
./bin/spark-shell --master spark://radtech.io:7077
--total-executor-cores 2 --driver-class-path
/usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars
Can you include -X in your maven command and pastebin the output ?
Cheers
On Apr 3, 2015, at 3:58 AM, myelinji myeli...@aliyun.com wrote:
Thank you for your reply. When I'm using maven to compile the whole project,
the erros as follows
[INFO] Spark Project Parent POM
I think you need to include the jar file through --jars option that
contains the hive definition (code) of UDF json_tuple. That should solve
your problem.
On Fri, Apr 3, 2015 at 3:57 PM, Todd Nist tsind...@gmail.com wrote:
I placed it there. It was downloaded from MySql site.
On Fri, Apr 3,
Hi,
Are there any recommendations for operating systems that one should use for
setting up Spark/Hadoop nodes in general?
I am not familiar with the differences between the various linux distributions
or how well they are (not) suited for cluster set-ups, so I wondered if there
is some
Started the spark shell with the one jar from hive suggested:
./bin/spark-shell --master spark://radtech.io:7077
--total-executor-cores 2 --driver-class-path
/usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars
/opt/apache-hive-0.13.1-bin/lib/hive-exec-0.13.1.jar
Results in the same
Copy pasted his command in the same thread.
Thanks
Best Regards
On Fri, Apr 3, 2015 at 3:55 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
Akhil
you mentioned /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar .
how come you got this lib into spark/lib folder.
1) did you place it there
There isn't any specific Linux distro, but i would prefer Ubuntu for a
beginner as its very easy to apt-get install stuffs on it.
Thanks
Best Regards
On Fri, Apr 3, 2015 at 4:58 PM, Horsmann, Tobias tobias.horsm...@uni-due.de
wrote:
Hi,
Are there any recommendations for operating systems
This thread might give you some insights
http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201311.mbox/%3CCA+WVT8WXbEHac=N0GWxj-s9gqOkgG0VRL5B=ovjwexqm8ev...@mail.gmail.com%3E
Thanks
Best Regards
On Fri, Apr 3, 2015 at 3:53 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
My Spark Job
What version of Cassandra are you using? Are you using DSE or the stock
Apache Cassandra version? I have connected it with DSE, but have not
attempted it with the standard Apache Cassandra version.
FWIW,
Hi Debasish, Charles,
I solved the problem by using a BPQ like method, based on your suggestions.
So thanks very much for that!
My approach was
1) Count the population of each segment in the RDD by map/reduce so that I
get the bound number N equivalent to 10% of each segment. This becomes the
A hack workaround is to use flatMap:
rdd.flatMap{ case (date, array) = for (x - array) yield (date, x) }
For those of you who don't know Scala, the for comprehension iterates
through the ArrayBuffer, named array and yields new tuples with the date
and each element. The case expression to the
I was able to write record that extends specificrecord (avro) this class was
not auto generated. Do we need to do something extra for auto generated classes
Sent from my iPhone
On 03-Apr-2015, at 5:06 pm, Akhil Das ak...@sigmoidanalytics.com wrote:
This thread might give you some insights
My apologizes. I was running this locally and the JAR I was building
using Intellij had some issues.
This was not related to UDFs. All works fine now.
On Thu, Apr 2, 2015 at 2:58 PM, Ted Yu yuzhih...@gmail.com wrote:
Can you show more code in CreateMasterData ?
How do you run your code ?
Hi,
I got this error when creating a hive table from parquet file:
DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException:
java.lang.UnsupportedOperationException: Parquet does not support
timestamp. See HIVE-6384
I check HIVE-6384, it's fixed in 0.14.
The hive in spark build is a customized
I meant that I did not have to use kyro. Why will kyro help fix this issue now ?
Sent from my iPhone
On 03-Apr-2015, at 5:36 pm, Deepak Jain deepuj...@gmail.com wrote:
I was able to write record that extends specificrecord (avro) this class was
not auto generated. Do we need to do
Because, its throwing up serializable exceptions and kryo is a serializer
to serialize your objects.
Thanks
Best Regards
On Fri, Apr 3, 2015 at 5:37 PM, Deepak Jain deepuj...@gmail.com wrote:
I meant that I did not have to use kyro. Why will kyro help fix this issue
now ?
Sent from my
Thanks mohammed. Will give it a try today. We would also need the sparksSQL
piece as we are migrating our data store from oracle to C* and it would be
easier to maintain all the reports rather recreating each one from scratch.
Thanks,
Pawan Venugopal.
On Apr 3, 2015 7:59 AM, Mohammed Guller
Cool !
You should also consider to contribute it back to spark if you are doing
quantile calculations for example...there is also topbykey api added in
master by @coderxiangsee if you can use that API to make the code
clean
On Apr 3, 2015 5:20 AM, Aung Htet aung@gmail.com wrote:
Hi
As Akhil says Ubuntu is a good choice if you're starting from near scratch.
Cloudera CDH virtual machine images[1] include Hadoop, HDFS, Spark, and
other big data tools so you can get a cluster running with very little
effort. Keep in mind Cloudera is a for-profit corporation so they are also
Hi Todd,
Thanks for the link. I would be interested in this solution. I am using DSE
for cassandra. Would you provide me with info on connecting with DSE either
through Tableau or zeppelin. The goal here is query cassandra through spark
sql so that I could perform joins and groupby on my queries.
Hi Todd,
We are using Apache C* 2.1.3, not DSE. We got Tableau to work directly with C*
using the ODBC driver, but now would like to add Spark SQL to the mix. I
haven’t been able to find any documentation for how to make this combination
work.
We are using the Spark-Cassandra-Connector in our
(That one was already fixed last week, and so should be updated when
the site updates for 1.3.1.)
On Fri, Apr 3, 2015 at 4:59 AM, Michael Armbrust mich...@databricks.com wrote:
Looks like a typo, try:
df.select(df(name), df(age) + 1)
Or
df.select(name, age)
PRs to fix docs are always
If you're asking about a compile error, you should include the command
you used to compile.
I am able to compile branch 1.2 successfully with mvn -DskipTests
clean package.
This error is actually an error from scalac, not a compile error from
the code. It sort of sounds like it has not been able
My Spark Job failed with
15/04/03 03:15:36 INFO scheduler.DAGScheduler: Job 0 failed:
saveAsNewAPIHadoopFile at AbstractInputHelper.scala:103, took 2.480175 s
15/04/03 03:15:36 ERROR yarn.ApplicationMaster: User class threw exception:
Job aborted due to stage failure: Task 0.0 in stage 2.0 (TID
Akhil
you mentioned /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar .
how come you got this lib into spark/lib folder.
1) did you place it there ?
2) What is download location ?
On Fri, Apr 3, 2015 at 3:42 PM, Todd Nist tsind...@gmail.com wrote:
Started the spark shell with the one
Thank you for your reply. When I'm using maven to compile the whole project,
the erros as follows[INFO] Spark Project Parent POM ..
SUCCESS [4.136s]
[INFO] Spark Project Networking .. SUCCESS [7.405s]
[INFO] Spark Project Shuffle Streaming Service
Hi Mohammed,
Not sure if you have tried this or not. You could try using the below api
to start the thriftserver with an existing context.
https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L42
The one
@Pawan
Not sure if you have seen this or not, but here is a good example by
Jonathan Lacefield of Datastax's on hooking up sparksql with DSE, adding
Tableau is as simple as Mohammed stated with DSE.
https://github.com/jlacefie/sparksqltest.
HTH,
Todd
On Fri, Apr 3, 2015 at 2:39 PM, Todd Nist
Happening right now
https://www.reddit.com/r/IAmA/comments/31bkue/im_matei_zaharia_creator_of_spark_and_cto_at/
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Matei-Zaharai-Reddit-Ask-Me-Anything-tp22364p22369.html
Sent from the Apache Spark User List
Very good question! This is because the current code is written such that
the ui considers a batch as waiting only when it has actually started being
processed. Thats batched waiting in the job queue is not considered in the
calculation. It is arguable that it may be more intuitive to count that
I am new to MLib so I have a basic question: is it possible to save MLlib
models (particularly CF models) to HDFS and then reload it later? If yes, could
u share some sample code (I could not find it in MLlib tutorial). Thanks!
Maybe add another stat for batches waiting in the job queue ?
Cheers
On Fri, Apr 3, 2015 at 10:01 AM, Tathagata Das t...@databricks.com wrote:
Very good question! This is because the current code is written such that
the ui considers a batch as waiting only when it has actually started being
Maybe that should be marked as waiting as well. Will keep that in mind. We
plan to update the ui soon, so will keep that in mind.
On Apr 3, 2015 10:12 AM, Ted Yu yuzhih...@gmail.com wrote:
Maybe add another stat for batches waiting in the job queue ?
Cheers
On Fri, Apr 3, 2015 at 10:01 AM,
spark-streaming-kinesis-asl is not part of the Spark distribution on your
cluster, so you cannot have it be just a provided dependency. This is also
why the KCL and its dependencies were not included in the assembly (but yes,
they should be).
~ Jonathan Kelly
From: Vadim Bichutskiy
Remove provided and got the following error:
[error] (*:assembly) deduplicate: different file contents found in the
following:
[error]
/Users/vb/.ivy2/cache/com.esotericsoftware.kryo/kryo/bundles/kryo-2.21.jar:com/esotericsoftware/minlog/Log$Logger.class
[error]
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Hi,
Thanks! I'll add the JIRA. I'll also try to work on a patch this weekend
.
- -- Ankur Chauhan
On 03/04/2015 13:23, Tim Chen wrote:
Hi Ankur,
There isn't a way to do that yet, but it's simple to add.
Can you create a JIRA in Spark for
@Pawan,
So it's been a couple of months since I have had a chance to do anything
with Zeppelin, but here is a link to a post on what I did to get it working
https://groups.google.com/forum/#!topic/zeppelin-developers/mCNdyOXNikI.
This may or may not work with the newer releases from Zeppelin.
If I use local[2] instead of *URL:* spark://ip-10-241-251-232:7077 this
seems to work. I don't understand why though because when I
give spark://ip-10-241-251-232:7077 application seem to bootstrap
successfully, just doesn't create a socket on port ?
On Fri, Mar 27, 2015 at 10:55 AM, Mohit
Thanks. So how do I fix it?
ᐧ
On Fri, Apr 3, 2015 at 3:43 PM, Kelly, Jonathan jonat...@amazon.com wrote:
spark-streaming-kinesis-asl is not part of the Spark distribution on
your cluster, so you cannot have it be just a provided dependency. This
is also why the KCL and its dependencies
What does the Spark Standalone UI at port 8080 say about number of cores?
On Fri, Apr 3, 2015 at 2:53 PM, Mohit Anchlia mohitanch...@gmail.com
wrote:
[ec2-user@ip-10-241-251-232 s_lib]$ cat /proc/cpuinfo |grep process
processor : 0
processor : 1
processor : 2
processor
I am afraid not. The whole point of Spark Streaming is to make it easy to
do complicated processing on streaming data while interoperating with core
Spark, MLlib, SQL without the operational overheads of maintain 4 different
systems. As a slight cost of achieving that unification, there maybe some
Just remove provided from the end of the line where you specify the
spark-streaming-kinesis-asl dependency. That will cause that package and all
of its transitive dependencies (including the KCL, the AWS Java SDK libraries
and other transitive dependencies) to be included in your uber jar.
That doesn't seem like a good solution unfortunately as I would be needing
this to work in a production environment. Do you know why the limitation
exists for FileInputDStream in the first place? Unless I'm missing
something important about how some of the internals work I don't see why
this
Assembly settings have an option to exclude jars. You need something
similar to:
assemblyExcludedJars in assembly = (fullClasspath in assembly) map { cp =
val excludes = Set(
minlog-1.2.jar
)
cp filter { jar = excludes(jar.data.getName) }
}
in your build file (may need to be
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Hi,
I am trying to figure out if there is a way to tell the mesos
scheduler in spark to isolate the workers to a set of mesos slaves
that have a given attribute such as `tachyon:true`.
Anyone knows if that is possible or how I could achieve such a
Hi All
I am building a logistic regression for matching the person data lets say
two person object is given with their attribute we need to find the score.
that means at side you have 10 millions records and other side we have 1
record , we need to tell which one match with highest score among 1
My apologies for following my own post, but a friend just pointed out that if I
use kryo with reference counting AND copy-and-paste, this runs.
However, if I try to load file, this fails as described below.
I thought load was supposed to be equivalent?
Thanks!-Mike
From: Michael Albert
Yes, definitely can be added. Just haven't gotten around to doing it :)
There are proposals for this that you can try -
https://github.com/apache/spark/pull/2765/files . Have you review it at
some point.
On Fri, Apr 3, 2015 at 1:08 PM, Adam Ritter adamge...@gmail.com wrote:
That doesn't seem
So after pulling my hair out for a bit trying to convert one of my standard
spark jobs to streaming I found that FileInputDStream does not support
nested folders (see the brief mention here
http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources
the fileStream method
[ec2-user@ip-10-241-251-232 s_lib]$ cat /proc/cpuinfo |grep process
processor : 0
processor : 1
processor : 2
processor : 3
processor : 4
processor : 5
processor : 6
processor : 7
On Fri, Apr 3, 2015 at 2:33 PM, Tathagata Das t...@databricks.com
Hi Todd,
Thanks for the help. So i was able to get the DSE working with tableau as
per the link provided by Mohammed. Now i trying to figure out if i could
write sparksql queries from tableau and get data from DSE. My end goal is
to get a web based tool where i could write sql queries which will
In 1.3, you can use model.save(sc, hdfs path). You can check the
code examples here:
http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html#examples.
-Xiangrui
On Fri, Apr 3, 2015 at 2:17 PM, Justin Yip yipjus...@prediction.io wrote:
Hello Zhou,
You can look at the
How many cores are present in the works allocated to the standalone cluster
spark://ip-10-241-251-232:7077 ?
On Fri, Apr 3, 2015 at 2:18 PM, Mohit Anchlia mohitanch...@gmail.com
wrote:
If I use local[2] instead of *URL:* spark://ip-10-241-251-232:7077 this
seems to work. I don't understand
Thanks, Todd.
It is an interesting idea; worth trying.
I think the cash project is old. The tuplejump guy has created another project
called CalliopeServer2, which works like a charm with BI tools that use JDBC,
but unfortunately Tableau throws an error when it connects to it.
Mohammed
From:
@Todd,
I had looked at it yesterday. All these dependencies explained is added in
the DSE node. Do I need to include spark and DSE dependencies in the
Zeppline node?
I built zeppelin with no spark and no hadoop. To my understanding zeppelin
will send a request to a remote master at spark://
Just remove provided for spark-streaming-kinesis-asl
libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl
% 1.3.0
On Fri, Apr 3, 2015 at 12:45 PM, Vadim Bichutskiy
vadim.bichuts...@gmail.com wrote:
Thanks. So how do I fix it?
ᐧ
On Fri, Apr 3, 2015 at 3:43 PM, Kelly,
I sort-a-hacky workaround is to use a queueStream where you can manually
create RDDs (using sparkContext.hadoopFile) and insert into the queue. Note
that this is for testing only as queueStream does not work with driver
fautl recovery.
TD
On Fri, Apr 3, 2015 at 12:23 PM, adamgerst
Hi Ankur,
There isn't a way to do that yet, but it's simple to add.
Can you create a JIRA in Spark for this?
Thanks!
Tim
On Fri, Apr 3, 2015 at 1:08 PM, Ankur Chauhan achau...@brightcove.com
wrote:
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Hi,
I am trying to figure out if there is
Hello Zhou,
You can look at the recommendation template
http://templates.prediction.io/PredictionIO/template-scala-parallel-recommendation
of PredictionIO. PredictionIO is built on the top of spark. And this
template illustrates how you can save the ALS model to HDFS and the reload
it later.
Hi all,
As we all know, Spark has set the record for sorting data, as published on:
https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html.
Here at our group, we would love to verify these results, and compare
machine using this benchmark. We've spend quite some time trying to find the
If you can examine your data matrix and know that about 1/6 or so of the
values are non-zero (so 5/6 are zeros), then it's probably worth using
sparse vectors. (1/6 is a rough estimate.)
There is support for L1 and L2 regularization. You can look at the guide
here:
Hi,
Are there any tutorials that explains all the changelogs between Spark
0.8.0 and Spark 1.3.0 and how can we approach this issue.
Thanks Mohammed,
I was aware of Calliope, but haven't used it since with since the
spark-cassandra-connector project got released. I was not aware of the
CalliopeServer2; cool thanks for sharing that one.
I would appreciate it if you could lmk how you decide to proceed with this;
I can see this
Sweet - I'll have to play with this then! :)
On Fri, Apr 3, 2015 at 19:43 Reynold Xin r...@databricks.com wrote:
There is already an explode function on DataFrame btw
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L712
I think
Hi Xingrui,
I have create JIRA https://issues.apache.org/jira/browse/SPARK-6706, and
attached the sample code. But I could not attache the test data. I will
update the bug once I found a place to host the test data.
Thanks,
David
On Tue, Mar 31, 2015 at 8:18 AM Xiangrui Meng men...@gmail.com
There is already an explode function on DataFrame btw
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L712
I think something like this would work. You might need to play with the
type.
df.explode(arrayBufferColumn) { x = x }
On Fri,
Thanks Dean - fun hack :)
On Fri, Apr 3, 2015 at 6:11 AM Dean Wampler deanwamp...@gmail.com wrote:
A hack workaround is to use flatMap:
rdd.flatMap{ case (date, array) = for (x - array) yield (date, x) }
For those of you who don't know Scala, the for comprehension iterates
through the
This might be overkill for your needs, but the scodec parser combinator
library might be useful for creating a parser.
https://github.com/scodec/scodec
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe
Hi experts,
I am trying to write unit tests for my spark application which fails with
javax.servlet.FilterRegistration error.
I am using CDH5.3.2 Spark and below is my dependencies list.
val spark = 1.2.0-cdh5.3.2
val esriGeometryAPI = 1.2
val csvWriter = 1.0.0
You’ll definitely want to use a Kryo-based serializer for Avro. We have a Kryo
based serializer that wraps the Avro efficient serializer here.
Frank Austin Nothaft
fnoth...@berkeley.edu
fnoth...@eecs.berkeley.edu
202-340-0466
On Apr 3, 2015, at 5:41 AM, Akhil Das ak...@sigmoidanalytics.com
Thanks everyone for the inputs.
I guess I will try out a custom implementation of InputFormat. But I have
no idea where to start. Are there any code examples of this that might help?
On Fri, Apr 3, 2015 at 9:15 AM, Dean Wampler deanwamp...@gmail.com wrote:
This might be overkill for your
I noticed spark has some nice memory tracking estimators in it, but they are
private. We have some custom implementations of RDD and PairRDD to suit our
internal needs and it’d be fantastic if we’d be able to just leverage the
memory estimates that already exist in Spark.
Is there any change
Hadoop TextInputFormat is a good start.
It is not really that hard. You just need to implement the logic to identify
the Record delimiter, and think a logic way to represent the Key, Value for
your RecordReader.
Yong
From: kvi...@vt.edu
Date: Fri, 3 Apr 2015 11:41:13 -0400
Subject: Re: Reading
83 matches
Mail list logo