Try adding all the jars in your $HIVE/lib directory. If you want the
specific jar, you could look fr jackson or json serde in it.
Thanks
Best Regards
On Thu, Apr 2, 2015 at 12:49 AM, Todd Nist tsind...@gmail.com wrote:
I have a feeling I’m missing a Jar that provides the support or could this
I think before 1.3 you also get stackoverflow problem in ~35
iterations. In 1.3.x, please use setCheckpointInterval to solve this
problem, which is available in the current master and 1.3.1 (to be
released soon). Btw, do you find 80 iterations are needed for
convergence? -Xiangrui
On Wed, Apr 1,
Yes, any Hadoop-related process that asks for Snappy compression or
needs to read it will have to have the Snappy libs available on the
library path. That's usually set up for you in a distro or you can do
it manually like this. This is not Spark-specific.
The second question also isn't
Hi, all
I am new to here. Could you give me some suggestion to learn Spark ? Thanks.
Best Regards,
Star Guo
Hey ,
I didn't find any documentation regarding support for cycles in spark
topology , although storm supports this using manual configuration in
acker function logic (setting it to a particular count) .By cycles i
doesn't mean infinite loops .
--
Thanks Regards,
Anshu Shukla
Thanks Xiangrui,
I used 80 iterations to demonstrates the marginal diminishing return in
prediction quality :)
Justin
On Apr 2, 2015 00:16, Xiangrui Meng men...@gmail.com wrote:
I think before 1.3 you also get stackoverflow problem in ~35
iterations. In 1.3.x, please use
Hi All,
I want to find near duplicate items from given dataset
For e.g consider a data set
1. Cricket,bat,ball,stumps
2. Cricket,bowler,ball,stumps,
3. Football,goalie,midfielder,goal
4. Football,refree,midfielder,goal,
Here 1 and 2 are near duplicates (only field 2 is
Hi,
Maybe this might be helpful:
https://github.com/GenTang/spark_hbase/blob/master/src/main/scala/examples/pythonConverters.scala
Cheers
Gen
On Thu, Apr 2, 2015 at 1:50 AM, Eric Kimbrel eric.kimb...@soteradefense.com
wrote:
I am attempting to read an hbase table in pyspark with a range
This inside out parallelization has been a way people have used R
with MapReduce for a long time. Run N copies of an R script on the
cluster, on different subsets of the data, babysat by Mappers. You
just need R installed on the cluster. Hadoop Streaming makes this easy
and things like RDD.pipe in
spark 1.3.0
spark@pc-zjqdyyn1:~ tail /etc/profile
export JAVA_HOME=/usr/jdk64/jdk1.7.0_45
export PATH=$PATH:$JAVA_HOME/bin
#
# End of /etc/profile
#
But ERROR LOG
Container: container_1427449644855_0092_02_01 on pc-zjqdyy04_45454
Hi,
Verbose output showed no additional information about the origin of the error
rsync from right
sending incremental file list
sent 20 bytes received 12 bytes 64.00 bytes/sec
total size is 0 speedup is 0.00
starting org.apache.spark.deploy.master.Master, logging to
Fair enough but I'd say you hit that diminishing return after 20 iterations
or so... :)
On Thu, Apr 2, 2015 at 9:39 AM, Justin Yip yipjus...@gmail.com wrote:
Thanks Xiangrui,
I used 80 iterations to demonstrates the marginal diminishing return in
prediction quality :)
Justin
On Apr 2,
I’m unable to access ganglia, it looks like due the web server not starting as
I receive this error when I launch spark:
Starting httpd: http: Syntax error on line 154 of /etc/httpd/conf/httpd.conf:
Cannot load /etc/httpd/modules/mod_authz_core.so
This occurs when I’m using the vanilla script.
Hello,
I have been using Mllib's ALS in 1.2 and it works quite well. I have just
upgraded to 1.3 and I encountered stackoverflow problem.
After some digging, I realized that when the iteration ~35, I will get
overflow problem. However, I can get at least 80 iterations with ALS in 1.2.
Is there
Hi,
I am trying to Spark Jobserver(
https://github.com/spark-jobserver/spark-jobserver
https://github.com/spark-jobserver/spark-jobserver ) for running Spark
SQL jobs.
I was able to start the server but when I run my application(my Scala class
which extends SparkSqlJob), I am getting the
I reproduced the bug on master and submitted a patch for it:
https://github.com/apache/spark/pull/5329. It may get into Spark
1.3.1. Thanks for reporting the bug! -Xiangrui
On Wed, Apr 1, 2015 at 12:57 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:
Hmm, I got the same error with the master. Here
Hi,
Jira created: https://issues.apache.org/jira/browse/SPARK-6675
Thank you.
On Wed, Apr 1, 2015 at 7:50 PM, Michael Armbrust mich...@databricks.com
wrote:
Can you open a JIRA please?
On Wed, Apr 1, 2015 at 9:38 AM, Hao Ren inv...@gmail.com wrote:
Hi,
I find HiveContext.setConf does
Peter's suggestion sounds good, but watch out for the match case since I
believe you'll have to match on:
case (Row(feature1, feature2, ...), Row(label)) =
On Thu, Apr 2, 2015 at 7:57 AM, Peter Rudenko petro.rude...@gmail.com
wrote:
Hi try next code:
val labeledPoints: RDD[LabeledPoint] =
Do you have a full stack trace?
On Thu, Apr 2, 2015 at 11:45 AM, ogoh oke...@gmail.com wrote:
Hello,
My ETL uses sparksql to generate parquet files which are served through
Thriftserver using hive ql.
It especially defines a schema programmatically since the schema can be
only
known at
+1.
Caching is way too slow.
On Wed, Apr 1, 2015 at 12:33 PM, SamyaMaiti samya.maiti2...@gmail.com wrote:
Hi Experts,
I have a parquet dataset of 550 MB ( 9 Blocks) in HDFS. I want to run SQL
queries repetitively.
Few questions :
1. When I do the below (persist to memory after reading
This isn't currently a capability that Spark has, though it has definitely
been discussed: https://issues.apache.org/jira/browse/SPARK-1061. The
primary obstacle at this point is that Hadoop's FileInputFormat doesn't
guarantee that each file corresponds to a single split, so the records
Hm, that will indeed be trickier because this method assumes records are the
same byte size. Is the file an arbitrary sequence of mixed types, or is there
structure, e.g. short, long, short, long, etc.?
If you could post a gist with an example of the kind of file and how it should
look once
Thanks all. Finally I am able to run my code successfully. It is running in
Spark 1.2.1. I will try it on Spark 1.3 too.
The major cause of all errors I faced was that the delimiter was not correctly
declared.
val TABLE_A =
Hi,
I am new to the spark MLLib and I was browsing through the internet for good
tutorials advanced to the spark documentation example. But, I do not find any.
Need help.
Regards
Phani Kumar
Here's one:
https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html
Reza
On Thu, Apr 2, 2015 at 12:51 PM, Phani Yadavilli -X (pyadavil)
pyada...@cisco.com wrote:
Hi,
I am new to the spark MLLib and I was browsing through the internet for
good tutorials advanced
I was able to hack this on my similar setup issue by running (on the driver)
$ sudo hostname ip
Where ip is the same value set in the spark.driver.host property. This
isn't a solution I would use universally and hope the someone can fix this
bug in the distribution.
Regards,
Mike
--
View
Check out the Spark docs for that parameter: *maxIterations*
http://spark.apache.org/docs/latest/mllib-clustering.html#k-means
On Thu, Apr 2, 2015 at 4:42 AM, podioss grega...@hotmail.com wrote:
Hello,
i am running the Kmeans algorithm in cluster mode from Mllib and i was
wondering if i could
To Akhil's point, see Tuning Data structures. Avoid standard collection hashmap.
With fewer machines, try running 4 or 5 cores per executor and only
3-4 executors (1 per node):
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/.
Ought to reduce shuffle performance
yup a related JIRA is here
https://issues.apache.org/jira/browse/SPARK-5113 which you might want to
leave a comment in. This can be quite tricky we found ! but there are a
host of env variable hacks you can use when launching spark masters/slaves.
On Thu, Apr 2, 2015 at 5:18 PM, Michael
It is hard to say what could be reason without more detail information. If you
provide some more information, maybe people here can help you better.
1) What is your worker's memory setting? It looks like that your nodes have
128G physical memory each, but what do you specify for the worker's
Michael, thanks for the response and looking forward to try 1.3.1
From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Friday, April 03, 2015 6:52 AM
To: Haopu Wang
Cc: user
Subject: Re: [SparkSQL 1.3.0] Cannot resolve column name SUM('p.q)
among (k,
Hi, there
you may need to add :
import sqlContext.implicits._
Best,
Sun
fightf...@163.com
From: java8964
Date: 2015-04-03 10:15
To: user@spark.apache.org
Subject: Cannot run the example in the Spark 1.3.0 following the document
I tried to check out what Spark SQL 1.3.0. I installed it
This looks to me like you have incompatible versions of scala on your
classpath?
On Thu, Apr 2, 2015 at 4:28 PM, Okehee Goh oke...@gmail.com wrote:
yes, below is the stacktrace.
Thanks,
Okehee
java.lang.NoSuchMethodError:
I tried to check out what Spark SQL 1.3.0. I installed it and following the
online document here:
http://spark.apache.org/docs/latest/sql-programming-guide.html
In the example, it shows something like this:// Select everybody, but increment
the age by 1
df.select(name, df(age) + 1).show()
//
Hint:
DF.rdd.map{}
Mohammed
From: Denny Lee [mailto:denny.g@gmail.com]
Sent: Thursday, April 2, 2015 7:10 PM
To: user@spark.apache.org
Subject: ArrayBuffer within a DataFrame
Quick question - the output of a dataframe is in the format of:
[2015-04, ArrayBuffer(A, B, C, D)]
and I'd
Yes, with spark.cleaner.ttl set there is no cleanup. We pass --properties-file
spark-dev.conf to spark-submit where spark-dev.conf contains:
spark.master spark://10.250.241.66:7077
spark.logConf true
spark.cleaner.ttl 1800
spark.executor.memory 10709m
spark.cores.max 4
Michael,
You are right. The build brought org.scala-lang:scala-library:2.10.1
from other package (as below).
It works fine after excluding the old scala version.
Thanks a lot,
Okehee
== dependency:
|+--- org.apache.kafka:kafka_2.10:0.8.1.1
||+---
Hi,all: Just now i checked out spark-1.2 on github , wanna to build it use
maven, how ever I encountered an error during compiling:
[INFO]
[ERROR] Failed to execute goal
net.alchim31.maven:scala-maven-plugin:3.2.0:compile
Hi folks, having some seemingly noob issues with the dataframe API.
I have a DF which came from the csv package.
1. What would be an easy way to cast a column to a given type -- my DF
columns are all typed as strings coming from a csv. I see a schema getter
but not setter on DF
2. I am trying
The import command already run.
Forgot the mention, the rest of examples related to df all works, just this
one caused problem.
Thanks
Yong
Date: Fri, 3 Apr 2015 10:36:45 +0800
From: fightf...@163.com
To: java8...@hotmail.com; user@spark.apache.org
Subject: Re: Cannot run the example in the
yes, below is the stacktrace.
Thanks,
Okehee
java.lang.NoSuchMethodError:
scala.reflect.NameTransformer$.LOCAL_SUFFIX_STRING()Ljava/lang/String;
at scala.reflect.internal.StdNames$CommonNames.init(StdNames.scala:97)
at
I'll add we just back ported this so it'll be included in 1.2.2 also.
On Wed, Apr 1, 2015 at 4:14 PM, Michael Armbrust mich...@databricks.com
wrote:
This is fixed in Spark 1.3.
https://issues.apache.org/jira/browse/SPARK-5195
On Wed, Apr 1, 2015 at 4:05 PM, Judy Nash
FYI I wrote a small test to try to reproduce this, and filed
SPARK-6688 to track the fix.
On Tue, Mar 31, 2015 at 1:15 PM, Marcelo Vanzin van...@cloudera.com wrote:
Hmmm... could you try to set the log dir to
file:/home/hduser/spark/spark-events?
I checked the code and it might be the case
I think implementing your own InputFormat and using SparkContext.hadoopFile()
is the best option for your case.
Yong
From: kvi...@vt.edu
Date: Thu, 2 Apr 2015 17:31:30 -0400
Subject: Re: Reading a large file (binary) into RDD
To: freeman.jer...@gmail.com
CC: user@spark.apache.org
The file has a
Hi Young,
Sorry for the duplicate post, want to reply to all.
I just downloaded the bits prebuilt form apache spark download site.
Started the spark shell and got the same error.
I then started the shell as follows:
./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 2
For cast, you can use selectExpr method. For example,
df.selectExpr(cast(col1 as int) as col1, cast(col2 as bigint) as col2).
Or, df.select(df(colA).cast(int), ...)
On Thu, Apr 2, 2015 at 8:33 PM, Michael Armbrust mich...@databricks.com
wrote:
val df = Seq((test, 1)).toDF(col1, col2)
You can
Are you saying that even with the spark.cleaner.ttl set your files are not
getting cleaned up?
TD
On Thu, Apr 2, 2015 at 8:23 AM, andrem amesa...@gmail.com wrote:
Apparently Spark Streaming 1.3.0 is not cleaning up its internal files and
the worker nodes eventually run out of inodes.
We see
Hmm, I just tested my own Spark 1.3.0 build. I have the same problem, but I
cannot reproduce it on Spark 1.2.1
If we check the code change below:
Spark 1.3
branchhttps://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala
vs
Spark
Quick question - the output of a dataframe is in the format of:
[2015-04, ArrayBuffer(A, B, C, D)]
and I'd like to return it as:
2015-04, A
2015-04, B
2015-04, C
2015-04, D
What's the best way to do this?
Thanks in advance!
The best way of learning spark is to use spark
you may follow the instruction of apache spark
website.http://spark.apache.org/docs/latest/
download-deploy it in standalone mode-run some examples-try cluster deploy
mode- then try to develop your own app and deploy it in your spark cluster.
http://docs.oracle.com/cd/B10500_01/java.920/a96654/connpoca.htm
The question doesn't seem to be Spark specific, btw
On Apr 2, 2015, at 4:45 AM, Sateesh Kavuri sateesh.kav...@gmail.com wrote:
Hi,
We have a case that we will have to run concurrent jobs (for the same
algorithm) on
It failed to find the class class org.apache.spark.sql.catalyst.ScalaReflection
in the Spark SQL library. Make sure it's in the classpath and the version
is correct, too.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
yes!
thank you very much:-)
在 2015年4月2日,下午7:13,Sean Owen so...@cloudera.com 写道:
Right, I asked because in your original message, you were looking at
the initialization to a random vector. But that is the initial state,
not final state.
On Thu, Apr 2, 2015 at 11:51 AM, lisendong
Hi Everyone,
I am getting following error while registering table using Scala IDE.
Please let me know how to resolve this error. I am using Spark 1.2.1
import sqlContext.createSchemaRDD
val empFile = sc.textFile(/tmp/emp.csv, 4)
.map ( _.split(,) )
Right, I am aware on how to use connection pooling with oracle, but the
specific question is how to use it in the context of spark job execution
On 2 Apr 2015 17:41, Ted Yu yuzhih...@gmail.com wrote:
http://docs.oracle.com/cd/B10500_01/java.920/a96654/connpoca.htm
The question doesn't seem to
Hi, I want to rename an aggregation field using DataFrame API. The
aggregation is done on a nested field. But I got below exception.
Do you see the same issue and any workaround? Thank you very much!
==
Exception in thread main org.apache.spark.sql.AnalysisException:
Cannot resolve
No, in my company are using cloudera distributions and 1.2.0 is the last
version of spark.
Thanks
On Wed, Apr 1, 2015 at 8:08 PM, Michael Armbrust mich...@databricks.com
wrote:
Can you try with Spark 1.3? Much of this code path has been rewritten /
improved in this version.
On Wed, Apr
Hello!
I have a CSV file that has the following content:
C1;C2;C3
11;22;33
12;23;34
13;24;35
What is the best approach to use Spark (API, MLLib) for achieving the
transpose of it?
C1 11 12 13
C2 22 23 24
C3 33 34 35
I look forward for your solutions and suggestions (some Scala code will be
NO, I’m referring to the result.
you means there might be so many zero features in the als result ?
I think it is not related to the initial state, but I do not know why the
percent of zero-vector is so high(50% around)
在 2015年4月2日,下午6:08,Sean Owen so...@cloudera.com 写道:
You're referring
Right, I asked because in your original message, you were looking at
the initialization to a random vector. But that is the initial state,
not final state.
On Thu, Apr 2, 2015 at 11:51 AM, lisendong lisend...@163.com wrote:
NO, I’m referring to the result.
you means there might be so many zero
Hello,
i am running the Kmeans algorithm in cluster mode from Mllib and i was
wondering if i could run the algorithm with fixed number of iterations in
some way.
Thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-kmeans-iteration-tp22353.html
You can also refer this blog http://blog.prabeeshk.com/blog/archives/
On 2 April 2015 at 12:19, Star Guo st...@ceph.me wrote:
Hi, all
I am new to here. Could you give me some suggestion to learn Spark ?
Thanks.
Best Regards,
Star Guo
Hi,
We have a case that we will have to run concurrent jobs (for the same
algorithm) on different data sets. And these jobs can run in parallel and
each one of them would be fetching the data from the database.
We would like to optimize the database connections by making use of
connection
Hey Christopher,
I'm working with Teng on this issue. Thank you for the explanation. I tried
both workarounds:
just leaving hive.metastore.warehouse.dir empty is not doing anything.
Still the tmp data is written to S3 and the job attempts to
rename/copy+delete from S3 to S3. But anyway, since
Oh, I found the reason.
according to the ALS optimization formula :
If a user’s all ratings are zero, that is, the R(i, Ii) is a zero matrix, so
the final result feature of this user will be all-zero vector…
在 2015年4月2日,下午6:08,Sean Owen so...@cloudera.com 写道:
You're referring to the
You can start with http://spark.apache.org/docs/1.3.0/index.html
Also get the Learning Spark book http://amzn.to/1NDFI5x. It's great.
Enjoy!
Vadim
ᐧ
On Thu, Apr 2, 2015 at 4:19 AM, Star Guo st...@ceph.me wrote:
Hi, all
I am new to here. Could you give me some suggestion to learn Spark ?
I have a self-study workshop here:
https://github.com/deanwampler/spark-workshop
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
How long does each executor keep the connection open for? How many
connections does each executor open?
Are you certain that connection pooling is a performant and suitable
solution? Are you running out of resources on the database server and
cannot tolerate each executor having a single
Thanks Michael - that was it! I was drawing a blank on this one for some
reason - much appreciated!
On Thu, Apr 2, 2015 at 8:27 PM Michael Armbrust mich...@databricks.com
wrote:
A lateral view explode using HiveQL. I'm hopping to add explode shorthand
directly to the df API in 1.4.
On
Looks like a typo, try:
*df.select**(**df**(name), **df**(age) + 1)*
Or
df.select(name, age)
PRs to fix docs are always appreciated :)
On Apr 2, 2015 7:44 PM, java8964 java8...@hotmail.com wrote:
The import command already run.
Forgot the mention, the rest of examples related to df all
But this basically means that the pool is confined to the job (of a single
app) in question, but is not sharable across multiple apps?
The setup we have is a job server (the spark-jobserver) that creates jobs.
Currently, we have each job opening and closing a connection to the
database. What we
Actually they may not be sequentially generated and also the list (RDD)
could come from a different component.
For example from this RDD :
(105,918)
(105,757)
(502,516)
(105,137)
(516,816)
(350,502)
I would like to separate into two RDD's :
1) (105,918)
(502,516)
2) (105,757)
*Ask Me Anything about Apache Spark big data*
Reddit AMA with Matei Zaharia
Friday, April 3 at 9AM PT/ 12PM ET
Details can be found here:
http://strataconf.com/big-data-conference-uk-2015/public/content/reddit-ama
--
View this message in context:
Each executor runs for about 5 secs until which time the db connection can
potentially be open. Each executor will have 1 connection open.
Connection pooling surely has its advantages of performance and not hitting
the dbserver for every open/close. The database in question is not just
used by the
Thanks for the reply. Unfortunately, in my case, the binary file is a mix
of short and long integers. Is there any other way that could of use here?
My current method happens to have a large overhead (much more than actual
computation time). Also, I am short of memory at the driver when it has to
It appears you are using a Cloudera Spark build, 1.3.0-cdh5.4.0-SNAPSHOT,
which expects to find the hadoop command:
/data/PlatformDep/cdh5/dist/bin/compute-classpath.sh: line 164: hadoop:
command not found
If you don't want to use Hadoop, download one of the pre-built Spark
releases from
Hi All
Is there an way to make the JavaRDDObject from existing java collection
type ListObject?
I know this can be done using scala , but i am looking how to do this using
java.
Regards
Jeetendra
Thanks all. I was able to get the decompression working by adding the
following to my spark-env.sh script:
export
JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native
export
Hi.
I'm using Spark SQL 1.2. I have this query:
CREATE TABLE test_MA STORED AS PARQUET AS
SELECT
field1
,field2
,field3
,field4
,field5
,COUNT(1) AS field6
,MAX(field7)
,MIN(field8)
,SUM(field9 / 100)
,COUNT(field10)
,SUM(IF(field11 -500, 1, 0))
,MAX(field12)
,SUM(IF(field13 = 1, 1, 0))
Thank you ! I Begin with it.
Best Regards,
Star Guo
I have a self-study workshop here:
https://github.com/deanwampler/spark-workshop
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
I am running a spark streaming stand-alone cluster, connected to rabbitmq
endpoint(s). The application will run for 20-30 minutes before failing with
the following error:
WARN 2015-04-01 21:00:53,944
org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove
RDD 22 - Ask timed
Thank you for the response, Dean. There are 2 worker nodes, with 8 cores
total, attached to the stream. I have the following settings applied:
spark.executor.memory 21475m
spark.cores.max 16
spark.driver.memory 5235m
On Thu, Apr 2, 2015 at 11:50 AM, Dean Wampler deanwamp...@gmail.com wrote:
Sorry for the obvious typo, I have 4 workers with 16 cores total*
On Thu, Apr 2, 2015 at 11:56 AM, Bill Young bill.yo...@threatstack.com
wrote:
Thank you for the response, Dean. There are 2 worker nodes, with 8 cores
total, attached to the stream. I have the following settings applied:
Hi try next code:
|val labeledPoints: RDD[LabeledPoint] = features.zip(labels).map{ case
Row(feture1, feture2,..., label) = LabeledPoint(label,
Vectors.dense(feature1, feature2, ...)) } |
Thanks,
Peter Rudenko
On 2015-04-02 17:17, drarse wrote:
Hello!,
I have a questions since days ago.
So cool !! Thanks.
Best Regards,
Star Guo
=
You can also refer this blog http://blog.prabeeshk.com/blog/archives/
On 2 April 2015 at 12:19, Star Guo st...@ceph.me wrote:
Hi, all
I am new to here. Could you give me some suggestion to
Use JavaSparkContext.parallelize.
http://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaSparkContext.html#parallelize(java.util.List)
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe
Are you allocating 1 core per input stream plus additional cores for the
rest of the processing? Each input stream Reader requires a dedicated core.
So, if you have two input streams, you'll need local[3] at least.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
To clarify one thing, is count() the first action (
http://spark.apache.org/docs/latest/programming-guide.html#actions) you're
attempting? As defined in the programming guide, an action forces
evaluation of the pipeline of RDDs. It's only then that reading the data
actually occurs. So, count()
Apparently Spark Streaming 1.3.0 is not cleaning up its internal files and
the worker nodes eventually run out of inodes.
We see tons of old shuffle_*.data and *.index files that are never deleted.
How do we get Spark to remove these files?
We have a simple standalone app with one RabbitMQ
Thanks a lot. Follow you suggestion .
Best Regards,
Star Guo
=
The best way of learning spark is to use spark
you may follow the instruction of apache spark
website.http://spark.apache.org/docs/latest/
download-deploy it in standalone mode-run some
I'm reading a stream of string lines that are in json format.
I'm using Java with Spark.
Is there a way to get this from a transformation? so that I end up with a
stream of JSON objects.
I would also welcome any feedback about this approach or alternative
approaches.
thanks
jk
I am running a standalone Spark streaming cluster, connected to multiple
RabbitMQ endpoints. The application will run for 20-30 minutes before
raising the following error:
WARN 2015-04-01 21:00:53,944
org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove
RDD 22 - Ask timed
That’s precisely what I was trying to check. It should have 42577 records in
it, because that’s how many there were in the text file I read in.
// Load a text file and convert each line to a JavaBean.
JavaRDDString lines = sc.textFile(file.txt);
JavaRDDBERecord tbBER =
Hello!,
I have a questions since days ago. I am working with DataFrame and with
Spark SQL I imported a jsonFile:
/val df = sqlContext.jsonFile(file.json)/
In this json I have the label and de features. I selected it:
/
val features = df.select (feature1,feature2,feature3,...);
val labels =
This just reduces to finding a library that can translate a String of
JSON into a POJO, Map, or other representation of the JSON. There are
loads of these, like Gson or Jackson. Sure, you can easily use these
in a function that you apply to each JSON string in each line of the
file. It's not
You shouldn't need to do anything special. Are you using a named context?
I'm not sure those work with SparkSqlJob.
By the way, there is a forum on Google groups for the Spark Job Server:
https://groups.google.com/forum/#!forum/spark-jobserver
On Thu, Apr 2, 2015 at 5:10 AM, Harika
Connection pools aren't serializable, so you generally need to set them up
inside of a closure. Doing that for every item is wasteful, so you
typically want to use mapPartitions or foreachPartition
rdd.mapPartition { part =
setupPool
part.map { ...
See Design Patterns for using foreachRDD in
A very simple example which works well with Spark 1.2, and fail to compile
with Spark 1.3:
build.sbt:
name := untitled
version := 1.0
scalaVersion := 2.10.4
libraryDependencies += org.apache.spark %% spark-core % 1.3.0
Test.scala:
package org.apache.spark.metrics
import
Yes, I just search for it !
Best Regards,
Star Guo
==
You can start with http://spark.apache.org/docs/1.3.0/index.html
Also get the Learning Spark book http://amzn.to/1NDFI5x. It's great.
Enjoy!
Vadim
ᐧ
On Thu, Apr 2, 2015 at 4:19 AM, Star Guo
Hi all,
I am trying to write an Amazon Kinesis consumer Scala app that processes
data in the
Kinesis stream. Is this the correct way to specify *build.sbt*:
---
*import AssemblyKeys._*
*name := Kinesis Consumer*
*version := 1.0organization := com.myconsumerscalaVersion :=
1 - 100 of 119 matches
Mail list logo