Hi,
I am using spark-submit to submit my application jar to a YARN cluster. I
want to deliver a single jar file to my users, so I would like to avoid to
tell them also, please put that log4j.xml file somewhere and add that path
to the spark-submit command.
I thought it would be sufficient that
Hi,
I'm loading a bunch of json files and there seems to be problems with
specific files (either schema changes or incomplete files).
I'd like to catch the inconsistent files but I'm not sure how to do it.
This is the exception I get:
14/11/20 00:13:49 INFO cluster.YarnClientClusterScheduler:
Hi,
We have a requirement, where we have two data sets represented by RDD's
RDDA RDDB.
For performing an aggregation operation on RDDA, the action would need
RDDB's subset of data, wanted to understand if there is a best practice in
doing this ? Dont even know how will this be possible as of
You could use combineTextFile from
https://github.com/RetailRocket/SparkMultiTool
It combines input files before mappers by means of Hadoop
CombineFileInputFormat. In our case it reduced the number of mappers from
10 to approx 3000 and made job significantly faster.
Example:
import
Hi!
The hive table is an external table, which I created like this:
CREATE EXTERNAL TABLE MyHiveTable
( id int, data string )
STORED BY 'org.apache.hadoop.hive.cassandra.cql.CqlStorageHandler'
TBLPROPERTIES (cassandra.host = 10.194.30.2, cassandra.ks.name
= test ,
I assume that all examples do actually fall into exactly one of the classes.
If you always have to make a prediction then you always take the most
probable class.
If you can choose to make no classification for lack of confidence, yes you
want to pick a per-class threshold and take the most
Thanks a lot Sean. You are correct in assuming that my examples fall under a
single category.
It is interesting to see that the posterior probability can actually be
treated as something that is stable enough to have a constant threshold
value on per class basis. It would, I assume, keep changing
Update:
I tried surrounding the problematic code with try and catch but that does
not do the trick:
try
{
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
val jsonFiles=sqlContext.jsonFile(/requests.loading)
} catch {
case _: Throwable = // Catching all exceptions and
Can you not use *hadoop getmerge* option to merge the files afterwards?
Older version of HDFS are immutable, meaning once you close the file then
you won't be able to modify it. In the newer version they have support for
it. Code below writes the output to one directory (deletes the previous
I think the standard practice is to include your log config file among
the files uploaded to YARN containers, and then set
-Dlog4j.configuration=yourfile.xml in
spark.{executor.driver}.extraJavaOptions ?
http://spark.apache.org/docs/latest/running-on-yarn.html
On Thu, Nov 20, 2014 at 9:20 AM,
Yes, certainly you need to consider the problem of how and when you
update the model with new info. The principle is the same. Low or high
posteriors aren't wrong per se. It seems normal in fact that one class
is more probable than others, maybe a lot more.
On Thu, Nov 20, 2014 at 10:31 AM,
You can repartition to 1 partition to generate 1 output file, but,
that has other potentially bad implications for your processing. You
would be using 1 partition and 1 worker only for the final stages of
processing.
On Thu, Nov 20, 2014 at 11:32 AM, jishnu.prat...@wipro.com wrote:
Hi
Hi All,
I am working on a spark streaming job . The job is supposed to read
streaming data from Kafka. But after submitting the job its showing
org.apache.spark.SparkException: Job aborted due to stage failure: All
masters are unresponsive! and the following lines are getting printed
Sean,
My last sentence didn't come out right. Let me try to explain my question
again.
For instance, I have two categories, C1 and C2. I have trained 100 samples
for C1 and 10 samples for C2.
Now, I predict two samples one each of C1 and C2, namely S1 and S2
respectively. I get the following
Hi,
This is on version 1.1.0.
I’m did a simple test on MEMORY_AND_DISK storage level.
var file =
sc.textFile(“file:///path/to/file.txt”).persit(StorageLevel.MEMORY_AND_DISK)
file.count()
The file is 1.5GB and there is only 1 worker. I have requested for 1GB of
worker memory per node:
Up-sampling one class would change the result, yes. The prior for C2
would increase and so does its posterior across the board.
But that is dramatically changing your input: C2 isn't as prevalent as
C1, but you are pretending it is. Their priors aren't the same.
If you want to assume a uniform
Hello Rzykov,
I tried you approach but unfortunately I'm getting an error. What I get is:
[info] Caused by: java.lang.reflect.InvocationTargetException
[info] at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
[info] at
I believe assuming uniform priors is the way to go for my use case.
I am not sure about how to 'drop the prior term' with Mllib. I am just
providing the samples as they come after creating term vectors for each
sample. But I guess I can Google that information.
I appreciate all the help. Spark
I am using spark streaming 1.1.0 locally (not in a cluster). I created a
simple app that parses the data (about 10.000 entries), stores it in a
stream and then makes some transformations on it. Here is the code:
/def main(args : Array[String]){
val master = local[8]
val conf = new
Hi Folks!
I'm running a Python Spark job on a cluster with 1 master and 10 slaves
(64G RAM and 32 cores each machine).
This job reads a file with 1.2 terabytes and 1128201847 lines on HDFS and
call Kmeans method as following:
# SLAVE CODE - Reading features from HDFS
def
Friends,
I am pretty new to Spark as much as to Scala, MLib and the entire Hadoop
stack!! It would be so much help if I could be pointed to some good books on
Spark and MLib?
Further, does MLib support any algorithms for B2B cross sell/ upsell or
customer retention (out of the box
Take a look at the O'Reilly Learning Spark (Early Release) book. I've found
this very useful.
Darin.
From: Saurabh Agrawal saurabh.agra...@markit.com
To: user@spark.apache.org user@spark.apache.org
Sent: Thursday, November 20, 2014 9:04 AM
Subject: Please help me get started on Apache
For Spark,
You can start with a new book like :
https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch01.html
I think the paper book is out now,
You can also have a look on tutorials documentation guide available on :
https://spark.apache.org/docs/1.1.0/mllib-guide.html
Hi,
I tried with try catch blocks. Infact, inside mapPartitionsWithIndex,
method is invoked which does the operation. I put the operations inside the
function in try...catch block but thats of no use...still the error
persists. Even I commented all the operations and a simple print statement
Dear all,
We encountered a problem with failed Spark jobs.
We have a Spark/Hadoop cluster - CDH 5.1.2 + Spark 1.1
After launching a spark job with command:
~/soft/spark-1.1.0-bin-hadoop2.3/bin/spark-submit --master yarn-cluster
--executor-memory 4G --driver-memory 4G --class
I finally solved this problem.
The org.apache.hadoop.mapreduce.JobContext is a class in hadoop 2.0 and is
an interface in hadoop = 2.0.
I have to use a spark build for hadoop v1.
So spark-sql seems fine.
But, the thriftserver does not work with my config!
Here is my spark-env.sh:
So you are saying to query an entire day of data I would need to create one
RDD for every hour and then union them into one RDD. After I have the one
RDD I would be able to query for a=2 throughout the entire day. Please
correct me if I am wrong.
Thanks
On Wed, Nov 19, 2014 at 5:53 PM,
Hi Guys,
I really need your help with this:
I'm loading a bunch of json files uploaded via webhdfs, some of them have
some incosistencies (the json ends mid-line for example) and that causes my
whole application to fail.
How can I continue processing valid json records without failing the
Hi all,
I'm working on an application that has several tables (RDDs of tuples) of
data. Some of the types are complex-ish (e.g., date time objects). I'd like
to use something like case classes for each entry.
What is the best way to store the data to disk in a text format without
writing custom
https://issues.apache.org/jira/browse/SPARK-1825
I've had the following problems to make Windows+Pyspark+YARN work properly:
1. net.ScriptBasedMapping: Exception running
/etc/hadoop/conf.cloudera.yarn/topology.py
FIX? Comment the net.topology.script.file.name property configuration in
the file
How do I configure the files to be uploaded to YARN containers. So far, I’ve
only seen --conf spark.yarn.jar=hdfs://….” which allows me to specify the HDFS
location of the Spark JAR, but I’m not sure how to prescribe other files for
uploading (e.g., spark-env.sh)
mn
On Nov 20, 2014, at 4:08
Didn't really edit the configs as much .. but here's what the spark-env.sh
is:
#!/usr/bin/env bash
##
# Generated by Cloudera Manager and should not be modified directly
##
export SPARK_HOME=/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/spark
export
Also attaching the parquet file if anyone wants to take a further look.
On Thu, Nov 20, 2014 at 8:54 AM, Sadhan Sood sadhan.s...@gmail.com wrote:
So, I am seeing this issue with spark sql throwing an exception when
trying to read selective columns from a thrift parquet file and also when
Cool
On 20 Nov 2014 22:01, kam lee cloudher...@gmail.com wrote:
Yes, fixed by setting --executor-cores to 2 or higher.
Thanks a lot! Really appreciate it!
cloud
On Wed, Nov 19, 2014 at 10:48 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Make sure the executor cores are set to a value
I have a simple S3 job to read a text file and do a line count.
Specifically I'm doing *sc.textFile(s3n://mybucket/myfile).count*.The
file is about 1.2GB. My setup is standalone spark cluster with 4 workers
each with 2 cores / 16GB ram. I'm using branch-1.2 code built against
hadoop 2.4 (though
Hi,
We are seeing bad performance as we incrementally load data. Here is the
config
Spark standalone cluster
spark01 (spark master, shark, hadoop namenode): 15GB RAM, 4vCPU's
spark02 (spark worker, hadoop datanode): 15GB RAM, 8vCPU's
spark03 (spark worker): 15GB RAM, 8vCPU's
spark04 (spark
Hi Guys,
I'm having an issue in standalone mode (Spark 1.1, Hadoop 2.4, Windows Server
2008).
A very simple program runs fine in local mode but fails in standalone mode.
Here is the error:
14/11/20 17:01:53 INFO DAGScheduler: Failed to run count at SimpleApp.scala:22
Exception in thread main
Well, after many attempts I can now successfully run the thrift server using
root@cdb-01:~/spark# ./sbin/start-thriftserver.sh --master
spark://10.194.30.2:7077 --hiveconf hive.server2.thrift.bind.host 0.0.0.0
--hiveconf hive.server2.thrift.port 1
(the command was failing because of the
Awesome! And Patrick just gave his LGTM ;-)
On Wed Nov 19 2014 at 5:13:17 PM Aniket Bhatnagar
aniket.bhatna...@gmail.com wrote:
Thanks for pointing this out Mark. Had totally missed the existing JIRA
items
On Wed Nov 19 2014 at 21:42:19 Mark Hamstra m...@clearstorydata.com
wrote:
This is
To follow on:
I asked the developer how we incrementally load data and the response was
no. union only for updated records (every night)
For every minutes export algorithm next:
1. upload file to hadoop.
2. load data inpath... overwrite into table _incremental;
3. insert into table ..._cached
Hi Yiming,
On Wed, Nov 19, 2014 at 5:35 PM, Yiming (John) Zhang sdi...@gmail.com wrote:
Thank you for your reply. I was wondering whether there is a method of
reusing locally-built components without installing them? That is, if I have
successfully built the spark project as a whole, how
How many features and how many partitions? You set kmeans_clusters to
1. If the feature dimension is large, it would be really
expensive. You can check the WebUI and see task failures there. The
stack trace you posted is from the driver. Btw, the total memory you
have is 64GB * 10, so you can
Hi Jerome,
I've been trying to get this working as well...
Where are you specifying cassandra parameters (i.e. seed nodes, consistency
levels, etc.)?
-Ashic.
Date: Thu, 20 Nov 2014 10:34:58 -0700
From: jer...@gmail.com
To: u...@spark.incubator.apache.org
Subject: Re: tableau spark sql
I've read several posts of people struggling to read avro in spark. The
examples I've tried don't work. When I try this solution (
https://stackoverflow.com/questions/23944615/how-can-i-load-avros-in-spark-using-the-schema-on-board-the-avro-files)
I get errors:
spark
One option (starting with Spark 1.2, which is currently in preview) is to
use the Avro library for Spark SQL. This is very new, but we would love to
get feedback: https://github.com/databricks/spark-avro
On Thu, Nov 20, 2014 at 10:19 AM, al b beanb...@googlemail.com wrote:
I've read several
We are loading parquet data as temp tables but wondering if there is a way
to add a partition to the data without going through hive (we still want to
use spark's parquet serde as compared to hive). The data looks like -
/date1/file1, /date1/file2 ... , /date2/file1,
/date2/file2,/daten/filem
Which version are you running on again?
On Thu, Nov 20, 2014 at 8:17 AM, Sadhan Sood sadhan.s...@gmail.com wrote:
Also attaching the parquet file if anyone wants to take a further look.
On Thu, Nov 20, 2014 at 8:54 AM, Sadhan Sood sadhan.s...@gmail.com
wrote:
So, I am seeing this issue
I am running on master, pulled yesterday I believe but saw the same issue
with 1.2.0
On Thu, Nov 20, 2014 at 1:37 PM, Michael Armbrust mich...@databricks.com
wrote:
Which version are you running on again?
On Thu, Nov 20, 2014 at 8:17 AM, Sadhan Sood sadhan.s...@gmail.com
wrote:
Also
If you run master or the 1.2 preview release then it should automatically
skip lines that fail to parse. The corrupted text will be in the column
_corrupted_record and the other columns will be null.
On Thu, Nov 20, 2014 at 7:34 AM, Daniel Haviv danielru...@gmail.com wrote:
Hi Guys,
I really
I believe functions like sc.textFile will also accept paths with globs for
example /data/*/ which would read all the directories into a single RDD.
Under the covers I think it is just using Hadoop's FileInputFormat, in case
you want to google for the full list of supported syntax.
On Thu, Nov 20,
Thanks I'll try it out!
On Thu, Nov 20, 2014 at 8:39 PM, Michael Armbrust mich...@databricks.com
wrote:
If you run master or the 1.2 preview release then it should automatically
skip lines that fail to parse. The corrupted text will be in the column
_corrupted_record and the other columns
Hi Jerome,
This is cool. It would be great if you could share more details about you got
your setup to work finally. For example, what additional libraries/jars you are
using. How are you configuring the ThriftServer to use the additional jars to
communicate with Cassandra?
In addition, how
Check the --files argument in the output spark-submit -h.
On Thu, Nov 20, 2014 at 7:51 AM, Matt Narrell matt.narr...@gmail.com wrote:
How do I configure the files to be uploaded to YARN containers. So far, I’ve
only seen --conf spark.yarn.jar=hdfs://….” which allows me to specify the
HDFS
Guys,
After talking with Ankur, it turned out that sharing the talk we gave at
ScalaIO (France) would be worthy.
So there you go, and don't hesitate to share your thoughts ;-)/
http://www.slideshare.net/noootsab/machine-learning-and-graphx
Greetz,
andy
Hi Tobias,
With the current Yarn code, packaging the configuration in your app's
jar and adding the -Dlog4j.configuration=log4jConf.xml argument to
the extraJavaOptions configs should work.
That's not the recommended way for get it to work, though, since this
behavior may change in the future.
Ah awesome, thanks!!
On Thu, Nov 20, 2014 at 3:01 PM, Michael Armbrust mich...@databricks.com
wrote:
In 1.2 by default we use Spark parquet support instead of Hive when the
SerDe contains the word Parquet. This should work with hive partitioning.
On Thu, Nov 20, 2014 at 10:33 AM, Sadhan
Which SchemaRDD you can save out case classes to parquet (or JSON in Spark
1.2) automatically and when you read it back in the structure will be
preserved. However, you won't get case classes when its loaded back,
instead you'll get rows that you can query.
There is some experimental support for
Say I have two RDDs with the following values
x = [(1, 3), (2, 4)]
and
y = [(3, 5), (4, 7)]
and I want to have
z = [(1, 3), (2, 4), (3, 5), (4, 7)]
How can I achieve this. I know you can use outerJoin followed by map to
achieve this, but is there a more direct way for this.
hey,
Can anyone tell me how to debug a sql execution? Perhaps so it can show
what the query is doing and how long it takes at each point?
You want to use RDD.union (or SparkContext.union for many RDDs). These
don't join on a key. Union doesn't really do anything itself, so it is low
overhead. Note that the combined RDD will have all the partitions of the
original RDDs, so you may want to coalesce after the union.
val x =
Thanks Michael, opened this https://issues.apache.org/jira/browse/SPARK-4520
On Thu, Nov 20, 2014 at 2:59 PM, Michael Armbrust mich...@databricks.com
wrote:
Can you open a JIRA?
On Thu, Nov 20, 2014 at 10:39 AM, Sadhan Sood sadhan.s...@gmail.com
wrote:
I am running on master, pulled
As the Spark Streaming tuning guide indicates, the key indicators of a
healthy streaming job are:
- Processing Time
- Total Delay
The Spark UI page for the Streaming job [1] shows these two indicators but
the metrics source for Spark Streaming (StreamingSource.scala) [2] does
not.
Any reasons
I've similar type of issue, want to join two different type of RDD in one
RDD
file1.txt content (ID, counts)
val x : RDD[Long, Int] = sc.textFile(file1.txt).map( line =
line.split(,)).map(row = (row(0).toLong, row(1).toInt)
[(4407 ,40),
(2064, 38),
(7815 ,10),
(5736,17),
(8031,3)]
Second RDD
There is probably a better way to do it but I would register both as temp
tables and then join them via SQL.
BR,
Daniel
On 20 בנוב׳ 2014, at 23:53, Harihar Nahak hna...@wynyardgroup.com wrote:
I've similar type of issue, want to join two different type of RDD in one RDD
file1.txt
Harihar, your question is the opposite of what was asked. In the future,
please start a new thread for new questions.
You want to do a join in your case. The join function does an inner join,
which I think is what you want because you stated your IDs are common in
both RDDs. For other cases you
Just to close the loop, it seems no issues pop up when i submit the job
using 'spark submit' so that the driver process also runs on a container in
the YARN cluster.
In the above, the driver was running on the gateway machine through which
the job was submitted, which led to quite a few issues.
Guys,
As to the questions of pre-processing, you could just migrate your logic to
Spark before using K-means.
I only used Scala on Spark, and haven't used Python binding on Spark, but I
think the basic steps must be the same.
BTW, if your data set is big with huge sparse dimension feature
Hi all,
I want to try the TF-IDF functionality in MLlib.
I can feed it words and generate the tf and idf RDD[Vector]s, using the code
below.
But how do I get this back to words and their counts and tf-idf values for
presentation?
val sentsTmp = sqlContext.sql(SELECT text FROM sentenceTable)
Thanks Daniel ,
Applied Join from PairedRDD
val countByUsername = file1.join(file2)
.map {
case (id, (username, count)) = (id, username, count)
}
-
--Harihar
--
View this message in context:
Hi,
I am wondering how to write logging info in a worker when running a pyspark
app. I saw the thread
http://apache-spark-user-list.1001560.n3.nabble.com/logging-in-pyspark-td5458.html
but did not see a solution. Anybody know a solution? Thanks!
--
View this message in context:
/Someone will correct me if I'm wrong./
Actually, TF-IDF scores terms for a given document, an specifically TF.
Internally, these things are holding a Vector (hopefully sparsed)
representing all the possible words (up to 2²⁰) per document. So each
document afer applying TF, will be transformed in
Thank you.
And setting yarn.log-aggregation-enable in yarm-site.xml to true was the key.
It’s somewhat inconvenient that I must use ‘yarn logs’ rather than using YARN
resource manager web UI after the app has completed (that is, it seems that the
history server is not usable for Spark job),
Looks like intelij might be trying to load the wrong version of spark?
On Thu, Nov 20, 2014 at 4:35 PM, Sanjay Subramanian
sanjaysubraman...@yahoo.com.invalid wrote:
hey guys
I am at AmpCamp 2014 at UCB right now :-)
Funny Issue...
This code works in Spark-Shell but throws a funny
This seems pretty standard: your IntelliJ classpath isn't matched to the
correct ones that are used in spark shell
Are you using the SBT plugin? If not how are you putting deps into IntelliJ?
On Nov 20, 2014, at 7:35 PM, Sanjay Subramanian
sanjaysubraman...@yahoo.com.INVALID wrote:
Awesome that was it...Hit me with with a hockey stick :-)
unmatched Spark Core (1.0.0) and SparkSql (1.1.1) versionsCorrected that to
1.1.0 on both
dependency
groupIdorg.apache.spark/groupId
artifactIdspark-core_2.10/artifactId
version1.0.0/version
/dependency
dependency
Not using SBT...I have been creating and adapting various Spark Scala examples
and put it here and all u have to do is git clone and import as maven project
into IntelliJhttps://github.com/sanjaysubramanian/msfx_scala.git
Sidenote , IMHO, IDEs encourage the new to Spark/Scala developers to
I agree that using yarn logs is cumbersome. We're working to improve
this in future releases.
On Thu, Nov 20, 2014 at 4:31 PM, innowireless TaeYun Kim
taeyun@innowireless.co.kr wrote:
Thank you.
And setting yarn.log-aggregation-enable in yarm-site.xml to true was the
key.
It’s
Hi,
I am using sc.textFile(shared_dir/*) to load all the files in a directory
on a shared partition. The total size of the files in this directory is 1.2
TB. We have a 16 node cluster with 3 TB memory (1 node is driver, 15 nodes
are workers). But the loading fails after around 1 TB of data is
Hi friends,
I have successfully setup thrift server and execute beeline on top.
Beeline can handle select queries just fine, but it cannot seem to do any kind
of caching/RDD operations.
i.e.
1) Command cache table doesn't work. See error:
Error: Error while processing statement: FAILED:
Can you make sure the class SimpleApp$$anonfun$1 is included in your app
jar?
2014-11-20 18:19 GMT+01:00 Benoit Pasquereau [via Apache Spark User List]
ml-node+s1001560n19391...@n3.nabble.com:
Hi Guys,
I’m having an issue in standalone mode (Spark 1.1, Hadoop 2.4, Windows
Server 2008).
I've tried starting a Log Server on the driver according with this doc
https://docs.python.org/2/howto/logging-cookbook.html
so the workers could connect and send their logs to this remote server.
If I remember correctly ... I didn't get any error messages and the job
worked fine, but ... the
On Thu, Nov 20, 2014 at 10:18 PM, Benny Thompson ben.d.tho...@gmail.com
wrote:
I'm trying to use MongoDB as a destination for an ETL I'm writing in
Spark. It appears I'm gaining a lot of overhead in my system databases
(and possibly in the primary documents themselves); I can only assume
I think I understand what is going on here, but I was hoping someone could
confirm (or explain reality if I don't) what I'm seeing.
We are collecting data using a rather sizable accumulator - essentially, an
array of tens of thousands of entries. All told, about 1.3m of data.
If I understand
The disk size will not correctly reflect the size of the memory needed to
store that data.Memory depends a lot on the data structures you are using .
The objects that hold the data shall have their own overheads.
Try with reduced input size or more memory.
On 21/11/14 8:26 am, SK
To have a single text file output for each batch you can repartition it to
1 and then call the saveAsTextFiles
stream.repartition(1).saveAsTextFiles(location)
On 21 Nov 2014 11:28, jishnu.prat...@wipro.com wrote:
Hi I am also having similar problem.. any fix suggested..
*Originally Posted
Hi Akhil
Thanks for reply
But it creates different directories ..I tried using filewriter but it shows
non serializable error..
val stream = TwitterUtils.createStream(ssc, None) //, filters)
val statuses = stream.map(
status = sentimentAnalyzer.findSentiment({
Looks like it can not found class or jar in your Driver machine.
Are you sure that the corresponding jar file exist in Driver machine rather
than your develop machine?
2014-11-21 11:16 GMT+08:00 angel2014 angel.alvarez.pas...@gmail.com:
Can you make sure the class SimpleApp$$anonfun$1 is
Hi,
I am planning to use UIMA library to process data in my RDDs. I have had bad
experiences while using third party libraries inside worker tasks. The
system gets plagued with Serialization issues. But as UIMA classes are not
necessarily Serializable, I am not sure if it will work.
Please
Here's a quick version to store (append) in your local machine
val tweets = TwitterUtils.createStream(ssc, None)
val hashTags = tweets.flatMap(status = status.getText.split(
).filter(_.startsWith(#)))
hashTags.foreachRDD(rdds = {
rdds.foreach(rdd = {
*val fw = new
Hello,
I need to develop an application which:
- reads xml files in thousands of directories, two levels down, from year x to
year y
- extracts data from image tags in those files and stores them in a Sql or
NoSql database
- generates ImageMagick commands based on the extracted data to
90 matches
Mail list logo