From IntelliJ, you can use the remote debugging feature.
http://stackoverflow.com/questions/19128264/how-to-remote-debug-in-intellij-12-1-4
For remote debugging, you need to pass the following:
-Xdebug -Xrunjdwp:server=y,transport=dt_socket,address=4000,suspend=n
jvm options and configure your
I'm trying to build a dashboard to visualize stream of events coming from
mobile devices.
For example, I have event called add_photo, from which I want to calculate
trending tags for added photos for last x minutes. Then I'd like to
aggregate that by country, etc. I've built the streaming part,
Hello Matei,
Could you please also add our company to the Powered By list?
Details are as follows:
Name: Act Now
URL: www.actnowib.com
Description:
Sparks powers NOW APPS, a big data, real-time, predictive analytics
platform.
Using Spark SQL, MLlib and GraphX components for both batch ETL and
Are you submitting your application from local to a remote host?
If you want to run the spark application from a remote machine, then you have
to at least set the following configurations properly.
- *spark.driver.host* - points to the ip/host from where you are submitting
the job (make sure
I'll stay with my recommendation - that's exactly what Kibana is made for ;)
2015-03-20 9:06 GMT+01:00 Harut Martirosyan harut.martiros...@gmail.com:
Hey Jeffrey.
Thanks for reply.
I already have something similar, I use Grafana and Graphite, and for
simple metric streaming we've got all
Hey Jeffrey.
Thanks for reply.
I already have something similar, I use Grafana and Graphite, and for
simple metric streaming we've got all set-up right.
My question is about interactive patterns. For instance, dynamically choose
an event to monitor, dynamically choose group-by field or any sort
Hi All
I would like to measure Bytes Read and Peak Memory Usage for a Spark SQL
Query.
Please clarify if Bytes Read = aggregate size of all RDDs ??
All my RDDs are in memory and 0B spill to disk.
And I am clueless how to measure Peak Memory Usage.
--
View this message in context:
You could do a cache and see the memory usage under Storage tab in the
driver UI (runs on port 4040)
Thanks
Best Regards
On Fri, Mar 20, 2015 at 12:02 PM, anu anamika.guo...@gmail.com wrote:
Hi All
I would like to measure Bytes Read and Peak Memory Usage for a Spark SQL
Query.
Please
1. If you are consuming data from Kafka or any other receiver based
sources, then you can start 1-2 receivers per worker (assuming you'll have
min 4 core per worker)
2. If you are having single receiver or is a fileStream then what you can
do to distribute the data across machines is to do a
Hey Harut,
I don't think there'll by any general practices as this part heavily
depends on your environment, skills and what you want to achieve.
If you don't have a general direction yet, I'd suggest you to have a look
at Elasticsearch+Kibana. It's very easy to set up, powerful and therefore
Hi Davies,
I am already using --py-files. The system does use the other file. The
error I am getting is not trivial. Please check the error log.
On Thu, Mar 19, 2015 at 8:03 PM, Davies Liu dav...@databricks.com wrote:
You could submit additional Python source via --py-files , for example:
Hi,
I recently changed from Spark 1.1. to Spark 1.2., and I noticed that it
loses all executors whenever I have any Python code bug (like looking up a
key in a dictionary that does not exist). In earlier versions, it would
raise an exception but it would not lose all executors.
Anybody with a
Hi Mohit,
it also depends on what the source for your streaming application is.
If you use Kafka, you can easily partition topics and have multiple
receivers on different machines.
If you have sth like a HTTP, socket, etc stream, you probably can't do
that. The Spark RDDs generated by your
Hello,
Is that possible to delete shuffle data of previous iteration as it is not
necessary?
Alcaid
Hello Xiangrui,
I use spark 1.2.0 on cdh 5.3. Thanks!
-Su
On Fri, Mar 20, 2015 at 2:27 PM Xiangrui Meng men...@gmail.com wrote:
Su, which Spark version did you use? -Xiangrui
On Thu, Mar 19, 2015 at 3:49 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
To get these metrics out, you need
Hi all,
I'm building a Spark Streaming application that will continuously read
multiple kafka topics at the same time. However, I found a weird issue that
it reads only hundreds of messages then it stopped reading any more. If I
changed the three topic to only one topic, then it is fine and it
Instead of opening a tunnel to the Spark web ui port, could you open a
tunnel to the YARN RM web ui instead? That should allow you to
navigate to the Spark application's web ui through the RM proxy, and
hopefully that will work better.
On Fri, Feb 6, 2015 at 9:08 PM, yangqch
Feel free to send a pull request to fix the doc (or say which versions it's
needed in).
Matei
On Mar 20, 2015, at 6:49 PM, Krishna Sankar ksanka...@gmail.com wrote:
Yep the command-option is gone. No big deal, just add the '%pylab inline'
command as part of your notebook.
Cheers
k/
Let me do it now. I appreciate the perfect easy understandable
documentation of spark!
The updated command will be like
PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS=notebook
./bin/pyspark
When IPython notebook server is launched, you can create a new Python
2 notebook from Files
I ran into a similar issue. What's happening is that when Spark is running
in YARN client mode, YARN automatically launches a Web Application Proxy
http://archive.cloudera.com/cdh5/cdh/5/hadoop/hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html
to reduce hacking attempts. In doing so, it
Hi,
Running Spark 1.3 with secured Hadoop.
Spark-shell with Yarn client mode runs without issue when not using Dynamic
Allocation.
When Dynamic allocation is turned on, the shell comes up but same SQL etc.
causes it to loop.
spark.dynamicAllocation.enabled=true
Forgot to add - the cluster is idle otherwise so there should be no
resource issues. Also the configuration works when not using Dynamic
allocation.
On Fri, Mar 20, 2015 at 4:15 PM, Manoj Samel manojsamelt...@gmail.com
wrote:
Hi,
Running Spark 1.3 with secured Hadoop.
Spark-shell with Yarn
Hi All,
I have all my UDFs defined in the classes residing in a different package
than where I am instantiating my HiveContext.
I have a register function in my UDF class. I pass HiveContext to this
function. and in this function I call
hiveContext.registerFunction(myudf, myudf _)
All goes well
Dear all,
About the index of each partition of an RDD, I am wondering whether we
can keep their numbering on each physical machine in a hash
partitioning process. For example, a cluster containing three physical
machines A,B,C (all are workers), for an RDD with six partitions,
assume that the two
-amz-content-sha256:
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855, Host:
frankfurt.ingestion.batch.s3.amazonaws.com, x-amz-date: 20150320T112531Z,
Authorization: AWS4-HMAC-SHA256
Credential=XXX_MY_KEY_XXX/20150320/us-east-1/s3/aws4_request,SignedHeaders=date;host;x-amz-content
:
Fri, 20 Mar 2015 11:25:31 GMT, x-amz-content-sha256:
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855, Host:
frankfurt.ingestion.batch.s3.amazonaws.com, x-amz-date: 20150320T112531Z,
Authorization: AWS4-HMAC-SHA256
Credential=XXX_MY_KEY_XXX/20150320/us-east-1/s3/aws4_request
Hi all,
I have 6 nodes in the cluster and one of the nodes is clearly
under-performing:
I was wandering what is the impact of having such issues? Also what is the
recommended way to workaround it?
Thanks a lot,
Yiannis
Hi,
I am curious, when I start a spark program in local mode, which parameter
will be used to decide the jvm memory size for executor?
In theory should be:
--executor-memory 20G
But I remember local mode will only start spark executor in the same process
of driver, then should be:
Hi all,
We are designing a workflow where we try to stream local files to a Socket
streamer, that would clean and process the files and write them to hdfs. We
have an issue with bigger files when the streamer cannot keep up with the
data, and runs out of memory.
What would be the best way to
:
frankfurt.ingestion.batch.s3.amazonaws.com, x-amz-date:
20150320T112531Z, Authorization: AWS4-HMAC-SHA256
Credential=XXX_MY_KEY_XXX/20150320/us-east-1/s3/aws4_request,SignedHeaders=date;host;x-amz-content-sha256;x-amz-date,Signature=2098d3175c4304e44be912b770add7594d1d1b44f545c3025be1748672ec60e4
Anyone? or is this question nonsensical... and I am doing something
fundamentally wrong?
On Mon, Mar 16, 2015 at 5:33 PM, Jacob Abraham abe.jac...@gmail.com wrote:
Hi Folks,
I have a situation where I am getting a version conflict between java
libraries that is used by my application and
Grafana allows pretty slick interactive use patterns, especially with
graphite as the back-end. In a multi-user environment, why not have each
user just build their own independent dashboards and name them under some
simple naming convention?
*Irfan Ahmad*
CTO | Co-Founder | *CloudPhysics*
bq. Caused by: java.net.UnknownHostException: dhcp-10-35-14-100: Name or
service not known
Can you check your DNS ?
Cheers
On Fri, Mar 20, 2015 at 8:54 PM, tangzilu zilu.t...@hotmail.com wrote:
Hi All:
I recently started to deploy Spark1.2 in my VisualBox Linux.
But when I run the command
Hi, all:
When I exit the console of spark-sql, the following exception throwed..
My spark version is 1.3.0, hadoop version is 2.2.0
Exception in thread Thread-3 java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:629)
at
Hi, all:
When I exit the console of spark-sql, the following exception throwed..
Exception in thread Thread-3 java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:629)
at
Hi All:
I recently started to deploy Spark1.2 in my VisualBox Linux.
But when I run the command ./spark-shell in the path of
/opt/spark-1.2.1/bin, I got the result like this:
[root@dhcp-10-35-14-100 bin]# ./spark-shell
Using Spark's default log4j profile:
Isn't that a feature? Other than running a buggy pipeline, just kills all
executors? You can always handle exceptions with proper try catch in your
code though.
Thanks
Best Regards
On Fri, Mar 20, 2015 at 3:51 PM, mrm ma...@skimlinks.com wrote:
Hi,
I recently changed from Spark 1.1. to Spark
Uh, does that mean HDP shipped Marcelo's uncommitted patch from
SPARK-1537 anyway? Given the discussion there, that seems kinda
aggressive.
On Wed, Mar 18, 2015 at 8:49 AM, Marcelo Vanzin van...@cloudera.com wrote:
Those classes are not part of standard Spark. You may want to contact
My job crashes with a bunch of these messages in the YARN logs.
What are the appropriate steps in troubleshooting?
15/03/19 23:29:45 ERROR shuffle.RetryingBlockFetcher: Exception while
beginning fetch of 10 outstanding blocks (after 3 retries)
15/03/19 23:29:45 ERROR
Maybe this is related to a bug in 1.2 [1], it's fixed in 1.2.2 (not
released), could checkout the 1.2 branch and verify that?
[1] https://issues.apache.org/jira/browse/SPARK-5788
On Fri, Mar 20, 2015 at 3:21 AM, mrm ma...@skimlinks.com wrote:
Hi,
I recently changed from Spark 1.1. to Spark
You MUST put --py-files BEFORE main.py, as mentioned in another threads.
On Fri, Mar 20, 2015 at 1:47 AM, Guillaume Charhon
guilla...@databerries.com wrote:
Hi Davies,
I am already using --py-files. The system does use the other file. The error
I am getting is not trivial. Please check the
spark.sql.shuffle.partitions only control the number of tasks in the second
stage (the number of reducers). For your case, I'd say that the number of
tasks in the first state (number of mappers) will be the number of files
you have.
Actually, have you changed spark.executor.memory (it controls
Hi, Imran:
Thanks for your information.
I found a benchmark online about serialization which compares Java vs Kryo vs
gridgain at here:
http://gridgain.blogspot.com/2012/12/java-serialization-good-fast-and-faster.html
From my test result, in the above benchmark case for the SimpleObject, Kryo is
Thanks!
Let me update the status.
I have copied the DirectOutputCommitter to my local. And set:
Conf.set(spark.hadoop.mapred.output.committer.class,
org..DirectOutputCommitter)
It works perfectly.
Thanks everyone J
Regards,
Shuai
From: Aaron Davidson
Hm is data locality a factor here? I don't know.
Just a side note: this doesn't cause OOM errors per se since the cache
won't exceed the % of heap it's allowed. However that will hasten OOM
problems due to tasks using too much memory, of course. The solution
is to get more memory to the tasks or
val aDstream = ...
val distinctStream = aDstream.transform(_.distinct())
but the elements in distinctStream are not distinct.
Did I use it wrong?
I think you should see some other errors before that, from
NettyBlockTransferService, with a msg like Exception while beginning
fetchBlocks. There might be a bit more information there. there are an
assortment of possible causes, but first lets just make sure you have all
the details from the
Hi All,
I'm experiencing the same issue with Spark 120 (not verified with previous).
Could you please help us on this?
Thanks
Alessandro
On Tue, Nov 18, 2014 at 1:40 AM, mtimper mich...@timper.com wrote:
Hi I'm running a standalone cluster with 8 worker servers.
I'm developing a streaming
Hi Patcharee,
It is an alpha feature in HDP distribution, integrating ATS with Spark history
server. If you are using upstream, you can configure spark as regular without
these configuration. But other related configuration are still mandatory, such
as hdp.version related.
Thanks.
Zhan Zhang
But it requires all possible combinations of your filters as separate
metrics, moreover, it only can show time based information, you cannot
group by say country.
On 20 March 2015 at 19:09, Irfan Ahmad ir...@cloudphysics.com wrote:
Grafana allows pretty slick interactive use patterns,
It's not a crazy question, no. I'm having a bit of trouble figuring
out what's happening. Commons Net 2.2 is what's used by Spark. The
error appears to come from Spark. But the error is not finding a
method that did not exist in 2.2. I am not sure what ZipStream is, for
example. This could be a
Hi Yin,
the way I set the configuration is:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext.setConf(spark.sql.shuffle.partitions,1000);
it is the correct way right?
In the mapPartitions task (the first task which is launched), I get again
the same number of tasks and again
Hi Thomas,
sorry for such a late reply. I don't have any super-useful advice, but
this seems like something that is important to follow up on. to answer
your immediate question, No, there should not be any hard limit to the
number of tasks that MapOutputTracker can handle. Though of course as
I think you are running into a combo of
https://issues.apache.org/jira/browse/SPARK-5928
and
https://issues.apache.org/jira/browse/SPARK-5945
The standard solution is to just increase the number of partitions you are
creating. textFile(), reduceByKey(), and sortByKey() all take an optional
Hi Harut,
Jeff's right that Kibana + Elasticsearch can take you quite far out of the
box. Depending on your volume of data, you may only be able to keep recent
data around though.
Another option that is custom-built for handling many dimensions at query
time (not as separate metrics) is Druid
Actually I realized that the correct way is:
sqlContext.sql(set spark.sql.shuffle.partitions=1000)
but I am still experiencing the same behavior/error.
On 20 March 2015 at 16:04, Yiannis Gkoufas johngou...@gmail.com wrote:
Hi Yin,
the way I set the configuration is:
val sqlContext = new
Hi Yong,
yes I think your analysis is correct. I'd imagine almost all serializers
out there will just convert a string to its utf-8 representation. You
might be interested in adding compression on top of a serializer, which
would probably bring the string size down in almost all cases, but then
Below is the output:
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 1967947
max locked memory (kbytes, -l) 64
max memory
You probably don't cause a shuffle (which requires serialization) unless
there is a join or group by.
It's possible that we are need to pass the spark class loader to kryo when
creating a new instance (you can get it from Utils I believe). We never
run Otto this problem since this API is not
Hi,
We have spark setup such that there are various users running multiple jobs
at the same time. Currently all the logs go to 1 file specified in the
log4j.properties.
Is it possible to configure log4j in spark for per app/user logging instead
of sending all logs to 1 file mentioned in the
Hi,
I want to get telemetry metrics on spark apps activities, such as run time and
jvm activities.
Using Spark Metrics I am able to get the following sample data point on the an
app:
type=GAUGE, name=application.SparkSQL::headnode0.1426626495312.runtime_ms,
value=414873
How can I match this
Any other users interested in a feature
DataFrame.saveAsExternalTable() for making _useful_ external tables in
Hive, or am I the only one? Bueller? If I start a PR for this, will it
be taken seriously?
On Thu, Mar 19, 2015 at 9:34 AM, Christian Perez christ...@svds.com wrote:
Hi Yin,
Thanks
Do you think the ulimit for the user running Spark on your nodes?
Can you run ulimit -a under the user who is running spark on the executor
node? Does the result make sense for the data you are trying to process?
Yong
From: szheng.c...@gmail.com
To: user@spark.apache.org
Subject:
Hi,
I created a cluster using spark-ec2 script. But it installs HDFS version
1.0. I would like to use this cluster to connect to HIVE installed on a
cloudera CDH 5.3 cluster. But I am getting the following error:-
org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot
communicate
Hi All,
I try to run a simple sort by on 1.2.1. And it always give me below two
errors:
1, 15/03/20 17:48:29 WARN TaskSetManager: Lost task 2.0 in stage 1.0 (TID
35, ip-10-169-217-47.ec2.internal): java.io.FileNotFoundException:
I notice that some people send messages directly to user@spark.apache.org
and some via nabble, either using email or the web client.
There are two index sites, one directly at apache.org and one at nabble.
But messages sent directly to user@spark.apache.org only show up in the
apache list.
Yes, it did get delivered to the apache list shown here:
http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3CCAGK53LnsD59wwQrP3-9yHc38C4eevAfMbV2so%2B_wi8k0%2Btq5HQ%40mail.gmail.com%3E
But the web site for spark community directs people to nabble for viewing
messages and it
Assuming you are on Linux, what is your /etc/security/limits.conf set for
nofile/soft (number of open file handles)?
On Fri, Mar 20, 2015 at 3:29 PM Shuai Zheng szheng.c...@gmail.com wrote:
Hi All,
I try to run a simple sort by on 1.2.1. And it always give me below two
errors:
1,
Hi,
I am trying to create a Spark cluster using the spark-ec2 script which will
support 2.5.0-cdh5.3.2 for HDFS as well as Hive. I created a cluster by
adding --hadoop-major-version=2.5.0 which solved some of the errors I was
getting. But now when I run select query on hive I get the following
Su, which Spark version did you use? -Xiangrui
On Thu, Mar 19, 2015 at 3:49 AM, Akhil Das ak...@sigmoidanalytics.com wrote:
To get these metrics out, you need to open the driver ui running on port
4040. And in there you will see Stages information and for each stage you
can see how much time
I think you missed -Phadoop-2.4
On Fri, Mar 20, 2015 at 5:27 PM, morfious902002 anubha...@gmail.com wrote:
Hi,
I am trying to create a Spark cluster using the spark-ec2 script which will
support 2.5.0-cdh5.3.2 for HDFS as well as Hive. I created a cluster by
adding
Are these jobs the same jobs, just run by different users or, different
jobs ?
If the latter, can each application use its own log4j.properties ?
Cheers
On Fri, Mar 20, 2015 at 1:43 PM, Udit Mehta ume...@groupon.com wrote:
Hi,
We have spark setup such that there are various users running
I see. I will try the other way around.
On Thu, Mar 19, 2015 at 8:06 PM, Davies Liu dav...@databricks.com wrote:
the options of spark-submit should come before main.py, or they will
become the options of main.py, so it should be:
../hadoop/spark-install/bin/spark-submit --py-files
I tried your program in yarn-client mode and it worked with no
exception. This is the command I used:
spark-submit --master yarn-client --py-files work.py main.py
(Spark 1.2.1)
On 20.3.2015. 9:47, Guillaume Charhon wrote:
Hi Davies,
I am already using --py-files. The system does use the
Greetings!
I sorted a dataset in Spark and then wrote it out in avro/parquet.
Then I wanted to check that it was sorted.
It looks like each partition has been sorted, but when reading in, the first
partition (i.e., as seen in the partition index of mapPartitionsWithIndex) is
not the same as
Hello :
I tried ipython notebook with the following command in my enviroment.
PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS=notebook
--pylab inline ./bin/pyspark
But it shows --pylab inline support is removed from ipython newest version.
the log is as :
---
$
Yep the command-option is gone. No big deal, just add the '%pylab
inline' command
as part of your notebook.
Cheers
k/
On Fri, Mar 20, 2015 at 3:45 PM, cong yue yuecong1...@gmail.com wrote:
Hello :
I tried ipython notebook with the following command in my enviroment.
77 matches
Mail list logo