Is there a reason not to just use scala? It's not a lot of code... and
it'll be even less code in scala ;)
On Wed, Aug 19, 2015 at 4:14 AM, Shushant Arora shushantaror...@gmail.com
wrote:
To correct myself - KafkaRDD[K, V, U, T, R] is subclass of RDD[R] but
OptionKafkaRDD[K, V, U, T, R] is
I personally build with SBT and run Spark on YARN with IntelliJ. You need
to connect to remote JVMs with a remote debugger. You also need to do
similar, if you use Python, because it will launch a JVM on the driver
aswell.
On Wed, Aug 19, 2015 at 2:10 PM canan chen ccn...@gmail.com wrote:
Hi, thank you all for the asssistance.
It is odd, it works when creating a new java.mathBigDecimal object, but not if
I work directly with
scala 5 match { case x: java.math.BigDecimal = 2 }
console:23: error: scrutinee is incompatible with pattern type;
found : java.math.BigDecimal
required:
It is okay. this methodology works very well for mapping objects of my Seq[Any].
It is indeed very cool :-)
Saif
From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Wednesday, August 19, 2015 10:47 AM
To: Ellafi, Saif A.
Cc: sujitatgt...@gmail.com; wrbri...@gmail.com; user@spark.apache.org
Subject:
Hints are good. Thanks Corey. I will try to find out more from the logs.
On Aug 18, 2015, at 7:23 PM, Corey Nolet cjno...@gmail.com wrote:
Usually more information as to the cause of this will be found down in your
logs. I generally see this happen when an out of memory exception has
Thanks Ted. I notice another thread about running spark programmatically
(client mode for standalone and yarn). Would it be much easier to debug
spark if is is possible ? Hasn't anyone thought about it ?
On Wed, Aug 19, 2015 at 5:50 PM, Ted Yu yuzhih...@gmail.com wrote:
See this thread:
Saif:
In your example below, the error was due to there is no automatic conversion
from Int to BigDecimal.
Cheers
On Aug 19, 2015, at 6:40 AM, saif.a.ell...@wellsfargo.com
saif.a.ell...@wellsfargo.com wrote:
Hi, thank you all for the asssistance.
It is odd, it works when creating a
Anyone know about this ? Or do I miss something here ?
On Fri, Aug 7, 2015 at 4:20 PM, canan chen ccn...@gmail.com wrote:
Is there any reason that historyserver use another property for the event
log dir ? Thanks
Thanks Andrew for a detailed response,
So the reason why key value pairs with same keys are always found in a
single buckets in Hash based shuffle but not in Sort is because in
sort-shuffle each mapper writes a single partitioned file, and it is up to
the reducer to fetch correct partitions from
any differences in number of cores, memory settings for executors?
On 19 August 2015 at 09:49, Rick Moritz rah...@gmail.com wrote:
Dear list,
I am observing a very strange difference in behaviour between a Spark
1.4.0-rc4 REPL (locally compiled with Java 7) and a Spark 1.4.0 zeppelin
I am not 100% sure but probably flatMap unwinds the tuples. Try with map
instead.
2015-08-19 13:10 GMT+02:00 Jerry OELoo oylje...@gmail.com:
Hi.
I want to parse a file and return a key-value pair with pySpark, but
result is strange to me.
the test.sql is a big fie and each line is usename
When submitting to YARN, you can specify two different operation modes for
the driver with the --master parameter: yarn-client or yarn-cluster. For
more information on submitting to YARN, see this page in the Spark docs:
http://spark.apache.org/docs/latest/running-on-yarn.html
yarn-cluster mode
i would compare spark ui metrics for both cases and see any
differences(number of partitions, number of spills etc)
why can't you make repl to be consistent with zepellin spark version?
might be rc has issues...
On 19 August 2015 at 14:42, Rick Moritz rah...@gmail.com wrote:
No, the setup
oops, forgot to reply-all on this thread.
-- Forwarded message --
From: Rick Moritz rah...@gmail.com
Date: Wed, Aug 19, 2015 at 2:46 PM
Subject: Re: Strange shuffle behaviour difference between Zeppelin and
Spark-shell
To: Igor Berman igor.ber...@gmail.com
Those values are not
From the log, it looks like the OS user who is running spark cannot open any
more file.
Check your ulimit setting for that user:
ulimit -aopen files (-n) 65536
Date: Tue, 18 Aug 2015 22:06:04 -0700
From: swethakasire...@gmail.com
To: user@spark.apache.org
Subject: Failed
No, the setup is one driver with 32g of memory, and three executors each
with 8g of memory in both cases. No core-number has been specified, thus it
should default to single-core (though I've seen the yarn-owned jvms
wrapping the executors take up to 3 cores in top). That is, unless, as I
Hi,
I would ask if there are some blogs/articles/videos on how to analyse spark
performance during runtime,eg, tools that can be used or something related.
That worked great, thanks Andrew.
On Tue, Aug 18, 2015 at 1:39 PM, Andrew Or and...@databricks.com wrote:
Hi Axel,
You can try setting `spark.deploy.spreadOut` to false (through your
conf/spark-defaults.conf file). What this does is essentially try to
schedule as many cores on one worker as
Can you try to use HiveContext instead of SQLContext? Your query is trying
to create a table and persist the metadata of the table in metastore, which
is only supported by HiveContext.
On Wed, Aug 19, 2015 at 8:44 AM, Yusuf Can Gürkan yu...@useinsider.com
wrote:
Hello,
I’m trying to create a
Some background on what we're trying to do:
We have four Kinesis receivers with varying amounts of data coming through
them. Ultimately we work on a unioned stream that is getting about 11
MB/second of data. We use a batch size of 5 seconds.
We create four distinct DStreams from this data that
Hey Yin,
Thanks for answer. I thought that this could be problem but i can not create
HiveContext because i can not import org.apache.spark.sql.hive.HiveContext. It
does not see this package.
I read that i should build spark with -PHive but i’m running on Amazon EMR
1.4.1 and on spark-shell
Hello,
I have a DataFrame, with a date column which I want to use as a partition.
Each day I want to write the data for the same date in Parquet, and then
read a dataframe for a date range.
I'm using:
myDataframe.write().partitionBy(date).mode(SaveMode.Overwrite).parquet(parquetDir);
If I use
Excellent resource: http://www.oreilly.com/pub/e/3330
And more amazing is the fact that the presenter actually responds to your
questions.
Regards,
Gourav Sengupta
On Wed, Aug 19, 2015 at 4:12 PM, Todd bit1...@163.com wrote:
Hi,
I would ask if there are some blogs/articles/videos on how to
hmm maybe I spoke too soon.
I have an apache zeppelin instance running and have configured it to use 48
cores (each node only has 16 cores), so I figured by setting it to 48,
would mean that spark would grab 3 nodes. what happens instead though is
that spark, reports that 48 cores are being
you don't need to register, search in youtube for this video...
On 19 August 2015 at 18:34, Gourav Sengupta gourav.sengu...@gmail.com
wrote:
Excellent resource: http://www.oreilly.com/pub/e/3330
And more amazing is the fact that the presenter actually responds to your
questions.
Regards,
My question was how to do this in Hadoop? Could somebody point me to some
examples?
On Tue, Aug 18, 2015 at 10:43 PM, UMESH CHAUDHARY umesh9...@gmail.com
wrote:
Of course, Java or Scala can do that:
1) Create a FileWriter with append or roll over option
2) For each RDD create a StringBuilder
Hi.
I'd like to read Avro files using this library
https://github.com/databricks/spark-avro
I need to load several files from a folder, not all files. Is there some
functionality to filter the files to load?
And... Is is possible to know the name of the files loaded from a folder?
My problem
I'm glad that I could help :)
19 sie 2015 8:52 AM Shenghua(Daniel) Wan wansheng...@gmail.com
napisał(a):
+1
I wish I have read this blog earlier. I am using Java and have just
implemented a singleton producer per executor/JVM during the day.
Yes, I did see that NonSerializableException when
I'm getting some spark exception. Please look this log trace (
*http://pastebin.com/xL9jaRUa
http://pastebin.com/xL9jaRUa* ).
*Thanks*,
https://in.linkedin.com/in/ramkumarcs31
On Wed, Aug 19, 2015 at 10:20 PM, Hari Shreedharan
hshreedha...@cloudera.com wrote:
It looks like you are having
will try scala.
Only Reason of not using scala is - never worked on that.
On Wed, Aug 19, 2015 at 7:34 PM, Cody Koeninger c...@koeninger.org wrote:
Is there a reason not to just use scala? It's not a lot of code... and
it'll be even less code in scala ;)
On Wed, Aug 19, 2015 at 4:14 AM,
I'm running a pyspark program on a Mesos cluster and seeing behavior where:
* The first three stages run with multiple (10-15) tasks.
* The fourth stage runs with only one task.
* It is using 10 cpus, which is 5 machines in this configuration
* It is very slow
I would like it to use more
Hi Axel, what spark version are you using? Also, what do your
configurations look like for the following?
- spark.cores.max (also --total-executor-cores)
- spark.executor.cores (also --executor-cores)
2015-08-19 9:27 GMT-07:00 Axel Dahl a...@whisperstream.com:
hmm maybe I spoke too soon.
I
If you create a PairRDD from the DataFrame, using
dataFrame.toRDD().mapToPair(), then you can call
partitionBy(someCustomPartitioner) which will partition the RDD by the key
(of the pair).
Then the operations on it (like joining with another RDD) will consider
this partitioning.
I'm not sure that
Dawid is right, if you did words.count it would be twice the number of input
lines. You can use map like this:
words = lines.map(mapper2)
for i in words.take(10):
msg = i[0] + :” + i[1] + \n”
---
Robin East
I had the exact same issue, and overcame it by overriding
NativeS3FileSystem with my own class, where I replaced the implementation
of globStatus. It's a hack but it works.
Then I set the hadoop config fs.myschema.impl to my class name, and
accessed the files through myschema:// instead of s3n://
Hi Canan,
The event log dir is a per-application setting whereas the history server
is an independent service that serves history UIs from many applications.
If you use history server locally then the `spark.history.fs.logDirectory`
will happen to point to `spark.eventLog.dir`, but the use case
Yes, in other words, a bucket is a single file in hash-based shuffle (no
consolidation), but a segment of partitioned file in sort-based shuffle.
2015-08-19 5:52 GMT-07:00 Muhammad Haseeb Javed 11besemja...@seecs.edu.pk:
Thanks Andrew for a detailed response,
So the reason why key value pairs
Please look at more about hadoop logs, such as yarn logs -applicationId
xxx
attach more logs to this topic
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/issue-Running-Spark-Job-on-Yarn-Cluster-tp21779p24350.html
Sent from the Apache Spark User List
Hi,
How to set the number of executors and tasks in a Spark Streaming job in
Mesos? I have the following settings but my job still shows me 11 active
tasks and 11 executors. Any idea as to why this is happening
?
sparkConf.set(spark.mesos.coarse, true)
sparkConf.set(spark.cores.max, 128)
Thanks Andrew, make sense.
On Thu, Aug 20, 2015 at 4:40 AM, Andrew Or and...@databricks.com wrote:
Hi Canan,
The event log dir is a per-application setting whereas the history server
is an independent service that serves history UIs from many applications.
If you use history server locally
I think Yarn ResourceManager has the mechanism to relaunch the driver on
failure. But I am uncertain.
Could someone help on this? Thanks.
At 2015-08-19 16:37:32, Spark Enthusiast sparkenthusi...@yahoo.in wrote:
Thanks for the reply.
Are Standalone or Mesos the only options? Is there a
Hi.
I want to parse a file and return a key-value pair with pySpark, but
result is strange to me.
the test.sql is a big fie and each line is usename and password, with
# between them, I use below mapper2 to map data, and in my
understanding, i in words.take(10) should be a tuple, but the result
is
Running Spark on YARN in yarn-client mode, everything appears to be working
just fine except I can't access Spark UI via the Application Master link in
YARN UI or directly at http://driverip:4040/jobs I get error 500, and the
driver log shows the error pasted below:
When running the same job
Yes, you can try set the spark.sql.sources.partitionDiscovery.enabled to false.
BTW, which version are you using?
Hao
From: Jerrick Hoang [mailto:jerrickho...@gmail.com]
Sent: Thursday, August 20, 2015 12:16 PM
To: Philip Weaver
Cc: user
Subject: Re: Spark Sql behaves strangely with tables with
I cloned from TOT after 1.5.0 cut off. I noticed there were a couple of CLs
trying to speed up spark sql with tables with a huge number of partitions,
I've made sure that those CLs are included but it's still very slow
On Wed, Aug 19, 2015 at 10:43 PM, Cheng, Hao hao.ch...@intel.com wrote:
Yes,
HI All,
I have a data in HDFS partition with Year/month/data/event_type. And I am
creating a hive tables with this data, this data is in JSON so I am using
json serve and creating hive tables.
below is the code
val jsonFile =
Can you make some more profiling? I am wondering if the driver is busy with
scanning the HDFS / S3.
Like jstack pid of driver process
And also, it’s will be great if you can paste the physical plan for the simple
query.
From: Jerrick Hoang [mailto:jerrickho...@gmail.com]
Sent: Thursday, August
Thank you All. I have updated it to a little better version.
Regards,
Sandeep Giri,
+1 347 781 4573 (US)
+91-953-899-8962 (IN)
www.KnowBigData.com. http://KnowBigData.com.
Phone: +1-253-397-1945 (Office)
[image: linkedin icon] https://linkedin.com/company/knowbigdata [image:
other site icon]
Dear list,
I am observing a very strange difference in behaviour between a Spark
1.4.0-rc4 REPL (locally compiled with Java 7) and a Spark 1.4.0 zeppelin
interpreter (compiled with Java 6 and sourced from maven central).
The workflow loads data from Hive, applies a number of transformations
There is an option for the spark-submit (Spark standalone or Mesos with cluster
deploy mode only)
--supervise If given, restarts the driver on failure.
At 2015-08-19 14:55:39, Spark Enthusiast sparkenthusi...@yahoo.in wrote:
Folks,
As I see, the Driver program is a
Just add spark_1.4.1_yarn_shuffle.jar in ClassPath or create a New Maven
project using below dependency:
dependency
groupIdorg.apache.spark/groupId
artifactIdspark-core_2.11/artifactId
version1.4.1/version
/dependency
dependency
groupIdorg.apache.spark/groupId
artifactIdspark-sql_2.11/artifactId
I have a simple RDD with Key/Value and do
val partitioned = rdd.partitionBy(new HashPartitioner(400))
val row = partitioned.first
I can get a key G2726 from a returned row. This first row is located on a
partition #0 because G2726.hashCode is 67114000 and 67114000%400 is 0. But
the order of
+1
I wish I have read this blog earlier. I am using Java and have just
implemented a singleton producer per executor/JVM during the day.
Yes, I did see that NonSerializableException when I was debugging the code
...
Thanks for sharing.
On Tue, Aug 18, 2015 at 10:59 PM, Tathagata Das
I don't find related talk on whether spark sql supports column indexing. If it
does, is there guide how to do it? Thanks.
Folks,
As I see, the Driver program is a single point of failure. Now, I have seen
ways as to how to make it recover from failures on a restart (using
Checkpointing) but I have not seen anything as to how to restart it
automatically if it crashes.
Will running the Driver as a Hadoop Yarn
See this thread:
http://search-hadoop.com/m/q3RTtdZv0d1btRHl/Spark+build+modulesubj=Building+Spark+Building+just+one+module+
On Aug 19, 2015, at 1:44 AM, canan chen ccn...@gmail.com wrote:
I want to work on one jira, but it is not easy to do unit test, because it
involves different
Running Spark on YARN in yarn-client mode, everything appears to be working
just fine except I can't access Spark UI via the Application Master link in
YARN UI or directly at http://driverip:4040/jobs I get error 500, and the
driver log shows the error pasted below:
When running the same job
To correct myself - KafkaRDD[K, V, U, T, R] is subclass of RDD[R] but
OptionKafkaRDD[K, V, U, T, R] is not subclass of OptionRDD[R];
In scala C[T’] is a subclass of C[T] as per
https://twitter.github.io/scala_school/type-basics.html
but this is not allowed in java.
So is there any workaround
Thanks a lot for your suggestion. I had modified HADOOP_CONF_DIR in
spark-env.sh so that core-site.xml is under HADOOP_CONF_DIR. i can able to
see the logs like that you had shown above. Now i can able to run for 3
minutes and store results between every minutes. After sometimes, there is
an
The answer is simply NO,
But I hope someone could give more deep insight or any meaningful reference
在2015年08月19日 15:21,Todd 写道:
I don't find related talk on whether spark sql supports column indexing. If it
does, is there guide how to do it? Thanks.
We are using Cloudera-5.3.1. since it is one of the earlier version of CDH,
it doesnt supports the latest version of spark. So i installed spark-1.4.1
separately in my machine. I couldnt able to do spark-submit in cluster
mode. How to core-site.xml under classpath ? it will be very helpful if you
I want to work on one jira, but it is not easy to do unit test, because it
involves different components especially UI. spark building is pretty slow,
I don't want to build it each time to test my code change. I am wondering
how other people do ? Is there any experience can share ? Thanks
Hi Andrew,
Thanks a lot for your response. I am aware of the '--master' flag in the
spark-submit command. However I would like to create the SparkContext
inside my coding.
Maybe I should elaborate a little bit further: I would like to reuse e.g.
the result of any Spark computation inside my
Thanks for the reply.
Are Standalone or Mesos the only options? Is there a way to auto relaunch if
driver runs as a Hadoop Yarn Application?
On Wednesday, 19 August 2015 12:49 PM, Todd bit1...@163.com wrote:
There is an option for the spark-submit (Spark standalone or Mesos with
I guess the question is why does spark have to do partition discovery with
all partitions when the query only needs to look at one partition? Is there
a conf flag to turn this off?
On Wed, Aug 19, 2015 at 9:02 PM, Philip Weaver philip.wea...@gmail.com
wrote:
I've had the same problem. It turns
Hi all,
I did a simple experiment with Spark SQL. I created a partitioned parquet
table with only one partition (date=20140701). A simple `select count(*)
from table where date=20140701` would run very fast (0.1 seconds). However,
as I added more partitions the query takes longer and longer. When
I've had the same problem. It turns out that Spark (specifically parquet)
is very slow at partition discovery. It got better in 1.5 (not yet
released), but was still unacceptably slow. Sadly, we ended up reading
parquet files manually in Python (via C++) and had to abandon Spark SQL
because of
67 matches
Mail list logo