On Fri, Jun 5, 2015 at 12:30 PM Marcelo Vanzin van...@cloudera.com wrote:
Ignoring the serialization thing (seems like a red herring):
People seem surprised that I'm getting the Serialization exception at all -
I'm not convinced it's a red herring per se, but on to the blocking issue...
On Fri, Jun 5, 2015 at 1:00 PM Igor Berman igor.ber...@gmail.com wrote:
Lee, what cluster do you use? standalone, yarn-cluster, yarn-client, mesos?
Spark standalone, v1.2.1.
On Fri, Jun 5, 2015 at 12:58 PM Marcelo Vanzin van...@cloudera.com wrote:
You didn't show the error so the only thing we can do is speculate. You're
probably sending the object that's holding the SparkContext reference over
the network at some point (e.g. it's used by a task run in an
On Fri, Jun 5, 2015 at 2:05 PM Will Briggs wrbri...@gmail.com wrote:
Your lambda expressions on the RDDs in the SecondRollup class are closing
around the context, and Spark has special logic to ensure that all
variables in a closure used on an RDD are Serializable - I hate linking to
Quora,
Just wondering if we have any timeline on when the hive skew flag will be
included within SparkSQL?
Thanks!
Denny
Delete from table is available as part of Hive 0.14 (reference: Apache Hive
Language Manual DML - Delete
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Delete)
while Spark 1.3 defaults to Hive 0.13.Perhaps rebuild Spark with Hive
0.14 or generate a new
Python dependency management.
As far as I can tell, there is no core issue, upstream or otherwise.
On Tue, May 12, 2015 at 11:39 AM, Lee McFadden splee...@gmail.com wrote:
Thanks again for all the help folks.
I can confirm that simply switching to `--packages
org.apache.spark:spark
Thanks again for all the help folks.
I can confirm that simply switching to `--packages
org.apache.spark:spark-streaming-kafka-assembly_2.10:1.3.1` makes
everything work as intended.
I'm not sure what the difference is between the two packages honestly, or
why one should be used over the other,
:
com.yammer.metrics.core.Gauge is in metrics-core jar
e.g., in master branch:
[INFO] | \- org.apache.kafka:kafka_2.10:jar:0.8.1.1:compile
[INFO] | +- com.yammer.metrics:metrics-core:jar:2.2.0:compile
Please make sure metrics-core jar is on the classpath.
On Mon, May 11, 2015 at 1:32 PM, Lee McFadden splee
Hi,
We've been having some issues getting spark streaming running correctly
using a Kafka stream, and we've been going around in circles trying to
resolve this dependency.
Details of our environment and the error below, if anyone can help resolve
this it would be much appreciated.
Submit
in the assembly, is it? you'd
have to provide it and all its dependencies with your app. You could
also build this into your own app jar. Tools like Maven will add in
the transitive dependencies.
On Mon, May 11, 2015 at 10:04 PM, Lee McFadden splee...@gmail.com wrote:
Thanks Ted
itself can't be introducing java dependency clashes?
On Mon, May 11, 2015, 4:34 PM Lee McFadden splee...@gmail.com wrote:
Ted, many thanks. I'm not used to Java dependencies so this was a real
head-scratcher for me.
Downloading the two metrics packages from the maven repository
(metrics-core
Similar to what Dean called out, we build Puppet manifests so we could do
the automation - its a bit of work to setup, but well worth the effort.
On Fri, Apr 24, 2015 at 11:27 AM Dean Wampler deanwamp...@gmail.com wrote:
It's mostly manual. You could try automating with something like Chef, of
You may need to specify the hive port itself. For example, my own Thrift
start command is in the form:
./sbin/start-thriftserver.sh --master spark://$myserver:7077
--driver-class-path $CLASSPATH --hiveconf hive.server2.thrift.bind.host
$myserver --hiveconf hive.server2.thrift.port 1
HTH!
@spark.apache.org
I think you want to take a look at:
https://issues.apache.org/jira/browse/SPARK-6207
On Mon, Apr 20, 2015 at 1:58 PM, Andrew Lee alee...@hotmail.com wrote:
Hi All,
Affected version: spark 1.2.1 / 1.2.2 / 1.3-rc1
Posting this problem to user group first to see if someone
Thanks for the correction Mark :)
On Sun, Apr 19, 2015 at 3:45 PM Mark Hamstra m...@clearstorydata.com
wrote:
Almost. Jobs don't get skipped. Stages and Tasks do if the needed
results are already available.
On Sun, Apr 19, 2015 at 3:18 PM, Denny Lee denny.g@gmail.com wrote:
The job
Support for sub queries in predicates hasn't been resolved yet - please
refer to SPARK-4226
BTW, Spark 1.3 default bindings to Hive 0.13.1
On Fri, Apr 17, 2015 at 09:18 ARose ashley.r...@telarix.com wrote:
So I'm trying to store the results of a query into a DataFrame, but I get
the
Bummer - out of curiosity, if you were to use the classpath.first or
perhaps copy the jar to the slaves could that actually do the trick? The
latter isn't really all that efficient but just curious if that could do
the trick.
On Thu, Apr 16, 2015 at 7:14 AM ARose ashley.r...@telarix.com wrote:
If you're doing in Scala per se - then you can probably just reference
JodaTime or Java Date / Time classes. If are using SparkSQL, then you can
use the various Hive date functions for conversion.
On Tue, Apr 14, 2015 at 11:04 AM BASAK, ANANDA ab9...@att.com wrote:
I need some help to convert
By default Spark 1.3 has bindings to Hive 0.13.1 though you can bind it to
Hive 0.12 if you specify it in the profile when building Spark as per
https://spark.apache.org/docs/1.3.0/building-spark.html.
If you are downloading a pre built version of Spark 1.3 - then by default,
it is set to Hive
Can you create the database directly within Hive? If you're getting the
same error within Hive, it sounds like a permissions issue as per Bojan.
More info can be found at:
http://stackoverflow.com/questions/15898211/unable-to-create-database-path-file-user-hive-warehouse-error
On Thu, Apr 9,
At this time, the JDBC Data source is not extensible so it cannot support
SQL Server. There was some thoughts - credit to Cheng Lian for this -
about making the JDBC data source extensible for third party support
possibly via slick.
On Mon, Apr 6, 2015 at 10:41 PM bipin bipin@gmail.com
That's correct, at this time MS SQL Server is not supported through the
JDBC data source at this time. In my environment, we've been using Hadoop
streaming to extract out data from multiple SQL Servers, pushing the data
into HDFS, creating the Hive tables and/or converting them into Parquet,
and
something like this would work. You might need to play with the
type.
df.explode(arrayBufferColumn) { x = x }
On Fri, Apr 3, 2015 at 6:43 AM, Denny Lee denny.g@gmail.com wrote:
Thanks Dean - fun hack :)
On Fri, Apr 3, 2015 at 6:11 AM Dean Wampler deanwamp...@gmail.com
wrote:
A hack
://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com
On Thu, Apr 2, 2015 at 10:45 PM, Denny Lee denny.g@gmail.com wrote:
Thanks Michael - that was it! I was drawing a blank on this one for some
reason - much appreciated!
On Thu, Apr 2, 2015 at 8:27 PM
Quick question - the output of a dataframe is in the format of:
[2015-04, ArrayBuffer(A, B, C, D)]
and I'd like to return it as:
2015-04, A
2015-04, B
2015-04, C
2015-04, D
What's the best way to do this?
Thanks in advance!
, Apr 2, 2015 at 7:10 PM, Denny Lee denny.g@gmail.com wrote:
Quick question - the output of a dataframe is in the format of:
[2015-04, ArrayBuffer(A, B, C, D)]
and I'd like to return it as:
2015-04, A
2015-04, B
2015-04, C
2015-04, D
What's the best way to do this?
Thanks
Thanks Felix :)
On Wed, Apr 1, 2015 at 00:08 Felix Cheung felixcheun...@hotmail.com wrote:
This is tracked by these JIRAs..
https://issues.apache.org/jira/browse/SPARK-5947
https://issues.apache.org/jira/browse/SPARK-5948
--
From: denny.g@gmail.com
Date:
Hi Vincent,
This may be a case that you're missing a semi-colon after your CREATE
TEMPORARY TABLE statement. I ran your original statement (missing the
semi-colon) and got the same error as you did. As soon as I added it in, I
was good to go again:
CREATE TEMPORARY TABLE jsonTable
USING
version matters here, but I did observe
cases where Spark behaves differently because of semantic differences of
the same API in different Hadoop versions.
Cheng
On 3/27/15 11:33 AM, Pei-Lun Lee wrote:
Hi Cheng,
on my computer, execute res0.save(xxx, org.apache.spark.sql.SaveMode.
Overwrite
Upon reviewing your other thread, could you confirm that your Hive
metastore that you can connect to via Hive is a MySQL database? And to
also confirm, when you're running spark-shell and doing a show tables
statement, you're getting the same error?
On Fri, Mar 27, 2015 at 6:08 AM ÐΞ€ρ@Ҝ (๏̯͡๏)
1.0.4. Would you
mind to open a JIRA for this?
Cheng
On 3/27/15 2:40 PM, Pei-Lun Lee wrote:
I'm using 1.0.4
Thanks,
--
Pei-Lun
On Fri, Mar 27, 2015 at 2:32 PM, Cheng Lian lian.cs@gmail.com wrote:
Hm, which version of Hadoop are you using? Actually there should also
be a _metadata
If you're not using MySQL as your metastore for Hive, out of curiosity what
are you using?
The error you are seeing is common when there isn't the correct driver to
allow Spark to connect to the Hive metastore because the correct driver
isn't there.
As well, I noticed that you're using
, and thus can be faster to
read than _metadata.
Cheng
On 3/26/15 12:48 PM, Pei-Lun Lee wrote:
Hi,
When I save parquet file with SaveMode.Overwrite, it never generate
_common_metadata. Whether it overwrites an existing dir or not.
Is this expected behavior?
And what is the benefit
BTW, a tool that I have been using to help do the preaggregation of data
using hyperloglog in combination with Spark is atscale (http://atscale.com/).
It builds the aggregations and makes use of the speed of SparkSQL - all
within the context of a model that is accessible by Tableau or Qlik.
On
As you noted, you can change the spark.driver.maxResultSize value in your
Spark Configurations (https://spark.apache.org/docs/1.2.0/configuration.html).
Please reference the Spark Properties section noting that you can modify
these properties via the spark-defaults.conf or via SparkConf().
HTH!
I updated the PR for SPARK-6352 to be more like SPARK-3595.
I added a new setting spark.sql.parquet.output.committer.class in hadoop
configuration to allow custom implementation of ParquetOutputCommitter.
Can someone take a look at the PR?
On Mon, Mar 16, 2015 at 5:23 PM, Pei-Lun Lee pl
Hi,
When I save parquet file with SaveMode.Overwrite, it never generate
_common_metadata. Whether it overwrites an existing dir or not.
Is this expected behavior?
And what is the benefit of _common_metadata? Will reading performs better
when it is present?
Thanks,
--
Pei-Lun
Perhaps this email reference may be able to help from a DataFrame
perspective:
http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201503.mbox/%3CCALte62ztepahF=5hk9rcfbnyk4z43wkcq4fkdcbwmgf_3_o...@mail.gmail.com%3E
On Wed, Mar 25, 2015 at 7:29 PM Haopu Wang hw...@qilinsoft.com wrote:
*
Cheers,
Sandeep.v
On Wed, Mar 25, 2015 at 11:10 AM, sandeep vura sandeepv...@gmail.com
wrote:
No I am just running ./spark-shell command in terminal I will try with
above command
On Wed, Mar 25, 2015 at 11:09 AM, Denny Lee denny.g@gmail.com
wrote:
Did you include the connection
Did you include the connection to a MySQL connector jar so that way
spark-shell / hive can connect to the metastore?
For example, when I run my spark-shell instance in standalone mode, I use:
./spark-shell --master spark://servername:7077 --driver-class-path
/lib/mysql-connector-java-5.1.27.jar
By any chance does this thread address look similar:
http://apache-spark-developers-list.1001551.n3.nabble.com/Lost-executor-on-YARN-ALS-iterations-td7916.html
?
On Tue, Mar 24, 2015 at 5:23 AM Harut Martirosyan
harut.martiros...@gmail.com wrote:
What is performance overhead caused by YARN,
Hadoop 2.5 would be referenced as via -Dhadoop-2.5 using the profile
-Phadoop-2.4
Please note earlier in the link the section:
# Apache Hadoop 2.4.X or 2.5.X
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean package
Versions of Hadoop after 2.5.X may or may not work with the
You may be able to utilize Spork (Pig on Apache Spark) as a mechanism to do
this: https://github.com/sigmoidanalytics/spork
On Mon, Mar 23, 2015 at 2:29 AM Dai, Kevin yun...@ebay.com wrote:
Hi, all
Can spark use pig’s load function to load data?
Best Regards,
Kevin.
+1 - I currently am doing what Marcelo is suggesting as I have a CDH 5.2
cluster (with Spark 1.1) and I'm also running Spark 1.3.0+ side-by-side in
my cluster.
On Wed, Mar 18, 2015 at 1:23 PM Marcelo Vanzin van...@cloudera.com wrote:
Since you're using YARN, you should be able to download a
From the standpoint of Spark SQL accessing the files - when it is hitting
Hive, it is in effect hitting HDFS as well. Hive provides a great
framework where the table structure is already well defined.But
underneath it, Hive is just accessing files from HDFS so you are hitting
HDFS either way.
How are you running your spark instance out of curiosity? Via YARN or
standalone mode? When connecting Spark thriftserver to the Spark service,
have you allocated enough memory and CPU when executing with spark?
On Sun, Mar 22, 2015 at 3:39 AM fanooos dev.fano...@gmail.com wrote:
We have
JIRA and PR for first issue:
https://issues.apache.org/jira/browse/SPARK-6408
https://github.com/apache/spark/pull/5087
On Thu, Mar 19, 2015 at 12:20 PM, Pei-Lun Lee pl...@appier.com wrote:
Hi,
I am trying jdbc data source in spark sql 1.3.0 and found some issues.
First, the syntax where
Hi,
I am trying jdbc data source in spark sql 1.3.0 and found some issues.
First, the syntax where str_col='value' will give error for both
postgresql and mysql:
psql create table foo(id int primary key,name text,age int);
bash SPARK_CLASSPATH=postgresql-9.4-1201-jdbc41.jar
that direct dependency makes this injection much more
difficult for saveAsParquetFile.
On Thu, Mar 5, 2015 at 12:28 AM, Pei-Lun Lee pl...@appier.com wrote:
Thanks for the DirectOutputCommitter example.
However I found it only works for saveAsHadoopFile. What about
saveAsParquetFile?
It looks
Hi Rares,
If you dig into the descriptions for the two jobs, it will probably return
something like:
Job ID: 1
org.apache.spark.rdd.RDD.takeSample(RDD.scala:447)
$line41.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:22)
...
Job ID: 0
Thanks for the DirectOutputCommitter example.
However I found it only works for saveAsHadoopFile. What about
saveAsParquetFile?
It looks like SparkSQL is using ParquetOutputCommitter, which is subclass
of FileOutputCommitter.
On Fri, Feb 27, 2015 at 1:52 AM, Thomas Demoor
It depends on your setup but one of the locations is /var/log/mesos
On Wed, Mar 4, 2015 at 19:11 lisendong lisend...@163.com wrote:
I ‘m sorry, but how to look at the mesos logs?
where are them?
在 2015年3月4日,下午6:06,Akhil Das ak...@sigmoidanalytics.com 写道:
You can check in the mesos logs
Hi Suhel,
My team is currently working with a lot of SQL Server databases as one of
our many data sources and ultimately we pull the data into HDFS from SQL
Server. As we had a lot of SQL databases to hit, we used the jTDS driver
and SQOOP to extract the data out of SQL Server and into HDFS
The error message you have is:
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask.
MetaException(message:file:/user/hive/warehouse/src is not a directory or
unable to create one)
Could you verify that you (the user you are running under) has the rights
to create
descriptionlocation of default database for the
warehouse/description
/property
Do I need to do anything explicitly other than placing hive-site.xml in
the spark.conf directory ?
Thanks !!
On Wed, Feb 25, 2015 at 11:42 AM, Denny Lee denny.g@gmail.com wrote:
The error message
It may have to do with the akka heartbeat interval per SPARK-3923 -
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-3923 ?
On Tue, Feb 24, 2015 at 16:40 Xi Shen davidshe...@gmail.com wrote:
Hi Sean,
I launched the spark-shell on the same machine as I started YARN service.
I
paper!! We were already
using it as a guideline for our tests.
Best regards,
Francisco
--
From: Denny Lee denny.g@gmail.com
Sent: 22/02/2015 17:56
To: Ashic Mahtab as...@live.com; Francisco Orchard forch...@gmail.com;
Apache Spark user@spark.apache.org
, Davies Liu dav...@databricks.com wrote:
How many executors you have per machine? It will be helpful if you
could list all the configs.
Could you also try to run it without persist? Caching do hurt than
help, if you don't have enough memory.
On Fri, Feb 20, 2015 at 5:18 PM, Lee Bierman leebier
Hi Francisco,
Out of curiosity - why ROLAP mode using multi-dimensional mode (vs tabular)
from SSAS to Spark? As a past SSAS guy you've definitely piqued my
interest.
The one thing that you may run into is that the SQL generated by SSAS can
be quite convoluted. When we were doing the same thing
Back to thrift, there was an earlier thread on this topic at
http://mail-archives.apache.org/mod_mbox/spark-user/201411.mbox/%3CCABPQxsvXA-ROPeXN=wjcev_n9gv-drqxujukbp_goutvnyx...@mail.gmail.com%3E
that may be useful as well.
On Sun Feb 22 2015 at 8:42:29 AM Denny Lee denny.g@gmail.com wrote
.
On Fri, Feb 20, 2015 at 9:55 AM, Denny Lee denny.g@gmail.com wrote:
Quickly reviewing the latest SQL Programming Guide
https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md
(in github) I had a couple of quick questions:
1) Do we need to instantiate the SparkContext
Thanks for the suggestions.
I'm experimenting with different values for spark memoryOverhead and
explictly giving the executors more memory, but still have not found the
golden medium to get it to finish in a proper time frame.
Is my cluster massively undersized at 5 boxes, 8gb 2cpu ?
Trying to
Quickly reviewing the latest SQL Programming Guide
https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md
(in github) I had a couple of quick questions:
1) Do we need to instantiate the SparkContext as per
// sc is an existing SparkContext.
val sqlContext = new
or insights on what I'm missing here.
Thanks for the assistance.
-Todd
On Wed, Feb 11, 2015 at 3:20 PM, Andrew Lee alee...@hotmail.com wrote:
Sorry folks, it is executing Spark jobs instead of Hive jobs. I mis-read the
logs since there were other activities going on on the cluster.
From: alee
HI All,
Just want to give everyone an update of what worked for me. Thanks for Cheng's
comment and other ppl's help.
So what I misunderstood was the --driver-class-path and how that was related to
--files. I put both /etc/hive/hive-site.xml in both --files and
--driver-class-path when I
Sorry folks, it is executing Spark jobs instead of Hive jobs. I mis-read the
logs since there were other activities going on on the cluster.
From: alee...@hotmail.com
To: ar...@sigmoidanalytics.com; tsind...@gmail.com
CC: user@spark.apache.org
Subject: RE: SparkSQL + Tableau Connector
Date: Wed,
I have ThriftServer2 up and running, however, I notice that it relays the query
to HiveServer2 when I pass the hive-site.xml to it.
I'm not sure if this is the expected behavior, but based on what I have up and
running, the ThriftServer2 invokes HiveServer2 that results in MapReduce or Tez
It looks like this is related to the underlying Hadoop configuration.
Try to deploy the Hadoop configuration with your job with --files and
--driver-class-path, or to the default /etc/hadoop/conf core-site.xml.
If that is not an option (depending on how your Hadoop cluster is setup), then
hard
and tableau can extract that RDD persisted on hive.
Regards,
Ashutosh
--
*From:* Denny Lee denny.g@gmail.com
*Sent:* Thursday, February 5, 2015 1:27 PM
*To:* Ashutosh Trivedi (MT2013030); İsmail Keskin
*Cc:* user@spark.apache.org
*Subject:* Re: Tableau beta
works.
--
*From:* Denny Lee denny.g@gmail.com
*Sent:* Thursday, February 5, 2015 12:20 PM
*To:* İsmail Keskin; Ashutosh Trivedi (MT2013030)
*Cc:* user@spark.apache.org
*Subject:* Re: Tableau beta connector
Some quick context behind how Tableau interacts
Hi Ningjun,
I have been working with Spark 1.2 on Windows 7 and Windows 2008 R2 (purely
for development purposes). I had most recently installed them utilizing
Java 1.8, Scala 2.10.4, and Spark 1.2 Precompiled for Hadoop 2.4+. A handy
thread concerning the null\bin\winutils issue is addressed
A great presentation by Evan Chan on utilizing Cassandra as Jonathan noted
is at: OLAP with Cassandra and Spark
http://www.slideshare.net/EvanChan2/2014-07olapcassspark.
On Tue Feb 03 2015 at 10:03:34 AM Jonathan Haddad j...@jonhaddad.com wrote:
Write out the rdd to a cassandra table. The
, Denny Lee denny.g@gmail.com wrote:
I may be missing something here but typically when the hive-site.xml
configurations do not require you to place s within the configuration
itself. Both the retry.delay and socket.timeout values are in seconds so
you should only need to place the integer value
I may be missing something here but typically when the hive-site.xml
configurations do not require you to place s within the configuration
itself. Both the retry.delay and socket.timeout values are in seconds so
you should only need to place the integer value (which are in seconds).
On Sun Feb
I've been working with Spark 1.2 and Mesos 0.21.0 and while I have set the
spark.executor.uri within spark-env.sh (and directly within bash as well),
the Mesos slaves do not seem to be able to access the spark tgz file via
HTTP or HDFS as per the message below.
14/12/30 15:57:35 INFO SparkILoop:
Hi All,
I have tried to pass the properties via the SparkContext.setLocalProperty and
HiveContext.setConf, both failed. Based on the results (haven't get a chance to
look into the code yet), HiveContext will try to initiate the JDBC connection
right away, I couldn't set other properties
A follow up on the hive-site.xml, if you
1. Specify it in spark/conf, then you can NOT apply it via the
--driver-class-path option, otherwise, you will get the following exceptions
when initializing SparkContext.
org.apache.spark.SparkException: Found both spark.driver.extraClassPath
You should be able to kill the job using the webUI or via spark-class.
More info can be found in the thread:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-kill-a-Spark-job-running-in-cluster-mode-td18583.html.
HTH!
On Tue, Dec 23, 2014 at 4:47 PM, durga durgak...@gmail.com wrote:
To clarify, there isn't a Hadoop 2.6 profile per se but you can build using
-Dhadoop.version=2.4 which works with Hadoop 2.6.
On Fri, Dec 19, 2014 at 12:55 Ted Yu yuzhih...@gmail.com wrote:
You can use hadoop-2.4 profile and pass -Dhadoop.version=2.6.0
Cheers
On Fri, Dec 19, 2014 at 12:51
Lee denny.g@gmail.com wrote:
To clarify, there isn't a Hadoop 2.6 profile per se but you can build
using -Dhadoop.version=2.4 which works with Hadoop 2.6.
On Fri, Dec 19, 2014 at 12:55 Ted Yu yuzhih...@gmail.com wrote:
You can use hadoop-2.4 profile and pass -Dhadoop.version=2.6.0
Cheers
I'm curious if you're seeing the same thing when using bdutil against GCS?
I'm wondering if this may be an issue concerning the transfer rate of Spark
- Hadoop - GCS Connector - GCS.
On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta alexbare...@gmail.com
wrote:
All,
I'm using the Spark
. See the
following.
alex@hadoop-m:~/split$ time bash -c gsutil ls
gs://my-bucket/20141205/csv/*/*/* | wc -l
6860
real0m6.971s
user0m1.052s
sys 0m0.096s
Alex
On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee denny.g@gmail.com wrote:
I'm curious if you're seeing the same
to test this? But more importantly, what
information would this give me?
On Wed, Dec 17, 2014 at 10:46 PM, Denny Lee denny.g@gmail.com wrote:
Oh, it makes sense of gsutil scans through this quickly, but I was
wondering if running a Hadoop job / bdutil would result in just as fast
scans
I have a large of files within HDFS that I would like to do a group by
statement ala
val table = sc.textFile(hdfs://)
val tabs = table.map(_.split(\t))
I'm trying to do something similar to
tabs.map(c = (c._(167), c._(110), c._(200))
where I create a new RDD that only has
but that isn't
looks like
the way to go given the context. What's not working?
Kr, Gerard
On Dec 14, 2014 5:17 PM, Denny Lee denny.g@gmail.com wrote:
I have a large of files within HDFS that I would like to do a group by
statement ala
val table = sc.textFile(hdfs://)
val tabs = table.map(_.split
Yes - that works great! Sorry for implying I couldn't. Was just more
flummoxed that I couldn't make the Scala call work on its own. Will
continue to debug ;-)
On Sun, Dec 14, 2014 at 11:39 Michael Armbrust mich...@databricks.com
wrote:
BTW, I cannot use SparkSQL / case right now because my table
tabs.map(c = (c(167), c(110), c(200)) instead of tabs.map(c = (c._(167),
c._(110), c._(200))
On Sun, Dec 14, 2014 at 3:12 PM, Denny Lee denny.g@gmail.com wrote:
Yes - that works great! Sorry for implying I couldn't. Was just more
flummoxed that I couldn't make the Scala call work on its
Hi Xiaoyong,
SparkSQL has already been released and has been part of the Spark code-base
since Spark 1.0. The latest stable release is Spark 1.1 (here's the Spark
SQL Programming Guide
http://spark.apache.org/docs/1.1.0/sql-programming-guide.html) and we're
currently voting on Spark 1.2.
Hive
Yes, that is correct. A quick reference on this is the post
https://www.linkedin.com/pulse/20141007143323-732459-an-absolutely-unofficial-way-to-connect-tableau-to-sparksql-spark-1-1?_mSplash=1
with the pertinent section being:
It is important to note that when you create Spark tables (for
Thanks Sandy!
On Mon, Dec 8, 2014 at 23:15 Sandy Ryza sandy.r...@cloudera.com wrote:
Another thing to be aware of is that YARN will round up containers to the
nearest increment of yarn.scheduler.minimum-allocation-mb, which defaults
to 1024.
-Sandy
On Sat, Dec 6, 2014 at 3:48 PM, Denny Lee
This is perhaps more of a YARN question than a Spark question but i was
just curious to how is memory allocated in YARN via the various
configurations. For example, if I spin up my cluster with 4GB with a
different number of executors as noted below
4GB executor-memory x 10 executors = 46GB
* executorMemory.
When you set executor memory, the yarn resource request is executorMemory
+ yarnOverhead.
- Arun
On Sat, Dec 6, 2014 at 4:27 PM, Denny Lee denny.g@gmail.com wrote:
This is perhaps more of a YARN question than a Spark question but i was
just curious to how is memory allocated
My submissions of Spark on YARN (CDH 5.2) resulted in a few thousand steps.
If I was running this on standalone cluster mode the query finished in 55s
but on YARN, the query was still running 30min later. Would the hard coded
sleeps potentially be in play here?
On Fri, Dec 5, 2014 at 11:23 Sandy
, and --num-executors
arguments? When running against a standalone cluster, by default Spark
will make use of all the cluster resources, but when running against YARN,
Spark defaults to a couple tiny executors.
-Sandy
On Fri, Dec 5, 2014 at 11:32 AM, Denny Lee denny.g@gmail.com
wrote:
My
Okay, my bad for not testing out the documented arguments - once i use the
correct ones, the query shrinks completes in ~55s (I can probably make it
faster). Thanks for the help, eh?!
On Fri Dec 05 2014 at 10:34:50 PM Denny Lee denny.g@gmail.com wrote:
Sorry for the delay in my response
To determine if this is a Windows vs. other configuration, can you just try
to call the Spark-class.cmd SparkSubmit without actually referencing the
Hadoop or Thrift server classes?
On Tue Nov 25 2014 at 5:42:09 PM Judy Nash judyn...@exchange.microsoft.com
wrote:
I traced the code and used
By any chance are you using Spark 1.0.2? registerTempTable was introduced
from Spark 1.1+ while for Spark 1.0.2, it would be registerAsTable.
On Sun Nov 23 2014 at 10:59:48 AM riginos samarasrigi...@gmail.com wrote:
Hi guys ,
Im trying to do the Spark SQL Programming Guide but after the:
It sort of depends on your environment. If you are running on your local
environment, I would just download the latest Spark 1.1 binaries and you'll
be good to go. If its a production environment, it sort of depends on how
you are setup (e.g. AWS, Cloudera, etc.)
On Sun Nov 23 2014 at 11:27:49
extraction job against multiple data sources via Hadoop streaming.
Another good call out but utilizing Scala within Spark is that most of the
Spark code is written in Scala.
On Sat, Nov 22, 2014 at 08:12 Denny Lee denny.g@gmail.com wrote:
There are various scenarios where traditional Hadoop
101 - 200 of 299 matches
Mail list logo