Re: SparkContext Threading

2015-06-05 Thread Lee McFadden
On Fri, Jun 5, 2015 at 12:30 PM Marcelo Vanzin van...@cloudera.com wrote: Ignoring the serialization thing (seems like a red herring): People seem surprised that I'm getting the Serialization exception at all - I'm not convinced it's a red herring per se, but on to the blocking issue...

Re: SparkContext Threading

2015-06-05 Thread Lee McFadden
On Fri, Jun 5, 2015 at 1:00 PM Igor Berman igor.ber...@gmail.com wrote: Lee, what cluster do you use? standalone, yarn-cluster, yarn-client, mesos? Spark standalone, v1.2.1.

Re: SparkContext Threading

2015-06-05 Thread Lee McFadden
On Fri, Jun 5, 2015 at 12:58 PM Marcelo Vanzin van...@cloudera.com wrote: You didn't show the error so the only thing we can do is speculate. You're probably sending the object that's holding the SparkContext reference over the network at some point (e.g. it's used by a task run in an

Re: SparkContext Threading

2015-06-05 Thread Lee McFadden
On Fri, Jun 5, 2015 at 2:05 PM Will Briggs wrbri...@gmail.com wrote: Your lambda expressions on the RDDs in the SecondRollup class are closing around the context, and Spark has special logic to ensure that all variables in a closure used on an RDD are Serializable - I hate linking to Quora,

Hive Skew flag?

2015-05-15 Thread Denny Lee
Just wondering if we have any timeline on when the hive skew flag will be included within SparkSQL? Thanks! Denny

Re: how to delete data from table in sparksql

2015-05-14 Thread Denny Lee
Delete from table is available as part of Hive 0.14 (reference: Apache Hive Language Manual DML - Delete https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Delete) while Spark 1.3 defaults to Hive 0.13.Perhaps rebuild Spark with Hive 0.14 or generate a new

Re: Kafka stream fails: java.lang.NoClassDefFound com/yammer/metrics/core/Gauge

2015-05-12 Thread Lee McFadden
Python dependency management. As far as I can tell, there is no core issue, upstream or otherwise. On Tue, May 12, 2015 at 11:39 AM, Lee McFadden splee...@gmail.com wrote: Thanks again for all the help folks. I can confirm that simply switching to `--packages org.apache.spark:spark

Re: Kafka stream fails: java.lang.NoClassDefFound com/yammer/metrics/core/Gauge

2015-05-12 Thread Lee McFadden
Thanks again for all the help folks. I can confirm that simply switching to `--packages org.apache.spark:spark-streaming-kafka-assembly_2.10:1.3.1` makes everything work as intended. I'm not sure what the difference is between the two packages honestly, or why one should be used over the other,

Re: Kafka stream fails: java.lang.NoClassDefFound com/yammer/metrics/core/Gauge

2015-05-11 Thread Lee McFadden
: com.yammer.metrics.core.Gauge is in metrics-core jar e.g., in master branch: [INFO] | \- org.apache.kafka:kafka_2.10:jar:0.8.1.1:compile [INFO] | +- com.yammer.metrics:metrics-core:jar:2.2.0:compile Please make sure metrics-core jar is on the classpath. On Mon, May 11, 2015 at 1:32 PM, Lee McFadden splee

Kafka stream fails: java.lang.NoClassDefFound com/yammer/metrics/core/Gauge

2015-05-11 Thread Lee McFadden
Hi, We've been having some issues getting spark streaming running correctly using a Kafka stream, and we've been going around in circles trying to resolve this dependency. Details of our environment and the error below, if anyone can help resolve this it would be much appreciated. Submit

Re: Kafka stream fails: java.lang.NoClassDefFound com/yammer/metrics/core/Gauge

2015-05-11 Thread Lee McFadden
in the assembly, is it? you'd have to provide it and all its dependencies with your app. You could also build this into your own app jar. Tools like Maven will add in the transitive dependencies. On Mon, May 11, 2015 at 10:04 PM, Lee McFadden splee...@gmail.com wrote: Thanks Ted

Re: Kafka stream fails: java.lang.NoClassDefFound com/yammer/metrics/core/Gauge

2015-05-11 Thread Lee McFadden
itself can't be introducing java dependency clashes? On Mon, May 11, 2015, 4:34 PM Lee McFadden splee...@gmail.com wrote: Ted, many thanks. I'm not used to Java dependencies so this was a real head-scratcher for me. Downloading the two metrics packages from the maven repository (metrics-core

Re: Spark Cluster Setup

2015-04-27 Thread Denny Lee
Similar to what Dean called out, we build Puppet manifests so we could do the automation - its a bit of work to setup, but well worth the effort. On Fri, Apr 24, 2015 at 11:27 AM Dean Wampler deanwamp...@gmail.com wrote: It's mostly manual. You could try automating with something like Chef, of

Re: Start ThriftServer Error

2015-04-22 Thread Denny Lee
You may need to specify the hive port itself. For example, my own Thrift start command is in the form: ./sbin/start-thriftserver.sh --master spark://$myserver:7077 --driver-class-path $CLASSPATH --hiveconf hive.server2.thrift.bind.host $myserver --hiveconf hive.server2.thrift.port 1 HTH!

RE: GSSException when submitting Spark job in yarn-cluster mode with HiveContext APIs on Kerberos cluster

2015-04-20 Thread Andrew Lee
@spark.apache.org I think you want to take a look at: https://issues.apache.org/jira/browse/SPARK-6207 On Mon, Apr 20, 2015 at 1:58 PM, Andrew Lee alee...@hotmail.com wrote: Hi All, Affected version: spark 1.2.1 / 1.2.2 / 1.3-rc1 Posting this problem to user group first to see if someone

Re: Skipped Jobs

2015-04-19 Thread Denny Lee
Thanks for the correction Mark :) On Sun, Apr 19, 2015 at 3:45 PM Mark Hamstra m...@clearstorydata.com wrote: Almost. Jobs don't get skipped. Stages and Tasks do if the needed results are already available. On Sun, Apr 19, 2015 at 3:18 PM, Denny Lee denny.g@gmail.com wrote: The job

Re: Which version of Hive QL is Spark 1.3.0 using?

2015-04-17 Thread Denny Lee
Support for sub queries in predicates hasn't been resolved yet - please refer to SPARK-4226 BTW, Spark 1.3 default bindings to Hive 0.13.1 On Fri, Apr 17, 2015 at 09:18 ARose ashley.r...@telarix.com wrote: So I'm trying to store the results of a query into a DataFrame, but I get the

Re: Microsoft SQL jdbc support from spark sql

2015-04-16 Thread Denny Lee
Bummer - out of curiosity, if you were to use the classpath.first or perhaps copy the jar to the slaves could that actually do the trick? The latter isn't really all that efficient but just curious if that could do the trick. On Thu, Apr 16, 2015 at 7:14 AM ARose ashley.r...@telarix.com wrote:

Re: Converting Date pattern in scala code

2015-04-14 Thread Denny Lee
If you're doing in Scala per se - then you can probably just reference JodaTime or Java Date / Time classes. If are using SparkSQL, then you can use the various Hive date functions for conversion. On Tue, Apr 14, 2015 at 11:04 AM BASAK, ANANDA ab9...@att.com wrote: I need some help to convert

Re: Which Hive version should be used for Spark 1.3

2015-04-09 Thread Denny Lee
By default Spark 1.3 has bindings to Hive 0.13.1 though you can bind it to Hive 0.12 if you specify it in the profile when building Spark as per https://spark.apache.org/docs/1.3.0/building-spark.html. If you are downloading a pre built version of Spark 1.3 - then by default, it is set to Hive

Re: SQL can't not create Hive database

2015-04-09 Thread Denny Lee
Can you create the database directly within Hive? If you're getting the same error within Hive, it sounds like a permissions issue as per Bojan. More info can be found at: http://stackoverflow.com/questions/15898211/unable-to-create-database-path-file-user-hive-warehouse-error On Thu, Apr 9,

Re: Microsoft SQL jdbc support from spark sql

2015-04-07 Thread Denny Lee
At this time, the JDBC Data source is not extensible so it cannot support SQL Server. There was some thoughts - credit to Cheng Lian for this - about making the JDBC data source extensible for third party support possibly via slick. On Mon, Apr 6, 2015 at 10:41 PM bipin bipin@gmail.com

Re: Microsoft SQL jdbc support from spark sql

2015-04-07 Thread Denny Lee
That's correct, at this time MS SQL Server is not supported through the JDBC data source at this time. In my environment, we've been using Hadoop streaming to extract out data from multiple SQL Servers, pushing the data into HDFS, creating the Hive tables and/or converting them into Parquet, and

Re: ArrayBuffer within a DataFrame

2015-04-03 Thread Denny Lee
something like this would work. You might need to play with the type. df.explode(arrayBufferColumn) { x = x } On Fri, Apr 3, 2015 at 6:43 AM, Denny Lee denny.g@gmail.com wrote: Thanks Dean - fun hack :) On Fri, Apr 3, 2015 at 6:11 AM Dean Wampler deanwamp...@gmail.com wrote: A hack

Re: ArrayBuffer within a DataFrame

2015-04-03 Thread Denny Lee
://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Thu, Apr 2, 2015 at 10:45 PM, Denny Lee denny.g@gmail.com wrote: Thanks Michael - that was it! I was drawing a blank on this one for some reason - much appreciated! On Thu, Apr 2, 2015 at 8:27 PM

ArrayBuffer within a DataFrame

2015-04-02 Thread Denny Lee
Quick question - the output of a dataframe is in the format of: [2015-04, ArrayBuffer(A, B, C, D)] and I'd like to return it as: 2015-04, A 2015-04, B 2015-04, C 2015-04, D What's the best way to do this? Thanks in advance!

Re: ArrayBuffer within a DataFrame

2015-04-02 Thread Denny Lee
, Apr 2, 2015 at 7:10 PM, Denny Lee denny.g@gmail.com wrote: Quick question - the output of a dataframe is in the format of: [2015-04, ArrayBuffer(A, B, C, D)] and I'd like to return it as: 2015-04, A 2015-04, B 2015-04, C 2015-04, D What's the best way to do this? Thanks

Re: Creating Partitioned Parquet Tables via SparkSQL

2015-04-01 Thread Denny Lee
Thanks Felix :) On Wed, Apr 1, 2015 at 00:08 Felix Cheung felixcheun...@hotmail.com wrote: This is tracked by these JIRAs.. https://issues.apache.org/jira/browse/SPARK-5947 https://issues.apache.org/jira/browse/SPARK-5948 -- From: denny.g@gmail.com Date:

Re: Anyone has some simple example with spark-sql with spark 1.3

2015-03-30 Thread Denny Lee
Hi Vincent, This may be a case that you're missing a semi-colon after your CREATE TEMPORARY TABLE statement. I ran your original statement (missing the semi-colon) and got the same error as you did. As soon as I added it in, I was good to go again: CREATE TEMPORARY TABLE jsonTable USING

Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-27 Thread Pei-Lun Lee
version matters here, but I did observe cases where Spark behaves differently because of semantic differences of the same API in different Hadoop versions. Cheng On 3/27/15 11:33 AM, Pei-Lun Lee wrote: Hi Cheng, on my computer, execute res0.save(xxx, org.apache.spark.sql.SaveMode. Overwrite

Re: Hive Table not from from Spark SQL

2015-03-27 Thread Denny Lee
Upon reviewing your other thread, could you confirm that your Hive metastore that you can connect to via Hive is a MySQL database? And to also confirm, when you're running spark-shell and doing a show tables statement, you're getting the same error? On Fri, Mar 27, 2015 at 6:08 AM ÐΞ€ρ@Ҝ (๏̯͡๏)

Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-27 Thread Pei-Lun Lee
1.0.4. Would you mind to open a JIRA for this? Cheng On 3/27/15 2:40 PM, Pei-Lun Lee wrote: I'm using 1.0.4 Thanks, -- Pei-Lun On Fri, Mar 27, 2015 at 2:32 PM, Cheng Lian lian.cs@gmail.com wrote: Hm, which version of Hadoop are you using? Actually there should also be a _metadata

Re: spark-sql throws org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException

2015-03-26 Thread Denny Lee
If you're not using MySQL as your metastore for Hive, out of curiosity what are you using? The error you are seeing is common when there isn't the correct driver to allow Spark to connect to the Hive metastore because the correct driver isn't there. As well, I noticed that you're using

Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-26 Thread Pei-Lun Lee
, and thus can be faster to read than _metadata. Cheng On 3/26/15 12:48 PM, Pei-Lun Lee wrote: Hi, When I save parquet file with SaveMode.Overwrite, it never generate _common_metadata. Whether it overwrites an existing dir or not. Is this expected behavior? And what is the benefit

Re: Handling Big data for interactive BI tools

2015-03-26 Thread Denny Lee
BTW, a tool that I have been using to help do the preaggregation of data using hyperloglog in combination with Spark is atscale (http://atscale.com/). It builds the aggregations and makes use of the speed of SparkSQL - all within the context of a model that is accessible by Tableau or Qlik. On

Re: Total size of serialized results is bigger than spark.driver.maxResultSize

2015-03-25 Thread Denny Lee
As you noted, you can change the spark.driver.maxResultSize value in your Spark Configurations (https://spark.apache.org/docs/1.2.0/configuration.html). Please reference the Spark Properties section noting that you can modify these properties via the spark-defaults.conf or via SparkConf(). HTH!

Re: Which OutputCommitter to use for S3?

2015-03-25 Thread Pei-Lun Lee
I updated the PR for SPARK-6352 to be more like SPARK-3595. I added a new setting spark.sql.parquet.output.committer.class in hadoop configuration to allow custom implementation of ParquetOutputCommitter. Can someone take a look at the PR? On Mon, Mar 16, 2015 at 5:23 PM, Pei-Lun Lee pl

SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-25 Thread Pei-Lun Lee
Hi, When I save parquet file with SaveMode.Overwrite, it never generate _common_metadata. Whether it overwrites an existing dir or not. Is this expected behavior? And what is the benefit of _common_metadata? Will reading performs better when it is present? Thanks, -- Pei-Lun

Re: [SparkSQL] How to calculate stddev on a DataFrame?

2015-03-25 Thread Denny Lee
Perhaps this email reference may be able to help from a DataFrame perspective: http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201503.mbox/%3CCALte62ztepahF=5hk9rcfbnyk4z43wkcq4fkdcbwmgf_3_o...@mail.gmail.com%3E On Wed, Mar 25, 2015 at 7:29 PM Haopu Wang hw...@qilinsoft.com wrote:

Re: Errors in SPARK

2015-03-24 Thread Denny Lee
* Cheers, Sandeep.v On Wed, Mar 25, 2015 at 11:10 AM, sandeep vura sandeepv...@gmail.com wrote: No I am just running ./spark-shell command in terminal I will try with above command On Wed, Mar 25, 2015 at 11:09 AM, Denny Lee denny.g@gmail.com wrote: Did you include the connection

Re: Errors in SPARK

2015-03-24 Thread Denny Lee
Did you include the connection to a MySQL connector jar so that way spark-shell / hive can connect to the metastore? For example, when I run my spark-shell instance in standalone mode, I use: ./spark-shell --master spark://servername:7077 --driver-class-path /lib/mysql-connector-java-5.1.27.jar

Re: Standalone Scheduler VS YARN Performance

2015-03-24 Thread Denny Lee
By any chance does this thread address look similar: http://apache-spark-developers-list.1001551.n3.nabble.com/Lost-executor-on-YARN-ALS-iterations-td7916.html ? On Tue, Mar 24, 2015 at 5:23 AM Harut Martirosyan harut.martiros...@gmail.com wrote: What is performance overhead caused by YARN,

Re: Hadoop 2.5 not listed in Spark 1.4 build page

2015-03-24 Thread Denny Lee
Hadoop 2.5 would be referenced as via -Dhadoop-2.5 using the profile -Phadoop-2.4 Please note earlier in the link the section: # Apache Hadoop 2.4.X or 2.5.X mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean package Versions of Hadoop after 2.5.X may or may not work with the

Re: Use pig load function in spark

2015-03-23 Thread Denny Lee
You may be able to utilize Spork (Pig on Apache Spark) as a mechanism to do this: https://github.com/sigmoidanalytics/spork On Mon, Mar 23, 2015 at 2:29 AM Dai, Kevin yun...@ebay.com wrote: Hi, all Can spark use pig’s load function to load data? Best Regards, Kevin.

Re: Using a different spark jars than the one on the cluster

2015-03-23 Thread Denny Lee
+1 - I currently am doing what Marcelo is suggesting as I have a CDH 5.2 cluster (with Spark 1.1) and I'm also running Spark 1.3.0+ side-by-side in my cluster. On Wed, Mar 18, 2015 at 1:23 PM Marcelo Vanzin van...@cloudera.com wrote: Since you're using YARN, you should be able to download a

Re: Should I do spark-sql query on HDFS or hive?

2015-03-23 Thread Denny Lee
From the standpoint of Spark SQL accessing the files - when it is hitting Hive, it is in effect hitting HDFS as well. Hive provides a great framework where the table structure is already well defined.But underneath it, Hive is just accessing files from HDFS so you are hitting HDFS either way.

Re: Spark sql thrift server slower than hive

2015-03-22 Thread Denny Lee
How are you running your spark instance out of curiosity? Via YARN or standalone mode? When connecting Spark thriftserver to the Spark service, have you allocated enough memory and CPU when executing with spark? On Sun, Mar 22, 2015 at 3:39 AM fanooos dev.fano...@gmail.com wrote: We have

Re: SparkSQL 1.3.0 JDBC data source issues

2015-03-19 Thread Pei-Lun Lee
JIRA and PR for first issue: https://issues.apache.org/jira/browse/SPARK-6408 https://github.com/apache/spark/pull/5087 On Thu, Mar 19, 2015 at 12:20 PM, Pei-Lun Lee pl...@appier.com wrote: Hi, I am trying jdbc data source in spark sql 1.3.0 and found some issues. First, the syntax where

SparkSQL 1.3.0 JDBC data source issues

2015-03-18 Thread Pei-Lun Lee
Hi, I am trying jdbc data source in spark sql 1.3.0 and found some issues. First, the syntax where str_col='value' will give error for both postgresql and mysql: psql create table foo(id int primary key,name text,age int); bash SPARK_CLASSPATH=postgresql-9.4-1201-jdbc41.jar

Re: Which OutputCommitter to use for S3?

2015-03-16 Thread Pei-Lun Lee
that direct dependency makes this injection much more difficult for saveAsParquetFile. On Thu, Mar 5, 2015 at 12:28 AM, Pei-Lun Lee pl...@appier.com wrote: Thanks for the DirectOutputCommitter example. However I found it only works for saveAsHadoopFile. What about saveAsParquetFile? It looks

Re: takeSample triggers 2 jobs

2015-03-06 Thread Denny Lee
Hi Rares, If you dig into the descriptions for the two jobs, it will probably return something like: Job ID: 1 org.apache.spark.rdd.RDD.takeSample(RDD.scala:447) $line41.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:22) ... Job ID: 0

Re: Which OutputCommitter to use for S3?

2015-03-05 Thread Pei-Lun Lee
Thanks for the DirectOutputCommitter example. However I found it only works for saveAsHadoopFile. What about saveAsParquetFile? It looks like SparkSQL is using ParquetOutputCommitter, which is subclass of FileOutputCommitter. On Fri, Feb 27, 2015 at 1:52 AM, Thomas Demoor

Re: spark master shut down suddenly

2015-03-04 Thread Denny Lee
It depends on your setup but one of the locations is /var/log/mesos On Wed, Mar 4, 2015 at 19:11 lisendong lisend...@163.com wrote: I ‘m sorry, but how to look at the mesos logs? where are them? 在 2015年3月4日,下午6:06,Akhil Das ak...@sigmoidanalytics.com 写道: You can check in the mesos logs

Re: Use case for data in SQL Server

2015-02-24 Thread Denny Lee
Hi Suhel, My team is currently working with a lot of SQL Server databases as one of our many data sources and ultimately we pull the data into HDFS from SQL Server. As we had a lot of SQL databases to hit, we used the jTDS driver and SQOOP to extract the data out of SQL Server and into HDFS

Re: Unable to run hive queries inside spark

2015-02-24 Thread Denny Lee
The error message you have is: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:file:/user/hive/warehouse/src is not a directory or unable to create one) Could you verify that you (the user you are running under) has the rights to create

Re: Unable to run hive queries inside spark

2015-02-24 Thread Denny Lee
descriptionlocation of default database for the warehouse/description /property Do I need to do anything explicitly other than placing hive-site.xml in the spark.conf directory ? Thanks !! On Wed, Feb 25, 2015 at 11:42 AM, Denny Lee denny.g@gmail.com wrote: The error message

Re: How to start spark-shell with YARN?

2015-02-24 Thread Denny Lee
It may have to do with the akka heartbeat interval per SPARK-3923 - https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-3923 ? On Tue, Feb 24, 2015 at 16:40 Xi Shen davidshe...@gmail.com wrote: Hi Sean, I launched the spark-shell on the same machine as I started YARN service. I

Re: Spark SQL odbc on Windows

2015-02-23 Thread Denny Lee
paper!! We were already using it as a guideline for our tests. Best regards, Francisco -- From: Denny Lee denny.g@gmail.com Sent: ‎22/‎02/‎2015 17:56 To: Ashic Mahtab as...@live.com; Francisco Orchard forch...@gmail.com; Apache Spark user@spark.apache.org

Re: Spark Performance on Yarn

2015-02-23 Thread Lee Bierman
, Davies Liu dav...@databricks.com wrote: How many executors you have per machine? It will be helpful if you could list all the configs. Could you also try to run it without persist? Caching do hurt than help, if you don't have enough memory. On Fri, Feb 20, 2015 at 5:18 PM, Lee Bierman leebier

Re: Spark SQL odbc on Windows

2015-02-22 Thread Denny Lee
Hi Francisco, Out of curiosity - why ROLAP mode using multi-dimensional mode (vs tabular) from SSAS to Spark? As a past SSAS guy you've definitely piqued my interest. The one thing that you may run into is that the SQL generated by SSAS can be quite convoluted. When we were doing the same thing

Re: Spark SQL odbc on Windows

2015-02-22 Thread Denny Lee
Back to thrift, there was an earlier thread on this topic at http://mail-archives.apache.org/mod_mbox/spark-user/201411.mbox/%3CCABPQxsvXA-ROPeXN=wjcev_n9gv-drqxujukbp_goutvnyx...@mail.gmail.com%3E that may be useful as well. On Sun Feb 22 2015 at 8:42:29 AM Denny Lee denny.g@gmail.com wrote

Re: Spark 1.3 SQL Programming Guide and sql._ / sql.types._

2015-02-20 Thread Denny Lee
. On Fri, Feb 20, 2015 at 9:55 AM, Denny Lee denny.g@gmail.com wrote: Quickly reviewing the latest SQL Programming Guide https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md (in github) I had a couple of quick questions: 1) Do we need to instantiate the SparkContext

Re: Spark Performance on Yarn

2015-02-20 Thread Lee Bierman
Thanks for the suggestions. I'm experimenting with different values for spark memoryOverhead and explictly giving the executors more memory, but still have not found the golden medium to get it to finish in a proper time frame. Is my cluster massively undersized at 5 boxes, 8gb 2cpu ? Trying to

Spark 1.3 SQL Programming Guide and sql._ / sql.types._

2015-02-20 Thread Denny Lee
Quickly reviewing the latest SQL Programming Guide https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md (in github) I had a couple of quick questions: 1) Do we need to instantiate the SparkContext as per // sc is an existing SparkContext. val sqlContext = new

RE: SparkSQL + Tableau Connector

2015-02-17 Thread Andrew Lee
or insights on what I'm missing here. Thanks for the assistance. -Todd On Wed, Feb 11, 2015 at 3:20 PM, Andrew Lee alee...@hotmail.com wrote: Sorry folks, it is executing Spark jobs instead of Hive jobs. I mis-read the logs since there were other activities going on on the cluster. From: alee

RE: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2015-02-17 Thread Andrew Lee
HI All, Just want to give everyone an update of what worked for me. Thanks for Cheng's comment and other ppl's help. So what I misunderstood was the --driver-class-path and how that was related to --files. I put both /etc/hive/hive-site.xml in both --files and --driver-class-path when I

RE: SparkSQL + Tableau Connector

2015-02-11 Thread Andrew Lee
Sorry folks, it is executing Spark jobs instead of Hive jobs. I mis-read the logs since there were other activities going on on the cluster. From: alee...@hotmail.com To: ar...@sigmoidanalytics.com; tsind...@gmail.com CC: user@spark.apache.org Subject: RE: SparkSQL + Tableau Connector Date: Wed,

RE: Is the Thrift server right for me?

2015-02-11 Thread Andrew Lee
I have ThriftServer2 up and running, however, I notice that it relays the query to HiveServer2 when I pass the hive-site.xml to it. I'm not sure if this is the expected behavior, but based on what I have up and running, the ThriftServer2 invokes HiveServer2 that results in MapReduce or Tez

RE: hadoopConfiguration for StreamingContext

2015-02-10 Thread Andrew Lee
It looks like this is related to the underlying Hadoop configuration. Try to deploy the Hadoop configuration with your job with --files and --driver-class-path, or to the default /etc/hadoop/conf core-site.xml. If that is not an option (depending on how your Hadoop cluster is setup), then hard

Re: Tableau beta connector

2015-02-05 Thread Denny Lee
and tableau can extract that RDD persisted on hive. Regards, Ashutosh -- *From:* Denny Lee denny.g@gmail.com *Sent:* Thursday, February 5, 2015 1:27 PM *To:* Ashutosh Trivedi (MT2013030); İsmail Keskin *Cc:* user@spark.apache.org *Subject:* Re: Tableau beta

Re: Tableau beta connector

2015-02-04 Thread Denny Lee
works. -- *From:* Denny Lee denny.g@gmail.com *Sent:* Thursday, February 5, 2015 12:20 PM *To:* İsmail Keskin; Ashutosh Trivedi (MT2013030) *Cc:* user@spark.apache.org *Subject:* Re: Tableau beta connector Some quick context behind how Tableau interacts

Re: Fail to launch spark-shell on windows 2008 R2

2015-02-03 Thread Denny Lee
Hi Ningjun, I have been working with Spark 1.2 on Windows 7 and Windows 2008 R2 (purely for development purposes). I had most recently installed them utilizing Java 1.8, Scala 2.10.4, and Spark 1.2 Precompiled for Hadoop 2.4+. A handy thread concerning the null\bin\winutils issue is addressed

Re: Spark (SQL) as OLAP engine

2015-02-03 Thread Denny Lee
A great presentation by Evan Chan on utilizing Cassandra as Jonathan noted is at: OLAP with Cassandra and Spark http://www.slideshare.net/EvanChan2/2014-07olapcassspark. On Tue Feb 03 2015 at 10:03:34 AM Jonathan Haddad j...@jonhaddad.com wrote: Write out the rdd to a cassandra table. The

Re: spark-shell can't import the default hive-site.xml options probably.

2015-02-01 Thread Denny Lee
, Denny Lee denny.g@gmail.com wrote: I may be missing something here but typically when the hive-site.xml configurations do not require you to place s within the configuration itself. Both the retry.delay and socket.timeout values are in seconds so you should only need to place the integer value

Re: spark-shell can't import the default hive-site.xml options probably.

2015-02-01 Thread Denny Lee
I may be missing something here but typically when the hive-site.xml configurations do not require you to place s within the configuration itself. Both the retry.delay and socket.timeout values are in seconds so you should only need to place the integer value (which are in seconds). On Sun Feb

Spark 1.2 and Mesos 0.21.0 spark.executor.uri issue?

2014-12-30 Thread Denny Lee
I've been working with Spark 1.2 and Mesos 0.21.0 and while I have set the spark.executor.uri within spark-env.sh (and directly within bash as well), the Mesos slaves do not seem to be able to access the spark tgz file via HTTP or HDFS as per the message below. 14/12/30 15:57:35 INFO SparkILoop:

RE: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2014-12-29 Thread Andrew Lee
Hi All, I have tried to pass the properties via the SparkContext.setLocalProperty and HiveContext.setConf, both failed. Based on the results (haven't get a chance to look into the code yet), HiveContext will try to initiate the JDBC connection right away, I couldn't set other properties

RE: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2014-12-29 Thread Andrew Lee
A follow up on the hive-site.xml, if you 1. Specify it in spark/conf, then you can NOT apply it via the --driver-class-path option, otherwise, you will get the following exceptions when initializing SparkContext. org.apache.spark.SparkException: Found both spark.driver.extraClassPath

Re: S3 files , Spark job hungsup

2014-12-23 Thread Denny Lee
You should be able to kill the job using the webUI or via spark-class. More info can be found in the thread: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-kill-a-Spark-job-running-in-cluster-mode-td18583.html. HTH! On Tue, Dec 23, 2014 at 4:47 PM, durga durgak...@gmail.com wrote:

Re: Hadoop 2.6 compatibility?

2014-12-19 Thread Denny Lee
To clarify, there isn't a Hadoop 2.6 profile per se but you can build using -Dhadoop.version=2.4 which works with Hadoop 2.6. On Fri, Dec 19, 2014 at 12:55 Ted Yu yuzhih...@gmail.com wrote: You can use hadoop-2.4 profile and pass -Dhadoop.version=2.6.0 Cheers On Fri, Dec 19, 2014 at 12:51

Re: Hadoop 2.6 compatibility?

2014-12-19 Thread Denny Lee
Lee denny.g@gmail.com wrote: To clarify, there isn't a Hadoop 2.6 profile per se but you can build using -Dhadoop.version=2.4 which works with Hadoop 2.6. On Fri, Dec 19, 2014 at 12:55 Ted Yu yuzhih...@gmail.com wrote: You can use hadoop-2.4 profile and pass -Dhadoop.version=2.6.0 Cheers

Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Denny Lee
I'm curious if you're seeing the same thing when using bdutil against GCS? I'm wondering if this may be an issue concerning the transfer rate of Spark - Hadoop - GCS Connector - GCS. On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta alexbare...@gmail.com wrote: All, I'm using the Spark

Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Denny Lee
. See the following. alex@hadoop-m:~/split$ time bash -c gsutil ls gs://my-bucket/20141205/csv/*/*/* | wc -l 6860 real0m6.971s user0m1.052s sys 0m0.096s Alex On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee denny.g@gmail.com wrote: I'm curious if you're seeing the same

Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Denny Lee
to test this? But more importantly, what information would this give me? On Wed, Dec 17, 2014 at 10:46 PM, Denny Lee denny.g@gmail.com wrote: Oh, it makes sense of gsutil scans through this quickly, but I was wondering if running a Hadoop job / bdutil would result in just as fast scans

Limit the # of columns in Spark Scala

2014-12-14 Thread Denny Lee
I have a large of files within HDFS that I would like to do a group by statement ala val table = sc.textFile(hdfs://) val tabs = table.map(_.split(\t)) I'm trying to do something similar to tabs.map(c = (c._(167), c._(110), c._(200)) where I create a new RDD that only has but that isn't

Re: Limit the # of columns in Spark Scala

2014-12-14 Thread Denny Lee
looks like the way to go given the context. What's not working? Kr, Gerard On Dec 14, 2014 5:17 PM, Denny Lee denny.g@gmail.com wrote: I have a large of files within HDFS that I would like to do a group by statement ala val table = sc.textFile(hdfs://) val tabs = table.map(_.split

Re: Limit the # of columns in Spark Scala

2014-12-14 Thread Denny Lee
Yes - that works great! Sorry for implying I couldn't. Was just more flummoxed that I couldn't make the Scala call work on its own. Will continue to debug ;-) On Sun, Dec 14, 2014 at 11:39 Michael Armbrust mich...@databricks.com wrote: BTW, I cannot use SparkSQL / case right now because my table

Re: Limit the # of columns in Spark Scala

2014-12-14 Thread Denny Lee
tabs.map(c = (c(167), c(110), c(200)) instead of tabs.map(c = (c._(167), c._(110), c._(200)) On Sun, Dec 14, 2014 at 3:12 PM, Denny Lee denny.g@gmail.com wrote: Yes - that works great! Sorry for implying I couldn't. Was just more flummoxed that I couldn't make the Scala call work on its

Re: Spark SQL Roadmap?

2014-12-13 Thread Denny Lee
Hi Xiaoyong, SparkSQL has already been released and has been part of the Spark code-base since Spark 1.0. The latest stable release is Spark 1.1 (here's the Spark SQL Programming Guide http://spark.apache.org/docs/1.1.0/sql-programming-guide.html) and we're currently voting on Spark 1.2. Hive

Re: Spark-SQL JDBC driver

2014-12-11 Thread Denny Lee
Yes, that is correct. A quick reference on this is the post https://www.linkedin.com/pulse/20141007143323-732459-an-absolutely-unofficial-way-to-connect-tableau-to-sparksql-spark-1-1?_mSplash=1 with the pertinent section being: It is important to note that when you create Spark tables (for

Re: Spark on YARN memory utilization

2014-12-09 Thread Denny Lee
Thanks Sandy! On Mon, Dec 8, 2014 at 23:15 Sandy Ryza sandy.r...@cloudera.com wrote: Another thing to be aware of is that YARN will round up containers to the nearest increment of yarn.scheduler.minimum-allocation-mb, which defaults to 1024. -Sandy On Sat, Dec 6, 2014 at 3:48 PM, Denny Lee

Spark on YARN memory utilization

2014-12-06 Thread Denny Lee
This is perhaps more of a YARN question than a Spark question but i was just curious to how is memory allocated in YARN via the various configurations. For example, if I spin up my cluster with 4GB with a different number of executors as noted below 4GB executor-memory x 10 executors = 46GB

Re: Spark on YARN memory utilization

2014-12-06 Thread Denny Lee
* executorMemory. When you set executor memory, the yarn resource request is executorMemory + yarnOverhead. - Arun On Sat, Dec 6, 2014 at 4:27 PM, Denny Lee denny.g@gmail.com wrote: This is perhaps more of a YARN question than a Spark question but i was just curious to how is memory allocated

Re: spark-submit on YARN is slow

2014-12-05 Thread Denny Lee
My submissions of Spark on YARN (CDH 5.2) resulted in a few thousand steps. If I was running this on standalone cluster mode the query finished in 55s but on YARN, the query was still running 30min later. Would the hard coded sleeps potentially be in play here? On Fri, Dec 5, 2014 at 11:23 Sandy

Re: spark-submit on YARN is slow

2014-12-05 Thread Denny Lee
, and --num-executors arguments? When running against a standalone cluster, by default Spark will make use of all the cluster resources, but when running against YARN, Spark defaults to a couple tiny executors. -Sandy On Fri, Dec 5, 2014 at 11:32 AM, Denny Lee denny.g@gmail.com wrote: My

Re: spark-submit on YARN is slow

2014-12-05 Thread Denny Lee
Okay, my bad for not testing out the documented arguments - once i use the correct ones, the query shrinks completes in ~55s (I can probably make it faster). Thanks for the help, eh?! On Fri Dec 05 2014 at 10:34:50 PM Denny Lee denny.g@gmail.com wrote: Sorry for the delay in my response

Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-11-25 Thread Denny Lee
To determine if this is a Windows vs. other configuration, can you just try to call the Spark-class.cmd SparkSubmit without actually referencing the Hadoop or Thrift server classes? On Tue Nov 25 2014 at 5:42:09 PM Judy Nash judyn...@exchange.microsoft.com wrote: I traced the code and used

Re: Spark SQL Programming Guide - registerTempTable Error

2014-11-23 Thread Denny Lee
By any chance are you using Spark 1.0.2? registerTempTable was introduced from Spark 1.1+ while for Spark 1.0.2, it would be registerAsTable. On Sun Nov 23 2014 at 10:59:48 AM riginos samarasrigi...@gmail.com wrote: Hi guys , Im trying to do the Spark SQL Programming Guide but after the:

Re: Spark SQL Programming Guide - registerTempTable Error

2014-11-23 Thread Denny Lee
It sort of depends on your environment. If you are running on your local environment, I would just download the latest Spark 1.1 binaries and you'll be good to go. If its a production environment, it sort of depends on how you are setup (e.g. AWS, Cloudera, etc.) On Sun Nov 23 2014 at 11:27:49

Re: Spark or MR, Scala or Java?

2014-11-22 Thread Denny Lee
extraction job against multiple data sources via Hadoop streaming. Another good call out but utilizing Scala within Spark is that most of the Spark code is written in Scala. On Sat, Nov 22, 2014 at 08:12 Denny Lee denny.g@gmail.com wrote: There are various scenarios where traditional Hadoop

<    1   2   3   >