Re: First Time contribution.

2023-09-17 Thread Denny Lee
Hi Ram, We have some good guidance at https://spark.apache.org/contributing.html HTH! Denny On Sun, Sep 17, 2023 at 17:18 ram manickam wrote: > > > > Hello All, > Recently, joined this community and would like to contribute. Is there a > guideline or recommendation on tasks that can be

Re: Slack for PySpark users

2023-04-03 Thread Denny Lee
and affordable. Alternatives have been >>>>>>> suggested as well so those who like investigative search can agree and >>>>>>> come >>>>>>> up with a freebie one. >>>>>>> I am inclined to agree with Bjorn that th

Re: Slack for PySpark users

2023-03-30 Thread Denny Lee
;>>>>>> +1 >>>>>>> >>>>>>> + @d...@spark.apache.org >>>>>>> >>>>>>> This is a good idea. The other Apache projects (e.g., Pinot, Druid, >>>>>>> Flink) have created their

Re: Slack for PySpark users

2023-03-27 Thread Denny Lee
+1 I think this is a great idea! On Mon, Mar 27, 2023 at 6:24 PM Hyukjin Kwon wrote: > Yeah, actually I think we should better have a slack channel so we can > easily discuss with users and developers. > > On Tue, 28 Mar 2023 at 03:08, keen wrote: > >> Hi all, >> I really like *Slack *as

Re: Topics for Spark online classes & webinars

2023-03-15 Thread Denny Lee
. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> >> On Wed, 15 Mar 2023 at 18:31, Nitin Bhansali >> wrote: >> >> Hello Mich, >> >> My apologies ... but I am

Re: Topics for Spark online classes & webinars

2023-03-15 Thread Denny Lee
Thanks Mich for tackling this! I encourage everyone to add to the list so we can have a comprehensive list of topics, eh?! On Wed, Mar 15, 2023 at 10:27 Mich Talebzadeh wrote: > Hi all, > > Thanks to @Denny Lee to give access to > > https://www.linkedin.com/comp

Re: Topics for Spark online classes & webinars

2023-03-14 Thread Denny Lee
equest to leverage the original Spark confluence page <https://cwiki.apache.org/confluence/display/SPARK>.WDYT? On Mon, Mar 13, 2023 at 9:34 AM Mich Talebzadeh wrote: > Well that needs to be created first for this purpose. The appropriate name > etc. to be decided. Maybe @Denny Lee

Re: Online classes for spark topics

2023-03-12 Thread Denny Lee
Looks like we have some good topics here - I'm glad to help with setting up the infrastructure to broadcast if it helps? On Thu, Mar 9, 2023 at 6:19 AM neeraj bhadani wrote: > I am happy to be a part of this discussion as well. > > Regards, > Neeraj > > On Wed, 8 Mar 2023 at 22:41, Winston Lai

Re: Online classes for spark topics

2023-03-08 Thread Denny Lee
We used to run Spark webinars on the Apache Spark LinkedIn group but honestly the turnout was pretty low. We had dove into various features. If there are particular topics that. you would like to discuss during a live session,

Re: Prometheus with spark

2022-10-27 Thread Denny Lee
Hi Raja, A little atypical way to respond to your question - please check out the most recent Spark AMA where we discuss this: https://www.linkedin.com/posts/apachespark_apachespark-ama-committers-activity-6989052811397279744-jpWH?utm_source=share_medium=member_ios HTH! Denny On Tue, Oct 25,

Re: Databricks notebook - cluster taking a long time to get created, often timing out

2021-08-17 Thread Denny Lee
Hi Karan, You may want to ping Databricks Help or Forums as this is a Databricks specific question. I'm a little surprised that a Databricks cluster would take a long time to create so it may be best to utilize these

Re: Append to an existing Delta Lake using structured streaming

2021-07-21 Thread Denny Lee
Including the Delta Lake Users and Developers DL to help out. Saying this, could you clarify how data is not being added? By any chance do you have any code samples to recreate this? Sent via Superhuman On Wed, Jul 21, 2021 at 2:49 AM, wrote: >

Re: How to unsubscribe

2020-05-06 Thread Denny Lee
Hi Fred, To unsubscribe, could you please email: user-unsubscr...@spark.apache.org (for more information, please refer to https://spark.apache.org/community.html). Thanks! Denny On Wed, May 6, 2020 at 10:12 AM Fred Liu wrote: > Hi guys > > > >

Re: can we all help use our expertise to create an IT solution for Covid-19

2020-03-26 Thread Denny Lee
There are a number of really good datasets already available including (but not limited to): - South Korea COVID-19 Dataset - 2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by Johns Hopkins CSSE

Re: Should python-2 be supported in Spark 3.0?

2019-05-31 Thread Denny Lee
+1 On Fri, May 31, 2019 at 17:58 Holden Karau wrote: > +1 > > On Fri, May 31, 2019 at 5:41 PM Bryan Cutler wrote: > >> +1 and the draft sounds good >> >> On Thu, May 30, 2019, 11:32 AM Xiangrui Meng wrote: >> >>> Here is the draft announcement: >>> >>> === >>> Plan for dropping Python 2

Re: Does Pyspark Support Graphx?

2018-02-18 Thread Denny Lee
com> wrote: > Hi Denny, > The pyspark script uses the --packages option to load graphframe library, > what about the SparkLauncher class? > > > > ------ Original -- > *From:* Denny Lee <denny.g@gmail.com> > *Date:* Sun,Feb 18,2018 1

Re: Does Pyspark Support Graphx?

2018-02-17 Thread Denny Lee
. GraphFrames? On Sat, Feb 17, 2018 at 8:26 PM xiaobo <guxiaobo1...@qq.com> wrote: > Thanks Denny, will it be supported in the near future? > > > > -- Original ------ > *From:* Denny Lee <denny.g@gmail.com> > *Date:* Sun,Feb

Re: Does Pyspark Support Graphx?

2018-02-17 Thread Denny Lee
That’s correct - you can use GraphFrames though as it does support PySpark. On Sat, Feb 17, 2018 at 17:36 94035420 wrote: > I can not find anything for graphx module in the python API document, does > it mean it is not supported yet? >

Re: With 2.2.0 PySpark is now available for pip install from PyPI :)

2017-07-12 Thread Denny Lee
This is amazingly awesome! :) On Wed, Jul 12, 2017 at 13:23 lucas.g...@gmail.com wrote: > That's great! > > > > On 12 July 2017 at 12:41, Felix Cheung wrote: > >> Awesome! Congrats!! >> >> -- >> *From:*

Re: Spark Shell issue on HDInsight

2017-05-14 Thread Denny Lee
com> wrote: > Works for me tooyou are a life-saver :) > > But the question: should/how we report this to Azure team? > > On Fri, May 12, 2017 at 10:32 AM, Denny Lee <denny.g@gmail.com> wrote: > >> I was able to repro your issue when I had downloaded the ja

Re: Spark Shell issue on HDInsight

2017-05-11 Thread Denny Lee
SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185) > at > org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210) > at org.apache.spark.deploy.SparkSubmi

Re: Spark Shell issue on HDInsight

2017-05-08 Thread Denny Lee
This appears to be an issue with the Spark to DocumentDB connector, specifically version 0.0.1. Could you run the 0.0.3 version of the jar and see if you're still getting the same error? i.e. spark-shell --master yarn --jars

Re: Azure Event Hub with Pyspark

2017-04-20 Thread Denny Lee
As well, perhaps another option could be to use the Spark Connector to DocumentDB (https://github.com/Azure/azure-documentdb-spark) if sticking with Scala? On Thu, Apr 20, 2017 at 21:46 Nan Zhu wrote: > DocDB does have a java client? Anything prevent you using that? > >

Support Stored By Clause

2017-03-27 Thread Denny Lee
Per SPARK-19630, wondering if there are plans to support "STORED BY" clause for Spark 2.x? Thanks!

Re: unsubscribe

2017-01-09 Thread Denny Lee
Please unsubscribe by sending an email to user-unsubscr...@spark.apache.org HTH! On Mon, Jan 9, 2017 4:40 PM, william tellme williamtellme...@gmail.com wrote:

Re: UNSUBSCRIBE

2017-01-09 Thread Denny Lee
Please unsubscribe by sending an email to user-unsubscr...@spark.apache.org HTH! On Mon, Jan 9, 2017 4:41 PM, Chris Murphy - ChrisSMurphy.com cont...@chrissmurphy.com wrote: PLEASE!!

Re: Spark app write too many small parquet files

2016-11-27 Thread Denny Lee
Generally, yes - you should try to have larger data sizes due to the overhead of opening up files. Typical guidance is between 64MB-1GB; personally I usually stick with 128MB-512MB with the default of snappy codec compression with parquet. A good reference is Vida Ha's presentation Data Storage

Re: hope someone can recommend some books for me,a spark beginner

2016-11-06 Thread Denny Lee
There are a number of great resources to learn Apache Spark - a good starting point is the Apache Spark Documentation at: http://spark.apache.org/documentation.html The two books that immediately come to mind are - Learning Spark: http://shop.oreilly.com/product/mobile/0636920028512.do (there's

Re: Newbie question - Best way to bootstrap with Spark

2016-11-06 Thread Denny Lee
The one you're looking for is the Data Sciences and Engineering with Apache Spark at https://www.edx.org/xseries/data-science-engineering-apacher-sparktm. Note, a great quick start is the Getting Started with Apache Spark on Databricks at https://databricks.com/product/getting-started-guide HTH!

Re: How do I convert a data frame to broadcast variable?

2016-11-03 Thread Denny Lee
If you're able to read the data in as a DataFrame, perhaps you can use a BroadcastHashJoin so that way you can join to that table presuming its small enough to distributed? Here's a handy guide on a BroadcastHashJoin:

Re: GraphFrame BFS

2016-11-01 Thread Denny Lee
You should be able to GraphX or GraphFrames subgraph to build up your subgraph. A good example for GraphFrames can be found at: http://graphframes.github.io/user-guide.html#subgraphs. HTH! On Mon, Oct 10, 2016 at 9:32 PM cashinpj wrote: > Hello, > > I have a set of data

Re: Spark GraphFrames

2016-08-02 Thread Denny Lee
Hi Divya, Here's a blog post concerning On-Time Flight Performance with GraphFrames: https://databricks.com/blog/2016/03/16/on-time-flight-performance-with-graphframes-for-apache-spark.html It also includes a Databricks notebook that has the code in it. HTH! Denny On Tue, Aug 2, 2016 at 1:16

Re: Meetup in Rome

2016-02-19 Thread Denny Lee
Hey Domenico, Glad to hear that you love Spark and would like to organize a meetup in Rome. We created a Meetup-in-a-box to help with that - check out the post https://databricks.com/blog/2015/11/19/meetup-in-a-box.html. HTH! Denny On Fri, Feb 19, 2016 at 02:38 Domenico Pontari

Re: How to compile Python and use How to compile Python and use spark-submit

2016-01-08 Thread Denny Lee
Per http://spark.apache.org/docs/latest/submitting-applications.html: For Python, you can use the --py-files argument of spark-submit to add .py, .zip or .egg files to be distributed with your application. If you depend on multiple Python files we recommend packaging them into a .zip or .egg.

Re: subscribe

2016-01-08 Thread Denny Lee
To subscribe, please go to http://spark.apache.org/community.html to join the mailing list. On Fri, Jan 8, 2016 at 3:58 AM Jeetendra Gangele wrote: > >

Re: Intercept in Linear Regression

2015-12-15 Thread Denny Lee
If you're using model = LinearRegressionWithSGD.train(parseddata, iterations=100, step=0.01, intercept=True) then to get the intercept, you would use model.intercept More information can be found at:

Re: Best practises

2015-11-02 Thread Denny Lee
In addition, you may want to check out Tuning and Debugging in Apache Spark (https://sparkhub.databricks.com/video/tuning-and-debugging-apache-spark/) On Mon, Nov 2, 2015 at 05:27 Stefano Baghino wrote: > There is this interesting book from Databricks: >

Spark Survey Results 2015 are now available

2015-10-05 Thread Denny Lee
Thanks to all of you who provided valuable feedback in our Spark Survey 2015. Because of the survey, we have a better picture of who’s using Spark, how they’re using it, and what they’re using it to build–insights that will guide major updates to the Spark platform as we move into Spark’s next

Re: SQL Server to Spark

2015-07-23 Thread Denny Lee
It sort of depends on optimized. There is a good thread on the topic at http://search-hadoop.com/m/q3RTtJor7QBnWT42/Spark+and+SQL+server/v=threaded If you have an archival type strategy, you could do daily BCP extracts out to load the data into HDFS / S3 / etc. This would result in minimal impact

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-06 Thread Denny Lee
, Founder CTO, Swoop http://swoop.com/ @simeons http://twitter.com/simeons | blog.simeonov.com | 617.299.6746 From: Yin Huai yh...@databricks.com Date: Monday, July 6, 2015 at 12:59 AM To: Simeon Simeonov s...@swoop.com Cc: Denny Lee denny.g@gmail.com, Andy Huang andy.hu

Re: Spark SQL queries hive table, real time ?

2015-07-06 Thread Denny Lee
Within the context of your question, Spark SQL utilizing the Hive context is primarily about very fast queries. If you want to use real-time queries, I would utilize Spark Streaming. A couple of great resources on this topic include Guest Lecture on Spark Streaming in Stanford CME 323:

Re: Please add the Chicago Spark Users' Group to the community page

2015-07-06 Thread Denny Lee
Hey Dean, Sure, will take care of this. HTH, Denny On Tue, Jul 7, 2015 at 10:07 Dean Wampler deanwamp...@gmail.com wrote: Here's our home page: http://www.meetup.com/Chicago-Spark-Users/ Thanks, Dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-05 Thread Denny Lee
I had run into the same problem where everything was working swimmingly with Spark 1.3.1. When I switched to Spark 1.4, either by upgrading to Java8 (from Java7) or by knocking up the PermGenSize had solved my issue. HTH! On Mon, Jul 6, 2015 at 8:31 AM Andy Huang andy.hu...@servian.com.au

Hive Skew flag?

2015-05-15 Thread Denny Lee
Just wondering if we have any timeline on when the hive skew flag will be included within SparkSQL? Thanks! Denny

Re: how to delete data from table in sparksql

2015-05-14 Thread Denny Lee
Delete from table is available as part of Hive 0.14 (reference: Apache Hive Language Manual DML - Delete https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Delete) while Spark 1.3 defaults to Hive 0.13.Perhaps rebuild Spark with Hive 0.14 or generate a new

Re: Spark Cluster Setup

2015-04-27 Thread Denny Lee
Similar to what Dean called out, we build Puppet manifests so we could do the automation - its a bit of work to setup, but well worth the effort. On Fri, Apr 24, 2015 at 11:27 AM Dean Wampler deanwamp...@gmail.com wrote: It's mostly manual. You could try automating with something like Chef, of

Re: Start ThriftServer Error

2015-04-22 Thread Denny Lee
You may need to specify the hive port itself. For example, my own Thrift start command is in the form: ./sbin/start-thriftserver.sh --master spark://$myserver:7077 --driver-class-path $CLASSPATH --hiveconf hive.server2.thrift.bind.host $myserver --hiveconf hive.server2.thrift.port 1 HTH!

Re: Skipped Jobs

2015-04-19 Thread Denny Lee
Thanks for the correction Mark :) On Sun, Apr 19, 2015 at 3:45 PM Mark Hamstra m...@clearstorydata.com wrote: Almost. Jobs don't get skipped. Stages and Tasks do if the needed results are already available. On Sun, Apr 19, 2015 at 3:18 PM, Denny Lee denny.g@gmail.com wrote: The job

Re: Which version of Hive QL is Spark 1.3.0 using?

2015-04-17 Thread Denny Lee
Support for sub queries in predicates hasn't been resolved yet - please refer to SPARK-4226 BTW, Spark 1.3 default bindings to Hive 0.13.1 On Fri, Apr 17, 2015 at 09:18 ARose ashley.r...@telarix.com wrote: So I'm trying to store the results of a query into a DataFrame, but I get the

Re: Microsoft SQL jdbc support from spark sql

2015-04-16 Thread Denny Lee
Bummer - out of curiosity, if you were to use the classpath.first or perhaps copy the jar to the slaves could that actually do the trick? The latter isn't really all that efficient but just curious if that could do the trick. On Thu, Apr 16, 2015 at 7:14 AM ARose ashley.r...@telarix.com wrote:

Re: Converting Date pattern in scala code

2015-04-14 Thread Denny Lee
If you're doing in Scala per se - then you can probably just reference JodaTime or Java Date / Time classes. If are using SparkSQL, then you can use the various Hive date functions for conversion. On Tue, Apr 14, 2015 at 11:04 AM BASAK, ANANDA ab9...@att.com wrote: I need some help to convert

Re: Which Hive version should be used for Spark 1.3

2015-04-09 Thread Denny Lee
By default Spark 1.3 has bindings to Hive 0.13.1 though you can bind it to Hive 0.12 if you specify it in the profile when building Spark as per https://spark.apache.org/docs/1.3.0/building-spark.html. If you are downloading a pre built version of Spark 1.3 - then by default, it is set to Hive

Re: SQL can't not create Hive database

2015-04-09 Thread Denny Lee
Can you create the database directly within Hive? If you're getting the same error within Hive, it sounds like a permissions issue as per Bojan. More info can be found at: http://stackoverflow.com/questions/15898211/unable-to-create-database-path-file-user-hive-warehouse-error On Thu, Apr 9,

Re: Microsoft SQL jdbc support from spark sql

2015-04-07 Thread Denny Lee
At this time, the JDBC Data source is not extensible so it cannot support SQL Server. There was some thoughts - credit to Cheng Lian for this - about making the JDBC data source extensible for third party support possibly via slick. On Mon, Apr 6, 2015 at 10:41 PM bipin bipin@gmail.com

Re: Microsoft SQL jdbc support from spark sql

2015-04-07 Thread Denny Lee
That's correct, at this time MS SQL Server is not supported through the JDBC data source at this time. In my environment, we've been using Hadoop streaming to extract out data from multiple SQL Servers, pushing the data into HDFS, creating the Hive tables and/or converting them into Parquet, and

Re: ArrayBuffer within a DataFrame

2015-04-03 Thread Denny Lee
something like this would work. You might need to play with the type. df.explode(arrayBufferColumn) { x = x } On Fri, Apr 3, 2015 at 6:43 AM, Denny Lee denny.g@gmail.com wrote: Thanks Dean - fun hack :) On Fri, Apr 3, 2015 at 6:11 AM Dean Wampler deanwamp...@gmail.com wrote: A hack

Re: ArrayBuffer within a DataFrame

2015-04-03 Thread Denny Lee
://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Thu, Apr 2, 2015 at 10:45 PM, Denny Lee denny.g@gmail.com wrote: Thanks Michael - that was it! I was drawing a blank on this one for some reason - much appreciated! On Thu, Apr 2, 2015 at 8:27 PM

ArrayBuffer within a DataFrame

2015-04-02 Thread Denny Lee
Quick question - the output of a dataframe is in the format of: [2015-04, ArrayBuffer(A, B, C, D)] and I'd like to return it as: 2015-04, A 2015-04, B 2015-04, C 2015-04, D What's the best way to do this? Thanks in advance!

Re: ArrayBuffer within a DataFrame

2015-04-02 Thread Denny Lee
, Apr 2, 2015 at 7:10 PM, Denny Lee denny.g@gmail.com wrote: Quick question - the output of a dataframe is in the format of: [2015-04, ArrayBuffer(A, B, C, D)] and I'd like to return it as: 2015-04, A 2015-04, B 2015-04, C 2015-04, D What's the best way to do this? Thanks

Re: Creating Partitioned Parquet Tables via SparkSQL

2015-04-01 Thread Denny Lee
Thanks Felix :) On Wed, Apr 1, 2015 at 00:08 Felix Cheung felixcheun...@hotmail.com wrote: This is tracked by these JIRAs.. https://issues.apache.org/jira/browse/SPARK-5947 https://issues.apache.org/jira/browse/SPARK-5948 -- From: denny.g@gmail.com Date:

Re: Anyone has some simple example with spark-sql with spark 1.3

2015-03-30 Thread Denny Lee
Hi Vincent, This may be a case that you're missing a semi-colon after your CREATE TEMPORARY TABLE statement. I ran your original statement (missing the semi-colon) and got the same error as you did. As soon as I added it in, I was good to go again: CREATE TEMPORARY TABLE jsonTable USING

Re: Hive Table not from from Spark SQL

2015-03-27 Thread Denny Lee
Upon reviewing your other thread, could you confirm that your Hive metastore that you can connect to via Hive is a MySQL database? And to also confirm, when you're running spark-shell and doing a show tables statement, you're getting the same error? On Fri, Mar 27, 2015 at 6:08 AM ÐΞ€ρ@Ҝ (๏̯͡๏)

Re: spark-sql throws org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException

2015-03-26 Thread Denny Lee
If you're not using MySQL as your metastore for Hive, out of curiosity what are you using? The error you are seeing is common when there isn't the correct driver to allow Spark to connect to the Hive metastore because the correct driver isn't there. As well, I noticed that you're using

Re: Handling Big data for interactive BI tools

2015-03-26 Thread Denny Lee
BTW, a tool that I have been using to help do the preaggregation of data using hyperloglog in combination with Spark is atscale (http://atscale.com/). It builds the aggregations and makes use of the speed of SparkSQL - all within the context of a model that is accessible by Tableau or Qlik. On

Re: Total size of serialized results is bigger than spark.driver.maxResultSize

2015-03-25 Thread Denny Lee
As you noted, you can change the spark.driver.maxResultSize value in your Spark Configurations (https://spark.apache.org/docs/1.2.0/configuration.html). Please reference the Spark Properties section noting that you can modify these properties via the spark-defaults.conf or via SparkConf(). HTH!

Re: [SparkSQL] How to calculate stddev on a DataFrame?

2015-03-25 Thread Denny Lee
Perhaps this email reference may be able to help from a DataFrame perspective: http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201503.mbox/%3CCALte62ztepahF=5hk9rcfbnyk4z43wkcq4fkdcbwmgf_3_o...@mail.gmail.com%3E On Wed, Mar 25, 2015 at 7:29 PM Haopu Wang hw...@qilinsoft.com wrote:

Re: Errors in SPARK

2015-03-24 Thread Denny Lee
* Cheers, Sandeep.v On Wed, Mar 25, 2015 at 11:10 AM, sandeep vura sandeepv...@gmail.com wrote: No I am just running ./spark-shell command in terminal I will try with above command On Wed, Mar 25, 2015 at 11:09 AM, Denny Lee denny.g@gmail.com wrote: Did you include the connection

Re: Errors in SPARK

2015-03-24 Thread Denny Lee
Did you include the connection to a MySQL connector jar so that way spark-shell / hive can connect to the metastore? For example, when I run my spark-shell instance in standalone mode, I use: ./spark-shell --master spark://servername:7077 --driver-class-path /lib/mysql-connector-java-5.1.27.jar

Re: Standalone Scheduler VS YARN Performance

2015-03-24 Thread Denny Lee
By any chance does this thread address look similar: http://apache-spark-developers-list.1001551.n3.nabble.com/Lost-executor-on-YARN-ALS-iterations-td7916.html ? On Tue, Mar 24, 2015 at 5:23 AM Harut Martirosyan harut.martiros...@gmail.com wrote: What is performance overhead caused by YARN,

Re: Hadoop 2.5 not listed in Spark 1.4 build page

2015-03-24 Thread Denny Lee
Hadoop 2.5 would be referenced as via -Dhadoop-2.5 using the profile -Phadoop-2.4 Please note earlier in the link the section: # Apache Hadoop 2.4.X or 2.5.X mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean package Versions of Hadoop after 2.5.X may or may not work with the

Re: Use pig load function in spark

2015-03-23 Thread Denny Lee
You may be able to utilize Spork (Pig on Apache Spark) as a mechanism to do this: https://github.com/sigmoidanalytics/spork On Mon, Mar 23, 2015 at 2:29 AM Dai, Kevin yun...@ebay.com wrote: Hi, all Can spark use pig’s load function to load data? Best Regards, Kevin.

Re: Using a different spark jars than the one on the cluster

2015-03-23 Thread Denny Lee
+1 - I currently am doing what Marcelo is suggesting as I have a CDH 5.2 cluster (with Spark 1.1) and I'm also running Spark 1.3.0+ side-by-side in my cluster. On Wed, Mar 18, 2015 at 1:23 PM Marcelo Vanzin van...@cloudera.com wrote: Since you're using YARN, you should be able to download a

Re: Should I do spark-sql query on HDFS or hive?

2015-03-23 Thread Denny Lee
From the standpoint of Spark SQL accessing the files - when it is hitting Hive, it is in effect hitting HDFS as well. Hive provides a great framework where the table structure is already well defined.But underneath it, Hive is just accessing files from HDFS so you are hitting HDFS either way.

Re: Spark sql thrift server slower than hive

2015-03-22 Thread Denny Lee
How are you running your spark instance out of curiosity? Via YARN or standalone mode? When connecting Spark thriftserver to the Spark service, have you allocated enough memory and CPU when executing with spark? On Sun, Mar 22, 2015 at 3:39 AM fanooos dev.fano...@gmail.com wrote: We have

Re: takeSample triggers 2 jobs

2015-03-06 Thread Denny Lee
Hi Rares, If you dig into the descriptions for the two jobs, it will probably return something like: Job ID: 1 org.apache.spark.rdd.RDD.takeSample(RDD.scala:447) $line41.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:22) ... Job ID: 0

Re: spark master shut down suddenly

2015-03-04 Thread Denny Lee
It depends on your setup but one of the locations is /var/log/mesos On Wed, Mar 4, 2015 at 19:11 lisendong lisend...@163.com wrote: I ‘m sorry, but how to look at the mesos logs? where are them? 在 2015年3月4日,下午6:06,Akhil Das ak...@sigmoidanalytics.com 写道: You can check in the mesos logs

Re: Use case for data in SQL Server

2015-02-24 Thread Denny Lee
Hi Suhel, My team is currently working with a lot of SQL Server databases as one of our many data sources and ultimately we pull the data into HDFS from SQL Server. As we had a lot of SQL databases to hit, we used the jTDS driver and SQOOP to extract the data out of SQL Server and into HDFS

Re: Unable to run hive queries inside spark

2015-02-24 Thread Denny Lee
The error message you have is: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:file:/user/hive/warehouse/src is not a directory or unable to create one) Could you verify that you (the user you are running under) has the rights to create

Re: Unable to run hive queries inside spark

2015-02-24 Thread Denny Lee
descriptionlocation of default database for the warehouse/description /property Do I need to do anything explicitly other than placing hive-site.xml in the spark.conf directory ? Thanks !! On Wed, Feb 25, 2015 at 11:42 AM, Denny Lee denny.g@gmail.com wrote: The error message

Re: How to start spark-shell with YARN?

2015-02-24 Thread Denny Lee
It may have to do with the akka heartbeat interval per SPARK-3923 - https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-3923 ? On Tue, Feb 24, 2015 at 16:40 Xi Shen davidshe...@gmail.com wrote: Hi Sean, I launched the spark-shell on the same machine as I started YARN service. I

Re: Spark SQL odbc on Windows

2015-02-23 Thread Denny Lee
paper!! We were already using it as a guideline for our tests. Best regards, Francisco -- From: Denny Lee denny.g@gmail.com Sent: ‎22/‎02/‎2015 17:56 To: Ashic Mahtab as...@live.com; Francisco Orchard forch...@gmail.com; Apache Spark user@spark.apache.org

Re: Spark SQL odbc on Windows

2015-02-22 Thread Denny Lee
Hi Francisco, Out of curiosity - why ROLAP mode using multi-dimensional mode (vs tabular) from SSAS to Spark? As a past SSAS guy you've definitely piqued my interest. The one thing that you may run into is that the SQL generated by SSAS can be quite convoluted. When we were doing the same thing

Re: Spark SQL odbc on Windows

2015-02-22 Thread Denny Lee
Back to thrift, there was an earlier thread on this topic at http://mail-archives.apache.org/mod_mbox/spark-user/201411.mbox/%3CCABPQxsvXA-ROPeXN=wjcev_n9gv-drqxujukbp_goutvnyx...@mail.gmail.com%3E that may be useful as well. On Sun Feb 22 2015 at 8:42:29 AM Denny Lee denny.g@gmail.com wrote

Re: Spark 1.3 SQL Programming Guide and sql._ / sql.types._

2015-02-20 Thread Denny Lee
. On Fri, Feb 20, 2015 at 9:55 AM, Denny Lee denny.g@gmail.com wrote: Quickly reviewing the latest SQL Programming Guide https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md (in github) I had a couple of quick questions: 1) Do we need to instantiate the SparkContext

Spark 1.3 SQL Programming Guide and sql._ / sql.types._

2015-02-20 Thread Denny Lee
Quickly reviewing the latest SQL Programming Guide https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md (in github) I had a couple of quick questions: 1) Do we need to instantiate the SparkContext as per // sc is an existing SparkContext. val sqlContext = new

Re: Tableau beta connector

2015-02-05 Thread Denny Lee
and tableau can extract that RDD persisted on hive. Regards, Ashutosh -- *From:* Denny Lee denny.g@gmail.com *Sent:* Thursday, February 5, 2015 1:27 PM *To:* Ashutosh Trivedi (MT2013030); İsmail Keskin *Cc:* user@spark.apache.org *Subject:* Re: Tableau beta

Re: Tableau beta connector

2015-02-04 Thread Denny Lee
works. -- *From:* Denny Lee denny.g@gmail.com *Sent:* Thursday, February 5, 2015 12:20 PM *To:* İsmail Keskin; Ashutosh Trivedi (MT2013030) *Cc:* user@spark.apache.org *Subject:* Re: Tableau beta connector Some quick context behind how Tableau interacts

Re: Fail to launch spark-shell on windows 2008 R2

2015-02-03 Thread Denny Lee
Hi Ningjun, I have been working with Spark 1.2 on Windows 7 and Windows 2008 R2 (purely for development purposes). I had most recently installed them utilizing Java 1.8, Scala 2.10.4, and Spark 1.2 Precompiled for Hadoop 2.4+. A handy thread concerning the null\bin\winutils issue is addressed

Re: Spark (SQL) as OLAP engine

2015-02-03 Thread Denny Lee
A great presentation by Evan Chan on utilizing Cassandra as Jonathan noted is at: OLAP with Cassandra and Spark http://www.slideshare.net/EvanChan2/2014-07olapcassspark. On Tue Feb 03 2015 at 10:03:34 AM Jonathan Haddad j...@jonhaddad.com wrote: Write out the rdd to a cassandra table. The

Re: spark-shell can't import the default hive-site.xml options probably.

2015-02-01 Thread Denny Lee
, Denny Lee denny.g@gmail.com wrote: I may be missing something here but typically when the hive-site.xml configurations do not require you to place s within the configuration itself. Both the retry.delay and socket.timeout values are in seconds so you should only need to place the integer value

Re: spark-shell can't import the default hive-site.xml options probably.

2015-02-01 Thread Denny Lee
I may be missing something here but typically when the hive-site.xml configurations do not require you to place s within the configuration itself. Both the retry.delay and socket.timeout values are in seconds so you should only need to place the integer value (which are in seconds). On Sun Feb

Spark 1.2 and Mesos 0.21.0 spark.executor.uri issue?

2014-12-30 Thread Denny Lee
I've been working with Spark 1.2 and Mesos 0.21.0 and while I have set the spark.executor.uri within spark-env.sh (and directly within bash as well), the Mesos slaves do not seem to be able to access the spark tgz file via HTTP or HDFS as per the message below. 14/12/30 15:57:35 INFO SparkILoop:

Re: S3 files , Spark job hungsup

2014-12-23 Thread Denny Lee
You should be able to kill the job using the webUI or via spark-class. More info can be found in the thread: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-kill-a-Spark-job-running-in-cluster-mode-td18583.html. HTH! On Tue, Dec 23, 2014 at 4:47 PM, durga durgak...@gmail.com wrote:

Re: Hadoop 2.6 compatibility?

2014-12-19 Thread Denny Lee
To clarify, there isn't a Hadoop 2.6 profile per se but you can build using -Dhadoop.version=2.4 which works with Hadoop 2.6. On Fri, Dec 19, 2014 at 12:55 Ted Yu yuzhih...@gmail.com wrote: You can use hadoop-2.4 profile and pass -Dhadoop.version=2.6.0 Cheers On Fri, Dec 19, 2014 at 12:51

Re: Hadoop 2.6 compatibility?

2014-12-19 Thread Denny Lee
Sorry Ted! I saw profile (-P) but missed the -D. My bad! On Fri, Dec 19, 2014 at 16:46 Ted Yu yuzhih...@gmail.com wrote: Here is the command I used: mvn package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -DskipTests FYI On Fri, Dec 19, 2014 at 4:35 PM, Denny

Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Denny Lee
I'm curious if you're seeing the same thing when using bdutil against GCS? I'm wondering if this may be an issue concerning the transfer rate of Spark - Hadoop - GCS Connector - GCS. On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta alexbare...@gmail.com wrote: All, I'm using the Spark

Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Denny Lee
. See the following. alex@hadoop-m:~/split$ time bash -c gsutil ls gs://my-bucket/20141205/csv/*/*/* | wc -l 6860 real0m6.971s user0m1.052s sys 0m0.096s Alex On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee denny.g@gmail.com wrote: I'm curious if you're seeing the same

Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Denny Lee
to test this? But more importantly, what information would this give me? On Wed, Dec 17, 2014 at 10:46 PM, Denny Lee denny.g@gmail.com wrote: Oh, it makes sense of gsutil scans through this quickly, but I was wondering if running a Hadoop job / bdutil would result in just as fast scans

Limit the # of columns in Spark Scala

2014-12-14 Thread Denny Lee
I have a large of files within HDFS that I would like to do a group by statement ala val table = sc.textFile(hdfs://) val tabs = table.map(_.split(\t)) I'm trying to do something similar to tabs.map(c = (c._(167), c._(110), c._(200)) where I create a new RDD that only has but that isn't

Re: Limit the # of columns in Spark Scala

2014-12-14 Thread Denny Lee
looks like the way to go given the context. What's not working? Kr, Gerard On Dec 14, 2014 5:17 PM, Denny Lee denny.g@gmail.com wrote: I have a large of files within HDFS that I would like to do a group by statement ala val table = sc.textFile(hdfs://) val tabs = table.map(_.split

  1   2   >