Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-14 Thread Kidong Lee
on_Braun>Von > Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". > > > On Sun, 14 Apr 2024 at 13:40, Kidong Lee wrote: > >> >> Because spark streaming for kafk transaction does not work correctly to >> suit my need, I moved to another ap

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-14 Thread Kidong Lee
// commit offset. consumer.commitAsync(offsetMap, null); } }); In addition, I increased max.poll.records to 10. Even if this raw kafka consumer approach is not so scalable, it consumes read_committed messages from kafka correctly and is enough for me at the moment. - Kidong. 2024년 4월 12일 (금

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-13 Thread Kidong Lee
t of my > knowledge but of course cannot be guaranteed . It is essential to note > that, as with any advice, quote "one test result is worth one-thousand > expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von > Braun <https://en.wikipedia.org/

Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-12 Thread Kidong Lee
action does not handle such as kafka consumer properties like isolation.level=read_committed and enable.auto.commit=false correctly. Any help appreciated. - Kidong. -- *이기동 * *Kidong Lee* Email: mykid...@gmail.com Chango: https://cloudcheflabs.github.io/chango-private-docs Web Site: http://www.clo

Re: First Time contribution.

2023-09-17 Thread Haejoon Lee
messages that need improvement. Thanks! On Mon, Sep 18, 2023 at 9:33 AM Denny Lee wrote: > Hi Ram, > > We have some good guidance at > https://spark.apache.org/contributing.html > > HTH! > Denny > > > On Sun, Sep 17, 2023 at 17:18 ram manickam wrote: > >>

Re: First Time contribution.

2023-09-17 Thread Denny Lee
Hi Ram, We have some good guidance at https://spark.apache.org/contributing.html HTH! Denny On Sun, Sep 17, 2023 at 17:18 ram manickam wrote: > > > > Hello All, > Recently, joined this community and would like to contribute. Is there a > guideline or recommendation on tasks that can be

Unsubscribe

2023-06-29 Thread lee
Unsubscribe | | 李杰 | | leedd1...@163.com |

Re: Slack for PySpark users

2023-04-03 Thread Denny Lee
>>>>>> communication mechanism rather than the user mailing list. Before Stack >>>>>> Overflow days, there had been a meaningful number of questions around >>>>>> user@. >>>>>> It's just impossible to let them go back and post

Re: Slack for PySpark users

2023-03-30 Thread Denny Lee
;>>>>>> +1 >>>>>>> >>>>>>> + @d...@spark.apache.org >>>>>>> >>>>>>> This is a good idea. The other Apache projects (e.g., Pinot, Druid, >>>>>>> Flink) have created their

Re: What is the range of the PageRank value of graphx

2023-03-28 Thread lee
| | To | lee | | Cc | user@spark.apache.org | | Subject | Re: What is the range of the PageRank value of graphx | From the docs: * Note that this is not the "normalized" PageRank and as a consequence pages that have no * inlinks will have a PageRank of alpha. In particular, the pageranks may

What is the range of the PageRank value of graphx

2023-03-28 Thread lee
When I calculate pagerank using HugeGraph, each pagerank value is less than 1, and the total of pageranks is 1. However, the PageRank value of graphx is often greater than 1, so what is the range of the PageRank value of graphx? || 李杰 | | leedd1...@163.com |

Re: Slack for PySpark users

2023-03-27 Thread Denny Lee
+1 I think this is a great idea! On Mon, Mar 27, 2023 at 6:24 PM Hyukjin Kwon wrote: > Yeah, actually I think we should better have a slack channel so we can > easily discuss with users and developers. > > On Tue, 28 Mar 2023 at 03:08, keen wrote: > >> Hi all, >> I really like *Slack *as

Re: Topics for Spark online classes & webinars

2023-03-15 Thread Denny Lee
. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> >> On Wed, 15 Mar 2023 at 18:31, Nitin Bhansali >> wrote: >> >> Hello Mich, >> >> My apologies ... but I am

Re: Topics for Spark online classes & webinars

2023-03-15 Thread Denny Lee
Thanks Mich for tackling this! I encourage everyone to add to the list so we can have a comprehensive list of topics, eh?! On Wed, Mar 15, 2023 at 10:27 Mich Talebzadeh wrote: > Hi all, > > Thanks to @Denny Lee to give access to > > https://www.linkedin.com/comp

Re: Topics for Spark online classes & webinars

2023-03-14 Thread Denny Lee
equest to leverage the original Spark confluence page <https://cwiki.apache.org/confluence/display/SPARK>.WDYT? On Mon, Mar 13, 2023 at 9:34 AM Mich Talebzadeh wrote: > Well that needs to be created first for this purpose. The appropriate name > etc. to be decided. Maybe @Denny Lee

Re: Online classes for spark topics

2023-03-12 Thread Denny Lee
Looks like we have some good topics here - I'm glad to help with setting up the infrastructure to broadcast if it helps? On Thu, Mar 9, 2023 at 6:19 AM neeraj bhadani wrote: > I am happy to be a part of this discussion as well. > > Regards, > Neeraj > > On Wed, 8 Mar 2023 at 22:41, Winston Lai

Re: Online classes for spark topics

2023-03-08 Thread Denny Lee
We used to run Spark webinars on the Apache Spark LinkedIn group but honestly the turnout was pretty low. We had dove into various features. If there are particular topics that. you would like to discuss during a live session,

Re: Prometheus with spark

2022-10-27 Thread Denny Lee
Hi Raja, A little atypical way to respond to your question - please check out the most recent Spark AMA where we discuss this: https://www.linkedin.com/posts/apachespark_apachespark-ama-committers-activity-6989052811397279744-jpWH?utm_source=share_medium=member_ios HTH! Denny On Tue, Oct 25,

Re: spark thrift server as hive on spark running on kubernetes, and more.

2021-12-14 Thread Kidong Lee
on Kubernetes in cluster mode using DataRoaster spark operator. Please see my blog how to do so: https://t.co/T3SXG0mZFB Cheers, - Kidong Lee 2021년 9월 10일 (금) 오전 8:38, Kidong Lee 님이 작성: > Hi, > > Recently, I have open-sourced a tool called DataRoaster( > https://github.com/cloudcheflabs

spark thrift server as hive on spark running on kubernetes, and more.

2021-09-09 Thread Kidong Lee
. - see also demo https://github.com/cloudcheflabs/dataroaster#dataroaster-demo Thank you. - Kidong Lee.

Re: Databricks notebook - cluster taking a long time to get created, often timing out

2021-08-17 Thread Denny Lee
Hi Karan, You may want to ping Databricks Help or Forums as this is a Databricks specific question. I'm a little surprised that a Databricks cluster would take a long time to create so it may be best to utilize these

Re: Append to an existing Delta Lake using structured streaming

2021-07-21 Thread Denny Lee
Including the Delta Lake Users and Developers DL to help out. Saying this, could you clarify how data is not being added? By any chance do you have any code samples to recreate this? Sent via Superhuman On Wed, Jul 21, 2021 at 2:49 AM, wrote: >

Re: How to unsubscribe

2020-05-06 Thread Denny Lee
Hi Fred, To unsubscribe, could you please email: user-unsubscr...@spark.apache.org (for more information, please refer to https://spark.apache.org/community.html). Thanks! Denny On Wed, May 6, 2020 at 10:12 AM Fred Liu wrote: > Hi guys > > > >

Re: can we all help use our expertise to create an IT solution for Covid-19

2020-03-26 Thread Denny Lee
There are a number of really good datasets already available including (but not limited to): - South Korea COVID-19 Dataset - 2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by Johns Hopkins CSSE

unsubscribe

2019-12-30 Thread Jaebin Lee
unsubscribe

Re: Should python-2 be supported in Spark 3.0?

2019-05-31 Thread Denny Lee
+1 On Fri, May 31, 2019 at 17:58 Holden Karau wrote: > +1 > > On Fri, May 31, 2019 at 5:41 PM Bryan Cutler wrote: > >> +1 and the draft sounds good >> >> On Thu, May 30, 2019, 11:32 AM Xiangrui Meng wrote: >> >>> Here is the draft announcement: >>> >>> === >>> Plan for dropping Python 2

unsubscribe

2019-03-28 Thread Byron Lee
unsubscribe

unsubscribe

2019-03-11 Thread Byron Lee

Re: Re:Writing RDDs to HDFS is empty

2019-01-07 Thread Jian Lee
Sorry,the code is too long,it is simple to say look at the photo i define a arrayBuffer ,there are "1 2", '' 2 3" ," 4 5" in it ,I want to save in hdfs ,so i make it to RDD, sc.

Writing RDDs to HDFS is empty

2019-01-07 Thread Jian Lee
Hi all, In my experiment program,I used spark Graphx, when running on the Idea in windows,the result is right, but when runing on the linux distributed cluster,the result in hdfs is empty, why?how to solve?

Re: Job hangs in blocked task in final parquet write stage

2018-12-11 Thread Conrad Lee
, Dec 4, 2018 at 9:45 AM Conrad Lee wrote: > Yeah, probably increasing the memory or increasing the number of output > partitions would help. However increasing memory available to each > executor would add expense. I want to keep the number of partitions low so > that each parquet fi

Re: Job hangs in blocked task in final parquet write stage

2018-12-04 Thread Conrad Lee
On Mon, Dec 3, 2018 at 2:48 AM Conrad Lee wrote: > >> Thanks for the thoughts. While the beginning of the job deals with lots >> of files in the first stage, they're first coalesced down into just a few >> thousand partitions. The part of the job that's failing is the redu

Re: Job hangs in blocked task in final parquet write stage

2018-11-29 Thread Conrad Lee
com> wrote: > I ran into problems using 5.19 so I referred to 5.17 and it resolved my > issues. > > On Wed, Nov 28, 2018 at 2:48 AM Conrad Lee wrote: > >> Hello Vadim, >> >> Interesting. I've only been running this job at scale for a couple weeks >> so I ca

Re: Job hangs in blocked task in final parquet write stage

2018-11-27 Thread Conrad Lee
ntil two weeks ago everything was fine. > We're trying to figure out with the EMR team where the issue is coming > from. > On Tue, Nov 27, 2018 at 6:29 AM Conrad Lee wrote: > > > > Dear spark community, > > > > I'm running spark 2.3.2 on EMR 5.19.0. I've got a job t

Re: Job hangs in blocked task in final parquet write stage

2018-11-27 Thread Conrad Lee
Dear spark community, I'm running spark 2.3.2 on EMR 5.19.0. I've got a job that's hanging in the final stage--the job usually works, but I see this hanging behavior in about one out of 50 runs. The second-to-last stage sorts the dataframe, and the final stage writes the dataframe to HDFS.

Re: Does Pyspark Support Graphx?

2018-02-18 Thread Denny Lee
com> wrote: > Hi Denny, > The pyspark script uses the --packages option to load graphframe library, > what about the SparkLauncher class? > > > > -- Original -- > *From:* Denny Lee <denny.g@gmail.com> > *Date:* Sun,Feb 18,2018 1

Re: Does Pyspark Support Graphx?

2018-02-17 Thread Denny Lee
. GraphFrames? On Sat, Feb 17, 2018 at 8:26 PM xiaobo <guxiaobo1...@qq.com> wrote: > Thanks Denny, will it be supported in the near future? > > > > -- Original ------ > *From:* Denny Lee <denny.g@gmail.com> > *Date:* Sun,Feb

Re: Does Pyspark Support Graphx?

2018-02-17 Thread Denny Lee
That’s correct - you can use GraphFrames though as it does support PySpark. On Sat, Feb 17, 2018 at 17:36 94035420 wrote: > I can not find anything for graphx module in the python API document, does > it mean it is not supported yet? >

Spark loads data from HDFS or S3

2017-12-13 Thread Philip Lee
Hi ​ I have a few of questions about a structure of HDFS and S3 when Spark-like loads data from two storage. Generally, when Spark loads data from HDFS, HDFS supports data locality and already own distributed file on datanodes, right? Spark could just process data on workers. What about S3?

Re: With 2.2.0 PySpark is now available for pip install from PyPI :)

2017-07-12 Thread Denny Lee
This is amazingly awesome! :) On Wed, Jul 12, 2017 at 13:23 lucas.g...@gmail.com wrote: > That's great! > > > > On 12 July 2017 at 12:41, Felix Cheung wrote: > >> Awesome! Congrats!! >> >> -- >> *From:*

Re: Spark Shell issue on HDInsight

2017-05-14 Thread Denny Lee
com> wrote: > Works for me tooyou are a life-saver :) > > But the question: should/how we report this to Azure team? > > On Fri, May 12, 2017 at 10:32 AM, Denny Lee <denny.g@gmail.com> wrote: > >> I was able to repro your issue when I had downloaded the ja

Re: Spark Shell issue on HDInsight

2017-05-11 Thread Denny Lee
SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185) > at > org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210) > at org.apache.spark.deploy.SparkSubmi

Re: Spark Shell issue on HDInsight

2017-05-08 Thread Denny Lee
This appears to be an issue with the Spark to DocumentDB connector, specifically version 0.0.1. Could you run the 0.0.3 version of the jar and see if you're still getting the same error? i.e. spark-shell --master yarn --jars

Re: What is correct behavior for spark.task.maxFailures?

2017-04-24 Thread Dongjin Lee
Sumit, I think the post below is describing the very case of you. https://blog.cloudera.com/blog/2017/04/blacklisting-in-apache-spark/ Regards, Dongjin -- Dongjin Lee Software developer in Line+. So interested in massive-scale machine learning. facebook: http://www.facebook.com

Re: Azure Event Hub with Pyspark

2017-04-20 Thread Denny Lee
As well, perhaps another option could be to use the Spark Connector to DocumentDB (https://github.com/Azure/azure-documentdb-spark) if sticking with Scala? On Thu, Apr 20, 2017 at 21:46 Nan Zhu wrote: > DocDB does have a java client? Anything prevent you using that? > >

Support Stored By Clause

2017-03-27 Thread Denny Lee
Per SPARK-19630, wondering if there are plans to support "STORED BY" clause for Spark 2.x? Thanks!

Re: Issues: Generate JSON with null values in Spark 2.0.x

2017-03-21 Thread Dongjin Lee
, 2017 at 4:44 PM, Chetan Khatri <chetan.opensou...@gmail.com> wrote: > Exactly. > > On Sat, Mar 11, 2017 at 1:35 PM, Dongjin Lee <dong...@apache.org> wrote: > >> Hello Chetan, >> >> Could you post some code? If I understood correctly, you are trying to &

Re: Issues: Generate JSON with null values in Spark 2.0.x

2017-03-11 Thread Dongjin Lee
to SparkSQL[Row] but in scala how can > we generate json will null values as a Dataframe ? > > Thanks. > -- *Dongjin Lee* *Software developer in Line+.So interested in massive-scale machine learning.facebook: www.facebook.com/dongjin.lee.kr <http://www.facebook.com/dongjin.lee.kr&

Re: unsubscribe

2017-01-09 Thread Denny Lee
Please unsubscribe by sending an email to user-unsubscr...@spark.apache.org HTH! On Mon, Jan 9, 2017 4:40 PM, william tellme williamtellme...@gmail.com wrote:

Re: UNSUBSCRIBE

2017-01-09 Thread Denny Lee
Please unsubscribe by sending an email to user-unsubscr...@spark.apache.org HTH! On Mon, Jan 9, 2017 4:41 PM, Chris Murphy - ChrisSMurphy.com cont...@chrissmurphy.com wrote: PLEASE!!

Pretrained Word2Vec models

2016-12-05 Thread Lee Becker
ogle.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing> or the Freebase model <https://docs.google.com/file/d/0B7XkCwpI5KDYaDBDQm1tZGNDRHc/edit?usp=sharing> to see how they perform before training my own. Thanks, Lee

Re: Spark app write too many small parquet files

2016-11-27 Thread Denny Lee
Generally, yes - you should try to have larger data sizes due to the overhead of opening up files. Typical guidance is between 64MB-1GB; personally I usually stick with 128MB-512MB with the default of snappy codec compression with parquet. A good reference is Vida Ha's presentation Data Storage

Re: hope someone can recommend some books for me,a spark beginner

2016-11-06 Thread Denny Lee
There are a number of great resources to learn Apache Spark - a good starting point is the Apache Spark Documentation at: http://spark.apache.org/documentation.html The two books that immediately come to mind are - Learning Spark: http://shop.oreilly.com/product/mobile/0636920028512.do (there's

Re: Newbie question - Best way to bootstrap with Spark

2016-11-06 Thread Denny Lee
The one you're looking for is the Data Sciences and Engineering with Apache Spark at https://www.edx.org/xseries/data-science-engineering-apacher-sparktm. Note, a great quick start is the Getting Started with Apache Spark on Databricks at https://databricks.com/product/getting-started-guide HTH!

Re: How do I convert a data frame to broadcast variable?

2016-11-03 Thread Denny Lee
If you're able to read the data in as a DataFrame, perhaps you can use a BroadcastHashJoin so that way you can join to that table presuming its small enough to distributed? Here's a handy guide on a BroadcastHashJoin:

Re: GraphFrame BFS

2016-11-01 Thread Denny Lee
You should be able to GraphX or GraphFrames subgraph to build up your subgraph. A good example for GraphFrames can be found at: http://graphframes.github.io/user-guide.html#subgraphs. HTH! On Mon, Oct 10, 2016 at 9:32 PM cashinpj wrote: > Hello, > > I have a set of data

Re: Returning DataFrame as Scala method return type

2016-09-08 Thread Lee Becker
On Thu, Sep 8, 2016 at 11:35 AM, Ashish Tadose wrote: > I wish to organize these dataframe operations by grouping them Scala > Object methods. > Something like below > > > >> *Object Driver {* >> *def main(args: Array[String]) {* >> * val df =

collect_set without nulls (1.6 vs 2.0)

2016-09-07 Thread Lee Becker
f.select(removenulls(collect_set($"x"))).show Any suggestions are appreciated. Thanks, Lee

dynamic allocation in Spark 2.0

2016-08-24 Thread Shane Lee
Hello all, I am running hadoop 2.6.4 with Spark 2.0 and I have been trying to get dynamic allocation to work without success. I was able to get it to work with Spark 16.1 however. When I issue the commandspark-shell --master yarn --deploy-mode client this is the error I see: 16/08/24 00:05:40

Re: countDistinct, partial aggregates and Spark 2.0

2016-08-12 Thread Lee Becker
On Fri, Aug 12, 2016 at 11:55 AM, Lee Becker <lee.bec...@hapara.com> wrote: > val df = sc.parallelize(Array(("a", "a"), ("b", "c"), ("c", > "a"))).toDF("x", "y") > val grouped = df.groupBy($"

countDistinct, partial aggregates and Spark 2.0

2016-08-12 Thread Lee Becker
uire computing each aggregation separately and joining later? Is there a partial aggregation version of collect_set? Thanks, Lee

Re: SparkR error when repartition is called

2016-08-09 Thread Shane Lee
e.Could you give more environment information? On Aug 9, 2016, at 11:35, Shane Lee <shane_y_...@yahoo.com.INVALID> wrote: Hi All, I am trying out SparkR 2.0 and have run into an issue with repartition.  Here is the R code (essentially a port of the pi-calculating scala example in the s

SparkR error when repartition is called

2016-08-08 Thread Shane Lee
Hi All, I am trying out SparkR 2.0 and have run into an issue with repartition.  Here is the R code (essentially a port of the pi-calculating scala example in the spark package) that can reproduce the behavior: schema <- structType(structField("input", "integer"), structField("output",

Re: Spark GraphFrames

2016-08-02 Thread Denny Lee
Hi Divya, Here's a blog post concerning On-Time Flight Performance with GraphFrames: https://databricks.com/blog/2016/03/16/on-time-flight-performance-with-graphframes-for-apache-spark.html It also includes a Databricks notebook that has the code in it. HTH! Denny On Tue, Aug 2, 2016 at 1:16

Spark 2.0 preview - How to configure warehouse for Catalyst? always pointing to /user/hive/warehouse

2016-06-17 Thread Andrew Lee
>From branch-2.0, Spark 2.0.0 preview, I found it interesting, no matter what you do by configuring spark.sql.warehouse.dir it will always pull up the default path which is /user/hive/warehouse In the code, I notice that at LOC45

Re: streaming example has error

2016-06-15 Thread Lee Ho Yeung
p.com> wrote: > Have you tried to “set spark.driver.allowMultipleContexts = true”? > > > > *David Newberger* > > > > *From:* Lee Ho Yeung [mailto:jobmatt...@gmail.com] > *Sent:* Tuesday, June 14, 2016 8:34 PM > *To:* user@spark.apache.org > *Subject:* streaming ex

can spark help to prevent memory error for itertools.combinations(initlist, 2) in python script

2016-06-15 Thread Lee Ho Yeung
i write a python script which has itertools.combinations(initlist, 2) but it got error when number of elements in initlist over 14,000 is it possible to use spark to do this work? i have seen yatel can do this, is spark and yatel using hard disk as memory? if so, which need to change in

Re: can not show all data for this table

2016-06-15 Thread Lee Ho Yeung
; Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > >

Re: can not show all data for this table

2016-06-15 Thread Lee Ho Yeung
Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 15 June 2016 at 03:02, Lee

Re: can not show all data for this table

2016-06-14 Thread Lee Ho Yeung
deep=1")) org.apache.spark.sql.AnalysisException: cannot resolve 'a0' given input columns: [a0a1a2a3a4a5a6a7a8 a9]; at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) On Tue, Jun 14, 20

streaming example has error

2016-06-14 Thread Lee Ho Yeung
when simulate streaming with nc -lk got error below, then i try example, martin@ubuntu:~/Downloads$ /home/martin/Downloads/spark-1.6.1/bin/run-example streaming.NetworkWordCount localhost Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 16/06/14 18:33:06

can not show all data for this table

2016-06-14 Thread Lee Ho Yeung
after tried following commands, can not show data https://drive.google.com/file/d/0Bxs_ao6uuBDUVkJYVmNaUGx2ZUE/view?usp=sharing https://drive.google.com/file/d/0Bxs_ao6uuBDUc3ltMVZqNlBUYVk/view?usp=sharing /home/martin/Downloads/spark-1.6.1/bin/spark-shell --packages

Re: Spark build failure with com.oracle:ojdbc6:jar:11.2.0.1.0

2016-05-09 Thread Andrew Lee
In fact, it does require ojdbc from Oracle which also requires a username and password. This was added as part of the testing scope for Oracle's docker. I notice this PR and commit in branch-2.0 according to https://issues.apache.org/jira/browse/SPARK-12941. In the comment, I'm not sure what

Re: Dataset aggregateByKey equivalent

2016-04-25 Thread Lee Becker
On Sat, Apr 23, 2016 at 8:56 AM, Michael Armbrust wrote: > Have you looked at aggregators? > > > https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Dataset%20Aggregator.html > Thanks for the pointer to aggregators. I wasn't yet aware of them. However,

Re: Meetup in Rome

2016-02-19 Thread Denny Lee
Hey Domenico, Glad to hear that you love Spark and would like to organize a meetup in Rome. We created a Meetup-in-a-box to help with that - check out the post https://databricks.com/blog/2015/11/19/meetup-in-a-box.html. HTH! Denny On Fri, Feb 19, 2016 at 02:38 Domenico Pontari

reading ORC format on Spark-SQL

2016-02-10 Thread Philip Lee
What kind of steps exists when reading ORC format on Spark-SQL? I meant usually reading csv file is just directly reading the dataset on memory. But I feel like Spark-SQL has some steps when reading ORC format. For example, they have to create table to insert the dataset? and then they insert the

Spark Distribution of Small Dataset

2016-01-28 Thread Philip Lee
Hi, Simple Question about Spark Distribution of Small Dataset. Let's say I have 8 machine with 48 cores and 48GB of RAM as a cluster. Dataset (format is ORC by Hive) is so small like 1GB, but I copied it to HDFS. 1) if spark-sql run the dataset distributed on HDFS in each machine, what happens

Re: a question about web ui log

2016-01-26 Thread Philip Lee
mmed > > Author: Big Data Analytics with Spark > <http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/> > > > > *From:* Philip Lee [mailto:philjj...@gmail.com] > *Sent:* Monday, January 25, 2016 9:51 AM > *To:* user@spark.apache.org > *Sub

a question about web ui log

2016-01-25 Thread Philip Lee
​Hello, a questino about web UI log. ​I could see web interface log after forwarding the port on my cluster to my local and click completed application, but when I clicked "application detail UI" [image: Inline image 1] It happened to me. I do not know why. I also checked the specific log

Re: a question about web ui log

2016-01-25 Thread Philip Lee
? 2) still wondering how to see the log after copyting log file to my local. The error was metioned in previous mail. Thanks, Phil On Mon, Jan 25, 2016 at 5:36 PM, Philip Lee <philjj...@gmail.com> wrote: > ​Hello, a questino about web UI log. > > ​I could see web interface log a

Re: How to compile Python and use How to compile Python and use spark-submit

2016-01-08 Thread Denny Lee
Per http://spark.apache.org/docs/latest/submitting-applications.html: For Python, you can use the --py-files argument of spark-submit to add .py, .zip or .egg files to be distributed with your application. If you depend on multiple Python files we recommend packaging them into a .zip or .egg.

Re: subscribe

2016-01-08 Thread Denny Lee
To subscribe, please go to http://spark.apache.org/community.html to join the mailing list. On Fri, Jan 8, 2016 at 3:58 AM Jeetendra Gangele wrote: > >

Re: Intercept in Linear Regression

2015-12-15 Thread Denny Lee
If you're using model = LinearRegressionWithSGD.train(parseddata, iterations=100, step=0.01, intercept=True) then to get the intercept, you would use model.intercept More information can be found at:

Re: Best practises

2015-11-02 Thread Denny Lee
In addition, you may want to check out Tuning and Debugging in Apache Spark (https://sparkhub.databricks.com/video/tuning-and-debugging-apache-spark/) On Mon, Nov 2, 2015 at 05:27 Stefano Baghino wrote: > There is this interesting book from Databricks: >

Spark Survey Results 2015 are now available

2015-10-05 Thread Denny Lee
Thanks to all of you who provided valuable feedback in our Spark Survey 2015. Because of the survey, we have a better picture of who’s using Spark, how they’re using it, and what they’re using it to build–insights that will guide major updates to the Spark platform as we move into Spark’s next

Re: [Question] ORC - EMRFS Problem

2015-09-13 Thread Cazen Lee
unds exception? I don't > remember an array out of bounds problem off the top of my head. A stack trace > will tell me a lot, obviously. > > If you are using Spark 1.4 that implies Hive 0.13, which is pretty old. It > may be a problem that we fixed a while ago. > > Tha

[Question] ORC - EMRFS Problem

2015-09-12 Thread Cazen Lee
Good Day! I think there are some problems between ORC and AWS EMRFS. When I was trying to read "upper 150M" ORC files from S3, ArrayOutOfIndex Exception occured. I'm sure that it's AWS side issue because there was no exception when trying from HDFS or S3NativeFileSystem. Parquet runs

how to ignore MatchError then processing a large json file in spark-sql

2015-08-02 Thread fuellee lee
I'm trying to process a bunch of large json log files with spark, but it fails every time with `scala.MatchError`, Whether I give it schema or not. I just want to skip lines that does not match schema, but I can't find how in docs of spark. I know write a json parser and map it to json file RDD

Re: SQL Server to Spark

2015-07-23 Thread Denny Lee
It sort of depends on optimized. There is a good thread on the topic at http://search-hadoop.com/m/q3RTtJor7QBnWT42/Spark+and+SQL+server/v=threaded If you have an archival type strategy, you could do daily BCP extracts out to load the data into HDFS / S3 / etc. This would result in minimal impact

RE: The auxService:spark_shuffle does not exist

2015-07-21 Thread Andrew Lee
To: alee...@hotmail.com CC: zjf...@gmail.com; rp...@njit.edu; user@spark.apache.org Hi all, Did you forget to restart the node managers after editing yarn-site.xml by any chance? -Andrew 2015-07-17 8:32 GMT-07:00 Andrew Lee alee...@hotmail.com: I have encountered the same problem after following

RE: The auxService:spark_shuffle does not exist

2015-07-21 Thread Andrew Lee
Hi Andrew, Thanks for the advice. I didn't see the log in the NodeManager, so apparently, something was wrong with the yarn-site.xml configuration. After digging in more, I realize it was an user error. I'm sharing this with other people so others may know what mistake I have made. When I review

RE: The auxService:spark_shuffle does not exist

2015-07-17 Thread Andrew Lee
I have encountered the same problem after following the document. Here's my spark-defaults.confspark.shuffle.service.enabled true spark.dynamicAllocation.enabled true spark.dynamicAllocation.executorIdleTimeout 60 spark.dynamicAllocation.cachedExecutorIdleTimeout 120

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-06 Thread Denny Lee
, Founder CTO, Swoop http://swoop.com/ @simeons http://twitter.com/simeons | blog.simeonov.com | 617.299.6746 From: Yin Huai yh...@databricks.com Date: Monday, July 6, 2015 at 12:59 AM To: Simeon Simeonov s...@swoop.com Cc: Denny Lee denny.g@gmail.com, Andy Huang andy.hu

Re: Spark SQL queries hive table, real time ?

2015-07-06 Thread Denny Lee
Within the context of your question, Spark SQL utilizing the Hive context is primarily about very fast queries. If you want to use real-time queries, I would utilize Spark Streaming. A couple of great resources on this topic include Guest Lecture on Spark Streaming in Stanford CME 323:

Re: Please add the Chicago Spark Users' Group to the community page

2015-07-06 Thread Denny Lee
Hey Dean, Sure, will take care of this. HTH, Denny On Tue, Jul 7, 2015 at 10:07 Dean Wampler deanwamp...@gmail.com wrote: Here's our home page: http://www.meetup.com/Chicago-Spark-Users/ Thanks, Dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-05 Thread Denny Lee
I had run into the same problem where everything was working swimmingly with Spark 1.3.1. When I switched to Spark 1.4, either by upgrading to Java8 (from Java7) or by knocking up the PermGenSize had solved my issue. HTH! On Mon, Jul 6, 2015 at 8:31 AM Andy Huang andy.hu...@servian.com.au

RE: [Spark 1.3.1 on YARN on EMR] Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-06-20 Thread Andrew Lee
Hi Roberto, I'm not an EMR person, but it looks like option -h is deploying the necessary dataneucleus JARs for you.The req for HiveContext is the hive-site.xml and dataneucleus JARs. As long as these 2 are there, and Spark is compiled with -Phive, it should work. spark-shell runs in

Re: SparkContext Threading

2015-06-06 Thread Lee McFadden
, 2015, 12:21 AM Will Briggs wrbri...@gmail.com wrote: Hi Lee, it's actually not related to threading at all - you would still have the same problem even if you were using a single thread. See this section ( https://spark.apache.org/docs/latest/programming-guide.html#passing-functions-to-spark

SparkContext Threading

2015-06-05 Thread Lee McFadden
and haven't found any docs to point me in the right direction. Does anyone have any advice on how to get jobs submitted by multiple threads? The jobs are fairly simple and work when I run them serially, so I'm not exactly sure what I'm doing wrong. Thanks, Lee

Re: SparkContext Threading

2015-06-05 Thread Lee McFadden
, although it's not really required at the moment as I am only submitting one job until I get this issue straightened out :) Thanks, Lee On Fri, Jun 5, 2015 at 11:50 AM Marcelo Vanzin van...@cloudera.com wrote: On Fri, Jun 5, 2015 at 11:48 AM, Lee McFadden splee...@gmail.com wrote: Initially I

  1   2   3   >