[ANNOUNCE] Apache Kyuubi released 1.9.0

2024-03-18 Thread Binjie Yang
Hi all, The Apache Kyuubi community is pleased to announce that Apache Kyuubi 1.9.0 has been released! Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses. Kyuubi provides a pure SQL gateway through Thrift JDBC/ODBC interface for

Fwd: the life cycle shuffle Dependency

2023-12-27 Thread yang chen
hi, I'm learning spark, and wonder when to delete shuffle data, I find the ContextCleaner class which clean the shuffle data when shuffle dependency is GC-ed. Based on source code, the shuffle dependency is gc-ed only when active job finish, but i'm not sure, Could you explain the life cycle of

Re: [PySpark Structured Streaming] How to tune .repartition(N) ?

2023-10-04 Thread Shao Yang Hong
> Also note that there is an alternative option of using coalesce() instead of > repartition(). > -- > Raghavendra > > > On Thu, Oct 5, 2023 at 10:15 AM Shao Yang Hong > wrote: >> >> Hi all on user@spark: >> >> We are looking for advice and suggestions on how to tune the >

[PySpark Structured Streaming] How to tune .repartition(N) ?

2023-10-04 Thread Shao Yang Hong
Name(APP_NAME) .outputMode("append") .format("delta") .partitionBy(CREATED_DATE) .option("checkpointLocation", os.environ["CHECKPOINT"]) .start(os.environ["DELTA_PATH"]) ) query.awaitTermination() sp

[PySpark Structured Streaming] How to tune .repartition(N) ?

2023-10-04 Thread Shao Yang Hong
Name(APP_NAME) .outputMode("append") .format("delta") .partitionBy(CREATED_DATE) .option("checkpointLocation", os.environ["CHECKPOINT"]) .start(os.environ["DELTA_PATH"]) ) query.awaitTermination() sp

Re: Write Spark Connection client application in Go

2023-09-14 Thread bo yang
at’s so cool! Great work y’all :) >> >> On Tue, Sep 12, 2023 at 8:14 PM bo yang wrote: >> >>> Hi Spark Friends, >>> >>> Anyone interested in using Golang to write Spark application? We created >>> a Spark Connect Go Client library >>>

Write Spark Connection client application in Go

2023-09-12 Thread bo yang
Hi Spark Friends, Anyone interested in using Golang to write Spark application? We created a Spark Connect Go Client library . Would love to hear feedback/thoughts from the community. Please see the quick start guide

Re: [ANNOUNCE] Apache Spark 3.2.3 released

2022-11-30 Thread Yang,Jie(INF)
Thanks, Chao! 发件人: Maxim Gekk 日期: 2022年11月30日 星期三 19:40 收件人: Jungtaek Lim 抄送: Wenchen Fan , Chao Sun , dev , user 主题: Re: [ANNOUNCE] Apache Spark 3.2.3 released Thank you, Chao! On Wed, Nov 30, 2022 at 12:42 PM Jungtaek Lim mailto:kabhwan.opensou...@gmail.com>> wrote: Thanks Chao for

Re: [EXTERNAL] Re: Re: Stage level scheduling - lower the number of executors when using GPUs

2022-11-03 Thread bo yang
Interesting discussion here, looks like Spark does not support configuring different number of executors in different stages. Would love to see the community come out such a feature. On Thu, Nov 3, 2022 at 9:10 AM Shay Elbaz wrote: > Thanks again Artemis, I really appreciate it. I have watched

Re: [ANNOUNCE] Apache Spark 3.3.1 released

2022-10-26 Thread Yang,Jie(INF)
Thanks Yuming and all developers ~ Yang Jie 发件人: Maxim Gekk 日期: 2022年10月26日 星期三 15:19 收件人: Hyukjin Kwon 抄送: "L. C. Hsieh" , Dongjoon Hyun , Yuming Wang , dev , User 主题: Re: [ANNOUNCE] Apache Spark 3.3.1 released Congratulations everyone with the new release, and thanks to Yumi

Re: [Java 17] --add-exports required?

2022-06-23 Thread Yang,Jie(INF)
are used to pass all Spark UTs, but maybe you don't need all. However, these Options needn't explicit add when using spark-shell, spark-sql and spark-submit, but may need to add others as needed for Java 17. Maybe some instructions should be added to the document Yang Jie 发件人: Greg Kopff 日期

Re: [Java 17] --add-exports required?

2022-06-22 Thread Yang,Jie(INF)
Hi, Greg "--add-exports java.base/sun.nio.ch=ALL-UNNAMED " does not need to be added when SPARK-33772 is completed, so in order to answer your question, I need more details for testing: 1. Where can I download Java 17 (Temurin-17+35)? 2. What test commands do you use? Yang Jie 在

Re: Reverse proxy for Spark UI on Kubernetes

2022-05-17 Thread bo yang
Yes, it should be possible, any interest to work on this together? Need more hands to add more features here :) On Tue, May 17, 2022 at 2:06 PM Holden Karau wrote: > Could we make it do the same sort of history server fallback approach? > > On Tue, May 17, 2022 at 10:41 PM bo ya

Re: Reverse proxy for Spark UI on Kubernetes

2022-05-17 Thread bo yang
is to behave like that Web Application Proxy. It will simplify settings to access Spark UI on Kubernetes. On Mon, May 16, 2022 at 11:46 PM wilson wrote: > what's the advantage of using reverse proxy for spark UI? > > Thanks > > On Tue, May 17, 2022 at 1:47 PM bo yang wrote: >

Re: Reverse proxy for Spark UI on Kubernetes

2022-05-17 Thread bo yang
Thanks Holden :) On Mon, May 16, 2022 at 11:12 PM Holden Karau wrote: > Oh that’s rad  > > On Tue, May 17, 2022 at 7:47 AM bo yang wrote: > >> Hi Spark Folks, >> >> I built a web reverse proxy to access Spark UI on Kubernetes (working >> together with >&

Reverse proxy for Spark UI on Kubernetes

2022-05-16 Thread bo yang
Hi Spark Folks, I built a web reverse proxy to access Spark UI on Kubernetes (working together with https://github.com/GoogleCloudPlatform/spark-on-k8s-operator). Want to share here in case other people have similar need. The reverse proxy code is here:

Re: Spark Parquet write OOM

2022-03-01 Thread Yang,Jie(INF)
This is a DirectByteBuffer OOM,so plan 2 may not work, we can increase the capacity of DirectByteBuffer size by configuring `-XX:MaxDirectMemorySize` and this is a Java opts. However, we'd better check the length of memory to be allocated, because `-XX:MaxDirectMemorySize` and `-Xmx` should

Re: One click to run Spark on Kubernetes

2022-02-23 Thread bo yang
gt; chart to > deploy Spark and some other stuff on K8S? > > ons. 23. feb. 2022 kl. 17:49 skrev bo yang : > >> Hi Sarath, let's follow up offline on this. >> >> On Wed, Feb 23, 2022 at 8:32 AM Sarath Annareddy < >> sarath.annare...@gmail.com> wrote: >> &

Re: One click to run Spark on Kubernetes

2022-02-23 Thread bo yang
Hi Sarath, let's follow up offline on this. On Wed, Feb 23, 2022 at 8:32 AM Sarath Annareddy wrote: > Hi bo > > How do we start? > > Is there a plan? Onboarding, Arch/design diagram, tasks lined up etc > > > Thanks > Sarath > > > Sent from my iPhone >

Re: One click to run Spark on Kubernetes

2022-02-23 Thread bo yang
Guidance is appreciated. > > Sarath > > Sent from my iPhone > > On Feb 23, 2022, at 2:01 AM, bo yang wrote: > >  > > Right, normally people start with simple script, then add more stuff, like > permission and more components. After some time, people want to run th

Re: One click to run Spark on Kubernetes

2022-02-23 Thread bo yang
liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Wed, 23 Feb 2022 at 04:06, bo yang wrote: > >> Hi Spark Community, >> >> We built an open source tool to deploy and run Spark on Kubernetes with a >> one click

Re: One click to run Spark on Kubernetes

2022-02-22 Thread bo yang
you share link to the source? > > בתאריך יום ד׳, 23 בפבר׳ 2022, 6:52, מאת bo yang ‏: > >> We do not have SaaS yet. Now it is an open source project we build in our >> part time , and we welcome more people working together on that. >> >> You could specify cluste

Re: One click to run Spark on Kubernetes

2022-02-22 Thread bo yang
r > about 1 hour. Do you have the SaaS solution for this? I can pay as I did. > > Thanks > > On Wed, Feb 23, 2022 at 12:21 PM bo yang wrote: > >> It is not a standalone spark cluster. In some details, it deploys a Spark >> Operator (https://github.com/GoogleCloudPlatfo

Re: One click to run Spark on Kubernetes

2022-02-22 Thread bo yang
ion of spark? or just the standalone node? > > Thanks > > On Wed, Feb 23, 2022 at 12:06 PM bo yang wrote: > >> Hi Spark Community, >> >> We built an open source tool to deploy and run Spark on Kubernetes with a >> one click command. For example, on AWS, it co

One click to run Spark on Kubernetes

2022-02-22 Thread bo yang
Hi Spark Community, We built an open source tool to deploy and run Spark on Kubernetes with a one click command. For example, on AWS, it could automatically create an EKS cluster, node group, NGINX ingress, and Spark Operator. Then you will be able to use curl or a CLI tool to submit Spark

Re: [ANNOUNCE] Apache Kyuubi (Incubating) released 1.4.1-incubating

2022-01-30 Thread Vino Yang
apache.org/ Best, Vino Bitfox 于2022年1月31日周一 14:49写道: > > What’s the difference between Spark and Kyuubi? > > Thanks > > On Mon, Jan 31, 2022 at 2:45 PM Vino Yang wrote: >> >> Hi all, >> >> The Apache Kyuubi (Incubating) community is pleased to announce tha

[ANNOUNCE] Apache Kyuubi (Incubating) released 1.4.1-incubating

2022-01-30 Thread Vino Yang
Hi all, The Apache Kyuubi (Incubating) community is pleased to announce that Apache Kyuubi (Incubating) 1.4.1-incubating has been released! Apache Kyuubi (Incubating) is a distributed multi-tenant JDBC server for large-scale data processing and analytics, built on top of Apache Spark and

Re: Log4J 2 Support

2021-11-10 Thread Yang,Jie(INF)
It may be more feasible to replace the current slf4j + log4j with log4j2-api, some projects that spark relies on may also use log4j at the code level, such as EventCounter and ContainerLogAppender in Hadoop, directly removing the dependency on log4j may lead to some code dependencies loss.

Re: Unsubscribe

2021-08-03 Thread Howard Yang
Unsubscribe Edward Wu 于2021年8月3日周二 下午4:15写道: > Unsubscribe >

Re: Unsubscribe

2021-07-13 Thread Howard Yang
Unsubscribe Eric Wang 于2021年7月12日周一 上午7:31写道: > Unsubscribe > > On Sun, Jul 11, 2021 at 9:59 PM Rishi Raj Tandon < > tandon.rishi...@gmail.com> wrote: > >> Unsubscribe >> >

RE: Spark UI Storage Memory

2020-12-04 Thread Jack Yang
unsubsribe

Re: How to submit a job via REST API?

2020-11-25 Thread Zhou Yang
:55,vaquar khan mailto:vaquar.k...@gmail.com>> 写道: Hi Yang, Please find following link https://stackoverflow.com/questions/63677736/spark-application-as-a-rest-service/63678337#63678337 Regards, Vaquar khan On Wed, Nov 25, 2020 at 12:40 AM Sonal Goyal mailto:sonalgoy...@gmail.com>> wrote: Y

How to submit a job via REST API?

2020-11-23 Thread Zhou Yang
Dear experts, I found a convenient way to submit job via Rest API at https://gist.github.com/arturmkrtchyan/5d8559b2911ac951d34a#file-submit_job-sh. But I did not know whether can I append `—conf` parameter like what I did in spark-submit. Can someone can help me with this issue? Regards, Yang

Anyone interested in Remote Shuffle Service

2020-10-21 Thread bo yang
Hi Spark Users, Uber open sourced Remote Shuffle Service ( https://github.com/uber/RemoteShuffleService ) recently. It works with open source Spark version without code change needed, and could store shuffle data on separate machines other than Spark executors. Anyone interested to try? Also we

unsubscribe

2019-06-24 Thread Song Yang

How does org.apache.spark.sql.catalyst.util.MapData support hash lookup?

2019-05-08 Thread Shawn Yang
Hi guys, I'm reading spark source code. When I read org.apache.spark.sql.catalyst.util.ArrayBasedMapData, org.apache.spark.sql.catalyst.expressions.UnsafeMapData, I can't understand how it supports hash lookup? Is there anything I miss?

Re: unsubscribe

2019-04-27 Thread Song Yang
> > unsubscribe >

Re: Spark Profiler

2019-03-28 Thread bo yang
Yeah, these options are very valuable. Just add another option :) We build a jvm profiler (https://github.com/uber-common/jvm-profiler) to monitor and profile Spark applications in large scale (e.g. sending metrics to kafka / hive for batch analysis). People could try it as well. On Wed, Mar 27,

Re: Using Apache Kylin as data source for Spark

2018-05-25 Thread Li Yang
That is very useful~~ :-) On Fri, May 18, 2018 at 11:56 AM, ShaoFeng Shi wrote: > Hello, Kylin and Spark users, > > A doc is newly added in Apache Kylin website on how to using Kylin as a > data source in Spark; > This can help the users who want to use Spark to

how to kill application

2018-03-26 Thread Shuxin Yang
'.    How can I kill those application in S1 - S2 (i.e. alive Spark app, dead YARN app)? Looking not closing the SparkContext could cause this problem. However, I'm not always able to close the context, for example my program crash prematurely.    Tons thanks in advance! Shuxin Yang

Re: HDP 2.5 - Python - Spark-On-Hbase

2017-06-26 Thread Weiqing Yang
gt;- The ability to configure closure serializer >>- HTTPBroadcast >>- TTL-based metadata cleaning >>- *Semi-private class org.apache.spark.Logging. We suggest you use >> slf4j directly.* >>- SparkContext.metricsSystem >> >> Thanks, >> &

Meetup in Taiwan

2017-06-25 Thread Yang Bryan
Hi, I'm Bryan, the co-founder of Taiwan Spark User Group. We discuss, share information on https://www.facebook.com/groups/spark.tw/. We have physical meetup twice a month. Please help us add on the official website. And We will hold a code competition about Spark, could we print the logo of

Re: HDP 2.5 - Python - Spark-On-Hbase

2017-06-23 Thread Weiqing Yang
Yes. What SHC version you were using? If hitting any issues, you can post them in SHC github issues. There are some threads about this. On Fri, Jun 23, 2017 at 5:46 AM, ayan guha wrote: > Hi > > Is it possible to use SHC from Hortonworks with pyspark? If so, any > working

Re: Use SQL Script to Write Spark SQL Jobs

2017-06-14 Thread bo yang
> Regards, > > On Wed, 14 Jun 2017 at 04:32, bo yang <bobyan...@gmail.com> wrote: > >> Thanks Benjamin and Ayan for the feedback! You kind of represent two >> group of people who need such script tool or not. Personally I find the >> script is very useful for m

Re: Use SQL Script to Write Spark SQL Jobs

2017-06-13 Thread bo yang
;> interface, such as Talend, SSIS, etc. There is a small amount of scripting >> involved but not too much. I looked at what you are trying to do, and I >> welcome it. This could open up Spark to the masses and shorten development >> times. >> >> Cheers, >> Ben >>

Re: Use SQL Script to Write Spark SQL Jobs

2017-06-12 Thread bo yang
> > On 12-Jun-2017 11:00 AM, "bo yang" <bobyan...@gmail.com> wrote: > >> Hi Guys, >> >> I am writing a small open source project >> <https://github.com/uber/uberscriptquery> to use SQL Script to write >> Spark Jobs. Want to see if ther

Use SQL Script to Write Spark SQL Jobs

2017-06-11 Thread bo yang
Hi Guys, I am writing a small open source project to use SQL Script to write Spark Jobs. Want to see if there are other people interested to use or contribute to this project. The project is called UberScriptQuery (

Re: how to mark a (bean) class with schema for catalyst ?

2017-05-09 Thread Yang
s.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/908554720841389/2840265927289860/latest.html> > in > Spark 2.1. > > On Tue, May 9, 2017 at 12:10 PM, Yang <tedd...@gmail.com> wrote: > >> somehow the schema check is here >> >> https://g

Re: how to mark a (bean) class with schema for catalyst ?

2017-05-09 Thread Yang
/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala#L71-L83>. > If you are using scala though I'd consider using the case class encoders. > > On Tue, May 9, 2017 at 12:21 AM, Yang <tedd...@gmail.com> wrote: > >> I'm trying to use Encoders.bean() t

Re: how to mark a (bean) class with schema for catalyst ?

2017-05-09 Thread Yang
2.0.2 with scala 2.11 On Tue, May 9, 2017 at 11:30 AM, Michael Armbrust <mich...@databricks.com> wrote: > Which version of Spark? > > On Tue, May 9, 2017 at 11:28 AM, Yang <tedd...@gmail.com> wrote: > >> actually with var it's the same:

Re: how to mark a (bean) class with schema for catalyst ?

2017-05-09 Thread Yang
> <https://github.com/apache/spark/blob/f830bb9170f6b853565d9dd30ca7418b93a54fe3/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala#L71-L83>. > If you are using scala though I'd consider using the case class encoders. > > On Tue, May 9, 2017 at 12:21 AM, Yang

Re: how to mark a (bean) class with schema for catalyst ?

2017-05-09 Thread Yang
. This way when I encode the wrapper, the bean encoder simply encodes the getContent() output, I think. encoding a list of tuples is very fast. Yang On Tue, May 9, 2017 at 11:19 AM, Michael Armbrust <mich...@databricks.com> wrote: > I think you are supposed to set BeanProperty on a var a

how to mark a (bean) class with schema for catalyst ?

2017-05-09 Thread Yang
I'm trying to use Encoders.bean() to create an encoder for my custom class, but it fails complaining about can't find the schema: class Person4 { @scala.beans.BeanProperty def setX(x:Int): Unit = {} @scala. beans.BeanProperty def getX():Int = {1} } val personEncoder = Encoders.bean[

Re: Graph Analytics on HBase with HGraphDB and Spark GraphFrames

2017-04-03 Thread Weiqing Yang
Thanks for sharing this. On Sun, Apr 2, 2017 at 7:08 PM, Irving Duran wrote: > Thanks for the share! > > > Thank You, > > Irving Duran > > On Sun, Apr 2, 2017 at 7:19 PM, Felix Cheung > wrote: > >> Interesting! >> >>

Re: Re: how to call recommend method from ml.recommendation.ALS

2017-03-15 Thread Yuhao Yang
This is something that was just added to ML and will probably be released with 2.2. For now you can try to copy from the master code: https://github.com/apache/spark/blob/70f9d7f71c63d2b1fdfed75cb7a59285c272a62b/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L352 and give it a

Re: [MLlib] kmeans random initialization, same seed every time

2017-03-15 Thread Yuhao Yang
Hi Julian, Thanks for reporting this. This is a valid issue and I created https://issues.apache.org/jira/browse/SPARK-19957 to track it. Right now the seed is set to this.getClass.getName.hashCode.toLong by default, which indeed keeps the same among multiple fits. Feel free to leave your

Re: how to construct parameter for model.transform() from datafile

2017-03-14 Thread Yuhao Yang
Hi Jinhong, Based on the error message, your second collection of vectors has a dimension of 804202, while the dimension of your training vectors was 144109. So please make sure your test dataset are of the same dimension as the training data. >From the test dataset you posted, the vector

Re: FPGrowth Model is taking too long to generate frequent item sets

2017-03-14 Thread Yuhao Yang
Hi Raju, Have you tried setNumPartitions with a larger number? 2017-03-07 0:30 GMT-08:00 Eli Super : > Hi > > It's area of knowledge , you will need to read online several hours about > it > > What is your programming language ? > > Try search online : "machine learning

Sharing my DataFrame (DataSet) cheat sheet.

2017-03-04 Thread Yuhao Yang
Sharing some snippets I accumulated during developing with Apache Spark DataFrame (DataSet). Hope it can help you in some way. https://github.com/hhbyyh/DataFrameCheatSheet. [image: 内嵌图片 1] Regards, Yuhao Yang

Re: physical memory usage keep increasing for spark app on Yarn

2017-02-15 Thread Yang Cao
resources. I don’t know whether my explanation is right. Plz correct me if you find any issue.THX Best, Yang > On 2017年1月23日, at 18:03, Pavel Plotnikov <pavel.plotni...@team.wrike.com> > wrote: > > Hi Yang! > > I don't know exactly why this happen, but i think GC can'

Re: physical memory usage keep increasing for spark app on Yarn

2017-01-22 Thread Yang Cao
Also, do you know why this happen? > On 2017年1月20日, at 18:23, Pavel Plotnikov <pavel.plotni...@team.wrike.com> > wrote: > > Hi Yang, > i have faced with the same problem on Mesos and to circumvent this issue i am > usually increase partition number. On last step

Re: physical memory usage keep increasing for spark app on Yarn

2017-01-22 Thread Yang Cao
plotni...@team.wrike.com> > wrote: > > Hi Yang, > i have faced with the same problem on Mesos and to circumvent this issue i am > usually increase partition number. On last step in your code you reduce > number of partitions to 1, try to set bigger value, may be it solve this >

physical memory usage keep increasing for spark app on Yarn

2017-01-20 Thread Yang Cao
Hi all, I am running a spark application on YARN-client mode with 6 executors (each 4 cores and executor memory = 6G and Overhead = 4G, spark version: 1.6.3 / 2.1.0). I find that my executor memory keeps increasing until get killed by node manager; and give out the info that tells me to boost

filter push down on har file

2017-01-16 Thread Yang Cao
Hi, My team just do a archive on last year’s parquet files. I wonder whether the filter push down optimization still work when I read data through “har:///path/to/data/“? THX. Best, - To unsubscribe e-mail:

Re: Kryo On Spark 1.6.0

2017-01-10 Thread Yang Cao
If you don’t mind, could please share me with the scala solution? I tried to use kryo but seamed not work at all. I hope to get some practical example. THX > On 2017年1月10日, at 19:10, Enrico DUrso wrote: > > Hi, > > I am trying to use Kryo on Spark 1.6.0. > I am able to

Re: L1 regularized Logistic regression ?

2017-01-04 Thread Yang
regression.html#logistic-regression > > You'd set elasticnetparam = 1 for Lasso > > On Wed, Jan 4, 2017 at 7:13 PM, Yang <tedd...@gmail.com> wrote: > >> does mllib support this? >> >> I do see Lasso impl here https://github.com/apache >> /spark/blob/maste

L1 regularized Logistic regression ?

2017-01-04 Thread Yang
does mllib support this? I do see Lasso impl here https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/regression/Lasso.scala if it supports LR , could you please show me a link? what algorithm does it use? thanks

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread Yuanzhe Yang
Hi Ayan, This "inline view" idea is really awesome and enlightens me! Finally I have a plan to move on. I greatly appreciate your help! Best regards, Yang 2017-01-03 18:14 GMT+01:00 ayan guha <guha.a...@gmail.com>: > Ahh I see what you meanI confused two terminologies

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread Yuanzhe Yang
ently? Best regards, Yang 2017-01-03 17:23 GMT+01:00 ayan guha <guha.a...@gmail.com>: > Hi > > You need to store and capture the Max of the column you intend to use for > identifying new records (Ex: INSERTED_ON) after every successful run of > your job. Then, use the value in low

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread Yuanzhe Yang
ays ingest the entire table. So I also cannot simulate a streaming process by starting the job in fix intervals... Best regards, Yang 2017-01-03 15:06 GMT+01:00 ayan guha <guha.a...@gmail.com>: > Hi > > While the solutions provided by others looks promising and I'd like to try >

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread Yuanzhe Yang
Hi Tamas, Thanks a lot for your suggestion! I will also investigate this one later. Best regards, Yang 2017-01-03 12:38 GMT+01:00 Tamas Szuromi <tamas.szur...@odigeo.com>: > > You can also try https://github.com/zendesk/maxwell > > Tamas > > On 3 January 2017 at 12:25,

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread Yuanzhe Yang
Hi Amrit, Thanks a lot for your suggestion! I will investigate it later. Best regards, Yang 2017-01-03 12:25 GMT+01:00 Amrit Jangid <amrit.jan...@goibibo.com>: > You can try out *debezium* : https://github.com/debezium. it reads data > from bin-logs, provides structure and strea

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread Yuanzhe Yang
ingestion tasks on one database. Is Sqoop a proper candidate from your knowledge? Thank you again and have a nice day. Best regards, Yang 2016-12-30 8:28 GMT+01:00 ayan guha <guha.a...@gmail.com>: > > > "If data ingestion speed is faster than data production speed, then >

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread Yuanzhe Yang
Hi Hongdi, Thanks a lot for your suggestion. The data is truely immutable and the table is append-only. But actually there are different databases involved, so the only feature they share in common and I can depend on is jdbc... Best regards, Yang 2016-12-30 6:45 GMT+01:00 任弘迪 <ryan

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread Yuanzhe Yang
Hi Michael, Thanks a lot for your ticket. At least it is the first step. Best regards, Yang 2016-12-30 2:01 GMT+01:00 Michael Armbrust <mich...@databricks.com>: > We don't support this yet, but I've opened this JIRA as it sounds > generally useful: https://issues.apache.org/jira/

[Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2016-12-29 Thread Yuanzhe Yang (杨远哲)
anything related to streaming data from a growing database. Is there anything I can read to achieve my goal? Any suggestion is highly appreciated. Thank you very much and have a nice day. Best regards, Yang - To uns

Re: scikit-learn and mllib difference in predictions python

2016-12-25 Thread Yuhao Yang
Hi ioanna, I'd like to help look into it. Is there a way to access your training data? 2016-12-20 17:21 GMT-08:00 ioanna : > I have an issue with an SVM model trained for binary classification using > Spark 2.0.0. > I have followed the same logic using scikit-learn and

Re: spark linear regression error training dataset is empty

2016-12-25 Thread Yuhao Yang
Hi Xiaomeng, Have you tried to confirm the DataFrame contents before fitting? like assembleddata.show() before fitting. Regards, Yuhao 2016-12-21 10:05 GMT-08:00 Xiaomeng Wan : > Hi, > > I am running linear regression on a dataframe and get the following error: > >

spark-shell fails to redefine values

2016-12-21 Thread Yang
summary: Spark-shell fails to redefine values in some cases, this is at least found in a case where "implicit" is involved, but not limited to such cases run the following in spark-shell, u can see that the last redefinition does not take effect. the same code runs in plain scala REPL without

Re: Multilabel classification with Spark MLlib

2016-11-29 Thread Yuhao Yang
If problem transformation is not an option ( https://en.wikipedia.org/wiki/Multi-label_classification#Problem_transformation_methods), I would try to develop a customized algorithm based on MultilayerPerceptronClassifier, in which you probably need to rewrite LabelConverter. 2016-11-29 9:02

Re: Kafka segmentation

2016-11-16 Thread bo yang
and whether maxRatePerPartition is set. > > I expect that there is something wrong with backpressure, see e.g. > https://issues.apache.org/jira/browse/SPARK-18371 > > On Wed, Nov 16, 2016 at 5:05 PM, bo yang <bobyan...@gmail.com> wrote: > > I hit similar issue with Spark St

Re: Kafka segmentation

2016-11-16 Thread bo yang
I hit similar issue with Spark Streaming. The batch size seemed a little random. Sometime it was large with many Kafka messages inside same batch, sometimes it was very small with just a few messages. Is it possible that was caused by the backpressure implementation in Spark Streaming? On Wed,

type-safe join in the new DataSet API?

2016-11-10 Thread Yang
the new DataSet API is supposed to provide type safety and type checks at compile time https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#join-operations It does this indeed for a lot of places, but I found it still doesn't have a type safe join: val ds1 =

Do you use spark 2.0 in work?

2016-10-31 Thread Yang Cao
Hi guys, Just for personal interest. I wonder whether spark 2.0 a productive version? Is there any company use this version as its main version in daily work? THX - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

task not serializable in case of groupByKey() + mapGroups + map?

2016-10-31 Thread Yang
with the following simple code val a = sc.createDataFrame(sc.parallelize(Seq((1,2),(3,4.as[(Int,Int)] val grouped = a.groupByKey({x:(Int,Int)=>x._1}) val mappedGroups = grouped.mapGroups((k,x)=>{(k,1)}) val yyy = sc.broadcast(1) val last = mappedGroups.rdd.map(xx=>{

Re: RDD groupBy() then random sort each group ?

2016-10-23 Thread Yang
uld avoid it for large groups. > > The key is to never materialize the grouped and shuffled data. > > To see one approach to do this take a look at > https://github.com/tresata/spark-sorted > > It's basically a combination of smart partitioning and secondary sort. &g

Re: RDD groupBy() then random sort each group ?

2016-10-23 Thread Yang
ame wrapping your RDD, and $"id" % 10 with the key > to group by, then you can get the RDD from shuffled and do the following > operations you want. > > Cheng > > > > On 10/20/16 10:53 AM, Yang wrote: > >> in my application, I group by same training sa

RDD groupBy() then random sort each group ?

2016-10-20 Thread Yang
in my application, I group by same training samples by their model_id's (the input table contains training samples for 100k different models), then each group ends up having about 1 million training samples, then I feed that group of samples to a little Logistic Regression solver (SGD), but SGD

Re: can mllib Logistic Regression package handle 10 million sparse features?

2016-10-19 Thread Yang
>> > per-iteration time). >> >> > >> >> > Note that the current impl forces dense arrays for intermediate data >> >> > structures, increasing the communication cost significantly. See this >> PR for >> >> > info: https://githu

Re: question about the new Dataset API

2016-10-19 Thread Yang
2| +-++ On Tue, Oct 18, 2016 at 11:30 PM, Yang <tedd...@gmail.com> wrote: > scala> val a = sc.parallelize(Array((1,2),(3,4))) > a: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[243] at > parallelize at :38 > > scala> val a_ds = hc.di.createDa

question about the new Dataset API

2016-10-19 Thread Yang
scala> val a = sc.parallelize(Array((1,2),(3,4))) a: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[243] at parallelize at :38 scala> val a_ds = hc.di.createDataFrame(a).as[(Long,Long)] a_ds: org.apache.spark.sql.Dataset[(Long, Long)] = [_1: int, _2: int] scala>

previous stage results are not saved?

2016-10-17 Thread Yang
while making small changes to the code. any idea what part of the spark framework might have caused this ? thanks Yang

question on the structured DataSet API join

2016-10-17 Thread Yang
I'm trying to use the joinWith() method instead of join() since the former provides type checked result while the latter is a straight DataFrame. the signature is DataSet[(T,U)] joinWith(other:DataSet[U], col:Column) here the second arg, col:Column is normally provided by

RE: LDA and Maximum Iterations

2016-09-20 Thread Yang, Yuhao
Hi Frank, Which version of Spark are you using? Also can you share more information about the exception. If it’s not confidential, you can send the data sample to me (yuhao.y...@intel.com) and I can try to investigate. Regards, Yuhao From: Frank Zhang [mailto:dataminin...@yahoo.com.INVALID]

Re: Issues with Spark On Hbase Connector and versions

2016-08-30 Thread Weiqing Yang
The PR will be reviewed soon. Thanks, Weiqing From: Sachin Jain > Date: Sunday, August 28, 2016 at 11:12 PM To: spats > Cc: user >

Re: java.net.UnknownHostException

2016-08-02 Thread Yang Cao
actually, i just came into same problem. Whether you can share some code around the error, then I can figure it out whether I can help you. And the "s001.bigdata” is your name of name node? > On 2016年8月2日, at 17:22, pseudo oduesp wrote: > > someone can help me please

create external table from partitioned avro file

2016-07-28 Thread Yang Cao
Hi, I am using spark 1.6 and I hope to create a hive external table based on one partitioned avro file. Currently, I don’t find any build-in api to do this work. I tried the write.format().saveAsTable, with format com.databricks.spark.avro. it returned error can’t file Hive serde for this.

get hdfs file path in spark

2016-07-25 Thread Yang Cao
Hi, To be new here, I hope to get assistant from you guys. I wonder whether I have some elegant way to get some directory under some path. For example, I have a path like on hfs /a/b/c/d/e/f, and I am given a/b/c, is there any straight forward way to get the path /a/b/c/d/e . I think I can do

how do I set TBLPROPERTIES in dataFrame.saveAsTable()?

2016-06-15 Thread Yang
I tried df.options(MAP(prop_name->prop_value)).saveAsTable(tb_name) doesn't seem to work thanks a lot!

Re: OutOfMemoryError - When saving Word2Vec

2016-06-13 Thread Yuhao Yang
Hi Sharad, what's your vocabulary size and vector length for Word2Vec? Regards, Yuhao 2016-06-13 20:04 GMT+08:00 sharad82 : > Is this the right forum to post Spark related issues ? I have tried this > forum along with StackOverflow but not seeing any response. > >

  1   2   3   >