Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Hyukjin Kwon
One very good example is SparkR releases in Conda channel ( https://github.com/conda-forge/r-sparkr-feedstock). This is fully run by the community unofficially. On Tue, 19 Mar 2024 at 09:54, Mich Talebzadeh wrote: > +1 for me > > Mich Talebzadeh, > Dad | Technologist | Solutions Architect |

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-04 Thread Hyukjin Kwon
Is this related to https://github.com/apache/spark/pull/42428? cc @Yang,Jie(INF) On Mon, 4 Mar 2024 at 22:21, Jungtaek Lim wrote: > Shall we revisit this functionality? The API doc is built with individual > versions, and for each individual version we depend on other released > versions.

Re: Pyspark UDF as a data source for streaming

2023-12-28 Thread Hyukjin Kwon
Just fyi streaming python data source is in progress https://github.com/apache/spark/pull/44416 we will likely release this in spark 4.0 On Thu, Dec 28, 2023 at 4:53 PM Поротиков Станислав Вячеславович wrote: > Yes, it's actual data. > > > > Best regards, > > Stanislav Porotikov > > > > *From:*

Re: Architecture of Spark Connect

2023-12-14 Thread Hyukjin Kwon
By default for now, yes. One Spark Connect server handles multiple Spark Sessions. To multiplex or run multiple Drivers, you need some work such as gateway. On Thu, 14 Dec 2023 at 12:03, Kezhi Xiong wrote: > Hi, > > My understanding is there is only one driver/spark context for all user >

Re: [FYI] SPARK-45981: Improve Python language test coverage

2023-12-02 Thread Hyukjin Kwon
Awesome! On Sat, Dec 2, 2023 at 2:33 PM Dongjoon Hyun wrote: > Hi, All. > > As a part of Apache Spark 4.0.0 (SPARK-44111), the Apache Spark community > starts to have test coverage for all supported Python versions from Today. > > - https://github.com/apache/spark/actions/runs/7061665420 > >

Re: Introducing English SDK for Apache Spark - Seeking Your Feedback and Contributions

2023-07-03 Thread Hyukjin Kwon
The demo was really amazing. On Tue, 4 Jul 2023 at 09:17, Farshid Ashouri wrote: > This is wonderful news! > > On Tue, 4 Jul 2023 at 01:14, Gengliang Wang wrote: > >> Dear Apache Spark community, >> >> We are delighted to announce the launch of a groundbreaking tool that >> aims to make Apache

Re: [ANNOUNCE] Apache Spark 3.4.1 released

2023-06-23 Thread Hyukjin Kwon
Thanks! On Sat, Jun 24, 2023 at 11:01 AM Mridul Muralidharan wrote: > > Thanks Dongjoon ! > > Regards, > Mridul > > On Fri, Jun 23, 2023 at 6:58 PM Dongjoon Hyun wrote: > >> We are happy to announce the availability of Apache Spark 3.4.1! >> >> Spark 3.4.1 is a maintenance release containing

Re: Slack for PySpark users

2023-03-27 Thread Hyukjin Kwon
Yeah, actually I think we should better have a slack channel so we can easily discuss with users and developers. On Tue, 28 Mar 2023 at 03:08, keen wrote: > Hi all, > I really like *Slack *as communication channel for a tech community. > There is a Slack workspace for *delta lake users* ( >

Re: [ANNOUNCE] Apache Spark 3.3.1 released

2022-10-26 Thread Hyukjin Kwon
Thanks, Yuming. On Wed, 26 Oct 2022 at 16:01, L. C. Hsieh wrote: > Thank you for driving the release of Apache Spark 3.3.1, Yuming! > > On Tue, Oct 25, 2022 at 11:38 PM Dongjoon Hyun > wrote: > > > > It's great. Thank you so much, Yuming! > > > > Dongjoon > > > > On Tue, Oct 25, 2022 at 11:23

Re: [Feature Request] make unix_micros() and unix_millis() available in PySpark (pyspark.sql.functions)

2022-10-17 Thread Hyukjin Kwon
You can workaround it by leveraging expr, e.g., expr("unix_micros(col)") for now. Should better have Scala binding first before we have Python one FWIW, On Sat, 15 Oct 2022 at 06:19, Martin wrote: > Hi everyone, > > In *Spark SQL* there are several timestamp related functions > >-

Re: Stickers and Swag

2022-06-14 Thread Hyukjin Kwon
Woohoo On Tue, 14 Jun 2022 at 15:04, Xiao Li wrote: > Hi, all, > > The ASF has an official store at RedBubble > that Apache Community > Development (ComDev) runs. If you are interested in buying Spark Swag, 70 > products featuring the Spark logo

Re: Conda Python Env in K8S

2021-12-24 Thread Hyukjin Kwon
Can you share the logs, settings, environment, etc. and file a JIRA? There are integration test cases for K8S support, and I myself also tested it before. It would be helpful if you try what I did at https://databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html and see

Re: [R] SparkR on conda-forge

2021-12-19 Thread Hyukjin Kwon
Awesome! On Mon, 20 Dec 2021 at 09:43, yonghua wrote: > Nice release. thanks for sharing. > > On 2021/12/20 3:55, Maciej wrote: > > FYI ‒ thanks to good folks from conda-forge we have now these: > > - > To unsubscribe e-mail:

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-12 Thread Hyukjin Kwon
. We could have a configuration to enable and disable but the implementation of this in DataFrame.toPandas would be complicated due to existing optimization such as Arrow. Haven't taken a deeper look though but my guts say it's not worthwhile. On Sat, Nov 13, 2021 at 12:05 PM Hyukjin Kwon wrote

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-12 Thread Hyukjin Kwon
Thanks for pinging me Sean. Yes, there's an optimization on DataFrame.collect which tries to collect few first partitioms and see if the number of rows are found (and repeat). DataFrame.toPandas does not have such optimization. I suspect that the shuffle isn't actual shuffle but just collects

Re: [ANNOUNCE] Apache Spark 3.1.2 released

2021-06-01 Thread Hyukjin Kwon
awesome! 2021년 6월 2일 (수) 오전 9:59, Dongjoon Hyun 님이 작성: > We are happy to announce the availability of Spark 3.1.2! > > Spark 3.1.2 is a maintenance release containing stability fixes. This > release is based on the branch-3.1 maintenance branch of Spark. We strongly > recommend all 3.1 users to

Re: [ANNOUNCE] Announcing Apache Spark 3.1.1

2021-03-03 Thread Hyukjin Kwon
>>>>>> Greenplum >>>>>> with Spark SQL and DataFrames, 10~100x faster.* >>>>>> *spark-func-extras <https://github.com/yaooqinn/spark-func-extras>A >>>>>> library that brings excellent and useful functions from var

[ANNOUNCE] Announcing Apache Spark 3.1.1

2021-03-02 Thread Hyukjin Kwon
We are excited to announce Spark 3.1.1 today. Apache Spark 3.1.1 is the second release of the 3.x line. This release adds Python type annotations and Python dependency management support as part of Project Zen. Other major updates include improved ANSI SQL compliance support, history server

Re: [SparkR] gapply with strings with arrow

2020-10-10 Thread Hyukjin Kwon
If it works without Arrow optimization, it's likely a bug. Please feel free to file a JIRA for that. On Wed, 7 Oct 2020, 22:44 Jacek Pliszka, wrote: > Hi! > > Is there any place I can find information how to use gapply with arrow? > > I've tried something very simple > > collect(gapply( > df,

Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-03 Thread Hyukjin Kwon
Nice summary. Thanks Dongjoon. One minor correction -> I believe we dropped R 3.5 and below at branch 2.4 as well. On Sun, 4 Oct 2020, 09:17 Dongjoon Hyun, wrote: > Hi, All. > > As of today, master branch (Apache Spark 3.1.0) resolved > 852+ JIRA issues and 606+ issues are 3.1.0-only patches. >

PySpark documentation main page

2020-08-01 Thread Hyukjin Kwon
Hi all, I am trying to write up the main page of PySpark documentation at https://github.com/apache/spark/pull/29320. While I think the current proposal might be good enough, I would like to collect more feedback about the contents, structure and image since this is the entrance page of PySpark

Re: [PSA] Python 2, 3.4 and 3.5 are now dropped

2020-07-13 Thread Hyukjin Kwon
cc user mailing list too. 2020년 7월 14일 (화) 오전 11:27, Hyukjin Kwon 님이 작성: > I am sending another email to make sure dev people know. Python 2, 3.4 and > 3.5 are now dropped at https://github.com/apache/spark/pull/28957. > > >

Re: Error: Vignette re-building failed. Execution halted

2020-06-24 Thread Hyukjin Kwon
Looks like you haven't installed the 'e1071' package. 2020년 6월 24일 (수) 오후 6:49, Anwar AliKhan 님이 작성: > ./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr > -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes > > > > minor error Spark r test

Re: [ANNOUNCE] Apache Spark 3.0.0

2020-06-18 Thread Hyukjin Kwon
Yay! 2020년 6월 19일 (금) 오전 4:46, Mridul Muralidharan 님이 작성: > Great job everyone ! Congratulations :-) > > Regards, > Mridul > > On Thu, Jun 18, 2020 at 10:21 AM Reynold Xin wrote: > >> Hi all, >> >> Apache Spark 3.0.0 is the first release of the 3.x line. It builds on >> many of the innovations

Re: [ANNOUNCE] Apache Spark 2.4.6 released

2020-06-10 Thread Hyukjin Kwon
Yay! 2020년 6월 11일 (목) 오전 10:38, Holden Karau 님이 작성: > We are happy to announce the availability of Spark 2.4.6! > > Spark 2.4.6 is a maintenance release containing stability, correctness, > and security fixes. > This release is based on the branch-2.4 maintenance branch of Spark. We > strongly

Re: [ANNOUNCE] Announcing Apache Spark 2.4.5

2020-02-10 Thread Hyukjin Kwon
Thanks Dongjoon! 2020년 2월 9일 (일) 오전 10:49, Takeshi Yamamuro 님이 작성: > Happy to hear the release news! > > Bests, > Takeshi > > On Sun, Feb 9, 2020 at 10:28 AM Dongjoon Hyun > wrote: > >> There was a typo in one URL. The correct release note URL is here. >> >>

Re: Fail to use SparkR of 3.0 preview 2

2019-12-26 Thread Hyukjin Kwon
I was randomly googling out of curiosity, and seems indeed that's the problem ( https://r.789695.n4.nabble.com/Error-in-rbind-info-getNamespaceInfo-env-quot-S3methods-quot-td4755490.html ). Yes, seems we should make sure we build SparkR in an old version. Since that support for R prior to version

Re: [VOTE] Shall we release ORC 1.4.5rc1?

2019-12-06 Thread Hyukjin Kwon
+1 (as a Spark user) 2019년 12월 7일 (토) 오전 11:06, Dongjoon Hyun 님이 작성: > +1 for Apache ORC 1.4.5 release. > > Thank you for making the release. > > I'd like to mention some notable changes here. > Apache ORC 1.4.5 is not a drop-in replacement for 1.4.4 because of the > following. > >

Re: [DISCUSS] Remove sorting of fields in PySpark SQL Row construction

2019-11-07 Thread Hyukjin Kwon
+1 2019년 11월 6일 (수) 오후 11:38, Wenchen Fan 님이 작성: > Sounds reasonable to me. We should make the behavior consistent within > Spark. > > On Tue, Nov 5, 2019 at 6:29 AM Bryan Cutler wrote: > >> Currently, when a PySpark Row is created with keyword arguments, the >> fields are sorted

Re: DataSourceV2: pushFilters() is not invoked for each read call - spark 2.3.2

2019-09-06 Thread Hyukjin Kwon
I believe this issue was fixed in Spark 2.4. Spark DataSource V2 has been still being radically developed - It is not complete yet until now. So, I think the feasible option to get through at the current moment is: 1. upgrade to higher Spark versions 2. disable filter push down at your

Re: [ANNOUNCE] Announcing Apache Spark 2.4.4

2019-09-01 Thread Hyukjin Kwon
YaY! 2019년 9월 2일 (월) 오후 1:27, Wenchen Fan 님이 작성: > Great! Thanks! > > On Mon, Sep 2, 2019 at 5:55 AM Dongjoon Hyun > wrote: > >> We are happy to announce the availability of Spark 2.4.4! >> >> Spark 2.4.4 is a maintenance release containing stability fixes. This >> release is based on the

Re: Release Apache Spark 2.4.4

2019-08-14 Thread Hyukjin Kwon
Adding Shixiong WDYT? 2019년 8월 14일 (수) 오후 2:30, Terry Kim 님이 작성: > Can the following be included? > > [SPARK-27234][SS][PYTHON] Use InheritableThreadLocal for current epoch in > EpochTracker (to support Python UDFs) > > > Thanks, > Terry > > On Tue,

Re: Continuous processing mode and python udf

2019-08-13 Thread Hyukjin Kwon
that's fixed in https://github.com/apache/spark/commit/b83b7927b3a85c1a4945e2224ed811b5bb804477 2019년 8월 13일 (화) 오후 12:37, zenglong chen 님이 작성: > Does Spark 2.4.0 support Python UDFs with Continuous Processing mode? > I try it and occur error like below: > WARN scheduler.TaskSetManager:

Re: Release Apache Spark 2.4.4

2019-08-13 Thread Hyukjin Kwon
+1 2019년 8월 14일 (수) 오전 9:13, Takeshi Yamamuro 님이 작성: > Hi, > > Thanks for your notification, Dongjoon! > I put some links for the other committers/PMCs to access the info easily: > > A commit list in github from the last release: >

Re: Usage of PyArrow in Spark

2019-07-17 Thread Hyukjin Kwon
Regular Python UDFs don't use PyArrow under the hood. Yes, they can potentially benefit but they can be easily worked around via Pandas UDFs. For instance, both below are virtually identical. @udf(...) def func(col): return col @pandas_udf(...) def pandas_func(col): return

Re: Exposing JIRA issue types at GitHub PRs

2019-06-16 Thread Hyukjin Kwon
Labels look good and useful. On Sat, 15 Jun 2019, 02:36 Dongjoon Hyun, wrote: > Now, you can see the exposed component labels (ordered by the number of > PRs) here and click the component to search. > > https://github.com/apache/spark/labels?sort=count-desc > > Dongjoon. > > > On Fri, Jun

Re: Exposing JIRA issue types at GitHub PRs

2019-06-13 Thread Hyukjin Kwon
Yea, I think we can automate this process via, for instance, https://github.com/apache/spark/blob/master/dev/github_jira_sync.py +1 for such sort of automatic categorizing and matching metadata between JIRA and github Adding Josh and Sean as well. On Thu, 13 Jun 2019, 13:17 Dongjoon Hyun,

Re: [ANNOUNCE] Announcing Apache Spark 2.3.3

2019-02-18 Thread Hyukjin Kwon
Yay! Good job Takeshi! On Mon, 18 Feb 2019, 14:47 Takeshi Yamamuro We are happy to announce the availability of Spark 2.3.3! > > Apache Spark 2.3.3 is a maintenance release, based on the branch-2.3 > maintenance branch of Spark. We strongly recommend all 2.3.x users to > upgrade to this stable

Re: [ANNOUNCE] Announcing Apache Spark 2.2.3

2019-01-16 Thread Hyukjin Kwon
Nice! 2019년 1월 16일 (수) 오전 11:55, Jiaan Geng 님이 작성: > Glad to hear this. > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >

Re: Python

2019-01-02 Thread Hyukjin Kwon
Yup, it's supported. On Wed, 2 Jan 2019, 3:35 pm Gourav Sengupta Hi, > Can I please confirm which version of Python 3.x is supported by Spark 2.4? > > Regards, > Gourav >

Re: DataSourceV2 APIs creating multiple instances of DataSourceReader and hence not preserving the state

2018-10-09 Thread Hyukjin Kwon
I took a look for the codes. val source = classOf[MyDataSource].getCanonicalName spark.read.format(source).load().collect() Looks indeed it calls twice. First all: Looks it creates it first to read the schema for a logical plan

Re: Should python-2 be supported in Spark 3.0?

2018-09-16 Thread Hyukjin Kwon
I think we can deprecate it in 3.x.0 and remove it in Spark 4.0.0. Many people still use Python 2. Also, techincally 2.7 support is not officially dropped yet - https://pythonclock.org/ 2018년 9월 17일 (월) 오전 9:31, Aakash Basu 님이 작성: > Removing support for an API in a major release makes poor

Re: How to make pyspark use custom python?

2018-09-06 Thread Hyukjin Kwon
Are you doubly sure if it is an issue in Spark? I used custom python several times with setting it in PYSPARK_PYTHON before and it was no problem. 2018년 9월 6일 (목) 오후 2:21, mithril 님이 작성: > For better looking , please see > >

Re: Issue upgrading to Spark 2.3.1 (Maintenance Release)

2018-06-15 Thread Hyukjin Kwon
I use PyCharm. Mind if I ask to elaborate what you did step by step? 2018년 6월 16일 (토) 오전 12:11, Marcelo Vanzin 님이 작성: > I'm not familiar with PyCharm. But if you can run "pyspark" from the > command line and not hit this, then this might be an issue with > PyCharm or your environment - e.g.

Re: Issue upgrading to Spark 2.3.1 (Maintenance Release)

2018-06-15 Thread Hyukjin Kwon
I use PyCharm. Mind if I ask to elaborate what you did step by step? 2018년 6월 16일 (토) 오전 12:11, Marcelo Vanzin 님이 작성: > I'm not familiar with PyCharm. But if you can run "pyspark" from the > command line and not hit this, then this might be an issue with > PyCharm or your environment - e.g.

Re: how to set up pyspark eclipse, pyDev, virtualenv? syntaxError: yield from walk(

2018-04-05 Thread Hyukjin Kwon
FYI, there is a PR and JIRA for virtualEnv support in PySpark https://issues.apache.org/jira/browse/SPARK-13587 https://github.com/apache/spark/pull/13599 2018-04-06 7:48 GMT+08:00 Andy Davidson : > FYI > >

Re: [PySpark SQL] sql function to_date and to_timestamp return the same data type

2018-03-18 Thread Hyukjin Kwon
Mind if I ask a reproducer? seems returning timestamps fine: >>> from pyspark.sql.functions import * >>> spark.range(1).select(to_timestamp(current_timestamp())).printSchema() root |-- to_timestamp(current_timestamp()): timestamp (nullable = false) >>>

Re: SparkR test script issue: unable to run run-tests.h on spark 2.2

2018-02-14 Thread Hyukjin Kwon
>From a very quick look, I think testthat version issue with SparkR. I had to fix that version to 1.x before in AppVeyor. There are few details in https://github.com/apache/spark/pull/20003 Can you check and lower testthat version? On 14 Feb 2018 6:09 pm, "chandan prakash"

Re: Custom line/record delimiter

2018-01-01 Thread Hyukjin Kwon
Hi, There's a PR - https://github.com/apache/spark/pull/18581 and JIRA - SPARK-21289 Alternatively, you could check out multiLine option for CSV and see if applicable. Thanks. 2017-12-30 2:19 GMT+09:00 sk skk : > Hi, > > Do we have an option to write a csv or text

Re: Reading CSV with multiLine option invalidates encoding option.

2017-08-17 Thread Hyukjin Kwon
For when multiLine is not set, we currently only support ascii-compatible encodings, up to my knowledge, mainly due to line separator and as I investigated in the comment. For when multiLine is set, it appears encoding is not considered. I actually meant encoding does not work at all in this case

Re: how to set the assignee in JIRA please?

2017-07-25 Thread Hyukjin Kwon
gt; > I find some PR were created one year ago, the last comment is several > monthes before. > No one to close or reject it. > Such as 6880, just put it like this? > > > ---Original--- > *From:* "Hyukjin Kwon"<gurwls...@gmail.com> > *Date:* 2017/7/25 09:

Re: how to set the assignee in JIRA please?

2017-07-24 Thread Hyukjin Kwon
se? ---Original--- *From:* "Hyukjin Kwon"<gurwls...@gmail.com> *Date:* 2017/7/25 09:15:49 *To:* "Marcelo Vanzin"<van...@cloudera.com>; *Cc:* "user"<user@spark.apache.org>;"萝卜丝炒饭"<1427357...@qq.com>; *Subject:* Re: how to set the assignee

Re: how to set the assignee in JIRA please?

2017-07-24 Thread Hyukjin Kwon
It should be not a big deal anyway. Thanks for the details. 2017-07-25 10:09 GMT+09:00 Marcelo Vanzin <van...@cloudera.com>: > On Mon, Jul 24, 2017 at 6:04 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote: > > However, I see some JIRAs are assigned to someone time to time. Wer

Re: how to set the assignee in JIRA please?

2017-07-24 Thread Hyukjin Kwon
However, I see some JIRAs are assigned to someone time to time. Were those mistakes or would you mind if I ask when someone is assigned? When I started to contribute to Spark few years ago, I was confused by this and I am pretty sure some guys are still confused. I do usually say something like

Re: to_json not working with selectExpr

2017-07-19 Thread Hyukjin Kwon
Yes, I guess it is. 2017-07-20 11:31 GMT+09:00 Matthew cao <cybea...@gmail.com>: > AH, I get it. So that’s why I get the not register error? Cuz it not added > into SQL in 2.1.0? > > On 2017年7月19日, at 22:35, Hyukjin Kwon <gurwls...@gmail.com> wrote: > > Yea, but it

Re: to_json not working with selectExpr

2017-07-19 Thread Hyukjin Kwon
Yea, but it was added into SQL from Spark 2.2.0 2017-07-19 23:02 GMT+09:00 Matthew cao : > I am using version 2.1.1 As I could remember, this function was added > since 2.1.0. > > On 2017年7月17日, at 12:05, Burak Yavuz wrote: > > Hi Matthew, > > Which Spark

Re: With 2.2.0 PySpark is now available for pip install from PyPI :)

2017-07-12 Thread Hyukjin Kwon
Cool! 2017-07-13 9:43 GMT+09:00 Denny Lee : > This is amazingly awesome! :) > > On Wed, Jul 12, 2017 at 13:23 lucas.g...@gmail.com > wrote: > >> That's great! >> >> >> >> On 12 July 2017 at 12:41, Felix Cheung wrote: >>

Re: Multiple CSV libs causes issues spark 2.1

2017-05-09 Thread Hyukjin Kwon
Sounds like it is related with https://github.com/apache/spark/pull/17916 We will allow pick up the internal one if this one gets merged. On 10 May 2017 7:09 am, "Mark Hamstra" wrote: > Looks to me like it is a conflict between a Databricks library and Spark > 2.1.

Re: Why selectExpr changes schema (to include id column)?

2017-03-27 Thread Hyukjin Kwon
ompt response. I appreciate. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Mon, Mar 27, 2017 at 2:43 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote: > I ju

Re: Why selectExpr changes schema (to include id column)?

2017-03-27 Thread Hyukjin Kwon
I just tried to build against the current master to help check - https://github.com/apache/spark/commit/3fbf0a5f9297f438bc92db11f106d4a0ae568613 It seems I can't reproduce this as below: scala> spark.range(1).printSchema root |-- id: long (nullable = false) scala>

Re: CSV empty columns handling in Spark 2.0.2

2017-03-16 Thread Hyukjin Kwon
I think this is fixed in https://github.com/apache/spark/pull/15767 This should be fixed in 2.1.0. 2017-03-17 3:28 GMT+09:00 George Obama : > Hello, > > > > I am using spark 2.0.2 to read the CSV file with empty columns and is > hitting the issue: > > scala>val df =

Re: [Spark CSV]: Use Custom TextInputFormat to Prevent Exceptions

2017-03-15 Thread Hyukjin Kwon
Other options are maybe : - "spark.sql.files.ignoreCorruptFiles" option - DataFrameReader.csv(csvDataset: Dataset[String]) with custom inputformat (this is available from Spark 2.2.0). For example, val rdd = spark.sparkContext.newAPIHadoopFile("/tmp/abcd",

Re: DataFrameWriter - Where to find list of Options applicable to particular format(datasource)

2017-03-13 Thread Hyukjin Kwon
Hi, all the options are documented in https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter It seems we don't have both options for writing. If the goal is trimming the whitespaces, I think we could do this within dataframe operations (as we talked in the

Re: using spark-xml_2.10 to extract data from XML file

2017-02-14 Thread Hyukjin Kwon
Hi Carlo, There was a bug in lower versions when accessing to nested values in the library. Otherwise, I suspect another issue about parsing malformed XML. Could you maybe open an issue in https://github.com/databricks/spark-xml/issues with your sample data? I will stick with it until it is

Re: JavaRDD text matadata(file name) findings

2017-01-31 Thread Hyukjin Kwon
Hi, Are you maybe possible to switch it to text datasource with input_file_name function? Thanks. On 1 Feb 2017 3:58 a.m., "Manohar753" wrote: Hi All, myspark job is reading data from a folder having different files with same structured data. the red JavaRdd

Re: Scala Developers

2017-01-25 Thread Hyukjin Kwon
Just as a subscriber in this mailing list, I don't want to recieve job recruiting emails and even make some efforts to set a filter for it. I don't know the policy in details but I feel inappropriate to send them where, in my experience, Spark users usually ask some questions and discuss about

Re: filter rows by all columns

2017-01-16 Thread Hyukjin Kwon
Hi Shawn, Could we do this as below? for any of true scala> val df = spark.range(10).selectExpr("id as a", "id / 2 as b") df: org.apache.spark.sql.DataFrame = [a: bigint, b: double] scala> df.filter(_.toSeq.exists(v => v == 1)).show() +---+---+ | a| b| +---+---+ | 1|0.5| | 2|1.0|

Re: Unable to explain the job kicked off for spark.read.csv

2017-01-08 Thread Hyukjin Kwon
Oh, I mean another job would *not* happen if the schema is explicitly given. 2017-01-09 16:37 GMT+09:00 Hyukjin Kwon <gurwls...@gmail.com>: > Hi Appu, > > > I believe that textFile and filter came from... > > https://github.com/apache/spark/blob/branch-2.1/sql/ > cor

Re: Unable to explain the job kicked off for spark.read.csv

2017-01-08 Thread Hyukjin Kwon
Hi Appu, I believe that textFile and filter came from... https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L59-L61 It needs to read a first line even if using the header is disabled and schema inference

Re: Merging Parquet Files

2016-12-22 Thread Hyukjin Kwon
Hi Benjamin, As you might already know, I believe the Hadoop command automatically does not merge the column-based format such as ORC or Parquet but just simply concatenates them. I haven't tried this by myself but I remember I saw a JIRA in Parquet -

Re: get corrupted rows using columnNameOfCorruptRecord

2016-12-07 Thread Hyukjin Kwon
Let me please just extend the suggestion a bit more verbosely. I think you could try something like this maybe. val jsonDF = spark.read .option("columnNameOfCorruptRecord", "xxx") .option("mode","PERMISSIVE") .schema(StructType(schema.fields :+ StructField("xxx", StringType, true)))

Re: How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL

2016-12-05 Thread Hyukjin Kwon
Thanks for this but Isn't this what Michael suggested? > > Thanks, > kant > > On Mon, Dec 5, 2016 at 4:45 AM, Hyukjin Kwon <gurwls...@gmail.com> wrote: > >> Hi Kant, >> >> How about doing something like this? >> >> import org.apache.spark.sql.functions._

Re: How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL

2016-12-05 Thread Hyukjin Kwon
Hi Kant, How about doing something like this? import org.apache.spark.sql.functions._ // val df2 = df.select(df("body").cast(StringType).as("body")) val df2 = Seq("""{"a": 1}""").toDF("body") val schema = spark.read.json(df2.as[String].rdd).schema df2.select(from_json(col("body"),

Re: Handling windows characters with Spark CSV on Linux

2016-11-17 Thread Hyukjin Kwon
Actually, CSV datasource supports encoding option[1] (although it does not support non-ascii compatible encoding types). [1] https://github.com/apache/spark/blob/44c8bfda793b7655e2bd1da5e9915a09ed9d42ce/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L364 On 17 Nov 2016 10:59

Re: Spark-xml - OutOfMemoryError: Requested array size exceeds VM limit

2016-11-16 Thread Hyukjin Kwon
ll check with new version and try to use different rowTags and >> increase executor-memory tomorrow. I will open a new issue as well. >> >> >> >> On Tue, Nov 15, 2016 at 7:52 PM, Hyukjin Kwon <gurwls...@gmail.com> >> wrote: >> >>> Hi Ar

Re: How do I convert json_encoded_blob_column into a data frame? (This may be a feature request)

2016-11-16 Thread Hyukjin Kwon
Maybe it sounds like you are looking for from_json/to_json functions after en/decoding properly. On 16 Nov 2016 6:45 p.m., "kant kodali" wrote: > > > https://spark.apache.org/docs/2.0.2/sql-programming-guide. > html#json-datasets > > "Spark SQL can automatically infer the

Re: Spark-xml - OutOfMemoryError: Requested array size exceeds VM limit

2016-11-15 Thread Hyukjin Kwon
Hi Arun, I have few questions. Dose your XML file have like few huge documents? In this case of a row having a huge size like (like 500MB), it would consume a lot of memory becuase at least it should hold a row to iterate if I remember correctly. I remember this happened to me before while

Re: How to read a Multi Line json object via Spark

2016-11-15 Thread Hyukjin Kwon
Hi Sree, There is a blog about that, http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files/ It is pretty old but I am sure that it is helpful. Currently, JSON datasource only supports to rest JSON documents formatted according to http://jsonlines.org/ There is an

Re: Spark SQL shell hangs

2016-11-13 Thread Hyukjin Kwon
Hi Rakesh, Could you please open an issue in https://github.com/databricks/spark-xml with some codes so that reviewers can reproduce the issue you met? Thanks! 2016-11-14 0:20 GMT+09:00 rakesh sharma : > Hi > > I'm trying to convert an XML file to data frame using

Re: pyspark: accept unicode column names in DataFrame.corr and cov

2016-11-12 Thread Hyukjin Kwon
Hi Sam, I think I have some answers for two questions. > Humble request: could we replace the "isinstance(col1, str)" tests with "isinstance(col1, basestring)"? IMHO, yes, I believe this should be basestring. Otherwise, some functions would not accept unicode as arguments for columns in Python

Re: Reading csv files with quoted fields containing embedded commas

2016-11-06 Thread Hyukjin Kwon
Hi Femi, Have you maybe tried the quote related options specified in the documentation? http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.csv Thanks. 2016-11-06 6:58 GMT+09:00 Femi Anthony : > Hi, I am trying to process a very

Re: Error creating SparkSession, in IntelliJ

2016-11-03 Thread Hyukjin Kwon
Hi Shyla, there is the documentation for setting up IDE - https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IDESetup I hope this is helpful. 2016-11-04 9:10 GMT+09:00 shyla deshpande : > Hello Everyone, > > I just installed

Re: Spark XML ignore namespaces

2016-11-03 Thread Hyukjin Kwon
Oh, that PR was actually about not concerning the namespaces (meaning leaving data as they are, including prefixes). The problem was, each partition needs to produce each record with knowing the namesapces. It is fine to deal with them if they are within each XML documentation (represented as a

Re: csv date/timestamp type inference in spark 2.0.1

2016-10-26 Thread Hyukjin Kwon
Hi Koert, I am curious about your case. I guess the purpose of timestampFormat and dateFormat is to infer timestamps/dates when parsing/inferring but not to exclude the type inference/parsing. Actually, it does try to infer/parse in 2.0.0 as well (but it fails) so actually I guess there

Re: spark infers date to be timestamp type

2016-10-26 Thread Hyukjin Kwon
2015-01-01 > 2016-03-05 > > next i run this code in spark 2.0.1: > spark.read > .format("csv") > .option("header", true) > .option("inferSchema", true) > .load("test.csv") > .printSchema > > the result is: > root &

Re: spark infers date to be timestamp type

2016-10-26 Thread Hyukjin Kwon
There are now timestampFormat for TimestampType and dateFormat for DateType. Do you mind if I ask to share your codes? On 27 Oct 2016 2:16 a.m., "Koert Kuipers" wrote: > is there a reason a column with dates in format -mm-dd in a csv file > is inferred to be

Re: pyspark doesn't recognize MMM dateFormat pattern in spark.read.load() for dates like 1989Dec31 and 31Dec1989

2016-10-24 Thread Hyukjin Kwon
I am also interested in this issue. I will try to look into this too within coming few days.. 2016-10-24 21:32 GMT+09:00 Sean Owen : > I actually think this is a general problem with usage of DateFormat and > SimpleDateFormat across the code, in that it relies on the default

Re: how to extract arraytype data to file

2016-10-18 Thread Hyukjin Kwon
This reminds me of https://github.com/databricks/spark-xml/issues/141#issuecomment-234835577 Maybe using explode() would be helpful. Thanks! 2016-10-19 14:05 GMT+09:00 Divya Gehlot : > http://stackoverflow.com/questions/33864389/how-can-i- >

Re: Why the json file used by sparkSession.read.json must be a valid json object per line

2016-10-18 Thread Hyukjin Kwon
Regarding his recent PR[1], I guess he meant multiple line json. As far as I know, single line json also conplies the standard. I left a comment with RFC in the PR but please let me know if I am wrong at any point. Thanks! [1]https://github.com/apache/spark/pull/15511 On 19 Oct 2016 7:00 a.m.,

Re: JSON Arrays and Spark

2016-10-12 Thread Hyukjin Kwon
No, I meant it should be in a single line but it supports array type too as a root wrapper of JSON objects. If you need to parse multiple lines, I have a reference here. http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files/ 2016-10-12 15:04 GMT+09:00 Kappaganthu,

Re: JSON Arrays and Spark

2016-10-10 Thread Hyukjin Kwon
FYI, it supports [{...}, {...} ...] Or {...} format as input. On 11 Oct 2016 3:19 a.m., "Jean Georges Perrin" wrote: > Thanks Luciano - I think this is my issue :( > > On Oct 10, 2016, at 2:08 PM, Luciano Resende wrote: > > Please take a look at >

Re: pyspark: sqlContext.read.text() does not work with a list of paths

2016-10-06 Thread Hyukjin Kwon
It seems obviously a bug. It was introduced from my PR, https://github.com/apache/spark/commit/d37c7f7f042f7943b5b684e53cf4284c601fb347 +1 for creating a JIRA and PR. If you have any problem with this, I would like to do this quickly. On 5 Oct 2016 9:12 p.m., "Laurent Legrand"

Re: spark sql on json

2016-09-29 Thread Hyukjin Kwon
https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java#L104-L181 2016-09-29 18:58 GMT+09:00 Hitesh Goyal : > Hi team, > > > > I have a json document. I want to put spark SQL to it. > > Can you please

Re: Issue with rogue data in csv file used in Spark application

2016-09-27 Thread Hyukjin Kwon
Hi Mich, I guess you could use nullValue option by setting it to null. If you are reading them into strings at the first please, then, you would meet https://github.com/apache/spark/pull/14118 first which is resolved from 2.0.1 Unfortunately, this bug also exists in external csv library for

Re: NumberFormatException: For input string: "0.00000"

2016-09-19 Thread Hyukjin Kwon
It seems not an issue in Spark. Does "CSVParser" works fine without Spark with the data? BTW, it seems there is something wrong with your email address. I am sending this again. On 20 Sep 2016 8:32 a.m., "Hyukjin Kwon" <gurwls...@gmail.com> wrote: > It seem

Re: NumberFormatException: For input string: "0.00000"

2016-09-19 Thread Hyukjin Kwon
It seems not an issue in Spark. Does "CSVParser" works fine without Spark with the data? On 20 Sep 2016 2:15 a.m., "Mohamed ismail" wrote: > Hi all > > I am trying to read: > > sc.textFile(DataFile).mapPartitions(lines => { > val parser = new

How many are there PySpark Windows users?

2016-09-18 Thread Hyukjin Kwon
Hi all, We are currently testing SparkR on Windows[1] and it seems several problems are being identified time to time. Although It seems it is not easy to automate Spark's tests in Scala on Windows because I think we should introduce a proper change detection to run only related tests rather than

Re: take() works on RDD but .write.json() does not work in 2.0.0

2016-09-17 Thread Hyukjin Kwon
Hi Kevin, I have few questions on this. Does that only not work with write.json() ? I just wonder if write.text, csv or another API does not work as well and it is a JSON specific issue. Also, does that work with small data? I want to make sure if this happen only on large data. Thanks!

Re: Spark CSV skip lines

2016-09-10 Thread Hyukjin Kwon
r(new StringReader(txt)); > | reader.readAll().map(data => Row(data(3),data(4),data(7), > data(9),data(14)))} > > The above code throws arrayoutofbounce exception for empty line and report > line. > > > On Sat, Sep 10, 2016 at 3:02 PM, Hyukjin Kwon <gurwls...@gmail.

Re: Reading a TSV file

2016-09-10 Thread Hyukjin Kwon
Yeap. also, sep is preferred and has a higher precedence than delimiter. ​ 2016-09-11 0:44 GMT+09:00 Jacek Laskowski : > Hi Muhammad, > > sep or delimiter should both work fine. > > Pozdrawiam, > Jacek Laskowski > > https://medium.com/@jaceklaskowski/ > Mastering Apache

  1   2   >